JP4686438B2

JP4686438B2 - Data classification apparatus, data classification method, data classification program, and recording medium

Info

Publication number: JP4686438B2
Application number: JP2006319416A
Authority: JP
Inventors: 保志櫻井
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2006-11-28
Filing date: 2006-11-28
Publication date: 2011-05-25
Anticipated expiration: 2026-11-28
Also published as: JP2008134750A

Description

本発明は、データベース技術において複数のデータをデータ間の類似度に対応した複数のクラスタに分類する技術に関し、特に、分類対象とするデータが、多次元空間内のベクトルデータの集合の分布を示す分布データであるデータ分類装置、データ分類方法およびデータ分類プログラムならびに記録媒体に関する。 The present invention relates to a technique for classifying a plurality of data into a plurality of clusters corresponding to the similarity between the data in the database technique, and in particular, the data to be classified indicates a distribution of a set of vector data in a multidimensional space. The present invention relates to a data classification device, data classification method, data classification program, and recording medium that are distribution data.

従来、データベース技術において、分類対象とする多次元空間内のベクトルデータの大規模な集合をベクトル間の類似性に基づいて分類する様々なクラスタリング（分類）手法が提案されてきている。最も代表的なクラスタリング手法として、例えば、ｋ−ｍｅａｎｓ法（非特許文献１参照）が知られている。このｋ−ｍｅａｎｓ法は、分類対象とするベクトルデータの集合を、予め定められたｋ個のクラスタに分類するものである。具体的には、ｋ−ｍｅａｎｓ法では、各クラスタに含まれるベクトルデータの平均値をそのクラスタの代表点として用い、クラスタの代表点とベクトルデータの距離の総和が最小になるように、各々のベクトルデータを各クラスタに配置させることで、ベクトルデータの集合をｋ個のクラスタに分類する。 Conventionally, various clustering (classification) methods for classifying a large set of vector data in a multidimensional space to be classified based on the similarity between vectors have been proposed in database technology. As the most representative clustering method, for example, the k-means method (see Non-Patent Document 1) is known. This k-means method classifies a set of vector data to be classified into k clusters determined in advance. Specifically, in the k-means method, the average value of the vector data included in each cluster is used as the representative point of the cluster, and the sum of the distance between the representative point of the cluster and the vector data is minimized. By arranging vector data in each cluster, the set of vector data is classified into k clusters.

その他には、ベクトルデータの集合に対して階層的にクラスタリングを行う手法として、ＢＩＲＣＨ（Balanced Iterative Reducing and Clustering using Hierarchies）が知られている（非特許文献２参照）。このＢＩＲＣＨでは、ベクトルデータの集合の要約情報として、ＣＦ（Clustering Feature）−ｔｒｅｅという木構造型の要約情報を生成する。そして、元（オリジナル）のベクトルデータの集合に対して、分類すべきベクトルデータが新たに追加されると、オリジナルのベクトルデータの集合を走査することなしに、ＣＦ−ｔｒｅｅを更新する。
Jiawei Han and Micheline Kamber、「Data Mining」、Morgan Kaufmann、2000年 Tian Zhang, Raghu Ramakrishnan and Miron Livny、「BIRCH: An Efficient Data Clustering Method for Very Large Databases」、In Proceedings of ACM SIGMOD、1996年、p.103-114 In addition, BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is known as a technique for hierarchically clustering a set of vector data (see Non-Patent Document 2). In this BIRCH, tree-structured summary information called CF (Clustering Feature) -tree is generated as summary information of a set of vector data. When new vector data to be classified is added to the original (original) set of vector data, the CF-tree is updated without scanning the original set of vector data.
Jiawei Han and Micheline Kamber, "Data Mining", Morgan Kaufmann, 2000 Tian Zhang, Raghu Ramakrishnan and Miron Livny, `` BIRCH: An Efficient Data Clustering Method for Very Large Databases '', In Proceedings of ACM SIGMOD, 1996, p.103-114

しかしながら、従来のクラスタリング手法は、いずれもベクトルデータの集合を分類するためのものであり、多次元空間内のベクトルデータの集合の分布を示す分布データの集合を分類する技術が望まれていた。また、分布データの集合が、分布データを例えば数千〜数万も有する大規模な集合である場合に、この分布データの集合を分類しようとすると、多大な計算時間を必要とすることが考えられるという問題があった。 However, all of the conventional clustering methods are for classifying a set of vector data, and a technique for classifying a set of distribution data indicating the distribution of the set of vector data in a multidimensional space has been desired. In addition, when the set of distribution data is a large-scale set having, for example, thousands to tens of thousands of distribution data, it is considered that it takes a lot of calculation time to classify the set of distribution data. There was a problem of being.

本発明は、前記した問題に鑑み創案されたものであり、分布データの集合を分類できる技術を提供することを目的とする。
また、本発明は、分布データの集合を高速に分類できる技術を提供することを他の目的とする。 The present invention has been made in view of the above-described problems, and an object thereof is to provide a technique capable of classifying a set of distribution data.
Another object of the present invention is to provide a technique capable of classifying a set of distribution data at high speed.

前記課題を解決するため、請求項１に記載のデータ分類装置は、多次元空間内のベクトルデータの集合の分布を示す分布データの集合を前記分布データ間の類似度に対応した複数のクラスタに分類するデータ分類装置であって、分類対象とするすべての分布データが予め分類された所定数のクラスタをそれぞれ代表するヒストグラムとして、前記各クラスタにおいて現在所属しているすべての分布データの予め作成されたすべてのヒストグラムのデータをそれぞれ要約するデータ要約手段と、前記所定数のクラスタのいずれか１つに現在所属している分類対象とする分布データのヒストグラムと、前記所定数のクラスタをそれぞれ代表するヒストグラムとに基づいて、分布データ間の類似度をそれぞれ算出する類似度算出手段と、前記算出された類似度に基づいて、前記分類対象とする分布データを前記所定数のクラスタに分類する分類手段と、前記分類後に前記所定数のクラスタに所属している分布データに関する情報を分類結果として出力する分類結果出力手段とを備えることを特徴とする。 In order to solve the above problem, the data classification device according to claim 1, wherein a set of distribution data indicating a distribution of a set of vector data in a multidimensional space is divided into a plurality of clusters corresponding to the similarity between the distribution data. A data classification device for classification, wherein all distribution data currently belonging to each cluster are created in advance as histograms representing a predetermined number of clusters in which all distribution data to be classified are previously classified. The data summarizing means for summarizing all the histogram data, the histogram of the distribution data to be classified currently belonging to any one of the predetermined number of clusters, and the predetermined number of clusters, respectively. Similarity calculation means for calculating the similarity between the distribution data based on the histogram; Based on the similarity, classification means for classifying the distribution data to be classified into the predetermined number of clusters, and information on distribution data belonging to the predetermined number of clusters after the classification is output as a classification result Classification result output means.

また、請求項５に記載のデータ分類方法は、多次元空間内のベクトルデータの集合の分布を示す分布データの集合を前記分布データ間の類似度に対応した複数のクラスタに分類するデータ分類装置のデータ分類方法であって、データ要約手段によって、分類対象とするすべての分布データが予め分類された所定数のクラスタをそれぞれ代表するヒストグラムとして、前記各クラスタにおいて現在所属しているすべての分布データの予め作成されたすべてのヒストグラムのデータをそれぞれ要約するデータ要約ステップと、類似度算出手段によって、前記所定数のクラスタのいずれか１つに現在所属している分類対象とする分布データのヒストグラムと、前記所定数のクラスタをそれぞれ代表するヒストグラムとに基づいて、分布データ間の類似度をそれぞれ算出する類似度算出ステップと、分類手段によって、前記算出された類似度に基づいて、前記分類対象とする分布データを前記所定数のクラスタに分類する分類ステップと、分類結果出力手段によって、前記分類後に前記所定数のクラスタに所属している分布データに関する情報を分類結果として出力する分類結果出力ステップとを有することを特徴とする。 Further, the data classification method according to claim 5 is a data classification device for classifying a set of distribution data indicating a distribution of a set of vector data in a multidimensional space into a plurality of clusters corresponding to the similarity between the distribution data. All the distribution data currently belonging to each cluster as a histogram representing a predetermined number of clusters in which all the distribution data to be classified are previously classified by the data summarizing means. A data summarizing step for summarizing all previously generated histogram data, and a histogram of distribution data to be classified as belonging to any one of the predetermined number of clusters by means of similarity calculation means, Based on a histogram representative of each of the predetermined number of clusters. A similarity calculating step for calculating degrees respectively, a classification step for classifying the distribution data to be classified into the predetermined number of clusters based on the similarity calculated by the classification unit, and a classification result output unit And a classification result output step of outputting information relating to distribution data belonging to the predetermined number of clusters after the classification as a classification result.

請求項１に記載のデータ分類装置または請求項５に記載のデータ分類方法によれば、データ分類装置は、クラスタ代表点として、各分布データのヒストグラムのデータを要約したヒストグラムを用いる。このクラスタ代表点は、従来のクラスタリング手法においてクラスタに含まれるベクトルデータ全体の平均値をクラスタ代表点とした場合と異なり、クラスタに含まれる分布データ全体の分布特性を正確に反映することができる。そのため、ヒストグラムのデータを要約することで得られるヒストグラムをクラスタ代表点として算出された分布データ間の類似度は、当該分布データの分布特性とそのクラスタ全体の分布特性との類似度を正確に示す指標となる。したがって、データ分類装置は、すべての分布データを分布データ間の類似度が最も小さくなるように各クラスタに分類することで、分布データの集合を予め定められた所定数のクラスタに正確に分類することができる。ここで、分布データをクラスタに分類する方法としては、例えば、ｋ−ｍｅａｎｓ法を用いることができる。 According to the data classification device according to claim 1 or the data classification method according to claim 5, the data classification device uses a histogram obtained by summarizing the histogram data of each distribution data as the cluster representative point. The cluster representative point can accurately reflect the distribution characteristics of the entire distribution data included in the cluster, unlike the case where the average value of the entire vector data included in the cluster is used as the cluster representative point in the conventional clustering method. Therefore, the similarity between the distribution data calculated using the histogram obtained by summarizing the histogram data as the cluster representative point accurately indicates the similarity between the distribution characteristics of the distribution data and the distribution characteristics of the entire cluster. It becomes an indicator. Therefore, the data classification device classifies all distribution data into each cluster so that the similarity between the distribution data is minimized, thereby accurately classifying the set of distribution data into a predetermined number of clusters. be able to. Here, as a method of classifying the distribution data into clusters, for example, a k-means method can be used.

また、請求項２に記載のデータ分類装置は、請求項１に記載のデータ分類装置において、前記類似度算出手段が、前記分類対象とするすべての分布データの中から予め選択された前記所定数の分布データのそれぞれのヒストグラムと、前記分類対象とする分布データのヒストグラムと、に基づいて、前記分布データ間の類似度の初期値をそれぞれ算出し、前記分布データ間の類似度の初期値に基づいて、前記分類対象とするすべての分布データを前記所定数のクラスタに分類することで、前記所定数のクラスタに初期に所属する分布データを定めるクラスタ初期化手段をさらに備えることを特徴とする。 The data classification apparatus according to claim 2 is the data classification apparatus according to claim 1, wherein the similarity calculation unit selects the predetermined number selected in advance from all distribution data to be classified. Based on the respective histograms of the distribution data and the histogram of the distribution data to be classified, an initial value of the similarity between the distribution data is calculated, and the initial value of the similarity between the distribution data is calculated. Based on this, all the distribution data to be classified is classified into the predetermined number of clusters, further comprising cluster initialization means for determining distribution data initially belonging to the predetermined number of clusters. .

また、請求項６に記載のデータ分類方法は、請求項５に記載のデータ分類方法において、前記類似度算出手段が、前記データ要約ステップの前に、前記分類対象とするすべての分布データの中から予め選択された前記所定数の分布データのそれぞれのヒストグラムと、前記分類対象とする分布データのヒストグラムと、に基づいて、前記分布データ間の類似度の初期値をそれぞれ算出し、クラスタ初期化手段によって、前記分布データ間の類似度の初期値に基づいて、前記分類対象とするすべての分布データを前記所定数のクラスタに分類することで、前記所定数のクラスタに初期に所属する分布データを定めるクラスタ初期化ステップをさらに有することを特徴とする。 Further, the data classification method according to claim 6 is the data classification method according to claim 5, wherein the similarity calculation means includes all the distribution data to be classified before the data summarization step. The initial value of the similarity between the distribution data is calculated based on the histogram of the predetermined number of distribution data selected in advance from the histogram of the distribution data to be classified, and cluster initialization Distribution data belonging to the predetermined number of clusters initially by classifying all the distribution data to be classified into the predetermined number of clusters based on an initial value of similarity between the distribution data by means And a cluster initialization step for defining

請求項２に記載のデータ分類装置または請求項６に記載のデータ分類方法によれば、データ分類装置は、クラスタ代表点として、各分布データのヒストグラムのデータを要約する前に予め各クラスタの初期値を作成することができる。したがって、分布データの集合を容易かつ迅速に分類することができる。 According to the data classification device according to claim 2 or the data classification method according to claim 6, the data classification device preliminarily collects the initial data of each cluster before summarizing the histogram data of each distribution data as a cluster representative point. A value can be created. Therefore, a set of distribution data can be classified easily and quickly.

また、請求項３に記載のデータ分類装置は、請求項１または請求項２に記載のデータ分類装置において、データ分類装置は、前記分類対象とする分布データからヒストグラムを作成するヒストグラム作成手段をさらに備えることを特徴とする。 The data classification device according to claim 3 is the data classification device according to claim 1 or 2, wherein the data classification device further includes a histogram creation means for creating a histogram from the distribution data to be classified. It is characterized by providing.

また、請求項７に記載のデータ分類方法は、請求項５または請求項６に記載のデータ分類方法において、ヒストグラム作成手段によって、前記分類対象とする分布データからヒストグラムを作成するヒストグラム作成ステップをさらに有することを特徴とする。 The data classification method according to claim 7 is the data classification method according to claim 5 or 6, further comprising a histogram creation step of creating a histogram from the distribution data to be classified by the histogram creation means. It is characterized by having.

請求項３に記載のデータ分類装置または請求項７に記載のデータ分類方法によれば、データ分類装置は、分布データからヒストグラムを作成することができるので、分布データを入力するだけで自動的にヒストグラムを作成することができる。したがって、入力された分布データの集合を容易かつ迅速に分類することができる。 According to the data classification device according to claim 3 or the data classification method according to claim 7, the data classification device can create a histogram from the distribution data, so that it is automatically performed only by inputting the distribution data. A histogram can be created. Therefore, the set of input distribution data can be classified easily and quickly.

また、請求項４に記載のデータ分類装置は、請求項１ないし請求項３のいずれか一項に記載のデータ分類装置において、前記分布データのヒストグラムのバケットのデータを周波数領域のデータに変換するデータ変換手段をさらに備え、前記類似度算出手段が、前記周波数領域のデータに変換されたデータに基づいて、前記分布データ間の類似度をそれぞれ算出することを特徴とする。 The data classification device according to claim 4 is the data classification device according to any one of claims 1 to 3, wherein the data of the bucket of the histogram of the distribution data is converted into frequency domain data. Data conversion means is further provided, wherein the similarity calculation means calculates the similarity between the distribution data based on the data converted into the frequency domain data.

また、請求項８に記載のデータ分類方法は、請求項５ないし請求項７のいずれか一項に記載のデータ分類方法において、データ変換手段によって、前記ヒストグラムのバケットのデータを周波数領域のデータに変換するデータ変換ステップをさらに有し、前記類似度算出ステップが、前記周波数領域のデータに変換されたデータに基づいて、前記分布データ間の類似度をそれぞれ算出することを特徴とする。 The data classification method according to claim 8 is the data classification method according to any one of claims 5 to 7, wherein the data of the histogram buckets is converted into frequency domain data by a data conversion unit. The method further includes a data conversion step of converting, wherein the similarity calculation step calculates the similarity between the distribution data based on the data converted into the data in the frequency domain.

請求項４に記載のデータ分類装置または請求項８に記載のデータ分類方法によれば、データ分類装置は、分布データのヒストグラムのバケットのデータを周波数領域のデータに変換する。そのため、周波数領域のデータは、度数分布を示すバケットのデータを反映することとなり、周波数領域のデータのうち、小さい度数を反映したデータは全体の中で寄与が小さいのでこれを計算から除外する近似を行うことが可能となる。したがって、周波数領域のデータのうち寄与の小さいデータを無視して分布データ間の類似度を算出することで、分布データの集合を高速に分類することが可能となる。ここで、周波数領域のデータに変換する方法は、例えば、ウェーブレット変換を用いることができる。 According to the data classification device according to claim 4 or the data classification method according to claim 8, the data classification device converts the data of the bucket of the histogram of the distribution data into data in the frequency domain. Therefore, the frequency domain data reflects the bucket data indicating the frequency distribution. Among the frequency domain data, the data reflecting the small frequency has a small contribution in the whole, so an approximation that excludes this from the calculation. Can be performed. Therefore, it is possible to classify a set of distribution data at high speed by calculating similarity between distribution data ignoring data with small contributions among frequency domain data. Here, for example, wavelet transform can be used as a method of converting into frequency domain data.

また、請求項９に記載のデータ分類プログラムは、請求項５ないし請求項８のいずれか一項に記載のデータ分類方法をコンピュータに実行させることを特徴とする。このように構成されることにより、このプログラムをインストールされたコンピュータは、このプログラムに基づいた各機能を実現することができる。 A data classification program according to a ninth aspect causes a computer to execute the data classification method according to any one of the fifth to eighth aspects. By being configured in this way, a computer in which this program is installed can realize each function based on this program.

また、請求項１０に記載のコンピュータ読み取り可能な記録媒体は、請求項９に記載のデータ分類プログラムが記録されたことを特徴とする。このように構成されることにより、この記録媒体を装着されたコンピュータは、この記録媒体に記録されたプログラムに基づいた各機能を実現することができる。 A computer-readable recording medium according to a tenth aspect stores the data classification program according to the ninth aspect. By being configured in this way, a computer equipped with this recording medium can realize each function based on a program recorded on this recording medium.

本発明によれば、分布データの集合を高速に分類することができる。 According to the present invention, a set of distribution data can be classified at high speed.

以下、図面を参照して本発明のデータ分類装置およびデータ分類方法を実施するための最良の形態（以下「実施形態」という）について詳細に説明する。 The best mode for carrying out the data classification apparatus and data classification method of the present invention (hereinafter referred to as “embodiment”) will be described in detail below with reference to the drawings.

（第１実施形態）
［データ分類装置の構成］
図１は、本発明の第１実施形態に係るデータ分類装置の構成を模式的に示すブロック図である。データ分類装置１は、多次元空間内のベクトルデータの集合の分布を示す多次元分布データ（以下、単に分布データという）の集合（ｎ個）を、各分布データのヒストグラムを用いて、分布データ間の類似度に基づいて、予め定められている個数（ｋ個）のクラスタに分類し、分類結果を示す分布データ分類リストを出力するものである。 (First embodiment)
[Configuration of data classification device]
FIG. 1 is a block diagram schematically showing the configuration of the data classification device according to the first embodiment of the present invention. The data classification device 1 uses a histogram of each distribution data to generate a set of distribution data (n) of multidimensional distribution data (hereinafter simply referred to as distribution data) indicating the distribution of a set of vector data in the multidimensional space. Based on the similarity between them, the data is classified into a predetermined number (k) of clusters, and a distribution data classification list indicating the classification result is output.

データ分類装置１は、例えば、ＣＰＵ（Central Processing Unit）と、ＲＡＭ（Random Access Memory）と、ＲＯＭ（Read Only Memory）と、ＨＤＤ（Hard Disk Drive）等から構成されており、図１に示すように、大別して、入出力手段２と、記憶手段３と、制御手段４とを備えている。 The data classification device 1 includes, for example, a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), an HDD (Hard Disk Drive), and the like, as shown in FIG. In general, an input / output unit 2, a storage unit 3, and a control unit 4 are provided.

入出力手段２は、所定の入出力インタフェースや通信インタフェース等から構成され、所定の情報やコマンドを入力したり、所定の情報を出力したりするものである。この入出力手段２は、例えば、通信ネットワーク等を介して、分布データの集合を入力する。また、入出力手段２は、例えばマウスやキーボード等の図示しない入力装置から入力されるクラスタ数ｋを入力する。 The input / output means 2 includes a predetermined input / output interface, a communication interface, and the like, and inputs predetermined information and commands, and outputs predetermined information. The input / output means 2 inputs a set of distribution data via, for example, a communication network. The input / output means 2 inputs the number k of clusters input from an input device (not shown) such as a mouse or a keyboard.

記憶手段３は、ＲＡＭ３１と、ＲＯＭ３２と、ＨＤＤ等からなるヒストグラム記憶手段３３および分類クラスタ記憶手段３４とを備えている。
ＲＡＭ３１は、制御手段４による演算処理等に利用されると共に、入出力手段２を介して取得した情報等を記憶する。ＲＯＭ３２は、所定のプログラム等を記憶する。
ヒストグラム記憶手段３３は、後記するヒストグラム作成手段４１で作成された各分布データのヒストグラムを各分布データの識別情報と共に記憶するものである。なお、ヒストグラム記憶手段３３は、入力された各分布データも共に格納するようにしてもよい。
分類クラスタ記憶手段３４は、後記するクラスタ初期化手段４３およびクラスタ更新手段４５で作成された各クラスタを分布データの識別情報と共に記憶するものである。なお、分類クラスタ記憶手段３４は、後記するデータ要約手段４４で要約されたヒストグラムのデータをクラスタ別に記憶する。 The storage unit 3 includes a RAM 31, a ROM 32, a histogram storage unit 33 and a classification cluster storage unit 34 including an HDD or the like.
The RAM 31 is used for arithmetic processing by the control unit 4 and stores information acquired through the input / output unit 2. The ROM 32 stores a predetermined program and the like.
The histogram storage means 33 stores the histogram of each distribution data created by the histogram creation means 41 described later together with the identification information of each distribution data. Note that the histogram storage means 33 may store the input distribution data together.
The classification cluster storage unit 34 stores each cluster created by the cluster initialization unit 43 and the cluster update unit 45, which will be described later, together with the identification information of the distribution data. The classification cluster storage means 34 stores the histogram data summarized by the data summarization means 44 described later for each cluster.

制御手段４は、例えば、ＣＰＵ等から構成され、入出力手段２および記憶手段３を制御すると共に、図１に示すように、ヒストグラム作成手段４１と、類似度算出手段４２と、クラスタ初期化手段４３と、データ要約手段４４と、クラスタ更新手段４５と、組替判別手段４６と、分類結果出力手段４７とを備えている。 The control means 4 is composed of, for example, a CPU, etc., and controls the input / output means 2 and the storage means 3, and as shown in FIG. 1, a histogram creation means 41, a similarity calculation means 42, and a cluster initialization means. 43, data summarizing means 44, cluster updating means 45, reclassification determining means 46, and classification result output means 47.

＜ヒストグラム作成手段＞
ヒストグラム作成手段４１は、入力された分類対象とする分布データからヒストグラムを作成するものである。ヒストグラムとは、例えば１次元のデータの場合、データが存在する範囲をいくつか（例えばα個）の区間（以下、バケットという）に等しく分け、そのバケットに含まれる度数を表したものである。つまり、１次元のデータの場合には、ヒストグラムはα個の棒グラフとして表現される。以下では、ヒストグラムをｄ（ｄ＝１，２，３，…）次元に拡張する場合を想定する。ｄ次元空間の各軸をそれぞれα個に分割した場合、ヒストグラムのバケットの数ｍは、式（１）で示すことができる。 <Histogram creation means>
The histogram creating means 41 creates a histogram from the input distribution data to be classified. For example, in the case of one-dimensional data, a histogram is a range in which data is divided equally into several (for example, α) sections (hereinafter referred to as buckets) and represents the frequency contained in the bucket. That is, in the case of one-dimensional data, the histogram is expressed as α bar graphs. In the following, it is assumed that the histogram is expanded to d (d = 1, 2, 3,...) Dimension. When each axis of the d-dimensional space is divided into α, the number m of buckets in the histogram can be expressed by Expression (1).

ｍ＝α^d …式（１） m = α ^d Formula (1)

図２は、ヒストグラムの例を示す説明図である。
ここでは、簡便のためα＝２とした。式（１）において、ｄ＝１とすると、バケット数ｍは「２」となる（ｍ＝２）。この場合には、図２（ａ）に示すように、例えば、０≦ｘ＜５の区間を示すバケットＢ₁（一次元のバケット）のデータは「１」であり、５≦ｘ＜１０の区間を示すバケットＢ₂のデータは「２」である。 FIG. 2 is an explanatory diagram illustrating an example of a histogram.
Here, α = 2 for convenience. In equation (1), if d = 1, the number of buckets m is “2” (m = 2). In this case, as shown in FIG. 2A, for example, the data of the bucket B ₁ (one-dimensional bucket) indicating a section of 0 ≦ x <5 is “1”, and 5 ≦ x <10. The data of the bucket B ₂ indicating the section is “2”.

また、式（１）において、ｄ＝２とすると、バケット数ｍは「４」となる（ｍ＝４）。この場合には、図２（ｂ）に示すように、例えば、０≦ｘ＜５かつ０≦ｙ＜５の区間を示すバケットＢ₁₁（二次元のバケット）と、５≦ｘ＜１０かつ０≦ｙ＜５の区間を示すバケットＢ₁₂と、０≦ｘ＜５かつ５≦ｙ＜１０の区間を示すバケットＢ₂₁と、５≦ｘ＜１０かつ５≦ｙ＜１０の区間を示すバケットＢ₂₂とが存在する。図２（ｂ）において、各バケットのデータの図示は省略したが、例えば、紙面に垂直な方向の棒グラフに表すことができる。 Further, in equation (1), if d = 2, the bucket number m is “4” (m = 4). In this case, as shown in FIG. 2B, for example, bucket B ₁₁ (two-dimensional bucket) indicating a section of 0 ≦ x <5 and 0 ≦ y <5, and 5 ≦ x <10 and 0 Bucket B ₁₂ indicating a section of ≦ y <5, bucket B ₂₁ indicating a section of 0 ≦ x <5 and 5 ≦ y <10, and bucket B indicating a section of 5 ≦ x <10 and 5 ≦ y <10 There are ₂₂ and. In FIG. 2B, the data of each bucket is not shown, but can be represented by, for example, a bar graph in a direction perpendicular to the paper surface.

また、式（１）において、ｄ＝３とすると、バケット数ｍは「８」となる（ｍ＝８）。この場合には、図２（ｃ）に示すように、例えば、０≦ｘ＜５かつ０≦ｙ＜５かつ０≦ｚ＜５の区間を示すバケットＢ₁₁₁（三次元のバケット）、…、５≦ｘ＜１０かつ５≦ｙ＜１０かつ５≦ｙ＜１０の区間を示すバケットＢ₂₂₂が存在する。各バケットのデータは、例えば、バケットを示すキュービックの中で棒グラフに表すようにしてもよいが、各バケットのデータは必ずしも棒グラフで表示する必要はない。 Further, in equation (1), if d = 3, the number of buckets m is “8” (m = 8). In this case, as shown in FIG. 2 (c), for example, bucket B ₁₁₁ (three-dimensional bucket) indicating a section of 0 ≦ x <5 and 0 ≦ y <5 and 0 ≦ z <5, There is a bucket B ₂₂₂ indicating a section of 5 ≦ x <10 and 5 ≦ y <10 and 5 ≦ y <10. For example, the data of each bucket may be represented as a bar graph in a cubic indicating the bucket, but the data of each bucket is not necessarily displayed as a bar graph.

以下、同様に、式（１）において、ｄ＝４以上のヒストグラムには、ｄ次元のバケットがｍ個存在することになる。つまり、ヒストグラムは、バケットのデータとしてｍ個のデータを有する。ここで、度数分布が「０」の場合には、そのバケットのデータは「０」の値を有し、バケットのデータの数自体は、ｍ個存在している。そのため、ヒストグラム作成手段４１は、１つの分布データを、ｍ個の値（０を含む）を有した１つのヒストグラムに変換する。したがって、ヒストグラム作成手段４１は、ｎ個の分布データの集合を、ｍ個の値（０を含む）を有したｎ個のヒストグラムに変換する。 Similarly, in equation (1), there are m d-dimensional buckets in the histogram of d = 4 or more. That is, the histogram has m pieces of data as bucket data. Here, when the frequency distribution is “0”, the data in the bucket has a value of “0”, and there are m number of data in the bucket itself. Therefore, the histogram creating means 41 converts one distribution data into one histogram having m values (including 0). Therefore, the histogram creation means 41 converts a set of n distribution data into n histograms having m values (including 0).

＜類似度算出手段＞
類似度算出手段４２は、所定数（ｋ個）のクラスタのいずれか１つに現在所属している分類対象とする分布データのヒストグラムと、所定数のクラスタをそれぞれ代表するヒストグラムとに基づいて、分布データ間の類似度をそれぞれ算出するものである。
本実施形態では、分布データ間の類似度を分布データ間の距離で表し、分布データ間の距離をヒストグラム間の距離で定義する。類似度算出手段４２は、ヒストグラム間の距離を、ヒストグラム間の対称ＫＬ情報量として算出する。 <Similarity calculation means>
The similarity calculation means 42 is based on a histogram of distribution data to be classified and currently belonging to any one of a predetermined number (k) of clusters, and a histogram representing each of the predetermined number of clusters. The similarity between the distribution data is calculated.
In the present embodiment, the similarity between distribution data is represented by the distance between the distribution data, and the distance between the distribution data is defined by the distance between the histograms. The similarity calculation means 42 calculates the distance between the histograms as the symmetric KL information amount between the histograms.

また、類似度算出手段４２は、後記するクラスタ初期化手段４３で分類された（初期化された）ｋ個の初期クラスタ（クラスタの初期値）代表のヒストグラムと、各分布データのヒストグラムとに基づいて、分布データ間の対称ＫＬ情報量の初期値をそれぞれ算出する。また、類似度算出手段４２は、後記するデータ要約手段４４で要約されたヒストグラムと、各分布データのヒストグラムとに基づいて、分布データ間の対称ＫＬ情報量をそれぞれ算出する。 The similarity calculation means 42 is based on a histogram of k initial clusters (initial values of clusters) representative (initialized) classified by the cluster initialization means 43 described later, and a histogram of each distribution data. Thus, the initial value of the symmetrical KL information amount between the distribution data is calculated. Further, the similarity calculation unit 42 calculates the symmetric KL information amount between the distribution data based on the histogram summarized by the data summarization unit 44 described later and the histogram of each distribution data.

ここで、対称ＫＬ情報量について説明する。ヒストグラムＰがｍ個のバケットのデータｐ₁，…，ｐ_mを有することをＰ＝（ｐ₁，…，ｐ_m）と表記する。同様に、ヒストグラムＱがｍ個のバケットのデータｑ₁，…，ｑ_mを有することをＱ＝（ｑ₁，…，ｑ_m）と表記する。従来、２つの分布間の類似度を表す尺度として、ＫＬ情報量が知られている。ＫＬ情報量については、例えば、「Richard O. Duda and Peter E. Hart and David G. Stork、“Pattern Classification”、Wiley-Interscience、2000年10月」に記載されている。
確率分布Ｐ′と確率分布Ｑ′とのＫＬ情報量ｄ_KL（Ｐ′，Ｑ′）は、式（２）で示される。このＫＬ情報量ｄ_KLは、確率分布Ｐ′から確率分布Ｑ′への距離差を表している。ここで、確率分布Ｐ′，Ｑ′は、離散的なヒストグラムＰ，Ｑに対応しており、ｘの関数として連続値ｐ_x，ｑ_xでそれぞれ表すことができるものとする。 Here, the symmetric KL information amount will be described. The histogram P having m bucket data p ₁ ,..., P _m is expressed as P = (p ₁ ,..., P _m ). Similarly, the fact that the histogram Q has data q ₁ ,..., Q _m of _m buckets is expressed as Q = (q ₁ ,..., Q _m ). Conventionally, the KL information amount is known as a measure representing the degree of similarity between two distributions. The amount of KL information is described in, for example, “Richard O. Duda and Peter E. Hart and David G. Stork,“ Pattern Classification ”, Wiley-Interscience, October 2000”.
The KL information amount d _KL (P ′, Q ′) between the probability distribution P ′ and the probability distribution Q ′ is expressed by Expression (2). This KL information amount d _KL represents a distance difference from the probability distribution P ′ to the probability distribution Q ′. Here, the probability distributions P ′ and Q ′ correspond to the discrete histograms P and Q, and can be represented by continuous values p _x and q _x as functions of _x , respectively.

ただし、ＫＬ情報量ｄ_KL（Ｐ′，Ｑ′）は、式（３）に示すように、確率分布Ｐ′と確率分布Ｑ′とに関して対称な尺度ではない。 However, the KL information amount d _KL (P ′, Q ′) is not a symmetric measure with respect to the probability distribution P ′ and the probability distribution Q ′ as shown in the equation (3).

これに対して、式（４）に示す対称ＫＬ情報量ｄ_SKL（Ｐ′，Ｑ′）は、確率分布Ｐ′と確率分布Ｑ′とに関して対称性を有している。 On the other hand, the symmetrical KL information amount d _SKL (P ′, Q ′) shown in the equation (4) has symmetry with respect to the probability distribution P ′ and the probability distribution Q ′.

本実施形態では、前記した式（４）の確率分布Ｐ′と確率分布Ｑ′との対称ＫＬ情報量ｄ_SKL（Ｐ′，Ｑ′）に基づいて、式（５）に示すように、離散的なヒストグラムＰ，Ｑの対称ＫＬ情報量ｄ_SKL（Ｐ，Ｑ）を採用する。なお、ヒストグラムＰ，Ｑのバケットのデータは、式（６）に示すように、正規化されている。 In the present embodiment, based on the symmetric KL information amount d _SKL (P ′, Q ′) between the probability distribution P ′ and the probability distribution Q ′ in the above equation (4), as shown in the equation (5), discrete A symmetrical KL information amount d _SKL (P, Q) of a typical histogram P, Q is adopted. Note that the data of the buckets of the histograms P and Q are normalized as shown in Expression (6).

＜クラスタ初期化手段＞
クラスタ初期化手段４３は、類似度算出手段４２で算出された分布データ間の類似度の初期値に基づいて、分類対象とするすべての分布データを所定数（ｋ個）のクラスタに分類することで、所定数のクラスタに初期に所属する分布データを定めるものである。
本実施形態では、クラスタ初期化手段４３は、初期クラスタ代表として、ｋ個の分布データを選択する。また、クラスタ初期化手段４３は、対称ＫＬ情報量の初期値に基づいて、各分布データをｋ個の初期クラスタに分類する。また、本実施形態では、クラスタ初期化手段４３の分類（クラスタリング）の手法は、ｋ−ｍｅａｎｓ法に基づいている。なお、クラスタリングには様々な手法がこれまで提案されており、ｋ−ｍｅａｎｓ法以外の手法も分布データの集合のクラスタリングに用いることができる。 <Cluster initialization means>
The cluster initialization unit 43 classifies all distribution data to be classified into a predetermined number (k) of clusters based on the initial value of the similarity between the distribution data calculated by the similarity calculation unit 42. Thus, distribution data initially belonging to a predetermined number of clusters is determined.
In this embodiment, the cluster initialization unit 43 selects k distribution data as the initial cluster representative. Further, the cluster initialization means 43 classifies each distribution data into k initial clusters based on the initial value of the symmetric KL information amount. In this embodiment, the classification (clustering) method of the cluster initialization unit 43 is based on the k-means method. Various methods have been proposed for clustering, and methods other than the k-means method can be used for clustering a set of distribution data.

＜データ要約手段＞
データ要約手段４４は、分類対象とするすべての分布データが予め分類された所定数（ｋ個）のクラスタをそれぞれ代表するヒストグラムとして、各クラスタにおいて現在所属しているすべての分布データの予め作成されたすべてのヒストグラムのデータをそれぞれ要約するものである。ここで、要約について説明する。以下では、ヒストグラムＰの各バケットのデータとヒストグラムＱの各バケットのデータとの要約のことを、単に、ヒストグラムＰとヒストグラムＱとの要約という。ヒストグラムＰとヒストグラムＱとの要約Ｒは、式（７）で示される。 <Data summary means>
The data summarizing unit 44 creates in advance all distribution data currently belonging to each cluster as a histogram representing a predetermined number (k) of clusters in which all distribution data to be classified are pre-classified. All the histogram data is summarized. Here, the summary will be described. Hereinafter, the summary of the data of each bucket of the histogram P and the data of each bucket of the histogram Q is simply referred to as a summary of the histogram P and the histogram Q. A summary R of the histogram P and the histogram Q is expressed by Expression (7).

Ｒ＝（Ｐ＋Ｑ）／２ …式（７） R = (P + Q) / 2 Formula (7)

ここで、要約Ｒは、各ヒストグラムＰ，Ｑに対応して、ｍ個のバケットのデータｒ₁，…，ｒ_mを有する。これをＲ＝（ｒ₁，…，ｒ_m）と表記する。これにより、前記した式（７）は、式（８）のように書き換えられる。 Here, the summary R is each histogram P, corresponding to the Q, the data r ₁ of the m buckets, ..., having r _m. This is expressed as R = (r ₁ ,..., R _m ). Thereby, the above-described equation (7) is rewritten as equation (8).

ｒ_i＝（ｐ_i＋ｑ_i）／２ …式（８） r _i = (p _i + q _i ) / 2 (8)

前記した２つのヒストグラムの要約と同様にして、ｎ個のヒストグラムＰ_j（ｊ＝１，…，ｎ）の要約Ｒは、式（９）で定義される。 Similar to the above-described two histogram summaries, the summaries R of the n histograms P _j (j = 1,..., N) are defined by Equation (9).

＜クラスタ更新手段＞
クラスタ更新手段（分類手段）４５は、類似度算出手段４２で算出された類似度に基づいて、分類対象とする分布データを所定数（ｋ個）のクラスタに分類するものである。具体的には、クラスタ更新手段４５は、類似度算出手段４２で算出された類似度に基づいて、分類対象とする分布データを所定数（ｋ個）のクラスタのいずれか１つに改めて分類することで、所定数のクラスタを更新する。また、本実施形態では、クラスタ更新手段４５は、対称ＫＬ情報量に基づいて、各分布データをｋ個のクラスタのいずれか１つに改めて分類し、クラスタを更新する。また、本実施形態では、クラスタ更新手段４５の分類（クラスタリング）の手法は、ｋ−ｍｅａｎｓ法に基づいている。なお、クラスタリングには様々な手法がこれまで提案されており、ｋ−ｍｅａｎｓ法以外の手法も分布データの集合のクラスタリングに用いることができる。また、クラスタ更新手段４５とクラスタ初期化手段４３とを分けることなく同一の手段で実現するようにしてもよい。 <Cluster update means>
The cluster update unit (classification unit) 45 classifies the distribution data to be classified into a predetermined number (k) of clusters based on the similarity calculated by the similarity calculation unit 42. Specifically, the cluster updating unit 45 reclassifies the distribution data to be classified into any one of a predetermined number (k) of clusters based on the similarity calculated by the similarity calculating unit 42. Thus, a predetermined number of clusters are updated. Further, in the present embodiment, the cluster update unit 45 reclassifies each distribution data into any one of k clusters based on the symmetric KL information amount, and updates the clusters. In the present embodiment, the classification (clustering) technique of the cluster updating unit 45 is based on the k-means method. Various methods have been proposed for clustering, and methods other than the k-means method can be used for clustering a set of distribution data. Further, the cluster update unit 45 and the cluster initialization unit 43 may be realized by the same unit without being separated.

＜組替判別手段＞
組替判別手段４６は、クラスタ更新手段４５によってクラスタを更新する前後で、分類対象とする分布データの所属先が変更したか否かを判別するものである。本実施形態では、組替判別手段４６は、すべての分布データの所属先が変更していないか否かを判別し、判別結果を分類結果出力手段４７に出力する。 <Reclassification discriminating means>
The reclassification determination unit 46 determines whether or not the affiliation destination of the distribution data to be classified has changed before and after the cluster update unit 45 updates the cluster. In the present embodiment, the reclassification determination unit 46 determines whether or not the distribution destinations of all the distribution data have changed, and outputs the determination result to the classification result output unit 47.

＜分類結果出力手段＞
分類結果出力手段４７は、組替判別手段４６で分類対象とする分布データの所属先が、分類対象とするすべての分布データについて変更していないと判別された場合に、所定数（ｋ個）のクラスタに現在所属している分布データに関する情報を分類結果として出力するものである。本実施形態では、分類結果出力手段４７は、すべての分布データの所属先が変更していない場合に、分布データ分類リストを出力する。ここで、分布データ分類リストは、例えば、分布データの識別情報をクラスタ毎に記載したリストである。 <Classification result output means>
The classification result output unit 47 determines a predetermined number (k) when the reclassification determination unit 46 determines that the affiliation destination of the distribution data to be classified has not been changed for all distribution data to be classified. Information on distribution data currently belonging to the cluster is output as a classification result. In the present embodiment, the classification result output means 47 outputs a distribution data classification list when all the distribution data affiliations have not been changed. Here, the distribution data classification list is, for example, a list in which identification information of distribution data is described for each cluster.

なお、前記した制御手段４が備えるヒストグラム作成手段４１、類似度算出手段４２、クラスタ初期化手段４３、データ要約手段４４、クラスタ更新手段４５、組替判別手段４６および分類結果出力手段４７は、ＣＰＵが記憶手段３のＨＤＤ等に格納された所定のプログラムをＲＡＭ３１に展開して実行することにより実現されるものである。 Note that the histogram creation means 41, the similarity calculation means 42, the cluster initialization means 43, the data summarization means 44, the cluster update means 45, the rearrangement discrimination means 46, and the classification result output means 47 included in the control means 4 are CPUs. Is realized by developing a predetermined program stored in the HDD or the like of the storage means 3 in the RAM 31 and executing it.

［データ分類装置の動作］
図１に示したデータ分類装置１の動作について図３を参照（適宜図１参照）して説明する。図３は、図１に示したデータ分類装置による分布データのクラスタへの分類処理を示すフローチャートである。ここでは、分布データの集合を分類すべきクラスタ数ｋが予め入力されているものとする。データ分類装置１は、入出力手段２を介して分布データの集合が入力されると、ヒストグラム作成手段４１によって、入力された各分布データからヒストグラムを作成し（ステップＳ１：ヒストグラム作成ステップ）、作成した各ヒストグラムを分布データの識別情報と共にヒストグラム記憶手段３３に格納しておく。そして、データ分類装置１は、クラスタ初期化手段４３によって、初期クラスタ代表として、ｋ個の分布データを選択し（ステップＳ２）、選択したｋ個の分布データの識別情報を初期クラスタとして分類クラスタ記憶手段３４に格納する。具体的には、データ分類装置１は、ｎ個（例えば１０００個）の分布データから、ｋ個（例えば３個）の分布データをランダムに選択して、仮のクラスタ代表とする。 [Operation of data classifier]
The operation of the data classification apparatus 1 shown in FIG. 1 will be described with reference to FIG. 3 (refer to FIG. 1 as appropriate). FIG. 3 is a flowchart showing a classification process of distribution data into clusters by the data classification apparatus shown in FIG. Here, it is assumed that the number k of clusters to which the set of distribution data is to be classified is input in advance. When a set of distribution data is input via the input / output unit 2, the data classification device 1 generates a histogram from each input distribution data by the histogram generation unit 41 (step S1: histogram generation step). Each histogram is stored in the histogram storage means 33 together with the identification information of the distribution data. Then, the data classification device 1 selects k distribution data as the initial cluster representative by the cluster initialization unit 43 (step S2), and stores the identification information of the selected k distribution data as the initial cluster. Store in the means 34. Specifically, the data classification device 1 randomly selects k (for example, 3) distribution data from n (for example, 1000) distribution data and sets it as a temporary cluster representative.

そして、データ分類装置１は、類似度算出手段４２によって、ｋ個の初期クラスタ代表のヒストグラムと、各分布データのヒストグラムとに基づいて、分布データ間の対称ＫＬ情報量の初期値をそれぞれ算出する（ステップＳ３）。具体的には、データ分類装置１は、ある分布データのヒストグラムと、ｋ個（例えば３個）の分布データのヒストグラムとの間のｋ個（例えば３個）の対称ＫＬ情報量の初期値を算出する。そして、同様に、残りの各分布データについてもｋ個（例えば３個）の対称ＫＬ情報量の初期値をそれぞれ算出する。そして、データ分類装置１は、クラスタ初期化手段４３によって、算出された対称ＫＬ情報量の初期値に基づいて、各分布データをｋ個の初期クラスタに分類する（ステップＳ４：クラスタ初期化ステップ）。具体的には、データ分類装置１は、分布データ間の対称ＫＬ情報量の初期値が最小となるように各クラスタに分類する。なお、ステップＳ３とステップＳ４の一連の処理を分布データ毎に実行するようにしてもよい。ここまでの処理によって、ｎ個の分布データ集合からｋ個のクラスタが仮に形成される（クラスタ初期値が決定される）。 Then, the data classification apparatus 1 calculates the initial value of the symmetric KL information amount between the distribution data based on the histogram of the k initial cluster representatives and the histogram of each distribution data by the similarity calculation unit 42. (Step S3). Specifically, the data classification device 1 sets initial values of k (for example, 3) symmetric KL information amounts between a histogram of certain distribution data and a histogram of k (for example, 3) distribution data. calculate. Similarly, for the remaining distribution data, k (for example, 3) initial values of the symmetric KL information amount are respectively calculated. Then, the data classification device 1 classifies each distribution data into k initial clusters based on the calculated initial value of the symmetric KL information amount by the cluster initialization unit 43 (step S4: cluster initialization step). . Specifically, the data classification device 1 classifies each cluster so that the initial value of the symmetric KL information amount between the distribution data is minimized. Note that a series of processes of step S3 and step S4 may be executed for each distribution data. Through the processing so far, k clusters are temporarily formed from the n distribution data sets (cluster initial values are determined).

続いて、データ分類装置１は、データ要約手段４４によって、各クラスタにおいて現在所属しているすべての分布データのヒストグラムのデータを前記した式（９）にしたがってそれぞれ要約する（ステップＳ５：データ要約ステップ）。そして、データ分類装置１は、類似度算出手段４２によって、要約されたヒストグラムと、各分布データのヒストグラムとに基づいて、分布データ間の対称ＫＬ情報量をそれぞれ算出する（ステップＳ６：類似度算出ステップ）。さらに、データ分類装置１は、クラスタ更新手段４５によって、算出された対称ＫＬ情報量に基づいて、各分布データをｋ個のクラスタのいずれか１つに改めて分類し、クラスタを更新する（ステップＳ７：分類ステップ）。具体的には、データ分類装置１は、分布データ間の対称ＫＬ情報量が最小となるように各クラスタに分類する。そして、データ分類装置１は、組替判別手段４６によって、すべての分布データの所属先が変更されていないか否かを判別する（ステップＳ８：組替判別ステップ）。所属先が変更した分布データがある場合（ステップＳ８：Ｎｏ）、データ分類装置１は、ステップＳ５に戻る。一方、すべての分布データの所属先が変更していない場合（ステップＳ８：Ｙｅｓ）、データ分類装置１は、分類結果出力手段４７によって、現在分離されているクラスタについての分布データ分類リストを入出力手段２から出力する（ステップＳ９：分類結果出力ステップ）。 Subsequently, the data classification device 1 summarizes the histogram data of all the distribution data currently belonging in each cluster by the data summarizing means 44 according to the above-described equation (9) (step S5: data summarization step). ). Then, the data classification device 1 calculates the symmetric KL information amount between the distribution data based on the summarized histogram and the histogram of each distribution data by the similarity calculation means 42 (step S6: similarity calculation). Step). Furthermore, the data classification device 1 reclassifies each distribution data into any one of k clusters based on the calculated symmetric KL information amount by the cluster updating means 45, and updates the clusters (step S7). : Classification step). Specifically, the data classification device 1 classifies each cluster so that the symmetric KL information amount between the distribution data is minimized. Then, the data classification device 1 determines whether or not the affiliation destination of all the distribution data has been changed by the reassignment determination means 46 (step S8: reassignment determination step). If there is distribution data whose affiliation has changed (step S8: No), the data classification device 1 returns to step S5. On the other hand, when all the distribution data affiliations have not been changed (step S8: Yes), the data classification device 1 inputs and outputs the distribution data classification list for the currently separated cluster by the classification result output means 47. Output from the means 2 (step S9: classification result output step).

なお、データ分類装置１は、一般的なコンピュータに、前記した各ステップを実行させるデータ分類プログラムを実行することで実現することもできる。このプログラムは、通信回線を介して配布することも可能であるし、ＣＤ−ＲＯＭ等の記録媒体に書き込んで配布することも可能である。 The data classification device 1 can also be realized by executing a data classification program that causes a general computer to execute the above steps. This program can be distributed via a communication line, or can be written on a recording medium such as a CD-ROM for distribution.

本実施形態のデータ分類装置１は、様々な分野における分布データの集合の分析に適用可能である。適用分野としては、例えば、ビジネス、金融、医用、設備、気象、地学、情報通信、マルチメディア等が挙げられる。例えば、ビジネス分野として電子商取引に応用された場合には、商品の購入履歴によって顧客を分類することができる。具体的には、ある顧客が通信ネットワークを介していくつかの商品を購入した場合に、商品価格、購入量、購入に至るまでのブラウジング回数、ブラウジング時間といったデータは、多次元ベクトルとして表現することができる。そして、この多次元ベクトルの時間経過に伴って発生する分布データが着目している一人の顧客の購入履歴となる。したがって、多数（ｎ人）の顧客について購入履歴が分布データの集合として表現されることになる。この分布データの集合を分類（クラスタリング）することにより得られた情報は、マーケティングのために利用したり、各種ルール発見のために利用したりすることができる。一方、従来のクラスタリング手法は、いずれもベクトルデータの集合を分類するためのものであるため、例えば、電子商取引に応用したとしても、ある商品の購入履歴によって顧客を分類する場合に、細かく分類したり早く分類したりすることが困難である。 The data classification device 1 of the present embodiment is applicable to analysis of a set of distribution data in various fields. Application fields include, for example, business, finance, medical, equipment, weather, geology, information communication, multimedia, and the like. For example, when it is applied to electronic commerce as a business field, customers can be classified by the purchase history of products. Specifically, when a customer purchases several products via a communication network, data such as product price, purchase quantity, number of browsing times until purchase, and browsing time should be expressed as a multidimensional vector. Can do. The distribution data generated with the passage of time of this multidimensional vector is the purchase history of one customer who is paying attention. Therefore, the purchase history is expressed as a set of distribution data for a large number (n) of customers. Information obtained by classifying (clustering) the set of distribution data can be used for marketing or for discovery of various rules. On the other hand, all of the conventional clustering methods are for classifying a set of vector data. For example, even when applied to electronic commerce, when classifying customers according to the purchase history of a certain product, they are classified finely. It is difficult to classify quickly.

第１実施形態によれば、データ分類装置１は、各分布データを分類する際に、クラスタ代表点として、各分布データのヒストグラムのデータを要約したヒストグラムを用いて分布データに相当するヒストグラム間の対称ＫＬ情報量を分布データ間の類似度として算出するので、分布データの集合を予め定められた所定数のクラスタに正確に分類することができる。 According to the first embodiment, when classifying each distribution data, the data classification device 1 uses a histogram obtained by summarizing the histogram data of each distribution data as a cluster representative point. Since the symmetric KL information amount is calculated as the similarity between the distribution data, the set of distribution data can be accurately classified into a predetermined number of clusters.

（第２実施形態）
第２実施形態では、第１実施形態と同様に分布データ間の類似度を分布データ間の距離で表すが、この分布データ間の距離を、ヒストグラムのバケットのデータを周波数領域のデータに変換したデータ間の距離で定義する。 (Second Embodiment)
In the second embodiment, similar to the first embodiment, the similarity between the distribution data is expressed by the distance between the distribution data. The distance between the distribution data is converted from the bucket data in the histogram to the data in the frequency domain. It is defined by the distance between data.

［データ分類装置の構成］
図４は、本発明の第２実施形態に係るデータ分類装置の構成を模式的に示すブロック図である。このデータ分類装置１Ａは、制御手段４にデータ変換手段５１をさらに備えると共に、記憶手段３にウェーブレット係数値記憶手段５２をさらに備える点を除いて、図１に示したデータ分類装置１と同様な構成なので、同一の構成には同一の符号を付して説明を省略する。 [Configuration of data classification device]
FIG. 4 is a block diagram schematically showing the configuration of the data classification device according to the second embodiment of the present invention. The data classification apparatus 1A is similar to the data classification apparatus 1 shown in FIG. 1 except that the control means 4 further includes a data conversion means 51 and the storage means 3 further includes a wavelet coefficient value storage means 52. Since it is a structure, the same code | symbol is attached | subjected to the same structure and description is abbreviate | omitted.

＜データ変換手段＞
データ変換手段５１は、分布データのヒストグラムのバケットのデータを周波数領域のデータに変換するものである。本実施形態では、データ変換手段５１は、分布データのヒストグラムのバケットのデータをウェーブレット変換することにより、ウェーブレット係数値を算出する。 <Data conversion means>
The data conversion means 51 converts the data of the histogram bucket of the distribution data into frequency domain data. In this embodiment, the data conversion means 51 calculates a wavelet coefficient value by performing wavelet conversion on the bucket data of the histogram of the distribution data.

データ変換手段５１の行うウェーブレット変換について、以下の式（１０）〜式（１７）を参照して説明する。以下、分布データのヒストグラムのバケットのデータがウェーブレット変換されたものをウェーブレット係数値という。また、分布データのヒストグラムをウェーブレット変換したものを単にウェーブレットという。また、ヒストグラムの各バケットのデータの対数値のことを単にヒストグラムの対数値という。まず、ヒストグラムＰの対数値の表式を式（１０）で定義する。このとき、式（１０）に示したヒストグラムＰの各バケットのデータについての対数値の表式は式（１１）で示される。 The wavelet transform performed by the data conversion means 51 will be described with reference to the following formulas (10) to (17). Hereinafter, data obtained by wavelet transforming the histogram data of the distribution data histogram is referred to as a wavelet coefficient value. A wavelet transform of the distribution data histogram is simply called a wavelet. Further, the logarithmic value of the data of each bucket of the histogram is simply referred to as the logarithmic value of the histogram. First, a logarithmic expression of the histogram P is defined by Expression (10). At this time, a logarithmic expression for the data of each bucket of the histogram P shown in Expression (10) is expressed by Expression (11).

また、ヒストグラムＰのウェーブレットの表式を式（１２）で定義する。同様に、前記した式（１０）に示したヒストグラムＰの対数値のウェーブレットの表式を式（１３）で定義する。 Further, the expression of the wavelet of the histogram P is defined by Expression (12). Similarly, the expression of the logarithmic wavelet of the histogram P shown in Expression (10) is defined by Expression (13).

前記した式（５）に示したヒストグラムＰ，Ｑの対称ＫＬ情報量ｄ_SKL（Ｐ，Ｑ）を、前記した式（１１）ないし式（１３）に示した関係式を用いて書き換えると、式（１４）および式（１５）が得られる。 When the symmetric KL information amount d _SKL (P, Q) of the histograms P and Q shown in the above equation (5) is rewritten using the relational equations shown in the above equations (11) to (13), (14) and equation (15) are obtained.

分布データに対応するヒストグラムのウェーブレット係数値は、ヒストグラムのバケットの数（ｍ個）だけ存在している。バケットによっては、度数が小さい（値が０である）場合がある。そこで、本実施形態では、ｍ個すべてのウェーブレット係数値を用いるのではなく、比較的大きなエネルギー（大きな絶対値）を持つ少数（＜ｍ）のウェーブレット係数値を選択し、選択したウェーブレット係数値を用いて対称ＫＬ情報量を計算することとする。具体的には、例えば、前記した式（１５）において、ｗ_p（ｔ）もしくはｗ_q（ｔ）のいずれかを選択したときに、ｆ_pq（ｔ）を計算し、それ以外の係数値は小さなエネルギー（小さな絶対値）しか持たないため、対称ＫＬ情報量の計算には用いない。式（１６）に示すように、このような近似を行った対称ＫＬ情報量を、ＰとＱの近似対称ＫＬ情報量ともいう。そのため、本実施形態では、類似度算出手段４２は、分布データ間の距離を、式（１６）で示される近似対称ＫＬ情報量ｄ_SKL（Ｐ，Ｑ）として算出する。ここで、式（１６）中のｆ_pq（ｔ）は前記した式（１５）で定義される。以下では、近似対称ＫＬ情報量のことを単に対称ＫＬ情報量という。 There are as many histogram wavelet coefficient values corresponding to the distribution data as the number of histogram buckets (m). Depending on the bucket, the frequency may be small (value is 0). Therefore, in this embodiment, instead of using all m wavelet coefficient values, a small number (<m) of wavelet coefficient values having relatively large energy (large absolute value) are selected, and the selected wavelet coefficient values are selected. It is assumed that the symmetric KL information amount is used for calculation. Specifically, for example, when either w _p (t) or w _q (t) is selected in the above equation (15), f _pq (t) is calculated, and other coefficient values are Since it has only a small energy (small absolute value), it is not used for the calculation of the symmetric KL information amount. As shown in Expression (16), the symmetric KL information amount obtained by such approximation is also referred to as an approximate symmetric KL information amount of P and Q. Therefore, in this embodiment, the similarity calculation unit 42 calculates the distance between the distribution data as the approximate symmetric KL information amount d _SKL (P, Q) represented by Expression (16). Here, f _pq (t) in the equation (16) is defined by the above equation (15). Hereinafter, the approximate symmetric KL information amount is simply referred to as symmetric KL information amount.

式（１２）で示したウェーブレットであるＷ_pと、ヒストグラムＱについての同様なウェーブレットであるＷ_qとでは、式（１６）において選択されるウェーブレット係数値の位置（対応するバケットの位置）は異なる。言い換えると、仮にヒストグラムＱのウェーブレット係数値ｗ_q（ｔ）は選択されたものの、ヒストグラムＰのウェーブレット係数値ｗ_p（ｔ）が選択されなかった場合には、選択されなかったヒストグラムＰのウェーブレット係数値とその対数値との間に、式（１７）に示す関係が成り立っていることとする。 The position of the wavelet coefficient value selected in Expression (16) (the position of the corresponding bucket) is different between W _p that is the wavelet shown in Expression (12) and W _q that is a similar wavelet for the histogram Q. . In other words, if the wavelet coefficient value w _q (t) of the histogram Q is selected, but the wavelet coefficient value w _p (t) of the histogram P is not selected, the wavelet coefficient of the histogram P that is not selected is selected. It is assumed that the relationship shown in Expression (17) is established between the numerical value and its logarithmic value.

また、本実施形態では、クラスタ初期化手段４３と、クラスタ更新手段４５とは、ヒストグラム間の対称ＫＬ情報量の代わりに、類似度算出手段４２で算出されたウェーブレット係数値を用いた対称ＫＬ情報量に基づいて、分布データを所定数（ｋ個）のクラスタに分類する。 In the present embodiment, the cluster initialization unit 43 and the cluster update unit 45 use the symmetric KL information using the wavelet coefficient value calculated by the similarity calculation unit 42 instead of the symmetric KL information amount between histograms. Based on the quantity, the distribution data is classified into a predetermined number (k) of clusters.

また、本実施形態では、データ要約手段４４は、前記した式（７）および式（８）で示されるヒストグラムの要約Ｒの代わりに、ヒストグラムのウェーブレットの要約Ｗ_Rを式（１８）に基づいて求める。 Further, in the present embodiment, data summarization unit 44, instead of the summary R histogram represented by the formula (7) and (8), based on the summary W _R of the wavelet histogram in formula (18) Ask.

Ｗ_R＝（Ｗ_p＋Ｗ_Q）／２ …式（１８） W _R = (W _p + W _Q ) / 2 Formula (18)

ここで、ウェーブレットの要約Ｗ_Rは、ヒストグラムＰ，ＱのウェーブレットＷ_p，Ｗ_Qに対応して、ｍ個のウェーブレット係数値ｗ_r(1)，…，ｗ_r(m)を有する。これをＷ_R＝（ｗ_r(1)，…，ｗ_r(m)）と表記する。これにより、前記した式（１８）は、式（１９）のように書き換えられる。 Here, the wavelet summary W _R has m wavelet coefficient values w _r (1),..., W _r (m) corresponding to the wavelets W _p and W _Q of the histograms P and Q. This is _expressed as W _R = ( _wr (1), ..., _wr (m)). Thereby, the above-described equation (18) is rewritten as equation (19).

ｗ_r(i)＝（ｗ_p(i)＋ｗ_q(i)）／２ …式（１９） w _r (i) = (w _p (i) + w _q (i)) / 2 Equation (19)

前記した２つのウェーブレットの要約と同様にして、ｎ個のウェーブレットＷ_j（ｊ＝１，…，ｎ）の要約Ｗ_Rは、式（２０）で定義される。 Similar to the summary of the two wavelets described above, the summary W _{R of the} n wavelets W _j (j = 1,..., N) is defined by Expression (20).

＜ウェーブレット係数値記憶手段＞
ウェーブレット係数値記憶手段５２は、データ変換手段５１で算出された分布データに対応するヒストグラムのウェーブレット係数値を記憶するものであり、例えば、一般的なハードディスク等から構成される。 <Wavelet coefficient value storage means>
The wavelet coefficient value storage means 52 stores the wavelet coefficient values of the histogram corresponding to the distribution data calculated by the data conversion means 51, and is composed of, for example, a general hard disk.

［データ分類装置の動作］
図４に示したデータ分類装置１Ａの動作について図５を参照（適宜図４参照）して説明する。図５は、図４に示したデータ分類装置による分布データのクラスタへの分類処理を示すフローチャートである。データ分類装置１Ａは、ヒストグラム作成手段４１によって、入力された各分布データからヒストグラムを作成し（ステップＳ２１）、データ変換手段５１によって、作成したヒストグラムのバケットのデータをウェーブレット変換する（ステップＳ２２：データ変換ステップ）。続いて、データ分類装置１Ａは、ステップＳ２３〜ステップＳ３０の処理を実行する。これらの処理において、データ分類装置１Ａは、対称ＫＬ情報量の算出、データの要約および分類における処理が異なる点を除いて、図３に示したステップＳ２〜ステップＳ９と同様な処理を行う。具体的には、データ分類装置１Ａの行うステップＳ２４〜Ｓ２８の処理は、ヒストグラムの代わりに、ヒストグラムのバケットのデータのウェーブレット係数値に基づく点を除いて、第１実施形態のデータ分類装置１の行うステップＳ３〜Ｓ７の処理と同様である。したがって、データ分類装置１Ａの動作の詳細な説明を省略する。 [Operation of data classifier]
The operation of the data classification device 1A shown in FIG. 4 will be described with reference to FIG. 5 (refer to FIG. 4 as appropriate). FIG. 5 is a flowchart showing a classification process of distribution data into clusters by the data classification apparatus shown in FIG. The data classification device 1A creates a histogram from each input distribution data by the histogram creation means 41 (step S21), and wavelet transforms the data of the created histogram bucket by the data conversion means 51 (step S22: data). Conversion step). Subsequently, the data classification device 1A executes the processes of steps S23 to S30. In these processes, the data classification device 1A performs the same processes as those in steps S2 to S9 shown in FIG. 3 except that the processes for calculating the symmetric KL information amount, the data summarization, and the classification are different. Specifically, the processing of steps S24 to S28 performed by the data classification device 1A is performed by the data classification device 1 of the first embodiment except that the processing is based on the wavelet coefficient value of the data of the bucket of the histogram instead of the histogram. This is the same as the processing of steps S3 to S7 to be performed. Therefore, detailed description of the operation of the data classification device 1A is omitted.

第２実施形態によれば、分布データのヒストグラムのバケットのデータをウェーブレット変換して得られたウェーブレット係数値のうち、寄与の小さいデータを無視して分布データ間の類似度を算出するので、分布データの集合を高速に分類することができる。例えば、ｄ次元のヒストグラムを作成するために、ｄ次元空間の各軸をα個に分割した場合、ヒストグラムのパケットの数は、前記した式（１）に示すように、ｍ＝α^dとなり、次元数が増加するに従ってヒストグラムのバケット数が指数関数的に増大し、結果として、計算コストとメモリ消費量が飛躍的に増加する。しかしながら、本実施形態では、近似対称ＫＬ情報量を計算することで処理を高速化できるので、計算コストとメモリ消費量を低減することができる。 According to the second embodiment, the similarity between the distribution data is calculated by ignoring the data with small contribution among the wavelet coefficient values obtained by wavelet transforming the bucket data of the histogram of the distribution data. Data sets can be classified at high speed. For example, in order to create a d-dimensional histogram, when each axis of the d-dimensional space is divided into α, the number of packets in the histogram is m = α ^{d as} shown in the above equation (1). As the number of dimensions increases, the number of buckets in the histogram increases exponentially, resulting in a dramatic increase in computational cost and memory consumption. However, in the present embodiment, since the processing speed can be increased by calculating the approximate symmetric KL information amount, the calculation cost and the memory consumption can be reduced.

以上、本発明の各実施形態について説明したが、本発明はこれらに限定されるものではなく、その趣旨を変えない範囲で実施することができる。例えば、データ分類装置１（１Ａ）は、分布データ間の距離を対称ＫＬ情報量に基づいて計算するものとして説明したが、これに限定されるものではない。また、データ分類装置１Ａは、データ変換手段５１によって、分布データのヒストグラムのバケットのデータをウェーブレット変換するものとしたが、変換方法は周波数領域のデータに変換できるものであればよく、ウェーブレット変換に限定されるものではない。 As mentioned above, although each embodiment of this invention was described, this invention is not limited to these, It can implement in the range which does not change the meaning. For example, although the data classification device 1 (1A) has been described as calculating the distance between distribution data based on the symmetric KL information amount, the present invention is not limited to this. In the data classification device 1A, the data conversion means 51 performs wavelet transform on the histogram data of the distribution data. However, any conversion method may be used as long as it can be converted into frequency domain data. It is not limited.

本発明の効果を確認するために各実施形態に係るデータ分類装置１，１Ａを用いて分布データの集合を分類した。以下では、データ分類装置１を用いるクラスタリング手法を一般化法と呼び、データ分類装置１Ａを用いるクラスタリング手法を高速化法と呼ぶ。
各データ分類装置１，１Ａは、２ＧＢのメモリと、（会社名「Ｉｎｔｅｌ」、製品名「Ｘｅｏｎ２．５３ＧＨｚ」）のＣＰＵとを搭載したＬｉｎｕｘ（登録商標）のマシンとして構成した。また、分布データの集合の数ｎは１０００個とし、クラスタ数ｋは３個とした。また、データ分類装置１，１Ａにおいて、ヒストグラムのバケットの個数を５００〜５０００の範囲で変化させた。また、データ分類装置１Ａにおいては、全ウェーブレット係数値の個数を１０００〜４０００個の範囲で変化させると共に、採用したウェーブレット係数値の個数を５０〜６００個の範囲で変化させた。 In order to confirm the effect of the present invention, a set of distribution data was classified using the data classification devices 1 and 1A according to each embodiment. Hereinafter, the clustering method using the data classification device 1 is called a generalization method, and the clustering method using the data classification device 1A is called an acceleration method.
Each of the data classification devices 1 and 1A is configured as a Linux (registered trademark) machine equipped with a 2 GB memory and a CPU (company name “Intel”, product name “Xeon 2.53 GHz”). The number n of distribution data sets was 1000, and the number k of clusters was 3. In the data classification devices 1 and 1A, the number of histogram buckets was changed in the range of 500 to 5000. In the data classification apparatus 1A, the number of all wavelet coefficient values was changed in the range of 1000 to 4000, and the number of adopted wavelet coefficient values was changed in the range of 50 to 600.

各分布データが基にしているベクトルデータは、シミュレーションプログラムから作成されたある会社の人工的な株価データであり、１日のうちの株価の始値、終値および出来高から成る。そして、各分布データは、ある会社の株価や出来高等について１００日間の統計をとって形成された３次元分布データとした（なお、図６では２次元分布データとした）。これら分布データの集合は、１０００社それぞれの１００日間の株価と出来高の推移から成っている。そして、データ分類装置１，１Ａは、１０００社を、株価と出来高の推移が互いに類似した３個のクラスタ（第１クラスタ、第２クラスタ、第３クラスタ）に分類した。このときの結果の一例を、株価の終値および出来高の２次元分布データとして図６に示す。 The vector data on which each distribution data is based is artificial stock price data of a company created from a simulation program, and consists of the opening price, closing price, and trading volume of the stock price in one day. Each distribution data is three-dimensional distribution data formed by taking 100 days of statistics about a company's stock price, trading volume, etc. (in FIG. 6, it is set as two-dimensional distribution data). A set of these distribution data is composed of changes in stock prices and trading volumes for 100 companies for 100 days. Then, the data classification apparatuses 1 and 1A classify 1000 companies into three clusters (first cluster, second cluster, and third cluster) in which changes in stock prices and trading volumes are similar to each other. An example of the result at this time is shown in FIG. 6 as two-dimensional distribution data of the closing price of the stock price and the trading volume.

図６（ａ）〜図６（ｉ）に示す各グラフは、横軸が株価の終値を示し、縦軸が出来高を示しており、１００個の点（＋で表示）でそれぞれの会社の１００日間の株価と出来高の推移を示している。図６（ａ）〜図６（ｉ）は、１０００社のうちの９社の分布データをそれぞれ示している。分類結果である第１クラスタは、図６（ａ）、図６（ｂ）、図６（ｃ）にそれぞれ示すグラフである。また、第２クラスタは、図６（ｄ）、図６（ｅ）、図６（ｆ）にそれぞれ示すグラフである。さらに、第３クラスタは、図６（ｇ）、図６（ｈ）、図６（ｉ）にそれぞれ示すグラフである。図６（ａ）〜図６（ｉ）に示すように、３個のクラスタ（第１クラスタ、第２クラスタ、第３クラスタ）は、分布の類似性にしたがって適切に分類されていることがわかる。なお、このように各分布データが、１００個の点（＋で表示）を備えている場合には、作成されるヒストグラムの各バケットのデータ値（度数）の合計は１００となる。 In each graph shown in FIGS. 6A to 6I, the horizontal axis indicates the closing price of the stock price, the vertical axis indicates the trading volume, and 100 points (indicated by +) indicate 100 of each company. It shows the trend of daily stock price and trading volume. Fig.6 (a)-FIG.6 (i) have each shown the distribution data of 9 companies among 1000 companies. The first cluster as the classification result is a graph shown in each of FIGS. 6A, 6B, and 6C. The second cluster is a graph shown in FIG. 6D, FIG. 6E, and FIG. 6F, respectively. Further, the third cluster is a graph shown in FIG. 6G, FIG. 6H, and FIG. 6I, respectively. As shown in FIGS. 6A to 6I, it is understood that the three clusters (first cluster, second cluster, and third cluster) are appropriately classified according to the distribution similarity. . When each distribution data includes 100 points (indicated by +) as described above, the sum of the data values (frequency) of each bucket of the created histogram is 100.

＜バケット数と計算時間との関係＞
３次元分布データについて、３軸をそれぞれ８個に分割した場合には、ヒストグラムのバケット数ｍは、５１２（８×８×８）である。同様に、３軸をそれぞれ１６個に分割した場合には、ヒストグラムのバケット数ｍは、４０９６（１６×１６×１６）である。図７に示すように、ヒストグラムのバケット数ｍが８倍になると、計算時間は約１２倍となる。また、バケット数ｍにかかわらず、高速化法による計算時間は、一般化法による計算時間のおよそ半分であった。 <Relationship between number of buckets and calculation time>
In the case of three-dimensional distribution data, when the three axes are each divided into eight, the number of buckets m in the histogram is 512 (8 × 8 × 8). Similarly, when the three axes are each divided into 16, the number of buckets m in the histogram is 4096 (16 × 16 × 16). As shown in FIG. 7, when the number of buckets m in the histogram is increased 8 times, the calculation time is approximately 12 times. Regardless of the number of buckets m, the calculation time by the high speed method is about half of the calculation time by the general method.

＜ウェーブレット係数値の採用個数と対称ＫＬ情報量の正確さとの関係＞
データ分類装置１Ａにおいて、算出されたウェーブレット係数値の全個数を１０００個とし、実際に採用したウェーブレット係数値の個数を変化させた場合に誤差率を求めた結果のグラフを図８に示す。誤差率とは、対称ＫＬ情報量について、一般化法で求めた値と、高速化法で求めた値とが異なる割合で定義される。ここでは、実際に採用したウェーブレット係数値の個数を、５０，１００，２００，３００，４００，５００，６００に順次変化させた。その結果、図８に示すように、ウェーブレット係数値の全個数の５％（２００個）を採用した場合には、誤差率が４．５％であり、高速化法で類似判断の基準とする対称ＫＬ情報量は、一般化法で類似判断の基準とする対称ＫＬ情報量と同じ正確さを有していた。 <Relationship between the number of adopted wavelet coefficient values and the accuracy of symmetric KL information>
FIG. 8 shows a graph of the result of obtaining the error rate when the total number of wavelet coefficient values calculated in the data classification apparatus 1A is 1000 and the number of wavelet coefficient values actually employed is changed. The error rate is defined at a rate at which the value obtained by the generalization method differs from the value obtained by the speed-up method for the symmetric KL information amount. Here, the number of wavelet coefficient values actually employed was sequentially changed to 50, 100, 200, 300, 400, 500, 600. As a result, as shown in FIG. 8, when 5% (200) of the total number of wavelet coefficient values is adopted, the error rate is 4.5%, which is a criterion for similarity determination in the high-speed method. The symmetric KL information amount has the same accuracy as the symmetric KL information amount used as a criterion for similarity determination in the generalized method.

この対称ＫＬ情報量における誤差とは別に、クラスタリングの結果としての分類の誤りについては以下の通りであった。すなわち、４０００個のウェーブレット係数値から５０個を採用した場合には、一般化法で求めた分布データ分類リストの記載事項に対して、高速化法で求めた分布データ分類リストの記載事項が一致する割合は、９５．４％であった。つまり、高速化法は、分布データの集合を、一般化法と略同じ正確さで各クラスタに分類することができる。 Apart from this error in the symmetric KL information amount, the classification error as a result of clustering was as follows. In other words, when 50 of the 4000 wavelet coefficient values are adopted, the description of the distribution data classification list obtained by the high-speed method matches the description of the distribution data classification list obtained by the generalization method. The ratio to be 95.4%. That is, the speed-up method can classify a set of distribution data into each cluster with substantially the same accuracy as the generalized method.

＜一般化法と高速化法との比較＞
高速化方法および一般法において、株価の始値、終値および出来高の３次元の分布データを用いて、バケット数ｍが１０００であるヒストグラムを作成し、分布データの集合を各クラスタに分類するときに要する計算時間を計測した。このときの結果を図９に示す。図９に示すグラフでは、分布データの集合の個数（ｎ）が１０００個までの範囲において、高速化方法による結果（図中×）および一般法による結果（図中◇）をそれぞれ示している。図９に示すように、分布データの集合の個数が１０００個の場合には、一般化方法が高速化法に比べて約３倍の計算時間を必要とすることが分かる。つまり、高速化方法は、一般法に比べて約３倍の速さで計算を行うことができる。 <Comparison between generalized method and accelerated method>
In the speed-up method and the general method, when a histogram with a bucket number m of 1000 is created using the three-dimensional distribution data of the stock price opening price, closing price, and trading volume, and a set of distribution data is classified into each cluster The calculation time required was measured. The result at this time is shown in FIG. The graph shown in FIG. 9 shows the results of the speed-up method (× in the figure) and the results of the general method (法 in the figure) when the number of distribution data sets (n) is up to 1000. As shown in FIG. 9, when the number of sets of distribution data is 1000, it can be seen that the generalization method requires about three times the calculation time compared to the high-speed method. That is, the speed-up method can perform calculation about three times faster than the general method.

本発明の第１実施形態に係るデータ分類装置の構成を模式的に示すブロック図である。It is a block diagram showing typically the composition of the data classification device concerning a 1st embodiment of the present invention. ヒストグラムの例を示す説明図である。It is explanatory drawing which shows the example of a histogram. 図１に示したデータ分類装置による分布データのクラスタへの分類処理を示すフローチャートである。It is a flowchart which shows the classification process to the cluster of the distribution data by the data classification apparatus shown in FIG. 本発明の第２実施形態に係るデータ分類装置の構成を模式的に示すブロック図である。It is a block diagram which shows typically the structure of the data classification device which concerns on 2nd Embodiment of this invention. 図４に示したデータ分類装置による分布データのクラスタへの分類処理を示すフローチャートである。It is a flowchart which shows the classification process to the cluster of distribution data by the data classification apparatus shown in FIG. 分布データの例を示す説明図である。It is explanatory drawing which shows the example of distribution data. バケット数と計算時間との関係の例を示すグラフである。It is a graph which shows the example of the relationship between the number of buckets and calculation time. ウェーブレット係数値の個数と誤差率との関係の例を示すグラフである。It is a graph which shows the example of the relationship between the number of wavelet coefficient values, and an error rate. 分布データの個数と計算時間との関係の例を示すグラフである。It is a graph which shows the example of the relationship between the number of distribution data, and calculation time.

Explanation of symbols

１データ分類装置
１Ａデータ分類装置
２入出力手段
３記憶手段
３１ＲＡＭ
３２ＲＯＭ
３３ヒストグラム記憶手段
３４分類クラスタ記憶手段
４制御手段
４１ヒストグラム作成手段
４２類似度算出手段
４３クラスタ初期化手段
４４データ要約手段
４５クラスタ更新手段（分類手段）
４６組替判別手段
４７分類結果出力手段
５１データ変換手段
５２ウェーブレット係数値記憶手段 DESCRIPTION OF SYMBOLS 1 Data classification device 1A Data classification device 2 Input / output means 3 Storage means 31 RAM
32 ROM
33 Histogram storage means 34 Classification cluster storage means 4 Control means 41 Histogram creation means 42 Similarity calculation means 43 Cluster initialization means 44 Data summarization means 45 Cluster update means (classification means)
46 Reclassification determination means 47 Classification result output means 51 Data conversion means 52 Wavelet coefficient value storage means

Claims

A data classification device for classifying a set of distribution data indicating a distribution of a set of vector data in a multidimensional space into a plurality of clusters corresponding to the similarity between the distribution data,
Summarize all histogram data created in advance for all distribution data currently belonging to each cluster as a histogram representing a predetermined number of clusters in which all distribution data to be classified are classified in advance. Data summarization means,
Similarity between distribution data is calculated based on a histogram of distribution data to be classified currently belonging to any one of the predetermined number of clusters and a histogram representing each of the predetermined number of clusters. Similarity calculation means to
Classification means for classifying the distribution data to be classified into the predetermined number of clusters based on the calculated similarity;
Classification result output means for outputting information on distribution data belonging to the predetermined number of clusters after the classification as a classification result;
A data classification device comprising:

The similarity calculation means, based on each histogram of the predetermined number of distribution data preselected from all the distribution data to be classified, and the histogram of the distribution data to be classified, Calculating an initial value of similarity between the distribution data,
Clusters defining distribution data that initially belong to the predetermined number of clusters by classifying all distribution data to be classified into the predetermined number of clusters based on an initial value of similarity between the distribution data The data classification apparatus according to claim 1, further comprising initialization means.

The data classification apparatus according to claim 1, further comprising a histogram creation unit that creates a histogram from the distribution data to be classified.

Data conversion means for converting the bucket data of the histogram of the distribution data into frequency domain data;
4. The similarity calculation unit according to claim 1, wherein the similarity calculation unit calculates the similarity between the distribution data based on the data converted into the frequency domain data. The data classification device described.

A data classification method of a data classification device for classifying a set of distribution data indicating a distribution of a set of vector data in a multidimensional space into a plurality of clusters corresponding to the similarity between the distribution data,
All histograms created in advance for all distribution data currently belonging to each cluster as histograms representing a predetermined number of clusters in which all distribution data to be classified are classified in advance by the data summarization means A data summarization step that summarizes each of the data
Based on the histogram of the distribution data to be classified that currently belong to any one of the predetermined number of clusters and the histogram representing each of the predetermined number of clusters by the similarity calculation means A similarity calculation step for calculating the similarity of each,
A classifying step of classifying the distribution data to be classified into the predetermined number of clusters based on the calculated similarity;
A classification result output step of outputting information on distribution data belonging to the predetermined number of clusters after the classification as a classification result by the classification result output means;
A data classification method characterized by comprising:

Prior to the data summarizing step, the similarity calculation means includes a histogram of each of the predetermined number of distribution data selected in advance from all distribution data to be classified, and distribution data to be classified. And an initial value of similarity between the distribution data based on the histogram of
By classifying all the distribution data to be classified into the predetermined number of clusters based on the initial value of the similarity between the distribution data by the cluster initialization means, the cluster initializing means is initially assigned to the predetermined number of clusters. 6. The data classification method according to claim 5, further comprising a cluster initialization step for determining distribution data to be distributed.

The data classification method according to claim 5 or 6, further comprising a histogram creation step of creating a histogram from the distribution data to be classified by a histogram creation means.

The data conversion means further includes a data conversion step of converting the data of the histogram bucket into frequency domain data,
8. The similarity calculation step according to claim 5, wherein the similarity calculation step calculates a similarity between the distribution data based on the data converted into the frequency domain data. The data classification method described.

A data classification program for causing a computer to execute the data classification method according to any one of claims 5 to 8.

A computer-readable recording medium on which the data classification program according to claim 9 is recorded.