JP2013182341A

JP2013182341A - Clustering device and clustering method

Info

Publication number: JP2013182341A
Application number: JP2012044540A
Authority: JP
Inventors: Manabu Kawasaki; 学川▲崎▼; Yasutaka Tanaka; 康貴田中; Masumi Tanimoto; 益巳谷本
Original assignee: Sohgo Security Services Co Ltd
Current assignee: Sohgo Security Services Co Ltd
Priority date: 2012-02-29
Filing date: 2012-02-29
Publication date: 2013-09-12
Anticipated expiration: 2032-02-29
Also published as: JP5912667B2

Abstract

PROBLEM TO BE SOLVED: To provide a clustering device which performs clustering sequentially with a small operation quantity.SOLUTION: A clustering device includes: a data storage part 40 for storing data items to which clustering is applied, and a cluster to which the data items belong, in such a manner that they are associated with each other; a centroid storage part 50 for storing the cluster and a representative value representing feature quantities of the data items belonging to the cluster, in such a manner that they are associated with each other; a data acquisition part 10 for acquiring a new data item which is subject to clustering; a data distance calculating part 31 for calculating a data distance between a feature quantity of the new data item and the representative value stored in the centroid storage part 50; a cluster determining part 32 for determining, on the basis of the data distance, a cluster to which the new data item belongs; a centroid calculating part 33 for calculating, on the basis of the feature quantity of the new data item, a representative value of the cluster to which the new data item belongs; and a data updating part 60 for updating, on the basis of the new data item, the data storage part 40 and the centroid storage part 50.

Description

本発明は、クラスタリング装置およびクラスタリング方法に関する。 The present invention relates to a clustering apparatus and a clustering method.

従来、クラスタリングの対象となるデータを外的基準を設定することなく、自動的にクラスタリングする手法が知られている。クラスタリングの手法は、樹形図によって表現されるような階層的手法と、クラスタの妥当性を基準とする非階層的手法とに大別される（例えば、非特許文献１参照）。いずれの手法も、クラスタリングの対象となるデータをクラスタリングの前に準備し、類似したデータをグループ化するものである。 Conventionally, a method for automatically clustering data to be clustered without setting an external reference is known. Clustering methods are roughly classified into a hierarchical method represented by a tree diagram and a non-hierarchical method based on the validity of the cluster (for example, see Non-Patent Document 1). In either method, data to be clustered is prepared before clustering, and similar data is grouped.

また、特許文献１には、過去のクラスタリング結果と、これに付随する確率パラメータを用いて、記述長最小の基準に基づいて、新しくデータを追加する度にクラスタリングを行う技術が開示されている。 Patent Document 1 discloses a technique for performing clustering every time new data is added based on a criterion of minimum description length using past clustering results and probability parameters associated therewith.

特許第３２４３６９３号公報Japanese Patent No. 3243669

奥野忠一、久米均、芳賀敏郎、吉澤正著「多変量解析法」日科技連出版社ｐ．１２４−１５７Tadashi Okuno, Hitoshi Kume, Toshiro Haga, Tadashi Yoshizawa “Multivariate Analysis”, Nikka Giren Publishing Co., Ltd. p. 124-157

特許文献１の技術においては、新しくデータを追加する度にクラスタリングを行うことができるので、クラスタリング前にクラスタリングの対象となるデータをすべて準備する必要がない。しかしながら、特許文献１の技術では、過去にクラスタリングを行ったすべてのデータを見直す必要がないものの、データを追加する毎にクラスタの統合を可能な限り繰り返す必要があり、演算量が多いという問題があった。 In the technique of Patent Document 1, since clustering can be performed every time new data is added, it is not necessary to prepare all data to be clustered before clustering. However, in the technique of Patent Document 1, although it is not necessary to review all data that has been clustered in the past, it is necessary to repeat cluster integration as much as possible every time data is added, resulting in a large amount of computation. there were.

本発明は、上記に鑑みてなされたものであって、少ない演算量で、データが追加される度に逐次クラスタリングを行うことのできるクラスタリング装置およびクラスタリング方法を提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide a clustering apparatus and a clustering method capable of performing sequential clustering each time data is added with a small amount of calculation.

上述した課題を解決し、目的を達成するために、本発明は、取得データを逐次クラスタリングするクラスタリング装置であって、既にクラスタリングされたデータと、前記データが所属するクラスタとを対応付けて記憶するデータ記憶部と、前記クラスタと、前記クラスタに所属する前記データの特徴量を代表する代表値とを対応付けて記憶する代表値記憶部と、クラスタリングの対象となる新データを取得するデータ取得部と、前記データ取得部が取得した前記新データの前記特徴量と前記代表値記憶部に記憶されている前記代表値の間のデータ距離を算出するデータ距離算出部と、前記データ距離に基づいて、前記新データが属するクラスタを決定するクラスタ決定部と、前記新データの前記特徴量に基づいて、前記新データの属するクラスタの前記代表値を算出する代表値算出部と前記新データと、前記新データの属するクラスタとを対応付けて前記データ記憶部に書き込み、前記新データの属するクラスタに対応付けて前記新データの属するクラスタの前記代表値を前記代表値記憶部に書き込むデータ更新部とを備えることを特徴とする。 In order to solve the above-described problems and achieve the object, the present invention is a clustering device for sequentially clustering acquired data, and stores the already clustered data and the cluster to which the data belongs in association with each other. A data storage unit, a representative value storage unit that associates and stores a representative value that represents a feature value of the data belonging to the cluster, and a data acquisition unit that acquires new data to be clustered A data distance calculation unit that calculates a data distance between the feature value of the new data acquired by the data acquisition unit and the representative value stored in the representative value storage unit, and based on the data distance A cluster determination unit for determining a cluster to which the new data belongs, and a cluster to which the new data belongs based on the feature quantity of the new data. The representative value calculation unit for calculating the representative value of the star, the new data, and the cluster to which the new data belongs are associated and written to the data storage unit, and the new data is associated with the cluster to which the new data belongs. And a data updating unit for writing the representative value of the cluster to which the cluster belongs to the representative value storage unit.

また、本発明は、取得データを逐次クラスタリングするクラスタリング装置で実行されるクラスタリング方法であって、前記クラスタリング装置は、既にクラスタリングされたデータと、前記データが所属するクラスタとを対応付けて記憶するデータ記憶部と、前記クラスタと、前記クラスタに所属する前記データの特徴量を代表する代表値とを対応付けて記憶する代表値記憶部とを備え、クラスタリングの対象となる新データを取得するデータ取得工程と、前記データ取得工程において取得した前記新データの前記特徴量と前記代表値記憶部に記憶されている前記代表値の間のデータ距離を算出するデータ距離算出工程と、前記データ距離に基づいて、前記新データが属するクラスタを決定するクラスタ決定工程と、前記新データの前記特徴量に基づいて、前記新データの属するクラスタの前記代表値を算出する代表値算出工程と、前記新データと、前記新データの属するクラスタとを対応付けて前記データ記憶部に書き込み、前記新データの属するクラスタに対応付けて前記新データの属するクラスタの前記代表値を前記代表値記憶部に書き込むデータ更新工程とを含むことを特徴とする。 Further, the present invention is a clustering method executed by a clustering device that sequentially clusters acquired data, and the clustering device stores data that is already clustered and the cluster to which the data belongs in association with each other. A data acquisition unit that includes a storage unit, the cluster, and a representative value storage unit that stores a representative value that represents a feature value of the data belonging to the cluster in association with each other, and acquires new data to be clustered A data distance calculating step of calculating a data distance between the feature value of the new data acquired in the data acquisition step and the representative value stored in the representative value storage unit, and based on the data distance A cluster determining step for determining a cluster to which the new data belongs, and the special feature of the new data. A representative value calculating step of calculating the representative value of the cluster to which the new data belongs, and the new data and the cluster to which the new data belongs are associated and written to the data storage unit based on the amount; And a data updating step of writing the representative value of the cluster to which the new data belongs to the representative value storage unit in association with the cluster to which the new data belongs.

本発明によれば、少ない演算量で、データが追加される度に逐次クラスタリングを行うことができるという効果を奏する。 According to the present invention, it is possible to perform sequential clustering each time data is added with a small amount of calculation.

図１は、実施の形態にかかるクラスタリング装置の構成を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of the clustering apparatus according to the embodiment. 図２は、セントロイドを説明するための図である。FIG. 2 is a diagram for explaining the centroid. 図３は、データ記憶部のデータ構成を模式的に示す図である。FIG. 3 is a diagram schematically illustrating the data configuration of the data storage unit. 図４は、セントロイド記憶部のデータ構成を模式的に示す図である。FIG. 4 is a diagram schematically illustrating a data configuration of the centroid storage unit. 図５は、クラスタリング装置によるクラスタリング処理を示すフローチャートである。FIG. 5 is a flowchart showing clustering processing by the clustering apparatus. 図６は、クラスタの生成過程を示す図である。FIG. 6 is a diagram illustrating a cluster generation process. 図７は、音データを取得するのに伴い、古いデータを削除する過程を示す図である。FIG. 7 is a diagram showing a process of deleting old data as sound data is acquired. 図８−１は、実施例１にかかるクラスタリング装置によるデータ範囲内のランダムデータに対するクラスタリング結果を示す図である。FIG. 8A is a diagram of a clustering result for random data in the data range by the clustering apparatus according to the first embodiment. 図８−２は、比較例１にかかる群平均化法のクラスタリング結果を示す図である。FIG. 8-2 is a diagram illustrating a clustering result of the group averaging method according to Comparative Example 1. 図９−１は、実施例２にかかるクラスタリング装置によるクラスタリング結果を示す図である。FIG. 9A is a diagram of a clustering result by the clustering apparatus according to the second embodiment. 図９−２は、比較例２にかかる群平均化法のクラスタリング結果を示す図である。FIG. 9-2 is a diagram illustrating a clustering result of the group averaging method according to Comparative Example 2. 図１０は、実施例３において入力されるデータのデータ範囲を示す図である。FIG. 10 is a diagram illustrating a data range of data input in the third embodiment. 図１１は、実施例３において入力されるデータの入力順と、クラスタ遷移を示す図である。FIG. 11 is a diagram illustrating an input order of data input in the third embodiment and cluster transition. 図１２−１は、距離閾値（Ｄ_ｐ）７０に設定された場合のクラスタリング結果を示す図である。FIG. 12A is a diagram illustrating a clustering result when the distance threshold (D _p ) 70 is set. 図１２−２は、距離閾値（Ｄ_ｐ）８０に設定された場合のクラスタリング結果を示す図である。FIG. 12B is a diagram illustrating a clustering result when the distance threshold (D _p ) 80 is set. 図１３−１は、距離閾値（Ｄ_ｐ）７０に設定された場合のクラスタリング結果を示す図である。FIG. 13A is a diagram illustrating a clustering result when the distance threshold (D _p ) 70 is set. 図１３−２は、距離閾値（Ｄ_ｐ）８０に設定された場合のクラスタリング結果を示す図である。FIG. 13B is a diagram illustrating a clustering result when the distance threshold (D _p ) 80 is set. 図１４−１は、距離閾値（Ｄ_ｐ）７０に設定された場合のクラスタリング結果を示す図である。FIG. 14A is a diagram illustrating a clustering result when the distance threshold (D _p ) 70 is set. 図１４−２は、距離閾値（Ｄ_ｐ）８０に設定された場合のクラスタリング結果を示す図である。FIG. 14B is a diagram illustrating a clustering result when the distance threshold (D _p ) 80 is set. 図１５−１は、距離閾値（Ｄ_ｐ）７０に設定された場合のクラスタリング結果を示す図である。FIG. 15A is a diagram illustrating a clustering result when the distance threshold (D _p ) 70 is set. 図１５−２は、距離閾値（Ｄ_ｐ）８０に設定された場合のクラスタリング結果を示す図である。FIG. 15B is a diagram illustrating a clustering result when the distance threshold (D _p ) 80 is set. 図１６−１は、距離閾値（Ｄ_ｐ）７０に設定された場合のクラスタリング結果を示す図である。FIG. 16A is a diagram illustrating a clustering result when the distance threshold (D _p ) 70 is set. 図１６−２は、距離閾値（Ｄ_ｐ）８０に設定された場合のクラスタリング結果を示す図である。FIG. 16B is a diagram illustrating a clustering result when the distance threshold (D _p ) 80 is set. 図１７−１は、距離閾値（Ｄ_ｐ）７０に設定された場合のクラスタリング結果を示す図である。FIG. 17A is a diagram illustrating a clustering result when the distance threshold (D _p ) 70 is set. 図１７−２は、距離閾値（Ｄ_ｐ）８０に設定された場合のクラスタリング結果を示す図である。FIG. 17B is a diagram illustrating a clustering result when the distance threshold (D _p ) 80 is set.

以下に添付図面を参照して、クラスタリング装置およびクラスタリング方法の実施の形態を詳細に説明する。 Exemplary embodiments of a clustering apparatus and a clustering method will be described below in detail with reference to the accompanying drawings.

図１は、実施の形態にかかるクラスタリング装置１の構成を示すブロック図である。クラスタリング装置１は、データを取得する度にデータのクラスタリングを逐次行っていく。クラスタリング装置１は、データ取得部１０と、特徴パラメータ算出部２０と、逐次クラスタリング部３０と、データ記憶部４０と、セントロイド記憶部５０と、データ更新部６０とを備えている。 FIG. 1 is a block diagram illustrating a configuration of a clustering apparatus 1 according to the embodiment. The clustering apparatus 1 sequentially performs data clustering every time data is acquired. The clustering apparatus 1 includes a data acquisition unit 10, a feature parameter calculation unit 20, a sequential clustering unit 30, a data storage unit 40, a centroid storage unit 50, and a data update unit 60.

データ取得部１０は、クラスタリングの対象となるデータを取得する。特徴パラメータ算出部２０は、データ取得部１０が取得したデータの特徴量としての特徴パラメータを算出する。なお、本実施の形態のクラスタリング装置１は、クラスタリングの対象データとして音データを取得し、特徴パラメータとして、ＬＰＣケプストラム係数を算出することとする。ただし、クラスタリングの対象データの種類および特徴パラメータの種類は実施の形態に限定されるものではない。 The data acquisition unit 10 acquires data to be clustered. The feature parameter calculation unit 20 calculates a feature parameter as a feature amount of the data acquired by the data acquisition unit 10. Note that the clustering apparatus 1 of the present embodiment acquires sound data as clustering target data and calculates an LPC cepstrum coefficient as a feature parameter. However, the types of data to be clustered and the types of feature parameters are not limited to the embodiment.

データ取得部１０は、例えば、１６ｋＨｚ、１６ｂｉｔで量子化された音響信号（以下、音データと称する）をクラスタリングの対象データとして取得する。また、特徴パラメータ算出部２０は、例えば、分析フレーム長６４ｍｓｅｃ（１０２４ｐ）、分析フレーム間隔１６ｍｓｅｃ（２５６ｐ）で１６次のＬＰＣケプストラム係数を算出することにより、音の周波数構造を表す１６次元の特徴パラメータを算出する。 The data acquisition unit 10 acquires, for example, an acoustic signal quantized with 16 kHz and 16 bits (hereinafter referred to as sound data) as clustering target data. Further, the feature parameter calculation unit 20 calculates a 16th-order LPC cepstrum coefficient with an analysis frame length of 64 msec (1024 p) and an analysis frame interval of 16 msec (256 p), for example, thereby representing a 16-dimensional feature parameter representing the frequency structure of sound. Is calculated.

なお他の例としては、クラスタリングの対象となる対象データとともに特徴パラメータを外部から取得してもよい。この場合には、クラスタリング装置１は、特徴パラメータ算出部２０を備えなくともよい。 As another example, feature parameters may be acquired from the outside together with target data to be clustered. In this case, the clustering device 1 does not have to include the feature parameter calculation unit 20.

逐次クラスタリング部３０は、特徴パラメータ算出部２０により算出された特徴パラメータに基づいて、データ取得部１０がデータを取得する度に、データのクラスタリングを逐次行う。 The sequential clustering unit 30 sequentially performs data clustering each time the data acquisition unit 10 acquires data based on the feature parameters calculated by the feature parameter calculation unit 20.

データ記憶部４０は、逐次クラスタリング部３０によりクラスタリングされた音データを記憶している。セントロイド記憶部５０は、逐次クラスタリング部３０により生成されたクラスタのセントロイドの値を記憶している。ここで、セントロイドとは、各クラスタに属する音データの特徴パラメータの重心である。本実施の形態においては、クラスタの特徴を示す代表値としてセントロイドを用いるが、代表値は、クラスタの特徴を示すような値であればよく、例えば、クラスタに属する音データの平均値などセントロイド以外の値を用いてもよい。 The data storage unit 40 stores sound data clustered by the sequential clustering unit 30. The centroid storage unit 50 stores the centroid value of the cluster generated by the sequential clustering unit 30. Here, the centroid is the center of gravity of the feature parameter of the sound data belonging to each cluster. In this embodiment, a centroid is used as a representative value indicating the characteristics of the cluster. However, the representative value may be any value that indicates the characteristics of the cluster. For example, the centroid such as an average value of sound data belonging to the cluster may be used. Values other than Lloyd may be used.

図２は、セントロイドを説明するための図である。なお、本実施の形態においては、説明の便宜上、音データの特徴パラメータがｘ，ｙ座標上の値を有する二次元データである場合を例に説明する。図２に示すように、クラスタＡに属するデータａ１〜ａ５の特徴パラメータのｘｙ平面上の位置の重心位置がクラスタＡのセントロイドである。 FIG. 2 is a diagram for explaining the centroid. In the present embodiment, for the sake of convenience of explanation, a case where the feature parameter of sound data is two-dimensional data having values on the x and y coordinates will be described as an example. As shown in FIG. 2, the centroid position of the position on the xy plane of the characteristic parameters of the data a1 to a5 belonging to the cluster A is the centroid of the cluster A.

図１に戻り、データ更新部６０は、データ取得部１０が新たな音データを取得すると、新たな音データに基づいて、データ記憶部４０およびセントロイド記憶部５０のデータを更新する。なお、本実施の形態のクラスタリング装置１においては、データ取得部１０は、クラスタリングの対象となる音データを逐次取得し、データ更新部６０は、データ記憶部４０に記憶されている音データの数が予め設定された閾値を越えた場合には、最も古い音データを削除し、これに替えて新たな音データをデータ記憶部４０に書き込む。すなわち、データ更新部６０は、音データを削除する削除部としても機能する。 Returning to FIG. 1, when the data acquisition unit 10 acquires new sound data, the data update unit 60 updates data in the data storage unit 40 and the centroid storage unit 50 based on the new sound data. In the clustering device 1 of the present embodiment, the data acquisition unit 10 sequentially acquires sound data to be clustered, and the data update unit 60 stores the number of sound data stored in the data storage unit 40. Exceeds the preset threshold value, the oldest sound data is deleted, and new sound data is written in the data storage unit 40 instead. That is, the data update unit 60 also functions as a deletion unit that deletes sound data.

図３は、データ記憶部４０のデータ構成を模式的に示す図である。図３に示すように、データ記憶部４０は、データ取得部１０が音データを取得した取得順と、音データを識別するデータＩＤと、音データと、音データに対し特徴パラメータ算出部２０により算出された特徴パラメータと、音データが属するクラスタを識別するクラスタＩＤとを対応付けて記憶している。なお、取得順は、データを取得する度に付与される連続番号などであってもよく、また他の例としては、取得時刻であってもよい。 FIG. 3 is a diagram schematically illustrating a data configuration of the data storage unit 40. As shown in FIG. 3, the data storage unit 40 uses the feature parameter calculation unit 20 for the acquisition order of the sound data acquired by the data acquisition unit 10, the data ID for identifying the sound data, the sound data, and the sound data. The calculated feature parameter and the cluster ID for identifying the cluster to which the sound data belongs are stored in association with each other. Note that the acquisition order may be a serial number given every time data is acquired, or may be an acquisition time as another example.

図４は、セントロイド記憶部５０のデータ構成を模式的に示す図である。図４に示すように、セントロイド記憶部５０は、逐次クラスタリング部３０により生成されたクラスタのクラスタＩＤと、クラスタのセントロイドとを対応付けて記憶している。 FIG. 4 is a diagram schematically illustrating a data configuration of the centroid storage unit 50. As illustrated in FIG. 4, the centroid storage unit 50 stores the cluster ID of the cluster generated by the sequential clustering unit 30 and the centroid of the cluster in association with each other.

図１に戻り、逐次クラスタリング部３０は、データ距離算出部３１と、クラスタ決定部３２と、セントロイド算出部３３とを有している。データ距離算出部３１は、データ距離を算出する。ここで、データ距離とは、データ取得部１０が新たに取得した音データ（新データ）の特徴パラメータと、既に逐次クラスタリング部３０により生成されたクラスタのセントロイドとの距離である。データ距離算出部３１は、特徴パラメータ算出部２０により算出された特徴パラメータと、セントロイド記憶部５０に記憶されているクラスタのセントロイドに基づいて、データ距離を算出する。なお、セントロイド記憶部５０に複数のクラスタのセントロイドが記憶されている場合には、データ距離算出部３１は、すべてのセントロイドとのデータ距離を算出する。 Returning to FIG. 1, the sequential clustering unit 30 includes a data distance calculation unit 31, a cluster determination unit 32, and a centroid calculation unit 33. The data distance calculation unit 31 calculates the data distance. Here, the data distance is a distance between the feature parameter of the sound data (new data) newly acquired by the data acquisition unit 10 and the centroid of the cluster already generated by the sequential clustering unit 30. The data distance calculation unit 31 calculates the data distance based on the feature parameter calculated by the feature parameter calculation unit 20 and the cluster centroid stored in the centroid storage unit 50. When the centroids of a plurality of clusters are stored in the centroid storage unit 50, the data distance calculation unit 31 calculates the data distances with all centroids.

クラスタ決定部３２は、データ距離算出部３１により算出されたデータ距離と、予め設定された距離閾値とを比較する。そして、クラスタ決定部３２は、所定のクラスタとのデータ距離が距離閾値以下の場合には、距離閾値以下のデータ距離が算出されたクラスタを新データが属するクラスタに決定する。クラスタ決定部３２はまた、データ距離が距離閾値よりも大きい場合には、新データの属するクラスタとして新たなクラスタを生成し、これを新データが属するクラスタに決定する。 The cluster determination unit 32 compares the data distance calculated by the data distance calculation unit 31 with a preset distance threshold. Then, when the data distance from the predetermined cluster is equal to or smaller than the distance threshold, the cluster determining unit 32 determines the cluster for which the data distance equal to or smaller than the distance threshold is calculated as the cluster to which the new data belongs. If the data distance is greater than the distance threshold, the cluster determination unit 32 also generates a new cluster as a cluster to which the new data belongs, and determines this as a cluster to which the new data belongs.

なお、距離閾値は、クラスタリング装置１に予め設定しておく。なお、距離閾値の値は任意であるが、距離閾値を大きく設定することにより、生成されるクラスタの数を少なくすることができ、特徴量の異なるデータを同一クラスタに所属させることができる。一方で、距離閾値を小さく設定することにおり、比較的多くのクラスタを生成することができ、特徴量が比較的類似するデータのみを同一クラスタに所属させることができる。 Note that the distance threshold is set in the clustering apparatus 1 in advance. Although the value of the distance threshold is arbitrary, by setting the distance threshold to a large value, the number of clusters to be generated can be reduced, and data with different feature values can belong to the same cluster. On the other hand, by setting the distance threshold to be small, a relatively large number of clusters can be generated, and only data having relatively similar feature amounts can belong to the same cluster.

セントロイド算出部３３は、クラスタ決定部３２により決定された新データのクラスタのセントロイドを算出する。セントロイド算出部３３はまた、データ更新部６０により所定の音データがデータ記憶部４０から削除された場合には、削除された音データが属していたクラスタのセントロイドを算出する。すなわち、セントロイド算出部３３は、音データの削除後に、削除された音データのクラスタに属する、残りの音データの特徴パラメータに基づいて、削除された音データが属していたクラスタのセントロイドを算出する。 The centroid calculation unit 33 calculates the centroid of the new data cluster determined by the cluster determination unit 32. The centroid calculation unit 33 also calculates the centroid of the cluster to which the deleted sound data belonged when the predetermined sound data is deleted from the data storage unit 40 by the data update unit 60. That is, the centroid calculating unit 33 determines the centroid of the cluster to which the deleted sound data belonged based on the characteristic parameters of the remaining sound data belonging to the deleted sound data cluster after the sound data is deleted. calculate.

なお、クラスタ決定部３２により新データが属するクラスタが決定されると、データ更新部６０は、新データと、新データのデータＩＤと、新データの特徴パラメータと、新データに対して決定されたクラスタのクラスタＩＤとをデータ記憶部４０に書き込む。また、セントロイド算出部３３によりセントロイドが算出されると、データ更新部６０は、セントロイド算出部３３により算出されたセントロイドをクラスタＩＤに対応付けてセントロイド記憶部５０に書き込む。 When the cluster to which the new data belongs is determined by the cluster determination unit 32, the data update unit 60 is determined for the new data, the data ID of the new data, the feature parameter of the new data, and the new data. The cluster ID of the cluster is written into the data storage unit 40. When the centroid is calculated by the centroid calculation unit 33, the data update unit 60 writes the centroid calculated by the centroid calculation unit 33 in the centroid storage unit 50 in association with the cluster ID.

図５は、クラスタリング装置１によるクラスタリング処理を示すフローチャートである。クラスタリング処理は、クラスタリング装置１のデータ取得部１０が音データを取得する度に実行される。クラスタリング処理においては、まずデータ取得部１０が音データを取得すると（ステップＳ１００）、特徴パラメータ算出部２０は、音データの特徴パラメータを算出する（ステップＳ１０１）。次に、データ更新部６０は、データ記憶部４０を参照し、データ記憶部４０にデータが記憶されているか否かを確認する。 FIG. 5 is a flowchart showing the clustering process performed by the clustering apparatus 1. The clustering process is executed every time the data acquisition unit 10 of the clustering apparatus 1 acquires sound data. In the clustering process, first, when the data acquisition unit 10 acquires sound data (step S100), the feature parameter calculation unit 20 calculates feature parameters of the sound data (step S101). Next, the data update unit 60 refers to the data storage unit 40 and confirms whether data is stored in the data storage unit 40.

データ記憶部４０に音データが記憶されておらず、まだクラスタが生成されていない場合には（ステップＳ１０２，Ｎｏ）、データ更新部６０は、自身が有するデータ配列Ｘ［ｊ］のアドレス［ｊ＝０］にステップＳ１０１において算出された新データの特徴パラメータを格納する（ステップＳ１０３）。データ更新部６０は、さらにデータ記憶部４０に新たなデータ、新たなデータの取得順、データＩＤ、特徴パラメータを書き込む。 If no sound data is stored in the data storage unit 40 and a cluster has not yet been generated (No in step S102), the data update unit 60 uses the address [j of the data array X [j] that the data update unit 60 has. = 0], the feature parameter of the new data calculated in step S101 is stored (step S103). The data update unit 60 further writes new data, a new data acquisition order, a data ID, and a feature parameter in the data storage unit 40.

次に、逐次クラスタリング部３０のクラスタ決定部３２は、新データが属するクラスタ（新クラスタ）を新たに生成する（ステップＳ１０４）。これに対応し、データ記憶部４０においては、データ更新部６０により、新データに対応付けて、新クラスタのクラスタＩＤが書き込まれる。 Next, the cluster determination unit 32 of the sequential clustering unit 30 newly generates a cluster to which the new data belongs (new cluster) (step S104). Correspondingly, in the data storage unit 40, the data update unit 60 writes the cluster ID of the new cluster in association with the new data.

次に、セントロイド算出部３３は、新クラスタのセントロイドを算出する（ステップＳ１０５）。なお、ステップＳ１０４において生成された新クラスタに属するデータは新データのみであるので、ステップＳ１０５では、新データの特徴パラメータがクラスタのセントロイドとして算出される。 Next, the centroid calculation unit 33 calculates the centroid of the new cluster (step S105). Note that since the data belonging to the new cluster generated in step S104 is only new data, the feature parameter of the new data is calculated as the centroid of the cluster in step S105.

算出されたセントロイドは、データ更新部６０により、新クラスタのクラスタＩＤに対応付けてセントロイド記憶部５０に書き込まれ、以上で、最初の音データを取得した場合のクラスタリング処理が終了する。 The calculated centroid is written to the centroid storage unit 50 in association with the cluster ID of the new cluster by the data updating unit 60, and the clustering process when the first sound data is acquired is completed.

一方、ステップＳ１０２において、既にデータ記憶部４０に音データが記憶されており、クラスタ生成済みである場合には（ステップＳ１０２，Ｙｅｓ）、データ更新部６０は、データ配列Ｘ［ｊ］のアドレス［ｊ］を１だけ進める（ステップＳ１１０）。ここで、アドレス［ｊ］がデータ配列Ｘ［ｊ］の最終アドレスよりも大きい場合には（ステップＳ１１１，Ｙｅｓ）、データ更新部６０は、データ配列Ｘ［ｊ］のアドレスを［０］に戻す（ステップＳ１１２）。なお、ステップＳ１１１において、アドレス［ｊ］がデータ配列Ｘ［ｊ］の最終アドレス以下である場合には（ステップＳ１１１，Ｎｏ）、ステップＳ１１３へ進む。 On the other hand, in step S102, when the sound data is already stored in the data storage unit 40 and the cluster has been generated (step S102, Yes), the data update unit 60 uses the address [ j] is advanced by 1 (step S110). If the address [j] is larger than the final address of the data array X [j] (step S111, Yes), the data update unit 60 returns the address of the data array X [j] to [0]. (Step S112). In step S111, when the address [j] is equal to or lower than the final address of the data array X [j] (step S111, No), the process proceeds to step S113.

次に、セントロイド算出部３３は、データ配列Ｘ［ｊ］の各アドレスに格納されている特徴パラメータのうち、アドレス［ｊ］に格納されている特徴パラメータ以外の特徴パラメータに基づいて、アドレス［ｊ］の特徴パラメータが属するクラスタのセントロイドを算出し、データ更新部６０は、算出結果に基づいて、セントロイド記憶部５０のセントロイドを更新する（ステップＳ１１３）。 Next, the centroid calculation unit 33 selects the address [based on the feature parameter other than the feature parameter stored in the address [j] among the feature parameters stored in each address of the data array X [j]. j] and the data updating unit 60 updates the centroid of the centroid storage unit 50 based on the calculation result (step S113).

なお、ここでは、アドレス［ｊ］に格納されている特徴パラメータが除外されるので、アドレス［ｊ］の特徴パラメータが属するクラスタのセントロイドの値が、前回算出されたセントロイドの値と異なる可能性があるが、これ以外のクラスタのセントロイドの値に変更はない。そこで、ステップＳ１１３においては、アドレス［ｊ］の特徴パラメータが属するクラスタのセントロイドの算出、更新を行えばよい。 Here, since the feature parameter stored at the address [j] is excluded, the centroid value of the cluster to which the feature parameter at the address [j] belongs may be different from the previously calculated centroid value. There is no change in the centroid values of other clusters. Therefore, in step S113, the centroid of the cluster to which the feature parameter of the address [j] belongs may be calculated and updated.

次に、データ距離算出部３１は、新データと既に生成済みのすべてのクラスタのセントロイドとの間のデータ距離をそれぞれ算出する（ステップＳ１１４）。新データの特徴パラメータｘ_ｋ（ｋ＝０，１，２…Ｋ）とクラスタ［i］（ｉ＝０，１，２…Ｉ）のセントロイドＣ_ｉ，ｋの間のデータ距離ｄ_ｉは、（式１）により算出される。ここで、ｋは、特徴パラメータの次元であり、Ｉは、生成済みのクラスタの数である。なお、本実施の形態においては、特徴パラメータは２次元である。

Next, the data distance calculation unit 31 calculates the data distances between the new data and the centroids of all the already generated clusters (step S114). The data distance d _i between the new data feature parameter x _k (k = 0, 1, 2... K) and the centroid C _{i, k} of cluster [i] (i = 0, 1, 2... I) is Calculated by (Equation 1). Here, k is the dimension of the feature parameter, and I is the number of generated clusters. In the present embodiment, the feature parameter is two-dimensional.

次に、クラスタ決定部３２は、複数のクラスタが存在する場合には、複数のクラスタそれぞれに対して算出された新データのデータ距離の最小値と、予め設定された距離閾値（Ｄ_ｐ）とを比較する。なお、１つのクラスタのみ存在する場合には、算出されたデータ距離と距離閾値（Ｄ_ｐ）とを比較する。データ距離の最小値が距離閾値（Ｄ_ｐ）以下である場合には（ステップＳ１１５，Ｙｅｓ）、クラスタ決定部３２は、データ距離の最小値が得られたクラスタを新データが属するクラスタに決定する（ステップＳ１１６）。 Next, when there are a plurality of clusters, the cluster determination unit 32 determines the minimum value of the data distance of the new data calculated for each of the plurality of clusters, a preset distance threshold (D _p ), Compare When only one cluster exists, the calculated data distance is compared with the distance threshold value (D _p ). When the minimum value of the data distance is equal to or less than the distance threshold (D _p ) (step S115, Yes), the cluster determination unit 32 determines the cluster from which the minimum value of the data distance is obtained as the cluster to which the new data belongs. (Step S116).

次に、セントロイド算出部３３は、新データのクラスタのセントロイドを算出し、データ更新部６０は、算出結果に基づいて、セントロイド記憶部５０のセントロイドを更新する（ステップＳ１１７）。具体的には、セントロイド算出部３３は、新データと、新データが属するクラスタに属する音データの特徴パラメータに基づいて、新データが属するクラスタのセントロイドを算出する。そして、データ更新部６０は、新データが属するクラスタのクラスタＩＤに対応付けられているセントロイドの値を、セントロイド算出部３３により算出されたセントロイドの値、すなわち新データを追加後のクラスタのセントロイドの値に更新する。 Next, the centroid calculation unit 33 calculates the centroid of the cluster of new data, and the data update unit 60 updates the centroid of the centroid storage unit 50 based on the calculation result (step S117). Specifically, the centroid calculating unit 33 calculates the centroid of the cluster to which the new data belongs based on the new data and the feature parameters of the sound data belonging to the cluster to which the new data belongs. The data update unit 60 then adds the centroid value associated with the cluster ID of the cluster to which the new data belongs to the centroid value calculated by the centroid calculation unit 33, that is, the cluster after adding the new data. Update to the value of the centroid.

次に、データ更新部６０は、アドレス［ｊ］に新データの特徴パラメータを格納し、新データ、新データの取得順、データＩＤ、特徴パラメータ、クラスタＩＤをデータ記憶部４０に書き込む（ステップＳ１１８）。以上で、処理が終了する。 Next, the data update unit 60 stores the feature parameter of the new data at the address [j], and writes the new data, the acquisition order of the new data, the data ID, the feature parameter, and the cluster ID in the data storage unit 40 (step S118). ). This is the end of the process.

なお、ステップＳ１１５において、データ距離の最小値が距離閾値（Ｄ_ｐ）よりも大きい場合には（ステップＳ１１５，Ｎｏ）、新データは既に生成済みのいずれのクラスタにも属さないと判断し、ステップＳ１０４に進み、新データのみを所属データとする新クラスタを生成する。 In step S115, if the minimum value of the data distance is larger than the distance threshold (D _p ) (step S115, No), it is determined that the new data does not belong to any already generated cluster, and step Proceeding to S104, a new cluster having only new data as belonging data is generated.

以上のように、本実施の形態にかかるクラスタリング装置１は、新データが追加された場合には、新データの特徴パラメータと既に生成されているクラスタのセントロイドのみに基づいて、クラスタを更新する。すなわち、本実施の形態にかかるクラスタリング装置１は、少ない演算量で逐次クラスタリングを行うことができる。 As described above, when new data is added, the clustering apparatus 1 according to the present embodiment updates the cluster based only on the feature parameter of the new data and the centroid of the already generated cluster. . That is, the clustering apparatus 1 according to the present embodiment can perform sequential clustering with a small amount of calculation.

また、データ配列数以上の数のデータを取得した場合には、古い音データから順に削除し、この場合には、削除されたデータが属していたクラスタについてのみクラスタの更新を行えばよいので、時々刻々と変化するデータに対し、少ない演算量で、常に最新の一定期間に得られたデータを適切にクラスタリングすることができる。 In addition, when the number of data more than the number of data array is acquired, it is deleted in order from the old sound data, and in this case, it is only necessary to update the cluster for the cluster to which the deleted data belonged, For data that changes from moment to moment, it is possible to appropriately cluster the data obtained in the latest fixed period with a small amount of computation.

さらに、データ配列の数を設定することにより、クラスタリングの対象となるデータの最大数を設定することができるので、利用者は、希望するデータ数、または希望する期間に相当するデータ数を設定するだけで、常に希望するデータ数のデータを対象としたクラスタリング結果を自動的に得ることができる。 Furthermore, since the maximum number of data to be clustered can be set by setting the number of data arrays, the user sets the desired number of data or the number of data corresponding to the desired period. Thus, it is possible to automatically obtain a clustering result for a desired number of data.

図６は、クラスタの生成過程を示す図である。クラスタリング装置１に音データ１〜５が番号順に入力されたとする。この場合、まず、音データ１の入力に対し、データ数１のクラスタＡが生成される。クラスタＡのセントロイドは、音データ１の特徴パラメータの値となる。 FIG. 6 is a diagram illustrating a cluster generation process. It is assumed that sound data 1 to 5 are input to the clustering device 1 in numerical order. In this case, first, a cluster A having 1 data is generated for input of sound data 1. The centroid of cluster A is the value of the characteristic parameter of sound data 1.

次に、音データ２が入力されると、音データ２とクラスタＡのセントロイドとのデータ距離が算出される。音データ２のデータ距離は、距離閾値（Ｄ_ｐ）よりも大きいものとする。この場合、音データ２が属する新たなクラスタＢが生成され、クラスタＢのセントロイドは、音データ２の特徴パラメータの値となる。 Next, when the sound data 2 is input, the data distance between the sound data 2 and the centroid of the cluster A is calculated. It is assumed that the data distance of the sound data 2 is larger than the distance threshold (D _p ). In this case, a new cluster B to which the sound data 2 belongs is generated, and the centroid of the cluster B becomes the value of the characteristic parameter of the sound data 2.

次に、音データ３が入力されると、音データ３とクラスタＡのセントロイドとのデータ距離と、音データ３とクラスタＢのセントロイドとのデータ距離が算出される。ここで、いずれのデータ距離も距離閾値（Ｄ_ｐ）よりも大きいものとする。この場合、音データ３が属する新たなクラスタＣ（図中Ｃ１）が生成され、クラスタＣのセントロイドは音データ３の特徴パラメータの値となる。 Next, when the sound data 3 is input, the data distance between the sound data 3 and the centroid of the cluster A and the data distance between the sound data 3 and the centroid of the cluster B are calculated. Here, it is assumed that any data distance is larger than the distance threshold (D _p ). In this case, a new cluster C (C1 in the figure) to which the sound data 3 belongs is generated, and the centroid of the cluster C becomes the value of the characteristic parameter of the sound data 3.

次に、音データ４が入力されると、音データ４とクラスタＡ〜Ｃのそれぞれのセントロイドとのデータ距離、すなわち３つのデータ距離が算出される。ここで、算出された３つのデータ距離のうちクラスタＣとのデータ距離が最小で、かつデータ距離の最小値が距離閾値（Ｄ_ｐ）以下の値であるものとする。この場合には、音データ４の属するクラスタがクラスタＣに決定される。さらに、クラスタＣ（図中Ｃ２）の所属データを音データ３および音データ４として、クラスタＣ（Ｃ２）のセントロイドが更新される。 Next, when the sound data 4 is input, a data distance between the sound data 4 and each centroid of the clusters A to C, that is, three data distances are calculated. Here, it is assumed that the data distance to the cluster C is the minimum among the calculated three data distances, and the minimum value of the data distance is a value equal to or smaller than the distance threshold (D _p ). In this case, the cluster to which the sound data 4 belongs is determined as the cluster C. Furthermore, the centroid of cluster C (C2) is updated with the belonging data of cluster C (C2 in the figure) as sound data 3 and sound data 4.

次に、音データ５が入力されると、音データ５と、クラスタＡ〜クラスタＣそれぞれのセントロイドとのデータ距離が算出される。ここで、算出された３つのデータ距離のうちクラスタＣ（Ｃ２）とのデータ距離が最小で、かつデータ距離の最小値が距離閾値（Ｄ_ｐ）以下の値であるものとする。この場合には、音データ５の属するクラスタがクラスタＣに決定される。さらに、クラスタＣ（Ｃ３）の所属データを音データ３、音データ４および音データ５として、クラスタＣ（Ｃ３）のセントロイドが更新される。 Next, when the sound data 5 is input, the data distance between the sound data 5 and the centroids of the clusters A to C is calculated. Here, it is assumed that, among the calculated three data distances, the data distance to the cluster C (C2) is the minimum and the minimum value of the data distance is a value equal to or less than the distance threshold (D _p ). In this case, the cluster to which the sound data 5 belongs is determined as the cluster C. Furthermore, the centroid of the cluster C (C3) is updated with the belonging data of the cluster C (C3) as the sound data 3, the sound data 4, and the sound data 5.

図７は、音データを取得するのに伴い、古いデータを削除する過程を示す図である。なお、図７の例においては、データ配列Ｘ［ｊ］に１０個の音データが格納され、これ以上のデータがクラスタリング装置１に入力された場合には、古い音データから順に削除されるものとする。図７のｔ１のタイミングまでに音データ１〜１０がそれぞれクラスタＡ〜Ｄにクラスタリングされているものとする。 FIG. 7 is a diagram showing a process of deleting old data as sound data is acquired. In the example of FIG. 7, 10 sound data are stored in the data array X [j], and when more data is input to the clustering apparatus 1, the old sound data is deleted in order. And It is assumed that the sound data 1 to 10 are clustered into clusters A to D by the timing of t1 in FIG.

そして、ｔ１のタイミングでクラスタリング装置１に新データ１１が入力される。この場合には、ｔ１のタイミングにおいて、データ配列Ｘ［ｊ］に格納されている最も古いデータである音データ１が削除される。そして、ｔ２のタイミングで新データ１１が属するクラスタがクラスタＣに決定され、クラスタＣのセントロイドが更新される。 Then, new data 11 is input to the clustering apparatus 1 at the timing of t1. In this case, at the timing t1, the sound data 1 that is the oldest data stored in the data array X [j] is deleted. Then, the cluster to which the new data 11 belongs is determined as the cluster C at the timing t2, and the centroid of the cluster C is updated.

また、音データ１が削除されたことに伴い、ｔ２のタイミングで、音データ１が属していたクラスタＡのセントロイドが更新される。さらに、ｔ２のタイミングにおいて、新データ１２が入力されるとする。この場合には、ｔ２のタイミングにおいて、データ配列Ｘ［ｊ］に格納されている最も古いデータである音データ２が削除される。そして、ｔ３のタイミングで新データ１２が属する新たなクラスタＥが生成される。また、音データ２が削除されたことに伴い、音データ２が属していたクラスタＢが消滅する。なお、クラスタに属する音データが存在しない場合には、データ更新部６０は、セントロイド記憶部５０において、所属する音データが存在しないクラスタのクラスタＩＤに対応付けられているセントロイドの値、またはクラスタＩＤとセントロイドの値の両方を削除する。 Further, along with the deletion of the sound data 1, the centroid of the cluster A to which the sound data 1 belongs is updated at the timing t2. Furthermore, it is assumed that new data 12 is input at the timing t2. In this case, the sound data 2 which is the oldest data stored in the data array X [j] is deleted at the timing t2. Then, a new cluster E to which the new data 12 belongs is generated at the timing of t3. Further, with the deletion of the sound data 2, the cluster B to which the sound data 2 belongs disappears. When there is no sound data belonging to the cluster, the data updating unit 60, in the centroid storage unit 50, the centroid value associated with the cluster ID of the cluster to which no sound data belongs, or Delete both the cluster ID and the centroid value.

（実施例１）
実施の形態にかかるクラスタリング装置１を用いて、二次元乱数データのクラスタリングを行った。クラスタリング装置１に入力するデータとしては、（式２）
（ｘ，ｙ）＝（−５０≦ｘ，ｙ≦５０）（式２）
のデータ範囲の１００個のランダムデータを用いた。データ距離閾値（Ｄ_ｐ）は、５０に設定した。図８−１にクラスタリング装置１によるデータ範囲内のランダムデータに対するクラスタリング結果を示す。 Example 1
Clustering of two-dimensional random number data was performed using the clustering apparatus 1 according to the embodiment. As data to be input to the clustering apparatus 1, (Expression 2)
(X, y) = (− 50 ≦ x, y ≦ 50) (Formula 2)
100 random data in the data range were used. The data distance threshold (D _p ) was set to 50. FIG. 8A shows a clustering result for random data within the data range by the clustering apparatus 1.

（比較例１）
実施例１と同一のデータを用い、従来法としての群平均化法を用いて、クラスタリングを行った。図８−２に群平均化法のクラスタリング結果を示す。 (Comparative Example 1)
Using the same data as in Example 1, clustering was performed using a group averaging method as a conventional method. Fig. 8-2 shows the clustering result of the group averaging method.

（実施例２）
クラスタリング装置１を用いて、（式２）のデータ範囲の１００個のランダムデータに加え、（式２）のデータ範囲外の３個のデータ（特異データと称する）を追加した１０３個のデータのクラスタリングを行った。データ距離閾値（Ｄ_ｐ）は、５０に設定した。図９−１にクラスタリング装置１によるクラスタリング結果を示す。 (Example 2)
Using the clustering device 1, in addition to 100 random data in the data range of (Expression 2), 103 data of 3 data (referred to as singular data) outside the data range of (Expression 2) are added. Clustering was performed. The data distance threshold (D _p ) was set to 50. FIG. 9A shows a clustering result by the clustering apparatus 1.

（比較例２）
実施例２と同一のデータを用い、群平均化法を用いて、クラスタリングを行った。図９−２に群平均化法のクラスタリング結果を示す。 (Comparative Example 2)
Using the same data as in Example 2, clustering was performed using the group averaging method. FIG. 9-2 shows the clustering result of the group averaging method.

図８−１および図８−２に示すように、本実施の形態にかかるクラスタリング装置１のクラスタリングにより、群平均化法によるクラスタリングと同様の結果を得ることができた。さらに、図９−１に示すように、クラスタリング装置１は、特異データを含むデータ群に対するクラスタリング処理においては、特異データを他のデータと異なるクラスタにクラスタリングすることができた。さらに、図９−１および図９−２に示すように、特異データを含むデータ群に対するクラスタリング装置１のクラスタリングにより、群平均化法によるクラスタリングと同様の結果を得ることができた。 As shown in FIGS. 8A and 8B, the clustering of the clustering apparatus 1 according to the present embodiment can obtain the same result as the clustering by the group averaging method. Furthermore, as illustrated in FIG. 9A, the clustering apparatus 1 can cluster the singular data into clusters different from other data in the clustering process for the data group including the singular data. Furthermore, as shown in FIGS. 9-1 and 9-2, the clustering apparatus 1 performed clustering on a data group including singular data, and a result similar to clustering by the group averaging method could be obtained.

（実施例３）
実施の形態にかかるクラスタリング装置１を用いて、図１０に示すデータ範囲の二次元データのクラスタリングを行った。なお、クラスタリング装置１に入力されるデータは、図１０に示すようにグループＡ〜Ｄの異なるデータ範囲のランダムデータであり、各グループのデータ数は、図１０に示す通りである。図１１は、データの入力順、データ番号とデータ数、データ範囲およびクラスタの遷移を示している。なお、クラスタの遷移の欄のうち各入力順に対応する上段は距離閾値（Ｄ_ｐ）７０の場合、下段は距離閾値（Ｄ_ｐ）８０の場合のクラスタ遷移を示している。図１１に示す入力順の通り、各データ範囲のデータを順次、クラスタリング装置１に入力した。 (Example 3)
Clustering of two-dimensional data in the data range shown in FIG. 10 was performed using the clustering apparatus 1 according to the embodiment. Note that the data input to the clustering apparatus 1 is random data in different data ranges of groups A to D as shown in FIG. 10, and the number of data in each group is as shown in FIG. FIG. 11 shows the data input order, the data number and the number of data, the data range, and the transition of the cluster. In the cluster transition column, the upper row corresponding to each input order indicates the distance threshold (D _p ) 70, and the lower row indicates the cluster transition in the case of the distance threshold (D _p ) 80. The data in each data range was sequentially input to the clustering apparatus 1 in the input order shown in FIG.

図１２−１〜図１７−２に図１１に示す入力順でのデータ入力後のクラスタリング結果を示す。なお、各図の枝番１および枝番２には、それぞれ距離閾値（Ｄ_ｐ）を７０および８０に設定した場合のクラスタリング結果を示している。 12-1 to 17-2 show the clustering results after data input in the input order shown in FIG. Note that branch number 1 and branch number 2 in each figure show the clustering results when the distance threshold (D _p ) is set to 70 and 80, respectively.

入力順１の１００個のデータの入力後には、距離閾値（Ｄ_ｐ）７０の場合には、図１２−１に示すように４つのクラスタが生成された。一方、距離閾値（Ｄ_ｐ）８０の場合には、図１２−２に示すように、３つのクラスタが生成された。 After inputting 100 pieces of data in the input order 1, in the case of the distance threshold (D _p ) 70, four clusters are generated as shown in FIG. On the other hand, in the case of the distance threshold (D _p ) 80, three clusters were generated as shown in FIG. 12-2.

続いて入力順２の１２０個のデータの入力後には、距離閾値（Ｄ_ｐ）７０の場合には、図１３−１に示すように、クラスタ５が新たに生成された。一方、距離閾値（Ｄ_ｐ）８０の場合には、図１３−２に示すように、クラスタ４が新たに生成された。 Subsequently, after 120 pieces of data in the input order 2 are input, in the case of the distance threshold (D _p ) 70, a cluster 5 is newly generated as shown in FIG. On the other hand, in the case of the distance threshold (D _p ) 80, a cluster 4 is newly generated as shown in FIG. 13-2.

さらに、続いて入力順３の１２０個のデータの入力後には、距離閾値（Ｄ_ｐ）７０の場合には、図１４−１に示すように、クラスタ５が消滅し、クラスタ６が新たに生成された。一方、距離閾値（Ｄ_ｐ）８０の場合には、図１４−２に示すように、クラスタ４が消滅し、クラスタ５が生成された。さらに、距離閾値（Ｄ_ｐ）７０および８０のいずれの場合も、他のデータと異なるデータ（特異データ）と予測されるデータが、それぞれ他のデータと異なるクラスタであるクラスタ６およびクラスタ５にクラスタリングされていることを確認できた。 Further, after the 120 data items in the order 3 are input, in the case of the distance threshold (D _p ) 70, as shown in FIG. 14-1, the cluster 5 disappears and the cluster 6 is newly generated. It was done. On the other hand, in the case of the distance threshold value (D _p ) 80, as shown in FIG. 14-2, the cluster 4 disappears and the cluster 5 is generated. Further, in both cases of the distance thresholds (D _p ) 70 and 80, data different from other data (single data) and predicted data are clustered into clusters 6 and 5 which are clusters different from the other data, respectively. I was able to confirm that.

続いて、入力順４の１４０個のデータの入力後には、距離閾値（Ｄ_ｐ）７０の場合には、図１５−１に示すように、クラスタ７が新たに生成された。一方、距離閾値（Ｄ_ｐ）８０の場合には、図１５−２に示すように、クラスタ６が生成された。ここでも、特異データと予測されるデータがそれぞれ他のデータと異なるクラスタであるクラスタ７およびクラスタ６にクラスタリングされていることを確認できた。 Subsequently, after the 140 pieces of data in the input order 4 are input, in the case of the distance threshold (D _p ) 70, a cluster 7 is newly generated as shown in FIG. On the other hand, in the case of the distance threshold (D _p ) 80, the cluster 6 is generated as shown in FIG. Again, it was confirmed that the unique data and the predicted data were clustered into clusters 7 and 6 which are different clusters from the other data.

続いて、入力順５の１２０個のデータの入力後には、距離閾値（Ｄ_ｐ）７０の場合には、図１６−１に示すように、クラスタ６，７の２つのクラスタが消滅し、クラスタ８が新たに生成された。一方、距離閾値（Ｄ_ｐ）８０の場合には、図１６−２に示すように、クラスタ６が消滅し、クラスタ７が新たに生成された。 Subsequently, after inputting 120 data items in the order of entry 5, in the case of the distance threshold value (D _p ) 70, as shown in FIG. 8 was newly generated. On the other hand, in the case of the distance threshold (D _p ) 80, as shown in FIG. 16-2, the cluster 6 disappears and the cluster 7 is newly generated.

続いて、入力順６の１００個のデータの入力後には、距離閾値（Ｄ_ｐ）７０の場合には、図１７−１に示すように、クラスタ８が消滅した。一方、距離閾値（Ｄ_ｐ）８０の場合には、図１７−２に示すように、クラスタ７が消滅した。 Subsequently, after inputting 100 pieces of data in the input order 6, in the case of the distance threshold (D _p ) 70, the cluster 8 disappears as shown in FIG. On the other hand, in the case of the distance threshold (D _p ) 80, as shown in FIG.

以上のように、本実施の形態のクラスタリング装置１により、時間の経過とともに入力されるデータの特徴量が変化するようなデータ群に対し、データの特徴量の変化に追従したクラスタリングが可能であることが確認された。 As described above, the clustering apparatus 1 according to the present embodiment can perform clustering that follows changes in data feature amounts for a data group in which the feature amounts of input data change over time. It was confirmed.

さらに、距離閾値（Ｄ_ｐ）を異ならせることにより、クラスタ数やクラスタ構造は異なる結果が得られるものの、距離閾値（Ｄ_ｐ）として適切な範囲内の値が設定されている場合には、特異データを精度よく分離することができることが確認された。 Further, although different results can be obtained in the number of clusters and the cluster structure by making the distance threshold (D _p ) different, if a value within an appropriate range is set as the distance threshold (D _p ), the peculiarity It was confirmed that the data can be separated with high accuracy.

なお、データ分散に対して、相対的に小さい距離閾値（Ｄ_ｐ）を設定した場合には、クラスタ数が多くなり、相対的に大きい距離閾値（Ｄ_ｐ）を設定した場合には、クラスタ数は少なくなる。最適な距離閾値（Ｄ_ｐ）は、入力されるデータの特徴量に依存する。したがって、入力されるデータの特徴量や入力データ群のばらつきを予測し、これらの値に基づいて最適な距離閾値（Ｄ_ｐ）を予め設定しておくことが望ましい。 Note that the number of clusters increases when a relatively small distance threshold (D _p ) is set for data distribution, and the number of clusters when a relatively large distance threshold (D _p ) is set. Will be less. The optimum distance threshold (D _p ) depends on the feature amount of the input data. Therefore, it is desirable to predict variations in the input data feature amount and input data group, and to set an optimal distance threshold (D _p ) in advance based on these values.

以上のように、本実施の形態にかかるクラスタリング装置１においては、入力されたデータは、すべて記憶され続けるのではなく、古いデータから順次削除されるので、メモリを有効に活用することができる。さらに、本実施の形態にかかるクラスタリング装置１では、比較的新しいデータの傾向のみを反映させたクラスタリングを行うことができる。また、本実施の形態にかかるクラスタリング装置１においては、データ入力時に演算の対象となるのは、新たに入力されたデータと、このデータが属するクラスタに属する他のデータのみであり、また新たなクラスタ生成にかかる繰り返し演算も不要である。すなわち、本実施の形態にかかるクラスタリング装置１においては、新たなデータ入力時の演算量を削減し、処理効率を向上させることができる。 As described above, in the clustering apparatus 1 according to the present embodiment, all input data is not continuously stored, but is deleted sequentially from old data, so that the memory can be used effectively. Furthermore, the clustering apparatus 1 according to the present embodiment can perform clustering that reflects only relatively new data trends. Further, in the clustering apparatus 1 according to the present embodiment, only the newly input data and other data belonging to the cluster to which this data belongs are subject to calculation when data is input. There is no need for repetitive calculations for cluster generation. That is, in the clustering apparatus 1 according to the present embodiment, it is possible to reduce the amount of calculation at the time of new data input and improve the processing efficiency.

１クラスタリング装置
１０データ取得部
２０特徴パラメータ算出部
３０逐次クラスタリング部
３１データ距離算出部
３２クラスタ決定部
３３セントロイド算出部
４０データ記憶部
５０セントロイド記憶部
６０データ更新部 DESCRIPTION OF SYMBOLS 1 Clustering apparatus 10 Data acquisition part 20 Feature parameter calculation part 30 Sequential clustering part 31 Data distance calculation part 32 Cluster determination part 33 Centroid calculation part 40 Data storage part 50 Centroid storage part 60 Data update part

Claims

A clustering device for sequentially clustering acquired data,
A data storage unit that stores data that has already been clustered and the cluster to which the data belongs;
A representative value storage unit that associates and stores the cluster and a representative value that represents a feature amount of the data belonging to the cluster;
A data acquisition unit for acquiring new data to be clustered;
A data distance calculation unit that calculates a data distance between the feature value of the new data acquired by the data acquisition unit and the representative value stored in the representative value storage unit;
A cluster determination unit for determining a cluster to which the new data belongs based on the data distance;
A representative value calculating unit that calculates the representative value of the cluster to which the new data belongs, based on the feature amount of the new data;
The new data and the cluster to which the new data belongs are associated and written to the data storage unit, and the representative value of the cluster to which the new data belongs is associated with the cluster to which the new data belongs and the representative value storage unit A clustering apparatus comprising: a data update unit for writing.

The cluster determination unit compares the data distance calculated by the data distance calculation unit with a distance threshold, and if the data distance is larger than the distance threshold, the cluster stored in the data storage unit A new cluster other than is determined as a cluster to which the new data belongs,
The clustering apparatus according to claim 1, wherein the representative value calculation unit calculates the representative value of the new cluster based on a feature amount of the new data.

The cluster determination unit compares the data distance calculated by the data distance calculation unit with a distance threshold, and calculates the data distance equal to or less than the distance threshold when the data distance is equal to or less than the distance threshold. Determining the cluster for the representative value as the cluster of the new data;
The representative value calculation unit refers to the data storage unit, and based on the feature amount of the data belonging to the cluster determined for the new data and the feature amount of the new data, the new data Calculate the representative value of the cluster to which it belongs,
The data update unit updates the representative value of the cluster determined for the new data stored in the representative value storage unit to the representative value calculated by the representative value calculation unit. The clustering apparatus according to claim 1, wherein:

The data storage unit further stores an acquisition order in which the data acquisition unit acquires the data in association with the data,
When the data acquisition unit acquires the data, the total number of data stored in the data storage unit is compared with a preset data number threshold, and the total number of data is greater than the data number threshold. Is also larger, further comprising a deletion unit for deleting the data in the acquisition order earliest among the data stored in the data storage unit,
The representative value calculation unit calculates the representative value of the cluster to which the deleted data belonged based on the feature amount of the data stored in the data storage unit after deletion by the deletion unit. ,
The data update unit updates the representative value of the cluster, to which the deleted data belongs, stored in the representative value storage unit, to the representative value calculated by the representative value calculation unit. The clustering device according to any one of claims 1 to 3, wherein:

The clustering device according to any one of claims 1 to 4, wherein the representative value is a barycentric position of a feature amount of the target data belonging to the cluster.

A clustering method executed by a clustering apparatus that sequentially clusters acquired data,
The clustering device includes a data storage unit that stores data that has already been clustered and a cluster to which the data belongs,
A representative value storage unit that stores the cluster and a representative value that represents the feature amount of the data belonging to the cluster in association with each other;
A data acquisition process for acquiring new data to be clustered;
A data distance calculation step of calculating a data distance between the feature value of the new data acquired in the data acquisition step and the representative value stored in the representative value storage unit;
A cluster determining step for determining a cluster to which the new data belongs based on the data distance;
A representative value calculating step of calculating the representative value of the cluster to which the new data belongs, based on the feature quantity of the new data;
The new data and the cluster to which the new data belongs are associated and written to the data storage unit, and the representative value of the cluster to which the new data belongs is associated with the cluster to which the new data belongs and the representative value storage unit And a data updating step for writing.