JP6160445B2

JP6160445B2 - Analysis apparatus, analysis method, and analysis program

Info

Publication number: JP6160445B2
Application number: JP2013226058A
Authority: JP
Inventors: 松本　和宏; 和宏松本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2013-10-30
Filing date: 2013-10-30
Publication date: 2017-07-12
Anticipated expiration: 2033-10-30
Also published as: JP2015087966A

Description

本発明は、分析装置等に関する。 The present invention relates to an analyzer and the like.

クラスタ分析は、データの集まりをデータ間の類似度に基づいて複数のクラスタに分類する処理である。例えば、クラスタ分析には、階層的クラスタ分析や非階層的クラスタ分析がある。 Cluster analysis is a process of classifying a collection of data into a plurality of clusters based on the similarity between the data. For example, cluster analysis includes hierarchical cluster analysis and non-hierarchical cluster analysis.

階層的クラスタ分析は、例えば、個々のデータを１つのクラスタとして設定し、クラスタ間の類似度を計算し、最も類似している各クラスタを併合する処理を繰り返し実行するものである。 In the hierarchical cluster analysis, for example, individual data is set as one cluster, the similarity between the clusters is calculated, and the process of merging the most similar clusters is repeatedly executed.

非階層的クラスタ分析は、分類の状態を表す関数を使い、関数の値が最適解となるように探索を行うものである。 Non-hierarchical cluster analysis uses a function representing a classification state and performs a search so that the value of the function becomes an optimal solution.

特開２００７−１７９１４３号公報JP 2007-179143 A 特開２００５−２９３０４８号公報JP 2005-293048 A

しかしながら、上述した従来技術では、大規模データについてクラスタ分析を実行すると時間を要するという問題がある。 However, the above-described conventional technique has a problem that it takes time to perform cluster analysis on large-scale data.

例えば、階層的クラスタ分析および非階層的クラスタ分析はそれぞれ、小規模データ、中規模データに対してクラスタ分析を実行することを想定している。このため、現実的な計算機環境により、大規模データに対して階層的クラスタ分析や非階層的クラスタ分析を実行すると、現実的な計算時間内で計算できないことがある。 For example, hierarchical cluster analysis and non-hierarchical cluster analysis assume that cluster analysis is performed on small-scale data and medium-scale data, respectively. For this reason, if hierarchical cluster analysis or non-hierarchical cluster analysis is performed on large-scale data in a realistic computer environment, calculation may not be possible within realistic calculation time.

１つの側面では、クラスタ分析に要する時間を削減することができる分析装置、分析方法および分析プログラムを提供することを目的とする。 An object of one aspect is to provide an analysis apparatus, an analysis method, and an analysis program that can reduce the time required for cluster analysis.

第１の案では、分析装置は、サンプリング実行部、クラスタ分析部、クラスタ予測部、判定部、最終クラスタ計算部を有する。サンプリング実行部は、入力データに対してサンプリングを実行し前記入力データから一部のデータを抽出する処理を繰り返し実行して複数のサンプリングデータを生成する。クラスタ分析部は、複数のサンプリングデータについてクラスタ分析を実行し、サンプリングデータ毎に、サンプリングデータに含まれるデータを異なるクラスタに分類する。クラスタ予測部は、複数のサンプリングデータに対するクラスタ分析部の複数の分類結果と入力データとを基にして、入力データに含まれるデータの所属するクラスタを予測したデータを示す予測データを複数生成する。判定部は、予測データのクラスタ間距離およびクラスタ内距離を基にして、予測データ毎に評価値を算出し、パレート解となる評価値に対応する予測データを判定する。最終クラスタ計算部は、パレート解となる評価値に対応する予測データを基にして、前記入力データに含まれるデータをクラスタに分類する。なお、パレート解となる予測データは、例えば、評価値が他の予測データと比較して優越するものである。 In the first plan, the analysis apparatus includes a sampling execution unit, a cluster analysis unit, a cluster prediction unit, a determination unit, and a final cluster calculation unit. The sampling execution unit generates a plurality of sampling data by repeatedly performing a process of sampling the input data and extracting a part of the data from the input data. The cluster analysis unit performs cluster analysis on a plurality of sampling data, and classifies data included in the sampling data into different clusters for each sampling data. The cluster prediction unit generates a plurality of prediction data indicating data obtained by predicting a cluster to which the data included in the input data belongs, based on the plurality of classification results of the cluster analysis unit for the plurality of sampling data and the input data. The determination unit calculates an evaluation value for each prediction data based on the inter-cluster distance and the intra-cluster distance of the prediction data, and determines prediction data corresponding to the evaluation value that is a Pareto solution. The final cluster calculation unit classifies the data included in the input data into clusters based on the prediction data corresponding to the evaluation value that is the Pareto solution. Note that the prediction data that is the Pareto solution has, for example, an evaluation value that is superior to other prediction data.

本発明の１実施態様によれば、クラスタ分析に要する時間を削減することができるという効果を奏する。 According to one embodiment of the present invention, it is possible to reduce the time required for cluster analysis.

図１は、本実施例１に係る分析装置の構成を示す機能ブロック図である。FIG. 1 is a functional block diagram illustrating the configuration of the analyzer according to the first embodiment. 図２は、分析対象データのデータ構造の一例を示す図である。FIG. 2 is a diagram illustrating an example of a data structure of analysis target data. 図３は、サンプリングデータテーブルのデータ構造の一例を示す図である。FIG. 3 is a diagram illustrating an example of the data structure of the sampling data table. 図４は、予測データテーブルのデータ構造の一例を示す図である。FIG. 4 is a diagram illustrating an example of the data structure of the prediction data table. 図５は、評価値データテーブルのデータ構造の一例を示す図である。FIG. 5 is a diagram illustrating an example of a data structure of the evaluation value data table. 図６は、中間データテーブルのデータ構造の一例を示す図である。FIG. 6 is a diagram illustrating an example of the data structure of the intermediate data table. 図７は、最終データのデータ構造の一例を示す図である。FIG. 7 is a diagram illustrating an example of the data structure of the final data. 図８は、各予測データのクラスタ内距離とクラスタ間距離との関係を示す図（１）である。FIG. 8 is a diagram (1) showing the relationship between the intra-cluster distance and the inter-cluster distance of each prediction data. 図９は、最終データ候補テーブルの一例を示す図である。FIG. 9 is a diagram illustrating an example of the final data candidate table. 図１０は、本実施例１にかかる分析装置の処理手順を示すフローチャートである。FIG. 10 is a flowchart of the process procedure of the analyzer according to the first embodiment. 図１１は、本実施例２にかかる分析装置の構成を示す機能ブロック図である。FIG. 11 is a functional block diagram of the configuration of the analyzer according to the second embodiment. 図１２は、各予測データのクラスタ内距離とクラスタ間距離との関係を示す図（２）である。FIG. 12 is a diagram (2) illustrating the relationship between the intra-cluster distance and the inter-cluster distance of each prediction data. 図１３は、実施例２にかかる分析装置の処理手順を示すフローチャート（１）である。FIG. 13 is a flowchart (1) illustrating a processing procedure of the analyzer according to the second embodiment. 図１４は、実施例２にかかる分析装置の処理手順を示すフローチャート（２）である。FIG. 14 is a flowchart (2) illustrating the processing procedure of the analyzer according to the second embodiment. 図１５は、分析プログラムを実行するコンピュータの一例を示す図である。FIG. 15 is a diagram illustrating an example of a computer that executes an analysis program.

以下に、本願の開示する分析装置、分析方法および分析プログラムの実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。 Hereinafter, embodiments of an analysis apparatus, an analysis method, and an analysis program disclosed in the present application will be described in detail with reference to the drawings. Note that the present invention is not limited to the embodiments.

図１は、本実施例１に係る分析装置の構成を示す機能ブロック図である。図１に示すように、この分析装置１００は、通信部１１０、入力部１２０、出力部１３０、記憶部１４０、制御部１５０を有する。 FIG. 1 is a functional block diagram illustrating the configuration of the analyzer according to the first embodiment. As illustrated in FIG. 1, the analysis apparatus 100 includes a communication unit 110, an input unit 120, an output unit 130, a storage unit 140, and a control unit 150.

通信部１１０は、無線または有線によってネットワークに接続し、ネットワークを介して、他の装置とデータ通信を行う処理部である。通信部１１０は、通信装置に対応する。 The communication unit 110 is a processing unit that connects to a network wirelessly or by wire and performs data communication with other devices via the network. The communication unit 110 corresponds to a communication device.

入力部１２０は、各種の情報を入力する入力装置である。入力部１２０は、例えば、キーボードやマウス、タッチパネル等に対応する。 The input unit 120 is an input device that inputs various types of information. The input unit 120 corresponds to, for example, a keyboard, a mouse, a touch panel, and the like.

出力部１３０は、制御部１５０から出力される情報を表示する表示装置である。例えば、出力部１３０は、モニタ、液晶ディスプレイ、タッチパネル等に対応する。 The output unit 130 is a display device that displays information output from the control unit 150. For example, the output unit 130 corresponds to a monitor, a liquid crystal display, a touch panel, or the like.

記憶部１４０は、分析対象データ１４１、サンプリングデータテーブル１４２、予測データテーブル１４３、評価値テーブル１４４、中間データテーブル１４５、最終データ１４６を有する。記憶部１４０は、例えば、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、フラッシュメモリ（Flash Memory）などの半導体メモリ素子や、ＨＤＤ（Hard Disk Drive）などの記憶装置に対応する。 The storage unit 140 includes analysis target data 141, a sampling data table 142, a prediction data table 143, an evaluation value table 144, an intermediate data table 145, and final data 146. The storage unit 140 corresponds to, for example, a semiconductor memory device such as a RAM (Random Access Memory), a ROM (Read Only Memory), and a flash memory (Flash Memory), and a storage device such as an HDD (Hard Disk Drive).

分析対象データ１４１は、クラスタ分析の対象となるデータである。図２は、分析対象データのデータ構造の一例を示す図である。図２に示すように、分析対象データ１４１は、識別番号、年齢、性別、身長、体重等を有する。識別番号は、各レコードを一意に識別する情報である。年齢、性別、身長、体重は、特定の人物の年齢、性別、身長、体重をそれぞれ示す情報である。なお、図２に示す例では、性別を１または２で表す。例えば、性別「１」は、性別が男性であることを示し、性別「２」は、性別が女性であることを示す。 The analysis target data 141 is data to be subjected to cluster analysis. FIG. 2 is a diagram illustrating an example of a data structure of analysis target data. As shown in FIG. 2, the analysis target data 141 includes an identification number, age, sex, height, weight, and the like. The identification number is information for uniquely identifying each record. Age, sex, height, and weight are information indicating the age, sex, height, and weight of a specific person, respectively. In the example shown in FIG. 2, the sex is represented by 1 or 2. For example, gender “1” indicates that the gender is male, and gender “2” indicates that the gender is female.

サンプリングデータテーブル１４２は、複数のサンプリングデータを有するテーブルである。各サンプリングデータは、後述するサンプリング実行部１５１によって生成される。サンプリング実行部１５１が分析対象データ１４１をサンプリングすることで、各サンプリングデータが生成される。図３は、サンプリングデータテーブルのデータ構造の一例を示す図である。図３に示すように、サンプリングデータテーブル１４２は、サンプリングデータ１４２ａ，１４２ｂ，１４２ｃを有する。図３では一例として、サンプリングデータ１４２ａ，１４２ｂ，１４２ｃを示すが、その他のサンプリングデータを含んでも良い。 The sampling data table 142 is a table having a plurality of sampling data. Each sampling data is generated by a sampling execution unit 151 described later. Each sampling data is produced | generated when the sampling execution part 151 samples the analysis object data 141. FIG. FIG. 3 is a diagram illustrating an example of the data structure of the sampling data table. As shown in FIG. 3, the sampling data table 142 has sampling data 142a, 142b, 142c. In FIG. 3, sampling data 142a, 142b, and 142c are shown as an example, but other sampling data may be included.

図３において、例えば、サンプリングデータ１４２ａは、識別番号、年齢、性別、身長、体重、クラスタ番号を有する。識別番号、年齢、性別、身長、体重に関する説明は、図２で説明した、年齢、性別、身長、体重の説明と同様である。 In FIG. 3, for example, the sampling data 142a has an identification number, age, sex, height, weight, and cluster number. The explanation regarding the identification number, age, sex, height, and weight is the same as the explanation of age, sex, height, and weight described in FIG.

予測データテーブル１４３は、複数の予測データを有するテーブルである。各予測データは、後述するクラスタ予測部１５３によって生成される。クラスタ予測部１５３が、サンプリングデータを基にして、分析対象データ１４１の各レコードのクラスタ番号を予測することで、予測データを生成する。サンプリングデータ毎に予測データが生成される。図４は、予測データテーブルのデータ構造の一例を示す図である。図４に示すように、予測データテーブル１４３は、予測データ１４３ａ，１４３ｂ，１４３ｃを有する。図４では一例として、予測データ１４３ａ，１４３ｂ，１４３ｃを示すが、その他の予測データを含んでも良い。 The prediction data table 143 is a table having a plurality of prediction data. Each prediction data is generated by a cluster prediction unit 153 described later. The cluster prediction unit 153 generates prediction data by predicting the cluster number of each record of the analysis target data 141 based on the sampling data. Prediction data is generated for each sampling data. FIG. 4 is a diagram illustrating an example of the data structure of the prediction data table. As shown in FIG. 4, the prediction data table 143 includes prediction data 143a, 143b, and 143c. FIG. 4 shows the prediction data 143a, 143b, and 143c as an example, but other prediction data may be included.

評価値データテーブル１４４は、各予測データの評価値をそれぞれ保持するテーブルである。図５は、評価値データテーブルのデータ構造の一例を示す図である。図５に示すように、この評価値データテーブル１４４は、予測データ識別情報と、評価値とを対応付ける。予測データ識別情報は、予測データを一意に識別する情報である。 The evaluation value data table 144 is a table that holds the evaluation value of each prediction data. FIG. 5 is a diagram illustrating an example of a data structure of the evaluation value data table. As shown in FIG. 5, this evaluation value data table 144 associates predicted data identification information with evaluation values. The prediction data identification information is information that uniquely identifies the prediction data.

図５において、評価値は、クラスタ間距離と、クラスタ内距離とを含む。クラスタ間距離は、異なるクラスタ間の距離を示すものである。一般的に、クラスタ間距離が大きいほど、クラスタ分析結果に対する評価が高くなる。クラスタ内距離は、クラスタの直径を示すものである。一般的に、クラスタ内距離が小さいほど、クラスタ分析結果に対する評価が高くなる。即ち、クラスタ間距離が大きいほど、また、クラスタ内距離が小さいほど、クラスタ分析結果が優れている。 In FIG. 5, the evaluation value includes an inter-cluster distance and an intra-cluster distance. The inter-cluster distance indicates the distance between different clusters. In general, the larger the inter-cluster distance, the higher the evaluation for the cluster analysis result. The intra-cluster distance indicates the diameter of the cluster. In general, the smaller the intra-cluster distance, the higher the evaluation for the cluster analysis result. That is, the larger the inter-cluster distance and the smaller the intra-cluster distance, the better the cluster analysis result.

中間データテーブル１４５は、複数の中間データを有するテーブルである。各中間データは、評価値の良い予測データに対応して作成される。図６は、中間データテーブルのデータ構造の一例を示す図である。図６に示すように、この中間データテーブル１４５は、中間データ１４５ａ，１４５ｇ，１４５ｚを有する。図６では一例として、中間データ１４５ａ，１４５ｇ，１４５ｚを示すが、その他の中間データを含んでも良い。 The intermediate data table 145 is a table having a plurality of intermediate data. Each intermediate data is created corresponding to prediction data having a good evaluation value. FIG. 6 is a diagram illustrating an example of the data structure of the intermediate data table. As shown in FIG. 6, the intermediate data table 145 includes intermediate data 145a, 145g, and 145z. In FIG. 6, the intermediate data 145a, 145g, and 145z are shown as an example, but other intermediate data may be included.

図６において、中間データは、識別番号と、各クラスタ番号とを対応付ける。識別番号は、分析対象データ１４１の識別番号に対応する。例えば、中間データ１４５ａの１段目では、識別番号１００１に対応するレコードが、クラスタ番号「１」に分類されることを示す。ここで、識別番号「１００１」に対応するレコードは、図２に示した分析対象データ１４１の識別番号「１００１」に対応するレコードに対応する。従って、図２に示した分析対象データ１４１の識別番号「１００１」に対応するレコードが、クラスタ番号「１」のクラスタに属していることを示す。 In FIG. 6, the intermediate data associates an identification number with each cluster number. The identification number corresponds to the identification number of the analysis target data 141. For example, the first row of the intermediate data 145a indicates that the record corresponding to the identification number 1001 is classified into the cluster number “1”. Here, the record corresponding to the identification number “1001” corresponds to the record corresponding to the identification number “1001” of the analysis target data 141 illustrated in FIG. 2. Therefore, it is indicated that the record corresponding to the identification number “1001” of the analysis target data 141 illustrated in FIG. 2 belongs to the cluster having the cluster number “1”.

最終データ１４６は、分析対象データ１４１の最終的なクラスタ分析結果を示す。図７は、最終データのデータ構造の一例を示す図である。図７に示すように、この最終データ１４６は、識別番号と、各クラスタ番号とを対応付ける。識別番号は、分析対象データ１４１の識別番号に対応する。例えば、最終データ１４６の１段目では、識別番号１００１に対応するレコードが、クラスタ番号「１」に分類されることを示す。 The final data 146 indicates the final cluster analysis result of the analysis target data 141. FIG. 7 is a diagram illustrating an example of the data structure of the final data. As shown in FIG. 7, the final data 146 associates an identification number with each cluster number. The identification number corresponds to the identification number of the analysis target data 141. For example, the first row of the final data 146 indicates that the record corresponding to the identification number 1001 is classified into the cluster number “1”.

図１の説明に戻る。制御部１５０は、サンプリング実行部１５１、クラスタ分析部１５２、クラスタ予測部１５３、判定部１５４、最終クラスタ計算部１５５を有する。制御部１５０は、例えば、ＡＳＩＣ（Application Specific Integrated Circuit）や、ＦＰＧＡ（Field Programmable Gate Array）などの集積装置に対応する。また、制御部１５０は、例えば、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）等の電子回路に対応する。 Returning to the description of FIG. The control unit 150 includes a sampling execution unit 151, a cluster analysis unit 152, a cluster prediction unit 153, a determination unit 154, and a final cluster calculation unit 155. The control unit 150 corresponds to an integrated device such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA). The control unit 150 corresponds to an electronic circuit such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit).

サンプリング実行部１５１は、分析対象データ１４１に対してサンプリングを複数回繰り返し実行することで、複数のサンプリングデータを生成する処理部である。サンプリング実行部１５１は、生成した各サンプリングデータを、サンプリングデータテーブル１４２に格納する。 The sampling execution unit 151 is a processing unit that generates a plurality of sampling data by repeatedly performing sampling on the analysis target data 141 a plurality of times. The sampling execution unit 151 stores the generated sampling data in the sampling data table 142.

例えば、サンプリング実行部１５１は、入力部１２０を介して、計算回数およびサンプリング件数を取得し、取得した計算回数だけ、サンプリングを行う。また、サンプリング実行部１５１は、サンプリングを実行する度に、サンプリング間隔を変更しても良い。また、ランダムサンプリングを行っても良い。 For example, the sampling execution unit 151 acquires the number of calculations and the number of samplings via the input unit 120, and performs sampling for the acquired number of calculations. In addition, the sampling execution unit 151 may change the sampling interval each time sampling is executed. Further, random sampling may be performed.

サンプリング実行部１５１は、サンプリングデータのレコードの件数を、入力部１２０から取得したサンプリング件数に合わせる。例えば、指定されたサンプリング件数がＮ２件の場合には、各サンプリングデータの件数をそれぞれＮ２件とする。例えば、分析対象データのレコードの件数をＮ１件とすると、Ｎ１とＮ２との大小関係は「Ｎ１＞Ｎ２」となる。 The sampling execution unit 151 matches the number of sampling data records with the number of samplings acquired from the input unit 120. For example, when the designated sampling number is N2, the number of each sampling data is N2. For example, if the number of records of the analysis target data is N1, the magnitude relationship between N1 and N2 is “N1> N2.”

クラスタ分析部１５２は、サンプリングデータテーブル１４２に格納された各サンプリングデータを取得し、各サンプリングデータをクラスタ分析する処理部である。クラスタ分析部１５２は、クラスタ分析結果に応じて、サンプリングデータの各レコードについてクラスタ番号を割り当てる。 The cluster analysis unit 152 is a processing unit that acquires each sampling data stored in the sampling data table 142 and performs cluster analysis on each sampling data. The cluster analysis unit 152 assigns a cluster number to each record of the sampling data according to the cluster analysis result.

図３に示したサンプリングデータテーブル１４２を例にして説明を行う。クラスタ分析部１５２は、まず、サンプリングデータ１４２ａに対してクラスタ分析を行い、サンプリングデータ１４２ａの各レコードを複数のクラスタに分類し、分類結果に応じて、クラスタ番号を割り振る。クラスタ分析部１５２は、サンプリングデータ１４２ｂ，１４２ｃについても同様に、クラスタ分析を行って、各レコードを、複数のクラスタに分類し、分類結果に応じて、クラスタ番号を割り振る。クラスタ分析部１５２が分類するクラスタの数は、予め設定されているものとする。 The sampling data table 142 shown in FIG. 3 will be described as an example. The cluster analysis unit 152 first performs cluster analysis on the sampling data 142a, classifies each record of the sampling data 142a into a plurality of clusters, and assigns cluster numbers according to the classification results. Similarly, the cluster analysis unit 152 performs cluster analysis on the sampling data 142b and 142c, classifies each record into a plurality of clusters, and assigns cluster numbers according to the classification results. It is assumed that the number of clusters classified by the cluster analysis unit 152 is set in advance.

クラスタ分析部１５２が行うクラスタ分析は、階層的クラスタ分析でも良いし、非階層的クラスタ分析でもよい。ここでは一例として、クラスタ分析部１５２が、階層的クラスタ分析を実行する場合について説明する。 The cluster analysis performed by the cluster analysis unit 152 may be hierarchical cluster analysis or non-hierarchical cluster analysis. Here, as an example, a case where the cluster analysis unit 152 executes hierarchical cluster analysis will be described.

クラスタ分析部１５２が、階層的クラスタ分析を行う場合には、まず、個々のデータを１つのクラスタとして設定し、クラスタ間の類似度を計算する。クラスタ分析部１５２は、最も類似しているクラスタを併合する。クラスタ分析部１５２は、予め設定されたクラスタの数と同数になるまで、上記処理を繰り返し実行する。 When the cluster analysis unit 152 performs hierarchical cluster analysis, first, each data is set as one cluster, and the similarity between the clusters is calculated. The cluster analysis unit 152 merges the most similar clusters. The cluster analysis unit 152 repeatedly executes the above processing until the number is equal to the number of clusters set in advance.

例えば、クラスタ分析部１５２は、各クラスタの組み合わせについて、クラスタ間のユークリッド距離を算出し、ユークリッド距離が最小となる各クラスタの組みを、合併する。この場合には、クラスタ間のユークリッド距離が上記クラスタ間の類似度に対応し、ユークリッド距離が短いほど、類似度が高い。 For example, the cluster analysis unit 152 calculates the Euclidean distance between the clusters for each combination of clusters, and merges the combinations of the clusters that minimize the Euclidean distance. In this case, the Euclidean distance between the clusters corresponds to the similarity between the clusters, and the similarity is higher as the Euclidean distance is shorter.

クラスタ予測部１５３は、サンプリングデータテーブル１４２のサンプリングデータのクラスタ分析結果に基づいて、分析対象データ１４１の各レコードのクラスタ番号を予測し、予測データテーブル１４３を生成する処理部である。クラスタ予測部１５３は、サンプリングデータテーブル１４２に含まれるサンプリングデータの数だけ、予測データを生成し、生成した予測データを予測データテーブル１４３に登録する。 The cluster prediction unit 153 is a processing unit that predicts the cluster number of each record of the analysis target data 141 based on the cluster analysis result of the sampling data in the sampling data table 142 and generates the prediction data table 143. The cluster prediction unit 153 generates prediction data by the number of sampling data included in the sampling data table 142, and registers the generated prediction data in the prediction data table 143.

例えば、クラスタ予測部１５３は、サンプリングデータテーブル１４２のサンプリングデータ１４２ａを基にして、予測データ１４３ａを生成する。クラスタ予測部１５３は、サンプリングデータ１４２ｂを基にして、予測データ１４３ｂを生成する。クラスタ予測部１５３は、サンプリングデータ１４２ｃを基にして、予測データ１４３ｃを生成する。クラスタ予測部１５３は、サンプリングデータがＮ個存在する場合には、予測データをＮ個作成する。 For example, the cluster prediction unit 153 generates the prediction data 143a based on the sampling data 142a in the sampling data table 142. The cluster prediction unit 153 generates prediction data 143b based on the sampling data 142b. The cluster prediction unit 153 generates the prediction data 143c based on the sampling data 142c. When there are N pieces of sampling data, the cluster prediction unit 153 creates N pieces of prediction data.

ここで、クラスタ予測部１５３が、サンプリングデータ１４２ａを基にして、予測データ１４３ａを生成する場合の処理の一例について説明する。まず、クラスタ予測部１５３は、サンプリングデータ１４２含まれる識別番号と、クラスタ番号との関係を、そのまま、予測データ１４３ａに設定する。 Here, an example of processing when the cluster prediction unit 153 generates the prediction data 143a based on the sampling data 142a will be described. First, the cluster prediction unit 153 sets the relationship between the identification number included in the sampling data 142 and the cluster number as it is in the prediction data 143a.

例えば、クラスタ予測部１５３は、サンプリングデータ１４２ａに識別番号「１００１」のレコードのクラスタ番号が「１」の場合には、予測データ１４３ａの識別番号「１００１」のクラスタ番号を「１」に設定する。同様に、クラスタ予測部１５３は、サンプリングデータ１４２ａに存在する全ての識別番号とクラスタ番号との関係を、予測データ１４３ａに設定する。 For example, if the cluster number of the record with the identification number “1001” in the sampling data 142a is “1”, the cluster prediction unit 153 sets the cluster number with the identification number “1001” in the prediction data 143a to “1”. . Similarly, the cluster prediction unit 153 sets the relationship between all identification numbers and cluster numbers existing in the sampling data 142a in the prediction data 143a.

続いて、クラスタ予測部１５３は、上記処理を行った結果、クラスタ番号が未設定となるレコードについて下記の処理を行う。まず、クラスタ予測部１５３は、各クラスタに分類されたレコードから、代表レコードを検出する。例えば、クラスタ番号「１」のレコードのうち、平均的な数値を有するレコードを代表レコードとして検出する。クラスタ予測部１５３は、他のクラスタ番号に対応する代表レコードも同様にして検出する。 Subsequently, the cluster predicting unit 153 performs the following process on the record in which the cluster number is not set as a result of the above process. First, the cluster prediction unit 153 detects a representative record from records classified into each cluster. For example, a record having an average numerical value among the records having the cluster number “1” is detected as a representative record. The cluster prediction unit 153 similarly detects representative records corresponding to other cluster numbers.

クラスタ予測部１５３は、クラスタ番号が未設定のレコードと、各代表レコードとのユークリッド距離を計算し、ユークリッド距離が最小となる組み合わせを特定する。クラスタ予測部１５３は、特定した組の代表レコードのクラスタ番号を、該当するレコードのクラスタ番号に設定する。 The cluster prediction unit 153 calculates the Euclidean distance between the record in which the cluster number is not set and each representative record, and identifies the combination that minimizes the Euclidean distance. The cluster prediction unit 153 sets the cluster number of the representative record of the identified set as the cluster number of the corresponding record.

例えば、クラスタ番号が未設定のレコードと、各代表レコードとのユークリッド距離を算出し、未設定のレコードと、クラスタ番号「１」の代表レコードとのユーグリッド距離が最小の場合には、該当するレコードのクラスタ番号を「１」に設定する。クラスタ予測部１５３は、未設定のレコードについて、上記処理を繰り返し実行することで、予測データテーブル１４３を生成する。 For example, when the Euclidean distance between the record in which the cluster number is not set and each representative record is calculated, and the Eugrid distance between the record that has not been set and the representative record with the cluster number “1” is the minimum, this is applicable. Set the cluster number of the record to “1”. The cluster prediction unit 153 generates the prediction data table 143 by repeatedly executing the above process for the unset records.

判定部１５４は、予測データテーブル１４３を基にして、評価値データテーブル１４４を生成する処理部である。評価部１５４は、予測データテーブル１４３に含まれる予測データ毎に評価値を算出する。 The determination unit 154 is a processing unit that generates the evaluation value data table 144 based on the prediction data table 143. The evaluation unit 154 calculates an evaluation value for each prediction data included in the prediction data table 143.

判定部１５４は、予測データ毎にクラスタ間距離およびクラスタ内距離を算出し、クラスタ間距離およびクラスタ内距離を予測データの評価値とする。予測データのクラスタ間距離を算出する処理の一例について説明する。ここでは、クラスタ番号「１〜３」のクラスタが存在するものとする。判定部１５４は、クラスタ番号「１」に属する第１代表レコードと、クラスタ番号「２」に属する第２代表レコードと、クラスタ番号「３」に属する第３代表レコードとを検出する。代表レコードを検出する処理の一例は、例えば、同一のクラスタ番号に属するレコードのうち、平均的な数値を有するレコードを代表レコードとして検出する。 The determination unit 154 calculates the inter-cluster distance and the intra-cluster distance for each prediction data, and uses the inter-cluster distance and the intra-cluster distance as evaluation values of the prediction data. An example of processing for calculating the inter-cluster distance of the prediction data will be described. Here, it is assumed that clusters with cluster numbers “1 to 3” exist. The determination unit 154 detects the first representative record belonging to the cluster number “1”, the second representative record belonging to the cluster number “2”, and the third representative record belonging to the cluster number “3”. An example of processing for detecting a representative record is to detect, for example, a record having an average numerical value as a representative record among records belonging to the same cluster number.

判定部１５４は、第１代表レコードと、第２代表レコードとのユークリッド距離を算出し、第１代表レコードと第３代表レコードとのユークリッド距離を算出する。判定部１５４は、算出した各ユークリッド距離を平均したユークリッド距離を、予測データのクラスタ間距離とする。 The determination unit 154 calculates the Euclidean distance between the first representative record and the second representative record, and calculates the Euclidean distance between the first representative record and the third representative record. The determination unit 154 sets the Euclidean distance obtained by averaging the calculated Euclidean distances as the intercluster distance of the prediction data.

例えば、第１代表レコードの年齢、性別、身長、体重の値をそれぞれ、ａ１、ａ２、ａ３、ａ４とする。第２代表レコードの年齢、性別、身長、体重の値をそれぞれ、ｂ１、ｂ２、ｂ３、ｂ４とする。第３代表レコードの年齢、性別、身長、体重の値をそれぞれ、ｃ１、ｃ２、ｃ３、ｃ４とする。この場合には、第１代表レコードと、第２代表レコードとのユークリッド距離Ｘ１は、式（１）で計算され、第１代表レコードと、第３代表レコードとのユークリッド距離Ｘ２は、式（２）で計算される。この場合には、予測データのクラスタ間距離は式（３）に示すものとなる。 For example, the values of age, sex, height, and weight of the first representative record are a1, a2, a3, and a4, respectively. The values of age, sex, height, and weight of the second representative record are b1, b2, b3, and b4, respectively. Assume that the values of the age, sex, height, and weight of the third representative record are c1, c2, c3, and c4, respectively. In this case, the Euclidean distance X1 between the first representative record and the second representative record is calculated by Expression (1), and the Euclidean distance X2 between the first representative record and the third representative record is calculated by Expression (2). ). In this case, the inter-cluster distance of the prediction data is as shown in Expression (3).

ユーグリット距離Ｘ１＝（（ａ１−ｂ１）^２＋（ａ２−ｂ２）^２＋（ａ３−ｂ３）^２＋（ａ４−ｂ４）^２）^１／２・・・（１） Eugrit distance X1 = ((a1-b1) ² + (a2-b2) ² + (a3-b3) ² + (a4-b4) ² ) ^1/2 (1)

ユーグリット距離Ｘ２＝（（ａ１−ｃ１）^２＋（ａ２−ｃ２）^２＋（ａ３−ｃ３）^２＋（ａ４−ｃ４）^２）^１／２・・・（２） Eugrid distance X2 = ((a1-c1) ² + (a2-c2) ² + (a3-c3) ² + (a4-c4) ² ) ^1/2 (2)

クラスタ間距離＝（Ｘ１＋Ｘ２）／２・・・（３） Distance between clusters = (X1 + X2) / 2 (3)

続いて、クラスタ内距離を算出する処理について説明する。まず、判定部１５４は、同一のクラスタ番号に属する各レコード間のユークリッド距離をそれぞれ算出する。そして、判定部１５４は、算出したユークリッド距離を平均したユークリッド距離を、予測データのクラスタ内距離とする。判定部１５４は、各クラスタ番号のクラスタに対応するクラスタ内距離を平均することで、予測データのクラスタ内距離を算出する。 Next, a process for calculating the intra-cluster distance will be described. First, the determination unit 154 calculates the Euclidean distance between records belonging to the same cluster number. Then, the determination unit 154 sets the Euclidean distance obtained by averaging the calculated Euclidean distances as the intra-cluster distance of the prediction data. The determination unit 154 calculates the intra-cluster distance of the prediction data by averaging the intra-cluster distance corresponding to the cluster of each cluster number.

例えば、クラスタ番号「１〜３」のクラスタが存在する場合には、判定部１５４は、各クラスタ番号「１〜３」のクラスタ内距離をそれぞれ算出する。判定部１５４は、各クラスタ番号「１〜３」のクラスタ内距離を平均することで、予測データのクラスタ内距離を算出する。 For example, when there is a cluster with the cluster number “1-3”, the determination unit 154 calculates the intra-cluster distance of each cluster number “1-3”. The determination unit 154 calculates the intra-cluster distance of the prediction data by averaging the intra-cluster distances of the respective cluster numbers “1 to 3”.

例えば、クラスタ番号「１」のクラスタ内距離を算出する例について説明する。クラスタ内に３つの第１レコード、第２レコード、第３レコードが存在するものとする。例えば、第１レコードの年齢、性別、身長、体重の値をそれぞれ、ｄ１、ｄ２、ｄ３、ｄ４とする。第２レコードの年齢、性別、身長、体重の値をそれぞれ、ｅ１、ｅ２、ｅ３、ｅ４とする。第３代表レコードの年齢、性別、身長、体重の値をそれぞれ、ｆ１、ｆ２、ｆ３、ｆ４とする。この場合には、第１レコードと、第２レコードとのユークリッド距離Ｙ１は、式（４）で計算され、第１レコードと、第３レコードとのユークリッド距離Ｙ２は、式（５）で計算される。この場合には、クラスタ番号「１」のクラスタのクラスタ内距離は式（６）に示すものとなる。 For example, an example of calculating the intra-cluster distance of the cluster number “1” will be described. Assume that there are three first records, second record, and third record in the cluster. For example, assume that the values of age, sex, height, and weight of the first record are d1, d2, d3, and d4, respectively. Assume that the values of age, sex, height, and weight of the second record are e1, e2, e3, and e4, respectively. The values of the age, sex, height, and weight of the third representative record are f1, f2, f3, and f4, respectively. In this case, the Euclidean distance Y1 between the first record and the second record is calculated by Expression (4), and the Euclidean distance Y2 between the first record and the third record is calculated by Expression (5). The In this case, the intra-cluster distance of the cluster with the cluster number “1” is as shown in Expression (6).

ユーグリット距離Ｙ１＝（（ｄ１−ｅ１）^２＋（ｄ２−ｅ２）^２＋（ｄ３−ｅ３）^２＋（ｄ４−ｅ４）^２）^１／２・・・（４） Eugrid distance Y1 = ((d1−e1) ² + (d2−e2) ² + (d3−e3) ² + (d4−e4) ² ) ^1/2 (4)

ユーグリット距離Ｙ２＝（（ｄ１−ｆ１）^２＋（ｄ２−ｆ２）^２＋（ｄ３−ｆ３）^２＋（ｄ４−ｆ４）^２）^１／２・・・（５） Eugrid distance Y2 = ((d1-f1) ² + (d2-f2) ² + (d3-f3) ² + (d4-f4) ² ) ^1/2 (5)

クラスタ内距離＝（Ｙ１＋Ｙ２）／２・・・（６） Distance within cluster = (Y1 + Y2) / 2 (6)

判定部１５４は、他のクラスタについても同様にクラスタ内距離を算出し、各クラスタのクラスタ内距離を平均することで、予測データのクラスタ内距離を算出する。 The determination unit 154 calculates the intra-cluster distance for other clusters in the same manner, and calculates the intra-cluster distance of the prediction data by averaging the intra-cluster distance of each cluster.

判定部１５４は、予測データテーブル１４３に含まれる予測データ毎に上記処理を実行することで、各予測データの評価値を算出し、評価値データテーブル１４４を生成する。 The determination unit 154 performs the above processing for each prediction data included in the prediction data table 143, thereby calculating an evaluation value of each prediction data, and generates an evaluation value data table 144.

最終クラスタ計算部１５５は、分析対象データ１４１の最終的なクラスタ分析結果となる最終データ１４６を生成する処理部である。最終クラスタ計算部１５５は、評価値データテーブル１４４から中間データテーブル１４５を生成する処理を行った後に、中間データテーブル１４５を基にして、最終データ１４６を生成する。 The final cluster calculation unit 155 is a processing unit that generates final data 146 that is a final cluster analysis result of the analysis target data 141. The final cluster calculation unit 155 generates the final data 146 based on the intermediate data table 145 after performing the process of generating the intermediate data table 145 from the evaluation value data table 144.

最終クラスタ計算部１５５が、評価値データテーブル１４４から中間データテーブル１４５を生成する処理の一例について説明する。最終クラスタ計算部１５５は、評価値データテーブル１４４の予測データ毎の評価値を比較して、パレート解となる予測データを特定し、特定したパレート解となる予測データを、中間データテーブル１４５に設定する。例えば、パレート解となる予測データは、一つ以上の項目について他の予測データよりも優れているものとなる。 An example of processing in which the final cluster calculation unit 155 generates the intermediate data table 145 from the evaluation value data table 144 will be described. The final cluster calculation unit 155 compares the evaluation values for each prediction data in the evaluation value data table 144, specifies the prediction data to be the Pareto solution, and sets the prediction data to be the specified Pareto solution in the intermediate data table 145 To do. For example, the prediction data that is a Pareto solution is superior to other prediction data for one or more items.

図８は、各予測データのクラスタ内距離とクラスタ間距離との関係を示す図（１）である。図８において、縦軸はクラスタ内距離を示し、横軸はクラスタ間距離を示す。一般的に、クラスタ間距離が大きいほど、また、クラスタ内距離が小さいほど、予測データは、良い予測データであると言える。このため、図８に示す例では、最終クラスタ計算部１５５は、予測データ１４３ａ，１４３ｇ，１４３ｚを、パレート解として特定する。 FIG. 8 is a diagram (1) showing the relationship between the intra-cluster distance and the inter-cluster distance of each prediction data. In FIG. 8, the vertical axis represents the intra-cluster distance, and the horizontal axis represents the inter-cluster distance. In general, it can be said that the prediction data is better prediction data as the inter-cluster distance is larger and the intra-cluster distance is smaller. Therefore, in the example illustrated in FIG. 8, the final cluster calculation unit 155 identifies the prediction data 143a, 143g, and 143z as Pareto solutions.

続いて、最終クラスタ計算部１５５が、中間データテーブル１４５から最終データ１４６を生成する処理について説明する。まず、最終クラスタ計算部１５５は、最終データ候補テーブルを生成する。図９は、最終データ候補テーブルの一例を示す図である。図９に示すように、この最終データ候補テーブル１０は、最終データ候補１０ａ，１０ｂ，１０ｃを有する。ここでは一例として、最終データ候補１０ａ，１０ｂ，１０ｃを示すが、これ以外に、最終データ候補を含んでいても良い。 Subsequently, a process in which the final cluster calculation unit 155 generates the final data 146 from the intermediate data table 145 will be described. First, the final cluster calculation unit 155 generates a final data candidate table. FIG. 9 is a diagram illustrating an example of the final data candidate table. As shown in FIG. 9, the final data candidate table 10 includes final data candidates 10a, 10b, and 10c. Here, as an example, the final data candidates 10a, 10b, and 10c are shown, but other final data candidates may be included.

最終クラスタ計算部１５５は、最終データ候補１０ａ，１０ｂ，１０ｃの各クラスタ番号を０の初期値に設定する。そして、最終クラスタ計算部１５５は、各識別番号の各クラスタ番号の値のいずれか一つが「１」となるように、ランダムに「１」を割り振る。例えば、図９に示す例では、最終データ候補１０ａの識別番号「１００１」に対してランダムに「１」を割り振ることで、クラスタ番号「１」に対応するものが「１」に設定され、その他のクラスタ番号については「０」が設定される。 The final cluster calculation unit 155 sets each cluster number of the final data candidates 10a, 10b, and 10c to an initial value of 0. Then, the final cluster calculation unit 155 randomly assigns “1” so that any one of the values of the cluster numbers of the identification numbers is “1”. For example, in the example shown in FIG. 9, by randomly assigning “1” to the identification number “1001” of the final data candidate 10a, the one corresponding to the cluster number “1” is set to “1”. “0” is set for the cluster number.

最終クラスタ計算部１５５は、最終データ候補テーブル１０の各最終データ候補１０ａ，１０ｂ，１０ｃと、中間データテーブル１４５の各中間データとの類似度を計算し、最も類似度の高い最終データ候補を、最終データ１４６として特定する。 The final cluster calculator 155 calculates the similarity between each final data candidate 10a, 10b, 10c in the final data candidate table 10 and each intermediate data in the intermediate data table 145, and determines the final data candidate with the highest similarity as It is specified as final data 146.

最終クラスタ計算部１５５は、中間データの識別番号および識別番号に対応するクラスタ番号と、最終データ候補の識別番号および識別番号に対応するクラスタ番号とを比較し、一致する数を計数する。最終クラスタ計算部１５５は、一致する数を、全レコード数で除算することで、類似度を算出する。以下の説明では、一致する数を、一致数と表記する。 The final cluster calculator 155 compares the identification number of the intermediate data and the cluster number corresponding to the identification number with the identification number of the final data candidate and the cluster number corresponding to the identification number, and counts the number of matches. The final cluster calculation unit 155 calculates the similarity by dividing the number of matches by the total number of records. In the following description, the number of matches is expressed as the number of matches.

例えば、最終クラスタ計算部１５５が、最終データ候補１０ａの類似度を算出する場合について説明する。最終クラスタ計算部１５５は、最終データ候補１０ａと中間データ１４５ａとを比較し、一致数が「Ｌ１」であり、最終データ候補１０ａの全レコード数が「Ｍ１」の場合には、最終データ候補１０ａと中間データ１４５ａとの類似度は「Ｌ１／Ｍ１」となる。最終クラスタ計算部１５５は、最終データ候補１０ａと中間データ１４５ｇとを比較し、一致数が「Ｌ２」であり、最終データ候補１０ａの全レコード数が「Ｍ２」の場合には、最終データ候補１０ａと中間データ１４５ａとの類似度は「Ｌ２／Ｍ２」となる。最終クラスタ計算部１５５は、最終データ候補１０ａと中間データ１４５ｚとを比較し、一致数が「Ｌ３」であり、最終データ候補１０ａの全レコード数が「Ｍ３」の場合には、最終データ候補１０ａと中間データ１４５ｚとの類似度は「Ｌ３／Ｍ３」となる。この場合には、最終クラスタ計算部１５５は、最終データ候補１０ａの類似度を「Ｌ１／Ｍ１＋Ｌ２／Ｍ２＋Ｌ３／Ｍ３」と特定する。 For example, a case where the final cluster calculation unit 155 calculates the similarity of the final data candidate 10a will be described. The final cluster calculation unit 155 compares the final data candidate 10a and the intermediate data 145a, and if the number of matches is “L1” and the total number of records of the final data candidate 10a is “M1”, the final data candidate 10a And the intermediate data 145a is “L1 / M1”. The final cluster calculation unit 155 compares the final data candidate 10a with the intermediate data 145g, and if the number of matches is “L2” and the total number of records in the final data candidate 10a is “M2”, the final data candidate 10a And the intermediate data 145a is “L2 / M2”. The final cluster calculation unit 155 compares the final data candidate 10a with the intermediate data 145z. If the number of matches is “L3” and the total number of records of the final data candidate 10a is “M3”, the final data candidate 10a And the intermediate data 145z are “L3 / M3”. In this case, the final cluster calculation unit 155 specifies the similarity of the final data candidate 10a as “L1 / M1 + L2 / M2 + L3 / M3”.

最終クラスタ計算部１５５は、最終データ候補１０ｂ，１０ｃに関しても、最終データ候補１０ａと同様にして、類似度を算出する。最終クラスタ計算部１５５は、最終データ候補１０ａの類似度、最終データ候補１０ｂの類似度、最終データ候補１０ｃの類似度を比較し、類似度が最大となる最終データ候補を特定する。最終クラスタ計算部１５５は、特定した最終データ候補を、最終データ１４６として設定する。最終クラスタ計算部１５５は、最終データ１４６を、出力部１３０に出力しても良い。 The final cluster calculation unit 155 calculates the similarity for the final data candidates 10b and 10c in the same manner as the final data candidate 10a. The final cluster calculation unit 155 compares the similarity of the final data candidate 10a, the similarity of the final data candidate 10b, and the similarity of the final data candidate 10c, and specifies the final data candidate having the maximum similarity. The final cluster calculation unit 155 sets the specified final data candidate as the final data 146. The final cluster calculation unit 155 may output the final data 146 to the output unit 130.

次に、本実施例１にかかる分析装置１００の処理手順について説明する。図１０は、本実施例１にかかる分析装置の処理手順を示すフローチャートである。図１０に示すように、分析装置１００は、分析対象データ１４１を受け付ける（ステップＳ１０１）。また、分析装置１００は、繰り返し計算回数を受け付け（ステップＳ１０２）、サンプリング件数を受け付ける。また、カウント値を初期化する（ステップＳ１０３）。分析装置１００は、カウント値に１を加算する（ステップＳ１０４）。カウント値の初期値を０とする。 Next, a processing procedure of the analyzer 100 according to the first embodiment will be described. FIG. 10 is a flowchart of the process procedure of the analyzer according to the first embodiment. As shown in FIG. 10, the analysis apparatus 100 accepts analysis target data 141 (step S101). Moreover, the analyzer 100 receives the number of repeated calculations (step S102) and receives the number of samplings. Also, the count value is initialized (step S103). The analyzer 100 adds 1 to the count value (step S104). The initial value of the count value is 0.

分析装置１００は、分析対象データ１４１をサンプリングし、サンプリングデータを生成する（ステップＳ１０５）。各サンプリングデータは、サンプリングデータテーブル１４２に格納される。分析装置１００は、サンプリングデータに対してクラスタ分析処理を実行し、各々のレコードに対してクラスタ番号を割り振る（ステップＳ１０６）。 The analysis apparatus 100 samples the analysis target data 141 and generates sampling data (step S105). Each sampling data is stored in the sampling data table 142. The analysis apparatus 100 performs a cluster analysis process on the sampling data, and assigns a cluster number to each record (step S106).

分析装置１００は、クラスタ番号を割り振ったサンプリングデータと分析対象データ１４１とを比較して、分析対象データ１４１に含まれる各々のレコードに対してクラスタ番号を割り振ることで予測データを生成する（ステップＳ１０７）。各予測データは、予測データテーブル１４３に格納される。 The analysis apparatus 100 compares the sampling data to which the cluster number is assigned and the analysis target data 141, and generates prediction data by assigning a cluster number to each record included in the analysis target data 141 (step S107). ). Each prediction data is stored in the prediction data table 143.

分析装置１００は、予測データを基にして、クラスタ内距離およびクラスタ間距離を算出し、評価値データテーブル１４４を生成する（ステップＳ１０８）。分析装置１００は、繰り返しの計算回数がカウント値未満であるか否かを判定する（ステップＳ１０９）。分析装置１００は、繰り返しの計算回数がカウント値未満の場合には（ステップＳ１０９０，Ｙｅｓ）、ステップＳ１０４に移行する。 The analysis apparatus 100 calculates the intra-cluster distance and the inter-cluster distance based on the prediction data, and generates the evaluation value data table 144 (step S108). The analysis apparatus 100 determines whether or not the number of repeated calculations is less than the count value (step S109). When the number of repeated calculations is less than the count value (step S1090, Yes), the analysis apparatus 100 proceeds to step S104.

一方、分析装置１００は、繰り返しの計算回数がカウント値以上である場合には（ステップＳ１０９，Ｎｏ）、パレート解に対応する予測データを選択して、中間データテーブル１４５を作成する（ステップＳ１１０）。 On the other hand, when the number of repeated calculations is equal to or greater than the count value (No at Step S109), the analysis apparatus 100 selects prediction data corresponding to the Pareto solution and creates the intermediate data table 145 (Step S110). .

分析装置１００は、ランダムにクラスタ番号を割り振った複数の最終データ候補を生成する（ステップＳ１１１）。ステップＳ１１１において、分析装置１００は、類似度が大きくなるようにクラスタ番号を割り振る。例えば、分析装置１００は、ランダムにクラスタ番号を割り振り、類似度を計算する。そして、分析装置１００は、類似度が大きい、クラスタ番号の割り振りを少し変更して、類似度が大きくなるか、試行する処理を利用者が設定した回数繰り返す。 The analysis apparatus 100 generates a plurality of final data candidates randomly assigned with cluster numbers (step S111). In step S111, the analysis apparatus 100 assigns cluster numbers so that the degree of similarity increases. For example, the analysis apparatus 100 randomly assigns cluster numbers and calculates the similarity. Then, the analysis apparatus 100 changes the cluster number allocation with a high degree of similarity a little, and repeats the trial processing as many times as the degree of similarity increases or the user sets.

分析装置１００は、中間データと各最終データ候補とを比較して、類似度が最大となる最終データ候補を判定する（ステップＳ１１２）。ステップＳ１１２で判定した類似度が最大となる最終データ候補が、最終データ１４６となる。分析装置１００は、判定結果を出力する（ステップＳ１１３）。 The analysis apparatus 100 compares the intermediate data with each final data candidate, and determines the final data candidate having the maximum similarity (step S112). The final data candidate having the maximum similarity determined in step S112 is the final data 146. The analyzer 100 outputs the determination result (step S113).

次に、本実施例１にかかる分析装置１００の効果について説明する。分析装置１００は分析対象データ１４１から抽出したサンプリングデータをクラスタ分析し、サンプリングデータのクラスタ分析結果を基にして、分析対象データの各データが属するクラスタを予測した複数の予測データを生成する。そして、分析装置１００は、複数の予測データのうち、評価値のよい予測データのクラスタ分類結果を用いて、分析対象データ１４１の最終的なクラスタ分類結果を特定する。これにより、分析装置１００によれば、クラスタ分析に要する時間を削減することができる。 Next, effects of the analyzer 100 according to the first embodiment will be described. The analysis apparatus 100 performs cluster analysis on the sampling data extracted from the analysis target data 141, and generates a plurality of prediction data in which the cluster to which each data of the analysis target data belongs is predicted based on the cluster analysis result of the sampling data. And the analysis apparatus 100 specifies the final cluster classification | category result of the analysis object data 141 using the cluster classification | category result of prediction data with a favorable evaluation value among several prediction data. Thereby, according to the analyzer 100, the time which cluster analysis requires can be reduced.

また、現実的な計算機で、現実的な時間内に計算できない、大規模なデータに対して、現実的な計算機で、現実的な時間内に、クラスタ分析を実行することができる。 In addition, it is possible to perform cluster analysis within a realistic time with a realistic computer for large-scale data that cannot be calculated within a realistic time with a realistic computer.

図１１は、本実施例２にかかる分析装置の構成を示す機能ブロック図である。図１１に示すように、この分析装置２００は、通信部２１０、入力部２２０、出力部２３０、記憶部２４０、制御部２５０を有する。 FIG. 11 is a functional block diagram of the configuration of the analyzer according to the second embodiment. As illustrated in FIG. 11, the analysis device 200 includes a communication unit 210, an input unit 220, an output unit 230, a storage unit 240, and a control unit 250.

通信部２１０、入力部２２０、出力部２３０に関する説明は、図１に示した、通信部１１０、入力部１２０、出力部１３０に関する説明と同様である。 The description regarding the communication unit 210, the input unit 220, and the output unit 230 is the same as the description regarding the communication unit 110, the input unit 120, and the output unit 130 illustrated in FIG.

記憶部２４０は、分析対象データ２４１、サンプリングデータテーブル２４２、予測データテーブル２４３、評価値データテーブル２４４、中間データテーブル２４５、最終データ２４６を有する。記憶部２４０は、例えば、ＲＡＭ、ＲＯＭ、フラッシュメモリなどの半導体メモリ素子や、ＨＤＤなどの記憶装置に対応する。 The storage unit 240 includes analysis target data 241, a sampling data table 242, a prediction data table 243, an evaluation value data table 244, an intermediate data table 245, and final data 246. The storage unit 240 corresponds to, for example, a semiconductor memory element such as a RAM, a ROM, or a flash memory, or a storage device such as an HDD.

分析対象データ２４１は、クラスタ分析の対象となるデータである。分析対象データ２４１のデータ構造は、図２に示した分析対象データ１４１のデータ構造と同様である。 The analysis target data 241 is data to be subjected to cluster analysis. The data structure of the analysis target data 241 is the same as the data structure of the analysis target data 141 shown in FIG.

サンプリングデータテーブル２４２は、複数のサンプリングデータを有するテーブルである。各サンプリングデータは、後述するサンプリング実行部２５１によって生成される。サンプリング実行部２５１が分析対象データ２４１をサンプリングすることで、各サンプリングデータが生成される。サンプリングデータテーブル２４２のデータ構造は、図３に示したサンプリングデータテーブル１４２のデータ構造と同様である。 The sampling data table 242 is a table having a plurality of sampling data. Each sampling data is generated by a sampling execution unit 251 described later. The sampling execution unit 251 samples the analysis target data 241 so that each sampling data is generated. The data structure of the sampling data table 242 is the same as the data structure of the sampling data table 142 shown in FIG.

予測データテーブル２４３は、複数の予測データを有するテーブルである。各予測データは、後述するクラスタ予測部２５３によって生成される。クラスタ予測部２５３が、サンプリングデータを基にして、分析対象データ２４１の各レコードのクラスタ番号を予測することで、予測データを生成する。サンプリングデータ毎に予測データが生成される。予測データテーブル２４３のデータ構造は、図４に示した予測データテーブル１４３のデータ構造と同様である。 The prediction data table 243 is a table having a plurality of prediction data. Each prediction data is generated by a cluster prediction unit 253 described later. The cluster prediction unit 253 generates prediction data by predicting the cluster number of each record of the analysis target data 241 based on the sampling data. Prediction data is generated for each sampling data. The data structure of the prediction data table 243 is the same as the data structure of the prediction data table 143 shown in FIG.

評価値データテーブル２４４は、各予測データの評価値をそれぞれ保持するテーブルである。評価値データテーブル２４４のデータ構造は、図５に示した評価値データテーブル１４４のデータ構造と同様である。 The evaluation value data table 244 is a table that holds the evaluation value of each prediction data. The data structure of the evaluation value data table 244 is the same as the data structure of the evaluation value data table 144 shown in FIG.

中間データテーブル２４５は、複数の中間データを有するテーブルである。各中間データは、評価値の良い予測データに対応して作成される。中間データテーブル２４５のデータ構造は、図６に示した中間データテーブル１４５のデータ構造と同様である。 The intermediate data table 245 is a table having a plurality of intermediate data. Each intermediate data is created corresponding to prediction data having a good evaluation value. The data structure of the intermediate data table 245 is the same as the data structure of the intermediate data table 145 shown in FIG.

最終データ２４６ａ，２４６ｂ，２４６ｃは、分析対象データ２４１の最終的なクラスタ分析結果を示す。各最終データ２４６ａ，２４６ｂ，２４６ｃのデータ構造は、図７に示した最終データ１４６のデータ構造と同様である。 The final data 246a, 246b, and 246c indicate final cluster analysis results of the analysis target data 241. The data structure of each final data 246a, 246b, 246c is the same as the data structure of the final data 146 shown in FIG.

図１１の説明に戻る。制御部２５０は、サンプリング実行部２５１、クラスタ分析部２５２、クラスタ予測部２５３、判定部２５４、最終クラスタ計算部２５５を有する。制御部２５０は、例えば、ＡＳＩＣや、ＦＰＧＡなどの集積装置に対応する。また、制御部２５０は、例えば、ＣＰＵやＭＰＵ等の電子回路に対応する。 Returning to the description of FIG. The control unit 250 includes a sampling execution unit 251, a cluster analysis unit 252, a cluster prediction unit 253, a determination unit 254, and a final cluster calculation unit 255. The control unit 250 corresponds to, for example, an integrated device such as an ASIC or FPGA. Moreover, the control part 250 respond | corresponds to electronic circuits, such as CPU and MPU, for example.

サンプリング実行部２５１は、分析対象データ２４１に対してサンプリングを複数回繰り返し実行することで、複数のサンプリングデータを生成する処理部である。サンプリング実行部２５１は、生成した各サンプリングデータを、サンプリングデータテーブル２４２に格納する。サンプリング実行部２５１の具体的な処理は、図１に示したサンプリング実行部１５１と同様である。 The sampling execution unit 251 is a processing unit that generates a plurality of sampling data by repeatedly performing sampling on the analysis target data 241 a plurality of times. The sampling execution unit 251 stores each generated sampling data in the sampling data table 242. Specific processing of the sampling execution unit 251 is the same as that of the sampling execution unit 151 shown in FIG.

クラスタ分析部２５２は、サンプリングデータテーブル２４２に格納された各サンプリングデータを取得し、各サンプリングデータをクラスタ分析する処理部である。クラスタ分析部２５２は、クラスタ分析結果に応じて、サンプリングデータの各レコードについてクラスタ番号を割り当てる。クラスタ分析部２５２の具体的な処理は、図１に示したクラスタ分析部１５２と同様である。 The cluster analysis unit 252 is a processing unit that acquires each sampling data stored in the sampling data table 242 and performs cluster analysis on each sampling data. The cluster analysis unit 252 assigns a cluster number to each record of sampling data according to the cluster analysis result. The specific processing of the cluster analysis unit 252 is the same as that of the cluster analysis unit 152 shown in FIG.

クラスタ予測部２５３は、サンプリングデータテーブル２４２のサンプリングデータのクラスタ分析結果に基づいて、分析対象データ２４１の各レコードのクラスタ番号を予測し、予測データテーブル２４３を生成する処理部である。クラスタ予測部２５３の具体的な処理は、図１に示したクラスタ予測部１５３と同様である。 The cluster prediction unit 253 is a processing unit that predicts the cluster number of each record of the analysis target data 241 based on the cluster analysis result of the sampling data in the sampling data table 242, and generates the prediction data table 243. The specific processing of the cluster prediction unit 253 is the same as that of the cluster prediction unit 153 illustrated in FIG.

判定部２５４は、予測データテーブル２４３を基にして、評価値データテーブル２４４を生成する処理部である。判定部２５４は、予測データテーブル２４３に含まれる予測データ毎に評価値を算出する。例えば、判定部２５４は、予測データ毎にクラスタ間距離およびクラスタ内距離を算出し、クラスタ間距離およびクラスタ内距離を予測データの評価値とする。 The determination unit 254 is a processing unit that generates the evaluation value data table 244 based on the prediction data table 243. The determination unit 254 calculates an evaluation value for each prediction data included in the prediction data table 243. For example, the determination unit 254 calculates the inter-cluster distance and the intra-cluster distance for each prediction data, and uses the inter-cluster distance and the intra-cluster distance as evaluation values of the prediction data.

最終クラスタ計算部２５５は、分析対象データ２４１の最終的なクラスタ分析結果となる最終データ２４６を生成する処理部である。最終クラスタ計算部２５５は、評価値データテーブル２４４から中間データテーブル２４５を生成する処理を行った後に、中間データテーブル２４５を基にして、最終データ２４６ａ，２４６ｂ，２４６ｃを生成する。 The final cluster calculation unit 255 is a processing unit that generates final data 246 that is a final cluster analysis result of the analysis target data 241. The final cluster calculation unit 255 performs processing for generating the intermediate data table 245 from the evaluation value data table 244, and then generates final data 246a, 246b, and 246c based on the intermediate data table 245.

最終クラスタ計算部２５５が、評価値データテーブル２４４から中間データテーブル２４５を生成する処理について説明する。最終クラスタ計算部２５５は、評価データテーブル２４４の予測データ毎の評価値を比較し、パレート解となる予測データを特定し、特定したパレート解となる予測データを、中間データテーブル２４５に設定する。 A process in which the final cluster calculation unit 255 generates the intermediate data table 245 from the evaluation value data table 244 will be described. The final cluster calculation unit 255 compares the evaluation values for each prediction data in the evaluation data table 244, specifies the prediction data to be the Pareto solution, and sets the prediction data to be the specified Pareto solution in the intermediate data table 245.

図１２は、各予測データのクラスタ内距離とクラスタ間距離との関係を示す図（２）である。図１２において、縦軸はクラスタ内距離を示し、横軸はクラスタ間距離を示す。一般的に、クラスタ間距離が大きいほど、また、クラスタ内距離が小さいほど、中間データは、良い予測データであると言える。このため、図１２に示す例では、最終クラスタ計算部２５５は、中間データ２４３ａ，２４３ｃ，２４３ｆ，２４３ｇ，２４３ｚを、パレート解として特定する。 FIG. 12 is a diagram (2) illustrating the relationship between the intra-cluster distance and the inter-cluster distance of each prediction data. In FIG. 12, the vertical axis represents the intra-cluster distance, and the horizontal axis represents the inter-cluster distance. In general, it can be said that the intermediate data is better prediction data as the inter-cluster distance is larger and the intra-cluster distance is smaller. Therefore, in the example illustrated in FIG. 12, the final cluster calculation unit 255 identifies the intermediate data 243a, 243c, 243f, 243g, and 243z as Pareto solutions.

続いて、最終クラスタ計算部２５５が、中間データテーブル２４５から最終データ２４６を生成する処理について説明する。最終クラスタ計算部２５５は、中間データテーブル２４５の各予測データの評価値を比較して、類似する予測データ同士を同一グループに分類する処理を行う。例えば、最終クラスタ計算部２５５は、各予測データのクラスタ間距離の差分が閾値未満となり、かつ、各予測データのクラスタ内距離の差分が閾値未満となる予測データを、同一のグループに分類する。 Next, a process in which the final cluster calculation unit 255 generates the final data 246 from the intermediate data table 245 will be described. The final cluster calculation unit 255 compares the evaluation values of the respective prediction data in the intermediate data table 245 and classifies similar prediction data into the same group. For example, the final cluster calculation unit 255 classifies the prediction data in which the difference between the inter-cluster distances of each prediction data is less than the threshold and the difference in the intra-cluster distance of each prediction data is less than the threshold into the same group.

図１２に示す例では、最終クラスタ計算部２５５は、予測データ２４３ａ，２４３ｃをグループ５０ａに分類し、予測データ２４３ｆ，２４３ｇをグループ５０ｂに分類し、予測データ２４３ｉ，２４３ｚをグループ５０ｃに分類する。最終クラスタ計算部２５５は、分類したグループ毎に、最終データ２４６を生成する。 In the example illustrated in FIG. 12, the final cluster calculation unit 255 classifies the prediction data 243a and 243c into the group 50a, classifies the prediction data 243f and 243g into the group 50b, and classifies the prediction data 243i and 243z into the group 50c. The final cluster calculation unit 255 generates final data 246 for each classified group.

例えば、最終クラスタ計算部２５５は、グループ５０ａに含まれる予測データ２４３ａ，２４３ｃを基にして、最終データ２４６ａを生成する。最終クラスタ計算部２５５は、グループ５０ｂに含まれる予測データ２４３ｆ，２４３ｇを基にして、最終データ２４６ｂを生成する。最終クラスタ計算部２５５は、グループ５０ｃに含まれる予測データ２４３ｉ，２４３ｚを基にして、最終データ２４６ｃを生成する。 For example, the final cluster calculation unit 255 generates final data 246a based on the prediction data 243a and 243c included in the group 50a. The final cluster calculation unit 255 generates final data 246b based on the prediction data 243f and 243g included in the group 50b. The final cluster calculation unit 255 generates final data 246c based on the prediction data 243i and 243z included in the group 50c.

最終クラスタ計算部２５５が、予測データを基にして、最終データを特定する処理は、図１の最終クラスタ計算部１５５が、中間データテーブル１４５の予測データを基にして、最終データを特定する処理と同様である。図１２に示す例では、グループ５０ａ，５０ｂ，５０ｃについて、最終データが特定され、最終データ２４６ａ，２４６ｂ，２４６ｃが生成される。 The final cluster calculation unit 255 specifies the final data based on the prediction data. The final cluster calculation unit 155 in FIG. 1 specifies the final data based on the prediction data in the intermediate data table 145. It is the same. In the example shown in FIG. 12, final data is specified for the groups 50a, 50b, and 50c, and final data 246a, 246b, and 246c are generated.

次に、本実施例２に係る分析装置２００の処理手順について説明する。図１３および図１４は、実施例２にかかる分析装置の処理手順を示すフローチャートである。図１３に示すように、分析装置２００は、分析対象データ２４１を受け付ける（ステップＳ２０１）。また、分析装置２００は、繰り返し計算回数を受け付け（ステップＳ２０２）、サンプリング件数を受け付ける。また、カウント値を初期化する（ステップＳ２０３）。分析装置２００は、カウント値に１を加算する（ステップＳ２０４）。カウント値の初期値を０とする。 Next, a processing procedure of the analyzer 200 according to the second embodiment will be described. 13 and 14 are flowcharts illustrating the processing procedure of the analysis apparatus according to the second embodiment. As illustrated in FIG. 13, the analysis apparatus 200 receives analysis target data 241 (step S201). Moreover, the analyzer 200 receives the number of repeated calculations (step S202) and receives the number of samplings. Also, the count value is initialized (step S203). The analyzer 200 adds 1 to the count value (step S204). The initial value of the count value is 0.

分析装置２００は、分析対象データ２４１をサンプリングし、サンプリングデータを生成する（ステップＳ２０５）。各サンプリングデータは、サンプリングデータテーブル２４２に格納される。分析装置２００は、サンプリングデータに対してクラスタ分析処理を実行し、各々のレコードに対してクラスタ番号を割り振る（ステップＳ２０６）。 The analysis device 200 samples the analysis target data 241 and generates sampling data (step S205). Each sampling data is stored in the sampling data table 242. The analysis apparatus 200 performs cluster analysis processing on the sampling data and assigns a cluster number to each record (step S206).

分析装置２００は、クラスタ番号を割り振ったサンプリングデータと分析対象データ２４１とを比較して、分析対象データ２４１に含まれる各々のレコードに対してクラスタ番号を割り振ることで予測データを生成する（ステップＳ２０７）。各予測データは、予測データテーブル２４３に格納される。 The analysis apparatus 200 compares the sampling data to which the cluster number is assigned and the analysis target data 241 and generates prediction data by assigning the cluster number to each record included in the analysis target data 241 (step S207). ). Each prediction data is stored in the prediction data table 243.

分析装置２００は、予測データを基にして、クラスタ内距離およびクラスタ間距離を算出し、評価値データテーブル２４４を生成する（ステップＳ２０８）。分析装置２００は、繰り返しの計算回数がカウント値未満であるか否かを判定する（ステップＳ２０９）。分析装置２１００は、繰り返しの計算回数がカウント値未満の場合には（ステップＳ２０９０，Ｙｅｓ）、ステップＳ２０４に移行する。 The analysis apparatus 200 calculates the intra-cluster distance and the inter-cluster distance based on the prediction data, and generates the evaluation value data table 244 (step S208). The analyzing apparatus 200 determines whether or not the number of repeated calculations is less than the count value (step S209). When the number of repeated calculations is less than the count value (step S2090, Yes), the analyzer 2100 proceeds to step S204.

一方、分析装置２００は、繰り返しの計算回数がカウント値以上である場合には（ステップＳ２０９，Ｎｏ）、図１４のステップＳ２１０に移行する。 On the other hand, when the number of repeated calculations is equal to or greater than the count value (No at Step S209), the analyzer 200 proceeds to Step S210 in FIG.

図１４の説明に移行する。分析装置２００は、パレート解に対応する予測データを選択して、中間データテーブル２４５を作成する（ステップＳ２１０）。分析装置２００は、パレート解に対応する各予測データの類似度を算出する（ステップＳ２１１）。分析装置２００は、類似する各予測データを、グループに分類する（ステップＳ２１２）。 The description shifts to the description of FIG. The analysis apparatus 200 selects prediction data corresponding to the Pareto solution and creates the intermediate data table 245 (step S210). The analysis device 200 calculates the similarity of each prediction data corresponding to the Pareto solution (step S211). The analysis apparatus 200 classifies each similar prediction data into a group (step S212).

分析装置２００は、未選択のグループを選択し（ステップＳ２１３）、ランダムにクラスタ番号を割り振った複数の最終データ候補を生成する（ステップＳ２１４）。ステップＳ２１４において、分析装置２００は、類似度が大きくなるようにクラスタ番号を割り振る。例えば、分析装置２００は、ランダムにクラスタ番号を割り振り、類似度を計算する。そして、分析装置２００は、類似度が大きい、クラスタ番号の割り振りを少し変更して、類似度が大きくなるか、試行する処理を利用者が設定した回数繰り返す。 The analysis apparatus 200 selects an unselected group (step S213), and generates a plurality of final data candidates randomly assigned with cluster numbers (step S214). In step S214, the analysis apparatus 200 assigns a cluster number so that the degree of similarity increases. For example, the analysis apparatus 200 randomly assigns cluster numbers and calculates the similarity. Then, the analysis apparatus 200 changes the cluster number allocation with a high degree of similarity a little, and repeats the trial processing as many times as the degree of similarity increases or the user sets.

分析装置２００は、グループに含まれる予測データと各最終データ候補とを比較して、類似度が最大となる最終データ候補を判定する（ステップＳ２１５）。 The analysis apparatus 200 compares the prediction data included in the group with each final data candidate, and determines the final data candidate having the maximum similarity (step S215).

分析装置２００は、未選択のグループが存在するか否かを判定する（ステップＳ２１６）。分析装置２００は、未選択のグループが存在する場合には（ステップＳ２１６，Ｙｅｓ）、ステップＳ２１３に移行する。一方、分析装置２００は、未選択のグループが存在しない場合には（ステップＳ２１６，Ｎｏ）、各グループの判定結果を出力する（ステップＳ２１７）。 The analysis apparatus 200 determines whether there is an unselected group (step S216). If there is an unselected group (step S216, Yes), the analysis apparatus 200 proceeds to step S213. On the other hand, when there is no unselected group (step S216, No), the analysis apparatus 200 outputs the determination result of each group (step S217).

次に、本実施例２に係る分析装置２００の効果について説明する。分析装置２００は、複数の予測データのうち、評価値のよい予測データを類似する予測データ同士でグルーピングし、グループ毎に、最終データ２４６を生成する。このため、分析装置２００によれば、クラスタ分析に要する時間を削減することができる。また、類似する予測データに応じた最終データの候補を複数得ることが出来る。 Next, the effect of the analyzer 200 according to the second embodiment will be described. The analysis apparatus 200 groups prediction data having a good evaluation value among a plurality of prediction data with similar prediction data, and generates final data 246 for each group. For this reason, according to the analyzer 200, the time required for cluster analysis can be reduced. In addition, a plurality of final data candidates corresponding to similar prediction data can be obtained.

次に、上記実施例に示した分析装置１００，２００と同様の機能を実現する分析プログラムを実行するコンピュータの一例について説明する。図１５は、分析プログラムを実行するコンピュータの一例を示す図である。 Next, an example of a computer that executes an analysis program that realizes the same function as the analysis apparatuses 100 and 200 shown in the above embodiment will be described. FIG. 15 is a diagram illustrating an example of a computer that executes an analysis program.

図１５に示すように、コンピュータ３００は、各種演算処理を実行するＣＰＵ３０１と、ユーザからのデータの入力を受け付ける入力装置３０２と、ディスプレイ３０３を有する。また、コンピュータ３００は、記憶媒体からプログラム等を読取る読み取り装置３０４と、ネットワークを介して他のコンピュータとの間でデータの授受を行うインターフェース装置３０５とを有する。また、コンピュータ３００は、各種情報を一時記憶するＲＡＭ３０６と、ハードディスク装置３０７を有する。そして、各装置３０１〜３０７は、バス３０８に接続される。 As illustrated in FIG. 15, the computer 300 includes a CPU 301 that executes various arithmetic processes, an input device 302 that receives input of data from a user, and a display 303. The computer 300 also includes a reading device 304 that reads a program or the like from a storage medium, and an interface device 305 that exchanges data with other computers via a network. The computer 300 also includes a RAM 306 that temporarily stores various types of information and a hard disk device 307. The devices 301 to 307 are connected to the bus 308.

ハードディスク装置３０７は、サンプリングプログラム３０７ａ、クラスタ分析プログラム３０７ｂ、クラスタ予測プログラム３０７ｃ、判定プログラム３０７ｄ、最終クラスタ計算プログラム３０７ｅを有する。ＣＰＵ３０１は、各プログラム３０７ａ〜３０７ｅを読み出してＲＡＭ３０６に展開する。 The hard disk device 307 includes a sampling program 307a, a cluster analysis program 307b, a cluster prediction program 307c, a determination program 307d, and a final cluster calculation program 307e. The CPU 301 reads out each program 307 a to 307 e and develops it in the RAM 306.

サンプリングプログラム３０７ａは、サンプリングプロセス３０６ａとして機能する。クラスタ分析プログラム３０７ｂは、クラスタ分析プロセス３０６ｂとして機能する。クラスタ予測プログラム３０７ｃは、クラスタ予測プロセス３０６ｃとして機能する。判定プログラム３０７ｄは、判定プロセス３０６ｄとして機能する。最終クラスタ計算プログラム３０７ｅは、最終クラスタ計算プロセス３０６ｅとして機能する。 The sampling program 307a functions as a sampling process 306a. The cluster analysis program 307b functions as a cluster analysis process 306b. The cluster prediction program 307c functions as a cluster prediction process 306c. The determination program 307d functions as a determination process 306d. The final cluster calculation program 307e functions as a final cluster calculation process 306e.

例えば、サンプリングプロセス３０６ａは、サンプリング実行部１５１，２５１に対応する。クラスタ分析プロセス３０６ｂは、クラスタ分析部１５２，２５２に対応する。クラスタ予測プロセス３０６ｃは、クラスタ予測部１５３，２５３に対応する。判定プロセス３０６ｄは、判定部１５４，２５４に対応する。最終クラスタ計算プロセス３０６ｅは、最終クラスタ計算部１５５，２５５に対応する。 For example, the sampling process 306a corresponds to the sampling execution units 151 and 251. The cluster analysis process 306 b corresponds to the cluster analysis units 152 and 252. The cluster prediction process 306 c corresponds to the cluster prediction units 153 and 253. The determination process 306d corresponds to the determination units 154 and 254. The final cluster calculation process 306e corresponds to the final cluster calculation units 155 and 255.

なお、各プログラム３０７ａ〜３０７ｅについては、必ずしも最初からハードディスク装置３０７に記憶させておかなくても良い。例えば、コンピュータ３００に挿入されるフレキシブルディスク（ＦＤ）、ＣＤ−ＲＯＭ、ＤＶＤディスク、光磁気ディスク、ＩＣカードなどの「可搬用の物理媒体」に各プログラムを記憶させておく。そして、コンピュータ５００がこれらから各プログラム３０７ａ〜３０７ｅを読み出して実行するようにしてもよい。 Note that the programs 307a to 307e are not necessarily stored in the hard disk device 307 from the beginning. For example, each program is stored in a “portable physical medium” such as a flexible disk (FD), a CD-ROM, a DVD disk, a magneto-optical disk, and an IC card inserted into the computer 300. Then, the computer 500 may read and execute each of the programs 307a to 307e from these.

以上の各実施例を含む実施形態に関し、さらに以下の付記を開示する。 The following supplementary notes are further disclosed with respect to the embodiments including the above examples.

（付記１）入力データに対してサンプリングを実行し前記入力データから一部のデータを抽出する処理を繰り返し実行して複数のサンプリングデータを生成するサンプリング実行部と、
前記複数のサンプリングデータについてクラスタ分析を実行し、前記サンプリングデータ毎に、前記サンプリングデータに含まれるデータを異なるクラスタに分類するクラスタ分析部と、
前記複数のサンプリングデータに対する前記クラスタ分析部の複数の分類結果と前記入力データとを基にして、前記入力データに含まれるデータの所属するクラスタを予測したデータを示す予測データを複数生成するクラスタ予測部と、
前記予測データのクラスタ間距離およびクラスタ内距離を基にして、予測データ毎に評価値を算出し、パレート解となる評価値に対応する予測データを判定する判定部と、
前記パレート解となる評価値に対応する予測データを基にして、前記入力データに含まれるデータをクラスタに分類する最終クラスタ計算部と
を有することを特徴とする分析装置。 (Supplementary Note 1) A sampling execution unit that performs sampling on input data and repeatedly executes a process of extracting some data from the input data to generate a plurality of sampling data;
A cluster analysis unit that performs cluster analysis on the plurality of sampling data, and classifies the data included in the sampling data into different clusters for each sampling data;
Cluster prediction for generating a plurality of prediction data indicating data obtained by predicting a cluster to which data included in the input data belongs based on a plurality of classification results of the cluster analysis unit and the input data for the plurality of sampling data And
A determination unit that calculates an evaluation value for each prediction data based on the inter-cluster distance and the intra-cluster distance of the prediction data, and determines prediction data corresponding to the evaluation value that is a Pareto solution;
An analysis device comprising: a final cluster calculation unit that classifies data included in the input data into clusters based on prediction data corresponding to the evaluation value that is the Pareto solution.

（付記２）前記最終クラスタ計算部は、パレート解となる評価値に対応する類似の予測データをグループ化し、同一グループに含まれる予測データを基にして、前記入力データに含まれるデータを異なるクラスタに分類する処理を、グループ毎に実行することを特徴とする付記１に記載の分析装置。 (Supplementary Note 2) The final cluster calculation unit groups similar prediction data corresponding to the evaluation value that is a Pareto solution, and sets the data included in the input data to different clusters based on the prediction data included in the same group. The analysis apparatus according to appendix 1, wherein the process of classifying is performed for each group.

（付記３）前記最終クラスタ計算部は、前記入力データに対して、ランダムにクラスタを割り当てた複数の最終クラスタデータを生成し、各最終クラスタデータと予測データとの類似度を基にして、特定の最終クラスタデータを選択することを特徴とする付記１または２に記載の分析装置。 (Supplementary Note 3) The final cluster calculation unit generates a plurality of final cluster data in which clusters are randomly assigned to the input data, and specifies based on the similarity between each final cluster data and predicted data The analyzer according to appendix 1 or 2, wherein the last cluster data is selected.

（付記４）コンピュータが実行する分析方法であって、
入力データに対してサンプリングを実行し前記入力データから一部のデータを抽出する処理を繰り返し実行して複数のサンプリングデータを生成し、
前記複数のサンプリングデータについてクラスタ分析を実行し、前記サンプリングデータ毎に、前記サンプリングデータに含まれるデータを異なるクラスタに分類し、
前記複数のサンプリングデータに対する前記クラスタ分析部の複数の分類結果と前記入力データとを基にして、前記入力データに含まれるデータの所属するクラスタを予測したデータを示す予測データを複数生成し、
前記予測データのクラスタ間距離およびクラスタ内距離を基にして、予測データ毎に評価値を算出し、パレート解となる評価値に対応する予測データを判定し、
前記パレート解となる評価値に対応する予測データを基にして、前記入力データに含まれるデータをクラスタに分類する
各処理を実行することを特徴とする分析方法。 (Supplementary Note 4) An analysis method executed by a computer,
Sampling the input data and repeatedly executing a process of extracting a part of the data from the input data to generate a plurality of sampling data,
Cluster analysis is performed on the plurality of sampling data, and the data included in the sampling data is classified into different clusters for each sampling data,
Based on a plurality of classification results of the cluster analysis unit for the plurality of sampling data and the input data, to generate a plurality of prediction data indicating data predicted clusters to which the data included in the input data belongs,
Based on the inter-cluster distance and intra-cluster distance of the prediction data, calculate an evaluation value for each prediction data, determine the prediction data corresponding to the evaluation value to be a Pareto solution,
An analysis method comprising: performing each process of classifying data included in the input data into clusters based on prediction data corresponding to the evaluation value serving as the Pareto solution.

（付記５）前記入力データに含まれるデータをクラスタに分類する処理は、パレート解となる評価値に対応する類似の予測データをグループ化し、同一グループに含まれる予測データを基にして、前記入力データに含まれるデータを異なるクラスタに分類する処理を、グループ毎に実行することを特徴とする付記４に記載の分析方法。 (Additional remark 5) The process which classifies the data contained in the said input data into a cluster groups the similar prediction data corresponding to the evaluation value used as a Pareto solution, and performs said input based on the prediction data contained in the same group The analysis method according to appendix 4, wherein the process of classifying the data included in the data into different clusters is executed for each group.

（付記６）前記入力データに含まれるデータをクラスタに分類する処理は、前記入力データに対して、ランダムにクラスタを割り当てた複数の最終クラスタデータを生成し、各最終クラスタデータと予測データとの類似度を基にして、特定の最終クラスタデータを選択することを特徴とする付記４または５に記載の分析方法。 (Additional remark 6) The process which classifies the data contained in the said input data into a cluster produces | generates several final cluster data which allocated the cluster at random with respect to the said input data, and each final cluster data and prediction data 6. The analysis method according to appendix 4 or 5, wherein specific final cluster data is selected based on the similarity.

（付記７）コンピュータに、
入力データに対してサンプリングを実行し前記入力データから一部のデータを抽出する処理を繰り返し実行して複数のサンプリングデータを生成し、
前記複数のサンプリングデータについてクラスタ分析を実行し、前記サンプリングデータ毎に、前記サンプリングデータに含まれるデータを異なるクラスタに分類し、
前記複数のサンプリングデータに対する前記クラスタ分析部の複数の分類結果と前記入力データとを基にして、前記入力データに含まれるデータの所属するクラスタを予測したデータを示す予測データを複数生成し、
前記予測データのクラスタ間距離およびクラスタ内距離を基にして、予測データ毎に評価値を算出し、パレート解となる評価値に対応する予測データを判定し、
前記パレート解となる評価値に対応する予測データを基にして、前記入力データに含まれるデータをクラスタに分類する
各処理を実行させることを特徴とする分析プログラム。 (Appendix 7)
Sampling the input data and repeatedly executing a process of extracting a part of the data from the input data to generate a plurality of sampling data,
Cluster analysis is performed on the plurality of sampling data, and the data included in the sampling data is classified into different clusters for each sampling data,
Based on a plurality of classification results of the cluster analysis unit for the plurality of sampling data and the input data, to generate a plurality of prediction data indicating data predicted clusters to which the data included in the input data belongs,
Based on the inter-cluster distance and intra-cluster distance of the prediction data, calculate an evaluation value for each prediction data, determine the prediction data corresponding to the evaluation value to be a Pareto solution,
An analysis program for executing each process for classifying data included in the input data into clusters based on prediction data corresponding to an evaluation value serving as the Pareto solution.

（付記８）前記入力データに含まれるデータをクラスタに分類する処理は、パレート解となる評価値に対応する類似の予測データをグループ化し、同一グループに含まれる予測データを基にして、前記入力データに含まれるデータを異なるクラスタに分類する処理を、グループ毎に実行することを特徴とする付記７に記載の分析プログラム。 (Additional remark 8) The process which classifies the data contained in the said input data into a cluster groups the similar prediction data corresponding to the evaluation value used as a Pareto solution, and performs said input based on the prediction data contained in the same group The analysis program according to appendix 7, wherein the process of classifying the data included in the data into different clusters is executed for each group.

（付記９）前記入力データに含まれるデータをクラスタに分類する処理は、前記入力データに対して、ランダムにクラスタを割り当てた複数の最終クラスタデータを生成し、各最終クラスタデータと予測データとの類似度を基にして、特定の最終クラスタデータを選択することを特徴とする付記４または５に記載の分析方法。 (Additional remark 9) The process which classifies the data contained in the said input data into a cluster produces | generates several final cluster data which allocated the cluster at random with respect to the said input data, and each final cluster data and prediction data 6. The analysis method according to appendix 4 or 5, wherein specific final cluster data is selected based on the similarity.

１００，２００分析装置
１５１，２５１サンプリング実行部
１５２，２５２クラスタ分析部
１５３，２５３クラスタ予測部
１５４，２５４判定部
１５５，２５５最終クラスタ計算部 100,200 Analysis device 151,251 Sampling execution unit 152,252 Cluster analysis unit 153,253 Cluster prediction unit 154,254 Determination unit 155,255 Final cluster calculation unit

Claims

A sampling execution unit that performs sampling on input data and repeatedly executes a process of extracting some data from the input data to generate a plurality of sampling data;
A cluster analysis unit that performs cluster analysis on the plurality of sampling data, and classifies the data included in the sampling data into different clusters for each sampling data;
Cluster prediction for generating a plurality of prediction data indicating data obtained by predicting a cluster to which data included in the input data belongs based on a plurality of classification results of the cluster analysis unit and the input data for the plurality of sampling data And
A determination unit that calculates an evaluation value for each prediction data based on the inter-cluster distance and the intra-cluster distance of the prediction data, and determines prediction data corresponding to the evaluation value that is a Pareto solution;
An analysis device comprising: a final cluster calculation unit that classifies data included in the input data into clusters based on prediction data corresponding to the evaluation value that is the Pareto solution.

The final cluster calculation unit is a process of grouping similar prediction data corresponding to evaluation values that are Pareto solutions and classifying the data included in the input data into different clusters based on the prediction data included in the same group The analysis apparatus according to claim 1, wherein the analysis is executed for each group.

The final cluster calculation unit generates a plurality of final cluster data in which clusters are randomly assigned to the input data, and based on the similarity between each final cluster data and predicted data, specific final cluster data The analyzer according to claim 1, wherein the analyzer is selected.

An analysis method performed by a computer,
Sampling the input data and repeatedly executing a process of extracting a part of the data from the input data to generate a plurality of sampling data,
Cluster analysis is performed on the plurality of sampling data, and the data included in the sampling data is classified into different clusters for each sampling data,
Based on a plurality of classification results of the cluster analysis unit for the plurality of sampling data and the input data, to generate a plurality of prediction data indicating data predicted clusters to which the data included in the input data belongs,
Based on the inter-cluster distance and intra-cluster distance of the prediction data, calculate an evaluation value for each prediction data, determine the prediction data corresponding to the evaluation value to be a Pareto solution,
An analysis method comprising: performing each process of classifying data included in the input data into clusters based on prediction data corresponding to the evaluation value serving as the Pareto solution.

On the computer,
Sampling the input data and repeatedly executing a process of extracting a part of the data from the input data to generate a plurality of sampling data,
Cluster analysis is performed on the plurality of sampling data, and the data included in the sampling data is classified into different clusters for each sampling data,
Based on a plurality of classification results of the cluster analysis unit for the plurality of sampling data and the input data, to generate a plurality of prediction data indicating data predicted clusters to which the data included in the input data belongs,
Based on the inter-cluster distance and intra-cluster distance of the prediction data, calculate an evaluation value for each prediction data, determine the prediction data corresponding to the evaluation value to be a Pareto solution,
An analysis program for executing each process for classifying data included in the input data into clusters based on prediction data corresponding to an evaluation value serving as the Pareto solution.