JP2001134577A

JP2001134577A - Device and method for data analysis and recording medium stored with computer program thereof

Info

Publication number: JP2001134577A
Application number: JP31122599A
Authority: JP
Inventors: Kazuhiro Matsumoto; 和宏松本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1999-11-01
Filing date: 1999-11-01
Publication date: 2001-05-18

Abstract

PROBLEM TO BE SOLVED: To take a highly reliable analysis by reflecting the knowledge of a user on the analysis as to a data analyzing device which classifies pieces of data having different properties into clusters of similar data according to a procedure for cluster classification and predicts the values of unknown properties of the data according to the cluster classification results. SOLUTION: This device has a data output means which outputs the cluster classification results and the procedure for cluster classification, a correction input means which corrects the procedure for cluster classification, a cluster generating means which performs cluster classification according to the corrected procedure for cluster classification, and a predicting means which predicts the values of unknown properties of the respective data according to the cluster classification results and the values of properties included in the procedure for cluster classification.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、データウェアハウ
ス等の、蓄積されたデータを、データマイニングツール
などにより分析する分野に関する。この分野では、デー
タを蓄積するだけでなく、データの有する種々の属性か
ら有用な情報を引き出し、ビジネスに活用することが求
められている。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to the field of analyzing accumulated data, such as a data warehouse, using a data mining tool or the like. In this field, it is required not only to accumulate data but also to extract useful information from various attributes of the data and utilize it for business.

【０００２】具体的には、例えば、ＰＯＳ（Ｐｏｉｎｔ
ＯｆＳａｌｅｓ）の販売データ等の、購入者の年
齢、性別、商品の種類、大きさ、色等の、複数の属性を
有する複数のデータを、似た者同士のクラスタに分類
し、分類されたクラスタに含まれる属性の傾向から、次
の購入傾向等の予測を行うデータ分析装置がある。Specifically, for example, POS (Point
Of Sales), a plurality of data having a plurality of attributes such as a purchaser's age, gender, product type, size, and color are classified into clusters of similar persons and classified. There is a data analyzer that predicts the next purchase tendency or the like from the tendency of attributes included in a cluster.

【０００３】[0003]

【従来の技術】従来のデータ分析装置では、複数の属性
を有する複数のデータを、属性の値に従ってクラスタに
分類する際に、予めクラスタ分類の手順を登録し、ある
いは過去に蓄積されたデータの属性に対して、頻度や平
均、回帰分析、主成分分析等の簡単な統計処理を行うこ
とでデータを処理し、クラスタ分類の手順を生成し、ク
ラスタ分類を行う。2. Description of the Related Art In a conventional data analyzing apparatus, when classifying a plurality of data having a plurality of attributes into clusters according to attribute values, a cluster classification procedure is registered in advance, or data stored in the past is stored. The data is processed by performing simple statistical processing such as frequency, average, regression analysis, principal component analysis, and the like on the attribute to generate a cluster classification procedure and perform cluster classification.

【０００４】新たな入力データの分析では、このクラス
タ分類の手順に従ってクラスタ分類を行い、分類結果を
データ−クラスタ対応表等の形で出力する。In the analysis of new input data, cluster classification is performed according to the cluster classification procedure, and the classification result is output in the form of a data-cluster correspondence table or the like.

【０００５】また、入力データの未知の属性を予測する
際には、過去に蓄積されたデータの属性に対して、Ａ
Ｉ、ニューラルネットワーク等の方法で予測モデルを作
成し、未知の属性の値の予測を行い、予測値を出力す
る。Further, when predicting an unknown attribute of input data, an attribute of data stored in the past is
I, a prediction model is created by a method such as a neural network, a value of an unknown attribute is predicted, and a predicted value is output.

【０００６】図１１は、従来のデータ分析装置の構成例
を示す。以下、商品の販売実績のデータとして、表１の
データを、図１１に示す従来のデータ分析装置に入力し
た場合の処理を例に説明する。FIG. 11 shows a configuration example of a conventional data analyzer. Hereinafter, a description will be given of an example of a process when the data in Table 1 is input to the conventional data analyzer shown in FIG.

【０００７】[0007]

【表１】クラスタ作成手段が入力データをクラスタに分類する場
合、クラスタ分類の手順保持部には、図１２（１）のル
ールが保持されているものとる。[Table 1] When the cluster creating unit classifies the input data into clusters, it is assumed that the cluster classification procedure holding unit holds the rule of FIG.

【０００８】入力データは、大きさ、色、性別の順に分
類され、１８クラスタの何れかに分類される。結果は、
図１２（２）の５クラスタに分類される。The input data is classified in the order of size, color, and gender, and is classified into one of 18 clusters. Result is,
It is classified into five clusters in FIG.

【０００９】次に、予測手段に、表１の性別の属性が欠
落したデータを入力し、性別を予測する例について説明
する。クラスタ分類手順保持部には、表１を統計処理し
た結果の例としての、図１３（１）のルールが保持され
ている場合、図１３（２）の予測結果が出力される。Next, an example will be described in which data for which the gender attribute shown in Table 1 is missing is input to the prediction means and the gender is predicted. When the cluster classification procedure holding unit holds the rule of FIG. 13A as an example of the result of statistical processing of Table 1, the prediction result of FIG. 13B is output.

【００１０】[0010]

【発明が解決しようとする課題】しかし、従来のデータ
分析装置は、ユーザが入力データに関する先験的な知識
を保有していても、これを分析に反映できない。However, in the conventional data analyzer, even if the user has a priori knowledge about the input data, this cannot be reflected in the analysis.

【００１１】例えば、ユーザの持つ、当該商品の先験的
な知識が、１．大きさが「大」の商品は男性用。２．大きさが「小」の商品は女性用。For example, a priori knowledge of the product that the user has is as follows: Items with a size of "Large" are for men. 2. Products with a small size are for women.

【００１２】従って、先に大きさで分類すれば、性別の
分類は「中」のみで良い。であっても、これを分類、予
測に反映出来ない。その結果、余計な分類を行ったり、
クラスタ数を無駄に増やすこととなり、信頼感が十分に
得らない。Therefore, if the classification is made first by size, the classification of gender only needs to be “medium”. However, this cannot be reflected in classification and prediction. As a result, we do extra classification,
The number of clusters is increased unnecessarily, and a sufficient sense of reliability cannot be obtained.

【００１３】[0013]

【課題を解決するための手段】本発明は、データ分析経
過をユーザに提示し、これに対するユーザの知識を反映
する修正入力手段を有することを特徴とする。The present invention is characterized in that it has a correction input means for presenting the data analysis progress to the user and reflecting the user's knowledge on the data analysis progress.

【００１４】具体的には、クラスタ分類結果とクラスタ
分類の手順とを出力するデータ出力手段と、クラスタ分
類の手順に修正を加える修正入力手段と、修正されたク
ラスタ分類の手順に従ってクラスタ分類を行うクラスタ
作成手段と、クラスタ分類結果とクラスタ分類の手順に
含まれる属性の値から各データの未知の属性の値の予測
を行う予測手段とを有し、以下の如く分析することを特
徴とする。Specifically, data output means for outputting a result of cluster classification and a procedure of cluster classification, correction input means for modifying the procedure of cluster classification, and cluster classification according to the corrected procedure of cluster classification. It has cluster creation means and prediction means for predicting the value of an unknown attribute of each data from the cluster classification result and the value of the attribute included in the cluster classification procedure, and is characterized in that it is analyzed as follows.

【００１５】１．入力された、複数の属性を有する複数
のデータを、クラスタ分類の手順に従ってクラスタに分
類し、クラスタ分類の手順とデータ−クラスタ対応表と
を出力する。２．入力された、複数の属性を有する複数のデータを、
クラスタ分類の手順に含まれる属性の値から、各データ
の未知の属性の値を予測し、予測結果と、予測の基にな
ったクラスタ分類の手順とを出力する。３．ユーザによる、クラスタ分類の手順、あるいは、予
測の基になったクラスタ分類の手順の修正を受け付け、
該修正に従って、クラスタ分類と、未知の属性の値の予
測を行う。1. A plurality of input data having a plurality of attributes are classified into clusters according to a cluster classification procedure, and a cluster classification procedure and a data-cluster correspondence table are output. 2. Input multiple data with multiple attributes,
The value of the unknown attribute of each data is predicted from the value of the attribute included in the cluster classification procedure, and the prediction result and the cluster classification procedure based on the prediction are output. 3. When a user corrects a cluster classification procedure or a cluster classification procedure based on a prediction,
According to the correction, cluster classification and prediction of the value of the unknown attribute are performed.

【００１６】従って、ユーザの知識を反映したクラスタ
分類、あるいは、未知の属性の値の予測が可能となり、
より精度と信頼感の高い分析が可能となる。Therefore, it becomes possible to classify the cluster reflecting the knowledge of the user or to predict the value of an unknown attribute.
Analysis with higher accuracy and reliability is possible.

【００１７】以下、図をもって説明する。図１は、本発
明の構成例である。クラスタ作成手段１０は、入力デー
タの属性を分析する属性分析部１１と、分析結果を保持
するクラスタ分類保持部１２と、クラスタ分類手順保持
部１３とを有する。属性分析部１１は、入力データの属
性を分析し、クラスタを作成するものであれば、従来の
データ処理手法を用いたものであってもよい。Hereinafter, description will be made with reference to the drawings. FIG. 1 is a configuration example of the present invention. The cluster creating means 10 has an attribute analysis unit 11 for analyzing attributes of input data, a cluster classification holding unit 12 for storing analysis results, and a cluster classification procedure holding unit 13. The attribute analysis unit 11 may use a conventional data processing method as long as it analyzes attributes of input data and creates clusters.

【００１８】請求項２に記載の属性分析部１１は、以下
の手順で属性の分析を行うことを特徴とする。１．全ての入力データについて、各入力データの有する
全ての属性の値をもとに仮のクラスタ分類を行い、仮の
クラスタ分類結果のうち、最大クラスタを有する属性、
即ち分類効率の最も高い属性を第一属性とし、第一属性
によるクラスタ分類を第一分類とする。The attribute analyzing section 11 according to the second aspect is characterized in that the attribute is analyzed in the following procedure. 1. For all input data, provisional cluster classification is performed based on the values of all attributes of each input data, and among the provisional cluster classification results, the attribute having the largest cluster,
That is, the attribute with the highest classification efficiency is set as the first attribute, and the cluster classification based on the first attribute is set as the first classification.

【００１９】２．第一分類のそのそれぞれのクラスタに
ついて、第一属性以外の全ての属性の値をもとに仮のク
ラスタ分類を行い、仮のクラスタ分類結果のうち、最大
クラスタを有する属性を第二属性とし、第二属性による
クラスタ分類を第二分類とする。2. For each of the clusters of the first classification, perform a provisional cluster classification based on the values of all attributes other than the first attribute, and among the provisional cluster classification results, the attribute having the largest cluster as the second attribute, The cluster classification based on the second attribute is referred to as a second classification.

【００２０】３．以上の処理を、所定回数、例えばクラ
スタの数がユーザにより指定された値、あるいは装置が
保持する基準値に達するまで繰り返す。3. The above process is repeated a predetermined number of times, for example, until the number of clusters reaches a value specified by the user or a reference value held by the device.

【００２１】４．クラスタ分類をクラスタ分類保持部１
２に登録し、クラスタ分類の手順をクラスタ分類手順保
持部１３に登録する。4. Cluster classification storage unit 1
2 and the cluster classification procedure is registered in the cluster classification procedure holding unit 13.

【００２２】手順指定手段２は、ユーザが、クラスタ分
類の手順の表現形式を、例えば決定木、ルール、属性順
序、代表点等を指定することを可能とする。The procedure designating means 2 allows the user to designate the expression format of the procedure of cluster classification, for example, a decision tree, a rule, an attribute order, a representative point, and the like.

【００２３】決定木は、分類の過程を、分岐パスの形で
表すものである。The decision tree expresses the classification process in the form of a branch path.

【００２４】ルールは、分類の過程を、各分岐の部分を
「〜で、〜で、・・〜ならば、クラスタＸである。」と
いう形で表わすものである。The rule expresses the classification process in such a manner that each branch portion is a cluster X if.

【００２５】属性順序は、属性を、分類効率の高い順に
並べたものである。The attribute order is such that the attributes are arranged in the order of higher classification efficiency.

【００２６】代表点は、各クラスタを、各クラスタの代
表的な点の属性の値で表したものである。代表点の選び
方の例としては、各データを、属性の数、例えばＮ種類
の属性があるとすれば、Ｎ次元空間の座標にマッピング
された点とみなし、各クラスタ内の座標値の平均値とし
ても良い。The representative point represents each cluster by the value of the attribute of the representative point of each cluster. As an example of a method of selecting a representative point, assuming that each data has the number of attributes, for example, N types of attributes, it is regarded as a point mapped to coordinates in an N-dimensional space, and an average value of coordinate values in each cluster It is good.

【００２７】代表点の修正としては、代表点の移動の他
に、代表点の追加／削除、即ち、クラスタの追加／削除
であってもよい。The modification of the representative point may be addition / deletion of a representative point, that is, addition / deletion of a cluster, in addition to movement of the representative point.

【００２８】修正内容の他の例として、クラスタ分類手
順保持部１３の内容を全面的に置き換えるものであって
もよい。As another example of the correction content, the content of the cluster classification procedure holding unit 13 may be completely replaced.

【００２９】入力データにクラスタ分類が含まれている
場合は、クラスタ分類も一つの属性として属性分析を行
い、クラスタ分類の手順を出力する。If the input data includes a cluster classification, the cluster classification is also analyzed as one attribute, and the cluster classification procedure is output.

【００３０】予測手段２０は、複数の属性を有する複数
のデータを、属性分析部１１の保持するクラスタ分類の
手順に従いクラスタ分類し、未知の属性の値を予測する
ものであれば、従来のデータ処理手順であってもよい。The prediction means 20 classifies a plurality of data having a plurality of attributes according to the cluster classification procedure held by the attribute analysis unit 11 and predicts the value of an unknown attribute. It may be a processing procedure.

【００３１】請求項３に記載の予測手段２０は、予測す
る属性を含むクラスタ分類の手順がクラスタ分類手順保
持部１３に登録されている状態で、以下の手順で正誤順
序表を作成することを特徴とする。登録されているクラ
スタ分類の手順は、入力データを分析し、生成したもの
でも、外部から入力したものでもよい。According to a third aspect of the present invention, the predicting means 20 creates a correct / incorrect sequence table in the following procedure in a state where a cluster classification procedure including an attribute to be predicted is registered in the cluster classification procedure holding unit 13. Features. The registered cluster classification procedure may be a procedure generated by analyzing input data or a procedure input externally.

【００３２】１．クラスタ作成手段１０は、複数の属性
を有する複数のデータを、クラスタ分類の手順に従って
クラスタ分類する。２．正誤分析部２１は、分類結果に含まれる未知の属性
の値を予測結果とし、予測結果と入力データの属性の値
を比較し、正答率の高い属性から順にならべた、正誤順
序表を作成する。３．予測結果を予測結果保持部２２に、正誤順序表を正
誤順序表保持部２３に登録する。1. The cluster creating means 10 classifies a plurality of data having a plurality of attributes according to a cluster classification procedure. 2. The correct / incorrect analysis unit 21 uses the value of the unknown attribute included in the classification result as the prediction result, compares the prediction result with the value of the attribute of the input data, and creates a correct / incorrect order table in which the attributes having a higher correct answer rate are arranged in order. . 3. The prediction result is registered in the prediction result holding unit 22 and the correct / incorrect order table is registered in the correct / incorrect order table holding unit 23.

【００３３】また、ユーザが、予測条件を変更したい場
合、修正入力手段１からクラスタ分類手順保持部１３に
修正を加え、修正された条件に従って改めて予測を行
い、結果を出力する。When the user wants to change the prediction condition, the user inputs a correction to the cluster classification procedure holding unit 13 from the correction input means 1, makes a new prediction according to the corrected condition, and outputs the result.

【００３４】[0034]

【発明の実施の形態】以下、商品の販売実績のデータと
して、表１のデータを図１の本発明のデータ分析装置に
入力した場合の処理を例に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A description will now be given of a process when data in Table 1 is input to the data analyzer of the present invention shown in FIG.

【００３５】属性分析部１１は、入力された、複数の属
性を有する複数のデータを、クラスタ分類手順保持部１
３の内容に従ってクラスタに分類し、結果をクラスタ分
類保持部１２に登録する。The attribute analysis unit 11 stores the input data having a plurality of attributes into the cluster classification procedure holding unit 1.
The data is classified into clusters according to the content of No. 3, and the result is registered in the cluster classification holding unit 12.

【００３６】データ出力手段は、クラスタ分類の手順と
データ−クラスタ対応表とを出力する。The data output means outputs a cluster classification procedure and a data-cluster correspondence table.

【００３７】クラスタ分類の手順を生成する手法の例と
して、請求項２に記載の属性分析部１１に表１のデータ
入力した場合の処理例を図２〜図４に示すが、分類効率
の高い属性から順にクラスタ分類する手法であれば、他
の手法であってもよい。As an example of a method of generating a procedure of cluster classification, FIGS. 2 to 4 show processing examples when the data of Table 1 is input to the attribute analysis unit 11 according to the second aspect. Any other method may be used as long as the method classifies clusters in order from the attribute.

【００３８】この分析の結果、最大クラスタの全体のデ
ータに占める比率の平均値が最も高い、即ち分類効率が
最も高い属性２（大きさ）を第一属性とし、第一分類を
行う。この処理をクラスタ分類が可能な階層まで行った
結果を図５に示す。図５においては、クラスタ分類の手
順を決定木で表しているが、ユーザの指定により、図６
〜図８に示す如く、ルール、属性順序、代表点で表すこ
とが可能である。As a result of this analysis, the first classification is performed with the attribute 2 (size) having the highest average value of the ratio of the largest cluster in the entire data, that is, the highest classification efficiency as the first attribute. FIG. 5 shows the result of performing this processing up to the layer where cluster classification is possible. In FIG. 5, the cluster classification procedure is represented by a decision tree.
As shown in FIG. 8, it can be represented by rules, attribute orders, and representative points.

【００３９】図６（１）は、上記の手法に従って、クラ
スタ分類の手順をルールで表した例、図６（２）は、一
般的な統計処理として、該当する件数の少ないルール
を、修正入力手段１から削除した例である。FIG. 6A shows an example in which the cluster classification procedure is represented by a rule according to the above-described method. FIG. 6B shows a general statistical processing in which a rule having a small number of relevant cases is corrected and input. This is an example of deleting from the means 1.

【００４０】図８（１）は、上記の手法に従って、クラ
スタ分類の手順を代表点で表した例、図８（２）は、一
般的な代表点の選択処理として、距離の近い代表点を統
合した例である。FIG. 8A shows an example in which the cluster classification procedure is represented by representative points in accordance with the above-described method. FIG. 8B shows a typical representative point selection process in which representative points having short distances are selected. This is an example of integration.

【００４１】以上は、一般的な統計処理としての修正例
であるが、ユーザが、当該データに関する別の先験的知
識を持っている場合には、修正入力手段１を通じて、さ
らに修正が可能である。The above is an example of correction as a general statistical process. If the user has another a priori knowledge of the data, further correction can be made through the correction input means 1. is there.

【００４２】次に、予測手段２０が、未知の属性の値の
予測を行う例として、以下の条件の場合について説明す
る。１．表１のデータを分析した結果をユーザが修正した、
図６（２）のルールが、クラスタ分類手順保持部１３に
登録されているものとする。２．入力として、表１の性別の属性が欠落したデータを
与え性別を予測する。Next, as an example in which the prediction means 20 predicts the value of an unknown attribute, the case of the following conditions will be described. 1. The user corrected the result of analyzing the data in Table 1,
It is assumed that the rule in FIG. 6B is registered in the cluster classification procedure holding unit 13. 2. As input, data in which the gender attribute in Table 1 is missing is given, and the gender is predicted.

【００４３】この場合、図６（２）のルールを、上から
順に適用し、図９（１）の予測結果が得られる。In this case, the rule of FIG. 6B is applied in order from the top, and the prediction result of FIG. 9A is obtained.

【００４４】請求項３に記載の正誤分析部２１は、予測
結果と入力データの属性の値を比較し、図１０（１）に
示す属性毎の正誤分析行い、正答率の高い属性から順に
ならべた、正誤順序表を作成し、正誤順序表保持部２３
に登録することを特徴とする。The correct / incorrect analysis unit 21 compares the prediction result with the attribute value of the input data, performs correct / incorrect analysis for each attribute shown in FIG. 10A, and sorts the attributes in descending order of the correct answer rate. In addition, a correct / incorrect order table is created, and the correct / incorrect
Is registered.

【００４５】正答率は例えば下式で与えられるが、正答
率を導くものであれば他の求め方でもよい。The correct answer rate is given by, for example, the following equation, but any other method may be used as long as the correct answer rate is derived.

【００４６】[0046]

【数１】上式で求めた正答率による正誤順序表の例を図１０
（２）に示す。データ出力手段は、予測値と、クラスタ
分類の手順と、正誤順序表とを出力する。これにより、
ユーザは予測の根拠を知ることができる。(Equation 1) FIG. 10 shows an example of a correct / incorrect order table based on the correct answer rate obtained by the above equation.
This is shown in (2). The data output means outputs a predicted value, a cluster classification procedure, and a correct / incorrect order table. This allows
The user can know the basis of the prediction.

【００４７】また、ユーザの持つ、当該商品の先験的な
知識、１．大きさが「大」の商品は男性用。２．大きさが「小」の商品は女性用。を基に、図６（１）に示す、ルールＢを加えることも可
能である。この場合の、修正後のルール表を図９（２）
に示す。The user has a priori knowledge of the product, Items with a size of "Large" are for men. 2. Products with a small size are for women. , A rule B shown in FIG. 6A can be added. In this case, the modified rule table is shown in FIG.
Shown in

【００４８】図９（２）のルール表を用い、改善された
予測結果を、図９（３）に示す。FIG. 9 (3) shows an improved prediction result using the rule table of FIG. 9 (2).

【００４９】[0049]

【発明の効果】以上説明したように、本発明によれば、
複数の属性を有する複数のデータを、似た者同士のいく
つかのクラスタに分類する分析において、各データがど
のクラスタに属するかを出力すると同時に、クラスタ分
類の手順を出力する。これにより、分析過程の理解を容
易にする。また、入力データの、未知の属性の値を予測
を行う分析において、予測の基となったクラスタ分類の
手順を出力し、ユーザに問いかけ、ユーザがその結果に
応じた修正を可能とすることにより、ユーザの知識や経
験を反映したデータ分析を実現することが可能となる。As described above, according to the present invention,
In the analysis for classifying a plurality of data having a plurality of attributes into some clusters of similar persons, a cluster classification procedure is output at the same time as outputting which cluster each data belongs to. This facilitates understanding of the analysis process. Also, in the analysis of predicting the value of an unknown attribute of input data, by outputting the cluster classification procedure on which the prediction is based, asking the user and enabling the user to make corrections according to the result. Thus, data analysis reflecting the knowledge and experience of the user can be realized.

[Brief description of the drawings]

【図１】本発明の構成例FIG. 1 is a configuration example of the present invention.

【図２】属性分析１（性別）Fig. 2 Attribute analysis 1 (gender)

【図３】属性分析２（大きさ）FIG. 3 Attribute analysis 2 (size)

【図４】属性分析３（色）FIG. 4 Attribute analysis 3 (color)

【図５】分析結果例（決定木）Fig. 5 Example of analysis result (decision tree)

【図６】分析結果例及び修正例（ルール）FIG. 6: Analysis result example and correction example (rule)

【図７】分析結果例（属性順序表）FIG. 7: Analysis result example (attribute sequence table)

【図８】分析結果例及び修正例（クラスタ−代表点対
応表）FIG. 8 shows an analysis result example and a correction example (cluster-representative point correspondence table)

【図９】予測結果例FIG. 9 Example of prediction result

【図１０】属性毎の正誤分析例、及び正誤順序表例FIG. 10 shows an example of a true / false analysis for each attribute and a true / false order table.

【図１１】従来のデータ分析装置の構成例FIG. 11 is a configuration example of a conventional data analyzer.

【図１２】クラスタ分類の手順、及び結果例（ルー
ル）FIG. 12 shows a procedure of cluster classification and a result example (rule).

【図１３】ルール例、及び予測結果例FIG. 13 shows a rule example and a prediction result example.

[Explanation of symbols]

１修正入力手段２手順指定手段１０クラスタ作成手段１１属性分析部１２クラスタ分類保持部１３クラスタ分類手順保持部２０予測手段２１正誤分析部２２予測結果保持部２３正誤順序表保持部 DESCRIPTION OF SYMBOLS 1 Correction input means 2 Procedure designation means 10 Cluster creation means 11 Attribute analysis part 12 Cluster classification holding part 13 Cluster classification procedure holding part 20 Prediction means 21 Correctness / error analysis part 22 Prediction result storage part 23 Correction / error order table storage part

Claims

[Claims]

1. A data analysis apparatus for classifying a plurality of data having a plurality of attributes into clusters according to a cluster classification procedure, and predicting the value of an unknown attribute of each data from a cluster classification result. A data output unit that outputs a classification result, a cluster classification procedure, and a prediction result; a correction input unit that modifies the cluster classification procedure; and a cluster creation unit that performs cluster classification according to the corrected cluster classification procedure And a prediction unit for predicting the value of an unknown attribute of each data from the cluster classification result and the attributes included in the cluster classification procedure.

2. The cluster creating means performs tentative cluster classification on all input data on the basis of values of all attributes of each input data. Is the first attribute, and the cluster classification based on the first attribute is the first classification. For each of the first classifications, for all the attributes except the first attribute, a provisional The cluster classification was performed, and among the provisional cluster classification results, the attribute having the largest cluster was defined as the second attribute, and the cluster classification based on the second attribute was defined as the second classification. In the above processing, the number of clusters was designated by the user. Until the value or the reference value held by the device is reached, or the minimum number of data items in the cluster reaches the value specified by the user or the reference value held by the device Repeating, the data analyzer according to claim 1, characterized in that it comprises an attribute analysis unit.

3. The prediction means classifies input data into clusters according to a cluster classification procedure held by a cluster classification procedure holding unit, holds an unknown attribute value included in the cluster classification result as a prediction result, and Calculate the correct answer rate of
2. The data analysis device according to claim 1, further comprising a true / false analysis unit that holds a true / false order table in which attributes are arranged in descending order of the correct answer rate.

4. A data analysis method for classifying a plurality of data having a plurality of attributes into clusters of similar persons according to a cluster classification procedure, and predicting an unknown attribute value of each data from a cluster classification result. The correction input means modifies the cluster classification procedure held by the cluster generation means, and the cluster generation means performs cluster classification in accordance with the modified cluster classification procedure. And predicting the value of an unknown attribute of each data from the attributes included in the cluster classification procedure. The data output means outputs a cluster classification result, a cluster classification procedure, and a prediction result. Data analysis method to use.

5. A computer program for classifying a plurality of data having a plurality of attributes into clusters of similar persons according to a cluster classification procedure, and predicting an unknown attribute value of each data from the cluster classification result. A storage medium for storing, a data output means for outputting a cluster classification result and a cluster classification procedure, a correction input means for modifying the cluster classification procedure, and a cluster classification in accordance with the modified cluster classification procedure And a prediction means for predicting the value of an unknown attribute of each data from the cluster classification result and the attributes included in the cluster classification procedure.