JPH1115832A

JPH1115832A - Cluster generation device and recording medium

Info

Publication number: JPH1115832A
Application number: JP9168267A
Authority: JP
Inventors: Tadako Oota; 唯子太田; Nobuhiro Yugami; 伸弘湯上; Aoshi Okamoto; 青史岡本; Osamu Sato; 理佐藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1997-06-25
Filing date: 1997-06-25
Publication date: 1999-01-22

Abstract

PROBLEM TO BE SOLVED: To generate a cluster hierarchy permitting that one example belongs to plural clusters by means of shortening cluster generation time by considering the cluster whose number of appearing examples in common is large in a cluster set to be highly associated and generating the new cluster from the plural clusters. SOLUTION: An initial cluster generation means 2 generates the clusters for respective attribute values based on the given example, which is read from an example data base 5. A pair selection means 2 selects the pair of the clusters in the generated clusters. A cluster generation means 4 counts the number of the examples contained in common on the selected pair, generates the new cluster from the pertinent pair of plural clusters in the largest case and stores it in a cluster data base 6. The cluster whose number of the appearing examples in common is large in the cluster set is considered to be highly associated. The generation of the new cluster from the plural clusters is repeated and the cluster hierarchy is generated.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、与えられた事例を
もとにクラスタを生成するクラスタ生成装置に関するも
のである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a cluster generation device for generating a cluster based on a given case.

【０００２】事例を分類するクラスが与えられていない
場合、どのようなクラスに分類するかが重要な問題とな
る。クラスタリングは、そのような場合、クラスそのも
の、または部分集合となるようなクラスを生成する技術
である。一般に、クラスタは似た事例の集合である。例
えばアンケートデータなどから、似た人達のグループを
検出するのに利用できる。このクラスタの生成において
は、クラスタ間の類似性の計り方、クラスタの生成の仕
方によって、得られるクラスタ階層の性能が変わってく
る。本発明は、そのクラスタ間の類似性の計り方、生成
の仕方に関するものである。When a class for classifying a case is not given, an important problem is to classify the case. Clustering is a technique for generating a class that becomes a class itself or a subset in such a case. In general, a cluster is a set of similar cases. For example, it can be used to detect a group of similar people from questionnaire data. In the generation of clusters, the performance of the obtained cluster hierarchy changes depending on how similarities between clusters are measured and how clusters are generated. The present invention relates to how to measure and generate similarity between clusters.

【０００３】[0003]

【従来の技術】従来、クラスタ生成の手法には、逐次的
手法と一括的手法とがある。逐次的手法は、クラスタリ
ングされる事例を、一度ではなく、少しずつ時間をおっ
て与えられる。新しい事例が与えられた時、それまでに
作られたクラスタを始めから作り直すのではなく、それ
に変更を加える操作を行う。そのため、クラスタ生成に
必要とされる時間は短いが、生成されたクラスタ階層の
性能は、事例の入力順序に大きく依存する。2. Description of the Related Art Conventionally, cluster generation methods include a sequential method and a collective method. In the sequential method, the instances to be clustered are given not by one time but by little time. When a new case is given, perform operations to make changes to the previously created cluster, instead of recreating it from scratch. Therefore, although the time required for cluster generation is short, the performance of the generated cluster hierarchy largely depends on the input order of cases.

【０００４】一括的手法は、事例が一度に与えられるた
め、入力順序の問題はないが、クラスタ生成に時間がか
かる。例えば、代表的な手法としてＷａｒｄ法がある
が、この計算量は、全事例数の３乗である。In the collective method, since cases are given at once, there is no problem in the input order, but it takes time to generate clusters. For example, the Ward method is a typical method, but the amount of calculation is the cube of the total number of cases.

【０００５】[0005]

【発明が解決しようとする課題】上述したようにクラス
タ生成を従来の逐次的手法で行うと、新しい事例が与え
られたときにクラスタ生成を迅速に行うことができる
が、生成されるクラスタ階層が事例の入力順序に大きく
依存してしまう問題がある。As described above, when cluster generation is performed by a conventional sequential method, cluster generation can be performed quickly when a new case is given. There is a problem that it greatly depends on the input order of the cases.

【０００６】また、従来の一括的手法で行うと、入力順
序の問題はないが、クラスタ生成に時間がかかるという
問題がある。また、逐次的／一括的手法に共通して多く
の手法は、１つの事例が１つのクラスにしか属さないよ
うなクラスタ階層が生成される。ある問題を解決するた
めに、複数の視点が要求される場合も多く、従来の１つ
の事例が１つのクラスにしか含まれないような階層を生
成する方法では、各視点ごとに階層を生成する必要も生
じてしまう。そこで、事例の入力順序によらず、階層生
成に必要な時間が短く、更に、１つの事例が複数のクラ
スタに属することを許すようなクラスタ階層生成手法が
望まれる。Further, when the conventional batch method is used, there is no problem in the input order, but there is a problem that it takes time to generate clusters. Further, in many methods common to the sequential / collective methods, a cluster hierarchy in which one case belongs to only one class is generated. In order to solve a certain problem, a plurality of viewpoints are often required. In the conventional method of generating a hierarchy in which one case is included in only one class, a hierarchy is generated for each viewpoint. Necessity arises. Therefore, there is a demand for a cluster hierarchy generation method that can reduce the time required for hierarchy generation regardless of the input order of cases and further allows one case to belong to a plurality of clusters.

【０００７】本発明は、これらの問題を解決するため、
各属性の各値についてその値を持つ事例を集めてクラス
タを生成し、クラスタ集合中で共通して出現する事例の
数の多いクラスタは関連が高いと見做して複数クラスタ
から新たなクラスタを生成することを繰り返してクラス
タ階層を生成し、クラスタ生成時間を短くして１つの事
例が複数のクラスタに属することを許すクラスタ階層を
生成することを目的としている。[0007] The present invention solves these problems,
For each value of each attribute, clusters are generated by collecting the cases with that value, and clusters with a large number of cases that appear in common in the cluster set are considered to be highly related, and a new cluster is created from multiple clusters. An object of the present invention is to generate a cluster hierarchy by repeatedly generating a cluster hierarchy, shortening the cluster generation time, and generating a cluster hierarchy that allows one case to belong to a plurality of clusters.

【０００８】[0008]

【課題を解決するための手段】図１を参照して課題を解
決するための手段を説明する。図１において、処理装置
１は、図示外の記録媒体からプログラムを主記憶にロー
ディングして起動し各種処理を行うものであって、ここ
では、初期クラスタ生成手段２、ペア選択手段３、およ
びクラスタ生成手段４などから構成されるものである。Means for solving the problem will be described with reference to FIG. In FIG. 1, a processing device 1 loads a program from a recording medium (not shown) into a main storage and starts up to perform various processes. In this example, an initial cluster generating unit 2, a pair selecting unit 3, and a cluster It comprises a generating means 4 and the like.

【０００９】初期クラスタ生成手段２は、クラスタを生
成するものである。ペア選択手段３は、クラスタのペア
を選択するものである。クラスタ生成手段４は、選択し
たペアについて共通出現事例数を計数し最も多い数のペ
アを新たなクラスタとして生成などするものである。The initial cluster generating means 2 generates a cluster. The pair selecting means 3 selects a cluster pair. The cluster generating means 4 counts the number of common occurrence cases for the selected pair and generates the largest number of pairs as a new cluster.

【００１０】事例データベース５は、事例を格納したも
のである。クラスタデータベース６は、生成したクラス
タを格納するものである。表示装置７は、クラスタを表
示などするものである。The case database 5 stores cases. The cluster database 6 stores generated clusters. The display device 7 displays a cluster or the like.

【００１１】入力装置８は、各種指示やデータを入力す
るものである。次に、動作を説明する。初期クラスタ生
成手段２が事例データベース５から読み出した与えられ
た事例をもとに属性値毎にクラスタを生成し、ペア選択
手段３が生成されたクラスタ中の複数（例えば２つ）の
クラスタのペアを選択し、クラスタ生成手段４が選択さ
れたペアについて共通して含まれる事例の数を計数し最
も多い場合に当該ペアの複数のクラスタから新たなクラ
スタを生成し、クラスタデータベース６に格納するよう
にしている。The input device 8 is for inputting various instructions and data. Next, the operation will be described. The initial cluster generating means 2 generates a cluster for each attribute value based on the given case read from the case database 5, and the pair selecting means 3 generates a pair of a plurality of clusters (for example, two) in the generated cluster. Is selected, and the cluster generation unit 4 counts the number of cases commonly included in the selected pair, and if the number is the largest, generates a new cluster from a plurality of clusters of the pair and stores the new cluster in the cluster database 6. I have to.

【００１２】この際、複数のクラスタの和集合を新たな
クラスタを生成するようにしている。また、複数のクラ
スタの積集合を新たなクラスタを生成するようにしてい
る。At this time, a new cluster is generated from the union of a plurality of clusters. In addition, a new cluster is generated from the intersection of a plurality of clusters.

【００１３】従って、各属性の各値についてその値を持
つ事例を集めてクラスタを生成し、クラスタ集合中で共
通して出現する事例の数の多いクラスタは関連が高いと
見做して複数クラスタから新たなクラスタを生成するこ
とを繰り返してクラスタ階層を生成することにより、ク
ラスタ生成時間を短くして１つの事例が複数のクラスタ
に属することを許すクラスタ階層を生成することが可能
となる。Therefore, clusters are generated by collecting cases having the values of the respective attributes, and clusters having a large number of cases appearing in common in the cluster set are regarded as having a high association, and a plurality of clusters are considered. By repeatedly generating a new cluster from, a cluster hierarchy can be generated by reducing the cluster generation time and allowing one case to belong to a plurality of clusters.

【００１４】[0014]

【発明の実施の形態】次に、図２から図５を用いて本発
明の実施の形態および動作を順次詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, an embodiment and operation of the present invention will be described in detail with reference to FIGS.

【００１５】図２は、本発明の動作説明フローチャート
を示す。これは、図１の構成の詳細な動作を説明するフ
ローチャートである。図２において、Ｓ１は、事例集合
を入力する。FIG. 2 is a flowchart illustrating the operation of the present invention. This is a flowchart for explaining the detailed operation of the configuration in FIG. In FIG. 2, S1 inputs a case set.

【００１６】Ｓ２は、パラメタＫとクラスタ制限数Ｌを
決定する。これは、右側に記載したように、パラメタＫ
とクラスタ制御数Ｌを入力し、これをもとに右側に記載
したように、Ｋ×標準偏差＝標準偏差の正の実数倍の決
定、およびクラスあるいはその部分集合（クラスタ）の
総計の最大数をＬとして決定する。In step S2, a parameter K and a cluster limit number L are determined. This is, as described on the right, the parameter K
And the cluster control number L, and based on this, as described on the right side, K × standard deviation = determination of a positive real multiple of the standard deviation, and the maximum number of the total of the class or its subset (cluster) Is determined as L.

【００１７】Ｓ３は、初期クラスタ階層を生成する。こ
れは、Ｓ１で入力された事例集合について、後述する図
３に示すように初期クラスタ階層を生成、即ち、後述す
る図３に示すように、入力された全事例を含むクラスタ
をＣrootとし、葉クラスタはそれぞれ属性ａｍの値がｖ
ｍｖである事例を集めたクラスタとからなる初期クラス
タ階層を生成する。In step S3, an initial cluster hierarchy is generated. In this case, an initial cluster hierarchy is generated for the case set input in S1 as shown in FIG. 3 to be described later, that is, as shown in FIG. Each cluster has a value of attribute am of v
An initial cluster hierarchy composed of a cluster of mv cases collected is generated.

【００１８】Ｓ４は、終了か判別する。これは、Ｓ５以
下の処理について、クラスタ数がＬ個となるか、ペアと
して選択するクラスタがなくなるまで繰り返すことが終
了したか判別する。ＹＥＳの場合には、終了する。ＮＯ
の場合には、Ｓ５に進む。In step S4, it is determined whether or not the processing is completed. It is determined whether the process from S5 onward is repeated until the number of clusters becomes L or until there are no clusters to be selected as a pair. If YES, the process ends. NO
In the case of, the process proceeds to S5.

【００１９】Ｓ５は、クラスタペアを選択する。Ｓ６
は、共通出現事例数を数える。これは、選択したペアの
クラスタについて、共通して出現する事例の数を数え
る。In step S5, a cluster pair is selected. S6
Counts the number of common occurrences. This counts the number of commonly occurring cases for the selected pair of clusters.

【００２０】Ｓ７は、推定共通事例数を計算する（後述
する）。Ｓ８は、Ｓ６で数えた共通出現事例数がＳ７で
計算した推定共通事例数よりも十分に大きいか判別す
る。ＹＥＳの場合には、Ｓ９に進む。ＮＯの場合には、
Ｓ４に戻り繰り返す。In step S7, the number of estimated common cases is calculated (described later). In S8, it is determined whether or not the number of common occurrence cases counted in S6 is sufficiently larger than the estimated number of common cases calculated in S7. In the case of YES, the process proceeds to S9. In the case of NO,
Return to S4 and repeat.

【００２１】Ｓ９は、事例数の割合を計算する。Ｓ１０
は、一番大きい割合を記憶する。Ｓ１１は、全ペア試行
したか判別する。ＹＥＳの場合には、Ｓ１２で一番割合
の大きいペアを選択し、Ｓ１３で新たなクラスタを生成
し、Ｓ４に戻り繰り返す。一方、Ｓ１１のＮＯの場合に
は、Ｓ４に戻り繰り返す。In step S9, the ratio of the number of cases is calculated. S10
Stores the largest ratio. In S11, it is determined whether all pairs have been tried. In the case of YES, the pair having the largest ratio is selected in S12, a new cluster is generated in S13, and the process returns to S4 and repeats. On the other hand, if NO in S11, the process returns to S4 and repeats.

【００２２】ここで、Ｓ３ないしＳ１３について以下に
詳細に説明する。後述する図３に示すように、全ての事
例を含むクラスタをＣrootとして生成し、葉クラスタと
して属性ａｍの値がｖｍｖである事例の集合を生成する
（Ｓ３）。そして、共通して出現する事例数のクラスタ
のペアの関連度を評価する。その数が推定共通事例数よ
りも十分に大きい場合（Ｓ８のＹＥＳの場合）、関連が
高いと見なす。推定共通事例数は、以下の式によって計
算する。Here, S3 to S13 will be described in detail below. As shown in FIG. 3 described later, a cluster including all cases is generated as Croot, and a set of cases in which the value of the attribute am is vmv is generated as a leaf cluster (S3). Then, the degree of relevance of the cluster pairs of the number of cases that appear in common is evaluated. If the number is sufficiently larger than the estimated number of common cases (YES in S8), it is considered that the association is high. The estimated number of common cases is calculated by the following equation.

【００２３】ｅ_ij＝（｜Ｃ_i｜×｜Ｃ_j｜）／Ｎ（１）ここで、｜Ｃ_i｜、｜Ｃ_j｜はクラスタｉ、ｊに含まれる
それぞれの事例数を表し、Ｎは全事例数を表す。このと
きの標準偏差は、ｄ_ij＝｛｜Ｃ_i｜×｜Ｃ_j｜×（Ｎ−｜Ｃ_i｜）×（Ｎ−｜Ｃ_j｜）｝^1/2 ／｛Ｎ²×（Ｎ−１）｝^1/2 （２）となる。ここで、パラメタＫを含む、以下のような式
（３）を満たす時に、その２つのクラスタの関連が高い
と見做され、選択候補となる（Ｓ８）。E _ij = (| C _i | × | C _j |) / N (1) where | C _i | and | C _j | represent the number of cases included in clusters i and j, respectively. Represents the total number of cases. The standard deviation at this time is d _ij = ｛| C _i | × | C _j | × (N− | C _i |) × (N− | C _j |)} ^1/2 / ｛N ² × (N− 1)｝ ^1/2 (2). Here, when the following equation (3) including the parameter K is satisfied, the relationship between the two clusters is considered to be high, and the two clusters are selected as candidates (S8).

【００２４】｜Ｃ_i∩Ｃ_j｜＞ｅ_ij＋Ｋ・ｄ_ij （３）ここで、左側は後述する図４の（ｅ）の積集合を表す。
次に、下記の式（４）を計算する（Ｓ９）。| C _i ∩C _j |> e _ij + K · d _ij (3) Here, the left side represents the intersection of (e) in FIG. 4 described later.
Next, the following equation (4) is calculated (S9).

【００２５】｜Ｃ_i∩Ｃ_j｜／（｜Ｃ_i｜×｜Ｃ_j｜）（４）ここで、左側の式は後述する図４の（ｅ）の積集合を表
し、右側の式は後述する図４の（ｄ）の和集合を表す。| C _i ∩C _j | / (| C _i | × | C _j |) (4) Here, the left expression represents the intersection of (e) in FIG. This represents the union of FIG. 4D described later.

【００２６】これら計算した結果から、式（３）を満た
し、式（４）を最大化するクラスタペアを選択すること
により、事例数が多いクラスタ同士が選択される傾向を
緩和できる（Ｓ９、Ｓ１０）。From the calculated results, by selecting a cluster pair that satisfies Expression (3) and maximizes Expression (4), the tendency of selecting clusters with a large number of cases can be reduced (S9, S10). ).

【００２７】そして、選択されたクラスタペアの和集合
を新たなクラスタとして生成し、選択ペアの上位にリン
クすることにより、後述する図４の（ｂ）となる。ま
た、選択されたクラスタペアの積集合を新たなクラスタ
として生成し、選択ペアの下位にリンクすることによ
り、後述する図４の（ｃ）となる。Then, the union of the selected cluster pair is generated as a new cluster, and linked to the higher rank of the selected pair, to obtain FIG. 4B described later. In addition, by generating a product set of the selected cluster pair as a new cluster and linking it to the lower order of the selected pair, the result becomes FIG.

【００２８】以上のように、各属性値についてクラスタ
を生成し、関連の大きいクラスタペアから新たなクラス
タを生成することを繰り返しクラスタ階層を生成するこ
とにより、クラスタ生成時間を短くして１つの事例が複
数のクラスに属することを許すクラスタ階層を生成する
ことが可能となる。以下順次詳細に説明する。As described above, by generating a cluster for each attribute value and repeatedly generating a new cluster from a cluster pair having a large relation, a cluster hierarchy is generated. It is possible to generate a cluster hierarchy that allows to belong to a plurality of classes. The details will be sequentially described below.

【００２９】図３は、本発明の説明図（その１）を示
す。これは、初期クラスタの概念を説明する図である。
図３の（ａ）は、初期クラスタの概念図を示す。ここ
で、全ての事例を含むクラスタをＣrootとして生成す
る。次に、葉クラスタとして属性ａｍの値がｖｍｖであ
る事例の集合を図示のように生成する。FIG. 3 is an explanatory diagram (part 1) of the present invention. This is a diagram for explaining the concept of the initial cluster.
FIG. 3A is a conceptual diagram of an initial cluster. Here, a cluster including all cases is generated as Croot. Next, a set of cases in which the value of the attribute am is vmv is generated as a leaf cluster as illustrated.

【００３０】図３の（ｂ）は、初期クラスタの具体例を
示す。ここで、全ての事例を含むクラスタＣrootとして
クラスタ“家具”を生成する。次に、葉クラスタとし
て、属性“脚の数”の値が“０”、“３”を持つものを
クラスタ“脚の数＝０”、“脚の数＝３”として図示の
ように生成する。同様に、葉クラスタとして、属性“材
質”の値が“木”を持つものをクラスタ“材質＝木”と
して図示のように生成する。FIG. 3B shows a specific example of the initial cluster. Here, a cluster “furniture” is generated as a cluster Croot including all cases. Next, as the leaf cluster, those having the attribute “number of legs” having the values “0” and “3” are generated as clusters “number of legs = 0” and “number of legs = 3” as shown in the figure. . Similarly, as a leaf cluster, a cluster having a value of attribute “material” having “tree” is generated as a cluster “material = tree” as illustrated.

【００３１】以上のように全ての事例を含むクラスタＣ
rootを生成し、葉クラスタとして属性の値毎に事例の集
合を生成することによって、事例集合から初期クラスタ
を生成することが可能となる。As described above, the cluster C including all cases
By generating a root and generating a set of cases for each attribute value as a leaf cluster, it is possible to generate an initial cluster from the case set.

【００３２】図４は、本発明の説明図（その２）を示
す。これは、初期クラスタから関連の強い複数のクラス
タから１つの新たなクラスタを生成するときの概念を説
明する図である。FIG. 4 is an explanatory diagram (part 2) of the present invention. This is a diagram for explaining the concept when one new cluster is generated from a plurality of clusters having a strong relationship from the initial cluster.

【００３３】図４の（ａ）は、クラスタ階層中から選択
されたペアの２つのクラスタＣ_i、Ｃ_jを示す。図４の
（ｂ）は、和集合を新クラスタとする方法の例を示す。
この２つのクラスタＣ_i、Ｃ_jの和集合を新たなクラスタ
Ｃｍとする場合には、当該２つのクラスタＣ_i、Ｃ_jの和
集合の新たなクラスタＣｍを図示のようにペアの上位に
リンク付けする。FIG. 4A shows two clusters C _i and C _j of a pair selected from the cluster hierarchy. FIG. 4B shows an example of a method of using the union as a new cluster.
Link The two clusters C _i, the union of C _j when a new cluster Cm is the two clusters C _i, the upper pair, as shown a new cluster Cm union of C _j Attach.

【００３４】図４の（ｃ）は、積集合を新クラスタとす
る方法の例を示す。この２つのクラスタＣ_i、Ｃ_jの積集
合を新たなクラスタＣｍとする場合には、当該２つのク
ラスタＣ_i、Ｃ_jの積集合の新たなクラスタＣｍを図示の
ようにペアの下位にリンク付けする。FIG. 4C shows an example of a method of using the intersection set as a new cluster. Link The two clusters C _i, the product set of C _j when a new cluster Cm is the two clusters C _i, the lower pair, as shown a new cluster Cm set intersection of C _j Attach.

【００３５】図４の（ｄ）は、和集合を模式的に示した
図である。和集合（図４の（ｂ）の場合）は、｜Ｃ_i｜＋｜Ｃ_j｜で表現され、クラスタＣ_iとクラスタＣ_jとのそれぞれの
斜線で示す部分の和となる。FIG. 4D is a diagram schematically showing the union. The union set (in the case of (b) in FIG. 4) is represented by | C _i | + | C _j |, and is the sum of the portions of the cluster C _i and the cluster C _j indicated by oblique lines.

【００３６】図４の（ｅ）は、積集合を模式的に示した
図である。積集合（図４の（ｃ）の場合）は、｜Ｃ_i∩Ｃ_j｜で表現され、クラスタＣ_iとクラスタＣ_jとのそれぞれの
斜線で示す重なる部分となる。FIG. 4E is a diagram schematically showing the intersection. The intersection (in the case of (c) in FIG. 4) is represented by | C _i ｜ C _j |, and is an overlapping portion of each of the cluster C _i and the cluster C _j indicated by oblique lines.

【００３７】図５は、本発明のクラスタ階層例を示す。
これは、図３の（ｂ）の初期クラスタについて、既述し
た和集合の場合の新たなクラスタ、および既述した積
集合の場合の新たなクラスタを生成した後のクラスタ
階層例である。FIG. 5 shows an example of a cluster hierarchy according to the present invention.
This is an example of a cluster hierarchy after a new cluster in the case of the union set and a new cluster in the case of the intersection set described above are generated for the initial cluster in FIG. 3B.

【００３８】例えば左側のの和集合のクラスタは、ク
ラスタ“脚の数＝３”とクラスタ“材質＝木”の２つの
ペアの和集合として新たなクラスタ“脚の数＝３ｏ
ｒ材質＝木”を生成して上位にリンクしたものである。For example, the cluster of the union on the left side is a new cluster “the number of legs = 3 o” as the union of two pairs of the cluster “number of legs = 3” and the cluster “material = tree”.
r material = tree ”is generated and linked to a higher order.

【００３９】同様に、例えば右側のの積集合のクラス
タは、クラスタ“材質＝プラスティック”とクラスタ
“形＝四角”の２つのペアの積集合として新たなクラス
タ“材質＝プラスティックａｎｄ形＝四角”を生
成して下位にリンクしたものである。Similarly, for example, the cluster of the product set on the right side is a new cluster “material = plastic and shape = square” as a product set of two pairs of the cluster “material = plastic” and the cluster “shape = square”. Generated and linked below.

【００４０】以上のように、クラスタ集合中から関連す
るペアを見つけ、その和集合／積集合を上位／下位にリ
ンク付けすることにより、関連の大きいクラスタペアか
ら新たなクラスタを生成することが可能となる。As described above, a new cluster can be generated from a cluster pair having a large relation by finding a relevant pair from the cluster set and linking the union / intersection to the upper / lower levels. Becomes

【００４１】図６および図７は、本発明のシステム動作
説明フローチャートを示す。これは、既述した図２の詳
細なシステム動作を説明するフローチャートであって、
図２のＳ１ないしＳ１３に対応する部分を左端に記載す
る。FIGS. 6 and 7 show flowcharts for explaining the operation of the system according to the present invention. This is a flowchart for explaining the detailed system operation of FIG.
Portions corresponding to S1 to S13 in FIG. 2 are described at the left end.

【００４２】図６において、Ｓ２１は、事例集合を入力
する（Ｓ１）。Ｓ２２は、パラメタｋと、クラスタ制限
数Ｌを決定する。Ｓ２３は、ｍ＝０と初期設定する。こ
れは、新たなクラスタＣmを生成するときの変数ｍの値
を初期化する。In FIG. 6, S21 inputs a case set (S1). In S22, the parameter k and the cluster limit number L are determined. In step S23, m = 0 is initially set. This initializes the value of the variable m when generating a new cluster Cm.

【００４３】Ｓ２４は、ｉ＝０と初期設定する。Ｓ２５
は、ｊ＝０と初期設定する。Ｓ２６は、属性ａｉが値ｖ
ｉｊである全事例からなるクラスタＣｍを作る。In step S24, i = 0 is initialized. S25
Is initially set to j = 0. In step S26, the attribute ai has the value v
A cluster Cm consisting of all cases ij is created.

【００４４】Ｓ２７は、ｍ＝ｍ＋１する。Ｓ２８は、ｊ
＝（ａｉの取る値の数−１）か判別する。ＹＥＳの場合
には、Ｓ３０に進む。一方、ＮＯの場合には、Ｓ２９で
ｊ＝ｊ＋１し、Ｓ２６に戻り、次のクラスタＣｍを作成
することを繰り返す。In S27, m = m + 1. S28 is j
= (The number of values taken by ai-1). In the case of YES, the process proceeds to S30. On the other hand, in the case of NO, j = j + 1 in S29, the process returns to S26, and the process of creating the next cluster Cm is repeated.

【００４５】Ｓ３０は、ｉ＝（全属性数−１）か判別す
る。ＹＥＳの場合には、Ｓ３２に進む。一方、ＮＯの場
合には、Ｓ３１でｉ＝ｉ＋１し、Ｓ２５に戻り繰り返
す。Ｓ３２は、Ｐｍａｘ＝０、かつＣｍａｘ＝０と初期
設定する。In step S30, it is determined whether i = (the number of all attributes-1). In the case of YES, the process proceeds to S32. On the other hand, if NO, i = i + 1 is set in S31, and the process returns to S25 and repeats. In S32, Pmax = 0 and Cmax = 0 are initially set.

【００４６】Ｓ３３は、ｉ＝０と初期設定する。Ｓ３４
は、ｊ＝ｉ＋１する。Ｓ３５は、クラスタＣ_iとクラス
タＣ_jに共通して含まれる事例数｜Ｃ_i∩Ｃ_j｜を数え
る。In step S33, i = 0 is initialized. S34
Is j = i + 1. In step S35, the number of cases | C _i ∩C _j | included in the clusters C _i and C _j is counted.

【００４７】Ｓ３６は、Ｃ_iとＣ_jが独立な場合の推定共
通事例数ｅ_ijとその標準偏差ｄ_ijを計算する。図７のＳ
３７は、｜Ｃ_i∩Ｃ_j｜＞ｅ_ij＋Ｋ・ｄ_ijか判別する。Ｙ
ＥＳの場合には、Ｓ３８に進む。ＮＯの場合には、Ｓ４
０に進む。In step S36, the estimated number of common cases e _ij and its standard deviation d _ij when C _i and C _j are independent are calculated. S in FIG.
37 determines whether | C _i ∩C _j |> e _ij + K · d _ij . Y
In the case of ES, the process proceeds to S38. If NO, S4
Go to 0.

【００４８】Ｓ３８は、Ｐ_max＜（｜Ｃ_i∩Ｃ_j｜）／
（｜Ｃ_i｜＋｜Ｃ_j｜）か判別する。ＹＥＳの場合には、
Ｓ３９に進む。ＮＯの場合には、Ｓ４０に進む。Ｓ３９
は、Ｐ_max＝（｜Ｃ_i∩Ｃ_j｜）／（｜Ｃ_i｜＋｜Ｃ_j｜）Ｃ_max←（Ｃ_i、Ｃ_j）を行う。In S38, P _max <(| C _i ∩C _j |) /
(| C _i | + | C _j |). If yes,
Proceed to S39. In the case of NO, the process proceeds to S40. S39
Performs P _max = (| C _i ∩C _j |) / (| C _i | + | C _j |) C _max ← (C _i , C _j ).

【００４９】Ｓ４０は、ｊ＝ｍ−１か判別する。ＹＥＳ
の場合には、Ｓ４２に進む。ＮＯの場合には、Ｓ４１で
ｊ＝ｊ＋１し、Ｓ３５に戻り繰り返す。Ｓ４２は、ｉ＝
ｍ−２か判別する。ＹＥＳの場合には、Ｓ４４に進む。
ＮＯの場合には、Ｓ４３でｉ＝ｉ＋１し、Ｓ３４に戻り
繰り返す。In step S40, it is determined whether j = m-1. YES
In the case of, the process proceeds to S42. In the case of NO, j = j + 1 is performed in S41, and the process returns to S35 and repeats. In S42, i =
m-2. In the case of YES, the process proceeds to S44.
In the case of NO, i = i + 1 is set in S43, and the process returns to S34 and is repeated.

【００５０】Ｓ４４は、Ｃ_max≠０か判別する。ＹＥＳ
の場合には、Ｓ４５に進む。ＮＯの場合には、終了す
る。Ｓ４５は、Ｃ_maxからＣｍをオペレータにより作
る。A step S44 decides whether C _max max0. YES
In the case of, the process proceeds to S45. If NO, the process ends. In step S45, Cm is created from _Cmax by the operator.

【００５１】Ｓ４６は、ｍ＝ｍ＋１する。Ｓ４７は、ｍ
＝Ｌか判別する。ＹＥＳの場合には、終了する。ＮＯの
場合には、Ｓ３２に戻り繰り返す。At S46, m = m + 1. S47 is m
= L. If YES, the process ends. If NO, the process returns to S32 and is repeated.

【００５２】[0052]

【発明の効果】以上説明したように、本発明によれば、
各属性の各値についてその値を持つ事例を集めてクラス
タを生成し、クラスタ集合中で共通して出現する事例の
数の多いクラスタは関連が高いと見做して複数クラスタ
から新たなクラスタを生成することを繰り返してクラス
タ階層を生成する構成を採用しているため、クラスタ生
成時間を短くして１つの事例が複数のクラスに属するこ
とを許すクラスタ階層を生成することができる。これら
により、（１）属性の値毎にクラスタを１つづつ生成し、共通
して出現する事例の数が多いクラスタ同士から新たなク
ラスタを生成し、１つの事例が複数のクラスに属するよ
うな場合にも対応することが可能となる。As described above, according to the present invention,
For each value of each attribute, clusters are generated by collecting the cases with that value, and clusters with a large number of cases that appear in common in the cluster set are considered to be highly related, and a new cluster is created from multiple clusters. Since the configuration of generating a cluster hierarchy by repeating generation is adopted, it is possible to generate a cluster hierarchy that allows one case to belong to a plurality of classes by shortening the cluster generation time. Thus, (1) one cluster is generated for each attribute value, a new cluster is generated from clusters having a large number of commonly appearing cases, and one case belongs to a plurality of classes. It is possible to cope with the case.

【００５３】（２）クラスタ階層の生成に必要な処理
時間は、階層中のクラスタ数の３乗と全事例数に比例す
る。通常、複雑すぎるクラス集合は、知識として利用し
難いため、必要とされない。そこで、パラメタとして設
定される階層中のクラスタ数を、事例数に比べて小さい
値にできる。その結果、比較的に短い時間でクラスタの
生成が可能となる。(2) The processing time required to generate a cluster hierarchy is proportional to the cube of the number of clusters in the hierarchy and the total number of cases. Usually, a class set that is too complex is not needed because it is difficult to use as knowledge. Therefore, the number of clusters in the hierarchy set as a parameter can be made smaller than the number of cases. As a result, a cluster can be generated in a relatively short time.

【００５４】（３）上記（１）および（２）により、
１つの事例が複数の概念に属するような大規模な事例集
合に対して、効果的にクラスを構成するクラスタを学習
することが可能となる。(3) According to the above (1) and (2),
For a large-scale case set in which one case belongs to a plurality of concepts, it is possible to effectively learn clusters constituting a class.

[Brief description of the drawings]

【図１】本発明のシステム構成図である。FIG. 1 is a system configuration diagram of the present invention.

【図２】本発明の動作説明フローチャートである。FIG. 2 is a flowchart illustrating the operation of the present invention.

【図３】本発明の説明図（その１）である。FIG. 3 is an explanatory diagram (No. 1) of the present invention.

【図４】本発明の説明図（その２）である。FIG. 4 is an explanatory diagram (No. 2) of the present invention.

【図５】本発明のクラスタ階層例である。FIG. 5 is an example of a cluster hierarchy according to the present invention.

【図６】本発明のシステム動作説明フローチャート（そ
の１）である。FIG. 6 is a flowchart (part 1) for explaining the system operation of the present invention.

【図７】本発明のシステム動作説明フローチャート（そ
の２）である。FIG. 7 is a flowchart (part 2) for explaining the system operation of the present invention.

[Explanation of symbols]

１：処理装置２：初期クラスタ生成手段３：ペア選択手段４：クラスタ生成手段５：事例データベース６：クラスタデータベース７：表示装置８：入力装置 1: Processing unit 2: Initial cluster generation unit 3: Pair selection unit 4: Cluster generation unit 5: Case database 6: Cluster database 7: Display unit 8: Input unit

───────────────────────────────────────────────────── フロントページの続き (72)発明者岡本青史神奈川県川崎市中原区上小田中４丁目１番１号富士通株式会社内 (72)発明者佐藤理神奈川県川崎市中原区上小田中４丁目１番１号富士通株式会社内 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Seishi Okamoto 4-1-1 Kamikadanaka, Nakahara-ku, Kawasaki City, Kanagawa Prefecture Inside Fujitsu Limited (72) Inventor Osamu Sato 4-chome, Kamiodanaka Nakahara-ku, Kawasaki City, Kanagawa Prefecture No. 1 Inside Fujitsu Limited

Claims

[Claims]

1. A cluster generating apparatus for generating a cluster based on a given case, comprising: means for generating a cluster for each attribute value based on the given case; Means for generating a new cluster from the plurality of clusters when the number of cases commonly included in the cluster is large.

2. The cluster generating apparatus according to claim 1, wherein a new cluster is generated from the plurality of clusters, and a union of the plurality of clusters is set as a new cluster.

3. The cluster generation apparatus according to claim 1, wherein a new cluster is generated from the plurality of clusters, and an intersection of the plurality of clusters is set as a new cluster.

4. A means for generating a cluster for each attribute value based on a given case, and, when the number of cases commonly included in a plurality of clusters among the generated clusters is large, said plurality of clusters A recording medium storing a program functioning as a means for generating a new cluster from a cluster.