JPH10260988A

JPH10260988A - Clustering method

Info

Publication number: JPH10260988A
Application number: JP9068109A
Authority: JP
Inventors: Hideki Tanaka; 英輝田中
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 1997-03-21
Filing date: 1997-03-21
Publication date: 1998-09-29
Anticipated expiration: 2017-03-21
Also published as: JP4004584B2

Abstract

PROBLEM TO BE SOLVED: To hierarchically classify many examples by inputting an example set that consists of observation values of plural observation items as an initial example set and sequentially dividing an example set into two so that the reduction of fluctuation in example sets may become maximum. SOLUTION: An initial example set is inputted as a first input data (step S1), and an example set is inputted as confirmation processing of an end condition (step S2). Here, when the number of examples of the initial example set is not less than a previously defined number K (step S2A) and when observation values that do not coincide among examples of the initial example set are included (step S2B), an optimum observation item and a division threshold are calculated from an inputted example set (step S3). When the value of the optimum observation item of each inputted example is larger than the division threshold, it is classified to a group 1. When it is not, it is classified to a group 2 (step S4) and a tree diagram is generated (step S5A).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、階層分類法に属す
るクラスタリング方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a clustering method belonging to a hierarchical classification method.

【０００２】更に詳述すると、本発明は、例えば音声認
識処理あるいは情報検索処理などの情報処理に好適な、
クラスタリング方法に関するものである。More specifically, the present invention is suitable for information processing such as voice recognition processing or information search processing.
It relates to a clustering method.

【０００３】[0003]

【従来の技術】クラスタリングは統計的な分類手法の一
つであり、社会科学・生物学など分類を必要とする研究
で広く利用されている。一方、産業上でも音声認識装
置，情報検索装置など情報処理分野を中心に広く利用さ
れている。2. Description of the Related Art Clustering is one of statistical classification methods, and is widely used in research requiring classification such as social science and biology. On the other hand, it is widely used in industry, mainly in the information processing field such as a speech recognition device and an information retrieval device.

【０００４】クラスタリング手法にはさまざまなものが
ある。その概要は「統計学辞典」（竹中啓，東洋経済
新報社）などに詳しいが、大別すると以下のようにな
る。There are various clustering techniques. The summary is detailed in "Statistics Dictionary" (Hiroshi Takenaka, Toyo Keizai Shimpo), etc., but it is roughly divided as follows.

【０００５】第１は、分割分類法と呼ばれる手法であっ
て、事例（分類対象を事例と呼ぶ）集合を重なりなく部
分集合に分割する手法である。[0005] The first is a method called a division classification method, which divides a set of cases (a classification target is referred to as a case) into subsets without overlapping.

【０００６】第２は、階層分類法と呼ばれる手法であっ
て、事例集合を上下階層に分類する手法である。The second is a method called a hierarchical classification method, which classifies a set of cases into upper and lower layers.

【０００７】これらの分類法のうち、従来の代表的な階
層分類法は、凝集型クラスタリング法である。この凝集
型クラスタリング法において、入力に用いるデータは、
各事例について、観測項目の観測値を測定した表であ
る。例えば、次に示す表１は６個の事例それぞれについ
て二つの観測項目の観測値を測定した表である。[0007] Among these classification methods, a conventional representative hierarchical classification method is an agglomerative clustering method. In this cohesive clustering method, the data used for input is
It is the table which measured the observation value of the observation item about each case. For example, Table 1 shown below is a table in which observation values of two observation items are measured for each of six cases.

【０００８】[0008]

【表１】 [Table 1]

【０００９】このような表１を使って、次のような手順
で事例の階層的な分類を作成する。Using such a table 1, a hierarchical classification of cases is created in the following procedure.

【００１０】１．観測値の表を基に事例間の類似度を計
算する。この場合、₆Ｃ₂＝１５個の距離が得られる。[0010] 1. The similarity between cases is calculated based on the table of observation values. In this case, ₆ C ₂ = 15 distances are obtained.

【００１１】２．以下の操作を、クラスタの数が１にな
るまで繰り返す。2. The following operation is repeated until the number of clusters becomes one.

【００１２】３．距離が最小となる二つのクラスタ（も
しくは事例）を合併して、新しいクラスタを作成する。3. Merge the two clusters (or cases) with the minimum distance to create a new cluster.

【００１３】４．クラスタ（もしくは事例）間の距離を
計算し直す。4. Recalculate the distance between clusters (or cases).

【００１４】[0014]

【発明が解決しようとする課題】しかしながら、上述し
た従来の代表的な階層分類法では、明らかに、事例間の
距離の数が事例の数に二乗にほぼ比例して増加する。す
なわち、事例の数をＮとした場合に、距離を保存する記
憶領域（装置）の数がＮ²に比例して増加する。However, in the above-described conventional representative hierarchical classification method, the number of distances between cases obviously increases almost in proportion to the square of the number of cases. That is, when the number of cases is N, the number of storage areas (devices) for storing distances increases in proportion to N ² .

【００１５】このため、事例の数が増えた場合には、距
離を保存しておく記憶領域が不足してクラスタリングの
実行が困難になる、という問題がある。また、最小の距
離を求める計算の回数も多くなるため、多くの実行時間
がかかるという問題も発生する。For this reason, when the number of cases increases, there is a problem that the storage area for storing the distance is insufficient, and it becomes difficult to execute clustering. In addition, since the number of calculations for finding the minimum distance increases, there is also a problem that a long execution time is required.

【００１６】このような問題を避けるために、従来は対
象事例の数を制限したり、非階層手法と組み合わせる等
の手法をとっていた。In order to avoid such a problem, techniques such as limiting the number of target cases or combining with a non-hierarchical technique have been employed.

【００１７】よって本発明の目的は、上述の点に鑑み、
従来から採られていた手法によることなく、数多くの事
例を階層的に分類する能力を持つクラスタリング方法を
提供することにある。Accordingly, an object of the present invention is to provide
An object of the present invention is to provide a clustering method capable of classifying a large number of cases hierarchically without using a conventionally employed method.

【００１８】[0018]

【課題を解決するための手段】上記の目的を達成するた
めに、本発明に係るクラスタリング方法では、複数ある
観測項目の観測値から成る事例集合を初期事例集合とし
て入力し、事例集合における変動の減少が最大になるよ
うに当該事例集合を逐次二分割していくものである。In order to achieve the above object, in the clustering method according to the present invention, a case set including observation values of a plurality of observation items is input as an initial case set, and a variation in the case set is input. The case set is sequentially divided into two so that the decrease is maximized.

【００１９】ここで、前記観測値を格納するのに必要な
記憶領域を圧縮して前記事例集合を形成することによ
り、処理に必要とされる記憶領域を縮小することができ
る。Here, by compressing the storage area necessary for storing the observation values to form the case set, the storage area required for processing can be reduced.

【００２０】また、前記観測項目を、前記初期事例集合
における変動の値の大きな順に整列することにより、所
要の演算回数を減少させることが可能である。同様に、
前記観測値における異なる頻度の間を二分割することに
よっても、演算回数を減少させることが可能である。Further, by arranging the observation items in descending order of the value of variation in the initial case set, it is possible to reduce the required number of operations. Similarly,
It is also possible to reduce the number of operations by dividing the frequency between the observation values into two.

【００２１】[0021]

【発明の実施の形態】実施の形態の概要以下に詳述するクラスタリング方法では、事例の観測値
の表を基に、事例集合の変動の減少を最大にするように
事例集合を逐次に二分割していく手法を用いる。In DETAILED DESCRIPTION OF THE INVENTION The clustering method detailed below Summary embodiment, based on a table of observations case, sequentially bisects case set to maximize the reduction of fluctuations in the case set We use the technique of doing.

【００２２】ここで、観測値の表を圧縮して処理に必要
な記憶領域を縮小することも可能である。Here, it is also possible to compress the table of observation values to reduce the storage area required for processing.

【００２３】また、上記の手法において、観測項目を初
期事例集合の変動の値の大きな順に整列することによっ
て、計算回数を減少させることも可能である。In the above method, the number of calculations can be reduced by arranging the observation items in descending order of the value of the variation of the initial case set.

【００２４】さらに、上記の手法において、観測値の異
なる頻度の間を二分割することで計算回数を減少させる
ことも可能である。Further, in the above method, it is also possible to reduce the number of calculations by dividing the frequency between observation values into two.

【００２５】換言すると、本発明を適用したクラスタリ
ング方法は、以下に列挙する，，の特色を有す
る。In other words, the clustering method according to the present invention has the following features.

【００２６】逐次二分割：最初に事例集合を一つの
クラスタとする。次に、これを逐次二分割する。すなわ
ち、従来の凝集型手法が事例を併合しながら最終的に一
つのクラスタを作成するのと逆の手法を採る。Sequential bisection: First, a set of cases is made into one cluster. Next, this is sequentially divided into two. In other words, the reverse method is used in which the conventional aggregation method merges the cases and finally creates one cluster.

【００２７】距離の不使用：事例間の距離を使わな
い。本発明を適用したクラスタリング方法では、事例の
観測値のデータに基づいて、事例のクラスタリングを行
なう。いま、事例の数をＮ，観測項目の数をＭとした場
合、必要とする記憶領域の大きさはＮ×Ｍに比例する。
一般のクラスタリング問題では、Ｍは小さな定数とみな
すことができるため、Ｎに比例することになる。No use of distance: The distance between cases is not used. In the clustering method to which the present invention is applied, clustering of cases is performed based on data of observed values of cases. Now, assuming that the number of cases is N and the number of observation items is M, the required size of the storage area is proportional to N × M.
In a general clustering problem, M can be regarded as a small constant, and is therefore proportional to N.

【００２８】このため、従来より必要とする記憶領域が
大幅に減少し、これに伴って処理に要する時間も大きく
減少する。For this reason, the storage area required conventionally is greatly reduced, and the time required for processing is also greatly reduced.

【００２９】クラスタの特徴を説明可能：本発明を
適用したクラスタリング方法を実行して得られる階層
は、二分木である。この二分木の根ノードは、全事例集
合をカバーする最大のクラスタとなっている。葉ノード
には、事例集合が格納されている。内部の各ノードは観
測項目に対応しており、ここから二つの枝が下方に出て
いる。各事例は、ノードに対応する観測項目の観測値が
閾値を超えていれば左の枝の下に分類され、そうでなけ
れば右の枝の下に分類されている。Explain the characteristics of clusters: The hierarchy obtained by executing the clustering method according to the present invention is a binary tree. The root node of the binary tree is the largest cluster covering the entire case set. A case set is stored in the leaf node. Each node inside corresponds to an observation item, from which two branches emerge downward. Each case is classified under the left branch if the observed value of the observation item corresponding to the node exceeds the threshold, and otherwise classified under the right branch.

【００３０】この特色によって、事例集合があるクラス
タに分類された理由を、二分木をルートからその事例ま
でたどることで知ることができる。すなわち、途上に出
現する観測項目の頻度の大小の形で知ることができる。
このような特色は従来の凝集型クラスタリング手法には
なく、本発明を適用したクラスタリング方法に独自の特
色である。By this feature, the reason why the case set is classified into a certain cluster can be known by tracing the binary tree from the root to the case. That is, it is possible to know the magnitude of the frequency of the observation item appearing on the way.
Such a feature is not present in the conventional cohesive clustering method, but is unique to the clustering method to which the present invention is applied.

【００３１】以下に、かかるクラスタリング方法を具体
的に説明していく。Hereinafter, such a clustering method will be specifically described.

【００３２】実施の形態１まず、本クラスタリング方法の処理に必要となる入力デ
ータの構成、ならびに、本クラスタリング方法の処理に
用いる語句や数式について説明する。 Embodiment 1 First, the structure of input data required for the processing of the present clustering method, and words and mathematical expressions used for the processing of the present clustering method will be described.

【００３３】（１）事例集合について本クラスタリング方法を実行するためのクラスタリング
装置（図１参照）へ入力するデータは、事例の集合を表
わすデータである。ここで、事例は複数の観測項目の観
測値からなる。また、観測値は数値である。(1) Case Set Data input to a clustering apparatus (see FIG. 1) for executing the present clustering method is data representing a set of cases. Here, the case includes observation values of a plurality of observation items. Observed values are numerical values.

【００３４】従来の技術の欄で示した表１は、事例集合
の例を示す。この表１には事例が６個あり、それぞれａ
からｆまでの名称を与えている。また、観測項目は２個
あり、それぞれ番号１，２を与えている。Table 1 shown in the section of the prior art shows an example of a case set. In Table 1, there are six examples,
To f are given. In addition, there are two observation items, and numbers 1 and 2 are given respectively.

【００３５】以下、この表を初期事例集合としてクラス
タリング装置（後に、図１を参照して詳述する）に入力
し、本クラスタリング方法を実行させる。Hereinafter, this table is input as an initial case set to a clustering apparatus (to be described in detail later with reference to FIG. 1), and the present clustering method is executed.

【００３６】（２）配列型メモリについて本文での配列の要素の参照の仕方を二通り規定する。(2) Array type memory Two types of reference are made to the elements of the array in the text.

【００３７】配列型メモリＡのｊ番目の要素をＡ［ｊ］
で参照する。一つの要素の参照である。配列は要素の集
合であるため、以下では配列と集合を同じ意味で使う。The j-th element of the array type memory A is represented by A [j]
To refer to. It is a reference to one element. Since an array is a set of elements, the terms array and set are used interchangeably below.

【００３８】配列Ａのｓ番目の要素からｅ番目（ｓ≦
ｅ）までの要素をＡ［ｓ，ｅ］で参照する。すなわち、
部分集合の参照である。From the s-th element of the array A to the e-th element (s ≦
The elements up to e) are referred to by A [s, e]. That is,
Subset reference.

【００３９】配列ＡのThe sequence A

【００４０】[0040]

【外１】 [Outside 1]

【００４１】で表す。Is represented by

【００４２】配列Ａ［ｓ，ｅ］の要素の数を｜Ａ［ｓ，
ｅ］｜で表す。The number of elements of the array A [s, e] is | A [s,
e] |.

【００４３】（３）観測値集合について事例集合の観測項目ｉがｎ個の観測値を持つとする。こ
の観測値の集合をｎ次元の配列（メモリ）に格納する。
この配列をＸ⁽ⁱ⁾で表す。上の添字（ｉ）はこの配列が
項目（ｉ）のものであることを示している。(3) Observation Value Set Assume that the observation item i of the case set has n observation values. This set of observation values is stored in an n-dimensional array (memory).
This array is represented by X ⁽ⁱ⁾ . The suffix (i) above indicates that this array is for item (i).

【００４４】前節の約束に従うと観測項目ｉの全観測値
はＸ⁽ⁱ⁾［１，ｎ］となる。According to the promise in the previous section, all observation values of the observation item i are X ⁽ⁱ⁾ [1, n].

【００４５】また、表１の場合In the case of Table 1

【００４６】[0046]

【外２】 [Outside 2]

【００４７】である。Is as follows.

【００４８】（４）観測値集合の変動について観測値集合Ｘ⁽ⁱ⁾［１，ｎ］が与えられたとする。この
観測値集合の変動は次式で計算する。(4) Variation of Observed Value Set Assume that an observed value set X ⁽ⁱ⁾ [1, n] is given. The variation of this set of observations is calculated by the following equation.

【００４９】[0049]

【数１】 (Equation 1)

【００５０】観測値集合が平均値の周りに集まっていれ
ば変動は小さくなる。すなわち、変動は観測値集合のま
とまり具合の指標である。If the set of observations is clustered around the mean, the variation will be small. That is, the fluctuation is an indicator of the degree of unity of the observation value set.

【００５１】例えば、表１の観測項目１，２の変動を計
算すると、次のようになる。まず、観測項目１について
はｔ（Ｘ⁽¹⁾［１，６］）＝（１−２）²×２＋（２−
２）²×２＋（３−２）²×２＝４である。For example, when the fluctuations of the observation items 1 and 2 in Table 1 are calculated, the following is obtained. First, for observation item 1, t (X ⁽¹⁾ [1,6]) = (1-2) ² × 2 + (2-
2) ² × 2 + (3-2) ² × 2 = 4

【００５２】一方、観測項目２の観測値の平均値は２で
あり、ｔ（Ｘ⁽²⁾［１，６］）＝（１−２）²×３＋
（３−２）²×３＝６となる。すなわち、観測項目２の
観測値集合の方が、まとまりが悪いことが分かる。On the other hand, the average value of the observation values of observation item 2 is 2, and t (X ⁽²⁾ [1,6]) = (1-2) ² × 3 +
(3-2) ² × 3 = 6. That is, it can be seen that the observation value set of observation item 2 is less organized.

【００５３】（５）観測値集合の二分割について観測値集合Ｘ⁽ⁱ⁾［１，ｎ］が与えられたとする。この
配列をｐ番目の要素で二分割するとは、次の操作を指
す。(5) Observation Value Set Bisection It is assumed that an observation value set X ⁽ⁱ⁾ [1, n] is given. Dividing this array by the p-th element indicates the following operation.

【００５４】まず、Ｘ⁽ⁱ⁾［１，ｎ］の要素を小さい順
に並べ直す。すなわち、配列Ｘ⁽ⁱ⁾［１，ｎ］を昇順に
整列する。First, the elements of X ⁽ⁱ⁾ [1, n] are rearranged in ascending order. That is, the array X ⁽ⁱ⁾ [1, n] is arranged in ascending order.

【００５５】次にＸ⁽ⁱ⁾［１，ｎ］をＸ⁽ⁱ⁾［１，ｐ］
とＸ⁽ⁱ⁾［ｐ＋１，ｎ］の二つの部分集合に分割する
（１≦ｐ＜ｎ）。Next, X ⁽ⁱ⁾ [1, n] is changed to X ⁽ⁱ⁾ [1, p].
And X ⁽ⁱ⁾ [p + 1, n] (1 ≦ p <n).

【００５６】例えば、Ｘ⁽¹⁾［１，６］＝｛１，１，
２，２，３，３｝の二番目の要素による二分割の結果、
Ｘ⁽¹⁾［１，２］＝｛１，１｝，Ｘ⁽¹⁾［３，６］＝
｛２，２，３，３｝を得る。For example, X ⁽¹⁾ [1,6] = {1,1,
The result of the bisection with the second element of 2,2,3,3｝
X ⁽¹⁾ [1,2] = {1,1}, X ⁽¹⁾ [3,6] =
{2,2,3,3} are obtained.

【００５７】またＸ⁽²⁾［１，６］＝｛１，１，１，
３，３，３｝の三番目の要素による二分割の結果、Ｘ
⁽²⁾［１，３］＝｛１，１，１｝，Ｘ⁽²⁾［４，６］＝
｛３，３，３｝を得る（要素を昇順に整列していること
に注意）。X ⁽²⁾ [1,6] = {1,1,1,
As a result of bisecting by the third element of 3,3,3｝, X
⁽²⁾ [1,3] = {1,1,1}, X ⁽²⁾ [4,6] =
Get {3,3,3} (note that the elements are arranged in ascending order).

【００５８】（６）変動の減少について観測値集合Ｘ⁽ⁱ⁾［１，ｎ］を、Ｘ⁽ⁱ⁾［１，ｐ］とＸ
⁽ⁱ⁾［ｐ＋１，ｎ］に二分割する。この結果に対して、
ｔ（Ｘ⁽ⁱ⁾［１，ｎ］）（分割前の観測値集合の変
動），ｔ（Ｘ⁽ⁱ⁾［１，ｐ］）（分割した部分集合１の
変動），ｔ（Ｘ⁽ⁱ⁾［ｐ＋１，ｎ］）（分割した部分集
合２の変動）を計算できる。(6) Reduction of Fluctuation Observation value set X ⁽ⁱ⁾ [1, n] is converted to X ⁽ⁱ⁾ [1, p] and X
⁽ⁱ⁾ Divide into two [p + 1, n]. For this result,
t (X ⁽ⁱ⁾ [1, n]) (variation of the observed value set before division), t (X ⁽ⁱ⁾ [1, p]) (variation of the divided subset 1), t (X ^{(i )} [P + 1, n]) (variation of the divided subset 2) can be calculated.

【００５９】これらを使って、ｐによる二分割で発生す
る変動減少ｂ（Ｘ⁽ⁱ⁾［１，ｎ］，ｐ）を次式で計算で
きる。Using these, the fluctuation reduction b (X ⁽ⁱ⁾ [1, n], p) occurring in the two divisions by p can be calculated by the following equation.

【００６０】[0060]

【数２】 (Equation 2)

【００６１】変動の減少が大きいほど、分割で得られた
二つの部分集合のまとまりが良くなったことを示す。It is shown that the smaller the fluctuation, the better the unity of the two subsets obtained by the division.

【００６２】（７）変動減少計算法について観測値集合Ｘ⁽ⁱ⁾［１，ｎ］のすべての可能な二分割の
変動減少ｂ（Ｘ⁽ⁱ⁾［１，ｎ］，ｐ）（１≦ｐ＜ｎ）を
効率的に求める手法を示す。[0062] (7) The variation reduced calculus observations set ^{X (i) [1, n} ] of the variation reduction of all possible two-piece ^{b (X (i) [1} , n], p) (1 ≦ A method for efficiently finding p <n) will be described.

【００６３】式（２）は集合の全変動から、その部分集
合の変動を引いたものである。統計学によれば（例えば
柳井晴夫ほか「多変量解析ハンドブック」現代数学社１
７ページ）、これは部分集合間の変動に一致する。すな
わち、Equation (2) is obtained by subtracting the variation of the subset from the total variation of the set. According to statistics (for example, Haruo Yanai et al. "Multivariate Analysis Handbook"
7), which is consistent with the variation between subsets. That is,

【００６４】[0064]

【数３】 (Equation 3)

【００６５】である。Is as follows.

【００６６】式（３）の中のIn equation (3),

【００６７】[0067]

【外３】 [Outside 3]

【００６８】であり、最初に一度計算しておけば良い。
また、It is only necessary to calculate once at first.
Also,

【００６９】[0069]

【外４】 [Outside 4]

【００７０】は以下に示す漸化式を使って効率的に計算
することができる。尚、観測項目ｉは明らかなので、肩
の添字（ｉ）は省略した。Can be efficiently calculated using the recurrence formula shown below. Since the observation item i is clear, the suffix (i) is omitted.

【００７１】[0071]

【数４】 (Equation 4)

【００７２】次に、図１を参照して、本クラスタリング
方法を実行するためのハードウェア構成を説明する。図
１において、２は入力装置であり、初期事例集合を入力
する。４は演算装置であり、図２に示す処理手順に従っ
て、本クラスタリング方法を実行する。６はＲＯＭであ
り、図２に示したフローチャートをプログラムの形態で
記憶してある。８はＲＡＭであり、上記の配列型メモリ
のほか、ワークエリアとして使用される。１０は演算結
果を出力するための出力装置である。Next, a hardware configuration for executing the present clustering method will be described with reference to FIG. In FIG. 1, an input device 2 inputs an initial case set. Numeral 4 denotes an arithmetic unit which executes the clustering method according to the processing procedure shown in FIG. Reference numeral 6 denotes a ROM which stores the flowchart shown in FIG. 2 in the form of a program. A RAM 8 is used as a work area in addition to the array type memory. Reference numeral 10 denotes an output device for outputting a calculation result.

【００７３】次に、図２に示すフローチャートを参照し
て、本クラスタリング方法における処理ステップを説明
する。Next, processing steps in the present clustering method will be described with reference to the flowchart shown in FIG.

【００７４】（１）ステップＳ１（初期事例集合の入力
処理）についてここでは、最初の入力データとして、表１に例示した初
期事例集合を入力する。(1) Step S1 (Input Process of Initial Case Set) Here, the initial case set illustrated in Table 1 is input as the first input data.

【００７５】（２）ステップＳ２（終了条件の確認処
理）についてこのステップには事例の集合を入力する。最初は、ステ
ップＳ１を介して初期事例集合、すなわち表１を入力す
る。このステップＳ２の処理には以下の通り、ステップ
Ｓ２ＡとステップＳ２Ｂが含まれる。(2) Step S2 (End Condition Confirmation Process) In this step, a set of cases is input. First, an initial case set, that is, Table 1 is input via step S1. The processing in step S2 includes step S2A and step S2B as described below.

【００７６】初期事例集合の事例数があらかじめ規定し
ている数（Ｋで表す。ここではＫ＝２とする）以下でな
ければ（ステップＳ２Ａ）、次のステップＳ２Ｂに進
む。If the number of cases in the initial case set is not less than a predetermined number (represented by K; here, K = 2) (step S2A), the process proceeds to the next step S2B.

【００７７】次に、初期事例集合の各事例間で観測値が
一致しないものが含まれる場合には（ステップＳ２
Ｂ）、次のステップＳ３に進む。Next, when there is a case where the observed values do not match between the cases in the initial case set (step S2).
B), and proceed to the next step S3.

【００７８】ここでは既に述べた通り、初期事例集合
（表１）には事例が６個ある。また表１には観測値が一
致しないものが含まれるため、次の処理ステップＳ３に
進む。As described above, there are six cases in the initial case set (Table 1). In addition, since the values in Table 1 do not match, the process proceeds to the next processing step S3.

【００７９】処理ステップＳ３には、入力された事例集
合を渡す。The input case set is passed to the processing step S3.

【００８０】なお、初期事例集合があらかじめ規定して
いる数Ｋ以下である場合、または、初期事例集合の各事
例間で各観測値がすべて一致する場合、すなわち初期事
例集合がこれ以上分割できない場合もありえるが、この
場合には、すべての処理を終了する。When the initial case set is equal to or less than a predetermined number K, or when all the observation values match among the cases in the initial case set, that is, when the initial case set cannot be further divided. In this case, all the processes are terminated.

【００８１】（３）ステップＳ３（最適観測項目の選択
処理）について受け取る事例数をｎとする。すなわち、すべての観測項
目はｎ個の観測値を持つ。ここでは以下の処理を行な
う。(3) Step S3 (Optimal Observation Item Selection Processing) The number of cases received is n. That is, all observation items have n observation values. Here, the following processing is performed.

【００８２】１．各観測項目ｉを対象に以下の処理を行
なう。1. The following processing is performed for each observation item i.

【００８３】２．Ｘ⁽ⁱ⁾［１，ｎ］の要素を値の小さい
順に並べ直す。2. X ⁽ⁱ⁾ Rearrange the elements of [1, n] in ascending order of value.

【００８４】３．Ｘ⁽ⁱ⁾［１，ｎ］のすべての可能な二
分割を行ない、そのときの変動減少ｂ（Ｘ⁽ⁱ⁾［１，
ｎ］，ｐ）を記録する（１≦ｐ＜ｎ）。3. Perform all possible bisections of X ⁽ⁱ⁾ [1, n] and reduce the variation b (X ⁽ⁱ⁾ [1,
n], p) is recorded (1 ≦ p <n).

【００８５】記録した結果から、最大の変動減少を与え
る観測項目ｉ_bとその分割点ｐ_bを求める。これらを最
適観測項目、および最適分割点と呼ぶ。[0085] From the results reported, we obtain the observation item i _b and the division point p _b to provide maximum fluctuation decreases. These are called optimal observation items and optimal division points.

【００８６】さらに、Further,

【００８７】[0087]

【外５】 [Outside 5]

【００８８】で計算する。Is calculated.

【００８９】次段の「事例集合の分割処理ステップＳ
４」には、１．入力された事例集合２．最適観測項目３．分割閾値を渡す。The next stage, “Case set processing step S
4 ”includes: 1. Input case set Optimal observation items 3. Pass the split threshold.

【００９０】表１を入力した場合の本処理ステップの動
作は、次の通りである。ここで、観測項目１，２のすべ
ての二分割で得られた変動減少は、以下の表２に示す通
りである。The operation of this processing step when Table 1 is input is as follows. Here, the fluctuation reduction obtained in all the two divisions of the observation items 1 and 2 is as shown in Table 2 below.

【００９１】[0091]

【表２】 [Table 2]

【００９２】この表２の見方は、以下の通りである。観
測項目２のｐ＝３番目の要素による二分割の変動減少は
６．０となっている。その意味は、次の通りである。The way of reading Table 2 is as follows. The variation reduction of the split into two by the p = third element of the observation item 2 is 6.0. The meaning is as follows.

【００９３】すなわち、表１の観測項目２の観測値集合
Ｘ⁽²⁾［１，６］を昇順に整列し、｛１，１，１，３，
３，３｝を作成する。この集合の３番目の要素を使って
二分割を行なうと｛１，１，１｝と｛３，３，３｝が得
られる。これより、変動ｂ（［Ｘ⁽²⁾［１，６］，
３］）＝６．０が得られることを示している。That is, the observation value set X ⁽²⁾ [1,6] of the observation item 2 in Table 1 is arranged in ascending order, and {1,1,1,3,
Create 3,3｝. When the second division is performed using the third element of this set, {1,1,1} and {3,3,3} are obtained. From this, the fluctuation b ([X ⁽²⁾ [1, 6],
3]) = 6.0 is obtained.

【００９４】また、表２よりこの場合が最大の変動減少
となっていることがわかる。この結果、次の「事例集合
の分割処理ステップＳ４」には、入力の６事例と観測項
目番号２と分割閾値２（すなわち（１＋３）／２で計算
した値）を渡す。Further, it can be seen from Table 2 that the maximum variation was reduced in this case. As a result, six input cases, observation item number 2 and division threshold 2 (that is, a value calculated by (1 + 3) / 2) are passed to the next “case set division processing step S4”.

【００９５】（４）ステップＳ４（事例集合の分割処
理）について前段のステップからは、事例集合，最適観測項目，分割
閾値を受取り、以下の処理を行なう。(4) Step S4 (case set division processing) The case set, the optimum observation item, and the division threshold are received from the previous step, and the following processing is performed.

【００９６】入力された各事例の最適観測項目の値が、
分割閾値より大きければグループ１に分類する。そうで
なければ、グループ２に分類する。この二つの事例集合
を、次の処理ステップに渡す。When the value of the optimal observation item of each input case is
If it is larger than the division threshold, it is classified into group 1. Otherwise, it is classified into group 2. The two case sets are passed to the next processing step.

【００９７】現在の事例集合は表１そのものである。こ
れを、最適観測項目２とその分割閾値２を使って二つの
グループに分割すると、次に示す表３および表４のよう
になる。これらの表３，表４を次の処理ステップに渡
す。The current case set is Table 1 itself. When this is divided into two groups using the optimum observation item 2 and its division threshold value 2, the following table 3 and table 4 are obtained. These Tables 3 and 4 are passed to the next processing step.

【００９８】[0098]

【表３】 [Table 3]

【００９９】[0099]

【表４】 [Table 4]

【０１００】（５）ステップＳ５（樹形図の作成処理）
についてここでは、以下の処理を実行する。(5) Step S5 (Process for creating a tree diagram)
Here, the following processing is executed.

【０１０１】１．ノードを作成し（二分割する前の事例
集合に対応するノード）、その下に２本の枝を生成する
（ステップＳ５Ａ）。左の枝の下には、グループ１の事
例集合を対応させる。右の枝の下には、グループ２の事
例集合を対応させる。1. A node is created (a node corresponding to the case set before being divided into two), and two branches are generated below the node (step S5A). Below the left branch, a case set of group 1 is associated. A case set of group 2 is associated below the right branch.

【０１０２】２．グループ１に属する事例集合を使っ
て、ステップＳ２からステップＳ５までの処理（図２で
一点鎖線で囲った部分）を再帰的に実行する（ステップ
Ｓ５Ｂ）。2. The process from step S2 to step S5 (the portion surrounded by the dashed line in FIG. 2) is recursively executed using the case set belonging to group 1 (step S5B).

【０１０３】３．グループ２に属する事例集合を使っ
て、ステップＳ２からステップＳ５までの処理（図２で
一点鎖線で囲った部分）を再帰的に実行する（ステップ
Ｓ５Ｃ）。3. The processing from step S2 to step S5 (portion surrounded by the dashed line in FIG. 2) is recursively executed using the case set belonging to group 2 (step S5C).

【０１０４】ステップＳ２では、各再帰処理の対象とな
る事例数があらかじめ規定している数Ｋ以下である場合
（ステップＳ２Ａ）、または、各再帰処理の対象となる
各事例間で各観測値がすべて一致する場合（ステップＳ
２Ｂ）には、これ以上の分割はできないので、その再帰
処理を終了し、次のステップに進む。In step S2, if the number of cases to be subjected to each recursive process is equal to or less than a predetermined number K (step S2A), or if each observation value is different between the cases to be subjected to each recursive process. If all match (step S
In 2B), no further division is possible, so the recursive processing ends, and the process proceeds to the next step.

【０１０５】これを実例で説明する。表１の初期事例集
合を使った場合には、ステップＳ５には表３および表４
のデータが入力される。そうすると、図３に示す樹形図
１が得られる。This will be described with an actual example. When the initial case set of Table 1 is used, Step S5 includes Tables 3 and 4
Is input. Then, a tree diagram 1 shown in FIG. 3 is obtained.

【０１０６】次に、ステップＳ２によって、表３の事
例、すなわち｛ｂ，ｅ，ｆ｝をグループ１として使った
再帰処理を開始する。Next, in step S2, a recursive process using the case shown in Table 3, that is, {b, e, f} as group 1, is started.

【０１０７】この事例集合の最適観測項目は１、最適分
割点は１となり、分割閾値は２となる。これを使って二
分割すると｛ｅ，ｆ｝，｛ｂ｝が得られ、図４に示す樹
形図２が得られる。In this case set, the optimal observation item is 1, the optimal division point is 1, and the division threshold is 2. By using this to divide into two, {e, f}, {b} are obtained, and a tree diagram 2 shown in FIG. 4 is obtained.

【０１０８】次に、｛ｅ，ｆ｝をグループ１として再度
再帰処理（ステップＳ５Ｂ）を開始するが、事例数が２
≦２なのでこの再帰処理を終了し（ステップＳ２Ａ）、
次のステップに進む。すなわち、｛ｂ｝をグループ２と
して次の再帰処理（ステップＳ５Ｃ）をするが、これも
事例数が１≦２なのでこの再帰処理を終了する。以上で
｛ｂ，ｅ，ｆ｝をグループ１として使った再帰処理（ス
テップＳ５Ｂ）が終了する。Next, recursive processing (step S5B) is started again with {e, f} as group 1, but the number of cases is 2
Since ≦ 2, this recursive processing is terminated (step S2A),
Proceed to the next step. That is, the next recursive processing (step S5C) is performed with {b} as group 2, but since the number of cases is 1 ≦ 2, this recursive processing is terminated. Thus, the recursive processing using {b, e, f} as group 1 (step S5B) is completed.

【０１０９】次に、｛ａ，ｃ，ｄ｝をグループ２とした
再帰処理（ステップＳ５Ｃ）を行なう。今までと同様の
処理の結果、図５に示す樹形図３が得られる。これで処
理全体が停止する。Next, recursive processing (step S5C) is performed with {a, c, d} as group 2. As a result of the same processing as before, a tree diagram 3 shown in FIG. 5 is obtained. This stops the entire process.

【０１１０】（実施の形態１による効果）以上説明した
通り、本実施の形態１によれば、事例を分類する樹形図
を得ることができる。しかも、事例集合から距離行列を
求める操作は必要ない。先に述べたように、事例の数を
Ｎとした場合の距離行列の大きさはＮ²である。(Effects of First Embodiment) As described above, according to the first embodiment, a tree diagram for classifying cases can be obtained. Moreover, there is no need to perform an operation for obtaining a distance matrix from the case set. As described above, when the number of cases is N, the size of the distance matrix is N ² .

【０１１１】一方、本実施の形態１で必要とする記憶領
域の大きさは、観測項目数をＭとしたとき、Ｎ×Ｍであ
る。一般にＭ＜＜Ｎであるから、本実施の形態１のほう
が、小さな記憶装置で済む。また、より小さな記憶装置
の中の処理であるため、実行時間も高速となる。On the other hand, the size of the storage area required in the first embodiment is N × M, where M is the number of observation items. Generally, since M << N, the first embodiment requires a smaller storage device. In addition, since the processing is performed in a smaller storage device, the execution time is shortened.

【０１１２】また、図５より各クラスタの特徴を明示的
に知ることができる。例えば、事例｛ｅ，ｆ｝を持つク
ラスタは「観測項目２の頻度が２以上でかつ観測項目１
の頻度が２以上」という特徴を持つことが分かる。Further, the characteristics of each cluster can be explicitly known from FIG. For example, the cluster having the case {e, f} indicates that the observation item 2 has a frequency of 2 or more and the observation item 1
Is more than 2 ".

【０１１３】以上が、実施の形態１についての説明であ
る。The above is the description of the first embodiment.

【０１１４】その他の実施の形態について上述した処理手順は、（ａ）ある観測項目の観測値に同じ値がほとんど出現し
ない場合（ｂ）観測項目の数が事例の数より小さい場合に有効である。しかし、上記の性質を満さない観測値の
場合には、以下に示す手法を用いた方が効率的である
（表１は同じ値が出現しているので、以下に説明する手
法で処理する方が効率的になる）。The processing procedure described above for the other embodiments is effective when (a) the same value hardly appears in the observation value of a certain observation item, and (b) the number of observation items is smaller than the number of cases. . However, in the case of an observation value that does not satisfy the above properties, it is more efficient to use the following method (Table 1 shows the same value, so processing is performed by the method described below). Is more efficient).

【０１１５】以下では、そのような例として文書集合の
クラスタリング問題を考えた手法の実例を具体的に示
す。The following is a specific example of a method considering the clustering problem of a document set as such an example.

【０１１６】本発明を適用して文書集合をクラスタリン
グする一つの方法は、次のようなものである。文書集合
全体を対象として異なる単語を調査する。そして、これ
らの単語を観測項目とし、その出現頻度を観測値とす
る。例えば、あるニュース記事の集合に対してこのよう
な表を作成すると、次に示す表５のようになる。One method of clustering a document set by applying the present invention is as follows. Investigate different words in the whole document set. Then, these words are used as observation items, and their appearance frequencies are used as observation values. For example, when such a table is created for a certain news article set, the following table 5 is obtained.

【０１１７】[0117]

【表５】 [Table 5]

【０１１８】この表５から、次のことが判る。The following can be seen from Table 5.

【０１１９】（イ）観測項目（単語）の数が膨大にな
り、表が巨大になる。(A) The number of observation items (words) becomes enormous, and the table becomes enormous.

【０１２０】（ロ）観測値（頻度）の大多数は０であ
り、次が１、その次が２…となる傾向がある。(B) The majority of observed values (frequency) are 0, the next is 1, the next is 2 and so on.

【０１２１】さらに、このような性質のデータの場合、
先に示した「最適観測項目の選択処理ステップＳ３」で
最大のｂ（Ｘ⁽ⁱ⁾ ［１，ｎ］，ｐ）を与えるｉ_bとｐ_b
を求めるのは、次のような問題がある。Further, in the case of data having such properties,
I _b and p _b giving the maximum b (X ⁽ⁱ⁾ [1, n], p) in “optimal observation item selection processing step S3” described above.
There are the following problems to ask for.

【０１２２】問題α．観測項目ｉの数が増えると、それ
だけ二分割の評価回数が増大して処理時間が増大する。The problem α . As the number of observation items i increases, the number of evaluations of the two divisions increases accordingly, and the processing time increases.

【０１２３】問題β．ある観測項目ｉのThe problem β . For an observation item i

【０１２４】[0124]

【外６】 [Outside 6]

【０１２５】のに無駄が多い。例えば表５の観測項目
「飛行機」の観測値（見えている範囲を対象とする。以
下も同様）はHowever, there is much waste. For example, the observation value of the observation item "airplane" in Table 5 (targets the visible area, and the same applies to the following)

【０１２６】[0126]

【外７】 [Outside 7]

【０１２７】３回の二分割を要する。しかし必要なの
は、０と１の間の二分割の評価だけである。０と０の間
の二分割は最大変動減少を与えないため、この評価は無
駄である。[0127] Three divisions are required. However, all that is required is an evaluation of the bisection between 0 and 1. This evaluation is useless, since a bisection between 0 and 0 does not give the maximum variation reduction.

【０１２８】以下、表が膨大になる問題、および二分割
で起きる二つの問題の解決手法を具体的に説明する。The following specifically describes a method for solving the problem that the table becomes enormous and the two problems that occur in the two-partitioning.

【０１２９】実施の形態２上述した通り、言語データをもとに作成する事例の表
（例えば表５）は巨大になる。しかし、頻度０の部分が
大半を占めているので、大幅に圧縮することができる。 Embodiment 2 As described above, the table of cases created based on language data (for example, Table 5) becomes huge. However, since the portion having the frequency of 0 occupies the majority, the compression can be performed significantly.

【０１３０】すなわち、各事例（記事）で頻度が０でな
い単語と、その頻度だけを記憶装置に格納する（単語頻
度リスト表と呼ぶ）。さらに、全単語（観測項目）のリ
ストを作成しておく。次に示す表６は、表５に対応する
単語頻度リスト表を示す。That is, in each case (article), a word whose frequency is not 0 and only the frequency are stored in a storage device (referred to as a word frequency list table). In addition, a list of all words (observation items) is created. Table 6 below shows a word frequency list table corresponding to Table 5.

【０１３１】[0131]

【表６】 [Table 6]

【０１３２】この表６に対応する全単語リストは｛飛行
機，大統領，条約｝となる。観測項目の番号は、この単
語リストの順番に従うものとする。「大統領」は観測項
目２となる。A list of all words corresponding to Table 6 is {airplane, president, treaty}. Observation item numbers follow the order of this word list. "President" is the second observation item.

【０１３３】二分割はある単語に着目して行なう。そこ
で必要なのは、各記事でのその単語の出現頻度である。
これは、単語頻度リスト表と全単語リストがあれば即座
に求めることができる。すなわち、単語頻度リスト表と
全単語リストを使えば膨大な表を使うことなく、表５と
同等な処理を実現できる。The two divisions are performed focusing on a certain word. What is needed is the frequency of occurrence of the word in each article.
This can be determined immediately if there is a word frequency list table and an all word list. That is, if the word frequency list table and the all word list are used, processing equivalent to that of Table 5 can be realized without using a huge table.

【０１３４】（実施の形態２による効果）以上説明した
通り、本実施の形態によれば、膨大な表を使うこと、お
よび、そのための処理が不要となる。(Effects of the Second Embodiment) As described above, according to the present embodiment, an enormous table is not used, and the processing therefor becomes unnecessary.

【０１３５】実施の形態３観測項目ｉの数が多くなると「最適観測項目の選択処理
ステップＳ３」での二分割評価の数が増大し、処理時間
が長くなる。ここでは、一部のｉについてのみ二分割を
評価するだけで最適な観測項目と分割点を求める手法を
示す。いわば、無駄な観測項目の評価をしない手法を示
す。これには「変動」の性質を利用する。 Embodiment 3 When the number of observation items i increases, the number of two-part evaluations in the “optimal observation item selection processing step S3” increases, and the processing time increases. Here, a method of finding the optimal observation item and the division point only by evaluating the two divisions for only a part i will be described. In other words, a method that does not evaluate useless observation items will be described. This utilizes the nature of "fluctuations".

【０１３６】まず、変動減少に関する基本的な性質につ
いて述べる。ｎ個の要素からなる観測値集合Ｘ
⁽ⁱ⁾［１，ｎ］を考える。この観測値集合の変動はｔ
（Ｘ⁽ⁱ⁾［１，ｎ］）である。任意の点ｐでＸ
⁽ⁱ⁾［１，ｎ］を分割すると変動の減少ｂ（Ｘ
⁽ⁱ⁾［１，ｎ］，ｐ）が生ずる。これらの間には式
（７）で示す関係があることが知られている（例えば前
掲の「多変量解析ハンドブック」）。First, the basic characteristics related to the fluctuation reduction will be described. Observation set X consisting of n elements
⁽ⁱ⁾ Consider [1, n]. The variation of this set of observations is t
(X ⁽ⁱ⁾ [1, n]). X at any point p
⁽ⁱ⁾ Dividing [1, n] reduces fluctuation b (X
⁽ⁱ⁾ [1, n], p) occurs. It is known that there is a relationship represented by equation (7) between them (for example, the above-mentioned “Multivariate Analysis Handbook”).

【０１３７】[0137]

【数５】 (Equation 5)

【０１３８】この不等式は、集合を分割すれば非負の変
動減少があること、変動減少の最大値は高々元の集合の
持っている変動となることを示している。This inequality indicates that there is a non-negative fluctuation decrease when the set is divided, and that the maximum value of the fluctuation decrease is the fluctuation of the original set at most.

【０１３９】ここで初期事例集合Ｘ⁽ⁱ⁾［１，Ｎ］，
（１≦ｉ≦Ｍ），（ｎ≦Ｎ）が与えられたとする。この
集合の観測項目を変動ｔ（Ｘ⁽ⁱ⁾［１，Ｎ］）の大きな
順に並び替える。すなわち、次の関係が成り立つように
しておく。Here, the initial case set X ⁽ⁱ⁾ [1, N],
It is assumed that (1 ≦ i ≦ M) and (n ≦ N) are given. The observation items of this set are rearranged in descending order of the variation t (X ⁽ⁱ⁾ [1, N]). That is, the following relationship is established.

【０１４０】[0140]

【数６】 (Equation 6)

【０１４１】「最適観測項目の選択処理ステップＳ３」
では、並び替えた観測項目の１番から順番に最適分割点
を求めていく。ここで、次の規則が成り立つことが証明
できる。"Optimal observation item selection processing step S3"
Then, the optimal division points are obtained in order from the first observation item rearranged. Here, it can be proved that the following rules hold.

【０１４２】観測値集合Ｘ⁽ⁱ⁾［１，ｎ］を対象に二分
割を行なった結果、As a result of dividing the observation value set X ⁽ⁱ⁾ [1, n] into two,

【０１４３】[0143]

【外８】 [Outside 8]

【０１４４】とする。もしIt is assumed that if

【０１４５】[0145]

【数７】 (Equation 7)

【０１４６】が成立するならば、ｉ＋１以降の観測項目
は二分割の評価を行なう必要がない。そこまでで得られ
ている最適観測項目と、その最適分割点が求める解であ
る。If the condition is satisfied, it is not necessary to evaluate the observation items after (i + 1) in two. The optimal observation items obtained up to that point and the optimal division point are the solutions to be found.

【０１４７】[0147]

【数８】 (Equation 8)

【０１４８】が成立することを示せば良い。It is sufficient to show that the following holds.

【０１４９】式（７）が成立するので、Since equation (7) holds,

【０１５０】[0150]

【外９】 [Outside 9]

【０１５１】で[0151]

【０１５２】[0152]

【数９】 (Equation 9)

【０１５３】が成立する。観測項目は大きさの順に整列
されているためIs established. Observation items are arranged in order of size

【０１５４】[0154]

【数１０】 (Equation 10)

【０１５５】である。もしIs as follows. if

【０１５６】[0156]

【数１１】 [Equation 11]

【０１５７】が成立するなら、If the following holds,

【０１５８】[0158]

【数１２】 (Equation 12)

【０１５９】となって題意は証明された。As a result, the theme was proved.

【０１６０】これを、具体例で説明する。先に述べた
「観測値集合の変動」における計算より、表１の観測項
目１の変動は４であり、観測項目２の変動は６である。
変動の大きい順に観測項目を並べ替えると２，１の順と
なる。This will be described with a specific example. According to the above-described calculation of “fluctuation of observation value set”, the fluctuation of observation item 1 in Table 1 is 4 and the fluctuation of observation item 2 is 6.
When the observation items are rearranged in descending order of fluctuation, the order becomes 2,1.

【０１６１】次に、各観測項目の最適な二分割を計算す
る。まず、観測項目２の最適な二分割と、その変動減少
を求める。これは、表２よりｐ＝３で６．０となること
がわかる。ところで、この変動減少６．０は次の観測項
目１の変動４より大きい。すなわち、上記ルールの条件
が成立する。よって、観測項目１の二分割の評価をする
必要はなくなり、計算を大幅に削減できる。Next, the optimum bisecting of each observation item is calculated. First, the optimum two-partitioning of the observation item 2 and its fluctuation reduction are obtained. It can be seen from Table 2 that p = 3 is 6.0. Incidentally, the fluctuation decrease 6.0 is larger than the fluctuation 4 of the next observation item 1. That is, the conditions of the above rule are satisfied. Therefore, it is not necessary to evaluate the observation item 1 in two parts, and the calculation can be greatly reduced.

【０１６２】以上の手法を使った「最適観測項目の選択
処理ステップＳ３」の動作は下記の通りである。まず、
初期事例集合の観測項目の変動を計算し、観測項目を変
動の大きな順に整列しておくのが前提である。The operation of "optimal observation item selection processing step S3" using the above method is as follows. First,
It is premised that the variation of the observation items in the initial case set is calculated, and the observation items are arranged in descending order of the variation.

【０１６３】１．各観測項目ｉを対象に以下の処理を行
なう。1. The following processing is performed for each observation item i.

【０１６４】２．Ｘ⁽ⁱ⁾［１，ｎ］の要素を値の小さい
順に並べ直す。[0164] 2. X ⁽ⁱ⁾ Rearrange the elements of [1, n] in ascending order of value.

【０１６５】３．Ｘ⁽ⁱ⁾［１，ｎ］のすべての可能な二
分割を行ない、[0165] 3. Perform all possible bisections of X ⁽ⁱ⁾ [1, n],

【０１６６】[0166]

【外１０】 [Outside 10]

【０１６７】を記録する。Is recorded.

【０１６８】４．もし4. if

【０１６９】[0169]

【外１１】 [Outside 11]

【０１７０】ならば処理を終了する。If it is, the process ends.

【０１７１】解の中から、最大の変動減少を与える観測
項目ｉ_bと、その最適分割点ｐ_bを求める。[0171] from the solution, and the observation item i _b which gives the maximum variation reduction, obtaining the optimal dividing point p _b.

【０１７２】さらに、Further,

【０１７３】[0173]

【外１２】 [Outside 12]

【０１７４】で計算する。Is calculated.

【０１７５】次の処理ステップへ渡すデータは、先に述
べた「最適観測項目の処理ステップＳ３」の場合と同じ
である。The data to be passed to the next processing step is the same as in the case of the above-mentioned "processing step S3 for the optimum observation item".

【０１７６】実施の形態４ここでは、先に述べた二分割の問題βの解決手法を示
す。 Embodiment 4 Here, a method for solving the above-mentioned two-partition problem β will be described.

【０１７７】先に述べたように、ある観測項目での二分
割は異なる頻度間で実施すればよい。このためには、観
測値集合をそのまま配列に格納するのでなく、頻度毎に
格納すると便利である。以下、その実現手法を示す。As described above, the division into two for a certain observation item may be performed at different frequencies. For this purpose, it is convenient to store the observation value set for each frequency, instead of storing it in an array as it is. Hereinafter, a method for realizing the above will be described.

【０１７８】ある観測項目ｉにｋ⁽ⁱ⁾種類の頻度がある
とする。これを、小さな順番に、配列ｆ⁽ⁱ⁾［ｊ］，
（１≦ｊ≦ｋ⁽ⁱ⁾）に格納する。例えば表５から、単語
「大統領（ｉ＝２）」には０と１と２の三種類の頻度が
ある。そこで、ｆ⁽²⁾［１］＝０、ｆ⁽²⁾［２］＝１、
ｆ⁽²⁾［３］＝２となる。It is assumed that a certain observation item i has k ⁽ⁱ⁾ kinds of frequencies. This is added to the array f ⁽ⁱ⁾ [j],
(1 ≦ j ≦ k ⁽ⁱ⁾ ). For example, from Table 5, the word "President (i = 2)" has three frequencies, 0, 1, and 2. Therefore, f ⁽²⁾ [1] = 0, f ⁽²⁾ [2] = 1,
f ⁽²⁾ [3] = 2.

【０１７９】次に、ｊ番目の頻度を持つ観測値集合を列
挙したものを配列要素Ｅ⁽ⁱ⁾［ｊ］に格納する。例え
ば、単語が「大統領」（ｉ＝２）の場合、１番目の頻度
である０を持つ要素を列挙してＥ⁽²⁾［１］＝｛０，
０｝となる。同様に、Ｅ⁽²⁾［２］＝｛１｝，Ｅ
⁽²⁾［３］＝｛２｝となる。以降では、単語ｉに着目し
ていることが明らかなので、肩の添字(i) を省略する。Next, a list of observation value sets having the j-th frequency is stored in the array element E ⁽ⁱ⁾ [j]. For example, if the word is “President” (i = 2), the elements having 0, which is the first frequency, are enumerated and E ⁽²⁾ [1] = {0,
0 °. Similarly, E ⁽²⁾ [2] = {1}, E
⁽²⁾ [3] = {2}. In the following, since it is clear that attention is paid to the word i, the suffix (i) is omitted.

【０１８０】Ｅ［ｊ］には観測値集合の列挙が入るの
で、Ｅは配列の配列となる。そこで、要素の参照手法が
先に述べた「配列型メモリ」の場合と異なる。以下、変
更点を述べる。Since E [j] contains a list of observation value sets, E is an array of arrays. Therefore, the element reference method is different from the above-described “array type memory”. The changes are described below.

【０１８１】（イ）｜Ｅ［ｊ］｜は配列Ｅ［ｊ］が格納
している集合の要素数を示す。(A) | E [j] | indicates the number of elements in the set stored in the array E [j].

【０１８２】（ロ）Ｅ［ｓ，ｅ］は配列Ｅ［ｓ］からＥ
［ｅ］までの各観測値集合の和集合を表す。(B) E [s, e] is obtained from the array E [s] by E
Represents the union of each set of observations up to [e].

【０１８３】（ハ）Ｅの要素の平均は下記の式に従って
計算する。(C) The average of the elements of E is calculated according to the following equation.

【０１８４】[0184]

【数１３】 (Equation 13)

【０１８５】次に、頻度を使った二分割について説明す
る。観測項目ｉの全観測値集合をこれまでと同様Ｘ
［１，ｎ］とする。これは、Ｅ［１，ｋ］に一致する。
観測値集合Ｅ[1,k] のｐ番目の頻度による二分割とは、
Ｅ［１，ｐ］とＥ［ｐ＋１，ｋ］に分割（１≦ｐ＜ｋ）
することである。Next, the division into two using the frequency will be described. The set of all observations for observation item i is changed to X as before.
[1, n]. This matches E [1, k].
The bisection by the p-th frequency of the observation set E [1, k] is
Divided into E [1, p] and E [p + 1, k] (1 ≦ p <k)
It is to be.

【０１８６】また、二分割によって発生する変動減少は
次式のように計算できる。Further, the fluctuation reduction caused by the two divisions can be calculated as follows.

【０１８７】[0187]

【数１４】 [Equation 14]

【０１８８】以上のデータ、計算式を基に、頻度間の二
分割が可能となる。この場合の「最適観測項目の選択処
理ステップＳ３」の処理は次のようになる。Based on the above data and calculation formula, the frequency can be divided into two. In this case, the processing of the “optimal observation item selection processing step S3” is as follows.

【０１８９】１．各観測項目ｉを対象に以下の処理を行
なう。1. The following processing is performed for each observation item i.

【０１９０】２．Ｘ⁽ⁱ⁾［１，ｎ］データを配列メモリ
Ｅ⁽ⁱ⁾［１，ｋ⁽ⁱ⁾］に格納する。2. X ⁽ⁱ⁾ [1, n] data is stored in the array memory E ⁽ⁱ⁾ [1, k ⁽ⁱ⁾ ].

【０１９１】３．Ｅ⁽ⁱ⁾［１，ｋ⁽ⁱ⁾］のすべての可能
な二分割を行ない、そのときの変動減少ｂ（Ｅ
⁽ⁱ⁾［１，ｋ⁽ⁱ⁾］，ｐ）を記録する（１≦ｐ＜ｋ
⁽ⁱ⁾ ）。[0191] 3. E ⁽ⁱ⁾ Performs all possible bisections of [1, k ⁽ⁱ⁾ ] and then reduces the variation b (E
⁽ⁱ⁾ [1, k ⁽ⁱ⁾ ], p) is recorded (1 ≦ p <k
⁽ⁱ⁾ ).

【０１９２】最大の変動減少を与えるｉ_bとｐ_bを求め
る。またFind i _b and p _b that give the maximum fluctuation reduction. Also

【０１９３】[0193]

【外１３】 [Outside 13]

【０１９４】で求める。次の処理に渡すデータは、上記
ステップＳ３の場合と同じである。Is obtained. The data to be passed to the next process is the same as in step S3.

【０１９５】本手法では異なる頻度間で二分割を評価す
るため、頻度の種類が少なければ評価回数が大幅に減少
する。最初に示した表１の場合も頻度の種類は少ない。
例えば、この表の観測項目２では１と３の間の一回の評
価で済む。一方、以前の手法であれば５回の評価が必要
となる。In this method, since the two divisions are evaluated between different frequencies, the number of evaluations is greatly reduced if the number of types of frequencies is small. In the case of Table 1 shown first, there are few types of frequencies.
For example, for the observation item 2 in this table, only one evaluation between 1 and 3 is required. On the other hand, the previous method requires five evaluations.

【０１９６】なお、In addition,

【０１９７】[0197]

【外１４】 [Outside 14]

【０１９８】は以下の漸化式で効率的に計算できる。Can be efficiently calculated by the following recurrence formula.

【０１９９】[0199]

【数１５】 (Equation 15)

【０２００】（実施の形態３および４による効果）実施
の形態３における観測項目の整列は、初期事例集合の観
測項目の変動が大きくばらついている場合に有効であ
る。(Effects of Embodiments 3 and 4) The arrangement of observation items in Embodiment 3 is effective when the observation items in the initial case set vary greatly.

【０２０１】実施の形態４における頻度間の二分割は、
同じ頻度が多数出現する場合に有効である。The two divisions between frequencies in Embodiment 4 are as follows:
This is effective when the same frequency appears many times.

【０２０２】また、言語データは両方の性質をもってい
るため、両方の手法を適用することで大きく処理効率を
改善できる。Further, since language data has both properties, processing efficiency can be greatly improved by applying both methods.

【０２０３】[0203]

【発明の効果】以上説明した通り、本発明によれば、従
来から採られていた手法によることなく、数多くの事例
を階層的に分類する能力を持つクラスタリング方法を実
現することができる。その結果、本発明によれば、事例
集合が与えられた場合に、これらを樹形図の形で高速に
クラスタリングすることができる。As described above, according to the present invention, it is possible to realize a clustering method having the ability to classify a large number of cases in a hierarchical manner, without using a conventionally employed method. As a result, according to the present invention, when a set of cases is given, these can be clustered at high speed in the form of a tree diagram.

【０２０４】すなわち、従来のクラスタリングの手法で
は、事例集合を一旦、事例間の距離行列に変換する必要
があったが、本発明によればこの必要がない。このた
め、より小さなメモリでクラスタリングを行なうことが
できる。That is, in the conventional clustering method, it is necessary to temporarily convert a case set into a distance matrix between cases, but according to the present invention, this is not necessary. Therefore, clustering can be performed with a smaller memory.

【０２０５】さらに、事例集合中の頻度の性質によっ
て、本発明特有の処理効率を改善する手法を採ることが
できる。本発明による手法は、観測項目の数が大きいデ
ータ、観測値の大半がゼロであるスパースなデータに対
するクラスタリングに特に効果を発揮する。これに該当
する典型的データとして、例えば応用上重要である言語
データがある。Furthermore, a method for improving the processing efficiency peculiar to the present invention can be adopted depending on the nature of the frequency in the case set. The method according to the present invention is particularly effective in clustering data with a large number of observation items and sparse data in which most of the observation values are zero. Typical data that corresponds to this is, for example, language data that is important in application.

[Brief description of the drawings]

【図１】本発明を実施するためのハードウェア構成を示
すブロック図である。FIG. 1 is a block diagram showing a hardware configuration for implementing the present invention.

【図２】本発明の実施の形態の一例を示すフローチャー
トである。FIG. 2 is a flowchart illustrating an example of an embodiment of the present invention.

【図３】図２による処理の結果として得られた樹形図の
一例を示す説明図である。FIG. 3 is an explanatory diagram showing an example of a tree diagram obtained as a result of the processing according to FIG. 2;

【図４】図２による処理の結果として得られた樹形図の
一例を示す説明図である。FIG. 4 is an explanatory diagram showing an example of a tree diagram obtained as a result of the processing according to FIG. 2;

【図５】図２による処理の結果として得られた樹形図の
一例を示す説明図である。FIG. 5 is an explanatory diagram showing an example of a tree diagram obtained as a result of the processing according to FIG. 2;

Claims

[Claims]

1. A case set consisting of observation values of a plurality of observation items is input as an initial case set, and the case set is divided into two successively so that the variation in the case set is maximized. Clustering method to use.

2. The clustering method according to claim 1, wherein the case set is formed by compressing a storage area necessary for storing the observation value.

3. The clustering method according to claim 1, wherein the observation items are arranged in descending order of a variation value in the initial case set.

4. The clustering method according to claim 1, wherein the frequency of the observed value is divided into two.