JPH0259980A

JPH0259980A - Treatment of clustering

Info

Publication number: JPH0259980A
Application number: JP63212192A
Authority: JP
Inventors: Masaharu Kurakake; 正治倉掛
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1988-08-26
Filing date: 1988-08-26
Publication date: 1990-02-28
Anticipated expiration: 2010-08-16
Also published as: JPH0776986B2

Abstract

PURPOSE:To determine plural templates to high accurately recognize a test sample different from a learning sample by executing clustering based on a forecasting error. CONSTITUTION:Each sample is made into one cluster, the distances between all the clusters are obtained, all the cluster pairs are ranked starting from a cluster pair having the shortest distance between them, the cluster pair having the shortest distance are made into C1, C2, respective forecasting errors are obtained for a case when the cluster pair C1 and C2 are united and for another case when the cluster pair C1 and C2 are not united, and when the forecasting error in the former case is smaller, the C1 and C2 are united and made into one cluster, and current processing returns to the first processing. On the other hand, when the forecasting error in the latter case is smaller, if the next rank cluster pair exists and the distance between the second rank cluster pair is shorter than a certain fixed value, the cluster pair C1 and C2 are held as they are, and the current processing returns to the second processing. In the other cases, a protruding polyhedron composed of each sample for each cluster is obtained, the average value of respective samples to be endpoints is obtained, and the average value is made into the template. Thus, clustering can be stably executed.

Description

【発明の詳細な説明】（１）発明の属する技術分野本発明は１例えば、　（ｉ）教師無し分類を行う方法や
、（ｉｉ）文字・図形認識方式等においてサブクラスを
決定する方法や、（ｊｉ）Ｌ’ｆｆｌ識辞書を構成する
際のテンプレートを複数化する方法などに用いられるク
ラスタリング処理方法に関する。DETAILED DESCRIPTION OF THE INVENTION (1) Technical field to which the invention pertains The present invention relates to, for example, (i) a method for unsupervised classification, (ii) a method for determining subclasses in character/figure recognition systems, etc.; ji) It relates to a clustering processing method used in a method of creating a plurality of templates when constructing an L'ffl knowledge dictionary.

（２）従来の技術従来、クラスタリング手法では、学習サンプルが母集団
の性質を正しく反映しているとの仮定の元にサンプル間
の距離・クラスタ間の距離だけに基づいてクラスタリン
グを行っていたので、学習サンプルの数が少ない場合に
はクラスタリングの結果が学習サンプル中の偏ったサン
プルに大きく影響された。例えば、認識静置構成の際に
複数テンプレートをクラスタリング手法を通用して決定
する場合、複数テンプレートの決定が学習サンプルの偏
りに大きく影響されてテストサンプルの認識の際に効果
を発揮しなかった。(2) Conventional technology In the past, clustering methods performed clustering based only on the distance between samples and the distance between clusters on the assumption that the training samples accurately reflected the characteristics of the population. However, when the number of training samples was small, the clustering results were greatly affected by biased samples in the training samples. For example, when determining multiple templates using a clustering method during recognition static configuration, the determination of multiple templates was greatly influenced by the bias of the training sample and was not effective when recognizing the test sample.

（３）発明の目的本発明の目的は、学習サンプルの数が少ない場合にも学
習サンプルの偏りの影響を減らして安定にクラスタリン
グを行うクラスタリング処理方法を提供することにある
。(3) Object of the Invention An object of the present invention is to provide a clustering processing method that reduces the influence of bias in learning samples and performs stable clustering even when the number of learning samples is small.

（４）発明の構成以下５文字・図形認識方式の認識辞書を構成する際のテ
ンプレートを複数化する場合を例にとって説明する。(4) Structure of the Invention The following is a description of an example in which a plurality of templates are used when constructing a recognition dictionary for a five character/figure recognition method.

（４−１’）発明の特徴と従来技術との差異第３図は、
従来のクラスタリング手法を適用した複数テンプレート
決定方法の処理ブロック図の一例である。(4-1') Differences between the characteristics of the invention and the prior art Figure 3 shows the following:
1 is an example of a processing block diagram of a multiple template determination method using a conventional clustering method; FIG.

学習サンプルは特徴ベクトルで表現されている。Training samples are represented by feature vectors.

処理を始める前にテンプレートの数を決める（Ｋとする
）。またサンプル数をＮとする。各クラスタはＴ＝Ｎ／
に個のサンプルから構成される。Before starting processing, determine the number of templates (let it be K). Also, let N be the number of samples. Each cluster is T=N/
It consists of 2 samples.

処理１１において、学習サンプルから任意の１サンプル
Ｓｉを選ぶ。In process 11, one arbitrary sample Si is selected from the learning samples.

処理１２において、学習サンプルからＳｔへの距離が近
い順にＴ−１個選ぶ。In process 12, T-1 samples are selected in descending order of distance from the learning sample to St.

処理１３において、処理１２で選ばれたサンプルとＳｉ
の計Ｔ個のサンプルとを学習サンプルから除く。In process 13, the sample selected in process 12 and Si
A total of T samples are removed from the learning samples.

処理１４において、上記Ｔ個のサンプルの平均をテンプ
レートとする。In process 14, the average of the T samples is used as a template.

処理１５において、サンプルが残っていれば処理１１へ
戻り、残っていなければ処理を終了する。In process 15, if there are any samples left, the process returns to process 11, and if there are no samples left, the process ends.

処理１２で用いられる路地はユークリント距離・シティ
ーブロンク距離等を用いてよい。For the alleys used in the process 12, Euclint distance, City-Bronck distance, etc. may be used.

以上述べてきたように従来の処理方法は、学習サンプル
の偏りの影響を考慮した手法ではないので、別の学習サ
ンプルを用いた場合には決定されるテンプレートが大き
く変わる可能性が高かった。As described above, the conventional processing method does not take into account the influence of bias in the learning sample, so if a different learning sample is used, there is a high possibility that the template to be determined will change significantly.

これは学習サンプルと違うテストサンプルを認識する際
に、学習サンプルで決定した複数テンプレートの効果が
少ないことを意味する。本発明は５テンプレートを決定
する際に学習サンプルの偏りを減らす手法を提供するも
ので９．テストサンプルを高精度に認識する複数テンプ
レートの決定を可能にする。This means that when recognizing a test sample that is different from the training sample, multiple templates determined using the training sample are less effective. The present invention provides a method for reducing bias in learning samples when determining templates.9. It is possible to determine multiple templates that recognize test samples with high accuracy.

（１−２）実施例第１図は１本発明の処理ブロック図の一例である。(1-2) Examples FIG. 1 is an example of a processing block diagram of the present invention.

学習サンプルは特徴ベクトルで表現されているとする。It is assumed that the training samples are represented by feature vectors.

処理２Ｉにおいて、各サンプルをそれぞれ一つのクラス
タとする。In process 2I, each sample is treated as one cluster.

処理２２において、全てのクラスタ間の距離を求め、距
離の小さいほうから順位をつける。距離が最小のクラス
タ対をＣ１，Ｃ２とする。In process 22, the distances between all clusters are determined and the clusters are ranked in descending order of distance. Let the cluster pair with the minimum distance be C1 and C2.

処理２３において、Ｃ１と０２とを融合した場合としな
い場合とで後に述べる予測誤差を求める。In process 23, prediction errors, which will be described later, are obtained for cases in which C1 and 02 are combined and cases in which they are not combined.

処理２４において、Ｃ１と０２とを融合した場合の方が
予測誤差が小さければＣＩとＣ２とを融合して一つのク
ラスタとして処理２２へ戻る。Ｃ１とＣ２とを融合しな
い方が予測誤差が小さい場合には、処理２５へ進む。In process 24, if the prediction error is smaller when C1 and 02 are combined, CI and C2 are combined to form one cluster and the process returns to process 22. If the prediction error is smaller if C1 and C2 are not fused, the process proceeds to step 25.

処理２５において１次の順位のクラスタ対が存在してそ
の間の距離がある一定値より小さければ次の順位のクラ
スタ対をＣＩ、Ｃ２として処理２３へ戻る。それ以外の
場合には、処理２６へ進む。In process 25, if a cluster pair of the first order exists and the distance between them is smaller than a certain value, the cluster pair of the next order is set as CI and C2 and the process returns to process 23. In other cases, the process advances to step 26.

処理２６において、各クラスタ毎に個々のサンプルによ
って構成される凸条面体を求め、当該凸条面体の端点と
なるサンプルについての平均をとって当該平均値をテン
プレートとして処理を終了する。In process 26, a convex stripe made up of individual samples is obtained for each cluster, the average of the samples serving as the end points of the convex stripe is taken, and the process is completed using the average value as a template.

クラスタ間の距離は以下のように定義する。The distance between clusters is defined as follows.

Ｃ１から０２までの距離を０１から０２に属するサンプ
ルまでの距離のうちで最大のものとする。Let the distance from C1 to 02 be the largest among the distances from 01 to the samples belonging to 02.

Ｃ２から０１までの距離も同様に０２からＣＩに属する
サンプルまでの距離のうちで最大のものとする。そして
ＣＩから０２までの距離とＣ２からＣ１までの距離との
うち大きい方をクラスタＣ１と０２の距離とする。Similarly, the distance from C2 to 01 is assumed to be the maximum among the distances from 02 to the samples belonging to CI. Then, the larger of the distance from CI to 02 and the distance from C2 to C1 is set as the distance between clusters C1 and 02.

クラスタＣ１とサンプルＸとの距離を以下のように定義
する。第２図は、この定義を説明する説明図である。該
クラスタ内のサンプルＷｉの凸条面体Ｔの端点を求め端
点の平均をＹとする。サンプルＸからクラスタＣ１のサ
ンプルを含む超平面■１への射影点をＺとする。Ｘから
Ｚまでの距離をＤＩ、ＹからＺまでの距離をＤ２．Ｙと
Ｚとを結ぶ直線が凸条面体の境界面と交差する点とＹと
の距離をＤ３とすると、クラスタＣ１とサンプルＸとの
距離はＤＩ＋Ｄ２／Ｄ３で定義される。ＤＩ、Ｄ２．Ｄ
３を求める際の距離はユークリッド距離・シティ−ブロ
ック距離等の距離の公理を満たすものであれば特に問わ
ない。The distance between cluster C1 and sample X is defined as follows. FIG. 2 is an explanatory diagram illustrating this definition. The end points of the convex striped surface T of the sample Wi in the cluster are found, and the average of the end points is set as Y. Let Z be the projection point from sample X onto hyperplane 1 containing samples of cluster C1. The distance from X to Z is DI, and the distance from Y to Z is D2. If the distance between Y and the point where the straight line connecting Y and Z intersects the boundary surface of the convex stripe is D3, then the distance between cluster C1 and sample X is defined as DI+D2/D3. DI, D2. D
3 is not particularly limited as long as it satisfies distance axioms such as Euclidean distance and City-Brock distance.

予測誤差は以下のように定義する。クラスタＣに属する
サンプルをＸｉ　（ｉ＝ｌ、・・・・・・ｎ）とし乱数
を用いて発生させたテストサンプルをＺｊ（ｊ＝１．・
・・・・・ｍ）とする。クラスタＣからＺｊまでの距離
をＤｊ、クラスタＣのサンプルからＸｋ　（１＜＝ｋ＜
−ｎ）を除いたサンプルから構成されるクラスタからＺ
ｊまでの距離をＤｊ（−ｋ）とする。クラスタＣの予測
誤差はクラスタＣの全てのサンプルを一度づつ除いたと
きのＤｊとＤｊ（−ｋ）との差を全てのテストサンプル
に対して計算するもので以下のように定義される。The prediction error is defined as follows. The samples belonging to cluster C are Xi (i=l,...n), and the test samples generated using random numbers are Zj (j=1...n).
...m). The distance from cluster C to Zj is Dj, and the distance from the sample of cluster C to Xk (1<=k<
−n) from the cluster consisting of samples excluding Z
Let the distance to j be Dj(-k). The prediction error of cluster C is calculated by calculating the difference between Dj and Dj(-k) for all test samples when all samples of cluster C are removed one by one, and is defined as follows.

ΣΣ（Ｄｊ−Ｄｊ　　（−ｋ））” クラスタＣ１とＣ２とを融合した場合の予測誤差は、ク
ラスタＣ１，Ｃ２をまとめて一つのクラスタとみなして
予測誤差を計算したもので、クラスタＣ１と０２とを融
合しない場合の予測誤差は。ΣΣ(Dj-Dj (-k))" The prediction error when clusters C1 and C2 are combined is calculated by considering clusters C1 and C2 as one cluster, and the prediction error is calculated by considering clusters C1 and C2 as one cluster. The prediction error when not merging is:

テストサンプルまでの距離をクラスタＣ１，Ｃ２のうち
近い方のクラスタからの距離として計算したものである
。以上の説明の様に、予測誤差はサンプルの数が一つ減
った場合にもクラスタリングの結果が変わらない場合に
小さくなる。The distance to the test sample is calculated as the distance from the closer cluster of clusters C1 and C2. As explained above, the prediction error becomes smaller if the clustering result does not change even if the number of samples decreases by one.

（５）発明の詳細な説明したように１本発明によれば、予測誤差に基づい
てクラスタリングを行うので学習サンプルが多少変わっ
ても得られるテンブレー【はほとんど変わらない、すな
わち少数の偏った学習サンプルの影響を受けず学習サン
プルと違うテストサンプルを高精度に認識する複数テン
プレートの決定が可能となる。(5) As described in detail of the invention, according to the present invention, clustering is performed based on prediction errors, so even if the training samples are slightly changed, the obtained Tenbrei will hardly change, that is, a small number of biased training samples It becomes possible to determine multiple templates that can recognize test samples different from training samples with high accuracy without being affected by

[Brief explanation of the drawing]

第１図は本発明の処理ブロック図、第２図はクラスタか
らサンプル点までの距離の定義を説明する説明図、第３
図は従来のクラスタリング手法を適用した複数テンプレ
ート決定方法の処理ブロック図の一例を示す。特許出願人　日本電信電話株式会社FIG. 1 is a processing block diagram of the present invention, FIG. 2 is an explanatory diagram explaining the definition of the distance from a cluster to a sample point, and FIG.
The figure shows an example of a processing block diagram of a multiple template determination method using a conventional clustering method. Patent applicant Nippon Telegraph and Telephone Corporation

Claims

[Claims] In a processing method for clustering a plurality of samples represented by N-dimensional vectors, in the process of merging clusters starting from an initial state in which each sample forms one cluster, N-dimensional It includes a process of calculating the distance from a cluster to any point in the vector space, a process of calculating the distance between clusters, and a change in the distance to the test sample when some of the samples that make up the cluster are removed. a step of calculating as a prediction error, a step of determining whether to fuse cluster pairs based on the prediction error in the order of the cluster pairs having the smallest inter-cluster distance calculated in the step of calculating the distance between the clusters,
A clustering processing method comprising a step of determining the end of processing based on an inter-cluster distance and a prediction error.