JPH0259980A - Treatment of clustering - Google Patents

Treatment of clustering

Info

Publication number
JPH0259980A
JPH0259980A JP63212192A JP21219288A JPH0259980A JP H0259980 A JPH0259980 A JP H0259980A JP 63212192 A JP63212192 A JP 63212192A JP 21219288 A JP21219288 A JP 21219288A JP H0259980 A JPH0259980 A JP H0259980A
Authority
JP
Japan
Prior art keywords
cluster
distance
samples
sample
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP63212192A
Other languages
Japanese (ja)
Other versions
JPH0776986B2 (en
Inventor
Masaharu Kurakake
正治 倉掛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP63212192A priority Critical patent/JPH0776986B2/en
Publication of JPH0259980A publication Critical patent/JPH0259980A/en
Publication of JPH0776986B2 publication Critical patent/JPH0776986B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

PURPOSE:To determine plural templates to high accurately recognize a test sample different from a learning sample by executing clustering based on a forecasting error. CONSTITUTION:Each sample is made into one cluster, the distances between all the clusters are obtained, all the cluster pairs are ranked starting from a cluster pair having the shortest distance between them, the cluster pair having the shortest distance are made into C1, C2, respective forecasting errors are obtained for a case when the cluster pair C1 and C2 are united and for another case when the cluster pair C1 and C2 are not united, and when the forecasting error in the former case is smaller, the C1 and C2 are united and made into one cluster, and current processing returns to the first processing. On the other hand, when the forecasting error in the latter case is smaller, if the next rank cluster pair exists and the distance between the second rank cluster pair is shorter than a certain fixed value, the cluster pair C1 and C2 are held as they are, and the current processing returns to the second processing. In the other cases, a protruding polyhedron composed of each sample for each cluster is obtained, the average value of respective samples to be endpoints is obtained, and the average value is made into the template. Thus, clustering can be stably executed.

Description

【発明の詳細な説明】 (1)発明の属する技術分野 本発明は1例えば、 (i)教師無し分類を行う方法や
、(ii)文字・図形認識方式等においてサブクラスを
決定する方法や、(ji)L’ffl識辞書を構成する
際のテンプレートを複数化する方法などに用いられるク
ラスタリング処理方法に関する。
DETAILED DESCRIPTION OF THE INVENTION (1) Technical field to which the invention pertains The present invention relates to, for example, (i) a method for unsupervised classification, (ii) a method for determining subclasses in character/figure recognition systems, etc.; ji) It relates to a clustering processing method used in a method of creating a plurality of templates when constructing an L'ffl knowledge dictionary.

(2)従来の技術 従来、クラスタリング手法では、学習サンプルが母集団
の性質を正しく反映しているとの仮定の元にサンプル間
の距離・クラスタ間の距離だけに基づいてクラスタリン
グを行っていたので、学習サンプルの数が少ない場合に
はクラスタリングの結果が学習サンプル中の偏ったサン
プルに大きく影響された。例えば、認識静置構成の際に
複数テンプレートをクラスタリング手法を通用して決定
する場合、複数テンプレートの決定が学習サンプルの偏
りに大きく影響されてテストサンプルの認識の際に効果
を発揮しなかった。
(2) Conventional technology In the past, clustering methods performed clustering based only on the distance between samples and the distance between clusters on the assumption that the training samples accurately reflected the characteristics of the population. However, when the number of training samples was small, the clustering results were greatly affected by biased samples in the training samples. For example, when determining multiple templates using a clustering method during recognition static configuration, the determination of multiple templates was greatly influenced by the bias of the training sample and was not effective when recognizing the test sample.

(3)発明の目的 本発明の目的は、学習サンプルの数が少ない場合にも学
習サンプルの偏りの影響を減らして安定にクラスタリン
グを行うクラスタリング処理方法を提供することにある
(3) Object of the Invention An object of the present invention is to provide a clustering processing method that reduces the influence of bias in learning samples and performs stable clustering even when the number of learning samples is small.

(4)発明の構成 以下5文字・図形認識方式の認識辞書を構成する際のテ
ンプレートを複数化する場合を例にとって説明する。
(4) Structure of the Invention The following is a description of an example in which a plurality of templates are used when constructing a recognition dictionary for a five character/figure recognition method.

(4−1’)発明の特徴と従来技術との差異第3図は、
従来のクラスタリング手法を適用した複数テンプレート
決定方法の処理ブロック図の一例である。
(4-1') Differences between the characteristics of the invention and the prior art Figure 3 shows the following:
1 is an example of a processing block diagram of a multiple template determination method using a conventional clustering method; FIG.

学習サンプルは特徴ベクトルで表現されている。Training samples are represented by feature vectors.

処理を始める前にテンプレートの数を決める(Kとする
)。またサンプル数をNとする。各クラスタはT=N/
に個のサンプルから構成される。
Before starting processing, determine the number of templates (let it be K). Also, let N be the number of samples. Each cluster is T=N/
It consists of 2 samples.

処理11において、学習サンプルから任意の1サンプル
Siを選ぶ。
In process 11, one arbitrary sample Si is selected from the learning samples.

処理12において、学習サンプルからStへの距離が近
い順にT−1個選ぶ。
In process 12, T-1 samples are selected in descending order of distance from the learning sample to St.

処理13において、処理12で選ばれたサンプルとSi
の計T個のサンプルとを学習サンプルから除く。
In process 13, the sample selected in process 12 and Si
A total of T samples are removed from the learning samples.

処理14において、上記T個のサンプルの平均をテンプ
レートとする。
In process 14, the average of the T samples is used as a template.

処理15において、サンプルが残っていれば処理11へ
戻り、残っていなければ処理を終了する。
In process 15, if there are any samples left, the process returns to process 11, and if there are no samples left, the process ends.

処理12で用いられる路地はユークリント距離・シティ
ーブロンク距離等を用いてよい。
For the alleys used in the process 12, Euclint distance, City-Bronck distance, etc. may be used.

以上述べてきたように従来の処理方法は、学習サンプル
の偏りの影響を考慮した手法ではないので、別の学習サ
ンプルを用いた場合には決定されるテンプレートが大き
く変わる可能性が高かった。
As described above, the conventional processing method does not take into account the influence of bias in the learning sample, so if a different learning sample is used, there is a high possibility that the template to be determined will change significantly.

これは学習サンプルと違うテストサンプルを認識する際
に、学習サンプルで決定した複数テンプレートの効果が
少ないことを意味する。本発明は5テンプレートを決定
する際に学習サンプルの偏りを減らす手法を提供するも
ので9.テストサンプルを高精度に認識する複数テンプ
レートの決定を可能にする。
This means that when recognizing a test sample that is different from the training sample, multiple templates determined using the training sample are less effective. The present invention provides a method for reducing bias in learning samples when determining templates.9. It is possible to determine multiple templates that recognize test samples with high accuracy.

(1−2)実施例 第1図は1本発明の処理ブロック図の一例である。(1-2) Examples FIG. 1 is an example of a processing block diagram of the present invention.

学習サンプルは特徴ベクトルで表現されているとする。It is assumed that the training samples are represented by feature vectors.

処理2Iにおいて、各サンプルをそれぞれ一つのクラス
タとする。
In process 2I, each sample is treated as one cluster.

処理22において、全てのクラスタ間の距離を求め、距
離の小さいほうから順位をつける。距離が最小のクラス
タ対をC1,C2とする。
In process 22, the distances between all clusters are determined and the clusters are ranked in descending order of distance. Let the cluster pair with the minimum distance be C1 and C2.

処理23において、C1と02とを融合した場合としな
い場合とで後に述べる予測誤差を求める。
In process 23, prediction errors, which will be described later, are obtained for cases in which C1 and 02 are combined and cases in which they are not combined.

処理24において、C1と02とを融合した場合の方が
予測誤差が小さければCIとC2とを融合して一つのク
ラスタとして処理22へ戻る。C1とC2とを融合しな
い方が予測誤差が小さい場合には、処理25へ進む。
In process 24, if the prediction error is smaller when C1 and 02 are combined, CI and C2 are combined to form one cluster and the process returns to process 22. If the prediction error is smaller if C1 and C2 are not fused, the process proceeds to step 25.

処理25において1次の順位のクラスタ対が存在してそ
の間の距離がある一定値より小さければ次の順位のクラ
スタ対をCI、C2として処理23へ戻る。それ以外の
場合には、処理26へ進む。
In process 25, if a cluster pair of the first order exists and the distance between them is smaller than a certain value, the cluster pair of the next order is set as CI and C2 and the process returns to process 23. In other cases, the process advances to step 26.

処理26において、各クラスタ毎に個々のサンプルによ
って構成される凸条面体を求め、当該凸条面体の端点と
なるサンプルについての平均をとって当該平均値をテン
プレートとして処理を終了する。
In process 26, a convex stripe made up of individual samples is obtained for each cluster, the average of the samples serving as the end points of the convex stripe is taken, and the process is completed using the average value as a template.

クラスタ間の距離は以下のように定義する。The distance between clusters is defined as follows.

C1から02までの距離を01から02に属するサンプ
ルまでの距離のうちで最大のものとする。
Let the distance from C1 to 02 be the largest among the distances from 01 to the samples belonging to 02.

C2から01までの距離も同様に02からCIに属する
サンプルまでの距離のうちで最大のものとする。そして
CIから02までの距離とC2からC1までの距離との
うち大きい方をクラスタC1と02の距離とする。
Similarly, the distance from C2 to 01 is assumed to be the maximum among the distances from 02 to the samples belonging to CI. Then, the larger of the distance from CI to 02 and the distance from C2 to C1 is set as the distance between clusters C1 and 02.

クラスタC1とサンプルXとの距離を以下のように定義
する。第2図は、この定義を説明する説明図である。該
クラスタ内のサンプルWiの凸条面体Tの端点を求め端
点の平均をYとする。サンプルXからクラスタC1のサ
ンプルを含む超平面■1への射影点をZとする。Xから
Zまでの距離をDI、YからZまでの距離をD2.Yと
Zとを結ぶ直線が凸条面体の境界面と交差する点とYと
の距離をD3とすると、クラスタC1とサンプルXとの
距離はDI+D2/D3で定義される。DI、D2.D
3を求める際の距離はユークリッド距離・シティ−ブロ
ック距離等の距離の公理を満たすものであれば特に問わ
ない。
The distance between cluster C1 and sample X is defined as follows. FIG. 2 is an explanatory diagram illustrating this definition. The end points of the convex striped surface T of the sample Wi in the cluster are found, and the average of the end points is set as Y. Let Z be the projection point from sample X onto hyperplane 1 containing samples of cluster C1. The distance from X to Z is DI, and the distance from Y to Z is D2. If the distance between Y and the point where the straight line connecting Y and Z intersects the boundary surface of the convex stripe is D3, then the distance between cluster C1 and sample X is defined as DI+D2/D3. DI, D2. D
3 is not particularly limited as long as it satisfies distance axioms such as Euclidean distance and City-Brock distance.

予測誤差は以下のように定義する。クラスタCに属する
サンプルをXi (i=l、・・・・・・n)とし乱数
を用いて発生させたテストサンプルをZj(j=1.・
・・・・・m)とする。クラスタCからZjまでの距離
をDj、クラスタCのサンプルからXk (1<=k<
−n)を除いたサンプルから構成されるクラスタからZ
jまでの距離をDj(−k)とする。クラスタCの予測
誤差はクラスタCの全てのサンプルを一度づつ除いたと
きのDjとDj(−k)との差を全てのテストサンプル
に対して計算するもので以下のように定義される。
The prediction error is defined as follows. The samples belonging to cluster C are Xi (i=l,...n), and the test samples generated using random numbers are Zj (j=1...n).
...m). The distance from cluster C to Zj is Dj, and the distance from the sample of cluster C to Xk (1<=k<
−n) from the cluster consisting of samples excluding Z
Let the distance to j be Dj(-k). The prediction error of cluster C is calculated by calculating the difference between Dj and Dj(-k) for all test samples when all samples of cluster C are removed one by one, and is defined as follows.

ΣΣ(Dj−Dj  (−k))” クラスタC1とC2とを融合した場合の予測誤差は、ク
ラスタC1,C2をまとめて一つのクラスタとみなして
予測誤差を計算したもので、クラスタC1と02とを融
合しない場合の予測誤差は。
ΣΣ(Dj-Dj (-k))" The prediction error when clusters C1 and C2 are combined is calculated by considering clusters C1 and C2 as one cluster, and the prediction error is calculated by considering clusters C1 and C2 as one cluster. The prediction error when not merging is:

テストサンプルまでの距離をクラスタC1,C2のうち
近い方のクラスタからの距離として計算したものである
。以上の説明の様に、予測誤差はサンプルの数が一つ減
った場合にもクラスタリングの結果が変わらない場合に
小さくなる。
The distance to the test sample is calculated as the distance from the closer cluster of clusters C1 and C2. As explained above, the prediction error becomes smaller if the clustering result does not change even if the number of samples decreases by one.

(5)発明の詳細 な説明したように1本発明によれば、予測誤差に基づい
てクラスタリングを行うので学習サンプルが多少変わっ
ても得られるテンブレー【はほとんど変わらない、すな
わち少数の偏った学習サンプルの影響を受けず学習サン
プルと違うテストサンプルを高精度に認識する複数テン
プレートの決定が可能となる。
(5) As described in detail of the invention, according to the present invention, clustering is performed based on prediction errors, so even if the training samples are slightly changed, the obtained Tenbrei will hardly change, that is, a small number of biased training samples It becomes possible to determine multiple templates that can recognize test samples different from training samples with high accuracy without being affected by

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は本発明の処理ブロック図、第2図はクラスタか
らサンプル点までの距離の定義を説明する説明図、第3
図は従来のクラスタリング手法を適用した複数テンプレ
ート決定方法の処理ブロック図の一例を示す。 特許出願人 日本電信電話株式会社
FIG. 1 is a processing block diagram of the present invention, FIG. 2 is an explanatory diagram explaining the definition of the distance from a cluster to a sample point, and FIG.
The figure shows an example of a processing block diagram of a multiple template determination method using a conventional clustering method. Patent applicant Nippon Telegraph and Telephone Corporation

Claims (1)

【特許請求の範囲】 N次元ベクトルで表現される複数サンプルをクラスタリ
ングする処理方法に於いて、各サンプルがそれぞれ一つ
のクラスタを形成する初期状態からはじめてクラスタ間
を融合していく過程で、N次元ベクトル空間内の任意の
点までのクラスタからの距離を計算する工程をそなえる
と共に、クラスタ間の距離を計算する工程、クラスタを
構成するサンプルの一部を除いたときのテストサンプル
までの距離の変化を予測誤差として計算する工程、前記
クラスタ間の距離を計算する工程により計算したクラス
タ間距離の小さいクラスタ対の順に前記予測誤差に基づ
いて該クラスタ対を融合するかどうかを判定する工程、
クラスタ間距離と予測誤差とに基づいて処理の終了を決
定する工程からなる ことを特徴とするクラスタリング処理方法。
[Claims] In a processing method for clustering a plurality of samples represented by N-dimensional vectors, in the process of merging clusters starting from an initial state in which each sample forms one cluster, N-dimensional It includes a process of calculating the distance from a cluster to any point in the vector space, a process of calculating the distance between clusters, and a change in the distance to the test sample when some of the samples that make up the cluster are removed. a step of calculating as a prediction error, a step of determining whether to fuse cluster pairs based on the prediction error in the order of the cluster pairs having the smallest inter-cluster distance calculated in the step of calculating the distance between the clusters,
A clustering processing method comprising a step of determining the end of processing based on an inter-cluster distance and a prediction error.
JP63212192A 1988-08-26 1988-08-26 Clustering processing method Expired - Fee Related JPH0776986B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP63212192A JPH0776986B2 (en) 1988-08-26 1988-08-26 Clustering processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP63212192A JPH0776986B2 (en) 1988-08-26 1988-08-26 Clustering processing method

Publications (2)

Publication Number Publication Date
JPH0259980A true JPH0259980A (en) 1990-02-28
JPH0776986B2 JPH0776986B2 (en) 1995-08-16

Family

ID=16618450

Family Applications (1)

Application Number Title Priority Date Filing Date
JP63212192A Expired - Fee Related JPH0776986B2 (en) 1988-08-26 1988-08-26 Clustering processing method

Country Status (1)

Country Link
JP (1) JPH0776986B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008041764A1 (en) * 2006-10-05 2008-04-10 National Institute Of Advanced Industrial Science And Technology Music artist search device and method

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE602005025666D1 (en) 2004-02-13 2011-02-10 Jtekt Corp Electric power steering

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008041764A1 (en) * 2006-10-05 2008-04-10 National Institute Of Advanced Industrial Science And Technology Music artist search device and method
GB2456103A (en) * 2006-10-05 2009-07-08 Nat Inst Of Advanced Ind Scien Music artist search device and method
US8117214B2 (en) 2006-10-05 2012-02-14 National Institute Of Advanced Industrial Science And Technology Music artist retrieval system and method of retrieving music artist
JP4894026B2 (en) * 2006-10-05 2012-03-07 独立行政法人産業技術総合研究所 Music artist search apparatus and method

Also Published As

Publication number Publication date
JPH0776986B2 (en) 1995-08-16

Similar Documents

Publication Publication Date Title
Fränti et al. Randomised local search algorithm for the clustering problem
US7796821B2 (en) Method and system for fuzzy clustering of images
US6009199A (en) Classification technique using random decision forests
JPH05346915A (en) Learning machine and neural network, and device and method for data analysis
JPH07296117A (en) Constitution method of sort weight matrix for pattern recognition system using reduced element feature section set
Tscherepanow TopoART: A topology learning hierarchical ART network
Lerner et al. A classification-driven partially occluded object segmentation (CPOOS) method with application to chromosome analysis
JPH08227408A (en) Neural network
CN113850811B (en) Three-dimensional point cloud instance segmentation method based on multi-scale clustering and mask scoring
CN112836629A (en) Image classification method
JPH0259980A (en) Treatment of clustering
CN115827932A (en) Data outlier detection method, system, computer device and storage medium
CN115102868A (en) Web service QoS prediction method based on SOM clustering and depth self-encoder
CN113449524B (en) Named entity identification method, system, equipment and medium
CN114238634B (en) Regular expression generation method, application, device, equipment and storage medium
CN117152538B (en) Image classification method and device based on class prototype cleaning and denoising
CN115758139A (en) Training method and device for wind control model
CN117575004B (en) Nuclear function determining method, computing device and medium based on double-layer decision tree
Stevens et al. Feature-based target classification in laser radar
CN114187457A (en) Iterative graph alignment method
CN116702830A (en) Automatic generation method and system of graph neural network architecture
JPH07225837A (en) Method for generating pattern recognition dictionary
JPH03294983A (en) Character recognizing device
CN115795003A (en) New problem discovery method and device based on clustering
CN114168760A (en) Multimedia information recommendation method, device and storage medium

Legal Events

Date Code Title Description
LAPS Cancellation because of no payment of annual fees