JP3901540B2

JP3901540B2 - Clustering program

Info

Publication number: JP3901540B2
Application number: JP2002039485A
Authority: JP
Inventors: 隆洋中井; 眞蓼沼
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2002-02-18
Filing date: 2002-02-18
Publication date: 2007-04-04
Anticipated expiration: 2022-02-18
Also published as: JP2003242508A

Description

【０００１】
【産業上の利用分野】
この発明はクラスタリング方法に関し、特にたとえば複数の標本点をクラスタリングする、クラスタリング方法に関する。
【０００２】
【従来の技術】
従来のこの種のクラスタリング方法としては、単純クラスタリング方法がある。これは、「画像工学（増補）−画像のエレクトロニクス−（テレビジョン学会教科書シリーズ１）」（コロナ社、２０００年４月２０日発行）の４．６．１節に記述されている。
【０００３】
具体的に説明すると、ｎ個のベクトル（Ｘ₁,Ｘ₂,…,Ｘ_n）をクラスタリングする場合には、まず、任意のベクトル、たとえば、Ｘ_iをとり、これを第１クラスタＣ₁の中心Ｙ₁（Ｙ₁＝Ｘ_i）とする。
【０００４】
次に、ベクトルＸ_jをとり、Ｙ₁とＸ_jとの距離Ｄ_1,jを求める。ここで、Ｄ_i,j＞Ｔであれば、Ｘ_jを第２のクラスタＣ₂の中心Ｙ₂（Ｙ₂＝Ｘ_j）とする。しかし、Ｄ_i,j≦Ｔであれば、Ｘ_j∈Ｃ₁とする。
【０００５】
続いて、Ｘ_kをとり、Ｙ₁およびＹ₂とＸ_kとの距離Ｄ_1,k，Ｄ_2,kを求める。ここで、Ｄ_1,k＞Ｔであり、かつＤ_2,k＞Ｔであれば、Ｘ_kを第３クラスタＣ₃の中心Ｙ₃（Ｙ₃＝Ｘ_k）とする。しかし、Ｄ_1,k≦ＴまたはＤ_2,k≦Ｔであれば、Ｘ_kの中心との距離の短い方のクラスタに所属するものとする。
【０００６】
このような処理をすべてのベクトルについて実行することにより、クラスタリングが完了する。
【０００７】
【発明が解決しようとする課題】
しかし、従来の方法では、決定したクラスタ数が妥当であるかどうか明確でなく、クラスタリングの結果から人間がその妥当性を判断するため、場合によってはあまり好ましくないクラスタ数であることがあった。このため、その後の解析処理等に支障を来たしていた。
【０００８】
それゆえに、この発明の主たる目的は、クラスタリング結果の利便性を向上できる、クラスタリング方法を提供することである。
【０００９】
【課題を解決するための手段】
第１の発明は、複数の標本点をクラスタリングするクラスタリングプログラムであって、コンピュータを、各クラスタ内のすべての標本点と当該クラスタ内の重心となる代表点とのユークリッド距離をすべての標本点のそれぞれについて検出する距離検出手段、距離検出手段によって検出したユークリッド距離の第１総和値を求める第１総和値算出手段、第１総和値算出手段によって求めた第１総和値を各クラスタに渡って総和を取った第２総和値を求める第２総和値算出手段、クラスタ数を最大値から最小値に向けて変化させたときに第２総和値算出手段によって求めた第２総和値が急激に変化する個所を挟む前後数箇所の第２総和値のいずれかに対応するクラスタ数を最適クラスタ数に決定する最適クラス数決定手段、および最適クラスタ数決定手段によって決定した最適クラスタ数へクラスタリングするクラスタリング手段として機能させる、クラスタリングプログラムである。
【００１０】
第２の発明は、複数の標本点をクラスタリングするクラスタリングプログラムであって、コンピュータを、各クラスタ内のすべての標本点と当該クラスタ内の重心となる代表点とのユークリッド距離をすべての標本点のそれぞれについて検出する距離検出手段、距離検出手段によって検出したユークリッド距離の第１総和値を求める第１総和値算出手段、第１総和値算出手段によって求めた第１総和値を各クラスタに渡って総和を取った第２総和値を求める第２総和値算出手段、クラスタ数を最大値から最小値に向けてｎ（ｎは自然数）個ずつ減少させるとき、前回のクラスタリング結果において最もクラスタ間距離が短い２つのクラスタを１つのクラスタにまとめる操作をｎ回繰り返すことにより、クラスタ数をｎ個減少させたときのクラスタリングを決定して第２総和値を求める第２総和値再計算手段、第２総和値再計算手段によって求めた第２総和値が急激に変化する個所を挟む前後数箇所の第２総和値のいずれかに対応するクラスタ数を最適クラスタ数に決定する最適クラスタリング数決定手段、および最適クラスタ数決定手段によって決定した最適クラスタ数へクラスタリングするクラスタリング手段として機能させる、クラスタリングプログラムである。
【００１１】
第３の発明は、複数の標本点をクラスタリングするクラスタリングプログラムであって、コンピュータを、各クラスタ内のすべての標本点と当該クラスタ内の重心となる代表点とのユークリッド距離をすべての標本点のそれぞれについて検出する距離検出手段、距離検出手段によって検出したユークリッド距離の第１総和値を求める第１総和値算出手段、第１総和値算出手段によって求めた第１総和値を各クラスタに渡って総和を取った第２総和値を求める第２総和値算出手段、クラスタ数を最大値から最小値に向けて変化させるときに各クラスタ数に分けるすべての分け方について第２総和値を求める第２総和値再計算手段、第２総和値再計算手段によって求めた複数の第２総和値の中で最も小さいものを第３総和値として算出する第３総和値算出手段、クラスタ数を最大値から最小値に向けて変化させたときに第３総和値算出手段によって算出した第３総和値が急激に変化する前後いずれかの第３総和値に対応するクラスタ数を最適クラスタ数に決定する最適クラス多数決定手段、および最適クラスタ数決定手段によって決定した最適クラスタ数へクラスタリングするクラスタリング手段として機能させる、クラスタリングプログラムである。
【００１２】
【作用】
第１の発明では、クラスタリングには、重心法が用いられ、まず、クラスタ内の標本点とクラスタの代表点（重心点）との距離を当該クラスタに属する全標本点に渡って総和を取った第１総和値を求める。この第１総和値は、各クラスタについて求められ、それらの総和を取った第２総和値が求められる。たとえば、クラスタ数を最大値から最小値に向かって変化させたとき、第２総和値が急激に変化する前後いずれかの第２総和値に対応するクラスタ数を最適クラスタ数に決定する。そして、決定された最適クラスタ数で標本点がクラスタリングされる。
【００１３】
第２の発明は、第１の発明とほとんど同じであるが、クラスタ数を最大値から最小値に向けてｎ（ｎは自然数）個ずつ減少させるときに、前回のクラスタ間距離が最も短い２つのクラスタを１つにまとめる操作をｎ回繰り返すことにより、クラスタ数をｎ個減少させたときのクラスタリングについて第２総和値を求める。そして、第２総和値が急減に変化する個所を挟む前後数箇所の第２総和値のいずれかに対応するクラスタ数を最適クラスタ数に決定する。つまり、前回のクラスタ数におけるクラスタリング結果を利用できるので、演算処理の負担を軽くできる。
【００１４】
第３の発明もまた、第１の発明とほとんど同じであるが、クラスタ数をｎ個減らす場合には、クラスタ数をｎ個減少させたときのすべてのクラスタリングについての第２総和値を求め、その中の最小値を第３総和値とし、クラスタ数を最大値から最小値に向けて変更する場合に第３総和値が急激に変化した前後いずれかの第３総和値に対応するクラスタ数が最適クラスタ数に決定される。つまり、上述の発明よりも正確な（適切な）クラスタ数を決定することができる。
【００１５】
【発明の効果】
この発明によれば、最適なクラスタ数へのクラスタリングが可能であるため、クラスタリング結果の信頼性が高く、その後の処理における利便性を向上させることができる。
【００１６】
この発明の上述の目的,その他の目的,特徴および利点は、図面を参照して行う以下の実施例の詳細な説明から一層明らかとなろう。
【００１７】
【実施例】
この実施例のクラスタリングシステムは、図示は省略するが、パーソナルコンピュータ或いはワークステーションのようなコンピュータとデータベースとを備える。
【００１８】
なお、データベースは、コンピュータの内部に設けられてもよく、直接或いはインターネット等を介してコンピュータと通信可能に外部接続されてもよい。
【００１９】
データベースは、たとえば航空機や衛星から撮影した写真（航空写真、衛星写真）或いは絵画に対応する様々な画像（自然画像）の画像データが記憶（蓄積）される画像データベースである。また、複数の画像データについてのテクスチャの特徴量（後で詳細に説明する、Ｄ，Ｖ，Ｐ）を取得（算出）した結果である結果データも、対応する画像データと関連付けて記憶される。
【００２０】
図１に示すように、この実施例では、画像のサイズは、２０４８画素×２０４８画素である。また、テクスチャの平坦度を判別する単位（画像領域）は、この実施例では、３２画素×３２画素の大きさであり、これを小領域ａ_iとする。したがって、この実施例では、小領域ａ_iの個数は４０９６である。
【００２１】
たとえば、ユーザから標本点（テクスチャ特徴量）の抽出指示が入力されると、コンピュータ、実際には、コンピュータ内部に設けられたＣＰＵ（図示せず）が図２に示すフロー図に従ってテクスチャの特徴量を（抽出）取得する特徴量取得処理を実行して、上述したような結果データ（標本点）を得る。
【００２２】
図２を参照して、ユーザからの指示に応じて、コンピュータのＣＰＵが処理を開始すると、ステップＳ１では、ユーザの指示に従って特徴量を取得すべき所望の画像に対応する画像データを読み出す。続くステップＳ３では、読み出した画像を小領域ａ_iに分割する。上述したように、小領域ａ_iは３２画素×３２画素の大きさであるため、分割される小領域ａ_iの個数は４０９６である。
【００２３】
続いて、ステップＳ５では、ＣＰＵはカウンタのカウント値を初期化（ｉ＝１）し、後で詳細に説明するように、ステップＳ７ではｉ番目の小領域ａ_iについてのテクスチャ特徴量Ｄ（第１特徴量），Ｖ（第２特徴量）およびＰ（第３特徴量）の計算処理を実行して、ステップＳ９でカウント値ｉが４０９６かどうかを判断する。つまり、すべての小領域ａ_iについてのテクスチャ特徴量を計算したかどうかを判断する。
【００２４】
ステップＳ９で“ＮＯ”であれば、つまりカウント値ｉが４０９６でなければ、すべての小領域ａ_iについてのテクスチャ特徴量を終了していないと判断して、ステップＳ１１でカウント値ｉをインクリメント（ｉ＝ｉ＋１）してからステップＳ７に戻る。一方、ステップＳ９で“ＹＥＳ”であれば、つまりカウント値ｉが４０９６であれば、すべての小領域ａ_iについてテクスチャ特徴量の計算処理を終了したと判断して、ステップＳ１３に進む。ステップＳ１３では、４０９６個の小領域ａ_iについてのテクスチャ特徴量（Ｄ，Ｖ，Ｐ）を画像データに関連付けてデータベースに記憶する。
【００２５】
図３は、図２に示したステップＳ７の小領域ａ_iのテクスチャ特徴量の計算処理を示すフロー図であり、特徴量の計算処理が開始されると、ＣＰＵは、ステップＳ２１で小領域ａ_iの画像データをＲ，Ｇ，ＢデータからＹデータに変換する。たとえば、小領域ａ_i上の位置（ｎ，ｍ）における画素データをｑ_i（ｎ，ｍ）とすると、各画素はＲ，Ｇ，Ｂの３種類の画素値を持つので、数１のように表記することができる。
【００２６】
【数１】

【００２７】
ただし、ｎ，ｍは、それぞれ、１≦ｎ≦３２，１≦ｍ≦３２を満たす自然数である。ここで、位置すなわち座標系（ｎ，ｍ）は、画像に対する一般的な座標系（行列座標系）であり、水平位置ｎは右向きを正の方向とし、垂直位置ｍは下向きを正の方向とする。したがって、小領域ａ_iの画像の左上が（１，１）であり、右下が（３２，３２）である。
【００２８】
以後の処理においては、テクスチャは濃淡画像として扱うため、ＣＰＵは、数２に従って各画素のＲ，Ｇ，ＢデータをＹデータに変換する。
【００２９】
【数２】

【００３０】
また、これ以降の処理では、テクスチャの平均的な明るさ（平均輝度）は問題にしないので、次のステップＳ２３では、Ｙデータから小領域ａ_iの平均輝度を除去する。
【００３１】
まず、数３に従って小領域ａ_iの平均輝度を求める。
【００３２】
【数３】

【００３３】
次に、小領域ａ_iの各画素の輝度値Ｙ_i（ｎ，ｍ）から小領域ａ_iの平均輝度を減算した減算値（減算データ）ＹＴ_i（ｎ，ｍ）を数４に従って求める。
【００３４】
【数４】

【００３５】
ただし、１≦ｎ≦３２,１≦ｍ≦３２
次に、ステップＳ２５では、小領域ａ_iの減算データＹＴ_i（ｎ，ｍ）を２次元離散的フーリエ変換（２次元ＤＦＴ）する。つまり、数５に従って演算し、複素数Ｆ_i（ｕ，ｖ）を算出する。
【００３６】
【数５】

【００３７】
ただし、ｊは１単位虚数であり、ｕ，ｖは、それぞれ、−１６≦ｕ≦１６，−１６≦ｖ≦１６を満たす整数である。この複素数Ｆ_i（ｕ，ｖ）は３２画素×３２画素の減算データＹＴ_i（ｎ，ｍ）の２次元周波数（ｕ，ｖ）における周波数成分を表す。
【００３８】
ここで、ｕの単位は[cpw : cycle per picture width]であり、ｖの単位は[cph : cycle per picture hight]である。
【００３９】
一般的に、或る物理量が複素数で表される場合、その物理量の大きさや強さ（強度）は、その複素数の大きさすなわち絶対値で表すことができる。そこで、ＣＰＵは、続くステップＳ２７で２次元周波数平面上に数５によって得られた複素数Ｆ_i（ｕ，ｖ）の絶対値を取る（算出する）。具体的には、数６に従って複素数Ｆ_i（ｕ，ｖ）の絶対値（「Ｗ_i（ｕ，ｖ）」と表記する。）を取り、絶対値Ｗ_i（ｕ，ｖ）をもって２次元周波数成分Ｆ_i（ｕ，ｖ）の強度とする。
【００４０】
【数６】

【００４１】
ただし、上述したように、ｕ，ｖは、それぞれ、−１６≦ｕ≦１６，−１６≦ｖ≦１６を満たす整数である。
【００４２】
また、座標系（ｎ，ｍ）に対応して、２次元周波数座標系（ｕ，ｖ）についても、水平周波数ｕは右向きを正の方向とし、垂直周波数ｖは下向きを正の方向とする。このような２次元周波数座標系（平面上）に、２次元周波数強度分布Ｗ_i（ｕ，ｖ）を等高線表示した一例が図４のように示される。この図４では、グレースケールによって周波数分布の強度を表しており、黒に近いほど、強度が弱く、白に近いほど、強度が強い。
【００４３】
続いて、ステップＳ２９では、この強度分布Ｗ_i（ｕ，ｖ）を２次元周波数平面上の荷重分布とみなし、その重心位置を決定する（求める）。ただし、図４から分かるように、荷重分布（強度分布）は点（原点）対称であるため、重心位置が常に原点となることを避けるように、この実施例では、周波数平面の右半分の領域（水平周波数ｕが０から＋１６cpw、垂直周波数ｖが−１６から＋１６cphの範囲）についての重心Ｇ_iの位置（ｕ_iG，ｖ_iG）が数７に従って求められる。
【００４４】
【数７】

【００４５】
ここで、ＳＷ_iは小領域ａ_iの周波数強度分布の右半分の領域における総荷重を表し、数８で与えられる。また、重心位置の測定例が図４内に点Ｐで示される。
【００４６】
【数８】

【００４７】
なお、この実施例では、重心位置を周波数平面の右半分の領域で求めたが、これに限定する必要はなく、周波数平面の左半分、上半分或いは下半分の領域で求めてもよい。
【００４８】
また、計算の煩わしさを考慮しなければ、上、下、左、右半分の領域以外の場合で、原点を通る直線で周波数平面を２分割した一方の領域（範囲）内で重心位置を求めるようにしてもよい。
【００４９】
次に、この実施例では、テクスチャの方向は問題にしていないので、ステップＳ３１では、ＣＰＵは重心位置によらず，原点から重心位置までの距離Ｄ_iを小領域ａ_iのテクスチャの第１特徴量（テクスチャ特徴量Ｄ）として数９により算出する。
【００５０】
【数９】

【００５１】
続いて、ステップＳ３３で、小領域a_iのテクスチャの第２特徴量（テクスチャ特徴量Ｖ）として、上述したような周波数平面の右半分の領域の周波数範囲で、周波数強度分布の重心周りの分散ｖ_iを数１０により求める。このとき、周波数範囲の強度分布が、この小領域ａ_iに関する全サンプルであるので、不偏分散を求める必要はない。このため、数１０においては、分母がＳＷ_i-１ではなく、ＳＷ_iとされる。
【００５２】
【数１０】

【００５３】
ここで、小領域のテクスチャに関して、右半分の領域における水平周波数ｕを第１変量、垂直周波数ｖを第２変量とすると、この２つの変量ｕ，ｖを次のように統合して新しい２つの変量、すなわち第１主成分z_i1および第２主成分z_i2に変換することができる。
【００５４】
まず、第１変量ｕの分散（水平周波数ｕに沿って分布する荷重による分散）Ｖ_iuu、第２変量ｖの分散（垂直周波数ｖに沿って分布する荷重による分散）Ｖ_ivv、および第１変量ｕ、第２変量ｖの共分散Ｖ_iuvは数１１〜数１３によってそれぞれ求められる。
【００５５】
【数１１】

【００５６】
【数１２】

【００５７】
【数１３】

【００５８】
ここで、これらの分散値を要素とする分散・共分散行列（実対称行列）Ｖ_iは数１４で示される。
【００５９】
【数１４】

【００６０】
この実対称行列Ｖ_iの固有値λ_i1，λ_i2（ただし、λ_i1≧λ_i2≧０である。）および固有値λ_i1，λ_i2に対応する固有ベクトルｌ_i1，ｌ_i2は次のように求められる。具体的には、固有値λ_i1，λ_i2は、Ｖ_iの固有多項式（λに関する２次方程式）、すなわち数１５の根として求められる。また、固有ベクトルｌ_i1，ｌ_i2は、数１６を解くことにより求められる。
【００６１】
【数１５】

【００６２】
【数１６】

【００６３】
次に、新しい変量である第１主成分z_i1および第２主成分z_i2は、ｕ，ｖの１次結合であり、数１７によってそれぞれ定義される。
【００６４】
【数１７】

【００６５】
ここで、固有ベクトルｌ_i1，ｌ_i2は、それぞれ、数１８のように表している。
【００６６】
【数１８】

【００６７】
以上より、ＣＰＵは、ステップＳ３５で、２次元周波数平面の右半分の領域で第１主成分の寄与率Ｐ_iを小領域ａ_iのテクスチャの第３特徴量（テクスチャ特徴量Ｐ）として求める。すなわち、第１主成分の寄与率Ｐ_iは数１９で算出することができる。
【００６８】
【数１９】

【００６９】
しかし、第１固有値λ_i1は、第１主成分z_i1の分散Ｖ_imに等しく、第２固有値λ_i2は、第２主成分z_i2の分散Ｖ_isに等しいことが知られており、また、この実施例では、全変量の個数は２個であることから、第１主成分ｚ_i1の分散Ｖ_imと第２主成分ｚ_i2の分散Ｖ_isの和は、全分散Ｖ_iすなわち重心周りの分散に等しくなる。このため、寄与率Ｐ_iは、第１主成分ｚ_i1の分散Ｖ_imが全分散Ｖ_iに占める割合であるということもでき、数２０が成立する。
【００７０】
【数２０】

【００７１】
なお、この第１主成分ｚ_i1の座標軸は、重心を通るあらゆる軸の中で、その軸に沿った分散が最も大きくなる軸を意味する。第１主成分軸と第２主成分軸との測定例は、図４においてそれぞれ実線と点線とで示される。
【００７２】
以上のように、Ｓ３１，Ｓ３３およびＳ３５の処理を経て、この小領域ａ_iのテクスチャ特徴量すなわち、距離Ｄ_i、分散Ｖ_i、第１主成分の寄与率Ｐ_iが得られる。これらが、上述したように、標本値としてデータベースに記憶される。
【００７３】
たとえば、標本値としての結果データが最適なクラスタ数にクラスタリングされ、そのクラスタリング結果を用いて、テクスチャ平坦度の判別等のようなその後の処理に役立てることができる。
【００７４】
ただし、上述したテクスチャ特徴量は、個別に得た値であり、すなわちそれぞれ単位が異なるため、少なくともクラスタリングの開始時において、３つの特徴量を正規化する必要がある。具体的には、３種類の特徴量Ｄ_i，Ｖ_i，Ｐ_iのそれぞれについて、その４０９６個の測定データに渡る平均値が０、分散が１になるように正規化が行なわれる。以下においては、正規化された特徴量（正規化特徴量）を、それぞれ、ＺＤ_i，ＺＶ_i，ＺＰ_iと表記することにする。
【００７５】
まず、テクスチャ特徴量Ｄ_iは、数２１に従って正規化される。
【００７６】
【数２１】

【００７７】
ここで、Ｄ_M，σ_Dは、それぞれＤ_iの平均および標準偏差を表し、数２２および数２３によって計算される。
【００７８】
【数２２】

【００７９】
【数２３】

【００８０】
また、テクスチャ特徴量Ｖ_iは、数２４に従って正規化される。
【００８１】
【数２４】

【００８２】
ここで、Ｖ_M，σ_Vは、それぞれＶ_iの平均および標準偏差を表し、数２５および数２６によって計算される。
【００８３】
【数２５】

【００８４】
【数２６】

【００８５】
さらに、テクスチャ特徴量Ｐ_iは、数２７に従って正規化される。
【００８６】
【数２７】

【００８７】
ここで、Ｐ_M，σ_Pは、それぞれＰ_iの平均および標準偏差を表し、数２８および数２９によって計算される。
【００８８】
【数２８】

【００８９】
【数２９】

【００９０】
続いて、クラスタリングについて簡単に説明すると、クラスタ数ａがＮ個（全標本個数と同じ数）の場合に、クラスタ間の距離の決め方としてたとえば重心法に基づいてクラスタリングすると、クラスタリングは一意に決まる。
【００９１】
クラスタ数がa（ただし、ａ＝Ｎ，Ｎ−１，…，２，１）のときのｉ番目のクラスタをＣ（ａ，ｉ）と表記する。ここで、ｉ＝１，２，…，ａである。また、クラスタＣ（ａ,ｉ）に含まれる標本点の個数をｂ（ａ，ｉ）とし、クラスタＣ（ａ，ｉ）に含まれるｋ番目の標本点をＳ（ａ，ｉ，ｋ）と表記する。ここで、ｋ＝１，２，…，ｂ（ａ，ｉ）である。さらに、クラスタＣ（ａ，ｉ）の重心をＧ（ａ，ｉ）とし、重心Ｇ（ａ，ｉ）と標本点Ｓ（ａ，ｉ，ｋ）との距離をｄ（ａ，ｉ，ｋ）と表記すると、クラスタＣ（ａ，ｉ）内の各標本点を重心Ｇ（ａ，ｉ）で代表させたことによるクラスタＣ（ａ，ｉ）についてのクラスタリング誤差は、数３０で示される。
【００９２】
【数３０】

【００９３】
なお、以下においては、簡単のため、このＤ（ａ，ｉ）をｉ番目のクラスタＣ（ａ，ｉ）の誤差と呼ぶことにする。
【００９４】
最初のクラスタ数ａ＝Ｎのときには、各クラスタＣ（Ｎ，ｉ）（ただし、ｉ＝１,２,…,Ｎ）は1個の標本点のみを含むので、各クラスタＣ（Ｎ，ｉ）内の標本点と各クラスタの重心Ｇ（Ｎ，ｉ）は一致する。したがって、ｂ（Ｎ，ｉ）＝１，ｋ＝１，ｄ（Ｎ，ｉ，ｋ）＝０となるので、Ｄ（ａ，ｉ）＝０となる。すなわち、ａ＝Ｎのとき、どのクラスタＣ（ａ，ｉ）についても、クラスタＣ（ａ，ｉ）の誤差は０である。
【００９５】
また、ｉ番目のクラスタＣ（ａ，ｉ）の誤差Ｄ（ａ，ｉ）を（クラスタ数がａのときの）全クラスタ（ｉ＝１，２，…，ａ）について加算した誤差Ｄ（ａ，ｉ）の総和ＳＤ（ａ）は、数３１で算出される。
【００９６】
【数３１】

【００９７】
つまり、ＳＤ（ａ）は、Ｎ個の標本点をａ個のクラスタにクラスタリングしたことによるクラスタリングの誤差を表していると考えられる。以下、この実施例では、このＳＤ（ａ）をa個へのクラスタリングの誤差と呼ぶことにする。最初のクラスタ数ａ＝Ｎのときのクラスタリングについては、上述したように各クラスタの誤差Ｄ（ａ,ｉ）は０であるので、ＳＤ（ａ）＝ＳＤ（Ｎ）＝０となる。すなわちクラスタ数ａ＝Ｎのとき、ａ個へのクラスタリングの誤差は０である。
【００９８】
次に、クラスタ数ａ＝Ｎ−１の場合に、重心法に基づいてＮ個の全標本点をクラスタリングすると、Ｎ個の標本点の内、最も近い２点をまとめて１つのクラスタとし、残りのクラスタはクラスタ数ａ＝Ｎの場合のままとして、クラスタ数ａ＝Ｎ−１個へのクラスタリングを決定する。そして、決定されたクラスタリングについて、上記の諸量を求める処理（数３１を用いた誤差の総和の算出処理）を繰り返せば、最も誤差の少ないクラスタ数ａ＝Ｎ−１個へのクラスタリングの誤差ＳＤ（ａ）＝ＳＤ（Ｎ−１）を得ることができる。
【００９９】
続いて、クラスタ数ａ＝Ｎ−２の場合には、重心法に基づいて、現在のクラスタ数ａ＝Ｎ−１個のクラスタリングがクラスタ数ａ＝Ｎ−２個へのクラスタリングに変更される。この場合、Ｎ−１個の重心の内、最も近い２つの重心を１つのクラスタにまとめるように、クラスタ数ａ＝Ｎ−２個へのクラスタリングを決定する。
【０１００】
なお、このようなクラスタリングの分け方の列挙は、一般に流通している統計解析ソフトウェアを使用するか、自分でプログラムを作成すれば、可能であり、さらに自動化することも容易である。以下、クラスタリングする場合については同様のことが言える。また、一般的に流通している統計処理ソフトウェアとしては、ＳＰＳＳｌｎｃ．社製の「ＳＰＳＳ」やＳｔａｔＳｏｆｔ社製の「ＳＴＡＴＩＳＴＩＣＡ」を用いることができる。
【０１０１】
そして、決定されたクラスタリングで、上記の諸量を求める処理（数３１を用いた誤差の総和の算出処理）を繰り返せば、実用上差し支えのない程度に誤差の少ないクラスタ数ａ＝Ｎ−２個へのクラスタリングの誤差ＳＤ（ａ）＝ＳＤ（Ｎ−２）を得ることができる。
【０１０２】
ただし、Ｎ−２個へのクラスタリングを決める場合には、前回のクラスタリングの結果、すなわちＮ−１個へのクラスタリングの結果を用いずに、Ｎ個の標本をＮ−２個のクラスタに分けるすべての分け方について、上記の手順によってＳＤ（ａ）＝ＳＤ（Ｎ−２）を算出し、その中で最も小さいものを真のＳＤ（ａ）とするようにしてもよい。
【０１０３】
この後者の方法の方がＳＤ（ａ）の評価としてはより正確ではあるが、全標本の個数Ｎが非常に大きい場合には、コンピュータの処理能力にもよるが、現実的な方法ではなくなってしまう。このような場合には、前者の方法を用いて、実用上差し支えのない程度に誤差の少ないＳＤ（ａ）を求めるのがよいと考えられる。
【０１０４】
ここで、クラスタに分けるすべての分け方について具体的に説明することにする。たとえば、５個（Ｎ＝５）の標本点を３個のクラスタに分ける（クラスタリングする）すべての分け方は次のように、場合分けすることができる。ただし、５個の標本点（要素）は１〜５の数字で表すことにする。
【０１０５】
（ｉ）要素の数が３，１，１の３つのクラスタに分ける場合には、クラスタの組み合わせは次のようになる。なお、組み合わせ数は１０（₅Ｃ₃）である。
【０１０６】
（１，２，３），（４），（５）
（１，２，４），（３），（５）
（１，２，５），（３），（４）
（１，３，４），（２），（５）
（１，３，５），（２），（４）
（１，４，５），（２），（３）
（２，３，４），（１），（５）
（２，３，５），（１），（４）
（２，４，５），（１），（３）
（３，４，５），（１），（２）
（ii）要素の数が２，２，１の３つのクラスタに分ける場合には、クラスタの組み合わせは次のようになる。
【０１０７】
（１，２），（３，４），（５）
（１，３），（２，４），（５）
（１，４），（２，３），（５）
（１，２），（３，５），（４）
（１，３），（２，５），（４）
（１，５），（２，３），（４）
（１，２），（４，５），（３）
（１，４），（２，５），（３）
（１，５），（２，４），（３）
（１，３），（４，５），（２）
（１，４），（３，５），（２）
（１，５），（３，４），（２）
（２，３），（４，５），（１）
（２，４），（３，５），（１）
（２，５），（３，４），（１）
以上、（ｉ），（ii）で示すように、５個の標本点を３つのクラスタにクラスタリングする場合には、２５通りのクラスタリング方法がある。このように、分け方は多数存在するため、上述したように、前回の結果を利用してクラスタリングすることにより、コンピュータ（ＣＰＵ）の処理量を軽減してもよい。
【０１０８】
さらに、クラスタ数ａ＝Ｎ−３の場合には、重心法に基づいて、現在のクラスタ数ａ＝Ｎ−２個のクラスタリングがクラスタ数ａ＝Ｎ−３個へのクラスタリングに変更される。この場合、Ｎ−２個の重心の内、最も近い２つの重心を１つのクラスタにまとめるように、クラスタ数ａ＝Ｎ−３個へのクラスタリングを決め、上記の諸量を求める処理（数３１を用いた誤差の総和の算出処理）を繰り返せば、実用上差し支えのない程度に誤差の少ないクラスタ数ａ＝Ｎ−３個へのクラスタリングの誤差ＳＤ（ａ）＝ＳＤ（Ｎ−３）を得ることができる。
【０１０９】
ただし、上述した場合と同様に、Ｎ−３個へのクラスタリングを決める場合には、前回のクラスタリングの結果、すなわちＮ−２個へのクラスタリングの結果を用いずに、Ｎ個の標本をＮ−３個のクラスタに分けるすべての分け方について上記の手順にてＳＤ（ａ）＝ＳＤ（Ｎ−３）を算出し、その中で最も小さいものを真のＳＤ（ａ）とするようにしてもよい。
【０１１０】
このようにして、クラスタリングが繰り返され、クラスタ数ａ＝Ｎ−（Ｎ−２）＝２の場合には、重心法に基づいて、現在のクラスタ数ａ＝３個のクラスタリングが２個のクラスタリングに変更される。この場合、３個の重心の内、最も近い２つの重心を１つのクラスタにまとめるように、クラスタ数ａ＝２個へのクラスタリングを決め、上記の諸量を求める処理（数３１を用いた誤差の総和の算出処理）を繰り返せば、実用上差し支えのない程度に誤差の少ないクラスタ数（ａ＝２）へのクラスタリングの誤差ＳＤ（ａ）＝ＳＤ（２）を得ることができる。
【０１１１】
ただし、２個へのクラスタリングを決める場合には、前回のクラスタリングの結果、つまり、３個へのクラスタリングの結果を用いずに、Ｎ個の標本を２個のクラスタに分けるすべての分け方について、上記の手順によってＳＤ（ａ）＝ＳＤ（２）を算出し、その中で最も小さいものを真のＳＤ（ａ）とするようにしてもよい。
【０１１２】
最後に、ａ＝Ｎ−（Ｎ−１）＝１の場合には、重心法に基づいて、現在のクラスタ数ａ＝２個のクラスタリングをクラスタ数ａ＝１個へのクラスタリングに変える。この場合のクラスタリングは一意に決まる。統合された1つのクラスタの重心は２個の重心の中点となる。このクラスタリングについて、上記の諸量を求める処理（数３１を用いた誤差の総和の算出処理）を繰り返せば、最も誤差の少ないａ＝１個へのクラスタリングの誤差ＳＤ（ａ）＝ＳＤ（１）を得ることができる。
【０１１３】
上述のようにして得たａ個へのクラスタリングの誤差ＳＤ（ａ）（ただし、ａ＝Ｎ，Ｎ−１，…，２，１）を、横軸をクラスタ数ａとしてプロットする。この実施例では、Ｎ＝１８であり、前回のクラスタリングの結果を用いる方法で得たＳＤ（ａ）をプロットした一例を図５に示す。
【０１１４】
図５を参照して、最適なクラスタ数（最適クラスタ数）ａ_optは、クラスタ数ａをＮから１に向かって減少させたときに、急激にＳＤ（ａ）が増大（変化）する個所を含む変化前（直前）のクラスタ数ａ、または、それより大きなクラスタ数ａであって、許せる程度の大きさ（許容範囲）のＳＤ（ａ）を与えるクラスタ数ａに決定する。したがって、図５に示す例では、最適クラスタ数ａ_optは「３」である。
【０１１５】
ただし、この実施例では、許容範囲は最大誤差の２０パーセント未満としてあるが、設計者や使用者等によって自由に変更可能である。また、最適クラスタ数ａ_optは、急激にＳＤ（ａ）が変化する変化後（直後）のクラスタ数ａであって、許容範囲を超えないクラスタ数ａに決定するようにしてもよい。したがって、許容範囲の決め方によっては、数個所の最適クラスタ数ａ_optの候補が存在することとなり、その中から１つを決定する場合がある。
【０１１６】
具体的なクラスタリングの処理は、図６および図７のフロー図によって示される。図６を参照して、ユーザによってクラスタリングの指示が与えられると、ＣＰＵは処理を開始し、ステップＳ４１でクラスタリング指示に応じた４０９６個の標本値をデータベースから読み出す。続くステップＳ４３では、読み出した標本値すなわち（テクスチャ）特徴量（Ｄ，Ｖ，Ｐ）を正規化し、正規化した特徴量（正規化特徴量）ＺＤ，ＺＶおよびＺＰを求める。
【０１１７】
そして、ステップＳ４５では、正規化特徴量ＺＤ，ＺＶおよびＺＰを正規化特徴量空間すなわちＺＤ，ＺＶ，ＺＰ（３次元）空間上にプロットし、図８に示すような散布図を描く。
【０１１８】
以下、この実施例において、４０９６個の小領域ａ_iのテクスチャ特徴量Ｄ_i，Ｖ_i，Ｐ_iをまとめて（Ｄ，Ｖ，Ｐ）と表記することとする。同様に、正規化特徴量ＺＤ_i，ＺＶ_i，ＺＰ_iをまとめて（ＺＤ，ＺＶ，ＺＰ）と表記することにする。
【０１１９】
また、上述したステップＳ４５では、４０９６個の正規化特徴量の測定値（ＺＤ，ＺＶ，ＺＰ）が、ＺＤ，ＺＶ，ＺＰ空間上にプロットされる。ただし、この実施例では、簡単に説明するため、或る画像について、１８個の測定値（ＺＤ，ＺＶ，ＺＰ）がプロットされた例を図８に示してある。
【０１２０】
次にステップＳ４７では、クラスタ数ａを初期化（ａ＝Ｎ）する。続いて、ステップＳ４９では、重心法によりＮ個の全標本値をクラスタリングする。そして、ステップＳ５１でカウンタのカウント値ｉを初期化（ｉ＝１）して、ステップＳ５３でクラスタＣ（ａ，ｉ）の重心Ｇ（ａ，ｉ）を算出する。
【０１２１】
ステップＳ５５では、カウンタのカウント値ｋを初期化（ｋ＝１）し、ステップＳ５７では、クラスタＣ（ａ，ｉ）内の重心Ｇ（ａ，ｉ）と各標本点Ｓ（ａ，ｉ，ｋ）との距離ｄ（ａ，ｉ，ｋ）を算出する。
【０１２２】
次に図７に示すように、次のステップＳ５９では、ｋ＝ｂ（ａ，ｉ）かどうかを判断する。つまり、クラスタＣ（ａ，ｉ）における全標本点との距離ｄ（ａ，ｉ，ｋ）を算出したかどうかを判断する。
【０１２３】
ステップＳ５９で“ＮＯ”であれば、つまりｋ＝ｂ（ａ，ｉ）でなければ、クラスタＣ（ａ，ｉ）における全標本点との距離ｄ（ａ，ｉ，ｋ）を算出していないと判断し、ステップＳ６１でカウント値ｋをインクリメント（ｋ＝ｋ＋１）して、図６に示したステップＳ５７に戻る。
【０１２４】
一方、ステップＳ５９で“ＹＥＳ”であれば、つまりｋ＝ｂ（ａ，ｉ）であれば、クラスタＣ（ａ，ｉ）における全標本点との距離ｄ（ａ，ｉ，ｋ）を算出した判断し、ステップＳ６３でｉ番目のクラスタＣ（ａ，ｉ）の誤差Ｄ（ａ，ｉ）を数３０に従って算出する。
【０１２５】
続くステップＳ６５では、カウント値ｉがクラスタ数ａと等しいかどうかを判断する。このステップＳ６５で“ＮＯ”であれば、つまりカウント値ｉがクラスタ数ａと等しくなければ、ステップＳ６７でカウンタをインクリメント（ｉ＝ｉ＋１）して、図６に示したステップＳ５３に戻る。
【０１２６】
一方、ステップＳ６５で“ＹＥＳ”であれば、つまりカウント値ｉがクラスタ数ａと等しければ、ステップＳ６９でａ個へのクラスタリングの誤差ＳＤ（ａ）を算出する。具体的には、数３１を用いて、ａ個のクラスタにクラスタリングしたことによるクラスタリングの誤差を求める。
【０１２７】
続いて、ステップＳ７１では、クラスタ数ａが１かどうかを判断する。このステップＳ７１で“ＮＯ”であれば、つまりクラスタ数ａが１でなければ、ステップＳ７３でクラスタ数ａをデクリメントし（ａ＝ａ−１）し、重心法に基づいて、今回の（デクリメント後の）クラスタ数ａ（ａ個）のクラスタにクラスタリングしてから、図６に示したステップＳ５１に戻る。
【０１２８】
一方、ステップＳ７１で“ＹＥＳ”であれば、つまりクラスタ数ａが１であれば、ステップＳ７５で誤差ＳＤ(ａ)が急激に増大する直前のクラスタ数ａ、または、それより大きなクラスタ数ａで許容できる大きさ（許容範囲）における誤差ＳＤ（ａ）を与えるａを最適クラスタ数ａ_optに決定してから処理を終了する。
【０１２９】
ただし、この実施例では、誤差ＳＤ(ａ)の許容範囲は、最大誤差（約５５）の１／５（約１０）とし、誤差ＳＤ（ａ）が１０を超えない範囲で最適クラスタ数ａ_optが決定される。つまり、許容範囲は、プログラム（ソフト）の設計者や使用者が任意に設定できる値である。
【０１３０】
この実施例によれば、クラスタ数を最適（妥当）な値に決定してクラスタリングできるので、単なる統計として検証する際の信頼性が高く、その後の処理においてクラスタリング結果が悪影響を与えることはない。つまり、クラスタリング結果の利便性を向上させることができる。
【０１３１】
なお、上述の実施例では、クラスタ数ａを１個ずつ減らすようにしたが、全標本個数Ｎが膨大である場合には、クラスタ数ａを複数個（たとえば、１０個）ずつ減らすようにしてもよい。
【０１３２】
また、上述の実施例では、クラスタリングの手法として、ユークリッド距離に基づく重心法を適用した場合についてのみ説明したが、類似度を表す距離としてはユークリッド距離に限定する必要はない。たとえば、ユークリッド距離の２乗、マハラノビス距離、相関係数等を用いるようにしてもよい。
【０１３３】
さらに、クラスタ間の距離の決め方についても、重心法に限定する必要はなく、最短距離法、最長距離法、群平均法、メディアン法またはウォード法等を用いることも可能である。
【０１３４】
さらにまた、上述の実施例では、テクスチャ特徴量（標本値）を正規化するようにしたが、予め正規化されているデータが標本値である場合には正規化する必要はない。また、標本値はテクスチャ特徴量に限定される必要はなく、他の様々なデータを標本値にすることができる。
【０１３５】
また、上述の実施例では、小領域ａ_iに対する前処理としてＲ，Ｇ，ＢデータからＹデータを求めるようにしたが、Ｒ，Ｇ，Ｂデータから色相Ｈ，彩度Ｓ，明度Ｉ（または、Ｖと表記されることもある。）のデータすなわちＨＳＩ空間のデータに変換し、その結果である明度Ｉのデータを用いるようにしてもよい。
【０１３６】
さらに、上述の実施例では、２次元周波数成分の算出にあたり、２次元離散的フーリエ変換（ＤＦＴ）を用いたが、特に２次元高速フーリエ変換（ＦＦＴ）、或いは、２次元離散的コサイン変換（ＤＣＴ）を用いてもよい。
【図面の簡単な説明】
【図１】この発明の一実施例のクラスタリングで取り扱われる画像を示す図解図である。
【図２】テクスチャ特徴量の抽出処理を示すフロー図である。
【図３】テクスチャ特徴量の計算処理を示すフロー図である。
【図４】テクスチャ特徴量の計算処理において、減算データを２次元ＤＦＴして得られた複素数の絶対値を２次元周波数平面上に等高線表示した図解図である。
【図５】クラスタリング処理のクラスタ数に対する誤差の一例を示すグラフである。
【図６】クラスタリング処理の一部を示すフロー図である。
【図７】クラスタリング処理の他の一部を示すフロー図である。
【図８】テクスチャ特徴量の抽出処理で抽出したテクスチャ特徴量に従ってプロットした一例を示すグラフである。[0001]
[Industrial application fields]
The present invention relates to a clustering method, and more particularly to a clustering method for clustering a plurality of sample points, for example.
[0002]
[Prior art]
A conventional clustering method of this kind is a simple clustering method. This is described in section 4.6.1 of “Image Engineering (amplification)-Electronics of Image-(Television Society Textbook Series 1)” (Corona Inc., issued April 20, 2000).
[0003]
More specifically, n vectors (X ₁ , X ₂ , ..., X _n ) Is first clustered with an arbitrary vector, eg, X _i And take this as the first cluster C ₁ Center of Y ₁ (Y ₁ = X _i ).
[0004]
Next, the vector X _j Take Y ₁ And X _j Distance D to _{1, j} Ask for. Where D _{i, j} > T, X _j To the second cluster C ₂ Center of Y ₂ (Y ₂ = X _j ). However, D _{i, j} If ≦ T, X _j ∈C ₁ And
[0005]
Next, X _k Take Y ₁ And Y ₂ And X _k Distance D to _{1, k} , D _{2, k} Ask for. Where D _{1, k} > T and D _{2, k} > T, X _k The third cluster C _Three Center of Y _Three (Y _Three = X _k ). However, D _{1, k} ≦ T or D _{2, k} If ≦ T, X _k It belongs to the cluster with the shorter distance from the center of.
[0006]
Clustering is completed by executing such processing for all vectors.
[0007]
[Problems to be solved by the invention]
However, in the conventional method, it is not clear whether or not the determined number of clusters is appropriate, and humans judge the validity from the result of clustering, so that the number of clusters is not so preferable in some cases. This hindered the subsequent analysis processing and the like.
[0008]
Therefore, a main object of the present invention is to provide a clustering method that can improve the convenience of the clustering result.
[0009]
[Means for Solving the Problems]
A first invention is a clustering program for clustering a plurality of sample points, wherein a computer is connected to each cluster. All The Euclidean distance between the sample point and the representative point that is the center of gravity in the cluster All Distance detection means for detecting each of the sample points, first summation value calculation means for obtaining the first summation value of the Euclidean distance detected by the distance detection means, and the first summation value obtained by the first summation value calculation means for each cluster. The second total value calculation means for obtaining the second total value obtained by taking the sum across the second total value, the second total value calculated by the second total value calculation means when the number of clusters is changed from the maximum value to the minimum value. The optimal class number determining means for determining the optimal number of clusters corresponding to any of the second total values at several points before and after the portion that changes to the optimal cluster number, and clustering to the optimal number of clusters determined by the optimal cluster number determining means This is a clustering program that functions as a clustering means.
[0010]
A second invention is a clustering program for clustering a plurality of sample points, wherein a computer is connected to each cluster. All The Euclidean distance between the sample point and the representative point that is the center of gravity in the cluster All Distance detection means for detecting each of the sample points, first summation value calculation means for obtaining the first summation value of the Euclidean distance detected by the distance detection means, and the first summation value obtained by the first summation value calculation means for each cluster. A second sum total value calculating means for obtaining a second sum value obtained by summing across, when the number of clusters is decreased by n (n is a natural number) from the maximum value to the minimum value, the most clustering result in the previous clustering result A second total value recalculating means for determining a cluster when the number of clusters is reduced by n and repeating the operation of combining two clusters having a short distance into one cluster to obtain a second total value; 2 Optimum number of clusters corresponding to any of the second total values at several points before and after the point where the second total value obtained by the recalculation means rapidly changes. Optimal clustering number determining means for determining the number of rasters, and to function as a clustering unit for clustering the optimal number of clusters determined by the optimum cluster number determination unit, a clustering program.
[0011]
A third invention is a clustering program for clustering a plurality of sample points, wherein a computer is connected to each cluster. All The Euclidean distance between the sample point and the representative point that is the center of gravity in the cluster All Distance detection means for detecting each of the sample points, first summation value calculation means for obtaining the first summation value of the Euclidean distance detected by the distance detection means, and the first summation value obtained by the first summation value calculation means for each cluster. A second total value calculating means for calculating a second total value obtained by taking the sum across the second total value, and determining the second total value for all division methods for dividing each cluster number when the number of clusters is changed from the maximum value to the minimum value; Second sum total value recalculating means, third sum total value calculating means for calculating the smallest one of the plurality of second sum values calculated by the second sum total value recalculating means as the third sum value, and the maximum number of clusters. The number of clusters corresponding to the third total value before or after the third total value calculated by the third total value calculation means changes abruptly when the value is changed from 1 to the minimum value is set as the optimum cluster number. Optimal Class number determining means for constant, and to function as a clustering unit for clustering the optimal number of clusters determined by the optimum cluster number determination unit, a clustering program.
[0012]
[Action]
In the first invention, the centroid method is used for clustering. First, the distance between the sample point in the cluster and the representative point (centroid point) of the cluster is summed over all the sample points belonging to the cluster. The first sum value is obtained. This first sum value is obtained for each cluster, and a second sum value obtained by taking the sum is obtained. For example, when the number of clusters is changed from the maximum value to the minimum value, the number of clusters corresponding to the second total value before or after the second total value suddenly changes is determined as the optimum cluster number. Then, the sample points are clustered with the determined optimum number of clusters.
[0013]
The second invention is almost the same as the first invention, but when the number of clusters is decreased by n (n is a natural number) from the maximum value to the minimum value, the previous inter-cluster distance is the shortest 2 A second total value is obtained for clustering when the number of clusters is reduced by repeating the operation of combining one cluster n times n times. Then, the optimal number of clusters is determined as the number of clusters corresponding to any of the second total values at several points before and after the portion where the second total value changes rapidly. That is, since the clustering result in the previous number of clusters can be used, the burden of calculation processing can be reduced.
[0014]
The third invention is also almost the same as the first invention, but if the number of clusters is reduced by n, Number 2nd sum total value for all clustering when n is reduced, the minimum value among them is set as the third sum value, and the third sum total is changed when the number of clusters is changed from the maximum value to the minimum value. The number of clusters corresponding to the third total value either before or after the value suddenly changes is determined as the optimum number of clusters. That is, it is possible to determine a more accurate (appropriate) number of clusters than the above-described invention.
[0015]
【The invention's effect】
According to the present invention, clustering to the optimum number of clusters is possible, so the reliability of the clustering result is high, and convenience in subsequent processing can be improved.
[0016]
The above object, other objects, features and advantages of the present invention will become more apparent from the following detailed description of embodiments with reference to the drawings.
[0017]
【Example】
Although not shown, the clustering system of this embodiment includes a computer such as a personal computer or a workstation and a database.
[0018]
The database may be provided inside the computer, or may be externally connected so as to be communicable with the computer directly or via the Internet.
[0019]
The database is an image database in which image data of various images (natural images) corresponding to photographs (aerial photographs, satellite photographs) or pictures taken from, for example, an aircraft or a satellite are stored (accumulated). In addition, result data that is a result of obtaining (calculating) texture feature amounts (D, V, and P described in detail later) for a plurality of image data is also stored in association with the corresponding image data.
[0020]
As shown in FIG. 1, in this embodiment, the size of the image is 2048 pixels × 2048 pixels. Further, in this embodiment, the unit (image area) for determining the flatness of the texture has a size of 32 pixels × 32 pixels. _i And Therefore, in this embodiment, the small area a _i Is 4096.
[0021]
For example, when a sample point (texture feature amount) extraction instruction is input from a user, a texture feature amount of a computer, actually a CPU (not shown) provided in the computer, according to the flowchart shown in FIG. (Extraction) acquisition of feature quantity acquisition processing is executed to obtain result data (sample points) as described above.
[0022]
Referring to FIG. 2, when the CPU of the computer starts processing in response to an instruction from the user, in step S1, image data corresponding to a desired image whose feature value is to be acquired is read in accordance with the instruction from the user. In the subsequent step S3, the read image is stored in the small area a _i Divide into As described above, the small area a _i Is 32 pixels × 32 pixels in size, so that the small area a to be divided is _i Is 4096.
[0023]
In step S5, the CPU initializes the count value of the counter (i = 1). In step S7, the i-th small area a is initialized as will be described in detail later. _i A calculation process of texture feature amount D (first feature amount), V (second feature amount) and P (third feature amount) is executed for, and it is determined whether or not the count value i is 4096 in step S9. That is, all the small areas a _i It is determined whether or not the texture feature amount for the is calculated.
[0024]
If “NO” in the step S9, that is, if the count value i is not 4096, all the small regions a _i It is determined that the texture feature amount for is not finished, the count value i is incremented (i = i + 1) in step S11, and the process returns to step S7. On the other hand, if “YES” in the step S9, that is, if the count value i is 4096, all the small regions a _i It is determined that the texture feature amount calculation processing has ended for step S13, and the process proceeds to step S13. In step S13, 4096 small areas a _i Is stored in the database in association with the image data.
[0025]
FIG. 3 shows a small area a in step S7 shown in FIG. _i FIG. 6 is a flowchart showing the calculation process of the texture feature amount, and when the calculation process of the feature amount is started, the CPU _i Are converted from R, G, B data to Y data. For example, the small area a _i The pixel data at the upper position (n, m) is q _i Assuming that (n, m), each pixel has three types of pixel values, R, G, and B, and therefore can be expressed as in Equation (1).
[0026]
[Expression 1]

[0027]
Here, n and m are natural numbers that satisfy 1 ≦ n ≦ 32 and 1 ≦ m ≦ 32. Here, the position, that is, the coordinate system (n, m) is a general coordinate system (matrix coordinate system) for the image, and the horizontal position n has a rightward direction as a positive direction, and the vertical position m has a downward direction as a positive direction. To do. Therefore, the small area a _i The upper left of the image is (1, 1) and the lower right is (32, 32).
[0028]
In the subsequent processing, since the texture is handled as a gray image, the CPU converts the R, G, B data of each pixel into Y data according to Equation 2.
[0029]
[Expression 2]

[0030]
In the subsequent processing, the average brightness (average luminance) of the texture does not matter, so in the next step S23, the small area a is obtained from the Y data. _i Remove the average brightness of.
[0031]
First, according to Equation 3, the small area a _i Find the average brightness of.
[0032]
[Equation 3]

[0033]
Next, the small area a _i Luminance value Y of each pixel _i (N, m) to small area a _i Subtraction value (subtraction data) YT _i (N, m) is obtained according to Equation 4.
[0034]
[Expression 4]

[0035]
However, 1 ≦ n ≦ 32, 1 ≦ m ≦ 32
Next, in step S25, the small area a _i Subtraction data YT _i (N, m) is subjected to a two-dimensional discrete Fourier transform (two-dimensional DFT). That is, the calculation is performed according to Equation 5, and the complex number F _i (U, v) is calculated.
[0036]
[Equation 5]

[0037]
However, j is a 1-unit imaginary number, and u and v are integers satisfying −16 ≦ u ≦ 16 and −16 ≦ v ≦ 16, respectively. This complex number F _i (U, v) is subtraction data YT of 32 pixels × 32 pixels. _i It represents a frequency component at a two-dimensional frequency (u, v) of (n, m).
[0038]
Here, the unit of u is [cpw: cycle per picture width], and the unit of v is [cph: cycle per picture hight].
[0039]
Generally, when a certain physical quantity is represented by a complex number, the magnitude or strength (intensity) of the physical quantity can be represented by the magnitude of the complex number, that is, an absolute value. Therefore, the CPU proceeds to the next step S27 where the complex number F obtained by Equation 5 on the two-dimensional frequency plane is obtained. _i The absolute value of (u, v) is taken (calculated). Specifically, the complex number F according to Equation 6 _i The absolute value of (u, v) ("W _i (U, v) ”. ), Absolute value W _i With (u, v), the two-dimensional frequency component F _i The intensity is (u, v).
[0040]
[Formula 6]

[0041]
However, as described above, u and v are integers that satisfy −16 ≦ u ≦ 16 and −16 ≦ v ≦ 16, respectively.
[0042]
Also, in the two-dimensional frequency coordinate system (u, v) corresponding to the coordinate system (n, m), the horizontal frequency u has a positive direction in the right direction, and the vertical frequency v has a positive direction in the downward direction. In such a two-dimensional frequency coordinate system (on a plane), a two-dimensional frequency intensity distribution W _i An example in which (u, v) is displayed as a contour line is shown in FIG. In FIG. 4, the intensity of the frequency distribution is represented by a gray scale. The closer to black, the weaker the intensity and the closer to white, the stronger the intensity.
[0043]
Subsequently, in step S29, the intensity distribution W _i (U, v) is regarded as a load distribution on the two-dimensional frequency plane, and the position of the center of gravity is determined (obtained). However, as can be seen from FIG. 4, the load distribution (intensity distribution) is symmetric with respect to the point (origin), so in this embodiment, the right half region of the frequency plane is used to avoid the center of gravity position always being the origin. The center of gravity G for (the horizontal frequency u is in the range of 0 to +16 cpw and the vertical frequency v is in the range of −16 to +16 cph). _i Position (u _iG , V _iG ) Is obtained according to Equation 7.
[0044]
[Expression 7]

[0045]
Where SW _i Is a small area a _i This represents the total load in the right half region of the frequency intensity distribution, and is given by Equation 8. In addition, an example of measuring the position of the center of gravity is indicated by a point P in FIG.
[0046]
[Equation 8]

[0047]
In this embodiment, the center-of-gravity position is obtained in the right half region of the frequency plane, but is not limited to this, and may be obtained in the left half region, upper half region, or lower half region of the frequency plane.
[0048]
If the computational complexity is not taken into consideration, the position of the center of gravity is obtained in one area (range) obtained by dividing the frequency plane into two by a straight line passing through the origin in cases other than the upper, lower, left, and right half areas. You may do it.
[0049]
Next, in this embodiment, since the direction of the texture is not a problem, in step S31, the CPU does not depend on the position of the center of gravity, but the distance D from the origin to the position of the center of gravity. _i A small area a _i The first feature amount of the texture (texture feature amount D) is calculated by Equation 9.
[0050]
[Equation 9]

[0051]
Subsequently, in step S33, the small area a _i As the second feature amount of the texture (texture feature amount V), the variance v around the centroid of the frequency intensity distribution in the frequency range of the right half region of the frequency plane as described above _i Is determined by Equation (10). At this time, the intensity distribution in the frequency range is the small region a. _i There is no need to find unbiased variance. Therefore, in Equation 10, the denominator is SW _i -1, not SW _i It is said.
[0052]
[Expression 10]

[0053]
Here, regarding the texture of the small region, when the horizontal frequency u in the right half region is the first variable and the vertical frequency v is the second variable, the two variables u and v are integrated as follows. Variable, ie first principal component z _i1 And the second principal component z _i2 Can be converted to
[0054]
First, the variance of the first variable u (dispersion due to the load distributed along the horizontal frequency u) V _iuu , Dispersion of second variable v (dispersion due to load distributed along vertical frequency v) V _ivv , And the covariance V of the first variable u and the second variable v _iuv Are obtained by Equations 11 to 13, respectively.
[0055]
[Expression 11]

[0056]
[Expression 12]

[0057]
[Formula 13]

[0058]
Here, a variance / covariance matrix (real symmetric matrix) V having these variance values as elements. _i Is shown in Equation 14.
[0059]
[Expression 14]

[0060]
This real symmetric matrix V _i Eigenvalue of λ _i1 , Λ _i2 (However, λ _i1 ≧ λ _i2 ≧ 0. ) And eigenvalue λ _i1 , Λ _i2 The eigenvector l corresponding to _i1 , L _i2 Is obtained as follows. Specifically, the eigenvalue λ _i1 , Λ _i2 Is V _i Is obtained as a root of equation (15). The eigenvector l _i1 , L _i2 Is obtained by solving equation (16).
[0061]
[Expression 15]

[0062]
[Expression 16]

[0063]
Next, the first principal component z which is a new variable _i1 And the second principal component z _i2 Is a linear combination of u and v, and is defined by Equation 17, respectively.
[0064]
[Expression 17]

[0065]
Where eigenvector l _i1 , L _i2 Are respectively expressed as in Equation 18.
[0066]
[Formula 18]

[0067]
As described above, the CPU calculates the contribution P of the first principal component in the right half region of the two-dimensional frequency plane in step S35. _i A small area a _i Is obtained as the third feature amount (texture feature amount P). That is, the contribution ratio P of the first principal component _i Can be calculated by Equation 19.
[0068]
[Equation 19]

[0069]
However, the first eigenvalue λ _i1 Is the first principal component z _i1 Variance V _im Equal to the second eigenvalue λ _i2 Is the second principal component z _i2 Variance V _is And in this example, the number of total variables is two, so the first principal component z _i1 Variance V _im And the second principal component z _i2 Variance V _is Is the total variance V _i That is, it is equal to the variance around the center of gravity. For this reason, contribution rate P _i Is the first principal component z _i1 Variance V _im Is the total variance V _i It can also be said that it is the ratio occupied in the equation (20).
[0070]
[Expression 20]

[0071]
The first principal component z _i1 The coordinate axis of means the axis having the greatest variance among the axes passing through the center of gravity. Measurement examples of the first principal component axis and the second principal component axis are indicated by a solid line and a dotted line in FIG. 4, respectively.
[0072]
As described above, the small area a is obtained through the processing of S31, S33, and S35. _i Texture feature, ie, distance D _i , Distributed V _i , Contribution ratio P of the first principal component _i Is obtained. These are stored in the database as sample values as described above.
[0073]
For example, the result data as sample values are clustered to the optimum number of clusters, and the clustering result can be used for subsequent processing such as discrimination of texture flatness.
[0074]
However, the above-described texture feature values are individually obtained values, that is, the units are different from each other. Therefore, at least at the start of clustering, it is necessary to normalize the three feature values. Specifically, three types of feature amount D _i , V _i , P _i Is normalized so that the average value over the 4096 measurement data is 0 and the variance is 1. In the following, normalized feature values (normalized feature values) are respectively expressed as ZD. _i , ZV _i , ZP _i Will be written.
[0075]
First, the texture feature amount D _i Is normalized according to Equation 21.
[0076]
[Expression 21]

[0077]
Where D _M , Σ _D Respectively _i And is calculated by Equations (22) and (23).
[0078]
[Expression 22]

[0079]
[Expression 23]

[0080]
Also, texture feature V _i Is normalized according to Equation 24.
[0081]
[Expression 24]

[0082]
Where V _M , Σ _V Are respectively V _i And is calculated by Equation 25 and Equation 26.
[0083]
[Expression 25]

[0084]
[Equation 26]

[0085]
Furthermore, the texture feature amount P _i Is normalized according to Equation 27.
[0086]
[Expression 27]

[0087]
Where P _M , Σ _P Are respectively P _i And is calculated by Equation 28 and Equation 29.
[0088]
[Expression 28]

[0089]
[Expression 29]

[0090]
Subsequently, clustering will be briefly described. When the number of clusters a is N (the same number as the total number of samples), clustering is uniquely determined by clustering based on, for example, the centroid method as a method of determining the distance between clusters.
[0091]
The i-th cluster when the number of clusters is a (where a = N, N-1,..., 2, 1) is denoted as C (a, i). Here, i = 1, 2,..., A. Further, the number of sample points included in the cluster C (a, i) is b (a, i), and the kth sample point included in the cluster C (a, i) is S (a, i, k). write. Here, k = 1, 2,..., B (a, i). Further, the center of gravity of the cluster C (a, i) is G (a, i), and the distance between the center of gravity G (a, i) and the sample point S (a, i, k) is d (a, i, k). In this case, the clustering error for the cluster C (a, i) caused by representing each sample point in the cluster C (a, i) by the center of gravity G (a, i) is expressed by Equation 30.
[0092]
[30]

[0093]
In the following, for simplicity, this D (a, i) will be referred to as an error of the i-th cluster C (a, i).
[0094]
When the initial number of clusters a = N, each cluster C (N, i) (where i = 1, 2,..., N) includes only one sample point, so each cluster C (N, i) The sample points in each and the centroid G (N, i) of each cluster coincide. Therefore, since b (N, i) = 1, k = 1, and d (N, i, k) = 0, D (a, i) = 0. That is, when a = N, the error of cluster C (a, i) is zero for any cluster C (a, i).
[0095]
Further, an error D (a) obtained by adding the error D (a, i) of the i-th cluster C (a, i) to all the clusters (i = 1, 2,..., A) (when the number of clusters is a). , I) is calculated by Equation 31.
[0096]
[31]

[0097]
That is, SD (a) is considered to represent a clustering error caused by clustering N sample points into a clusters. Hereinafter, in this embodiment, this SD (a) is referred to as an error of clustering to a. For clustering when the number of clusters a = N for the first time, the error D (a, i) of each cluster is 0 as described above, so SD (a) = SD (N) = 0. That is, when the number of clusters a = N, the error of clustering to a is zero.
[0098]
Next, when all the N sample points are clustered based on the centroid method when the number of clusters is a = N−1, the nearest two points out of the N sample points are combined into one cluster, and the rest The cluster is determined to be the cluster number a = N, and clustering to the cluster number a = N−1 is determined. For the determined clustering, if the above-described processing for obtaining various quantities (calculation of the sum of errors using Equation 31) is repeated, the clustering error SD to the cluster number a = N−1 with the smallest error is obtained. (A) = SD (N−1) can be obtained.
[0099]
Subsequently, when the cluster number a = N−2, the clustering of the current cluster number a = N−1 is changed to clustering to the cluster number a = N−2 based on the centroid method. In this case, the clustering to cluster number a = N−2 is determined so that the nearest two centroids out of N−1 centroids are combined into one cluster.
[0100]
Such clustering can be enumerated by using generally available statistical analysis software or by creating a program yourself, and it is also easy to automate. Hereinafter, the same can be said for clustering. In addition, as statistical processing software that is generally distributed, SPSS Inc. “SPSS” manufactured by KK and “STATISTICA” manufactured by StatSoft can be used.
[0101]
Then, if the above-described processing for obtaining various quantities (processing for calculating the total sum of errors using Equation 31) is repeated in the determined clustering, the number of clusters a = N−2 with few errors to the extent that there is no practical problem. The error SD (a) = SD (N−2) for clustering to can be obtained.
[0102]
However, when determining clustering to N-2, all N samples are divided into N-2 clusters without using the result of previous clustering, that is, the result of clustering to N-1. SD (a) = SD (N−2) may be calculated by the above procedure, and the smallest of them may be set as true SD (a).
[0103]
This latter method is more accurate as an evaluation of SD (a). However, when the number N of all samples is very large, although it depends on the processing capability of the computer, it is no longer a realistic method. End up. In such a case, it is considered that the former method should be used to obtain SD (a) with as little error as is practically acceptable.
[0104]
Here, all the ways of dividing into clusters will be specifically described. For example, all the ways of dividing (clustering) 5 (N = 5) sample points into 3 clusters can be divided as follows. However, five sample points (elements) are represented by numbers 1 to 5.
[0105]
(I) When the number of elements is divided into three clusters of 3, 1 and 1, the combinations of clusters are as follows. The number of combinations is 10 ( _Five C _Three ).
[0106]
(1, 2, 3), (4), (5)
(1, 2, 4), (3), (5)
(1,2,5), (3), (4)
(1, 3, 4), (2), (5)
(1, 3, 5), (2), (4)
(1, 4, 5), (2), (3)
(2, 3, 4), (1), (5)
(2, 3, 5), (1), (4)
(2, 4, 5), (1), (3)
(3,4,5), (1), (2)
(Ii) When the number of elements is divided into three clusters of 2, 2, and 1, the combinations of clusters are as follows.
[0107]
(1,2), (3,4), (5)
(1,3), (2,4), (5)
(1,4), (2,3), (5)
(1,2), (3,5), (4)
(1,3), (2,5), (4)
(1,5), (2,3), (4)
(1,2), (4,5), (3)
(1,4), (2,5), (3)
(1,5), (2,4), (3)
(1,3), (4,5), (2)
(1,4), (3,5), (2)
(1,5), (3,4), (2)
(2,3), (4,5), (1)
(2,4), (3,5), (1)
(2,5), (3,4), (1)
As described above, as shown in (i) and (ii), when five sample points are clustered into three clusters, there are 25 clustering methods. As described above, since there are a large number of division methods, as described above, the processing amount of the computer (CPU) may be reduced by clustering using the previous result.
[0108]
Further, when the number of clusters a = N−3, the clustering of the current number of clusters a = N−2 is changed to clustering to the number of clusters a = N−3 based on the centroid method. In this case, the clustering to cluster number a = N−3 is determined so that the two nearest centroids out of N−2 centroids are combined into one cluster, and the above-mentioned various quantities are calculated (formula 31). (Summary of error sum calculation using) is repeated to obtain a clustering error SD (a) = SD (N−3) to a cluster number a = N−3 with few errors to a practical extent. be able to.
[0109]
However, as in the case described above, when deciding on clustering to N-3, N-samples are not changed to N- without using the previous clustering result, that is, the result of clustering to N-2. SD (a) = SD (N−3) is calculated according to the above procedure for all the division methods into three clusters, and the smallest one among them is assumed to be true SD (a). Good.
[0110]
In this way, when the clustering is repeated and the number of clusters a = N− (N−2) = 2, the clustering of the current number of clusters a = 3 becomes two clusters based on the centroid method. Be changed. In this case, the clustering to the number of clusters a = 2 is determined so that the two nearest centroids among the three centroids are combined into one cluster, and the above-mentioned various quantities are obtained (error using Equation 31). If the total sum calculation process) is repeated, the clustering error SD (a) = SD (2) to the number of clusters (a = 2) with few errors to the extent that there is no practical problem can be obtained.
[0111]
However, when deciding clustering to two, the result of the previous clustering, that is, all the ways of dividing N samples into two clusters without using the result of clustering to three, SD (a) = SD (2) may be calculated according to the above procedure, and the smallest of them may be set as true SD (a).
[0112]
Finally, when a = N− (N−1) = 1, the clustering of the current cluster number a = 2 is changed to clustering to the cluster number a = 1 based on the centroid method. The clustering in this case is uniquely determined. The center of gravity of one integrated cluster is the midpoint between the two centers of gravity. For this clustering, if the above-described processing for determining various quantities (calculation of the sum of errors using Equation 31) is repeated, the error SD (a) = SD (1) for clustering to a = 1 with the least error. Can be obtained.
[0113]
The clustering error SD (a) (a = N, N−1,..., 2, 1) obtained as described above is plotted with the horizontal axis as the number of clusters a. In this embodiment, N = 18, and FIG. 5 shows an example in which SD (a) obtained by the method using the previous clustering result is plotted.
[0114]
Referring to FIG. 5, the optimum number of clusters (the optimum number of clusters) a _opt Is the number of clusters before change (immediately before) including the point where SD (a) suddenly increases (changes) when the number of clusters a decreases from N to 1, or the number of clusters larger than that The number of clusters a that gives SD (a) having an allowable size (allowable range) is determined. Therefore, in the example shown in FIG. _opt Is “3”.
[0115]
However, in this embodiment, the allowable range is less than 20% of the maximum error, but can be freely changed by a designer, a user, or the like. In addition, the optimal number of clusters a _opt May be determined to be the cluster number a after the change (immediately after) in which SD (a) changes suddenly and not exceeding the allowable range. Therefore, depending on how the allowable range is determined, the optimal number of clusters a in several places _opt There are cases where one candidate is determined.
[0116]
The specific clustering process is shown by the flowcharts in FIGS. Referring to FIG. 6, when a clustering instruction is given by the user, the CPU starts processing, and reads 4096 sample values corresponding to the clustering instruction from the database in step S41. In the subsequent step S43, the read sample values, ie, (texture) feature amounts (D, V, P) are normalized, and normalized feature amounts (normalized feature amounts) ZD, ZV, and ZP are obtained.
[0117]
In step S45, the normalized feature values ZD, ZV, and ZP are plotted on the normalized feature value space, that is, ZD, ZV, ZP (three-dimensional) space, and a scatter diagram as shown in FIG. 8 is drawn.
[0118]
Hereinafter, in this embodiment, 4096 subregions a _i Texture feature amount D _i , V _i , P _i Are collectively expressed as (D, V, P). Similarly, normalized feature quantity ZD _i , ZV _i , ZP _i Are collectively expressed as (ZD, ZV, ZP).
[0119]
In step S45 described above, 4096 measured values of normalized features (ZD, ZV, ZP) are plotted on the ZD, ZV, ZP space. However, in this embodiment, for simplicity of explanation, FIG. 8 shows an example in which 18 measurement values (ZD, ZV, ZP) are plotted for a certain image.
[0120]
In step S47, the cluster number a is initialized (a = N). Subsequently, in step S49, all N sample values are clustered by the centroid method. In step S51, the count value i of the counter is initialized (i = 1), and in step S53, the center of gravity G (a, i) of the cluster C (a, i) is calculated.
[0121]
In step S55, the count value k of the counter is initialized (k = 1), and in step S57, the center of gravity G (a, i) in the cluster C (a, i) and each sample point S (a, i, k). ) To the distance d (a, i, k).
[0122]
Next, as shown in FIG. 7, in the next step S59, it is determined whether k = b (a, i). That is, it is determined whether the distance d (a, i, k) from all the sample points in the cluster C (a, i) has been calculated.
[0123]
If “NO” in the step S59, that is, unless k = b (a, i), the distance d (a, i, k) from all the sample points in the cluster C (a, i) is not calculated. In step S61, the count value k is incremented (k = k + 1), and the process returns to step S57 shown in FIG.
[0124]
On the other hand, if “YES” in the step S59, that is, if k = b (a, i), the distance d (a, i, k) from all sample points in the cluster C (a, i) is calculated. In step S63, the error D (a, i) of the i-th cluster C (a, i) is calculated according to Equation 30.
[0125]
In a succeeding step S65, it is determined whether or not the count value i is equal to the cluster number a. If “NO” in the step S65, that is, if the count value i is not equal to the cluster number a, the counter is incremented (i = i + 1) in a step S67, and the process returns to the step S53 shown in FIG.
[0126]
On the other hand, if “YES” in the step S65, that is, if the count value i is equal to the cluster number a, an error SD (a) for clustering to a is calculated in a step S69. Specifically, using Equation 31, the clustering error due to clustering into a clusters is obtained.
[0127]
Subsequently, in step S71, it is determined whether or not the cluster number a is 1. If “NO” in this step S71, that is, if the cluster number a is not 1, the cluster number a is decremented (a = a−1) in step S73, and the current (after decrementing) is performed based on the centroid method. After clustering into clusters having a cluster number a (a), the process returns to step S51 shown in FIG.
[0128]
On the other hand, if “YES” in the step S71, that is, if the cluster number a is 1, the cluster number a immediately before the error SD (a) suddenly increases in the step S75 or a larger cluster number a. A which gives an error SD (a) in an allowable size (allowable range) is an optimum number of clusters a _opt The process is terminated after the determination.
[0129]
However, in this embodiment, the allowable range of the error SD (a) is 1/5 (about 10) of the maximum error (about 55), and the optimum number of clusters a is within a range where the error SD (a) does not exceed 10. _opt Is determined. That is, the allowable range is a value that can be arbitrarily set by a program (software) designer or user.
[0130]
According to this embodiment, since the number of clusters can be determined to an optimal (reasonable) value and clustering can be performed, the reliability when verifying as simple statistics is high, and the clustering result does not adversely affect the subsequent processing. That is, the convenience of the clustering result can be improved.
[0131]
In the above embodiment, the number of clusters a is reduced by one. However, when the total number of samples N is enormous, the number of clusters a is reduced by a plurality (for example, 10). Also good.
[0132]
In the above-described embodiment, only the case where the center of gravity method based on the Euclidean distance is applied as the clustering method has been described. However, the distance representing the similarity need not be limited to the Euclidean distance. For example, the square of Euclidean distance, Mahalanobis distance, correlation coefficient, or the like may be used.
[0133]
Furthermore, the method for determining the distance between the clusters is not limited to the centroid method, and the shortest distance method, the longest distance method, the group average method, the median method, the Ward method, or the like can be used.
[0134]
Furthermore, in the above-described embodiment, the texture feature amount (sample value) is normalized, but it is not necessary to normalize when the data that has been normalized in advance is a sample value. The sample value need not be limited to the texture feature amount, and other various data can be used as the sample value.
[0135]
In the above-described embodiment, the small area a _i Although Y data is obtained from R, G, B data as a pre-processing for R, G, B, hue H, saturation S, lightness I (or V may be written). Data, that is, data in the HSI space may be converted, and the resulting lightness I data may be used.
[0136]
Furthermore, in the above-described embodiment, the two-dimensional frequency component is calculated by using the two-dimensional discrete Fourier transform (DFT). In particular, the two-dimensional fast Fourier transform (FFT) or the two-dimensional discrete cosine transform (DCT) is used. ) May be used.
[Brief description of the drawings]
FIG. 1 is an illustrative view showing an image handled in clustering according to an embodiment of the present invention;
FIG. 2 is a flowchart showing a texture feature amount extraction process;
FIG. 3 is a flowchart showing calculation processing of a texture feature amount.
FIG. 4 is an illustrative view showing the absolute value of a complex number obtained by performing two-dimensional DFT on subtraction data in a contour calculation on a two-dimensional frequency plane in a texture feature amount calculation process;
FIG. 5 is a graph showing an example of an error with respect to the number of clusters in the clustering process.
FIG. 6 is a flowchart showing a part of clustering processing;
FIG. 7 is a flowchart showing another part of the clustering process.
FIG. 8 is a graph showing an example plotted according to a texture feature amount extracted by a texture feature amount extraction process;

Claims

A clustering program for clustering a plurality of sample points,
Computer
Distance detecting means for detecting for each of all sample points and the Euclidean distance said all sampling points of the representative point as a center of gravity in the cluster in each cluster,
First sum value calculating means for obtaining a first sum value of Euclidean distances detected by the distance detecting means;
Second sum value calculating means for obtaining a second sum value obtained by summing the first sum value obtained by the first sum value calculating means over each cluster;
Any of the second total values at several points before and after the point where the second total value calculated by the second total value calculation means changes suddenly when the number of clusters is changed from the maximum value to the minimum value. And a clustering program that functions as clustering means for clustering to the optimum cluster number determined by the optimum cluster number determining means.

A clustering program for clustering a plurality of sample points,
Computer
Distance detecting means for detecting for each of all sample points and the Euclidean distance said all sampling points of the representative point as a center of gravity in the cluster in each cluster,
First sum value calculating means for obtaining a first sum value of Euclidean distances detected by the distance detecting means;
Second sum value calculating means for obtaining a second sum value obtained by summing the first sum value obtained by the first sum value calculating means over each cluster;
When the number of clusters is decreased by n (n is a natural number) from the maximum value to the minimum value, the operation of combining two clusters with the shortest inter-cluster distance into one cluster in the previous clustering result is repeated n times. , Second total value recalculating means for determining clustering when the number of clusters is decreased by n and obtaining the second total value;
The optimal clustering number for determining the optimal cluster number as the number of clusters corresponding to any of the second total values at several points before and after the point where the second total value obtained by the second total value recalculating means changes rapidly. A clustering program that functions as a determining unit, and a clustering unit that performs clustering to the optimal cluster number determined by the optimal cluster number determining unit.

A clustering program for clustering a plurality of sample points,
Computer
Distance detecting means for detecting for each of all sample points and the Euclidean distance said all sampling points of the representative point as a center of gravity in the cluster in each cluster,
First sum value calculating means for obtaining a first sum value of Euclidean distances detected by the distance detecting means;
Second sum value calculating means for obtaining a second sum value obtained by summing the first sum value obtained by the first sum value calculating means over each cluster;
Second total value recalculating means for obtaining the second total value for all the division methods for dividing each cluster number when the number of clusters is changed from the maximum value to the minimum value;
Third sum total value calculating means for calculating the smallest one of the plurality of second sum total values obtained by the second sum total value recalculating means as a third sum value;
The number of clusters corresponding to the third total value either before or after the third total value calculated by the third total value calculating means suddenly changes when the number of clusters is changed from the maximum value to the minimum value. And a clustering program that functions as clustering means for clustering to the optimal number of clusters determined by the optimal cluster number determining means.