JP2004341959A

JP2004341959A - Data classification device, data classification method, and program for making computer execute the method

Info

Publication number: JP2004341959A
Application number: JP2003139512A
Authority: JP
Inventors: Takashi Nakagawa; 尚中川
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 2003-05-16
Filing date: 2003-05-16
Publication date: 2004-12-02

Abstract

<P>PROBLEM TO BE SOLVED: To precisely classify data even if the distribution of training data is not uniform, whereas a classification by SVM of a related art has the problem wherein data is apt to be classified to a class having a relatively high density when the distribution of the training data is not uniform, since a separation hyperplane for dividing a positive example and a negative example is uniformly set to the center of a margin. <P>SOLUTION: In this method, a local positive example training data density and negative example training data density in the vicinity of classification object data are calculated, and the position of the separation hyperplane is corrected according to the densities, concretely, to a side having a further high training data density. According to this, the probability that the data which were classified to a class having a high density before the correction are classified to a class having a low density can be raised after the correction, and the problem of deterioration of classification precision in a place having a relatively low training data density can be solved. A threshold of classification can be similarly corrected by the training data density in a classification method (kNN, etc.) except the SVM. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
この発明は、複数のデータの存在する空間を分離面（平面や曲面など）により各分類ごとの空間に分割するデータ分類装置、データ分類方法およびその方法をコンピュータに実行させるプログラムに関する。
【０００２】
【従来の技術】
ＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）は、２種類以上の分類のいずれかが付与されたデータの存在する空間を、各分類ごとの空間に分割する手法である。このＳＶＭでは、個々の訓練データを表すベクトルを素性空間（一般に入力空間より高次元）上に写像し、正例と負例を最も幅広く分割できるマージンを見つけ出す。そして、そのマージンの中央に、正例と負例とを切り分ける分離超平面を構築する（図１）。
【０００３】
一般に高次元空間上でデータを分類した場合、分類は容易になる一方、データに特化した過学習を起こしやすくなることが知られているが、ＳＶＭではマージン最大化という戦略により、実際にはマージンに接しているデータ（ＳＶ：ＳｕｐｐｏｒｔＶｅｃｔｏｒ）の数に依存する次元しか、判別に利用しない。つまり、高次元上の分類に有効な軸のみ利用することで、高次元空間におけるデータの分類しやすさを維持しつつ、過学習を防止していることになる。
【０００４】
【発明が解決しようとする課題】
しかしながら、マージンの中央を分離超平面とすることの根拠については、従来ほとんど検討されてこなかった。従来技術においては、素性空間上における正例・負例の訓練データの分布は均一であることが暗黙の前提となっている。一方、実際の訓練データの分布は不均一であるため、後述のように、訓練データの密度が相対的に低い場所で分類精度が悪化してしまうという問題があった。
【０００５】
この発明は上記従来技術による問題を解決するため、計算量を抑制しつつ、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類装置、データ分類方法およびその方法をコンピュータに実行させるプログラムを提供することを目的とする。
【０００６】
【課題を解決するための手段】
上述した課題を解決し、目的を達成するため、請求項１に記載の発明にかかるデータ分類装置は、複数のデータの存在する空間を分離面により各分類ごとの空間に分割するデータ分類装置において、分類対象データの位置における、第１の分類に分類される訓練データの密度を算出する第１の密度算出手段と、前記分類対象データの位置における、第２の分類に分類される訓練データの密度を算出する第２の密度算出手段と、前記第１の密度算出手段により算出された密度および前記第２の密度算出手段により算出された密度にもとづいて前記分離面の位置を補正する分離面位置補正手段と、を備えたことを特徴とする。
【０００７】
この請求項１に記載の発明によれば、第１の分類と第２の分類との境界となる分離面の位置が、上記各分類に属する訓練データの密度に応じて補正される。
【０００８】
また、請求項２に記載の発明にかかるデータ分類装置は、前記請求項１に記載の発明において、前記分離面位置補正手段が、前記第１の分類または前記第２の分類のうち、前記第１の密度算出手段または前記第２の密度算出手段により算出された密度が相対的に高い側へ前記分離面の位置を補正することを特徴とする。
【０００９】
この請求項２に記載の発明によれば、訓練データの分布が均一でない場合に、訓練データ密度の相対的に高い分類へデータが分類されやすいという特性（この特性が分類精度の悪化の原因となっている）が修正される。
【００１０】
また、請求項３に記載の発明にかかるデータ分類装置は、前記請求項１または請求項２に記載の発明において、前記第１の密度算出手段が前記第１の分類に分類されていない訓練データも含めて前記密度を算出することを特徴とする。
【００１１】
この請求項３に記載の発明によれば、補正後の分離面の位置を決定する訓練データ密度は、第１の分類が付与された訓練データのほか、正確な分類は不明であるものの、第１の分類が付与される確率の高い訓練データを含めて計算される。
【００１２】
また、請求項４に記載の発明にかかるデータ分類装置は、前記請求項１〜請求項３のいずれか一つに記載の発明において、前記第１の密度算出手段が前記第１の分類に分類される訓練データのうち一部の訓練データを用いて前記密度を算出することを特徴とする。
【００１３】
この請求項４に記載の発明によれば、補正後の分離面の位置を決定する訓練データ密度は、前記空間内に存在する一部の訓練データの密度によって近似される。
【００１４】
また、請求項５に記載の発明にかかるデータ分類装置は、前記請求項４に記載の発明において、前記一部の訓練データは前記分離面からの距離が所定の条件を満たす訓練データであることを特徴とする。
【００１５】
この請求項５に記載の発明によれば、補正後の分離面の位置を決定する訓練データ密度は、たとえば補正前の分離面からの距離が一定の訓練データの密度（補正前の分離面に対して平行な面上での訓練データ密度）によって近似される。
【００１６】
また、請求項６に記載の発明にかかるデータ分類装置は、前記請求項１〜請求項５のいずれか一つに記載の発明において、前記第１の密度算出手段が前記第１の分類に分類される訓練データを含む第１の訓練データ集合および前記第１の訓練データ集合から一部の訓練データを除外した第２の訓練データ集合を用いて前記密度を算出することを特徴とする。
【００１７】
この請求項６に記載の発明によれば、補正後の分離面の位置を決定する訓練データ密度は、補正前の分離面に対して垂直な方向での訓練データ密度によって近似される。
【００１８】
また、請求項７に記載の発明にかかるデータ分類装置は、前記請求項１〜請求項６のいずれか一つに記載の発明において、前記第１の密度算出手段が前記訓練データ間の距離にもとづいて前記密度を算出することを特徴とする。
【００１９】
この請求項７に記載の発明によれば、補正後の分離面の位置を決定する訓練データ密度の算出に複雑な計算を必要としない。
【００２０】
また、請求項８に記載の発明にかかるデータ分類方法は、複数のデータの存在する空間を分離面により各分類ごとの空間に分割するデータ分類方法において、分類対象データの位置における、第１の分類に分類される訓練データの密度を算出する第１の密度算出工程と、前記分類対象データの位置における、第２の分類に分類される訓練データの密度を算出する第２の密度算出工程と、前記第１の密度算出工程で算出された密度および前記第２の密度算出工程で算出された密度にもとづいて前記分離面の位置を補正する分離面位置補正工程と、を含んだことを特徴とする。
【００２１】
この請求項８に記載の発明によれば、第１の分類と第２の分類との境界となる分離面の位置が、上記各分類に属する訓練データの密度に応じて補正される。
【００２２】
また、請求項９に記載の発明にかかるデータ分類方法は、前記請求項８に記載の発明において、前記分離面位置補正工程では、前記第１の分類または前記第２の分類のうち、前記第１の密度算出工程または前記第２の密度算出工程で算出された密度が相対的に高い側へ前記分離面の位置を補正することを特徴とする。
【００２３】
この請求項９に記載の発明によれば、訓練データの分布が均一でない場合に、訓練データ密度の相対的に高い分類へデータが分類されやすいという特性が修正される。
【００２４】
また、請求項１０に記載の発明にかかるデータ分類方法は、前記請求項８または請求項９に記載の発明において、前記第１の密度算出工程では前記第１の分類に分類されていない訓練データも含めて前記密度を算出することを特徴とする。
【００２５】
この請求項１０に記載の発明によれば、補正後の分離面の位置を決定する訓練データ密度は、第１の分類が付与された訓練データのほか、正確な分類は不明であるものの、第１の分類が付与される確率の高い訓練データを含めて計算される。
【００２６】
また、請求項１１に記載の発明にかかるデータ分類方法は、前記請求項８〜請求項１０のいずれか一つに記載の発明において、前記第１の密度算出工程では前記第１の分類に分類される訓練データのうち一部の訓練データを用いて前記密度を算出することを特徴とする。
【００２７】
この請求項１１に記載の発明によれば、補正後の分離面の位置を決定する訓練データ密度は、前記空間内に存在する一部の訓練データの密度によって近似される。
【００２８】
また、請求項１２に記載の発明にかかるデータ分類方法は、前記請求項１１に記載の発明において、前記一部の訓練データは前記分離面からの距離が所定の条件を満たす訓練データであることを特徴とする。
【００２９】
この請求項１２に記載の発明によれば、補正後の分離面の位置を決定する訓練データ密度は、たとえば補正前の分離面からの距離が一定の訓練データの密度（補正前の分離面に対して平行な面上での訓練データ密度）によって近似される。
【００３０】
また、請求項１３に記載の発明にかかるデータ分類方法は、前記請求項８〜請求項１２のいずれか一つに記載の発明において、前記第１の密度算出工程では前記第１の分類に分類される訓練データを含む第１の訓練データ集合および前記第１の訓練データ集合から一部の訓練データを除外した第２の訓練データ集合を用いて前記密度を算出することを特徴とする。
【００３１】
この請求項１３に記載の発明によれば、補正後の分離面の位置を決定する訓練データ密度は、補正前の分離面に対して垂直な方向での訓練データ密度によって近似される。
【００３２】
また、請求項１４に記載の発明にかかるデータ分類方法は、前記請求項８〜請求項１３のいずれか一つに記載の発明において、前記第１の密度算出工程では前記訓練データ間の距離にもとづいて前記密度を算出することを特徴とする。
【００３３】
この請求項１４に記載の発明によれば、補正後の分離面の位置を決定する訓練データ密度の算出に複雑な計算を必要としない。
【００３４】
また、請求項１５に記載の発明にかかるプログラムによれば、前記請求項８〜請求項１４のいずれか一つに記載された方法がコンピュータによって実行される。
【００３５】
【発明の実施の形態】
以下に添付図面を参照して、この発明によるデータ分類装置、データ分類方法およびその方法をコンピュータに実行させるプログラムの好適な実施の形態を詳細に説明するが、その前に、本発明の骨子について簡単に説明する。
【００３６】
（発明の骨子）
上述のように、素性空間上における訓練データの分布が均一である場合には、マージンの中央を分離超平面とすることが合理的と考えられる。しかしながら、実際の訓練データの分布はしばしば不均一であり、その場合にも上記が合理的であるとは必ずしも言えない。
【００３７】
たとえば図２に示すように、同じ大きさの空間に正例訓練データ（図中「○」）は３個、負例訓練データ（同「×」）は２４個分布していたとする。このとき以下が成り立つと仮定する。
【００３８】
（ａ）正例と負例を完全に判別できる素性空間上の超平面がただ一つ存在する。
（ｂ）正例データの分布は、素性空間上の正例である場所に対して均一である。
（ｃ）正例訓練データは、正例データの中からランダムに選択される。
（ｄ）負例データの分布は、素性空間上の負例である場所に対して均一である。
（ｅ）負例訓練データは、負例データの中からランダムに選択される。
【００３９】
上記条件のもとで、素性空間上の理想的な分離超平面に垂直な軸上の分布を考えると、統計学上、最も分離超平面に近い正例訓練データ・負例訓練データの距離比も、訓練データ密度比と同じ８：１となる確率が最も高い。つまり分離超平面は、マージンを１：１に内分する位置（すなわちマージンの中央）ではなく、マージンを８：１に内分する位置にあるほうが合理的である。
【００４０】
そこで本発明では、分離超平面の位置を訓練データ密度の相対的に高い側へと補正する。そしてこの移動は、ＳＶＭの判別式を以下のように修正することで実現できる（なお、ｓｉｇｎ（ｐ）はｐの正負を返す関数である）。
（修正前）
【数式１】

（修正後）
【数式２】

【００４１】
問題はこのｃの値をどうやって算出するか、言い換えれば、分離超平面の移動量と方向とをどうやって決定するかであり、後述する実施の形態１および２は、いずれもこのｃ値の算出の詳細にかかるものである。
【００４２】
最も単純には、同じ大きさの空間あたりの正例訓練データの個数と負例訓練データの個数を、それぞれ正例訓練データ密度ｄｐ、負例訓練データ密度ｄｎとすればよい。たとえば図２の例では、同じ大きさの空間に正例訓練データは３個、負例訓練データは２４個分布していることから、ｋ＝１とするとｃ＝（３−２４）／（３＋２４）＝−７／９となる。
【００４３】
あるいは、正例訓練データ・負例訓練データの中でも特にマージンに接しているもの（マージンの境界面であるサポーティング・ハイパープレーン上に存在するもの）の個数、すなわち正例ＳＶの総数ＳＶｐおよび負例ＳＶの総数ＳＶｎは、それぞれ正例訓練データ密度ｄｐ・負例訓練データ密度ｄｎに比例すると仮定して、
【数式３】

あるいは
【数式４】

などとしてもよい。
【００４４】
なお、ソフトマージンＳＶＭを用いた場合は、ＳＶはマージン境界面上にあるとは限らないが（境界面上にあればＳＶであるが、ＳＶであれば境界面上にあるとは必ずしも言えない）、この場合はＳＶの中でも、特にマージン境界面上にあるＳＶの個数だけをＳＶｐ・ＳＶｎとしてもよい。マージン境界面に存在するＳＶとは、具体的にはその重みがＳＶＭのコスト値（訓練データの分類誤りに対するペナルティー）未満であるようなＳＶである。
【００４５】
ここまで説明してきたように、訓練データの空間あたりの個数、訓練データのうち特にＳＶであるものの個数、あるいはＳＶの中でも特にマージン境界面上のものの個数などから、正例訓練データ密度および負例訓練データ密度を推測し、この密度の相対的に高い側へ分離超平面をずらすことで、当該密度が相対的に低い側での分類精度を向上させることができる。
【００４６】
ただ、上記の計算式はいずれも、正例データ（および正例データの中からランダムに選択される正例訓練データ）は素性空間上の正例である場所に対して均一に、負例データ（および負例データの中からランダムに選択される負例訓練データ）も負例である場所に対して均一に、それぞれ分布していることを前提としている。
【００４７】
しかし実際の訓練データでは、たとえば図３に示すように、同じ負例側でもこの部分ではデータが密であり、この部分では疎であるといった分布の偏りがある。すなわち、上記で仮定した（ｂ）〜（ｅ）の条件は、実際には成立しないことが多い。
【００４８】
そこで本発明では、個々の分類対象データの近傍における訓練データの局所的密度から、分離超平面の位置を補正するようにモデルを拡張する。すなわち図３に示すように、分類対象データｘ（図中「？」）の位置における正例訓練データ・負例訓練データの密度比を算出し、この密度比に応じて分離超平面の位置を補正する。
【００４９】
そしてこの局所的密度の尺度として、本発明では
（１）分類対象データの近傍における、分離超平面に平行な方向での相対データ密度（後述する実施の形態１）
（２）分類対象データの近傍における、分離超平面に垂直な方向での相対データ密度（後述する実施の形態２）
のいずれかを利用する。
【００５０】
すなわち実施の形態１では、分類対象データｘの位置での訓練データ密度を、正例側・負例側のそれぞれのマージン境界面上で分類対象データｘに最も近い点における正例ＳＶ・負例ＳＶの密度から推定する。マージン境界面より内側の訓練データ密度を、マージン境界面上の訓練データであるＳＶの密度から推定すると言ってもよい。
【００５１】
まず、個々のＳＶの位置における相対データ密度を求める。素性空間上のベクトル間の距離は、式（１）（２）にもあるカーネル関数Ｋを用いて下記式により求められる。
【数式５】

【００５２】
ここで、ｉ番目の正例ＳＶであるＳＶｐｉの位置での相対正例データ密度（ＳＶｐｉ）、およびｉ番目の負例ＳＶであるＳＶｎｉの位置での相対負例データ密度（ＳＶｎｉ）を、
【数式６】

と定義する。
【００５３】
上記各式により算出される各ＳＶの相対データ密度は、当該ＳＶに近い（当該ＳＶとの距離の小さい）ＳＶが多くあるほど、すなわちＳＶがより密に分布している場所ほど高くなる。逆に言えば、ＳＶの寄り集まっている場所ほど高く、ＳＶのまばらな場所ほど低くなるように相対データ密度を算出できるのであれば、計算式は上記に限らず、どのようなものであってもよい。他の計算式の例としては、たとえば下記のようなものがある。
【数式７】

【００５４】
次に、上記で算出した各ＳＶの位置における相対データ密度から、正例側マージン境界面上で分類対象データｘに最も近い点における相対正例データ密度、および負例側マージン境界面上で分類対象データｘに最も近い点における相対負例データ密度を算出する。そしてこれらを、分類対象データｘの位置における相対正例データ密度（ｘ）および相対負例データ密度（ｘ）であるとみなす。
【数式８】

【００５５】
上記式から分かるように、相対正例データ密度（ｘ）はすべての正例ＳＶの位置における相対正例データ密度（ＳＶｐｉ）を、当該ＳＶから分類対象データｘまでの距離に反比例する重み（ｗｐｉ）で足し合わせたものとなっている。すなわち、分類対象データｘに近いＳＶほど大きな重み、遠いＳＶほど小さな重みで計算した、相対正例データ密度（ＳＶｐｉ）の加重平均となっている。相対負例データ密度（ｘ）も同様に、すべての負例ＳＶの位置における相対負例データ密度（ＳＶｎｉ）の加重平均である。
【００５６】
逆に言えば、近いＳＶの重みほど大きく、遠いＳＶの重みほど小さくなるのであれば、ｗｐｉやｗｎｉの計算式は上記に限らず、どのようなものであってもよい。他の計算式の例としては、たとえば下記のようなものがある。
【数式９】

【００５７】
そして上記で算出した、分類対象データｘの位置での相対正例データ密度（ｘ）を式（２）のｄｐ、相対負例データ密度（ｘ）をｄｎに代入してｃ値を算出する。
【００５８】
一方、実施の形態２では分離超平面に沿う方向（平行方向）ではなく、分離超平面から離れる方向（垂直方向）での正例訓練データ・負例訓練データの密度比を求め、この密度比に応じてｃ値すなわち分離超平面の移動量と方向とを算出する。
【００５９】
すなわち、まず訓練データの全体を用いてＳＶＭを学習させ、学習後のＳＶＭを甲とする。次に訓練データの中から、ＳＶＭ甲でＳＶとなったものと同一の訓練データを取り除く（なお、ＳＶとまったく同じ訓練データが複数あった場合は、それらをすべて除いてもよいし、一つだけ除いてもよい）。そして、ＳＶと同じものを除いた残りの訓練データを用いて再度ＳＶＭを学習させ、学習後のＳＶＭを乙とする。
【００６０】
図４に示すように、ＳＶＭ乙ではＳＶＭ甲のＳＶを除いた残りの訓練データの中で最大となるマージンを採用するので、正例側・負例側のマージン境界面は、多かれ少なかれＳＶＭ甲の分離超平面から離れる方向へと後退する。
【００６１】
ただし、その後退量は正例側・負例側で同一ではない。ＳＶＭ甲の分離超平面に対して垂直な直線を考えると、この直線上における境界面の後退量は、訓練データ密度が相対的に高い側（図示する例では負例側）のほうが、当該密度の相対的に低い側（同正例側）よりも小さくなっている。すなわち、訓練データ密度が高いほど、ＳＶの除去によってもマージン境界面が後退しづらい。
【００６２】
本発明ではこの点に注目し、マージン境界面の後退量は訓練データ密度に反比例するとの前提のもとに、ＳＶを除去する前のＳＶＭ甲とＳＶを除去した後のＳＶＭ乙とを用いて、正例側・負例側でのマージン境界面の後退量の比を求め、この比から逆に正例訓練データ・負例訓練データの密度比を推定する。
【００６３】
入力空間から素性空間への写像関数をφとおく。図４に示すように、分類対象データｘに対応する素性空間上の点φ（ｘ）（図中「？」）を通り、ＳＶＭ甲の分離超平面に垂直な直線（図中「垂線Ｖ」）を考える。垂線Ｖ上のφ（ｘ）以外の点ｘ’を考えると、ｘ’は下記式により表現できる。
【数式１０】

【００６４】
そして、式（２）に示したｆ（ｘ）の値をスコアと呼ぶことにすると、ＳＶＭ甲におけるｘ’のスコア甲（ｘ’）は下記のようになる。
【数式１１】

なお、ＳＶＭ乙におけるｘ’のスコア乙（ｘ’）も同様に計算できる。
【００６５】
ここで、垂線Ｖと、ＳＶＭ甲の正例側マージン境界面との交点をｐ、ＳＶＭ乙の正例側マージン境界面との交点をＰとする。ＳＶＭの学習による正例側マージン境界面の後退量は、ｐからＰまでのスコアの増加分、すなわち甲（Ｐ）−甲（ｐ）により定義できる。
【００６６】
甲（ｐ）＝１であるから、後退量を求めるには甲（Ｐ）、すなわちＳＶＭ乙の正例側マージン境界面にある点Ｐのスコアが、ＳＶＭ甲で計算するといくらになるかが分かればよい。そして、求めるスコア甲（Ｐ）は下記式により算出できる。
【数式１２】

【００６７】
一方、負例側マージン境界面の後退量も、垂線ＶとＳＶＭ甲の負例側マージン境界面との交点をｎ、ＳＶＭ乙の負例側マージン境界面との交点をＮとすると、甲（ｎ）−甲（Ｎ）により定義できる。そして甲（ｎ）は−１であり、甲（Ｎ）は下記式により算出できる。
【数式１３】

【００６８】
これにより、正例側マージン境界面の後退量（甲（Ｐ）−甲（ｐ））と負例側マージン境界面の後退量（甲（ｎ）−甲（Ｎ））とが求められるので、それぞれの逆数をｄｐおよびｄｎとする。そして、上述の式（２）によりｃ値を算出する。
【００６９】
（実施の形態１）
図５は、本発明の実施の形態１によるデータ分類装置のハードウエア構成の一例を示す説明図である。図中、５０１は装置全体を制御するＣＰＵを、５０２は基本入出力プログラムを記憶したＲＯＭを、５０３はＣＰＵ５０１のワークエリアとして使用されるＲＡＭを、それぞれ示している。
【００７０】
また、５０４はＣＰＵ５０１の制御にしたがってＨＤ（ハードディスク）５０５に対するデータのリード／ライトを制御するＨＤＤ（ハードディスクドライブ）を、５０５はＨＤＤ５０４の制御にしたがって書き込まれたデータを記憶するＨＤを、それぞれ示している。
【００７１】
また、５０６はＣＰＵ５０１の制御にしたがってＦＤ（フレキシブルディスク）５０７に対するデータのリード／ライトを制御するＦＤＤ（フレキシブルディスクドライブ）を、５０７はＦＤＤ５０６の制御にしたがって書き込まれたデータを記憶する着脱自在のＦＤを、それぞれ示している。
【００７２】
また、５０８はＣＰＵ５０１の制御にしたがってＣＤ−ＲＷ５０９に対するデータのリード／ライトを制御するＣＤ−ＲＷドライブを、５０９はＣＤ−ＲＷドライブ５０８の制御にしたがって書き込まれたデータを記憶する着脱自在のＣＤ−ＲＷを、それぞれ示している。
【００７３】
また、５１０はカーソル、メニュー、ウィンドウ、あるいは文字や画像などの各種データを表示するディスプレイを、５１１は文字、数値、各種指示などの入力のための複数のキーを備えたキーボードを、５１２は各種指示の選択や実行、処理対象の選択、マウスポインタの移動などをおこなうマウスを、それぞれ示している。
【００７４】
また、５１３は通信ケーブル５１４を介してＬＡＮやＷＡＮなどのネットワークに接続され、当該ネットワークとＣＰＵ５０１とのインターフェースとして機能するネットワークＩ／Ｆを、５００は上記各部を接続するためのバスを、それぞれ示している。
【００７５】
次に、図６は本発明の実施の形態１によるデータ分類装置の構成を機能的に示す説明図である。また、図７は本発明の実施の形態１によるデータ分類装置における、データ分類処理の手順を示すフローチャートである。以下、図７に示す手順に沿って、図６に示す各部の機能を順次説明する。
【００７６】
なお、本フローチャートによる処理に先立って、訓練データあるいは分類対象データとして使用する入力情報および出力（分類）情報の形式を決定しておく。これらの情報の形式には本発明は関知しないので、一般的な教師あり学習方法に準じた形式を採用すればよい。また、ＳＶＭを学習させるには入力情報をベクトル化する必要があるので、その形式も決定しておく。このベクトルの形式にも本発明は関知しないので、一般的なＳＶＭの学習方法に準じた形式とすればよい。
【００７７】
まず、訓練データあるいは分類対象データとして使用される入力情報および出力情報を、データ入力部６００により装置内部に取り込み（ステップＳ７０１）、取り込んだ入力情報を、データ変換部６０１により所定の形式に変換（具体的にはベクトル化）する（ステップＳ７０２）。
【００７８】
そしてＳＶＭ学習部６０２により、あらかじめ分類が付与された訓練データを用いてＳＶＭを学習させる（ステップＳ７０３）。その学習方法は任意であり、たとえばＳＭＯ（ＳｅｑｕｅｎｔｉａｌＭｉｎｉｍａｌＯｐｔｉｍｉｚａｔｉｏｎ：ＡＦａｓｔＡｌｇｏｒｉｔｈｍｆｏｒＴｒａｉｎｉｎｇＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅｓ（１９９８）ＪｏｈｎＣ．Ｐｌａｔｔ）法や、Ｋｅｒｎｅｌ−Ａｄａｔｒｏｎ（ＴｈｅＫｅｒｎｅｌ−Ａｄａｔｒｏｎ：ａｆａｓｔａｎｄｓｉｍｐｌｅｔｒａｉｎｉｎｇｐｒｏｃｅｄｕｒｅｆｏｒｓｕｐｐｏｒｔｖｅｃｔｏｒｍａｃｈｉｎｅｓ．（１９９８）Ｔ．Ｆｒｉｅｓｓ，Ｎ．ＣｒｉｓｔｉａｎｉｎｉａｎｄＣ．Ｃａｍｐｂｅｌｌ）法などを利用できる。なお、ＳＶＭのパラメータも一般的な手法（交差検定など）により最適化することで、分類精度を向上できる。ここでは、コスト値が１のソフトマージンＳＶＭで学習を行うものとする。
【００７９】
その後データを指定して、当該データの分類先を判定するようユーザから指示があると（ステップＳ７０４：Ｙｅｓ）、まずｃ値算出部６０３により、上述のｃ値すなわち分離超平面の移動量と方向とを算出する（ステップＳ７０５）。
【００８０】
図８は、本発明の実施の形態１によるデータ分類装置における、ｃ値算出処理（図７のステップＳ７０５）の手順を詳細に示すフローチャートである。ｃ値算出部６０３は、まずＳＶＭ学習部６０２による学習後のＳＶＭから、マージン境界面上に存在するＳＶを抽出する（ステップＳ８０１）。ここでは、上記ＳＶＭのＳＶが以下のようであったとする。
【００８１】

【００８２】
ここで、コスト値（ここでは１）以上の重みを持つＳＶｐ４は、マージン境界面には存在しないＳＶなので、ステップＳ８０１で抽出されるのはＳＶｐ１〜ＳＶｐ３とＳＶｎ１〜ＳＶｎ４の計７つである。
【００８３】
次に、ｃ値算出部６０３は上述の式（５）により、ステップＳ８０１で抽出した各ＳＶ間の距離を算出する（ステップＳ８０２）。ここでは計算の結果、各ＳＶ間の距離が以下のようになったとする。
【００８４】

【００８５】
次に、ｃ値算出部６０３は上述の式（６）により、ステップＳ８０２で算出した距離を用いて、各ＳＶの位置における相対正例データ密度（ＳＶｐｉ）および相対負例データ密度（ＳＶｎｉ）を算出する（ステップＳ８０３）。
【００８６】
ここで、式（６）のｔ＝１とすると、
相対正例データ密度（ＳＶｐ１）
＝１／距離（ＳＶｐ１〜ＳＶｐ２）＋１／距離（ＳＶｐ１〜ＳＶｐ３）
＝１／０．５＋１／０．５
＝４
である。
【００８７】
同様に、
相対正例データ密度（ＳＶｐ２）＝１／０．５＋１／０．２＝７
相対正例データ密度（ＳＶｐ３）＝１／０．５＋１／０．２＝７
相対負例データ密度（ＳＶｎ１）＝１／０．５＋１／０．５＋１／０．５＝６
相対負例データ密度（ＳＶｎ２）＝１／０．５＋１／０．２５＋１／０．２＝１１
相対負例データ密度（ＳＶｎ３）＝１／０．５＋１／０．２５＋１／０．１＝１６
相対負例データ密度（ＳＶｎ４）＝１／０．５＋１／０．２＋１／０．１＝１７である。
【００８８】
次に、ｃ値算出部６０３は上述の式（５）により、分類対象データｘからステップＳ８０１で抽出した各ＳＶまでの距離を算出する（ステップＳ８０４）。ここでは計算の結果、各ＳＶまでの距離が以下のようになったとする。
【００８９】
距離（ｘ〜ＳＶｐ１）＝０．２
距離（ｘ〜ＳＶｐ２）＝０．５
距離（ｘ〜ＳＶｐ３）＝０．５
距離（ｘ〜ＳＶｎ１）＝０．５
距離（ｘ〜ＳＶｎ２）＝０．２
距離（ｘ〜ＳＶｎ３）＝０．１
距離（ｘ〜ＳＶｎ４）＝０．１
【００９０】
次に、ｃ値算出部６０３は上述の式（８）により、ステップＳ８０３で算出した各ＳＶの位置での相対データ密度、およびステップＳ８０４で算出した各ＳＶまでの距離を用いて、分類対象データｘの位置（厳密にはその近傍のマージン境界面）における相対正例データ密度（ｘ）および相対負例データ密度（ｘ）を算出する（ステップＳ８０５）。
【００９１】
ここで、式（８）のｔ＝１とすると、ｗｐｉの分母Σ（１／距離（ｘ〜ＳＶｐｊ））＝１／０．２＋１／０．５＋１／０．５＝９であることから、
ｗｐ１＝１／０．２×１／９＝５／９
ｗｐ２＝１／０．５×１／９＝２／９
ｗｐ３＝１／０．５×１／９＝２／９
【００９２】
よって、相対正例データ密度（ｘ）
＝（５／９×４）＋（２／９×７）＋（２／９×７）
＝１６／３
＝５．３３３・・・
【００９３】
また、ｗｎｉの分母Σ（１／距離（ｘ〜ＳＶｎｊ））＝１／０．５＋１／０．２＋１／０．１＋１／０．１＝２７であることから、
ｗｎ１＝１／０．５×１／２７＝２／２７
ｗｎ２＝１／０．２×１／２７＝５／２７
ｗｎ３＝１／０．１×１／２７＝１０／２７
ｗｎ４＝１／０．１×１／２７＝１０／２７
【００９４】
よって、相対負例データ密度（ｘ）
＝（２／２７×６）＋（５／２７×１１）＋（１０／２７×１６）＋（１０／２７×１７）
＝３９７／２７
＝１４．７０３７０３・・・
【００９５】
最後に、ｃ値算出部６０３はステップＳ８０５で算出した相対正例データ密度（ｘ）をｄｐ、相対負例データ密度（ｘ）をｄｎとして、式（２）によりｃ値を算出する（ステップＳ８０６）。たとえば式（２）のｋ＝１とすると、

となる。この時点で図８のフローチャートによる処理は終了し、図７のステップＳ７０６へと復帰する。
【００９６】
上記のようにしてｃ値を求めた後、実施の形態１によるデータ分類装置は、次にそのデータ分類部６０４により式（２）の結果が正負のいずれとなるか、すなわち分類対象データｘが正例・負例のいずれに該当するかを判定する（ステップＳ７０６）。
【００９７】
ここでの判定結果は、分類対象データｘのスコアｆ（ｘ）に応じて図９のようになる。正例と負例との境目となる閾値が、従来技術では０であるのに対して、本発明では−２５３／５４１に減少している（図形的には分離超平面の位置が、負例側に補正されたことを意味する）。その結果スコアが−２５３／５４１から０までの範囲で、従来技術では負例と判定されていたデータが、本発明では正例と判定されることになる。
【００９８】
図７に戻り、最後に実施の形態１によるデータ分類装置は、データ出力部６０５によりデータ分類部６０４の処理結果をファイルあるいはディスプレイなどに出力して（ステップＳ７０７）、本フローチャートによる処理を終了する。
【００９９】
なお、ここでは説明の便宜上、個々のＳＶの位置での相対データ密度ＳＶｐｉ・ＳＶｎｉはｃ値の算出の都度算出するフローとしたが（図８ステップＳ８０１〜Ｓ８０３）、これらの値は分類対象データｘには依存しないので、実際上は毎回計算し直す必要はない。たとえば、上記で算出したＳＶｐｉやＳＶｎｉの値をテーブルに格納しておき、２回目以降の分類ではこのテーブルを参照することで、ステップＳ８０１〜Ｓ８０３の処理をスキップすることができる。
【０１００】
（実施の形態２）
一方、実施の形態２によるデータ分類装置も、そのハードウエア構成、機能的構成および当該装置におけるデータ分類処理の手順については、それぞれ図５、図６および図７に示した実施の形態１のそれと同様である。ただし、図７のステップＳ７０５におけるｃ値算出の手順のみは、図８に示した実施の形態１のそれとは異なっている。
【０１０１】
図１０は、本発明の実施の形態２によるデータ分類装置における、ｃ値算出処理（図７のステップＳ７０５）の手順を詳細に示すフローチャートである。実施の形態２によるｃ値算出部６０３は、まずステップＳ７０３による学習後のＳＶＭ（このＳＶＭを甲とする）から、マージン境界面に存在するＳＶを抽出する（ステップＳ１００１）。
【０１０２】
次に、ｃ値算出部６０３はステップＳ７０３でＳＶＭ甲の学習に用いた訓練データの中から、上記で抽出したＳＶと同一のものを除去する（ステップＳ１００２）。そして、データ除去後の訓練データをＳＶＭ学習部６０２に出力するとともに、当該訓練データによるＳＶＭ甲の学習を指示し、これを受けたＳＶＭ学習部６０２により、ＳＶＭ甲が再度学習される（ステップＳ１００３）。なお、ステップＳ１００３による学習後のＳＶＭを乙とする。
【０１０３】
次に、ｃ値算出部６０３は分類対象データφ（ｘ）と、φ（ｘ）を通る垂線Ｖ上の他の点ｘ’との、ＳＶＭ甲およびＳＶＭ乙におけるスコアを算出する（ステップＳ１００４）。ここでは計算の結果、上記スコアが以下のようになったとする。
【０１０４】
甲（φ（ｘ））＝０．２
甲（ｘ’）＝２．２
乙（φ（ｘ））＝０．３
乙（ｘ’）＝１．３
【０１０５】
次に、ｃ値算出部６０３は上記で算出した４点のスコアから、ＳＶＭ乙の正例側・負例側マージン境界面と垂線Ｖとの交点である点Ｐおよび点ＮのＳＶＭ甲におけるスコアを、上述の式（１２）および式（１３）により算出する（ステップＳ１００５）。
甲（Ｐ）＝（２．２−０．２）（１−０．３）／（１．３−０．３）＋０．２＝１．６
乙（Ｎ）＝（２．２−０．２）（−１−０．３）／（１．３−０．３）＋０．２＝−２．４
【０１０６】
次に、ｃ値算出部６０３は上記で算出したスコアから、正例側・負例側マージン境界面の後退量をそれぞれ算出する（ステップＳ１００６）。上記の例では、正例側のマージン境界面は、ＳＶＭ甲の１に対してＳＶＭ乙では１．６の位置まで、すなわち０．６だけ後退したことになる。また、負例側のマージン境界面は、ＳＶＭ甲の−１に対してＳＶＭ乙では−２．４の位置まで、すなわち１．４だけ後退したことになる。
【０１０７】
最後に、ｃ値算出部６０３はステップＳ１００６で算出した正例側マージン境界面の後退量の逆数をｄｐ、負例側マージン境界面の後退量の逆数をｄｎとして、式（２）によりｃ値を算出する（ステップＳ１００７）。たとえば式（２）のｋ＝１とすると、

となる。この時点で図１０のフローチャートによる処理は終了し、図７のステップＳ７０６へと復帰する。
【０１０８】
この後、図７のステップＳ７０６における判定結果は、分類対象データｘのスコアｆ（ｘ）に応じて図１１のようになる。正例と負例との境目となる閾値が、従来技術では０であるのに対して、本発明では０．４に増加している（図形的には分離超平面の位置が、正例側に補正されたことを意味する）。その結果スコアが０から０．４までの範囲で、従来技術では正例と判定されていたデータが、本発明では負例と判定されることになる。
【０１０９】
なお、ここでは説明の便宜上、ＳＶＭの再学習はｃ値の算出の都度行うフローとしたが（図１０ステップＳ１００１〜Ｓ１００３）、学習後のＳＶＭ乙は分類対象データｘには依存しないので、実際上は毎回計算し直す必要はない。たとえば、２回目以降の分類では１回目の分類で求められたＳＶＭ乙を使い回すことで、ステップＳ１００１〜Ｓ１００３の処理をスキップすることができる。
【０１１０】
以上説明した実施の形態１および２によれば、個々の分類対象データの位置における局所的な訓練データの密度比に応じて、正例と負例とを切り分ける分離超平面の位置を補正するので、素性空間上のどの場所でも、同じ量だけ同じ方向に分離超平面を移動させるよりも、より精度よくデータを分類することができる。
【０１１１】
なお、上述した実施の形態１および２では、正例訓練データ密度ｄｐの算出時には素性空間上の複数のデータのうち正例訓練データのみ（かつその一部。具体的には、実施の形態１ではマージン境界面上のＳＶ、実施の形態２では垂線Ｖ上のＳＶ）に、負例訓練データ密度ｄｎの算出時には負例訓練データのみ（同上）に、それぞれ注目したが、逆に密度の計算に用いる訓練データは上記に限定されるものではない。
【０１１２】
たとえば正例訓練データだけでなく、負例訓練データも参照して正例訓練データ密度ｄｐを算出するようにしてもよいし、負例訓練データだけでなく、正例訓練データも参照して負例訓練データ密度ｄｎを算出するようにしてもよい。また、訓練データの中に正例でも負例でもない分類不明のデータを含ませておき、ｄｐやｄｎの算出にあたって、正例・負例のほか分類不明の訓練データの分布状況も考慮するようにしてもよい。
【０１１３】
（実施の形態３）
さて、上述した実施の形態１および２はもっぱらＳＶＭについて分類精度の向上をはかったものであるが、訓練データの分布状況に応じて分類の閾値を補正する工夫は、ＳＶＭに限らず他の分類手法にも広く応用可能である。実施の形態３ではその一例として、ｋＮＮ（ｋ−ＮｅａｒｅｓｔＮｅｉｇｈｂｏｒ）法による分類において、訓練データの密度比に応じた閾値の補正を行う。なお、簡単のためｋ＝１とする。
【０１１４】
実施の形態３によるデータ分類装置のハードウエア構成は、図５に示した実施の形態１のそれと同様であるので説明を省略する。図１２は、本発明の実施の形態３によるデータ分類装置の構成を機能的に示す説明図である。また、図１３は本発明の実施の形態３によるデータ分類装置における、データ分類処理の手順を示すフローチャートである。以下、図１３に示す手順に沿って、図１２に示す各部の機能を順次説明する。
【０１１５】
まず、訓練データあるいは分類対象データとして使用される入力情報および出力情報を、データ入力部１２００により装置内部に取り込み（ステップＳ１３０１）、取り込んだ入力情報を、データ変換部１２０１により所定の形式に変換（具体的にはベクトル化）する（ステップＳ１３０２）。
【０１１６】
なお、ベクトル化された訓練データを以下では事例という。通常は分類の判明しているデータを訓練データとして使用するが、本発明では事例の中に、分類の判明しているものとそうでないものとが混在しているものとする（すなわち、事例の一部には分類が付与されていない）。この分類不明の事例は、もっぱら以下で説明する相対データ密度の算出時に考慮される、密度補完用の事例である。なお、分類不明の事例の存在は必須ではなく、すべての事例に分類が付与されているのでももちろんよい。
【０１１７】
次に、距離算出部１２０２により、事例のうち特に分類の付与されているものについて、他のすべての事例（分類の付与の有無を問わない）との間の距離を算出する（ステップＳ１３０３）。ここでは、訓練データは事例１〜５の５つであるものとし、分類Ａまたは分類Ｂに分類される事例１〜３と他の事例との距離が、それぞれ図１４に示すようであったとする。
【０１１８】
次に、相対データ密度算出部１２０３により、事例のうち特に分類の付与されているものについて、各事例の位置における相対データ密度を算出する（ステップＳ１３０４）。
【０１１９】
この相対データ密度は上述の式（６）や式（７）と同様に求めてもよいが、ここでは簡略化して、距離が一定以内、たとえば距離４以内の範囲に存在するデータの個数により表現するものとする。なお、この距離の閾値（ここでは４）は、交差検定などによって最適な数値を求めることができる。また、これ以外の手法で密度を算出することももちろん可能である。
【０１２０】
事例１の位置での相対データ密度＝１（距離４以内には事例１しかない）
事例２の位置での相対データ密度＝４（距離４以内には事例２〜５の４つがある）
事例３の位置での相対データ密度＝３（距離４以内には事例２・３・５の３つがある）
【０１２１】
その後データを指定して、当該データの分類先を判定するようユーザから指示があると（ステップＳ１３０５：Ｙｅｓ）、再度距離算出部１２０２により、当該データと上記事例のうち特に分類が付与されているものとの距離を算出する（ステップＳ１３０６）。ここでは分類対象データと、分類の判明している事例との距離が以下のようであったとする。
【０１２２】
距離（分類対象データ〜事例１）＝６
距離（分類対象データ〜事例２）＝２
距離（分類対象データ〜事例３）＝３
【０１２３】
そして、従来技術のｋＮＮ法、ここではｋ＝１なので１ＮＮ法によれば、分類対象データからの距離が最も小さい（分類対象データに最も近い）事例の分類を当該データの分類とするので、上記距離が最も小さい事例２の分類、すなわち分類Ｂが分類結果となるところである。
【０１２４】
しかしながら本発明では、さらに距離補正部１２０４により、ステップＳ１３０６で算出した各事例までの距離（従来のｋＮＮ法で利用する距離）を、ステップＳ１３０４で算出した各事例の位置での相対データ密度により補正する（ステップ１３０７）。この補正方法にも種々のものがあるが、ここではステップＳ１３０６の距離に、ステップＳ１３０４の相対データ密度を掛け合わせたものを補正後の距離とする。
【０１２５】
補正後距離（分類対象データ〜事例１）＝６×１＝６
補正後距離（分類対象データ〜事例２）＝２×４＝８
補正後距離（分類対象データ〜事例３）＝３×３＝９
【０１２６】
そしてデータ分類部１２０５により、ステップＳ１３０７で算出した補正後の距離を用いて、ｋＮＮ法（ここではｋ＝１なので１ＮＮ法）による分類を行う（ステップＳ１３０８）。ここでは、補正後の距離が最も小さい事例１の分類、すなわち分類Ａが分類結果となる。最後にこの分類結果を、データ出力部１２０６によりファイルあるいはディスプレイなどに出力して（ステップＳ１３０９）、本フローチャートによる処理を終了する。
【０１２７】
以上説明したように実施の形態３によれば、訓練データ（分類の付与の有無を問わない）の密度が事例１付近では低く、事例２付近では高いという分布の偏りを考慮して、分類Ａと分類Ｂとの分離面を、相対的に訓練データ密度が高い分類Ｂ側に移動するため、訓練データの分布の偏りが原因で誤って分類Ｂに分類されたデータを、正しい分類Ａに分類し直せる確率が高くなる。
【０１２８】
そして、実施の形態１〜３のような分離面の位置の補正により、補正のない場合に比べて分類精度が向上することで、逆に、同じ精度を達成するための訓練データの必要数を減らせるという利点もある。さらに、訓練データの数が減ることで学習や判別の速度の向上も期待できる。
【０１２９】
なお、上述したデータ入力部６００／１２００、データ変換部６０１／１２０１、ＳＶＭ学習部６０２、ｃ値算出部６０３、データ分類部６０４／１２０５、データ出力部６０５／１２０６、距離算出部１２０２、相対データ密度算出部１２０３および距離補正部１２０４は、具体的にはＨＤ５０５からＲＡＭ５０３に読み出されたプログラムをＣＰＵ５０１が実行することにより実現される。このプログラムはＨＤ５０５のほか、ＦＤ５０７、ＣＤ−ＲＷ５０９、ＭＯなどの各種の記録媒体に格納して配布することができ、ネットワークを介して配布することも可能である。
【０１３０】
【発明の効果】
以上説明したように請求項１に記載の発明は、複数のデータの存在する空間を分離面により各分類ごとの空間に分割するデータ分類装置において、分類対象データの位置における、第１の分類に分類される訓練データの密度を算出する第１の密度算出手段と、前記分類対象データの位置における、第２の分類に分類される訓練データの密度を算出する第２の密度算出手段と、前記第１の密度算出手段により算出された密度および前記第２の密度算出手段により算出された密度にもとづいて前記分離面の位置を補正する分離面位置補正手段と、を備えたので、第１の分類と第２の分類との境界となる分離面の位置が、上記各分類に属する訓練データの密度に応じて補正され、これによって、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類装置が得られるという効果を奏する。
【０１３１】
また、請求項２に記載の発明は、前記請求項１に記載の発明において、前記分離面位置補正手段が、前記第１の分類または前記第２の分類のうち、前記第１の密度算出手段または前記第２の密度算出手段により算出された密度が相対的に高い側へ前記分離面の位置を補正するので、訓練データの分布が均一でない場合に、訓練データ密度の相対的に高い分類へデータが分類されやすいという特性が修正され、これによって、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類装置が得られるという効果を奏する。
【０１３２】
また、請求項３に記載の発明は、前記請求項１または請求項２に記載の発明において、前記第１の密度算出手段が前記第１の分類に分類されていない訓練データも含めて前記密度を算出するので、補正後の分離面の位置を決定する訓練データ密度は、第１の分類が付与された訓練データのほか、正確な分類は不明であるものの、第１の分類が付与される確率の高い訓練データを含めて計算され、これによって、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類装置が得られるという効果を奏する。
【０１３３】
また、請求項４に記載の発明は、前記請求項１〜請求項３のいずれか一つに記載の発明において、前記第１の密度算出手段が前記第１の分類に分類される訓練データのうち一部の訓練データを用いて前記密度を算出するので、補正後の分離面の位置を決定する訓練データ密度は、前記空間内に存在する一部の訓練データの密度によって近似され、これによって、計算量を抑制しつつ、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類装置が得られるという効果を奏する。
【０１３４】
また、請求項５に記載の発明は、前記請求項４に記載の発明において、前記一部の訓練データは前記分離面からの距離が所定の条件を満たす訓練データであるので、補正後の分離面の位置を決定する訓練データ密度は、たとえば補正前の分離面からの距離が一定の訓練データの密度（補正前の分離面に対して平行な面上での訓練データ密度）によって近似され、これによって、計算量を抑制しつつ、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類装置が得られるという効果を奏する。
【０１３５】
また、請求項６に記載の発明は、前記請求項１〜請求項５のいずれか一つに記載の発明において、前記第１の密度算出手段が前記第１の分類に分類される訓練データを含む第１の訓練データ集合および前記第１の訓練データ集合から一部の訓練データを除外した第２の訓練データ集合を用いて前記密度を算出するので、補正後の分離面の位置を決定する訓練データ密度は、補正前の分離面に対して垂直な方向での訓練データ密度によって近似され、これによって、計算量を抑制しつつ、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類装置が得られるという効果を奏する。
【０１３６】
また、請求項７に記載の発明は、前記請求項１〜請求項６のいずれか一つに記載の発明において、前記第１の密度算出手段が前記訓練データ間の距離にもとづいて前記密度を算出するので、補正後の分離面の位置を決定する訓練データ密度の算出に複雑な計算を必要とせず、これによって、計算量を抑制しつつ、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類装置が得られるという効果を奏する。
【０１３７】
また、請求項８に記載の発明は、複数のデータの存在する空間を分離面により各分類ごとの空間に分割するデータ分類方法において、分類対象データの位置における、第１の分類に分類される訓練データの密度を算出する第１の密度算出工程と、前記分類対象データの位置における、第２の分類に分類される訓練データの密度を算出する第２の密度算出工程と、前記第１の密度算出工程で算出された密度および前記第２の密度算出工程で算出された密度にもとづいて前記分離面の位置を補正する分離面位置補正工程と、を含んだので、第１の分類と第２の分類との境界となる分離面の位置が、上記各分類に属する訓練データの密度に応じて補正され、これによって、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類方法が得られるという効果を奏する。
【０１３８】
また、請求項９に記載の発明は、前記請求項８に記載の発明において、前記分離面位置補正工程では、前記第１の分類または前記第２の分類のうち、前記第１の密度算出工程または前記第２の密度算出工程で算出された密度が相対的に高い側へ前記分離面の位置を補正するので、訓練データの分布が均一でない場合に、訓練データ密度の相対的に高い分類へデータが分類されやすいという特性が修正され、これによって、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類方法が得られるという効果を奏する。
【０１３９】
また、請求項１０に記載の発明は、前記請求項８または請求項９に記載の発明において、前記第１の密度算出工程では前記第１の分類に分類されていない訓練データも含めて前記密度を算出するので、補正後の分離面の位置を決定する訓練データ密度は、第１の分類が付与された訓練データのほか、正確な分類は不明であるものの、第１の分類が付与される確率の高い訓練データを含めて計算され、これによって、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類方法が得られるという効果を奏する。
【０１４０】
また、請求項１１に記載の発明は、前記請求項８〜請求項１０のいずれか一つに記載の発明において、前記第１の密度算出工程では前記第１の分類に分類される訓練データのうち一部の訓練データを用いて前記密度を算出するので、補正後の分離面の位置を決定する訓練データ密度は、前記空間内に存在する一部の訓練データの密度によって近似され、これによって、計算量を抑制しつつ、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類方法が得られるという効果を奏する。
【０１４１】
また、請求項１２に記載の発明は、前記請求項１１に記載の発明において、前記一部の訓練データは前記分離面からの距離が所定の条件を満たす訓練データであるので、補正後の分離面の位置を決定する訓練データ密度は、たとえば補正前の分離面からの距離が一定の訓練データの密度（補正前の分離面に対して平行な面上での訓練データ密度）によって近似され、これによって、計算量を抑制しつつ、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類方法が得られるという効果を奏する。
【０１４２】
また、請求項１３に記載の発明は、前記請求項８〜請求項１２のいずれか一つに記載の発明において、前記第１の密度算出工程では前記第１の分類に分類される訓練データを含む第１の訓練データ集合および前記第１の訓練データ集合から一部の訓練データを除外した第２の訓練データ集合を用いて前記密度を算出するので、補正後の分離面の位置を決定する訓練データ密度は、補正前の分離面に対して垂直な方向での訓練データ密度によって近似され、これによって、計算量を抑制しつつ、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類方法が得られるという効果を奏する。
【０１４３】
また、請求項１４に記載の発明は、前記請求項８〜請求項１３のいずれか一つに記載の発明において、前記第１の密度算出工程では前記訓練データ間の距離にもとづいて前記密度を算出するので、補正後の分離面の位置を決定する訓練データ密度の算出に複雑な計算を必要とせず、これによって、計算量を抑制しつつ、訓練データの分布が均一でない場合にも精度よくデータを分類することが可能なデータ分類方法が得られるという効果を奏する。
【０１４４】
また、請求項１５に記載の発明によれば、前記請求項８〜請求項１４のいずれか一つに記載された方法をコンピュータに実行させることが可能なプログラムが得られるという効果を奏する。
【図面の簡単な説明】
【図１】従来技術のＳＶＭによる分類の原理を模式的に示す説明図である。
【図２】従来技術のＳＶＭによる分類の原理と、本発明のＳＶＭによる分類の原理とを対比して示す説明図である。
【図３】従来技術のＳＶＭによる分類の原理と、本発明のＳＶＭによる分類の原理とを対比して示す説明図である。
【図４】本発明の実施の形態２にかかるデータ分類装置における、訓練データ密度の推定の原理を示す説明図である。
【図５】本発明の実施の形態１によるデータ分類装置のハードウエア構成の一例を示す説明図である。
【図６】本発明の実施の形態１によるデータ分類装置の構成を機能的に示す説明図である。
【図７】本発明の実施の形態１によるデータ分類装置における、データ分類処理の手順を示すフローチャートである。
【図８】本発明の実施の形態１によるデータ分類装置における、ｃ値算出処理（図７のステップＳ７０５）の手順を詳細に示すフローチャートである。
【図９】本発明の実施の形態１によるデータ分類装置における、分類対象データのスコアとその分類結果との関係を示す説明図である。
【図１０】本発明の実施の形態２によるデータ分類装置における、ｃ値算出処理（図７のステップＳ７０５）の手順を詳細に示すフローチャートである。
【図１１】本発明の実施の形態２によるデータ分類装置における、分類対象データのスコアとその分類結果との関係を示す説明図である。
【図１２】本発明の実施の形態３によるデータ分類装置の構成を機能的に示す説明図である。
【図１３】本発明の実施の形態３によるデータ分類装置における、データ分類処理の手順を示すフローチャートである。
【図１４】本発明の実施の形態３によるデータ分類装置に入力される、事例（訓練データ）間の距離の一覧を示す説明図である。
【符号の説明】
５００バス
５０１ＣＰＵ
５０２ＲＯＭ
５０３ＲＡＭ
５０４ＨＤＤ
５０５ＨＤ
５０６ＦＤＤ
５０７ＦＤ
５０８ＣＤ−ＲＷドライブ
５０９ＣＤ−ＲＷ
５１０ディスプレイ
５１１キーボード
５１２マウス
５１３ネットワークＩ／Ｆ
５１４通信ケーブル
６００，１２００データ入力部
６０１，１２０１データ変換部
６０２ＳＶＭ学習部
６０３ｃ値算出部
６０４，１２０５データ分類部
６０５，１２０６データ出力部
１２０２距離算出部
１２０３相対データ密度算出部
１２０４距離補正部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a data classification device, a data classification method, and a program that causes a computer to execute a data classification device that divides a space in which a plurality of pieces of data exist into a space for each classification by a separation plane (a plane or a curved surface).
[0002]
[Prior art]
The SVM (Support Vector Machine) is a technique of dividing a space in which data to which any one of two or more types is assigned exists into a space for each type. In this SVM, a vector representing each training data is mapped onto a feature space (generally higher in dimension than the input space), and a margin that can divide the positive example and the negative example most widely is found. Then, at the center of the margin, a separating hyperplane that separates the positive and negative examples is constructed (FIG. 1).
[0003]
In general, it is known that when data is classified in a high-dimensional space, the classification becomes easy, while over-learning specific to the data is liable to occur. Only dimensions that depend on the number of data (SV: Support Vector) in contact with the margin are used for discrimination. In other words, by using only axes that are effective for high-dimensional classification, over-learning is prevented while maintaining ease of data classification in a high-dimensional space.
[0004]
[Problems to be solved by the invention]
However, the grounds for setting the center of the margin to be a separating hyperplane have hardly been studied in the past. In the prior art, it is implicitly assumed that the distribution of training data of positive and negative examples on the feature space is uniform. On the other hand, since the distribution of the actual training data is not uniform, there is a problem that the classification accuracy is deteriorated in a place where the density of the training data is relatively low, as described later.
[0005]
SUMMARY OF THE INVENTION The present invention solves the above-mentioned problem of the prior art. Therefore, a data classification device, a data classification method, and a data classification method capable of accurately classifying data even when the distribution of training data is not uniform while suppressing the amount of calculation. It is intended to provide a program for causing a computer to execute the program.
[0006]
[Means for Solving the Problems]
In order to solve the above-mentioned problem and achieve the object, a data classification device according to the first aspect of the present invention is a data classification device that divides a space where a plurality of data exists into a space for each classification by a separation plane. First density calculating means for calculating the density of the training data classified into the first classification at the position of the classification target data, and calculating the density of the training data classified into the second classification at the position of the classification target data. A second density calculating means for calculating a density, and a separating surface for correcting a position of the separating surface based on the density calculated by the first density calculating means and the density calculated by the second density calculating means. And a position correcting means.
[0007]
According to the first aspect of the present invention, the position of the separation plane that is the boundary between the first classification and the second classification is corrected according to the density of the training data belonging to each of the above classifications.
[0008]
Also, in the data classification device according to the second aspect of the present invention, in the invention according to the first aspect, the separation plane position correction unit may be configured to determine whether the separation surface position correction unit is the second one of the first classification or the second classification. The position of the separation surface is corrected to a side where the density calculated by the first density calculation means or the second density calculation means is relatively high.
[0009]
According to the second aspect of the present invention, when the distribution of the training data is not uniform, the characteristic that the data is easily classified into a classification having a relatively high training data density (this characteristic is a cause of the deterioration of the classification accuracy) Has been corrected).
[0010]
According to a third aspect of the present invention, there is provided the data classification device according to the first or second aspect, wherein the first density calculating means is not classified into the first classification. It is characterized in that the density is calculated including the above.
[0011]
According to the third aspect of the present invention, the training data density for determining the position of the separation plane after the correction is the same as the training data to which the first classification is assigned, but the exact classification is unknown. It is calculated including training data having a high probability of being assigned one classification.
[0012]
According to a fourth aspect of the present invention, in the data classification device according to any one of the first to third aspects, the first density calculating means classifies the data into the first classification. The density is calculated using a part of the training data to be obtained.
[0013]
According to the fourth aspect of the present invention, the training data density for determining the position of the separation plane after the correction is approximated by the density of a part of the training data existing in the space.
[0014]
According to a fifth aspect of the present invention, in the data classification apparatus according to the fourth aspect, the partial training data is training data whose distance from the separation plane satisfies a predetermined condition. It is characterized by.
[0015]
According to the fifth aspect of the present invention, the training data density for determining the position of the separation plane after the correction is, for example, the density of the training data whose distance from the separation plane before the correction is constant (in the separation plane before the correction). Training data density on a plane parallel to the surface).
[0016]
According to a sixth aspect of the present invention, in the data classification device according to any one of the first to fifth aspects, the first density calculating means classifies the data into the first classification. The density is calculated using a first training data set including the training data to be performed and a second training data set obtained by removing some training data from the first training data set.
[0017]
According to this configuration, the training data density for determining the position of the separation plane after correction is approximated by the training data density in the direction perpendicular to the separation plane before correction.
[0018]
Further, in the data classification device according to the invention of claim 7, in the invention according to any one of claims 1 to 6, the first density calculation means may determine a distance between the training data. The density is calculated based on the calculated density.
[0019]
According to the seventh aspect of the present invention, the calculation of the training data density for determining the position of the separation plane after the correction does not require a complicated calculation.
[0020]
The data classification method according to the invention of claim 8 is a data classification method in which a space in which a plurality of data exists is divided into spaces for each classification by a separation plane. A first density calculation step of calculating the density of the training data classified into the classification, and a second density calculation step of calculating the density of the training data classified into the second classification at the position of the classification target data; A separating surface position correcting step of correcting the position of the separating surface based on the density calculated in the first density calculating step and the density calculated in the second density calculating step. And
[0021]
According to the invention described in claim 8, the position of the separation plane that is the boundary between the first classification and the second classification is corrected according to the density of the training data belonging to each classification.
[0022]
According to a ninth aspect of the present invention, in the data classification method according to the eighth aspect, in the separation plane position correcting step, the data is classified into the first classification and the second classification in the second classification. The position of the separation surface is corrected to a side where the density calculated in the first density calculation step or the second density calculation step is relatively high.
[0023]
According to the ninth aspect of the present invention, when the distribution of the training data is not uniform, the characteristic that the data is easily classified into a class having a relatively high training data density is corrected.
[0024]
According to a tenth aspect of the present invention, in the data classification method according to the eighth or ninth aspect, the training data which is not classified in the first classification in the first density calculation step. It is characterized in that the density is calculated including the above.
[0025]
According to the tenth aspect of the present invention, the training data density for determining the position of the separation plane after the correction is the same as the training data to which the first classification is assigned, but the exact classification is unknown. It is calculated including training data having a high probability of being assigned one classification.
[0026]
Further, in the data classification method according to the invention described in claim 11, in the invention according to any one of claims 8 to 10, in the first density calculation step, the data is classified into the first classification. The density is calculated using a part of the training data to be obtained.
[0027]
According to the eleventh aspect, the training data density for determining the position of the separation plane after the correction is approximated by the density of a part of the training data existing in the space.
[0028]
According to a twelfth aspect of the present invention, in the data classification method according to the eleventh aspect, the part of the training data is training data whose distance from the separation plane satisfies a predetermined condition. It is characterized by.
[0029]
According to the twelfth aspect of the present invention, the training data density for determining the position of the separation plane after the correction is, for example, the density of the training data whose distance from the separation plane before the correction is constant (in the separation plane before the correction). Training data density on a plane parallel to the surface).
[0030]
In the data classification method according to the invention described in claim 13, in the invention described in any one of claims 8 to 12, in the first density calculation step, the data is classified into the first classification. The density is calculated using a first training data set including the training data to be performed and a second training data set obtained by removing some training data from the first training data set.
[0031]
According to the thirteenth aspect, the training data density for determining the position of the separation plane after correction is approximated by the training data density in the direction perpendicular to the separation plane before correction.
[0032]
According to a fourteenth aspect of the present invention, in the data classification method according to any one of the eighth to thirteenth aspects, in the first density calculation step, the distance between the training data is set to The density is calculated based on the calculated density.
[0033]
According to the fourteenth aspect of the present invention, no complicated calculation is required for calculating the training data density for determining the position of the corrected separation plane.
[0034]
According to a program according to a fifteenth aspect of the present invention, the method according to any one of the eighth to fourteenth aspects is executed by a computer.
[0035]
BEST MODE FOR CARRYING OUT THE INVENTION
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments of a data classification device, a data classification method, and a program for causing a computer to execute the method according to the present invention will be described in detail below with reference to the accompanying drawings. A brief description will be given.
[0036]
(Gist of the invention)
As described above, when the distribution of the training data in the feature space is uniform, it is considered reasonable to set the center of the margin as the separating hyperplane. However, the distribution of actual training data is often non-uniform, and even in that case, the above is not always reasonable.
[0037]
For example, as shown in FIG. 2, it is assumed that three positive example training data (“中” in the figure) and 24 negative example training data (“×”) are distributed in a space of the same size. At this time, it is assumed that the following holds.
[0038]
(A) There is only one hyperplane in the feature space that can completely distinguish positive and negative examples.
(B) The distribution of the positive example data is uniform with respect to a location that is a positive example in the feature space.
(C) The positive example training data is randomly selected from the positive example data.
(D) The distribution of the negative example data is uniform with respect to a location that is a negative example in the feature space.
(E) Negative example training data is randomly selected from the negative example data.
[0039]
Under the above conditions, considering the distribution on the axis perpendicular to the ideal separation hyperplane in the feature space, the distance ratio between the positive training data and the negative training data that is statistically closest to the separation hyperplane is Also, the probability of the training data density ratio being 8: 1 is the highest. In other words, it is more reasonable that the separating hyperplane is located at a position that internally divides the margin into 8: 1 instead of a position that internally divides the margin into 1: 1 (that is, the center of the margin).
[0040]
Therefore, in the present invention, the position of the separation hyperplane is corrected to a relatively higher training data density. This movement can be realized by modifying the SVM discriminant as follows (sign (p) is a function that returns the sign of p).
(Before correction)
[Formula 1]

(Revised)
[Formula 2]

[0041]
The problem is how to calculate the value of c, in other words, how to determine the moving amount and the direction of the separation hyperplane. Embodiments 1 and 2 described below both require the calculation of this c value. It depends on the details.
[0042]
In the simplest case, the number of positive example training data and the number of negative example training data per space of the same size may be the positive example training data density dp and the negative example training data density dn, respectively. For example, in the example of FIG. 2, three positive example training data and 24 negative example training data are distributed in a space of the same size. Therefore, if k = 1, c = (3-24) / (3 + 24). ) = − 7/9.
[0043]
Alternatively, among the positive example training data and the negative example training data, in particular, the number of those which are in contact with the margin (existing on the supporting hyperplane which is the boundary surface of the margin), that is, the total number SVp of the positive examples SV and the negative example Assuming that the total number SVn of SVs is proportional to the positive example training data density dp and the negative example training data density dn, respectively:
(Equation 3)

Or
(Equation 4)

And so on.
[0044]
Note that when the soft margin SVM is used, the SV is not always on the margin boundary surface (it is SV if it is on the boundary surface, but it is not necessarily said that it is on the boundary surface if it is SV). In this case, among the SVs, in particular, only the number of SVs on the margin boundary may be set as SVp · SVn. The SV existing at the margin boundary surface is, specifically, an SV whose weight is less than the cost value of the SVM (penalty for classification error of training data).
[0045]
As described above, the positive training data density and the negative training data density are calculated based on the number of training data per space, the number of training data particularly SVs, and the number of SVs particularly on a marginal boundary surface. By estimating the training data density and shifting the separation hyperplane to the higher density side, the classification accuracy on the lower density side can be improved.
[0046]
However, in each of the above formulas, the positive example data (and the positive example training data randomly selected from the positive example data) are uniformly distributed to the positive example locations in the feature space. (And negative example training data selected at random from the negative example data) are also assumed to be uniformly distributed to the location of the negative example.
[0047]
However, in the actual training data, for example, as shown in FIG. 3, even in the same negative example side, the data is dense in this part and sparse in this part. That is, the conditions (b) to (e) assumed above often do not actually hold.
[0048]
Therefore, in the present invention, the model is extended so as to correct the position of the separation hyperplane from the local density of the training data in the vicinity of each classification target data. That is, as shown in FIG. 3, the density ratio between the positive example training data and the negative example training data at the position of the classification target data x (“?” In the figure) is calculated, and the position of the separation hyperplane is determined according to the density ratio. to correct.
[0049]
And as a measure of this local density, in the present invention
(1) Relative data density in the direction parallel to the separation hyperplane near the classification target data (Embodiment 1 described later)
(2) Relative data density in the direction perpendicular to the separation hyperplane near the classification target data (Embodiment 2 described later)
Use one of
[0050]
That is, in the first embodiment, the training data density at the position of the classification target data x is calculated based on the positive example SV and the negative example at the point closest to the classification target data x on each of the margin boundary surfaces on the positive example side and the negative example side. It is estimated from the density of SV. It can be said that the training data density inside the margin boundary is estimated from the density of SV which is the training data on the margin boundary.
[0051]
First, the relative data density at each SV position is determined. The distance between the vectors in the feature space is obtained by the following equation using the kernel function K which is also in the equations (1) and (2).
(Equation 5)

[0052]
Here, the relative positive example data density (SVpi) at the position of SVpi that is the i-th positive example SV and the relative negative example data density (SVni) at the position of SVni that is the i-th negative example SV are:
(Equation 6)

Is defined.
[0053]
The relative data density of each SV calculated by each of the above formulas increases as the number of SVs closer to the SV (the distance from the SV is smaller), that is, the places where the SVs are more densely distributed. Conversely, if the relative data density can be calculated so as to be higher in a place where the SVs are gathered and lower in a place where the SVs are sparse, the calculation formula is not limited to the above, and any formula can be used. Is also good. Examples of other calculation formulas include the following, for example.
[Formula 7]

[0054]
Next, based on the relative data density at the position of each SV calculated above, the relative positive example data density at the point closest to the classification target data x on the positive example side margin boundary surface and the classification on the negative example side margin boundary surface are obtained. The relative negative example data density at the point closest to the target data x is calculated. These are regarded as the relative positive example data density (x) and the relative negative example data density (x) at the position of the classification target data x.
(Equation 8)

[0055]
As can be seen from the above equation, the relative positive example data density (x) is obtained by calculating the relative positive example data density (SVpi) at all the positive example SV positions by a weight (wpi) that is inversely proportional to the distance from the SV to the classification target data x. ). That is, the weighted average of the relative positive example data densities (SVpi) is calculated with the SV closer to the classification target data x and the smaller weight as the SV is farther. Similarly, the relative negative example data density (x) is a weighted average of the relative negative example data densities (SVni) at all the positions of the negative examples SV.
[0056]
Conversely, if the weight of the nearer SV is larger and the weight of the farther SV is smaller, the calculation formulas of wpi and wni are not limited to the above, and may be any formula. Examples of other calculation formulas include the following, for example.
(Equation 9)

[0057]
Then, the c value is calculated by substituting the relative positive example data density (x) at the position of the classification target data x into dp and the relative negative example data density (x) in the equation (2) into dn.
[0058]
On the other hand, in the second embodiment, the density ratio of the positive example training data / negative example training data in the direction (vertical direction) away from the separation hyperplane, not in the direction along the separation hyperplane (parallel direction), is calculated. The c value, that is, the movement amount and direction of the separation hyperplane are calculated according to
[0059]
That is, first, the SVM is learned using the entire training data, and the SVM after the learning is used as the instep. Next, from the training data, remove the same training data as the one that became SV in SVM A (If there are a plurality of training data that are exactly the same as SV, all of them may be removed. May be omitted). Then, the SVM is trained again using the remaining training data except the same as the SV, and the SVM after the learning is designated as B.
[0060]
As shown in FIG. 4, since the SVM B employs the largest margin among the remaining training data excluding the SV of the SVM A, the margin boundary surface between the positive side and the negative side is more or less SVM A. Recedes away from the separation hyperplane.
[0061]
However, the retreat amount is not the same on the positive side and the negative side. Considering a straight line perpendicular to the separation hyperplane of the SVM A, the receding amount of the boundary surface on this straight line is higher on the side where the training data density is relatively higher (the negative example side in the illustrated example). Is smaller than the relatively lower side (the same positive example side). That is, as the training data density is higher, the margin boundary surface is less likely to recede even when the SV is removed.
[0062]
In the present invention, attention is paid to this point, and based on the premise that the retreat amount of the margin boundary surface is inversely proportional to the training data density, the SVM A before removing the SV and the SVM B after removing the SV are used. Then, the ratio of the retreat amount of the margin boundary surface between the positive example side and the negative example side is obtained, and the density ratio of the positive example training data / negative example training data is estimated from the ratio.
[0063]
Let φ be a mapping function from the input space to the feature space. As shown in FIG. 4, a straight line passing through a point φ (x) (“?” In the figure) on the feature space corresponding to the classification target data x and perpendicular to the separation hyperplane of the SVM A (“vertical line V” in the figure) )think of. Considering a point x ′ other than φ (x) on the perpendicular V, x ′ can be expressed by the following equation.
(Equation 10)

[0064]
If the value of f (x) shown in the equation (2) is called a score, the score A (x ') of x' in the SVM A is as follows.
[Equation 11]

Note that the score of x ′ in SVM B can be calculated in the same manner.
[0065]
Here, the intersection point between the perpendicular V and the positive example margin boundary surface of SVM A is p, and the intersection point of the positive example margin boundary surface of SVM B is P. The regression amount of the positive example margin boundary surface due to the learning of the SVM can be defined by an increase in the score from p to P, that is, instep (P) -instep (p).
[0066]
Since instep (p) = 1, it is necessary to calculate the indentation of instep (P), that is, the score of point P on the marginal boundary on the positive example side of SVM B when calculating with SVM A in order to determine the amount of retreat. Just fine. The required score A (P) can be calculated by the following equation.
(Equation 12)

[0067]
On the other hand, the amount of retreat of the negative example margin boundary surface is also calculated assuming that the intersection between the perpendicular V and the negative example margin boundary surface of SVM A is n, and the intersection of SVM B with the negative example margin boundary surface is N. n) -A (N). Instep (n) is -1, and instep (N) can be calculated by the following equation.
[Formula 13]

[0068]
As a result, the retreat amount of the positive example margin boundary surface (A (P) -A (p)) and the retreat amount of the negative example margin boundary surface (A (n) -A (N)) are obtained. The respective reciprocals are dp and dn. Then, the c value is calculated by the above equation (2).
[0069]
(Embodiment 1)
FIG. 5 is an explanatory diagram illustrating an example of a hardware configuration of the data classification device according to the first embodiment of the present invention. In the figure, reference numeral 501 denotes a CPU for controlling the entire apparatus, 502 denotes a ROM storing a basic input / output program, and 503 denotes a RAM used as a work area of the CPU 501.
[0070]
Reference numeral 504 denotes an HDD (hard disk drive) that controls reading / writing of data from / to an HD (hard disk) 505 under the control of the CPU 501, and reference numeral 505 denotes an HD that stores data written under the control of the HDD 504. I have.
[0071]
Reference numeral 506 denotes an FDD (flexible disk drive) that controls reading / writing of data from / to an FD (flexible disk) 507 under the control of the

CPU

501, and 507 denotes a removable FD that stores data written under the control of the FDD 506. Are shown respectively.
[0072]
Reference numeral 508 denotes a CD-RW drive for controlling reading / writing of data to / from the CD-RW 509 under the control of the CPU 501, and reference numeral 509 denotes a removable CD-ROM for storing data written under the control of the CD-RW drive 508. RW are shown respectively.
[0073]
Reference numeral 510 denotes a display for displaying various data such as cursors, menus, windows, or characters and images; 511, a keyboard having a plurality of keys for inputting characters, numerical values, various instructions, etc .; A mouse for selecting and executing an instruction, selecting a processing target, moving a mouse pointer, and the like is shown.
[0074]
Reference numeral 513 denotes a network I / F that is connected to a network such as a LAN or WAN via a communication cable 514, and functions as an interface between the network and the CPU 501. Reference numeral 500 denotes a bus for connecting the above units. ing.
[0075]
Next, FIG. 6 is an explanatory diagram functionally showing the configuration of the data classification device according to the first embodiment of the present invention. FIG. 7 is a flowchart showing a procedure of a data classification process in the data classification device according to the first embodiment of the present invention. Hereinafter, the function of each unit shown in FIG. 6 will be sequentially described according to the procedure shown in FIG.
[0076]
Prior to the processing according to this flowchart, the format of input information and output (classification) information used as training data or classification target data is determined. Since the present invention is not concerned with the format of such information, a format according to a general supervised learning method may be adopted. Further, since input information needs to be vectorized to train the SVM, its format is also determined. Since the present invention is not concerned with this vector format, the vector format may be a format based on a general SVM learning method.
[0077]
First, input information and output information used as training data or classification target data are fetched into the apparatus by the data input unit 600 (step S701), and the fetched input information is converted into a predetermined format by the data conversion unit 601 ( Specifically, it is vectorized (step S702).
[0078]
Then, the SVM learning unit 602 learns the SVM using the training data to which the classification has been given in advance (step S703). The learning method is optional. For example, SMO (Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines (1998) John C. Platen et al. support vector machines. (1998) T. Friess, N. Christianiani and C. Campbell method can be used. The classification accuracy can be improved by optimizing the parameters of the SVM by a general method (such as cross-validation). Here, it is assumed that learning is performed with a soft margin SVM having a cost value of 1.
[0079]
Thereafter, when the user designates the data and instructs the user to determine the classification destination of the data (step S704: Yes), first, the c-value calculating unit 603 sets the above-described c-value, that is, the movement amount and direction of the separation hyperplane. Is calculated (step S705).
[0080]
FIG. 8 is a flowchart showing in detail the procedure of the c-value calculation process (step S705 in FIG. 7) in the data classification device according to the first embodiment of the present invention. First, the c-value calculation unit 603 extracts an SV existing on the margin boundary from the SVM after the learning by the SVM learning unit 602 (step S801). Here, it is assumed that the SV of the SVM is as follows.
[0081]

[0082]
Here, the SVp4 having a weight equal to or greater than the cost value (here, 1) is an SV that does not exist on the marginal boundary surface, and therefore, a total of seven SVp1 to SVp3 and SVn1 to SVn4 are extracted in step S801.
[0083]
Next, the c-value calculation unit 603 calculates the distance between the SVs extracted in step S801 by using the above equation (5) (step S802). Here, it is assumed that the distance between each SV is as follows as a result of the calculation.
[0084]

[0085]
Next, the c-value calculation unit 603 calculates the relative positive example data density (SVpi) and the relative negative example data density (SVni) at the position of each SV using the distance calculated in step S802 according to the above equation (6). It is calculated (step S803).
[0086]
Here, if t = 1 in equation (6),
Relative positive example data density (SVp1)
= 1 / distance (SVp1 to SVp2) + 1 / distance (SVp1 to SVp3)
= 1 / 0.5 + 1 / 0.5
= 4
It is.
[0087]
Similarly,
Relative positive example data density (SVp2) = 1 / 0.5 + 1 / 0.2 = 7
Relative positive example data density (SVp3) = 1 / 0.5 + 1 / 0.2 = 7
Relative negative example data density (SVn1) = 1 / 0.5 + 1 / 0.5 + 1 / 0.5 = 6
Relative negative example data density (SVn2) = 1 / 0.5 + 1 / 0.25 + 1 / 0.2 = 11
Relative negative example data density (SVn3) = 1 / 0.5 + 1 / 0.25 + 1 / 0.1 = 16
The relative negative example data density (SVn4) = 1 / 0.5 + 1 / 0.2 + 1 / 0.1 = 17.
[0088]
Next, the c-value calculation unit 603 calculates the distance from the classification target data x to each SV extracted in step S801 from the above equation (5) (step S804). Here, as a result of the calculation, it is assumed that the distance to each SV is as follows.
[0089]
Distance (x to SVp1) = 0.2
Distance (x to SVp2) = 0.5
Distance (x to SVp3) = 0.5
Distance (x to SVn1) = 0.5
Distance (x to SVn2) = 0.2
Distance (x to SVn3) = 0.1
Distance (x to SVn4) = 0.1
[0090]
Next, the c-value calculation unit 603 uses the relative data density at the position of each SV calculated in step S803 and the distance to each SV calculated in step S804 to calculate the classification target data using the above-described equation (8). The relative positive example data density (x) and the relative negative example data density (x) at the position of x (strictly speaking, the margin boundary surface in the vicinity thereof) are calculated (step S805).
[0091]
Here, assuming that t = 1 in equation (8), the denominator of wpi Σ (1 / distance (x to SVpj)) = 1 / 0.2 + 1 / 0.5 + 1 / 0.5 = 9.
wp1 = 1 / 0.2 × 1/9 = 5/9
wp2 = 1 / 0.5 × 1/9 = 2/9
wp3 = 1 / 0.5 × 1/9 = 2/9
[0092]
Therefore, the relative positive example data density (x)
= (5/9 × 4) + (2/9 × 7) + (2/9 × 7)
= 16/3
= 5.333 ...
[0093]
Also, since the denominator of wni Σ (1 / distance (x to SVnj)) = 1 / 0.5 + 1 / 0.2 + 1 / 0.1 + 1 / 0.1 = 27,
wn1 = 1 / 0.5 × 1/27 = 2/27
wn2 = 1 / 0.2 × 1/27 = 5/27
wn3 = 1 / 0.1 × 1/27 = 10/27
wn4 = 1 / 0.1 × 1/27 = 10/27
[0094]
Therefore, the relative negative example data density (x)
= (2/27 × 6) + (5/27 × 11) + (10/27 × 16) + (10/27 × 17)
= 397/27
= 14.703703 ・・・
[0095]
Finally, the c-value calculating unit 603 calculates the c-value according to the equation (2), using dp as the relative positive example data density calculated in step S805 and dn as the relative negative example data density (step S806). ). For example, if k = 1 in equation (2),

It becomes. At this point, the process according to the flowchart in FIG. 8 ends, and the process returns to step S706 in FIG.
[0096]
After obtaining the c value as described above, the data classification device according to the first embodiment uses the data classification unit 604 to determine whether the result of Expression (2) is positive or negative, that is, the classification target data x It is determined which of the positive example and the negative example corresponds (step S706).
[0097]
The determination result here is as shown in FIG. 9 according to the score f (x) of the classification target data x. The threshold value, which is the boundary between the positive example and the negative example, is 0 in the related art, but is reduced to −253/541 in the present invention. Side). As a result, when the score is in the range from -253/541 to 0, data that has been determined as a negative example in the related art is determined as a positive example in the present invention.
[0098]
Returning to FIG. 7, finally, the data classification device according to the first embodiment outputs the processing result of the data classification unit 604 to a file or a display by the data output unit 605 (step S707), and ends the processing according to this flowchart. .
[0099]
Here, for convenience of explanation, the relative data densities SVpi and SVni at the individual SV positions are calculated each time the c value is calculated (steps S801 to S803 in FIG. 8). Since it does not depend on x, it is not actually necessary to recalculate every time. For example, the values of SVpi and SVni calculated above are stored in a table, and the processes of steps S801 to S803 can be skipped by referring to this table in the second and subsequent classifications.
[0100]
(Embodiment 2)
On the other hand, the data classification device according to the second embodiment also has a hardware configuration, a functional configuration, and a procedure of a data classification process in the device, which are different from those of the first embodiment shown in FIGS. 5, 6, and 7, respectively. The same is true. However, only the procedure for calculating the c value in step S705 of FIG. 7 is different from that of the first embodiment shown in FIG.
[0101]
FIG. 10 is a flowchart showing in detail the procedure of the c-value calculation process (step S705 in FIG. 7) in the data classification device according to the second embodiment of the present invention. First, the c-value calculating unit 603 according to the second embodiment extracts the SV existing at the margin boundary from the SVM after learning in step S703 (this SVM is used as the instep) (step S1001).
[0102]
Next, the c-value calculation unit 603 removes the same SV as the above extracted SV from the training data used for learning the SVM A in step S703 (step S1002). Then, the training data from which the data has been removed is output to the SVM learning unit 602, and learning of the SVM A by the training data is instructed. The SVM learning unit 602 receives the instruction and learns the SVM A again (step S1003). ). It is assumed that the SVM after the learning in step S1003 is a second party.
[0103]
Next, the c-value calculating unit 603 calculates the scores of the classification target data φ (x) and another point x ′ on the perpendicular V passing through φ (x) in SVM A and SVM B (step S1004). . Here, it is assumed that the score is as follows as a result of the calculation.
[0104]
Party (φ (x)) = 0.2
Instep (x ') = 2.2
Otsu (φ (x)) = 0.3
Otsu (x ') = 1.3
[0105]
Next, the c-value calculating unit 603 calculates, from the scores of the four points calculated above, the scores of the points P and N, which are the intersections between the positive and negative margin margin boundaries of SVM B and the perpendicular V in the SVM A. Is calculated by the above equations (12) and (13) (step S1005).
Party (P) = (2.2-0.2) (1-0.3) / (1.3-0.3) + 0.2 = 1.6
Otsu (N) = (2.2-0.2) (-1-0.3) / (1.3-0.3) + 0.2 = -2.4
[0106]
Next, the c-value calculation unit 603 calculates the retreat amount of the positive example side / negative example side margin boundary surface from the scores calculated above (step S1006). In the above example, the margin boundary surface on the positive example side has retreated to 1.6 at SVM B with respect to 1 at SVM A, that is, 0.6. In addition, the margin boundary surface on the negative side is retreated to the position of -2.4 in SVM B with respect to -1 of SVM A, that is, by 1.4.
[0107]
Finally, the c-value calculation unit 603 sets the reciprocal of the retreat amount of the positive example margin boundary surface calculated in step S1006 to dp and the reciprocal of the retreat amount of the negative example margin boundary surface to dn, and calculates the c value by Expression (2). Is calculated (step S1007). For example, if k = 1 in equation (2),

It becomes. At this point, the process according to the flowchart in FIG. 10 ends, and the process returns to step S706 in FIG.
[0108]
Thereafter, the determination result in step S706 of FIG. 7 is as shown in FIG. 11 according to the score f (x) of the classification target data x. The threshold value at the boundary between the positive example and the negative example is 0 in the prior art, but is increased to 0.4 in the present invention (in the figure, the position of the separating hyperplane is Is corrected). As a result, when the score is in the range of 0 to 0.4, data that has been determined to be a positive example in the related art is determined to be a negative example in the present invention.
[0109]
Here, for convenience of explanation, the re-learning of the SVM is performed each time the c value is calculated (steps S1001 to S1003 in FIG. 10). However, since the SVM B after learning does not depend on the classification target data x, There is no need to recalculate each time. For example, in the second and subsequent classifications, the processes in steps S1001 to S1003 can be skipped by reusing the SVM B obtained in the first classification.
[0110]
According to the first and second embodiments described above, the position of the separating hyperplane that separates the positive and negative examples is corrected according to the density ratio of the local training data at the position of each classification target data. It is possible to classify data more accurately than moving the separating hyperplane in the same direction by the same amount at any place in the feature space.
[0111]
In the first and second embodiments, when calculating the positive example training data density dp, only the positive example training data (and a part thereof) out of the plurality of data in the feature space, specifically, the first embodiment , The SV on the marginal boundary surface, the SV on the perpendicular V in the second embodiment), and only the negative example training data (same as above) when calculating the negative example training data density dn. Is not limited to the above.
[0112]
For example, the positive example training data density dp may be calculated by referring to not only the positive example training data but also the negative example training data, or the negative example training data may be referred to to obtain the negative example training data. Example The training data density dn may be calculated. Also, include unclassified data that is neither a positive nor negative example in the training data, and consider the distribution of unclassified training data in addition to positive and negative examples when calculating dp and dn. It may be.
[0113]
(Embodiment 3)
The first and second embodiments described above are intended to improve the classification accuracy of SVMs. However, the device for correcting the classification threshold according to the distribution state of training data is not limited to SVMs. It can be widely applied to methods. In the third embodiment, as an example, in the classification by the kNN (k-nearest neighbor) method, the threshold is corrected according to the density ratio of the training data. Note that for simplicity, k = 1.
[0114]
The hardware configuration of the data classification device according to the third embodiment is the same as that of the first embodiment shown in FIG. FIG. 12 is an explanatory diagram functionally showing the configuration of the data classification device according to the third embodiment of the present invention. FIG. 13 is a flowchart showing a procedure of a data classification process in the data classification device according to the third embodiment of the present invention. Hereinafter, the function of each unit shown in FIG. 12 will be sequentially described according to the procedure shown in FIG.
[0115]
First, input information and output information used as training data or classification target data are fetched into the device by the data input unit 1200 (step S1301), and the fetched input information is converted into a predetermined format by the data conversion unit 1201 (step S1301). Specifically, it is vectorized (step S1302).
[0116]
The vectorized training data is hereinafter referred to as a case. Normally, data with a known classification is used as training data. However, in the present invention, it is assumed that data having a known classification and data having no classification are mixed in a case (that is, the case Some have not been classified). This case of unknown classification is a case for density complementation, which is considered when calculating the relative data density described below. It should be noted that the existence of cases whose classification is unknown is not essential, and it goes without saying that the classification may be assigned to all cases.
[0117]
Next, the distance calculation unit 1202 calculates the distance between all the cases (regardless of the presence or absence of the classification) of the cases to which the classification is particularly provided among the cases (step S1303). Here, it is assumed that the training data is five of cases 1 to 5, and the distances between cases 1 to 3 classified into class A or class B and other cases are as shown in FIG. .
[0118]
Next, the relative data density calculation unit 1203 calculates the relative data density at the position of each case, of the cases to which the classification is given among the cases (step S1304).
[0119]
This relative data density may be obtained in the same manner as in the above formulas (6) and (7). However, here, for simplification, the relative data density is represented by the number of data within a certain distance, for example, within a range of four. It shall be. Note that an optimal numerical value can be obtained for the threshold value of the distance (here, 4) by cross-validation or the like. Further, it is of course possible to calculate the density by other methods.
[0120]
Relative data density at the position of case 1 = 1 (only case 1 within distance 4)
Relative data density at the position of case 2 = 4 (four cases 2 to 5 are within distance 4)
Relative data density at the position of Case 3 = 3 (There are three

Cases

2, 3, and 5 within distance 4)
[0121]
After that, when the user designates the data and instructs to determine the classification destination of the data (step S1305: Yes), the distance calculation unit 1202 again assigns the classification to the data and the above-described case. The distance to the object is calculated (step S1306). Here, it is assumed that the distance between the classification target data and the case where the classification is known is as follows.
[0122]
Distance (data to be classified to case 1) = 6
Distance (data to be classified to case 2) = 2
Distance (data to be classified to case 3) = 3
[0123]
Then, according to the kNN method of the related art, here k = 1, according to the 1NN method, the classification of the case having the smallest distance from the classification target data (closest to the classification target data) is set as the classification of the data. The classification of Case 2 having the shortest distance, that is, classification B is where the classification result is obtained.
[0124]
However, in the present invention, the distance to each case calculated in step S1306 (the distance used in the conventional kNN method) is further corrected by the distance correction unit 1204 based on the relative data density at the position of each case calculated in step S1304. (Step 1307). Although there are various correction methods, here, the distance after correction is obtained by multiplying the distance in step S1306 by the relative data density in step S1304.
[0125]
Corrected distance (data to be classified to case 1) = 6 × 1 = 6
Corrected distance (classification target data to case 2) = 2 × 4 = 8
Corrected distance (classification target data to case 3) = 3 × 3 = 9
[0126]
Then, the data classifying unit 1205 performs classification by the kNN method (here, 1 = 1 since k = 1) using the corrected distance calculated in step S1307 (step S1308). Here, the classification of Case 1 with the smallest corrected distance, that is, classification A is the classification result. Finally, the classification result is output to a file or a display by the data output unit 1206 (step S1309), and the processing according to this flowchart ends.
[0127]
As described above, according to the third embodiment, considering the distribution bias that the density of the training data (regardless of the presence or absence of the classification) is low near Case 1 and high near Case 2, the classification A The separation plane between the training data and the classification B is moved to the classification B side where the training data density is relatively high, so that the data erroneously classified into the classification B due to the bias of the training data distribution is classified into the correct classification A. The probability of being able to do it again increases.
[0128]
Then, by correcting the position of the separation plane as in Embodiments 1 to 3, the classification accuracy is improved as compared with the case without correction, and conversely, the required number of training data for achieving the same accuracy is reduced. There is also the advantage that it can be reduced. Further, the speed of learning and discrimination can be improved by reducing the number of training data.
[0129]
Note that the above-described data input unit 600/1200, data conversion unit 601/1201, SVM learning unit 602, c-value calculation unit 603, data classification unit 604/1205, data output unit 605/1206, distance calculation unit 1202, relative data The density calculation unit 1203 and the distance correction unit 1204 are specifically realized by the CPU 501 executing a program read from the HD 505 to the RAM 503. This program can be stored and distributed on various recording media such as the FD 507, the CD-RW 509, and the MO in addition to the HD 505, and can be distributed via a network.
[0130]
【The invention's effect】
As described above, according to the first aspect of the present invention, in a data classification device that divides a space where a plurality of data exist into a space for each classification by a separation plane, First density calculating means for calculating the density of the training data to be classified, second density calculating means for calculating the density of the training data classified to the second classification at the position of the classification target data, A separation surface position correction unit for correcting the position of the separation surface based on the density calculated by the first density calculation unit and the density calculated by the second density calculation unit. The position of the separation plane, which is the boundary between the classification and the second classification, is corrected according to the density of the training data belonging to each of the above classifications, whereby the data is accurately obtained even when the distribution of the training data is not uniform. Data classification apparatus capable of classifying an effect that can be obtained.
[0131]
Also, in the invention according to claim 2, in the invention according to claim 1, the separation surface position correction unit is configured to determine the first density calculation unit of the first class or the second class. Alternatively, the position of the separation plane is corrected to a side where the density calculated by the second density calculation means is relatively high, so that when the distribution of the training data is not uniform, the classification is performed with a relatively high training data density. The characteristic that data is easily classified is corrected, thereby providing an effect of obtaining a data classification device capable of accurately classifying data even when the distribution of training data is not uniform.
[0132]
Further, according to a third aspect of the present invention, in the first or second aspect of the invention, the first density calculation means includes the training data including training data not classified into the first classification. Is calculated, the training data density for determining the position of the separation plane after the correction is the training data to which the first classification is assigned, and the first classification is assigned although the exact classification is unknown. The calculation is performed including the training data having a high probability, thereby providing an effect of obtaining a data classification device capable of accurately classifying the data even when the distribution of the training data is not uniform.
[0133]
According to a fourth aspect of the present invention, in the invention according to any one of the first to third aspects, the first density calculating means is configured to execute the training data of the training data classified into the first classification. Since the density is calculated using a part of the training data, the training data density for determining the position of the separation plane after the correction is approximated by the density of the part of the training data existing in the space. In addition, it is possible to obtain a data classification device capable of accurately classifying data even when the distribution of training data is not uniform while suppressing the amount of calculation.
[0134]
According to a fifth aspect of the present invention, in the invention of the fourth aspect, the partial training data is training data whose distance from the separation plane satisfies a predetermined condition. The training data density for determining the position of the plane is approximated by, for example, the density of training data having a constant distance from the separation plane before correction (the training data density on a plane parallel to the separation plane before correction), As a result, there is an effect that a data classification device capable of accurately classifying data even when the distribution of training data is not uniform can be obtained while suppressing the calculation amount.
[0135]
According to a sixth aspect of the present invention, in the invention according to any one of the first to fifth aspects, the first density calculating means transmits training data classified to the first classification. Since the density is calculated using the first training data set including the first training data set and the second training data set obtained by excluding some training data from the first training data set, the position of the corrected separation plane is determined. The training data density is approximated by the training data density in the direction perpendicular to the separation plane before correction, thereby reducing the amount of calculation and accurately classifying the data even when the training data distribution is not uniform. There is an effect that a data classification device that can perform the data classification is obtained.
[0136]
Also, in the invention according to claim 7, in the invention according to any one of claims 1 to 6, the first density calculating means calculates the density based on a distance between the training data. Since the calculation is performed, the calculation of the training data density for determining the position of the separation plane after the correction does not require a complicated calculation, and thus, the calculation amount is suppressed, and the calculation is performed accurately even when the distribution of the training data is not uniform. The data classification device capable of classifying data can be obtained.
[0137]
In a data classification method for dividing a space in which a plurality of data are present into a space for each classification by a separation plane, the invention is classified into a first classification at the position of the classification target data. A first density calculation step of calculating the density of the training data; a second density calculation step of calculating the density of the training data classified into the second classification at the position of the classification target data; A separation plane position correction step of correcting the position of the separation plane based on the density calculated in the density calculation step and the density calculated in the second density calculation step. The position of the separation plane serving as a boundary with the second classification is corrected according to the density of the training data belonging to each of the above classifications, whereby the data can be classified accurately even when the distribution of the training data is not uniform. Yes Data classification methods an effect that can be obtained.
[0138]
According to a ninth aspect of the present invention, in the invention according to the eighth aspect, in the separation plane position correcting step, the first density calculating step of the first classification or the second classification is performed. Alternatively, since the position of the separation surface is corrected to a side where the density calculated in the second density calculation step is relatively high, if the distribution of the training data is not uniform, the classification is classified into a relatively high training data density. The characteristic that data is easily classified is corrected, and thus, there is an effect that a data classification method capable of accurately classifying data even when the distribution of training data is not uniform is obtained.
[0139]
The invention according to claim 10 is the invention according to claim 8 or claim 9, wherein the first density calculation step includes the training data not classified into the first classification. Is calculated, the training data density for determining the position of the separation plane after the correction is the training data to which the first classification is assigned, and the first classification is assigned although the exact classification is unknown. The calculation is performed including the training data having a high probability, thereby providing an effect of obtaining a data classification method capable of accurately classifying the data even when the distribution of the training data is not uniform.
[0140]
According to an eleventh aspect of the present invention, in the invention according to any one of the eighth to tenth aspects, in the first density calculation step, the training data of the training data classified into the first classification is used. Since the density is calculated using a part of the training data, the training data density for determining the position of the separation plane after the correction is approximated by the density of the part of the training data existing in the space. In addition, there is an effect that a data classification method capable of accurately classifying data even when the distribution of training data is not uniform can be obtained while suppressing the amount of calculation.
[0141]
According to a twelfth aspect of the present invention, in the invention of the eleventh aspect, the partial training data is training data whose distance from the separation plane satisfies a predetermined condition. The training data density for determining the position of the plane is approximated by, for example, the density of training data having a constant distance from the separation plane before correction (the training data density on a plane parallel to the separation plane before correction), Thus, there is an effect that a data classification method capable of accurately classifying data even when the distribution of training data is not uniform can be obtained while suppressing the amount of calculation.
[0142]
According to a thirteenth aspect of the present invention, in the invention according to any one of the eighth to twelfth aspects, in the first density calculation step, the training data classified into the first classification is used. Since the density is calculated using the first training data set including the first training data set and the second training data set obtained by excluding some training data from the first training data set, the position of the corrected separation plane is determined. The training data density is approximated by the training data density in the direction perpendicular to the separation plane before correction, thereby reducing the amount of calculation and accurately classifying the data even when the training data distribution is not uniform. There is an effect that a data classification method that can perform the data classification is obtained.
[0143]
In the invention according to claim 14, in the invention according to any one of claims 8 to 13, in the first density calculation step, the density is determined based on a distance between the training data. Since the calculation is performed, the calculation of the training data density for determining the position of the separation plane after the correction does not require a complicated calculation. There is an effect that a data classification method capable of classifying data is obtained.
[0144]
According to the invention of claim 15, there is an effect that a program capable of causing a computer to execute the method of any one of claims 8 to 14 is obtained.
[Brief description of the drawings]
FIG. 1 is an explanatory diagram schematically showing the principle of classification by a conventional SVM.
FIG. 2 is an explanatory diagram showing a comparison between the principle of classification by the SVM of the related art and the principle of classification by the SVM of the present invention.
FIG. 3 is an explanatory diagram showing a comparison between the principle of classification by the SVM of the related art and the principle of classification by the SVM of the present invention.
FIG. 4 is an explanatory diagram illustrating a principle of estimating a training data density in the data classification device according to the second embodiment of the present invention;
FIG. 5 is an explanatory diagram showing an example of a hardware configuration of the data classification device according to the first embodiment of the present invention.
FIG. 6 is an explanatory diagram functionally showing the configuration of the data classification device according to the first embodiment of the present invention.
FIG. 7 is a flowchart showing a procedure of a data classification process in the data classification device according to the first embodiment of the present invention.
FIG. 8 is a flowchart showing in detail a procedure of c-value calculation processing (step S705 in FIG. 7) in the data classification device according to the first embodiment of the present invention.
FIG. 9 is an explanatory diagram showing a relationship between a score of classification target data and a classification result in the data classification device according to the first embodiment of the present invention.
FIG. 10 is a flowchart showing in detail a procedure of c-value calculation processing (step S705 in FIG. 7) in the data classification device according to the second embodiment of the present invention.
FIG. 11 is an explanatory diagram showing a relationship between a score of classification target data and a classification result in the data classification device according to the second embodiment of the present invention.
FIG. 12 is an explanatory diagram functionally showing a configuration of a data classification device according to a third embodiment of the present invention.
FIG. 13 is a flowchart illustrating a procedure of a data classification process in the data classification device according to the third embodiment of the present invention.
FIG. 14 is an explanatory diagram showing a list of distances between cases (training data) input to the data classification device according to the third embodiment of the present invention.
[Explanation of symbols]
500 bus
501 CPU
502 ROM
503 RAM
504 HDD
505 HD
506 FDD
507 FD
508 CD-RW drive
509 CD-RW
510 display
511 keyboard
512 mouse
513 Network I / F
514 Communication cable
600, 1200 Data input unit
601, 1201 Data conversion unit
602 SVM learning unit
603 c value calculation unit
604, 1205 Data classification unit
605, 1206 Data output unit
1202 Distance calculator
1203 Relative data density calculator
1204 Distance correction unit

Claims

In a data classification device that divides a space where a plurality of data exists into a space for each classification by a separation plane,
First density calculation means for calculating the density of the training data classified into the first classification at the position of the classification target data;
A second density calculation unit that calculates the density of the training data classified into the second classification at the position of the classification target data;
Separation surface position correction means for correcting the position of the separation surface based on the density calculated by the first density calculation means and the density calculated by the second density calculation means,
A data classification device comprising:

The separation plane position correction unit may be configured to, among the first classification and the second classification, move the density calculated by the first density calculation unit or the second density calculation unit to a relatively higher side. 2. The data classification device according to claim 1, wherein the position of the separation surface is corrected.

3. The data classification device according to claim 1, wherein the first density calculation unit calculates the density including training data not classified in the first classification. 4.

4. The density according to claim 1, wherein the first density calculating means calculates the density by using a part of the training data among the training data classified into the first classification. A data classification device according to one.

The data classification apparatus according to claim 4, wherein the part of the training data is training data whose distance from the separation plane satisfies a predetermined condition.

The first density calculating means includes a first training data set including training data classified into the first classification, and a second training data set obtained by removing some training data from the first training data set. The data classification device according to any one of claims 1 to 5, wherein the density is calculated using the following.

The data classification device according to any one of claims 1 to 6, wherein the first density calculation means calculates the density based on a distance between the training data.

In a data classification method of dividing a space where a plurality of data exists into a space for each classification by a separation plane,
A first density calculation step of calculating the density of the training data classified into the first classification at the position of the classification target data;
A second density calculation step of calculating the density of the training data classified into the second classification at the position of the classification target data;
A separation plane position correction step of correcting the position of the separation plane based on the density calculated in the first density calculation step and the density calculated in the second density calculation step;
A data classification method comprising:

In the separation surface position correction step, the density calculated in the first density calculation step or the second density calculation step is higher in the first classification or the second classification. The data classification method according to claim 8, wherein the position of the separation surface is corrected.

The data classification method according to claim 8, wherein in the first density calculation step, the density is calculated including training data not classified in the first classification.

11. The first density calculation step, wherein the density is calculated by using a part of the training data among the training data classified into the first classification. Data classification method described in one.

The data classification method according to claim 11, wherein the part of the training data is training data whose distance from the separation plane satisfies a predetermined condition.

In the first density calculation step, a first training data set including training data classified into the first classification and a second training data set obtained by removing some training data from the first training data set The data classification method according to any one of claims 8 to 12, wherein the density is calculated using:

14. The data classification method according to claim 8, wherein in the first density calculation step, the density is calculated based on a distance between the training data.

A program for causing a computer to execute the method according to any one of claims 8 to 14.