JP4729151B2

JP4729151B2 - Classification apparatus, method, and file search method

Info

Publication number: JP4729151B2
Application number: JP13900198A
Authority: JP
Inventors: 嶐一岡; 裕信高橋
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1998-05-20
Filing date: 1998-05-20
Publication date: 2011-07-20
Anticipated expiration: 2018-05-20
Also published as: JPH11328184A

Description

【０００１】
【発明の属する技術分野】
本発明は、ファイル内の複数のデータを関連の深いもの同士にまとめ分類する分類装置、方法およびファイル検索方法に関する。
【０００２】
【従来の技術】
データの自己組織化すなわち、関連の深いデータを１つにまとめ、複数のグループに分類することはパターン認識においてモデルの自動生成として位置付けられ、重要なテーマの一つである。データの自己組織化については次の文献が知られている。
【０００３】
（１）T. Kohonen : Self-Organization maps : Springer-Verlarg, (1995)
例えば動画画像認識や音声認識に関する自己組織化では、時間的な連続性を相互関係とみて、入力データを自己組織化することが試みられ、次の文献が発表されている。
【０００４】
（２）遠藤隆，他：動画像の自己組織化ネットワークモデル−そのトポロジーと動的特徴の解析−：人工知能学会情報統合研究会ＳＩＧ−ＣＩＩ−９７０７，（１９９７．７）
また文書処理等でも同一文書中に依存している共起性によって自己組織化を試みており、単語やドキュメントの空間配置問題として扱い、検索や分類に用いることも行われていて、以下の文献が発表されている。
【０００５】
（３）豊浦潤，岡隆一：テキスト検索のためのテキストデータの自己組織化について：人工知能学会情報統合研究会ＳＩＧ−ＣＩＩ−９６０３，ｐｐ．１６−２３，（１９９７．３）
（４）本間直人，石川真澄：数量化ＩＩＩ類の逆問題を用いたキーワードと文献の双方向的空間配置：信学会誌Ｄ−ＩＩ，Ｊ８１−ＤＩＩ，３，ｐｐ．５６４−５７３，（１９９８．３）
さらに本発明に関する文献としては、
（５）林知己夫，他：数量化理論とデータ処理：朝倉書店，（１９８７）
が知られている。ここで述べられ、数量化ＩＶ類と呼ばれている分類方法を説明する。
【０００６】
数量化ＩＶ類は有限個の標本が与えられ標本相互の親和度の強さが定義されている時に、親和度の高いものほど有限次元の空間で近傍に配置するようにしている。これによって相互に親和度の高い標本が空間中に集まり自己組織化することが期待できる。例えば音声や画像の時系列データの解析では、一定時間内での各事象を標本とし、連続性を親和性と見なすことができる。文書の理解では、文字や形態素を標本とみて、同一の文書やコンテキストでの共起を親和度と見ることができる。
【０００７】
統計をとる標本をＮ次元空間に配置する問題を考える。各標本に任意の番号付けをしｉとする。その標本の空間中の位置をｘ_i とする。標本ｉとｊの間の親和度が与えられておりＭ_ijとする。Ｍ_ijは正の値をとり、親近度が高いものほど大きな値をとる。
【０００８】
各標本間の距離が親近度に対応するように、標本間の距離の２乗にマイナス１をかけたものを距離関係として定義する。
【０００９】
【数１】
ｄ_ij＝−（ｘ_j −ｘ_i ）²
次のように対応する標本間ごとに親近度と距離関係の積をとり、この総和が最大となるｘ_i を求める。
【００１０】
【数２】

【００１１】
しかしこの条件式だけでは、すべての標本が同一の点に位置した場合に０となり、最大となって条件が満たされてしまう。このため標本の位置ｘ_i ²が一定の分散を持つように次の条件式を加える。
【００１２】
【数３】

【００１３】
上記条件式の数２式，数３式は行列の固有値問題に帰着する解法が知られていて、解析的に解を求めることができる。
【００１４】
【発明が解決しようとする課題】
このような手法をたとえば、文書に適用する場合、上記標本が単語となり、親近度の高い標本ｘ_i が文書の特徴を表す単語として抽出される。しかしながら、この手法を実世界のデータに適用すると、親近度にランダムなノイズ（出現頻度が極端に低いデータ）が加わり、関係の深いデータ（この場合、単語）を分離（抽出）することが困難になるという解決すべき課題があった。
【００１５】
そこで、本発明の目的は、ノイズの影響の少ない分類装置、方法およびファイル検索方法を提供することにある。
【００１６】
【課題を解決するための手段】
このような目的を達成するために、請求項１の発明は、ファイルの中の複数のデータを関連の深いデータ同士にまとめて分類するために、前記データを標本とみなし、２つの標本の間の関連の度合いを示す親近度および２つの標本の間の距離に関する複数の標本全体の分布を統計解析する分類装置において、
前記複数の標本の分布の偏りを原点が分布の中心となるように平行移動させた後、共分散行列を求めて固有地分解を行うことで補正する第１の補正手段と、
前記第１の補正手段により補正された複数の標本の分布の中心から標本までの距離に比例して標本の個数が多くなるようにすべての標本について乱数を用いた配置を繰り返すことで前記複数の標本の距離を補正する第２の補正手段と
を具えたことを特徴とする。
【００１７】
請求項２の発明は、請求項１に記載の分類装置において、前記ファイルは複数の単語を含む文書であり、前記標本を前記単語とすることを特徴とする。
【００１８】
請求項３の発明は、請求項１に記載の分類装置において、前記ファイルは複数の音声要素からなるファイルであり、前記標本を前記音声要素とすることを特徴とする。
【００１９】
請求項４の発明は、請求項１に記載の分類装置において、前記ファイルは複数の静止画を有する動画であり、前記標本を前記静止画とすることを特徴とする。
【００２０】
請求項５の発明は、ファイルの中の複数のデータを関連の深いデータ同士にまとめて分類するために、前記データを標本とみなし、２つの標本の間の関連の度合いを示す親近度および２つの標本の間の距離に関する複数の標本全体の分布をコンピュータにより統計解析する分類方法において、前記コンピュータが
前記複数の標本の分布の偏りを原点が分布の中心となるように平行移動させた後、共分散行列を求めて固有地分解を行うことで前記コンピュータにより補正する第１の補正手段と、
前記第１の補正手段により補正された複数の標本の分布の中心から標本までの距離に比例して標本の個数が多くなるようにすべての標本について乱数を用いた配置を繰り返すことで前記複数の標本の距離を前記コンピュータにより補正する第２の補正手段と
として動作することを特徴とする。
【００２１】
請求項６の発明は、請求項５に記載の分類方法において、前記ファイルは複数の単語を含む文書であり、前記標本を前記単語とすることを特徴とする。
【００２２】
請求項７の発明は、請求項５に記載の分類方法において、前記ファイルは複数の音声要素からなるファイルであり、前記標本を前記音声要素とすることを特徴とする。
【００２３】
請求項８の発明は、請求項５に記載の分類方法において、前記ファイルは複数の静止画を有する動画であり、前記標本を前記静止画とすることを特徴とする。
【００２４】
請求項９の発明は、データベースに登録されたファイルをコンピュータにより検索するファイル検索方法において、前記コンピュータが、前記データベースに登録されたファイルを構成するデータと種類が同一で、検索目的のデータを入力する手段と、前記データベースに登録されたファイルの中に含まれるデータに対して請求項５に記載の分類方法を適用し、当該分類方法により分類されたデータが、前記入力する手段で入力されたデ−タと合致するか否かの判定を、前記データベースに登録されたファイル全てに対して行う手段ととして動作し、合致するの判定が得られたファイルを検索結果とすることを特徴とする。
【００２５】
請求項１０の発明は、請求項９に記載のファイル検索方法において、前記コンピュータが、検索策結果として得られるファイルをそのファイル名でリストアップするステップと、当該リストアップされたファイル名を、合致した前記データの有する親近度の順にソーティングする手段ととしてさらに動作することを特徴とする。
【００２６】
請求項１１の発明は、請求項９に記載のファイル検索方法において、前記コンピュータが、前記ファイルの中に含まれるデータの出現頻度を計数するステップと、検索策結果として得られるファイルをそのファイル名でリストアップする手段と、当該リストアップされたファイル名を、合致した前記データの有する出現頻度の順にソーティングする手段ととしてさらに動作することを特徴とする。
【００２７】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態を詳細に説明する。
【００２８】
最初に、本発明に関わる分類方法を説明する。上述の数量化ＩＶ類による解析では、後のシミュレーションで実験で示すように実世界データにしばしばあるように親近度にランダムなノイズが加わると、関係の深いものだけを分類することが困難になる。他の標本との関係によらず親近度の高い標本は中心付近に集中し、親近度が低い標本が中心から離れることで上記の条件式が満たされるようになる。
【００２９】
この問題について検討した結果、距離の定義式の数１式を変更し、一定距離以上に離れた場合のペナルティを緩和することで上記問題２に対処できることを本願発明者は発見した。本発明の方法を用いると、ノイズ（出現頻度が極端に低い標本）などによって弱い関係を持つ標本間を放して配置しても、数２式の変化が非線形化により緩和されるので、分離しクラスタリングすることが可能となる。
【００３０】
数１式を非線型関数Ｆにより変形し次のように定義する。
【００３１】
【数４】

【００３２】
Ｆは近傍（＜ａ）については２次関数であり、その外側では１次関数となる次の式で定義する。
【００３３】
【数５】

【００３４】
この関数によって閾値ａ以内の近傍では２乗と同じとなり、その外側では対象間の距離が大きくなっても評価値の減少が小さいので目的とする効果が得られる。
【００３５】
上記のように最小２乗項を非線形化すると、解析解を得られない。そこで次の節のように繰り返し法による数値解法が必要となる。さらに与えられた親近度の分布やａの値によっては特定の標本の位置だけを無限遠に置くことで最大化式が満たされてしまい数値計算が収束しない。
【００３６】
また、数量化ＩＶ類に比べて標本の中心付近への集積が緩和されるものの、中心付近で異なるグループに所属する標本間の距離ａ以下になると、非線形化による効果が失われてしまう。
【００３７】
この発散と中心への集中を防ぐために標本の位置を一定の超球内に閉じ込めて、この超球内での標本の分布を一様にとなるように以下の条件を加える。
【００３８】
１．標本の分布の中心は原点である。
【００３９】
２．標本の分布について主成分分析をしても分布にかたよりが見られない。
【００４０】
３．一定半径の球殻内にすべての標本が存在し、中心から半径方向への分布が空間内での球体積に比例した分布となっている。
【００４１】
最大化（最適化）と超球内一様化の条件を単一の式として解くことも可能であるが、データ規模によっては大規模行列計算となってしまう。そのため以下のように最適化と各制約を順次満たすような、繰り返し法で解を求める。
【００４２】
数６式に数４式を代入すると次の式が得られる。
【００４３】
【数６】

【００４４】
特定の標本（ｘ_i ）について偏微分すると次の式が得られ、Ｊが最大値を取るとき恒等的に０となる。
【００４５】
【数７】

【００４６】
【数８】

【００４７】
Ｆ′は数５式の微分なので次のように与えられる。
【００４８】
【数９】

【００４９】
Ｆ′の展開のためにＤを次のように定める。
【００５０】
【数１０】

【００５１】
【数１１】

【００５２】
これを数８式に代入しｘ_i について解き、それを逐次近似法の漸化式として利用する。
【００５３】
【数１２】

【００５４】
これだけでは収束性については保証されないので次の超球内一様化を行う。
【００５５】
原点への移動と特定方向へのかたよりの解消を次に説明する。
【００５６】
まず原点が分布の中心となるように平行移動する。
【００５７】
【数１３】

【００５８】
次に統計における主成分分析と同様に、共分散行列を求めて固有値分解によりどの方向に対する分散も同じ値となるようにする。
【００５９】
【外１】

【００６０】
【数１４】

【００６１】
これを固有値分解する。
【００６２】
【数１５】
Ａ＝Ｕ^t ＢＵ
得られた固有値σ₁ ，σ₂ ，…，σ_N に対して次のような逆変換行列を作る。
【００６３】
【数１６】

【００６４】
以下の変換を行う。
【００６５】
【数１７】

【００６６】
次に球の半径方向の標本の分布について統計をとる。図１に示したように超球の一定半径ｒ内に存在する標本の数を、標本の総数で割って規格化した値を求める。これをｒに対する関数と見てＵ（ｒ）とする。なお数値処理のためにあらかいめ標本の分布している半径の範囲を定めて１００段階に分割し、折線近似関数で代用している。
【００６７】
理想的に標本が一様に分布していれば、半径方向に対して体積に比例することが期待されるので、閉じ込める超球の半径を１とし、空間の次元がＮなのでＵ（ｒ）はｒ^N に一致する。そこで
【００６８】
【外２】

【００６９】
次の変換を行う。
【００７０】
【数１８】

【００７１】
次の手順ですべての標本について繰り返し方によって位置を求める。
【００７２】
１．初期値として標本ｉを一定半径の球内に一様分布となるように乱数を用いて配置する。
【００７３】
【外３】

【００７４】
２．ｔを繰り返し回数とし、すべての標本ｉについて数１２式を計算し
【００７５】
【外４】

【００７６】
３．求められた
【００７７】
【外５】

【００７８】
球内一様化の処理を行う。
【００７９】
４．ｔに１を加えて１に戻る。
【００８０】
シミュレーションによって数量化ＩＶ類と本実施形態の方法の能力を比較する。ここでは標本数を１０００とした。これらを１００のクラスに分割し、各クラスは１０ずつの標本を含む。同じクラスに所属する標本間の親近度は区間［０，１）の一様乱数で与えた。一方クラスの異なる標本間の親近度をノイズとして区間［０，α）の一様乱数で与える。αが１に近づくほど大きなノイズとなる。理想的なクラスタリングでは同一クラスに所属する１０の標本ごとに空間中で集まることになる。
【００８１】
ここではαが０．０１と０．１の場合について示す。この場合の親近度を図２と図３に示した。１０００の標本のうち３つのクラスに属する３０の標本の相互関係を示している。縦横の軸は各標本であり、各交点上で親近度を四角形の大きさで示している。
【００８２】
この標本に対して数量化ＩＶ類と本実施形態による分類を行い比較する。それぞれの手法を用いて１０次元空間中に配置する。なお本実施形態では超球の半径を１とし、ａを０．１とした。その結果５０回の繰り返しによりほぼ収束した。各１０００の標本は１０次元空間中に位置しているので、可視化のためにすべての点を２次元平面上に正照影した結果を図４から図７に示している。
【００８３】
数量化ＩＶ類によってもαが０．０１の場合（図４）には、クラスごとに分離できている。しかし原点付近に位置したクラスでは近傍に集まってしまっている。同じデータで本実施形態によった場合は（図６）クラスごとに明確に分離していることがわかる。
【００８４】
αが０．１になると、図５のように数量化ＩＶ類では大半の標本が超空間中の細い棒状の空間に集中してしまいクラスタに分離することができなくなる。これに対して本実施形態では（図７）個々のクラスの分散が大きくなるが明確に分離できている。
【００８５】
認識や検索への利用を考えた場合には、例えばあるquery に対してその再近傍にある標本によってクラスを判別することをする。そこでアルゴリズムの能力を調べるために、１０００個の各標本について、それぞれのもっとも近傍にある標本が本来想定した同じクラスにあるかどうかを調べた。表１にあるように、αが０．０１以下ではどちらも正しく判別できている。しかしαがその値を超えると数量化ＩＶ類では判別が困難になってしまい、明らかに能力が劣っていることがわかる。
【００８６】
【表１】

【００８７】
なお、このシミュレーションに要したＣＰＵ時間は５０回の繰り返しで２１５秒であった（Gray CS6400, SUN SPARC 85MHz）。
【００８８】
本実施形態を適用したネットニュースの記事検索システムを説明する。
【００８９】
インターネット上でのニュースシステムは日本では１９８５年からｆｊカテゴリーの運用が開始されている。発足以来現在までの約２３５万件の記事を収集しており、これに対する検索を提供することを目的としている。
【００９０】
前処理としてすべての記事の本文について、Chasenを用いて形態素解析をし、単語に分類する。この単語すべてを統計処理すべき標本とみなして本実施形態の方法により超空間に配置する。
【００９１】
各単語間の神話度については、まず収集された記事本文の中で、前後５単語以内に共起した単語の組についてすべてカウントしＮ_ijとする。次に助詞「は」や接尾辞のように出現頻度の高いものに標本が集中することを避けるため、各単語の出現数をＮ_i として規格化し親和度Ｍ_ijとしている。
【００９２】
【数１９】

【００９３】
本実施形態の分類方法によって、標本ｉが座標ｘ_i に配置されるので、次に標本（単語）ごとに、空間内で近くに配置されている単語をあらかじめ検索してある。近傍にある単語は単一の文書中で共起性が高いので、関連の深いものと考えられ、これによって単語や文章の曖昧検索を可能としている。
【００９４】
ユーザは、一般の日本文を与えることで検索できる。与えられた文章はChasenによって単語（形態素）に分割され、その単語とその近傍の単語を含む記事を検索する。
【００９５】
検索された記事は、共有する単語数や単語ごとの出現頻度およびGalaxy空間中での距離から点数が付けられ、関連が深いと考えられるものから順に表示される。
【００９６】
上述のネットニュースの検索を行うための分類装置内蔵のファイル検索システムの構成を図８に示す。ファイル検索システムとしは汎用のコンピュータ、たとえば、パーソナルコンピュータやワークステーションを使用可能であるが、本発明に関わるので、簡単にハード構成を説明しておく。図８において、１はＣＰＵであり、システムメモリ２およびハードディスク記憶装置（ＨＤＤ）４に記憶されたシステムプログラムにしたがって、構成各部のシステム制御を行う。さらにＨＤＤ４に記憶された図９の検索プログラムにしたがって、ネットニュースの検索を行う。この検索プログラムの中の後述の分類処理を実行する時のＣＰＵが分類装置として機能する。
【００９７】
システムメモリ２は上述のシステムプログラムおよびＣＰＵ１の演算に使用する各種のデータを記憶する。入力装置３は、データベースに登録するニュース（文書ファイル）を入力する。本例では、入力装置としてキーボードを使用するが、文書ファイルを入力できるものとしては、インターネットと接続する通信装置、フロッピーディスク等の記録媒体から文書ファイルを読み取る記録媒体読み取り装置を入力装置としても使用することができる。なお、入力装置３からは検索する内容（日本語文）をも入力する。
【００９８】
ＨＤＤ４は上述のシステムプログラムの一部および図９の検索プログラムを保存するとともに、さらには検索の対象となるネットニュースを蓄積しておくデータベースを保存する。さらに、ＨＤＤ４にはデータの分類（関連のあるものを１つのグループにまとめること）処理で使用する単語間の親近度およびそれらの単語がテーブル形態で記憶されている。また、日本語の形態素解析を行うための単語辞書もＨＤＤ４に記憶されている。表示装置５には検索結果として得られるファイル名を表示する。
【００９９】
このようなシステム構成において実行される検索処理を図９のフローチャートを参照して説明する。説明の便宜上、図９のフローチャートは機能表現で記載しているが、実際には、ＣＰＵ１が読み取り実行可能なプログラム言語で記載され、ＨＤＤ４に記憶されている。入力装置３からの起動の指示に応じて、図９のプログラムがＨＤＤ４からシステムメモリ２に読み出され、ＣＰＵ１により実行される。
【０１００】
図９において、ユーザは、たとえば、「自己組織化を行う装置」という日本語文を入力装置３から入力する。入力された日本語文からはＣＰＵ１の周知の形態素分析により、「自己」「組織化」「装置」の単語が抽出され、システムメモリ２に一時記憶される（ステップＳ１０）。
【０１０１】
ＣＰＵ１はＨＤＤ４に格納されたデータベースの中から第１番目のニュース、すなわち、文書ファィルをシステムメモリ２に読み出す。読み出された文書についても形態素解析が行われ、単語が抽出される（ステップＳ２０）。ここで、上述した分類方法に従った分類処理が開始される。より具体的には、単語を標本として、ＣＰＵ１は数２から数４式を満足する標本の分布をシステムメモリ２上に作成する。なお、この標本分布の作成と同時に、数４式の条件が組み込まれる。なお、数４式では、閾値ａより距離が大きい標本と上記距離が小さい標本とでは、異なる距離の算出式を使用するので、２つの標本の距離が離れている場合、分布の中心から距離の離れた標本については評価値を大きくする補正が数５式により行われる。次に、ＣＰＵ１は数６式から数１８式を実行して、標本の分布の偏りを補正する。乱数を使用した標本の再配置の繰り返しにより標本が分布の中心からの距離に比例してそれらの個数が多くなるように標本の距離が補正される（ステップＳ３０）。
【０１０２】
このように補正された標本の分布を使用して、従来と同様に主成分分析を行うと、関連のある標本（この場合）がシステムメモリ２上で１つのグループ（いわゆるクラス）に分類される（ステップＳ４０）。以上の分類処理に平行して、各標本の文書ファイル中の出現頻度もＣＰＵ１により計数され、その計数結果と、上述の分類処理で得られる単語の親近度がこのシステムメモリ２に格納される。
【０１０３】
次にＣＰＵ１はステップＳ１０で抽出された検索目的の単語（いわゆるキーワード）、すなわち、「自己」「組織化」「装置」とステップＳ４０で分類された単語とを比較し、すべて合致する場合には、上記分類の対象となった文書ファイルのファイル名、合致した単語の出現頻度をシステムメモリ２上にリストアップする（ステップ５０→Ｓ６０）。この後、手順はステップＳ７０を経由してステップＳ２０に戻り、ステップＳ２０〜Ｓ４０でデータベースに保存された次の文書ファイルのデータ分類処理が行われる。
【０１０４】
一方、ステップＳ５０の単語の合致判定処理で不一致の判定が得られた場合には手順をステップＳ２０に戻し、データベースに保存された次のファイルについての分類処理が行われる。
【０１０５】
このようにして、データベース上のすべての文書ファイルについて、上述の分類処理および単語の合致判定処理、ファイル名リストアップ処理を行うと（ステップＳ７０のＹＥＳ判定）、ＣＰＵ１はシステムメモリ２上にリストアップされたファイル名をソータティング（並べ替え）する。並べ替えの判断基準は、親近度および出現頻度の高いファイル名が上位に位置する。ソーティングの処理自体は周知であり、詳細な説明を要しないであろう。このようにして、得られたファイル名のリストが表示装置５に可視表示される（ステップＳ８０）。
【０１０６】
上述の実施形態の他に次の形態を実施できる。
【０１０７】
１）上述の実施形態はデータファイルが文書ファイル、すなわち、複数の単語を有する文書（テキストとも称する）であったが、データファイルとしては、音声（人間の声）データ、動画データ、音響データさらには楽譜データ等のファイルについても本発明を適用できる。この場合には、音声データを複数の音声単位、たとえば、音素、音韻、単語等所定の音声長さ単位で区切った音声データを標本と使用すればよい。動画は複数の静止画で構成されているので、静止画を標本として使用する。音楽のような音響データ、楽譜データはたとえば、１小節のような長さの音楽データを標本として使用するとよい。このような、音声データ、動画、音響データファイルを対象とする検索システムでは、検索目的のデータをデータベースに登録されたデータと同一の種類の音声データ、動画データ、音響データで与えることができる。このようなファイル検索の用途としてはたとえば、小説、音楽、楽譜をデータベースに登録しておき、著作権の侵害の有無の判定のために対象のデータを検索にかけるといった用途も考えられる。
【０１０８】
２）上述の実施形態では、検索により取得したファイル名のソーティングについては出現頻度および親近度の双方を並び替えの判断基準として使用したが、いずれか一方のみを使用してもよいこと勿論である。
【０１０９】
３）図９に示すプログラムをフロッピーディスクやＣＤＲＯＭ等の各種の記録媒体に記録して、図８のＨＤＤ４にインストールしてもよいこと勿論である。
【０１１０】
【発明の効果】
以上、説明したように、請求項１、５の発明では、親近度が高く、意味内容の異なる標本の分布上の集中が緩和され、逆に分布上で集中がない標本については、個数が増やされる。これにより、ノイズの影響をなくし、さらには分布上で集中した標本を別のグループに分類することができ、従来よりも分類精度を向上させることができる。
【０１１１】
請求項２、６の発明では、コンピュータが処理する各種の文書ファイルに含まれるデータを精度よく自己組織化することができる。
【０１１２】
請求項３、７の発明では、コンピュータが処理する各種の音声データファイルに含まれるデータを精度よく自己組織化することができる。
【０１１３】
請求項４、８の発明では、コンピュータが処理する各種の動画データファイルに含まれるデータを精度よく自己組織化することができる。
【０１１４】
請求項９〜１１の発明では、検索目的のデータ（文字の場合、キーワード）が複数有る場合には、個々のデータに合致するだけでなく、データの間の最も関連の深い（親近度の高い）ファイルやデータの出現頻度の高いファイルが検索結果の上位として得られる。
【図面の簡単な説明】
【図１】本発明実施形態の標本分布と標本位置の補正を説明するための説明図である。
【図２】本発明実施形態のシミュレーション結果を示す説明図である。
【図３】本発明実施形態のシミュレーション結果を示す説明図である。
【図４】従来の標本分布を示す説明図である。
【図５】従来の標本分布を示す説明図である。
【図６】本発明実施形態の標本分布を示す説明図である。
【図７】本発明実施形態の標本分布を示す説明図である。
【図８】本発明実施形態のシステム構成を示すブロック図である。
【図９】本発明実施形態の処理手順を示すフローチャートである。
【符号の説明】
１ＣＰＵ
２システムメモリ
３入力装置
４ＨＤＤ
５表示装置[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a classification apparatus, a method, and a file search method that collectively classify a plurality of data in a file into closely related ones.
[0002]
[Prior art]
Data self-organization, that is, gathering closely related data into one group and classifying them into a plurality of groups is positioned as automatic model generation in pattern recognition, and is one of the important themes. The following documents are known for data self-organization.
[0003]
(1) T. Kohonen: Self-Organization maps: Springer-Verlarg, (1995)
For example, in the self-organization related to moving image recognition and voice recognition, it is attempted to self-organize input data by considering temporal continuity as a mutual relationship, and the following documents have been published.
[0004]
(2) Takashi Endo, et al .: Self-organizing network model of moving image-Analysis of its topology and dynamic features-: SIG-CII-9707, Information Integration Society of Japanese Society for Artificial Intelligence (1997. 7)
In document processing, etc., self-organization is attempted by co-occurrence depending on the same document, it is treated as a problem of spatial arrangement of words and documents, and it is also used for search and classification. Has been announced.
[0005]
(3) Jun Toyoura, Ryuichi Oka: On the self-organization of text data for text retrieval: SIG-CII-9603, IPSJ Information Integration Study Group. 16-23, (1997.3)
(4) Naoto Honma, Masumi Ishikawa: Bidirectional spatial layout of keywords and documents using the inverse problem of quantification type III: Journal of IEICE, D-II, J81-DII, 3, pp. 564-573, (1998.3)
Furthermore, as a document related to the present invention,
(5) Tomio Hayashi, et al .: Quantification theory and data processing: Asakura Shoten, (1987)
It has been known. The classification method described here and called quantification type IV will be described.
[0006]
In the quantification type IV, when a finite number of samples are given and the strength of the affinity between the samples is defined, the higher the affinity, the closer to the finite dimensional space. As a result, it can be expected that samples having high affinity with each other gather in the space and self-organize. For example, in the analysis of time series data of audio and images, each event within a certain time can be used as a sample, and continuity can be regarded as affinity. In understanding a document, characters and morphemes can be regarded as samples, and co-occurrence in the same document or context can be viewed as affinity.
[0007]
Consider the problem of placing samples for statistics in an N-dimensional space. Assign an arbitrary number to each specimen and let it be i. Let x _i be the position of the sample in space. The affinity between samples i and j is given and is _denoted as M _ij . M _ij takes a positive value and takes a larger value as the degree of closeness increases.
[0008]
The distance relationship is defined by multiplying the square of the distance between the samples by minus 1 so that the distance between the samples corresponds to the closeness.
[0009]
[Expression 1]
d _ij = − (x _j −x _i ) ²
Taking the product of the affinity score and distance relationships between every sample corresponding as follows to determine the x _i where the sum is maximized.
[0010]
[Expression 2]

[0011]
However, with this conditional expression alone, when all the samples are located at the same point, the condition becomes 0 and the condition is satisfied. Therefore, the following conditional expression is added so that the sample position x _i ² has a certain variance.
[0012]
[Equation 3]

[0013]
Equations (2) and (3) are known to solve the eigenvalue problem of the matrix, and the solution can be obtained analytically.
[0014]
[Problems to be solved by the invention]
For example, when such a technique is applied to a document, the sample becomes a word, and a sample x _i having a high degree of closeness is extracted as a word representing the feature of the document. However, when this method is applied to real-world data, it is difficult to separate (extract) closely related data (in this case, words) by adding random noise (data with extremely low frequency of appearance) to familiarity There was a problem to be solved.
[0015]
Accordingly, an object of the present invention is to provide a classification apparatus, method, and file search method that are less affected by noise.
[0016]
[Means for Solving the Problems]
In order to achieve such an object, the invention of claim 1 regards the data as a sample in order to classify a plurality of data in a file together into closely related data, and between the two samples. In a classification device that statistically analyzes the distribution of a plurality of samples related to the degree of closeness indicating the degree of association between the two samples and the distance between the two samples,
A first correction unit that corrects the bias of the distribution of the plurality of samples by translating so that an origin is the center of the distribution, and then calculating a covariance matrix and performing eigendecomposition;
By repeating the arrangement using random numbers for all the samples so that the number of samples increases in proportion to the distance from the center of the distribution of the plurality of samples corrected by the first correction means to the plurality of samples, And a second correcting means for correcting the distance of the sample.
[0017]
According to a second aspect of the present invention, in the classification apparatus according to the first aspect, the file is a document including a plurality of words, and the sample is the word.
[0018]
According to a third aspect of the present invention, in the classification apparatus according to the first aspect, the file is a file including a plurality of sound elements, and the sample is the sound element.
[0019]
According to a fourth aspect of the present invention, in the classification device according to the first aspect, the file is a moving image having a plurality of still images, and the specimen is the still image.
[0020]
According to the invention of claim 5, in order to collectively classify a plurality of data in a file into closely related data, the data is regarded as a sample, and the degree of closeness indicating the degree of association between two samples and 2 In a classification method in which the distribution of the entire plurality of samples related to the distance between two samples is statistically analyzed by a computer, the computer translates the distribution bias of the plurality of samples so that the origin is the center of the distribution, A first correction means for correcting by the computer by obtaining a covariance matrix and performing eigenground decomposition;
By repeating the arrangement using random numbers for all the samples so that the number of samples increases in proportion to the distance from the center of the distribution of the plurality of samples corrected by the first correction means to the plurality of samples, It operates as a second correction means for correcting the distance of the sample by the computer.
[0021]
According to a sixth aspect of the present invention, in the classification method according to the fifth aspect, the file is a document including a plurality of words, and the sample is the word.
[0022]
The invention according to claim 7 is the classification method according to claim 5, wherein the file is a file including a plurality of sound elements, and the sample is the sound element.
[0023]
According to an eighth aspect of the present invention, in the classification method according to the fifth aspect, the file is a moving image having a plurality of still images, and the specimen is the still image.
[0024]
The invention according to claim 9 is a file search method for searching a file registered in a database by a computer , wherein the computer is the same type as the data constituting the file registered in the database and inputs data for search purposes. means for, applying a classification method according to claim 5 for the data contained in the files registered in the database, data classified by the classification method, input in said means for inputting It operates as a means for determining whether or not it matches the data for all the files registered in the database, and the search result is a file that is determined to match .
[0025]
A tenth aspect of the present invention is the file search method according to the ninth aspect, wherein the computer lists a file obtained as a search strategy result by its file name and the listed file name. It further operates as a means for sorting in the order of closeness of the data.
[0026]
The invention of claim 11 is the file search method according to claim 9, wherein the computer counts the frequency of appearance of the data contained in the file, and the file obtained as a search strategy result has its file name. It means for listing in a file name that is the list, further characterized in that operate as means for sorting in order of appearance frequency included in the circuit data matching.
[0027]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0028]
First, a classification method according to the present invention will be described. In the analysis based on the above-mentioned quantification type IV, if random noise is added to the closeness as often shown in the real-world data as shown by experiments in later simulations, it becomes difficult to classify only closely related ones. . Regardless of the relationship with other samples, samples with a high degree of familiarity are concentrated near the center, and a sample with a low degree of familiarity leaves the center to satisfy the above conditional expression.
[0029]
As a result of studying this problem, the inventor of the present application has found that the above problem 2 can be addressed by changing Formula 1 of the distance definition formula and relaxing the penalty when the distance is more than a certain distance. When the method of the present invention is used, even if samples that are weakly related to each other due to noise (samples with extremely low frequency of appearance) are placed apart, the change in equation (2) is alleviated by non-linearization. Clustering is possible.
[0030]
Equation 1 is modified by the nonlinear function F and defined as follows.
[0031]
[Expression 4]

[0032]
F is a quadratic function in the vicinity (<a), and is defined by the following equation that is a linear function outside the F.
[0033]
[Equation 5]

[0034]
By this function, in the vicinity within the threshold value a, it becomes the same as the square, and outside it, even if the distance between the objects becomes large, the decrease in the evaluation value is small, so the intended effect is obtained.
[0035]
If the least square term is made nonlinear as described above, an analytical solution cannot be obtained. Therefore, it is necessary to use a numerical solution by the iterative method as described in the next section. Further, depending on the distribution of closeness and the value of a given, only the position of a specific sample is placed at infinity, so that the maximization formula is satisfied and the numerical calculation does not converge.
[0036]
In addition, although the accumulation near the center of the sample is eased compared to the quantified IV class, the effect of non-linearization is lost when the distance between samples belonging to different groups near the center is less than a.
[0037]
In order to prevent this divergence and concentration at the center, the sample position is confined in a certain hypersphere, and the following conditions are added so that the distribution of the sample in the hypersphere is uniform.
[0038]
1. The center of the sample distribution is the origin.
[0039]
2. Even if the principal component analysis is performed on the distribution of the sample, no difference is seen in the distribution.
[0040]
3. All specimens exist in a spherical shell with a constant radius, and the distribution from the center to the radial direction is proportional to the volume of the sphere in the space.
[0041]
It is possible to solve the conditions for maximization (optimization) and uniformization within the hypersphere as a single expression, but depending on the data scale, it becomes a large-scale matrix calculation. Therefore, the solution is obtained by an iterative method that sequentially satisfies the optimization and each constraint as follows.
[0042]
Substituting Equation 4 into Equation 6 yields the following equation.
[0043]
[Formula 6]

[0044]
When a partial differentiation is performed on a specific sample (x _i ), the following expression is obtained, and when J takes the maximum value, it is uniformly 0.
[0045]
[Expression 7]

[0046]
[Equation 8]

[0047]
Since F ′ is a derivative of Formula 5, it is given as follows.
[0048]
[Equation 9]

[0049]
For the development of F ′, D is determined as follows.
[0050]
[Expression 10]

[0051]
[Expression 11]

[0052]
This is substituted into equation (8) to solve for x _{i and} used as a recurrence formula for the successive approximation method.
[0053]
[Expression 12]

[0054]
This alone does not guarantee the convergence, so the next hypersphere uniformization is performed.
[0055]
Next, the movement to the origin and the cancellation from the specific direction will be described.
[0056]
First, translation is performed so that the origin is the center of the distribution.
[0057]
[Formula 13]

[0058]
Next, as in the case of principal component analysis in statistics, a covariance matrix is obtained so that the variance in any direction becomes the same value by eigenvalue decomposition.
[0059]
[Outside 1]

[0060]
[Expression 14]

[0061]
This is eigenvalue decomposed.
[0062]
[Expression 15]
A = U ^t BU
The following inverse transformation matrix is created for the obtained eigenvalues σ ₁ , σ ₂ ,..., Σ _N.
[0063]
[Expression 16]

[0064]
Perform the following conversions:
[0065]
[Expression 17]

[0066]
Next, statistics are taken on the distribution of the samples in the radial direction of the sphere. As shown in FIG. 1, a normalized value is obtained by dividing the number of samples existing within a certain radius r of the hypersphere by the total number of samples. Considering this as a function for r, let U (r). For numerical processing, the radius range in which the sample is distributed is determined and divided into 100 stages, and a polygonal line approximation function is used instead.
[0067]
If the samples are ideally distributed uniformly, it is expected to be proportional to the volume in the radial direction, so the radius of the supersphere to be confined is 1 and the dimension of the space is N, so U (r) is matches r ^N. Therefore [0068]
[Outside 2]

[0069]
Perform the following conversion:
[0070]
[Expression 18]

[0071]
The position is obtained by the following procedure for all specimens.
[0072]
1. As an initial value, the sample i is arranged using random numbers so as to have a uniform distribution in a sphere having a constant radius.
[0073]
[Outside 3]

[0074]
2. Let t be the number of iterations and calculate Equation 12 for all samples i.
[Outside 4]

[0076]
3. Requested [0077]
[Outside 5]

[0078]
Perform processing for uniformizing the sphere.
[0079]
4). Add 1 to t and return to 1.
[0080]
The capability of the method of the present embodiment is compared with the quantification type IV by simulation. Here, the sample number was 1000. These are divided into 100 classes, each class containing 10 samples. The degree of closeness between samples belonging to the same class was given by a uniform random number in the interval [0, 1). On the other hand, the degree of closeness between samples of different classes is given as noise with a uniform random number in the interval [0, α). As α approaches 1, the noise becomes larger. In ideal clustering, 10 samples belonging to the same class are gathered in the space.
[0081]
Here, the case where α is 0.01 and 0.1 is shown. The closeness in this case is shown in FIGS. The correlation among 30 samples belonging to three classes out of 1000 samples is shown. The vertical and horizontal axes are each sample, and the degree of familiarity is indicated by a square size on each intersection.
[0082]
The specimen is classified and compared with the quantified IV class according to the present embodiment. Each method is used to place in a 10-dimensional space. In this embodiment, the radius of the hypersphere is set to 1 and a is set to 0.1. As a result, it was almost converged by 50 repetitions. Since 1000 specimens are located in the 10-dimensional space, the results of normal projection of all points on the 2-dimensional plane for visualization are shown in FIGS.
[0083]
When α is 0.01 by the quantification type IV (FIG. 4), it can be separated for each class. However, in the class located near the origin, it gathers in the vicinity. In the case of this embodiment with the same data (FIG. 6), it can be seen that the classes are clearly separated for each class.
[0084]
When α is 0.1, in the case of quantification type IV as shown in FIG. 5, most specimens are concentrated in a thin bar-like space in the superspace and cannot be separated into clusters. On the other hand, in this embodiment (FIG. 7), although the variance of each class becomes large, it can be clearly separated.
[0085]
When considering use for recognition and search, for example, a class is discriminated by a sample near its query. Therefore, in order to examine the capability of the algorithm, it was examined whether or not each of the 1000 samples is in the same class as originally assumed. As shown in Table 1, both are correctly determined when α is 0.01 or less. However, if α exceeds that value, it becomes difficult to discriminate with the quantified type IV, and it is clear that the ability is clearly inferior.
[0086]
[Table 1]

[0087]
The CPU time required for this simulation was 215 seconds after 50 iterations (Gray CS6400, SUN SPARC 85 MHz).
[0088]
An article search system for net news to which this embodiment is applied will be described.
[0089]
In Japan, the news system on the Internet has been operating in the fj category since 1985. Since its inception, it has collected about 2.35 million articles to date and aims to provide a search for this.
[0090]
As pre-processing, the body of all articles is analyzed by morpheme using Chasen and classified into words. All of these words are regarded as samples to be statistically processed and arranged in the superspace by the method of the present embodiment.
[0091]
As for the degree of myth between each word, first, all the pairs of words that co-occur within 5 words before and after in the collected article body are counted and set as N _ij . Next, in order to avoid particle "wa" and the specimen in a high frequency of occurrence as suffix is concentrated, and the normalized affinity M _ij the number of occurrences of each word as N _i.
[0092]
[Equation 19]

[0093]
Since the sample i is arranged at the coordinate x _i by the classification method of the present embodiment, next, for each sample (word), a word arranged nearby in the space is searched in advance. Words in the vicinity have high co-occurrence in a single document and are therefore considered to be closely related, thereby enabling an ambiguous search of words and sentences.
[0094]
The user can search by giving a general Japanese sentence. A given sentence is divided into words (morphemes) by Chasen, and an article including the word and its neighboring words is searched.
[0095]
Searched articles are scored based on the number of words to be shared, the frequency of occurrence for each word, and the distance in the Galaxy space, and are displayed in order from the most likely to be related.
[0096]
FIG. 8 shows the configuration of a file search system with a built-in classification device for searching the above-mentioned net news. A general-purpose computer such as a personal computer or a workstation can be used as the file search system, but since it relates to the present invention, the hardware configuration will be briefly described. In FIG. 8, reference numeral 1 denotes a CPU, which controls the system of each component according to system programs stored in a system memory 2 and a hard disk storage device (HDD) 4. Further, the Internet news is searched according to the search program of FIG. 9 stored in the HDD 4. A CPU for executing a classification process described later in the search program functions as a classification device.
[0097]
The system memory 2 stores the above-described system program and various data used for the operation of the CPU 1. The input device 3 inputs news (document file) to be registered in the database. In this example, a keyboard is used as an input device. However, as a device capable of inputting a document file, a communication device connected to the Internet and a recording medium reading device that reads a document file from a recording medium such as a floppy disk are also used as an input device. can do. The input device 3 also inputs the content to be searched (Japanese sentence).
[0098]
The HDD 4 stores a part of the above-described system program and the search program of FIG. 9, and further stores a database for accumulating net news to be searched. Further, the HDD 4 stores the degree of closeness between words used in the data classification process (collecting related items into one group) and those words in a table format. A word dictionary for performing Japanese morphological analysis is also stored in the HDD 4. The display device 5 displays the file name obtained as a search result.
[0099]
A search process executed in such a system configuration will be described with reference to the flowchart of FIG. For convenience of explanation, the flowchart of FIG. 9 is described in functional expression, but actually, it is described in a program language that can be read and executed by the CPU 1 and stored in the HDD 4. In response to an activation instruction from the input device 3, the program in FIG. 9 is read from the HDD 4 to the system memory 2 and executed by the CPU 1.
[0100]
In FIG. 9, the user inputs, for example, a Japanese sentence “device for self-organization” from the input device 3. From the input Japanese sentence, the words “self”, “organization”, and “device” are extracted by well-known morphological analysis of the CPU 1 and temporarily stored in the system memory 2 (step S10).
[0101]
The CPU 1 reads the first news from the database stored in the HDD 4, that is, the document file, into the system memory 2. Morphological analysis is also performed on the read document, and words are extracted (step S20). Here, the classification process according to the classification method described above is started. More specifically, using the word as a sample, the CPU 1 creates a distribution of samples satisfying the equations 2 to 4 on the system memory 2. At the same time as the preparation of the sample distribution, the condition of Equation 4 is incorporated. Note that, in Equation 4, since a calculation formula for different distances is used for a sample whose distance is larger than the threshold a and a sample whose distance is small, when the distance between the two samples is long, the distance from the center of the distribution is calculated. For a distant sample, correction for increasing the evaluation value is performed according to equation (5). Next, the CPU 1 executes Expressions 6 to 18 to correct the deviation of the sample distribution. By repeating the rearrangement of samples using random numbers, the sample distance is corrected so that the number of samples increases in proportion to the distance from the center of the distribution (step S30).
[0102]
When the principal component analysis is performed in the same manner as in the past using the sample distribution corrected in this way, related samples (in this case) are classified into one group (so-called class) on the system memory 2. (Step S40). In parallel with the above classification process, the appearance frequency of each sample in the document file is also counted by the CPU 1, and the count result and the word familiarity obtained by the above classification process are stored in the system memory 2.
[0103]
Next, the CPU 1 compares the word for search (so-called keyword) extracted in step S10, that is, "self", "organization", and "device" with the words classified in step S40. The file name of the document file to be classified and the appearance frequency of the matched word are listed on the system memory 2 (step 50 → S60). Thereafter, the procedure returns to step S20 via step S70, and the data classification process for the next document file stored in the database is performed in steps S20 to S40.
[0104]
On the other hand, if a mismatch determination is obtained in the word match determination process in step S50, the procedure returns to step S20, and the classification process for the next file stored in the database is performed.
[0105]
In this way, when the above-described classification process, word match determination process, and file name list-up process are performed for all document files on the database (YES determination in step S70), the CPU 1 lists them in the system memory 2. Sort the sorted file names. As a criterion for sorting, a file name having a high degree of closeness and appearance frequency is positioned higher. The sorting process itself is well known and will not require detailed description. In this way, the list of file names obtained is visually displayed on the display device 5 (step S80).
[0106]
In addition to the above embodiment, the following embodiment can be implemented.
[0107]
1) In the above-described embodiment, the data file is a document file, that is, a document (also referred to as text) having a plurality of words. As the data file, voice (human voice) data, moving image data, acoustic data, The present invention can also be applied to files such as musical score data. In this case, sound data obtained by dividing sound data into a plurality of sound units, for example, a predetermined sound length unit such as phonemes, phonemes, and words may be used as a sample. Since a moving image is composed of a plurality of still images, the still image is used as a sample. For acoustic data such as music and musical score data, for example, music data having a length of one measure may be used as a sample. In such a search system for audio data, moving image, and acoustic data files, the search target data can be given as the same type of audio data, moving image data, and acoustic data as the data registered in the database. For example, such a file search may be performed by registering novels, music, and musical scores in a database and searching the target data to determine whether there is a copyright infringement.
[0108]
2) In the above-described embodiment, both the appearance frequency and the closeness are used as sorting criteria for sorting the file names acquired by the search. However, it is a matter of course that only one of them may be used. .
[0109]
3) Of course, the program shown in FIG. 9 may be recorded on various recording media such as a floppy disk and a CDROM and installed in the HDD 4 of FIG.
[0110]
【The invention's effect】
As described above, in the inventions of

claims

1 and 5, the concentration on the distribution of samples having a high degree of closeness and different semantic contents is alleviated, and conversely, the number of samples having no concentration on the distribution is increased. It is. As a result, the influence of noise can be eliminated, and the samples concentrated on the distribution can be classified into another group, and the classification accuracy can be improved as compared with the prior art.
[0111]
In the inventions of claims 2 and 6, data contained in various document files processed by the computer can be self-organized with high accuracy.
[0112]
According to the third and seventh aspects of the present invention, data included in various audio data files processed by the computer can be self-organized with high accuracy.
[0113]
According to the fourth and eighth aspects of the present invention, data included in various moving image data files processed by the computer can be self-organized with high accuracy.
[0114]
In the inventions of claims 9 to 11, when there are a plurality of data for search purposes (in the case of characters, keywords), they not only match the individual data but also have the most relevant (highest degree of closeness) between the data ) Files with high frequency of appearance of files and data are obtained as higher rank search results.
[Brief description of the drawings]
FIG. 1 is an explanatory diagram for explaining correction of a sample distribution and a sample position according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram showing a simulation result of an embodiment of the present invention.
FIG. 3 is an explanatory diagram showing a simulation result of an embodiment of the present invention.
FIG. 4 is an explanatory diagram showing a conventional sample distribution.
FIG. 5 is an explanatory diagram showing a conventional sample distribution.
FIG. 6 is an explanatory diagram showing a sample distribution according to the embodiment of the present invention.
FIG. 7 is an explanatory diagram showing a sample distribution according to the embodiment of the present invention.
FIG. 8 is a block diagram showing a system configuration of an embodiment of the present invention.
FIG. 9 is a flowchart showing a processing procedure according to the embodiment of the present invention.
[Explanation of symbols]
1 CPU
2 System memory 3 Input device 4 HDD
5 display devices

Claims

In order to collectively classify a plurality of data in a file into closely related data, the data is regarded as a sample, and the degree of closeness indicating the degree of association between two samples and the distance between the two samples are related. In a classification device that statistically analyzes the distribution of multiple samples,
A first correction unit that corrects the bias of the distribution of the plurality of samples by translating so that an origin is the center of the distribution, and then calculating a covariance matrix and performing eigendecomposition;
By repeating the arrangement using random numbers for all the samples so that the number of samples increases in proportion to the distance from the center of the distribution of the plurality of samples corrected by the first correction means to the plurality of samples, A classification apparatus comprising: a second correction unit that corrects the distance of the sample.

The classification device according to claim 1, wherein the file is a document including a plurality of words, and the sample is the word.

2. The classification apparatus according to claim 1, wherein the file is a file composed of a plurality of sound elements, and the sample is the sound element.

2. The classification apparatus according to claim 1, wherein the file is a moving image having a plurality of still images, and the sample is the still image.

In order to collectively classify a plurality of data in a file into closely related data, the data is regarded as a sample, and the degree of closeness indicating the degree of association between two samples and the distance between the two samples are related. In a classification method in which the distribution of a plurality of samples is statistically analyzed by a computer, the computer translates the distribution bias of the plurality of samples so that the origin is the center of the distribution, and then obtains a covariance matrix to determine First correction means for correcting by the computer by performing ground decomposition;
By repeating the arrangement using random numbers for all the samples so that the number of samples increases in proportion to the distance from the center of the distribution of the plurality of samples corrected by the first correction means to the plurality of samples, A classification method, characterized by operating as second correction means for correcting the distance of a sample by the computer.

6. The classification method according to claim 5, wherein the file is a document including a plurality of words, and the sample is the word.

6. The classification method according to claim 5, wherein the file is a file composed of a plurality of sound elements, and the sample is the sound element.

6. The classification method according to claim 5, wherein the file is a moving image having a plurality of still images, and the specimen is the still image.

In a file search method for searching a file registered in a database by a computer, the computer includes:
Means for inputting data for search purposes that is the same type as the data constituting the file registered in the database;
The classification method according to claim 5 is applied to data included in a file registered in the database, and the data classified by the classification method includes data inputted by the inputting means and A file search method characterized in that it operates as a means for determining whether or not they match with respect to all the files registered in the database, and the search result is a file that has been determined to match.

10. The file search method according to claim 9, wherein the computer lists a file obtained as a search strategy result by its file name, and the degree of familiarity of the data that matches the listed file name. The file search method further operates as means for sorting in the following order.

The file search method according to claim 9, wherein the computer counts the frequency of appearance of data contained in the file, and means for listing files obtained as search strategy results by the file name; The file search method further operates as means for sorting the listed file names in the order of appearance frequencies of the matched data.