JP3772401B2

JP3772401B2 - Document classification device

Info

Publication number: JP3772401B2
Application number: JP19954396A
Authority: JP
Inventors: 博増市
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1996-07-11
Filing date: 1996-07-11
Publication date: 2006-05-10
Anticipated expiration: 2016-07-11
Also published as: JPH1027125A

Description

【０００１】
【発明の属する技術分野】
本発明は、ネットワークシステム上に存在する電子化された多数の文書を分類する文書分類装置に関し、特に、ハイパーテキストのような複雑にリンク付けされた多数の文書を分類する文書分類装置に関するものである。
【０００２】
【従来の技術】
今日、インターネットの普及に伴い、物理的に離れた位置に存在するコンピュータシステム上の電子文書にネットワークを介して容易にアクセスすることができるようになっている。このような電子文書は、文書の中に他の電子文書を参照するためのリンク情報を埋め込むことが可能であり、リンク情報が埋め込まれた電子文書は、そのリンク情報を辿ることによって当該電子文書に関連する他の電子文書に容易に到達することができる。このようなリンク情報が埋め込まれた電子文書の形態を、一般にハイパーテキストと呼んでいる。
【０００３】
インターネットのようなネットワークシステムにおいて、アクセス可能な電子文書の数が大量に増加すると、この大量の電子文書からリンク情報のみにしたがって所望の文書を探し出すことが困難になりつつある。
【０００４】
このような問題を解決するための１つの方法として、インターネット上で公開されている電子文書を対象とした検索サービスを提供するシステムが増えつつある。これらの検索システムでは、大量の文書に対して一括したキーワード検索を行うことができる。すなわち、インターネット上で公開されている電子文書を予め可能な限り漏れなく探索しておき、各文書の内容を取得しておくことにより、このような一括のキーワード検索を行うことができるようにしている。
【０００５】
また、更に、このような検索システムにおいては、各文書をその内容にしたがっていくつかのカテゴリーへと分類しておくことによって、より検索効率の向上を図るものがある。この場合のシステムの利用者は、所望の文書が含まれていると思われるカテゴリーを中心にキーワード検索を行うことが可能となり、検索効率の向上が期待できる。
【０００６】
ところで、文書を分類する方法には、人手によって行う方法と、文書間の距離に基づいた計算によって自動的に行う方法とがある。大量の文書を分類する場合には、効率の点から、後者の方法が有利である。
【０００７】
（従来技術１）
このような文書を分類する手法として、例えば、文献「Luhn, H. P., 'A statistical approach to mechanised encoding and searching of library information', IBM journal of research and development, 1, 309-17 (1957)」において論じられているように、文書中に含まれる各単語の出現頻度を基に単語の重み付けを行なう方法がある。この場合、特に、重みの高い単語は、その文書を代表するキーワードとみなすことができる。
【０００８】
（従来技術２）
また、単語の重みから文書間距離を求める手法が、例えば、文献「Salton, G. and McGill, N. J., ' Introduction to modern information retrieval', New York, McGraw-Hill (1983)」で提案されており、いくつかの文書分類システムにおいて採用されている。
【０００９】
このような文書分類システムにおいては、文書Ｐｉに対して各単語Ｒｕの重みＷiuが設定されているものとすると、文書Ｐｉの文書ベクトルＶpiを以下のように定義する。ただし、文書Ｐｉ中に単語Ｒｕが存在しない場合には、重みＷiuには“０”を設定する。また、単語Ｒｕが存在する場合には、重みＷiuは“０”以上の実数値とする。

ただし、ここでは、単語の異なり総数をｍとしており、また、Ωiu（０≦Ωiu≦１）を文書Ｐｉに対する各単語Ｒｕの重みＷiuとして再定義する。そして、この場合における文書Ｐｉと文書Ｐｊの間の距離ｄ（Ｐｉ，Ｐｊ）｛（０≦ｄ（Ｐｉ，Ｐｊ）≦１）は、
ｄ（Ｐi，Ｐj）＝２（arccos（Ｖpi・Ｖpj））／π ……（１−４）
として、両者の文書ベクトルの角度として定義する。
【００１０】
（従来技術３）
上記のようにして求められた文書間距離に基づき、クラスター分析の手法を用いると、文書の分類が可能となる。クラスター分析の手法については、例えば、文献「田中，垂水，脇本，“統計解析ハンドブックＩＩ多変量解析編”，第２２６頁〜第２５７頁，共立出版（１９８４）」が参照できる。クラスター分析の手法は、よく知られた技術であるのでここでの説明は省略する。
【００１１】
【発明が解決しようとする課題】
ところで、上述した従来の技術による文書分類システムにおいては、更に、解決すべき課題として、次のような問題がある。すなわち、（従来技術１）や（従来技術２）による文書分類システムにおいて、機械的に得られる文書間距離は、文書の意味内容を深く勘案した上で設定されるものではない。したがって、このような文書間距離に基づいた文書分類は、文書の意味内容が充分に反映されたものであるとは言い難い。このため、ユーザにとって、大量の電子文書からは所望の文書を探し出すことが困難な状況にあることにかわりはない。
【００１２】
本発明は、このような問題点を解決するためになされたものであり、本発明の目的は、ハイパーテキストのような複雑にリンク付けされた多数の文書を適切に分類することができる文書分類装置を提供することにある。
【００１３】
【課題を解決するための手段】
上記のような目的を達成するため、本発明による文書分類装置は、電子化された複数の文書を格納する文書格納手段（１１）と、前記文書格納手段に格納された複数の文書の間のリンク関係を格納するリンク関係格納手段（１２）と、前記文書格納手段に格納された各文書に含まれる単語の出現頻度から文書間距離を計算する距離計算手段（１３）と、前記リンク関係格納手段に格納されたリンク関係と前記距離計算手段から得られる文書間距離を基にして、クラスター分析を行い、前記文書格納手段に格納された複数の文書を分類する文書分類手段（１４）と、文書分類手段による分類された結果を出力する出力手段（１５）とを有することを特徴とする。
【００１４】
このような特徴を有する文書分類装置においては、文書格納手段（１１）が、電子化された複数の文書を格納しており、リンク関係格納手段（１２）が、文書格納手段に格納された複数の文書の間のリンク関係を格納している。距離計算手段（１３）が、文書格納手段に格納された各文書に含まれる単語の出現頻度から文書間距離を計算すると、文書分類手段（１４）が、リンク関係格納手段に格納されたリンク関係と前記距離計算手段から得られる文書間距離を基にして、クラスター分析を行い、前記文書格納手段に格納された複数の文書を分類する。そして、出力手段（１５）により、文書分類手段による分類された結果を出力する。
【００１５】
このようにして、本発明の文書分類装置では、ハイパーテキストの形態をとる文書をクラスター分析の手法を用いて分類する際に、文書に記述されたリンク情報を利用する。文書間のリンク関係は、基本的に文書の作成者が自分の作成した文書と意味的に近い（距離が小さい）文書に対して設定されているので、リンク関係情報と、文書間距離の双方を用いてクラスター分析を行う。これにより、文書の作成者の意志を反映した文書分類、つまりは、文書の意味内容に沿った文書分類が実現できる。
【００１６】
【発明の実施の形態】
以下、本発明を実施する場合の一形態について図面を参照して具体的に説明する。図１は、本発明の一実施例の文書分類装置の要部の構成を示すブロック図である。図１において、１１は文書格納部、１２はリンク関係格納部、１３は距離計算処理部、１４は文書分類処理部、１５は出力処理部である。
【００１７】
本実施例の文書分類装置においては、文書格納部１１が、電子化された大量の文書を格納しており、ここに格納された各々の文書に対応して、リンク関係格納部１２が、各々の文書の間のリンク関係情報（参照する文書の存在位置とその文書識別子）を格納している。距離計算処理部１３は、文書格納部１１に格納された各文書を解析し、その文書に含まれる単語の出現頻度から文書間距離を計算する。この文書間距離の計算は、例えば、前述した（従来技術２）の文書分類システムと同様な手法（アルゴリズム）により計算する。
【００１８】
このようにして文書間距離が計算されると、文書分類処理部１４では、リンク関係格納部１２に格納されたリンク関係情報と距離計算処理部１３から得られた文書間距離を基にして、クラスター分析を行う。そして、文書格納部１１に格納された複数の文書を分類する。分類された結果は、出力処理部１５によるグラフィカルユーザインタフェースを介して、見やすい表示形態でユーザに対して表示出力される。これにより、例えば、クラスター分析の結果に応じて、大量の文書の中から同じグループに属する文書のみが表示されるので、ユーザは所望する文書を探しやすくなる。
【００１９】
図２は、本発明の別の実施例である広域ネットワークに結合された文書分類システムの要部の構成を示すブロック図である。図２において、２０は広域ネットワーク、２１は文書取得処理部、２２は文書格納部、２３はリンク関係格納部、２４は自立語抽出処理部、２５は単語重み設定処理部、２６は文書間距離計算処理部、２７は文書分類処理部、２８は出力処理部である。図２に示す文書分類システムでは、広域ネットワーク２０上に分散して存在するハイパーテキストの形態の文書に対して、これらの文書を取得し、その文書中に埋め込まれたリンク情報から、これらの電子文書の内容を対象として文書分類を行う。
【００２０】
広域ネットワーク２０は、例えば、複数のネットワークシステムが互いに結合されたインターネットであり、文書取得処理部２１は、広域ネットワーク２０にアクセス可能に存在する大量の文書を取得するプログラムモジュールにより構成される。このプログラムモジュールは、広域ネットワーク２０に接続されているコンピュータシステム上に格納されている電子文書の１つを指定すると、「指定された電子文書の内容を取得し、この電子文書中に埋め込まれた他の文書を指示するリンク情報を同定し、リンク情報が指示する他の文書を取得する操作」を再帰的に繰り返す処理を実行し、広域ネットワーク２０に接続された複数のコンピュータシステム上に分散して存在する電子文書を取得する。
【００２１】
文書取得処理部２１により取得された大量の文書は、文書格納部２２に格納される。この場合、文書格納部２２では、文書取得処理部２１が取得した文書をその文書を特定するリンク情報と対にして格納する。また、リンク関係格納部２３において、文書格納部２２に格納されている各々の文書間のリンク関係の有無を格納する。
【００２２】
自立語抽出処理部２４は、文書格納部２２に格納されている文書から形態素解析アルゴリズムを用いて自立語（単語）を抽出する。これにより、文書から単語が切り出される。単語重み設定処理部２５は、自立語抽出処理部２４による抽出結果を基にして、各文書毎に全ての自立語に対して重み（重要度）を設定する。そして、文書間距離計算処理部２６において、単語重み設定処理部２５によって設定された重みを基にして、文書格納部２２に格納されている文書の全ての２つの項目の間の距離を計算する。
【００２３】
このようにして、文書間の距離が計算されると、文書分類処理部２７では、リンク関係格納部２３に格納されているリンク関係の有無と、文書間距離計算処理部２６によって計算された文書間距離に基づいて、文書をクラスター分析により分類する。分類された結果は、出力処理部２８により、その文書分類処理部２７の分類結果が表示される。出力処理部２８は、ユーザに対して、グラフィカルユーザインターフェイスを利用して見やすい表示形態により、例えば、同じグループに属する文書がまとめられて、その文書分類結果として出力表示される。
【００２４】
一般的にハイパーテキストの形態をとる電子文書では、文書の内容部分とリンク情報（他の文書のネットワーク上の存在位置および文書識別子）とを区別するため、リンク情報には、リンク情報であることを示すタグ付けがなされている。このため、文書中からタグと一致する文字列を検出することにより、文書取得処理部２１では、文書中からリンク情報を同定する。
【００２５】
図３は、文書取得処理部２１の文書取得処理のアルゴリズムを示すフローチャートである。図３に示すフローチャートを参照して、文書取得処理部の動作を説明する。広域ネットワーク上の１つの文書のリンク情報を初期条件として指定して、文書取得処理を起動すると、ここでの処理が開始され、まず、ステップ３１において、初期条件としてリンク情報（ネットワーク上の存在位置および文書識別子）が指定された文書を文書Ｄとし、次のステップ３２において、リストＳの先頭に文書Ｄのリンク情報を加え、リストＳの先頭をカレントのリスト位置Ｐとする。次に、次のステップ３３において、リストＳのリスト位置Ｐに対応するリンク情報が存在するか否かを判定する。この判定で、リンク情報が存在しない場合は、ここでのリスト操作による文書取得処理が終了したことなので、処理を終了する。
【００２６】
また、ステップ３３の判定処理で、リンク情報が存在する場合は、次のステップ３４に進み、リンク情報を基にして、各リンク情報に対応する文書Ｄの文書内容を取得する。次に、ステップ３５において、文書Ｄのリンク情報とその文書内容とを対にして、文書格納部２２に格納する（図４）。そして、次のステップ３６において、文書Ｄの文書中に記述されているリンク情報（Ｄ１，Ｄ２，…，Ｄｎ）を全て同定する。
【００２７】
次に、ステップ３７において、リンク情報（Ｄ１，Ｄ２，…，Ｄｎ）のうち、リストＳ中に存在しないリンク情報があれば、リストＳに連接する。次にステップ３８において、文書Ｄと各リンク情報（Ｄ１，Ｄ２，…，Ｄｎ）との間の２項間にリンク関係が存在することをリンク情報格納部２３に格納する。そして、次の文書に対する処理のため、ステップ３９において、カレントのリスト位置ＰをリストＳ中のリスト位置Ｐの次の位置とし、ステップ３３に戻る。ステップ３３においては、前述のように、リストＳのリスト位置Ｐに対応するリンク情報が存在するか否かを判定し、この判定処理で、リンク情報が存在する場合には、ステップ３４からの処理を繰り返し、また、リンク情報が存在しない場合は、ここでのリスト操作による文書取得処理が終了したことなので、処理を終了する。
【００２８】
このようにして、文書取得処理部２１の処理によって、文書中でリンク付けされている他の文書が再帰的に取得される。この結果、得られた各文書の内容はその文書のリンク情報と共に文書格納部２２に格納される。また、各文書間のリンク関係の情報は、リンク関係格納部２３に格納される。
【００２９】
図４は、文書格納部２２に格納される文書内容とリンク情報の関係を説明する図である。図４に示すように、文書格納部には、取得された文書の文書内容４２とリンク情報（Ｄ１，Ｄ２，…，Ｄｎ）４１とが対応づけて格納される。
【００３０】
図５は、リンク関係格納部２３に格納されるリンク関係の情報を説明する図である。図５に示すように、リンク関係格納処理部２３には、リンク関係が２次元マトリックスの表の形式で格納される。表中の行見出しおよび列見出しは、文書格納部２２に格納されたリンク情報（Ｄ１，Ｄ２，…，Ｄｎ）に対応し、リンク情報によって特定される文書間にリンク関係がある場合を○印で表記し、リンク関係がない場合を×印で表記している。
【００３１】
前述したように、自立語抽出処理部２４は、文書格納部２２に格納された各文書内容から公知の形態素解析アルゴリズムを用いて単語を切り出し、各文書内容の中の自立語を抽出する。ここで抽出した自立語に対して、単語重み設定処理部２５が、各文書の文書内容の中に含まれる自立語に対して“１”を設定し、文書内容の中に含まれない自立語に対して“０”を設定する。
【００３２】
図６は、単語重み設定処理部２５による重み付け結果の一例を示す図である。前述したように、ここでの文書の各文書内容は、リンク情報（Ｄ１，Ｄ２，…，Ｄｎ）により対応づけられているので、図６に示すように、各文書内容に含まれている自立語（ＷＯＲＤ１，ＷＯＲＤ２，ＷＯＲＤ３，…，ＷＯＲＤｎ）に対して、当該各文書の文書内容の中に含まれる自立語には“１”を設定し、文書内容の中に含まれない自立語は“０”を設定するが、これらは、リンク情報（Ｄ１，Ｄ２，…，Ｄｎ）により各文書内容と対応付けられる。
【００３３】
文書間距離計算処理部２６は、前述した式（１−１）〜式（１−４）に基づいて、文書格納処理部２２に格納された文書の全ての２項間について、その間の距離を計算する。計算された各文書の文書間距離は、各文書内容と対応づけられているリンク情報（Ｄ１，Ｄ２，…，Ｄｎ）の間の距離として格納される。図７は、文書間距離計算処理部２６による文書間距離の計算結果の一例を示している。
【００３４】
このようにして、リンク情報により取得された各文書の文書間距離が算出されると、文書分類処理部２７において、リンク関係の情報と、算出した文書間距離に基づいて、文書分類処理部２７は、初期文書クラスターを生成し、文書間距離に基づいたクラスター分析を行い、文書格納部２２に格納された各文書を分類する。
【００３５】
図８は、文書分類処理部２７による文書分類処理のアルゴリズムを示すフローチャートである。図８を参照して、ここで文書分類処理を説明する。文書分類処理においては、処理を開始すると、ステップ８１において、初期文書クラスターの作成処理を行う。すなわち、リンク関係格納部２３のリンク関係の有無と、文書間距離計算部２６の計算結果を参照し、リンク関係があり、かつ、文書間距離が所定の定数Ｋ（０≦Ｋ≦１）以下である文書の対を１つのクラスターとする。この場合、３つ以上の文書が、この条件を満たして連なる場合には、それらをまとめて１つのクラスターとする。
【００３６】
次に、ステップ８２に進み、得られた前クラスターと、クラスターに属さない全文書の２項間距離を再計算する。次に、ステップ８３において、得られた２項間距離のうち最も小さい値となる２つのクラスターあるいは文書を１つのクラスターとする。そして、次のステップ８４において、クラスター数および文書数の合計値が、所定数Ｎ（１≦Ｎ≦ｎ：文書総数ｎ）以下であるか否かを判定し、合計値が所定数Ｎ以下でない場合、未だ分類されていない文書が存在するので、この場合には、ステップ８２に戻り、ステップ８２およびステップ８３のクラスター分析よる分類処理を繰り返し行う。この結果、ステップ８４の判定処理で、クラスター数および文書数の合計値が所定数Ｎ以下であることが確認できると、ここで文書の分類が終了したので、一連の処理を終了する。そして、次に説明するように、分類した結果を出力処理部２８により表示する。
【００３７】
なお、このステップ８２の処理において、クラスターとクラスターに属さない文書の間の文書間距離の再計算を行うが、この場合の文書と文書との間の文書間距離計算は、前述したように、式（１−１）〜式（１−４）により行う。また、クラスターＣと文書Ｄの間の距離計算は、クラスターＣに属する全ての文書と文書Ｄの距離計算を式（１−１）〜式（１−４）によって行い、その平均値を距離とする。クラスターＣ１とクラスターＣ２の間では、クラスターＣ１とクラスターＣ２に属する各文書の距離計算を行い、その平均値を距離とする。
【００３８】
文書分類処理部２７による文書分類アルゴリズムは、一般のクラスター分析の初期クラスターの設定に文書間距離とリンク関係を併用するものである。すなわち、リンク関係があり、かつ、文書間距離が近い文書をまとめて、初期クラスターとし、更に、文書間距離とリンク関係を併用することにより、意味的関係の深いリンク関係を選択的に利用することが可能となる。また、リンク関係を用いることにより、従来の文書間距離情報のみに基づくクラスター分析と比較して、より信頼性の高い分類が可能となる。これにより、文書の意味内容をより反映したクラスター解析（分類）が可能となる。
【００３９】
具体例で説明すると、前述した図４，図５，図６，および図７の数値例の場合には、Ｋ＝０．６とした場合、文書間距離が最も近いものは、文書Ｄ１と文書Ｄ４との距離“０．０９”であり、次に近い文書間距離は文書Ｄ４と文書Ｄ５との距離“０．１２”であり、その次に近い文書間距離は文書Ｄ２と文書Ｄ３との距離“０．２７”であることから、初期クラスターは（Ｄ１，Ｄ４，Ｄ５）および（Ｄ２，Ｄ３）となる。
【００４０】
次に、出力処理部２８の処理について説明する。前述したように、出力処理部２８は、ユーザに対して、グラフィカルユーザインターフェイスを利用して見やすい表示形態により、例えば、同じグループに属する文書がまとめられて、その文書分類結果として出力表示する。このような出力処理部による表示形態を、具体的な操作例を例示して説明する。図９〜図１３は、ユーザが、ここでの文書分類装置に組み込まれている文書検索装置を起動して、論文検索を行い、更に文書分類を行う場合の操作画面の一連の状態の変化を示している。ここでの文書検索装置を起動すると、図９に示すように、文献検索ウィンドウ画面９０が表示される。この文献検索ウィンドウ画面９０には、検索操作ガイドと共に、検索キーワード入力のためのキーワード入力フィールド９１が設けられている。
【００４１】
この文献検索ウィンドウ画面９０において、例えば、ユーザが論文検索のためのキーワードとして、図１０に示すように、「人工頭脳」，「定性推論」，および「免疫ネットワーク」のキーワードを入力する操作を行うと、文献検索ウィンドウ画面９０は、キーワード入力フィールド９１に検索キーワードが入力された状態となり、この状態において、検索ボタン９２をポインタカーソル９３によりクリックすると、検索処理が開始されて、その検索結果が、検索結果表示フィールド９４に表示される。その結果、図１１に示すように、検索結果表示フィールド９４には、例えば、ヒットした文献の３件の文書のタイトルが表示される。
【００４２】
次に、ユーザが、検索された文書と関連の深い文書を更に表示させるため、本実施例にかかる文書分類装置を起動する。このため、図１２に示すように、検索結果表示フィールド９４に表示された文書の内の１つの文書９５をポインタカーソル９３の操作により指定して（反転表示させて）、図１３に示すように、関連文献表示ボタン９６を操作すると、つまり、マウス操作でポインタカーソル９３によりクリックすると、本実施例にかかる文書分類装置が起動される。そして、指定された文書から、その中に埋め込まれたリンク情報により関連のある文書を取得し、その文書間距離に基づくクラスター分析による文書分類処理を実行し、同じグループに属する文書を関連文書表示フィールド９７に表示する。このようして、ユーザは、文献検索を行う場合に、関連のある文書まで含めて効率よく検索することとができる。
【００４３】
【発明の効果】
以上、説明したように、本発明の文書分類装置によれば、ハイパーテキストの形態をとる文書をクラスター分析する際に、文書に記述されたリンク情報を利用することにより、文書の作成者の意志を反映した文書分類を行うことができる。つまり、文書の意味内容に沿った文書分類ができるようになる。
【図面の簡単な説明】
【図１】図１は本発明の一実施例の文書分類装置の要部の構成を示すブロック図、
【図２】図２は本発明の別の実施例である広域ネットワークに結合された文書分類システムの要部の構成を示すブロック図、
【図３】図３は文書取得処理部２１の文書取得処理のアルゴリズムを示すフローチャート、
【図４】図４は文書格納部２２に格納される文書内容とリンク情報の関係を説明する図、
【図５】図５はリンク関係格納部２３に格納されるリンク関係の情報を説明する図、
【図６】図６は単語重み設定処理部２５による重み付け結果の一例を示す図、
【図７】図７は文書間距離計算処理部２６による文書間距離の計算結果の一例を示す図、
【図８】図８は文書分類処理部２７による文書分類処理のアルゴリズムを示すフローチャート、
【図９】図９は論文検索を行い更に文書分類を行う場合の操作画面の一連の状態の変化の第１の状態を示す図、
【図１０】図１０は論文検索を行い更に文書分類を行う場合の操作画面の一連の状態の変化の第２の状態を示す図、
【図１１】図１１は論文検索を行い更に文書分類を行う場合の操作画面の一連の状態の変化の第３の状態を示す図、
【図１２】図１２は論文検索を行い更に文書分類を行う場合の操作画面の一連の状態の変化の第４の状態を示す図、
【図１３】図１３は論文検索を行い更に文書分類を行う場合の操作画面の一連の状態の変化の第５の状態を示す図である。
【符号の説明】
１１…文書格納部、１２…リンク関係格納部、１３…距離計算処理部、１４…文書分類処理部、１５…出力処理部、２０…広域ネットワーク、２１…文書取得処理部、２２…文書格納部、２３…リンク関係格納部、２４…自立語抽出処理部、２５…単語重み設定処理部、２６…文書間距離計算処理部、２７…文書分類処理部、２８…出力処理部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document classification apparatus that classifies a large number of electronic documents existing on a network system, and more particularly to a document classification apparatus that classifies a large number of complicatedly linked documents such as hypertext. is there.
[0002]
[Prior art]
Today, with the spread of the Internet, electronic documents on computer systems that are physically located can be easily accessed via a network. Such an electronic document can embed link information for referring to another electronic document in the document, and the electronic document in which the link information is embedded can be traced by following the link information. Other electronic documents related to can be easily reached. The form of an electronic document in which such link information is embedded is generally called hypertext.
[0003]
In a network system such as the Internet, when the number of accessible electronic documents increases in large numbers, it is becoming difficult to find a desired document from the large number of electronic documents according to only link information.
[0004]
As one method for solving such a problem, an increasing number of systems provide search services for electronic documents published on the Internet. In these search systems, it is possible to perform a keyword search collectively for a large number of documents. In other words, it is possible to perform such collective keyword search by searching electronic documents published on the Internet as much as possible in advance and acquiring the contents of each document. Yes.
[0005]
Furthermore, in such a search system, there is a search system that further improves search efficiency by classifying each document into several categories according to the contents. The user of the system in this case can perform a keyword search centering on a category that seems to contain a desired document, and an improvement in search efficiency can be expected.
[0006]
By the way, as a method for classifying documents, there are a method that is performed manually, and a method that is automatically performed by calculation based on the distance between documents. When classifying a large number of documents, the latter method is advantageous from the viewpoint of efficiency.
[0007]
(Prior art 1)
A method for classifying such documents is discussed in, for example, the literature `` Luhn, HP, 'A statistical approach to mechanised encoding and searching of library information', IBM journal of research and development, 1, 309-17 (1957) ''. As described, there is a method of weighting words based on the appearance frequency of each word included in a document. In this case, in particular, a high-weight word can be regarded as a keyword representing the document.
[0008]
(Prior art 2)
In addition, a method for obtaining the distance between documents from the weight of words has been proposed in, for example, the document "Salton, G. and McGill, NJ, 'Introduction to modern information retrieval', New York, McGraw-Hill (1983)". Has been adopted in several document classification systems.
[0009]
In such a document classification system, assuming that the weight Wiu of each word Ru is set for the document Pi, the document vector Vpi of the document Pi is defined as follows. However, when the word Ru does not exist in the document Pi, the weight Wiu is set to “0”. If the word Ru exists, the weight Wiu is a real value equal to or greater than “0”.

However, here, the total number of different words is m, and Ωiu (0 ≦ Ωiu ≦ 1) is redefined as the weight Wiu of each word Ru for the document Pi. In this case, the distance d (Pi, Pj) {(0 ≦ d (Pi, Pj) ≦ 1) between the document Pi and the document Pj is
d (Pi, Pj) = 2 (arccos (Vpi · Vpj)) / π (1-4)
Is defined as the angle of both document vectors.
[0010]
(Prior art 3)
Based on the distance between documents obtained as described above, it is possible to classify documents by using a cluster analysis technique. For the method of cluster analysis, reference can be made to, for example, the document “Tanaka, Tarumi, Wakimoto,“ Statistical Analysis Handbook II Multivariate Analysis ”, pp. 226 to 257, Kyoritsu Shuppan (1984)”. Since the cluster analysis method is a well-known technique, a description thereof is omitted here.
[0011]
[Problems to be solved by the invention]
By the way, in the above-described document classification system according to the prior art, there are the following problems as problems to be solved. In other words, in the document classification system according to (Prior Art 1) and (Prior Art 2), the inter-document distance obtained mechanically is not set after deeply considering the semantic content of the document. Therefore, it is difficult to say that such document classification based on the inter-document distance sufficiently reflects the semantic content of the document. For this reason, it does not change that it is difficult for the user to find a desired document from a large amount of electronic documents.
[0012]
The present invention has been made to solve such problems, and an object of the present invention is to classify documents that can appropriately classify a large number of complicatedly linked documents such as hypertext. To provide an apparatus.
[0013]
[Means for Solving the Problems]
In order to achieve the above object, a document classification apparatus according to the present invention includes a document storage unit (11) for storing a plurality of digitized documents and a plurality of documents stored in the document storage unit. A link relation storage means (12) for storing a link relation; a distance calculation means (13) for calculating a distance between documents from the appearance frequency of words included in each document stored in the document storage means; and the link relation storage. A document classification unit (14) for performing cluster analysis based on the link relation stored in the unit and the inter-document distance obtained from the distance calculation unit, and classifying a plurality of documents stored in the document storage unit; And output means (15) for outputting the result classified by the document classification means.
[0014]
In the document classification apparatus having such characteristics, the document storage means (11) stores a plurality of digitized documents, and the link relation storage means (12) stores a plurality of documents stored in the document storage means. The link relation between documents is stored. When the distance calculation means (13) calculates the inter-document distance from the appearance frequency of words included in each document stored in the document storage means, the document classification means (14) stores the link relation stored in the link relation storage means. Based on the distance between documents obtained from the distance calculation means, cluster analysis is performed to classify a plurality of documents stored in the document storage means. Then, the output means (15) outputs the result classified by the document classification means.
[0015]
In this way, the document classification apparatus of the present invention uses link information described in a document when classifying a document in the form of hypertext using a cluster analysis technique. Since the link relationship between documents is basically set for a document that is semantically close (small distance) to the document created by the document creator, both the link relationship information and the inter-document distance are set. Perform cluster analysis using. Thereby, the document classification reflecting the will of the document creator, that is, the document classification in accordance with the semantic content of the document can be realized.
[0016]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, an embodiment for carrying out the present invention will be specifically described with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a main part of a document classification apparatus according to an embodiment of the present invention. In FIG. 1, 11 is a document storage unit, 12 is a link relationship storage unit, 13 is a distance calculation processing unit, 14 is a document classification processing unit, and 15 is an output processing unit.
[0017]
In the document classification apparatus according to the present embodiment, the document storage unit 11 stores a large number of digitized documents, and the link relationship storage unit 12 corresponds to each document stored therein, Link relation information (the location of the document to be referred to and its document identifier) are stored. The distance calculation processing unit 13 analyzes each document stored in the document storage unit 11 and calculates an inter-document distance from the appearance frequency of words included in the document. This inter-document distance is calculated by, for example, a method (algorithm) similar to that of the document classification system of (Prior Art 2) described above.
[0018]
When the inter-document distance is calculated in this manner, the document classification processing unit 14 based on the link relationship information stored in the link relationship storage unit 12 and the inter-document distance obtained from the distance calculation processing unit 13. Perform cluster analysis. Then, the plurality of documents stored in the document storage unit 11 are classified. The classified results are displayed and output to the user in an easy-to-view display format via the graphical user interface by the output processing unit 15. Accordingly, for example, only documents belonging to the same group are displayed from a large number of documents according to the result of the cluster analysis, so that the user can easily find a desired document.
[0019]
FIG. 2 is a block diagram showing a configuration of a main part of a document classification system coupled to a wide area network according to another embodiment of the present invention. In FIG. 2, 20 is a wide area network, 21 is a document acquisition processing unit, 22 is a document storage unit, 23 is a link relation storage unit, 24 is an independent word extraction processing unit, 25 is a word weight setting processing unit, and 26 is a distance between documents. A calculation processing unit, 27 is a document classification processing unit, and 28 is an output processing unit. In the document classification system shown in FIG. 2, these documents are acquired for documents in the form of hypertext distributed on the wide area network 20, and the electronic information is obtained from the link information embedded in the documents. Document classification is performed on the contents of the document.
[0020]
The wide area network 20 is, for example, the Internet in which a plurality of network systems are coupled to each other, and the document acquisition processing unit 21 is configured by a program module that acquires a large number of documents that are accessible to the wide area network 20. When one of the electronic documents stored on the computer system connected to the wide area network 20 is specified, the program module reads “The content of the specified electronic document is acquired and embedded in the electronic document. The process of recursively repeating the operation of “identifying link information indicating another document and acquiring another document indicated by the link information” is distributed over a plurality of computer systems connected to the wide area network 20. To obtain existing electronic documents.
[0021]
A large amount of documents acquired by the document acquisition processing unit 21 is stored in the document storage unit 22. In this case, the document storage unit 22 stores the document acquired by the document acquisition processing unit 21 in a pair with link information for specifying the document. The link relationship storage unit 23 stores the presence / absence of a link relationship between the documents stored in the document storage unit 22.
[0022]
The independent word extraction processing unit 24 extracts an independent word (word) from the document stored in the document storage unit 22 using a morphological analysis algorithm. Thereby, the word is cut out from the document. The word weight setting processing unit 25 sets weights (importance) for all the independent words for each document based on the extraction result by the independent word extraction processing unit 24. Then, the inter-document distance calculation processing unit 26 calculates the distance between all two items of the document stored in the document storage unit 22 based on the weight set by the word weight setting processing unit 25. .
[0023]
When the distance between documents is calculated in this way, the document classification processing unit 27 determines whether or not there is a link relationship stored in the link relationship storage unit 23 and the document calculated by the inter-document distance calculation processing unit 26. Based on the distance between documents, the documents are classified by cluster analysis. The classification result of the document classification processing unit 27 is displayed by the output processing unit 28 as the classified result. The output processing unit 28 collects documents belonging to the same group, for example, in a display form that is easy to see for a user using a graphical user interface, and outputs and displays the documents as a result of document classification.
[0024]
In an electronic document that generally takes the form of hypertext, the link information must be link information in order to distinguish the document content portion from link information (location of other documents on the network and document identifiers). Is tagged. For this reason, the document acquisition processing unit 21 identifies link information from the document by detecting a character string that matches the tag from the document.
[0025]
FIG. 3 is a flowchart showing an algorithm for document acquisition processing of the document acquisition processing unit 21. The operation of the document acquisition processing unit will be described with reference to the flowchart shown in FIG. When link information of one document on a wide area network is specified as an initial condition and a document acquisition process is started, the process starts here. First, in step 31, link information (existing position on the network) is set as an initial condition. And the document identifier) are designated as document D, and in the next step 32, link information of document D is added to the head of list S, and the head of list S is made current list position P. Next, in the next step 33, it is determined whether or not link information corresponding to the list position P of the list S exists. If the link information does not exist in this determination, the document acquisition process by the list operation is completed, and the process is terminated.
[0026]
If the link information exists in the determination process of step 33, the process proceeds to the next step 34, and the document content of the document D corresponding to each link information is acquired based on the link information. Next, in step 35, the link information of the document D and the document content are paired and stored in the document storage unit 22 (FIG. 4). Then, in the next step 36, all link information (D1, D2,..., Dn) described in the document D is identified.
[0027]
Next, in step 37, if there is link information that does not exist in the list S among the link information (D1, D2,..., Dn), the link information is connected to the list S. Next, in step 38, it is stored in the link information storage unit 23 that a link relationship exists between two terms between the document D and each link information (D1, D2,..., Dn). Then, in order to process the next document, in step 39, the current list position P is set to the position next to the list position P in the list S, and the process returns to step 33. In step 33, as described above, it is determined whether or not there is link information corresponding to the list position P of the list S. If link information exists in this determination processing, the processing from step 34 is performed. If the link information does not exist, the document acquisition process by the list operation is completed, and the process ends.
[0028]
In this way, other documents linked in the document are recursively acquired by the processing of the document acquisition processing unit 21. As a result, the content of each obtained document is stored in the document storage unit 22 together with link information of the document. Further, the link relationship information between the documents is stored in the link relationship storage unit 23.
[0029]
FIG. 4 is a diagram for explaining the relationship between the document content stored in the document storage unit 22 and the link information. As shown in FIG. 4, the document storage unit stores the document content 42 of the acquired document and link information (D1, D2,..., Dn) 41 in association with each other.
[0030]
FIG. 5 is a diagram for explaining link relationship information stored in the link relationship storage unit 23. As shown in FIG. 5, the link relation storage processing unit 23 stores the link relation in the form of a two-dimensional matrix table. The row heading and the column heading in the table correspond to the link information (D1, D2,..., Dn) stored in the document storage unit 22, and a case where there is a link relationship between the documents specified by the link information The case where there is no link relationship is indicated by a cross.
[0031]
As described above, the independent word extraction processing unit 24 extracts words from each document content stored in the document storage unit 22 using a known morphological analysis algorithm, and extracts the independent words in each document content. For the independent word extracted here, the word weight setting processing unit 25 sets “1” for the independent word included in the document content of each document, and the independent word that is not included in the document content. Is set to “0”.
[0032]
FIG. 6 is a diagram illustrating an example of a weighting result by the word weight setting processing unit 25. As described above, each document content of the document here is associated with the link information (D1, D2,..., Dn). Therefore, as shown in FIG. For words (WORD1, WORD2, WORD3,..., WORDn), “1” is set for the independent word included in the document content of each document, and the independent word that is not included in the document content is “ Although 0 ″ is set, these are associated with the contents of each document by link information (D1, D2,..., Dn).
[0033]
The inter-document distance calculation processing unit 26 calculates the distance between all the two terms of the document stored in the document storage processing unit 22 based on the above-described equations (1-1) to (1-4). calculate. The calculated inter-document distance of each document is stored as a distance between link information (D1, D2,..., Dn) associated with each document content. FIG. 7 shows an example of the calculation result of the inter-document distance by the inter-document distance calculation processing unit 26.
[0034]
In this way, when the inter-document distance of each document acquired by the link information is calculated, the document classification processing unit 27 performs the document classification processing unit 27 based on the link relation information and the calculated inter-document distance. Generates an initial document cluster, performs cluster analysis based on the inter-document distance, and classifies each document stored in the document storage unit 22.
[0035]
FIG. 8 is a flowchart showing an algorithm for document classification processing by the document classification processing unit 27. The document classification process will now be described with reference to FIG. In the document classification process, when the process is started, an initial document cluster creation process is performed in step 81. That is, referring to the presence or absence of the link relationship in the link relationship storage unit 23 and the calculation result of the inter-document distance calculation unit 26, there is a link relationship and the inter-document distance is equal to or less than a predetermined constant K (0 ≦ K ≦ 1). Let a pair of documents be one cluster. In this case, when three or more documents satisfy this condition and are connected, they are collected into one cluster.
[0036]
Next, proceeding to step 82, the distance between the binomials of the obtained previous cluster and all documents not belonging to the cluster is recalculated. Next, in step 83, two clusters or documents having the smallest value among the obtained distances between the two terms are set as one cluster. Then, in the next step 84, it is determined whether or not the total value of the number of clusters and the number of documents is equal to or less than a predetermined number N (1 ≦ N ≦ n: total number of documents n). In this case, since there is a document that has not been classified yet, in this case, the process returns to step 82 and the classification process by cluster analysis in steps 82 and 83 is repeated. As a result, if it is confirmed in step 84 that the total value of the number of clusters and the number of documents is less than or equal to the predetermined number N, the classification of the documents is completed here, and the series of processes is terminated. Then, as described below, the classified result is displayed by the output processing unit 28.
[0037]
In the process of step 82, the inter-document distance between the documents that do not belong to the cluster is recalculated. In this case, the inter-document distance is calculated between the documents as described above. It carries out by Formula (1-1)-Formula (1-4). Further, the distance calculation between the cluster C and the document D is performed by calculating the distances between all the documents belonging to the cluster C and the document D by the expressions (1-1) to (1-4), and calculating the average value as the distance. To do. Between the clusters C1 and C2, distances between the documents belonging to the clusters C1 and C2 are calculated, and the average value is set as the distance.
[0038]
The document classification algorithm by the document classification processing unit 27 uses the inter-document distance and the link relationship together for the initial cluster setting of general cluster analysis. In other words, documents that have link relations and short inter-document distances are gathered together to form an initial cluster, and the inter-document distances and link relations are used together to selectively use link relations with deep semantic relations. It becomes possible. Further, by using the link relationship, classification with higher reliability is possible as compared with the conventional cluster analysis based only on the inter-document distance information. This enables cluster analysis (classification) that more reflects the semantic content of the document.
[0039]
More specifically, in the numerical examples of FIGS. 4, 5, 6, and 7 described above, when K = 0.6, the documents having the shortest distance between documents are the document D1 and the document. The distance between the documents D4 and the document D5 is the distance between the documents D4 and D5, and the next document distance between the documents D2 and D3 is the distance between the documents D4 and D5. Since the distance is “0.27”, the initial clusters are (D1, D4, D5) and (D2, D3).
[0040]
Next, processing of the output processing unit 28 will be described. As described above, the output processing unit 28 collects, for example, documents belonging to the same group in a display form that is easy to see for a user using a graphical user interface, and outputs and displays the documents as a result of document classification. A display form by such an output processing unit will be described by exemplifying a specific operation example. FIG. 9 to FIG. 13 show a series of changes in the state of the operation screen when the user activates the document retrieval apparatus incorporated in the document classification apparatus here, performs a paper search, and further performs document classification. Show. When the document search apparatus is activated, a document search window screen 90 is displayed as shown in FIG. This literature search window screen 90, search operation guide and both the keyword input field 91 for the search keyword input is provided.
[0041]
In this document search window screen 90, for example, the user performs an operation of inputting keywords of “artificial brain”, “qualitative reasoning”, and “immune network” as keywords for searching for articles as shown in FIG. Then, the document search window screen 90 is in a state in which the search keyword is input in the keyword input field 91. In this state, when the search button 92 is clicked with the pointer cursor 93, the search process is started, and the search result is It is displayed in the search result display field 94. As a result, as shown in FIG. 11, in the search result display field 94, for example, the titles of three documents of hit documents are displayed.
[0042]
Next, the user activates the document classification apparatus according to the present embodiment in order to further display a document closely related to the retrieved document. For this reason, as shown in FIG. 12, one of the documents 95 displayed in the search result display field 94 is designated (inverted display) by operating the pointer cursor 93, and as shown in FIG. When the related document display button 96 is operated, that is, when the mouse is operated and the pointer cursor 93 is clicked, the document classification apparatus according to the present embodiment is activated. Then, from the specified document, the related document is acquired by the link information embedded in the document, the document classification process is performed by the cluster analysis based on the distance between the documents, and the documents belonging to the same group are displayed. Displayed in field 97. In this way, when performing a document search, the user can efficiently search including a related document.
[0043]
【The invention's effect】
As described above, according to the document classification device of the present invention, when performing cluster analysis on documents in the form of hypertext, the intention of the creator of the document is obtained by using the link information described in the document. The document classification that reflects can be performed. That is, document classification can be performed according to the semantic content of the document.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a main part of a document classification apparatus according to an embodiment of the present invention;
FIG. 2 is a block diagram showing a configuration of a main part of a document classification system coupled to a wide area network according to another embodiment of the present invention;
FIG. 3 is a flowchart showing an algorithm for document acquisition processing of the document acquisition processing unit 21;
FIG. 4 is a diagram for explaining the relationship between document contents stored in the document storage unit 22 and link information;
FIG. 5 is a diagram for explaining link relationship information stored in a link relationship storage unit 23;
FIG. 6 is a diagram illustrating an example of a weighting result by a word weight setting processing unit 25;
FIG. 7 is a diagram showing an example of the calculation result of the inter-document distance by the inter-document distance calculation processing unit 26;
FIG. 8 is a flowchart showing an algorithm of document classification processing by the document classification processing unit 27;
FIG. 9 is a diagram showing a first state of a series of state changes of the operation screen when paper search is performed and document classification is further performed;
FIG. 10 is a diagram showing a second state of a series of state changes of the operation screen when paper search is performed and document classification is further performed;
FIG. 11 is a diagram showing a third state of a series of state changes in the operation screen when paper search is performed and document classification is further performed;
FIG. 12 is a diagram showing a fourth state of a series of state changes on the operation screen when paper search is performed and document classification is further performed;
FIG. 13 is a diagram illustrating a fifth state of a series of state changes on the operation screen when paper search is performed and document classification is performed.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 11 ... Document storage part, 12 ... Link relation storage part, 13 ... Distance calculation process part, 14 ... Document classification process part, 15 ... Output process part, 20 ... Wide area network, 21 ... Document acquisition process part, 22 ... Document storage part , 23 ... link relation storage unit, 24 ... independent word extraction processing unit, 25 ... word weight setting processing unit, 26 ... inter-document distance calculation processing unit, 27 ... document classification processing unit, 28 ... output processing unit.

Claims

Document storage means for storing a plurality of digitized documents;
Link relation storage means for storing link relations between a plurality of documents stored in the document storage means;
Distance calculation means for calculating the distance between documents from the frequency of appearance of words contained in each document stored in the document storage means;
A document classification unit that performs cluster analysis based on the link relationship stored in the link relationship storage unit and the inter-document distance obtained from the distance calculation unit, and classifies a plurality of documents stored in the document storage unit; ,
A document classification device comprising: output means for outputting a result classified by the document classification means.

A document classification method by a document classification device,
A document storage means provided in the document classification device stores a plurality of digitized documents;
A step of storing a link relationship between a plurality of documents stored in the document storage unit, a link relationship storage unit included in the document classification device;
A step of calculating a distance between documents from the appearance frequency of words included in each document stored in the document storage means, a distance calculation means provided in the document classification device;
The document classification means provided in the document classification device performs a cluster analysis based on the link relation stored in the link relation storage means and the inter-document distance obtained from the distance calculation means, and is stored in the document storage means. Categorizing multiple documents;
A document classification method comprising: an output unit included in the document classification device executing a step of outputting a result classified by the document classification unit.