JP4017354B2

JP4017354B2 - Information classification apparatus and information classification program

Info

Publication number: JP4017354B2
Application number: JP2001111942A
Authority: JP
Inventors: 佳則片山; 寛治内野; 憲彦坂本; 竜柴田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2000-04-17
Filing date: 2001-04-10
Publication date: 2007-12-05
Anticipated expiration: 2021-04-10
Also published as: JP2002007433A

Description

【０００１】
【発明の属する技術分野】
本発明は、大量のテキスト情報等の分類に用いられる情報分類装置および情報分類プログラムに関するものであり、特に、複数の分類方法から最も分類精度が高い分類方法を選択することで、分類精度、効率を高めることができる情報分類装置および情報分類プログラムに関するものである。
【０００２】
近時、インターネットを用いることで、莫大な量のテキスト情報を簡単に入手することが可能である。このことから、これらの大量のテキスト情報の内容を把握し、その中から必要なテキスト情報を効率よく抽出する技術が求められている。これは、決められた分類カテゴリに、これらのテキスト情報が分類されていると、後にテキスト情報を活用する際の検索や、関連テキスト情報を見つける場合等に便利だからである。
【０００３】
従来では、このような大量のテキスト情報は、分類担当者や、テキスト情報の作成者またはテキスト情報の活用者により、新規のテキスト情報の内容が判断され、複数の分類カテゴリからなる分類体系の中の最適な分類カテゴリにそれぞれ手動で分類されていた。また、別の分類方法としては、計算機システムを利用して新規のテキスト情報の内容が解析され、この解析結果に基づいて分類カテゴリに該当するテキスト情報を自動で分類するものがある。前者の分類方法では、非常に高いコストがかかり、後者の分類方法では、実用的な結果を得るための分類カテゴリの数や分類精度に問題がある。このことから、従来よりこのような問題を効果的に解決するための手段、方法が切望されている。
【０００４】
【従来の技術】
電子化された大量のテキスト情報が流通するようになった現在では、テキスト情報の効率的検索／利用の観点から、テキスト情報の意味内容に基づいた分類が重要な課題となっている。従来より、かかる課題を解決するための手段として、テキスト情報の分類作業を自動で実行する情報分類装置が各方面で用いられている。
【０００５】
また、従来では、与えられたテキスト情報の分類事例に基づいてテキスト情報の分類方法を導出した後、この分類方法に基づいて新規のテキスト情報を分類する方法として、特開平１１−３２８２１１号公報、特開平１１−２９６５５２号公報、特開平１１−１６７５８１号公報、特開平１１−１６１６７１号公報等に様々な分類方法が開示されている。ここで、つぎの（１）項〜（３）項に従来の分類方法を列挙する。
（１）確率モデルを基にした統計的な分類方法
（２）学習により自動分類を行う分類方法
（３）それぞれの分類カテゴリにテキスト情報を分類するためのルールを作成
し、このルールを用いて自動分類を行う分類方法
【０００６】
（１）項の分類方法は、一般的な分類の傾向を見つけだすことができるが、細かい分類の傾向を見つけだすことができない。（２）項の分類方法は、分類カテゴリ数が数十未満の場合に高い分類精度を得ることができるが、数十以上に増えた場合、分類精度が低くなる。また、（３）項の分類方法は、ルールの作成およびメンテナンスに多大なコストがかかる。このように、（１）項〜（３）項までの分類方法は、それぞれ一長一短がある。
【０００７】
図１８は、従来の情報分類装置の構成を示すブロック図である。この図において、分類サンプルデータ２は、どの分類カテゴリにどのテキストを分類するのかが予め決められた複数のテキストからなる分類に関する正解データである。特徴素抽出部１は、分類サンプルデータ２から、各分類カテゴリの特徴をそれぞれ表す特徴素（単語）を各テキストから抽出する。
【０００８】
ここで、特徴素の抽出においては、各分類カテゴリの弁別能力を高めることができる特徴素を効率的に抽出する必要がある。従って、特徴素抽出部１では、特徴素の出現頻度をベースにして、上記弁別能力を高めるための特徴素抽出方法が用いられる。この特徴素抽出方法としては、従来より複数のものが提案されている。また、特徴素の属性についても品詞を幾つか指定する等の方法が採られる。
【０００９】
分類学習情報生成部３は、特徴素抽出部１により抽出された特徴素から各分類カテゴリの特徴をそれぞれ算出し、この算出結果としての分類学習情報４を生成する。この分類学習情報生成部３における分類学習方法としては、従来より複数のものが提案されている。分類学習情報４は、特徴素の状況と分類カテゴリとの対応関係を表す情報である。自動分類部５は、予め固定的に設定された一つの分類方法により、分類対象である、複数のテキストからなる新規テキスト群６を分類学習情報４に基づいて、分類カテゴリに分類し、分類結果データ７を出力する。
【００１０】
【発明が解決しようとする課題】
ところで、前述したように、従来の情報分類装置（図１８参照）においては、特徴素抽出部１の特徴素抽出方法として複数のものがある旨を述べたが、分類対象となる新規テキスト群６の内容、量に依存して、分類結果データ７における分類精度が変動することから、あらゆる内容、量の新規テキスト群６に対して高い分類精度を維持する万能な抽出方法を一意に規定することが難しい。
【００１１】
同様にして、分類学習情報生成部３においても、分類学習方法として複数のものがある旨を述べたが、新規テキスト群６の内容、量に依存して分類結果データ７における分類精度が変動することから、高い分類精度を維持する万能な分類学習方法を一意に規定することが難しい。このことから、従来の情報分類装置では、やむを得ず、複数の分類方法（特徴素抽出方法、分類学習方法）のうち一つの分類方法が固定的に用いられている。
【００１２】
従って、従来の情報分類装置では、一つの固定的な分類方法により新規テキスト群６の分類を行っているため、新規テキスト群６の内容、量によって分類精度がバラツキ、結果的に分類精度が低くなってしまうという問題があった。
【００１３】
本発明は、上記に鑑みてなされたもので、分類対象の情報の内容、量にかかわらず、分類精度を高めることができる情報分類装置および情報分類プログラムを提供することを目的とする。
【００１４】
【課題を解決するための手段】
上記目的を達成するために、請求項１にかかる発明は、複数のサンプルテキストと複数の分類カテゴリとが予め対応付けられた分類サンプル情報に含まれる複数のサンプルテキストのそれぞれから分類カテゴリ毎に特徴素を抽出する特徴素抽出手段と、前記分類サンプル情報に基づいて、複数の分類方法の中から最も分類精度が高い分類方法を決定する分類方法決定手段と、前記分類方法決定手段により決定された分類方法に従って、前記特徴素抽出手段により抽出された特徴素に基づいて、分類カテゴリ毎の特徴を表す分類学習情報を生成する分類学習情報生成手段と、前記分類方法決定手段により決定された分類方法および前記分類学習情報に従って、分類対象である新規テキスト群を分類カテゴリ毎に分類し、分類結果を記憶手段に記憶させる分類手段と、前記分類手段によって第１の分類カテゴリに分類された新規テキストを、前記分類サンプル情報に含めさせて該新規テキストを第２の分類カテゴリと対応付けさせ、該新規テキストを含む前記分類サンプル情報から特徴素を再抽出する処理を前記特徴素抽出手段に行わせ、該新規テキストを含む前記分類サンプル情報に基づく分類方法の再決定を前記分類方法決定手段に行わせ、再決定された分類方法と再抽出された特徴素による分類学習情報の再生成を前記分類学習情報生成手段に行わせ、再決定された分類方法と再生成された分類学習情報による前記新規テキスト群の再分類を前記分類手段に行わせる再学習手段と、前記記憶手段に記憶された前回の分類結果との相違を示すシンボルを新規テキスト毎に付して、前記新規テキスト群に含まれる各新規テキストの再分類の結果を表示部に表示させる表示手段とを備えることを特徴とする。
【００１５】
この発明によれば、複数の分類方法を使用可能な状態にしておき、分類方法決定手段により、分類サンプル情報に基づいて複数の分類方法の中から最も分類精度が高い分類方法を決定した後、この分類方法に従って新規テキスト群を分類カテゴリ毎に分類するようにしたので、従来に比して、分類対象の情報の内容、量にかかわらず、分類精度を高めることができる。また、新規テキストを分類し直した分類サンプル情報に基づいて分類学習情報を再生成して分類を再実行し、再生成前後の分類結果の相違を表示するようにしたので、分類学習情報の分類精度をさらに高めることができる。
【００１６】
また、請求項２にかかる発明は、請求項１に記載の情報分類装置において、前記特徴素抽出手段は、複数の特徴素抽出方法により特徴素をそれぞれ抽出し、これらの抽出結果に基づいて、複数の特徴素抽出方法の中から分類カテゴリ間の弁別能力が高い特徴素抽出方法を選択し、この選択結果に対応する特徴素を抽出結果とすることを特徴とする。
【００１７】
この発明によれば、特徴素抽出手段で複数の特徴素抽出方法を使用可能な状態にしておき、これらの複数の特徴素抽出方法にそれぞれ対応する特徴素を抽出し、特に、分類カテゴリ間の弁別能力が高い特徴素抽出方法に対応する特徴素を抽出結果とするようにしたので、この特徴素に対応する分類結果の分類精度をさらに高めることができる。
【００１８】
また、請求項３にかかる発明は、請求項１に記載の情報分類装置において、前記特徴素抽出手段により抽出された特徴素を編集する編集手段を備えることを特徴とする。
【００１９】
この発明によれば、編集手段を設けて、抽出された特徴素を編集（削除、追加等）可能としたので、分類カテゴリに対して柔軟な特徴素設定を行うことができる。
【００２０】
また、請求項４にかかる発明は、請求項１〜３のいずれか一つに記載の情報分類装置において、前記分類方法決定手段は、クロスバリデーション方式により、複数の分類方法の中から最も分類精度が高い分類方法を決定することを特徴とする。
【００２１】
この発明によれば、複数の分類方法を使用可能な状態にしておき、分類方法決定手段により、分類サンプル情報に基づいて複数の分類方法の中から最も分類精度が高い分類方法をクロスバリデーション方式により決定した後、この分類方法に従って新規テキスト群を分類カテゴリ毎に分類するようにしたので、従来に比して、分類対象の情報の内容、量にかかわらず、分類精度を高めることができる。
【００２２】
また、請求項５にかかる発明は、請求項１〜４のいずれか一つに記載の情報分類装置において、前記サンプル情報、前記新規テキスト群における分類対象箇所を指定する指定手段を備えることを特徴とする。
【００２３】
この発明によれば、指定手段により、分類サンプル情報、新規テキスト群における分類対象箇所を指定するようにしたので、分類に不要な箇所を排除し、本質的に必要な箇所を対象に分類を行うことができるため、分類精度をさらに向上させることができる。
【００２４】
また、請求項６にかかる発明は、請求項１〜５のいずれか一つに記載の情報分類装置において、複数のサンプルテキストをクラスタリングすることで、前記複数のサンプルテキストと複数の分類カテゴリとが対応付けられた前記分類サンプル情報を生成するクラスタリング手段を備えることを特徴とする。
【００２５】
この発明によれば、クラスタリング手段により分類サンプル情報を生成するようにしたので、複数のサンプルテキストから分類カテゴリを手動で生成する場合に比して、格段に効率を向上させることができるとともに、ユーザの作業負担を軽減させることができる。
【００２６】
また、請求項７にかかる発明は、請求項１〜５のいずれか一つに記載の情報分類装置において、前記分類サンプル情報をクラスタリングするクラスタリング手段と、前記クラスタリング手段のクラスタリング結果と所望のクラスタリング結果とを比較する比較手段と、前記比較手段の比較結果に基づいて、必要に応じて前記分類サンプル情報を変更する変更手段とを備えることを特徴とする。
【００２７】
この発明によれば、クラスタリング手段のクラスタリング結果と所望のクラスタリング結果とを比較し、この比較結果が例えば不一致である場合に、変更手段により分類サンプル情報を変更可能としたので、より完全な分類サンプル情報に基づいて新規テキスト群の分類を行うことができることから、分類精度を極めて高くすることができる。
【００２８】
また、請求項８にかかる発明は、請求項１〜７のいずれか一つに記載の情報分類装置において、前記分類手段の分類結果における新規テキスト群をクラスタリングし、クラスタリング結果を表示するクラスタリング結果表示手段を備えることを特徴とする。
【００２９】
この発明によれば、クラスタリング結果表示手段によりクラスタリング結果を表示するようにしたので、分類結果の分布をユーザが容易に把握することができる。
【００３０】
また、請求項９にかかる発明は、請求項１〜８のいずれか一つに記載の情報分類装置において、前記分類手段の分類結果を最適化する最適化手段を備え、前記分類学習情報生成手段は、最適化された分類結果に基づいて、分類学習情報を再生成し、前記分類手段は、前記分類方法決定手段により決定された分類方法および再生成された前記分類学習情報に従って、分類対象である新規テキスト群を分類カテゴリ毎に分類することを特徴とする。
【００３１】
この発明によれば、最適化手段により最適化された分類結果に基づいて、分類学習情報を再生成し、この分類学習情報に従って、新規テキスト群を再度分類するようにしたので、さらに分類精度を向上させることができる。
【００３２】
また、請求項１０にかかる発明は、請求項９に記載の情報分類装置において、前記最適化前の分類結果と前記最適化後の分類結果との相違を視覚的に認識可能な相違認識情報として表示する相違認識情報表示手段を備えることを特徴とする。
【００３３】
この発明によれば、最適化前後における分類結果の相違を相違認識情報として表示させ、ユーザが一目で相違を認識できるようにしたので、相違に基づくユーザの対応を迅速に行わせることができ、結果的に分類精度を高めることができる。
【００３４】
また、請求項１１にかかる発明は、コンピュータを、複数のサンプルテキストと複数の分類カテゴリとが予め対応付けられた分類サンプル情報に含まれる複数のサンプルテキストのそれぞれから分類カテゴリ毎に特徴素を抽出する特徴素抽出手段と、前記分類サンプル情報に基づいて、複数の分類方法の中から最も分類精度が高い分類方法を決定する分類方法決定手段と、前記分類方法決定手段により決定された分類方法に従って、前記特徴素抽出手段により抽出された特徴素に基づいて、分類カテゴリ毎の特徴を表す分類学習情報を生成する分類学習情報生成手段と、前記分類方法決定手段により決定された分類方法および前記分類学習情報に従って、分類対象である新規テキスト群を分類カテゴリ毎に分類し、分類結果を記憶手段に記憶させる分類手段と、前記分類手段によって第１の分類カテゴリに分類された新規テキストを、前記分類サンプル情報に含めさせて該新規テキストを第２の分類カテゴリと対応付けさせ、該新規テキストを含む前記分類サンプル情報から特徴素を再抽出する処理を前記特徴素抽出手段に行わせ、該新規テキストを含む前記分類サンプル情報に基づく分類方法の再決定を前記分類方法決定手段に行わせ、再決定された分類方法と再抽出された特徴素による分類学習情報の再生成を前記分類学習情報生成手段に行わせ、再決定された分類方法と再生成された分類学習情報による前記新規テキスト群の再分類を前記分類手段に行わせる再学習手段と、前記記憶手段に記憶された前回の分類結果との相違を示すシンボルを新規テキスト毎に付して、前記新規テキスト群に含まれる各新規テキストの再分類の結果を表示部に表示させる表示手段
として動作させること特徴とする。
【００３５】
この発明によれば、複数の分類方法を使用可能な状態にしておき、分類方法決定工程で、分類サンプル情報に基づいて複数の分類方法の中から最も分類精度が高い分類方法を決定した後、この分類方法に従って新規テキスト群を分類カテゴリ毎に分類するようにしたので、従来に比して、分類対象の情報の内容、量にかかわらず、分類精度を高めることができる。また、新規テキストを分類し直した分類サンプル情報に基づいて分類学習情報を再生成して分類を再実行し、再生成前後の分類結果の相違を表示するようにしたので、分類学習情報の分類精度をさらに高めることができる。
【００４０】
【発明の実施の形態】
以下、図面を参照して本発明にかかる情報分類装置および情報分類プログラムの一実施の形態について詳細に説明する。
【００４１】
図１は、本発明にかかる一実施の形態の構成を示すブロック図である。この図において、サンプルテキスト群１０は、未分類の複数のテキストの集合である。クラスタリング部２０は、サンプルテキスト群１０をクラスタリングし、分類サンプルデータ３０を生成する。この分類サンプルデータ３０は、どの分類カテゴリにどのテキストを分類するのかが予め決められた複数のテキストからなる分類に関する正解データである。
【００４２】
特徴素抽出部４０は、特徴素抽出部１（図１８参照）と同様にして、分類サンプルデータ３０から、各分類カテゴリの特徴をそれぞれ表す特徴素（単語）を各テキストから抽出する。ただし、特徴素抽出部１が一つの特徴素抽出方法に従って特徴素の抽出を行うのに対して、特徴素抽出部４０は、複数の特徴素抽出方法のそれぞれに従って特徴素の抽出を行う点で、特徴素抽出部１と異なる。
【００４３】
分類学習情報生成部６０は、分類学習情報生成部３（図１８参照）と同様にして、特徴素抽出部４０により抽出された特徴素から各分類カテゴリの特徴をそれぞれ算出し、この算出結果としての分類学習情報７０を生成する。ただし、分類学習情報生成部３が一つの分類学習方法に従って特徴を算出するのに対して、分類学習情報生成部６０は、複数の分類学習方法のそれぞれに従って特徴を算出する点で、分類学習情報生成部３と異なる。
【００４４】
分類方法決定部５０は、例えば、周知のクロスバリデーションにより、複数の分類方法の中から最も分類精度が高い分類方法を決定する。この分類方法決定部５０の動作の詳細については後述する。新規テキスト群８０は、図２に示したように、分類対象の複数の新規テキストＴＸ₁（テキスト名ｔｅｘｔ１）〜新規テキストＴＸ₁₀ （テキスト名ｔｅｘｔ１０）、・・・からなる。図１に戻り、自動分類部９０は、分類方法決定部５０により決定された分類方法および分類学習情報７０に基づいて、新規テキスト群８０を分類カテゴリに分類し、これを分類結果データ１００（図３参照）として出力する。
【００４５】
クラスタリング部１１０は、分類結果データ１００をクラスタリングし、クラスタリング結果Ｃ（図４参照）を得る。表示部１２０は、クラスタリング部１１０からのクラスタリング結果Ｃや、各部からの各種データを表示するディスプレイである。図５〜図７には、表示部１２０の表示例が図示されている。入力部１３０は、後述する編集作業や、表示部１２０におけるウィンドウ操作等を行うためのマウス、キーボード等である。
【００４６】
つぎに、上述した一実施の形態の動作について、図８〜図１０に示したフローチャートを参照しつつ説明する。図１に示したクラスタリング部２０にサンプルテキスト群１０が入力されると、図８に示したステップＳＡ１では、クラスタリング部２０は、サンプルテキスト群１０の複数のテキストをクラスタリングする。ステップＳＡ２では、クラスタリング部２０は、各クラスタを分類カテゴリ化する。ステップＳＡ３では、クラスタリング部２０は、どの分類カテゴリにどのテキストを分類するのかが予め決められた複数のテキストからなる分類に関する分類サンプルデータ３０（正解データ）を特徴素抽出部４０へ出力する。
【００４７】
これにより、ステップＳＡ４では、特徴素抽出部４０は、分類サンプルデータ３０における各分類カテゴリの特徴をそれぞれ表す特徴素（単語）を各テキストから抽出する特徴素抽出処理を実行する。すなわち、図９に示したステップＳＢ１では、特徴素抽出部４０は、分類サンプルデータ３０を形態素解析することにより、分類カテゴリの特徴を表す特徴素（単語）の候補を抽出する。ステップＳＢ２では、特徴素抽出部４０は、抽出された特徴素の候補における同義語を統一化するという処理を実行する。
【００４８】
ステップＳＢ３では、特徴素抽出部４０は、抽出された複数の特徴素の候補に関して、分類カテゴリ毎に、同一語の特徴素をカウントする。ステップＳＢ４では、特徴素抽出部４０は、分類カテゴリ毎に複数の特徴素の候補を絞り込むランキング処理を実行する。このランキング処理では、複数の特徴素の候補に対して、出現頻度が高い順に特徴素を分類カテゴリ毎にランキングする方法や、出現確率が高い順に特徴素を分類カテゴリ毎にランキングする方法や、出現頻度の算出に統計的手法（他の分類カテゴリにも出現している特徴素のランキングを下げる手法）を取り入れ、特徴素を分類カテゴリ毎にランキングする方法等が採用される。
【００４９】
ステップＳＢ５では、特徴素抽出部４０は、上述したランキングが高い特徴素を分類カテゴリ毎に上位から所定数抽出し、これらを特徴素として抽出する。ステップＳＢ６では、特徴素抽出部４０は、抽出された特徴素を特徴素抽出結果データとして出力する。図１１は、上述した三つのランキングの方法のうち、出現頻度順にランキングされた特徴素出現頻度順リストＲ₁（特徴素抽出結果データに対応）を示す図である。
【００５０】
同図には、分類カテゴリ（「Ｅｃｏｎｏｍｉｃ」、「Ｆｏｒｅｉｇｎ」、・・・、「Ｓｏｃｉｅｔｙ」および「Ｓｐｏｒｔ」）のフィールドと、当該分類カテゴリにおける特徴素（「市場」、「円高」等）出現頻度を表す度数のフィールドとがある。それぞれの分類カテゴリに対応するレコードには、当該分類カテゴリに分類されたテキストの数が記述されている。ここでいうテキストとは、サンプルテキスト群１０（図１参照）を構成するものをいう。例えば、「Ｅｃｏｎｏｍｉｃ」という分類カテゴリには、２７個のテキストが分類されており、「Ｆｏｒｅｉｇｎ」という分類カテゴリには、４３個のテキストが分類されている。
【００５１】
同図左端のフィールドは、出現頻度が高い順を表すランキングである。例えば、「Ｅｃｏｎｏｍｉｃ」という分類カテゴリにおいては、２７個のテキスト内での出現頻度のランキングが１位の特徴素が「市場」（度数：６１．０）、２位の特徴素が「円高」（度数：４０．０）、以下同様にして、３０位の特徴素が「金融」（度数：１２．０）である。
【００５２】
図１２は、上述した三つのランキングの方法のうち、Ｋｕｌｌｂａｃｋ−Ｌｅｉｂｌｅｒ法と呼ばれる統計的手法を取り入れ、特徴素が分類カテゴリ毎にランキングされた特徴素出現頻度順リストＲ₂（特徴素抽出結果データに対応）を示す図である。同図に示した特徴素ランキングリストＲ₂ の基本的な構成は、特徴素出現頻度順リストＲ₁（図１１参照）の構成と同一である。
【００５３】
しかしながら、特徴素ランキングリストＲ₂ では、他の分類カテゴリにも出現している特徴素のランキングを下げ、当該分類カテゴリと他の分類カテゴリとの弁別能力を向上させるための統計的手法が採用されている。例えば、図１１に示した「Ｅｃｏｎｏｍｉｃ」という分類カテゴリにおけるランキング３位の「ドル」（特徴素）は、図１２に示した「Ｅｃｏｎｏｍｉｃ」という分類カテゴリで３１位以下（図示略）とされている。
【００５４】
図８に戻り、ステップＳＡ５では、分類方法決定部５０は、新規テキスト群８０に適用する分類方法を自動的に決定するか否かを判断する。ユーザからの指示が無ければ、分類方法決定部５０は、ステップＳＡ５の判断結果を「Ｙｅｓ」とする。一方、ユーザによりマニュアル操作で分類方法が指示された場合、分類方法決定部５０は、ステップＳＡ５の判断結果を「Ｎｏ」とし、ステップＳＡ７でユーザからの指示に基づいて分類方法を決定する。
【００５５】
この場合、ステップＳＡ６では、分類方法決定部５０は、例えば、クロスバリデーションにより、分類方法を自動的に決定する分類方法決定処理を実行する。すなわち、図１０に示したステップＳＣ１では、分類方法決定部５０は、分類サンプルデータ３０における分類カテゴリ毎に分類サンプル（テキスト）をランダムにＮ個に分ける。ステップＳＣ２では、分類方法決定部５０は、（Ｎ−１）個の分類サンプルに対して、複数の学習アルゴリズム（分類方法）をそれぞれ適用し、それぞれの学習アルゴリズムに対応する特徴素や分類学習情報を作成する。
【００５６】
ステップＳＣ３では、分類方法決定部５０は、ステップＳＣ２で作成された特徴素や分類学習情報を用いて、残り（１／Ｎ）の分類サンプルに対して当該学習アルゴリズム方法を適用することにより、分類テストを行い分類精度を算出する。この分類精度は、複数の学習アルゴリズムのそれぞれについて個別的に算出される。ステップＳＣ４では、分類方法決定部５０は、上記分類テストをＮ回実行したか否かを判断し、この場合、判断結果を「Ｎｏ」とする。以後、ステップＳＣ２およびステップＳＣ３では、分類サンプルを一つづつ替えることにより、Ｎ個の分類サンプルに関するそれぞれ分類精度が、複数の学習アルゴリズム毎に算出される。
【００５７】
そして、ステップＳＣ４の判断結果が「Ｙｅｓ」になると、ステップＳＣ５では、分類方法決定部５０は、Ｎ個の分類サンプルに関する分類精度の平均値を複数の学習アルゴリズム毎に算出する。ステップＳＣ６では、分類方法決定部５０は、複数の学習アルゴリズム（分類方法）にそれぞれ対応する複数の分類精度の平均値うち、最も高いものを選択した後、選択された分類精度に対応する学習アルゴリズム（分類方法）を選択する。また、分類方法決定部５０は、分類精度が最も高い学習アルゴリズム（分類方法）を分類学習情報生成部６０および自動分類部９０に通知する。
【００５８】
図８に戻り、ステップＳＡ８では、分類学習情報生成部６０は、分類方法決定部５０により通知された学習アルゴリズム（分類方法）、および特徴素抽出部４０からの特徴素抽出結果データに基づいて、分類学習情報７０を生成する。ステップＳＡ９では、分類学習情報生成部６０は、分類学習情報７０をデータベース（図示略）に登録する。ステップＳＡ１０では、自動分類部９０は、分類対象である新規テキスト群８０が入力されたか否かを判断し、この場合、判断結果を「Ｎｏ」として同判断を繰り返す。
【００５９】
そして、新規テキスト群８０が自動分類部９０に入力されると、自動分類部９０は、ステップＳＡ１０の判断結果を「Ｙｅｓ」とする。ステップＳＡ１１では、自動分類部９０は、新規テキスト群８０（図２参照）を構成する新規テキストＴＸ₁ 、新規テキストＴＸ₂、・・・新規テキストＴＸ₁₀ 、・・・のすべての自動分類が終了したか否かを判断し、この場合、判断結果を「Ｎｏ」とする。以降、ステップＳＡ１５〜ステップＳＡ２１では、自動分類部９０は、分類方法決定部５０により決定された分類方法に基づいて、自動分類処理を実行する。
【００６０】
以下では、分類方法の一例として、ベクトル空間法に基づいて新規テキスト群８０を分類する場合について説明する。この場合に、分類学習情報７０には、各分類カテゴリ毎に３０個の特徴素が含まれており、全特徴素のベクトル、各分類カテゴリのベクトルが含まれているものとする。この状態で、ステップＳＡ１５では、自動分類部９０は、新規テキスト群８０における新規テキストＴＸ₁ （図２参照）に対して形態素解析を実行し、特徴素（単語）を抽出する。ステップＳＡ１６では、自動分類部９０は、抽出された特徴素における同義語を統一化するという同義語統一化処理を実行する。
【００６１】
ステップＳＡ１７では、自動分類部９０は、抽出された特徴素をカウントする。ステップＳＡ１８では、自動分類部９０は、分類学習情報７０内の特徴素と同一の特徴素を、新規テキストＴＸ₁ に含まれる複数の特徴素から取得する。つぎに、自動分類部９０は、取得した特徴素、すなわち、新規テキストＴＸ₁ に関する文書ベクトルを生成する。
【００６２】
ステップＳＡ１９では、新規テキストＴＸ₁ に関する文書ベクトルと、分類学習情報７０内の各分類カテゴリのベクトルとの類似度（コサイン値）を算出する。この類似度（コサイン値）は、分類カテゴリのベクトルをＡ、新規テキストＴＸ₁ の文書ベクトルをＢとするとつぎの式で表される。
類似度（コサイン値）＝ベクトルＡと文書ベクトルＢとの内積／（ベクトルＡの大きさ×文書ベクトルＢの大きさ）
【００６３】
すなわち、ステップＳＡ１９では、新規テキストＴＸ₁ に関して、分類カテゴリの数分の類似度（コサイン値）が算出される。ステップＳＡ２０では、自動分類部９０は、算出された複数の類似度（コサイン値）を正規化（０〜１００までの値とする）する。ステップＳＡ２１では、自動分類部９０は、複数の類似度（コサイン値）のうち、しきい値（例えば、７０）以上の類似度を選択した後、選択された類似度に対応する分類カテゴリに新規テキストＴＸ₁ を分類する。なお、複数の類似度のすべてがしきい値に満たない場合、自動分類部９０は、当該新規テキストＴＸ₁ を分類できないテキストとする。以後、ステップＳＡ１５〜ステップＳＡ２１までの処理が繰り返されることにより、新規テキストが分類カテゴリに順次分類される。
【００６４】
そして、すべての新規テキストの分類が終了すると、自動分類部９０は、ステップＳＡ１１の判断結果を「Ｙｅｓ」とする。ステップＳＡ１２では、自動分類部９０は、図３に示した分類結果データ１００を出力する。この図において、テキスト名ｔｅｘｔ１〜テキスト名ｔｅｘｔ２０、・・・は、図２に示したテキスト名ｔｅｘｔ１〜テキスト名ｔｅｘｔ１０、・・・に対応しており、「ＡＵＴＯＭＯＴＩＶＥ＿ＩＮＤＵＳＴＲＹ」等は、分類カテゴリを示し、分類カテゴリの右側の数字は、得点（例えば、類似度）を表す。すなわち、図２に示した新規テキストＴＸ₁ は、「ＡＵＴＯＭＯＴＩＶＥ＿ＩＮＤＵＳＴＲＹ」という分類カテゴリに分類されており、得点（類似度）が「８５」である。
【００６５】
図８に戻り、ステップＳＡ１３では、クラスタリング部１１０は、分類結果データ１００を用いて、新規テキスト群８０をクラスタリングする。図４は、クラスタリング部１１０におけるクラスタリング結果Ｃを示す図である。この図には、１０００個の新規テキストからなる新規テキスト群８０が分類された場合であって、「Ｅｃｏｎｏｍｉｃ」という分類カテゴリに２６個の新規テキストが分類された場合の２６個の新規テキストの内訳（テキストの数、特徴素）が図示されている。
【００６６】
ステップＳＡ１４では、表示部１２０には、例えば、図４に示したクラスタリング結果Ｃが表示される。これにより、ユーザは、分類カテゴリ（この場合、Ｅｃｏｎｏｍｉｃ」）にどのような内容が分類されているかの確認を行うことができる。
【００６７】
なお、一実施の形態においては、図１２に示した特徴素ランキングリストＲ₂ を表示部１２０に表示させ、ユーザの要求に応じて、特徴素ランキングリストＲ₂ を編集し、図１３に示した特徴素ランキングリストＲ₃ を用いて、分類を行うようにしてもよい。この場合、ユーザは、入力部１３０を用いて、特徴素ランキングリストＲ₂ において不要と判断した特徴素を削除するという編集を行う。これにより、特徴素ランキングリストＲ₃ （図１３参照）が作成され、この特徴素ランキングリストＲ₃ に基づいて、上述した処理が実行される。
【００６８】
なお、一実施の形態では、分類サンプルデータ３０と新規テキスト群８０との構造が予め規定されている場合、分類サンプルデータ３０、新規テキスト群８０における分類対象箇所を入力部１３０により指定するようにしてもよい。
【００６９】
さて、前述では、図１に示したクラスタリング部２０によりクラスタリングされた結果（分類サンプルデータ３０）をそのまま特徴素抽出部４０で用いた例について説明したが、クラスタリングされた結果を検証するようにしてもよい。以下では、この場合を一実施の形態の変形例１として、図１４および図１５を参照して説明する。
【００７０】
図１５に示したステップＳＤ１では、図１に示した分類サンプルデータ３０（正解データ）に含まれるサンプルテキスト群１０に対して、クラスタリング部２０によりクラスタリングが実行される。この場合、分類サンプルデータ３０における分類カテゴリの割付けが無視される。図１４は、クラスタリング部２０によりクラスタリングされた結果（クラスタリング結果分布データＣＢ）を示す図である。この図には、７つの分類カテゴリ（「Ｅｃｏｎｏｍｉｃ」、「Ｆｏｒｅｉｇｎ」、・・・「Ｓｐｏｒｔ」）に割り付けられた２７７のテキストをクラスタリングした結果が図示されている。
【００７１】
この図によれば、Ａレコードの「Ｓｐｏｒｔｓ」、ＣおよびＥレコードの「Ｐｏｌｉｔｉｃｓ」は、きれいに分類カテゴリの割付が行われていることがわかる。これに対して、Ｄレコードの「Ｅｃｏｎｏｍｉｃ」と「Ｉｎｄｕｓｔｒｙ」の区別や、Ｆレコード以降の「Ｆｏｒｅｉｇｎ」、「Ｉｎｄｕｓｔｒｙ」、「Ｐｏｌｉｔｉｃｓ」、「Ｓｃｉｅｎｃｅ」、「Ｓｏｃｉｅｔｙ」の区別が曖昧になっていることがわかる。この場合には、後述するステップＳＤ４の処理が実行される。ステップＳＤ２では、クラスタリングされた結果（分類カテゴリの割付）と、ユーザが当初想定していた分類カテゴリの割付とが比較部（図示略）により比較される。
【００７２】
ステップＳＤ３では、比較部は、ステップＳＤ２の比較結果が同一であるか否かを判断し、この判断結果が「Ｎｏ」である場合、比較結果を表示部１２０に表示させる。これにより、ステップＳＤ４では、ユーザは、入力部１３０を用いて、クラスタリングされた結果（分類カテゴリの割付）を再検討し、分類カテゴリの編集を行う。一方、ステップＳＤ３の判断結果が「Ｙｅｓ」である場合、すなわち、分類サンプルデータ３０における分類カテゴリの割付がユーザが当初想定していたものと同一であるため、ステップＳＤ５では、分類カテゴリおよび分類サンプル（テキスト）が学習情報とされる。ステップＳＤ６では、比較部（図示略）は、分類サンプルデータ３０を特徴素抽出部４０へ出力する。これにより、前述した処理が実行される。
【００７３】
さて、前述では、自動分類部９０により分類された分類結果データ１００をそのまま出力する例について説明したが、自動分類部９０により分類が行われた後に分類結果データ１００が所望のものであるか否かを検証し、この検証結果がＮＧの場合に、この検証結果を分類学習情報７０にフィードバックし、再学習することにより分類精度を向上させるようにしてもよい。以下では、この場合を一実施の形態の変形例２として図１６を参照しつつ説明する。同図において、図１の各部に対応する部分には同一の符号を付ける。この図においては、再学習処理部１４０が新たに設けられている。この再学習処理部１４０は、上述したフィードバックを受けて分類学習情報７０Ａを作成する。
【００７４】
２０個の新規テキストからなる新規テキスト群８０が情報分類装置２００に入力されると、新規テキスト群８０は、前述した動作と同様にして、分類学習情報７０および所定の分類方法に基づいて、自動分類される。これにより、情報分類装置２００からは、分類結果データ１００が出力される。この分類結果データ１００は、表示部１２０に表示される。ここで、分類結果データ１００において、分類カテゴリＢに割り付けられた新規テキスト（５）および（６）が分類カテゴリＡに割り付けられるべきであって、かつ分類カテゴリＣに割り付けられた新規テキスト（９）が分類カテゴリＤに割り付けられるべきであった場合、ユーザは、入力部１３０を用いて、所望の割付に編集する。
【００７５】
これにより、再学習処理部１４０は、編集された分類結果データ１００に基づいて、分類学習情報生成部６０（図１参照）と同様の動作により、再学習処理を実行し、分類学習情報７０Ａを再構築する。この状態で、新規テキスト群８０が情報分類装置２００に入力されると、新規テキスト群８０は、前述した動作と同様にして、再構築された分類学習情報７０Ａおよび所定の分類方法に基づいて、自動分類される。この場合、情報分類装置２００から出力される分類結果データ１００の分類精度は、再学習の効果により、極めて高い。
【００７６】
なお、一実施の形態では、図１に示した表示部１２０に図５に示した画面Ｇ₁ を表示させ、分類処理で発生する各種情報を表示させるようにしてもよい。画面Ｇ₁ には、「ユーザークレーム分類」という分類カテゴリに対応するフォルダＨ₀ 、この分類カテゴリの配下に属する「初期不良」、・・・・「問い合わせ」および「分類されなかった文書」という分類カテゴリにそれぞれ対応するフォルダＨ₁ 〜Ｈ₇ がそれぞれ表示されている。
【００７７】
また、画面Ｇ₁ には、ウィンドウ制御により、画面Ｇ₂ 〜Ｇ₄ が表示されている。画面Ｇ₂ には、図６に示したように「問い合わせ」という分類カテゴリに対応するサンプル文書（分類サンプルデータ３０に対応）のタイトルＫ₁ およびテキスト内容Ｋ₂ が表示されている。また、図７に示した画面Ｇ₃ には、「問い合わせ」という分類カテゴリに対応するキーワード（特徴素）が表示されている。図５に示した画面Ｇ₄ には、「問い合わせ」という分類カテゴリに分類された新規テキストの一覧画面Ｊ₁ および当該新規テキストの内容に関する内容表示画面Ｊ₂ が表示されている。ここで、新規テキストの一覧画面Ｊ₁ におけるアイコンＩ₁〜Ｉ₄ は、上述した変形例２による再学習前の得点（類似度）に対する、再学習後の得点の変化を表すものである。
【００７８】
すなわち、アイコンＩ₁ は、前回よりも得点（類似度）が高くなったことを意味しており、アイコンＩ₂ は、前回よりも得点（類似度）が低くなったことを意味している。アイコンＩ₃ は、前回、当該分類カテゴリ（この場合「問い合わせ」）に分類されていた新規テキストが、今回、当該分類カテゴリに分類されなかったことを意味している。また、アイコンＩ₄ は、前回、当該分類カテゴリ（この場合「問い合わせ」）に分類されていなかった新規テキストが、今回、当該分類カテゴリに分類されたことを意味している。
【００７９】
以上説明したように、一実施の形態によれば、複数の分類方法を使用可能な状態にしておき、分類方法決定部５０により、分類サンプルデータ３０に基づいて複数の分類方法の中から最も分類精度が高い分類方法を決定した後、この分類方法に従って新規テキスト群８０を分類カテゴリ毎に分類するようにしたので、従来に比して、分類対象の情報の内容、量にかかわらず、分類精度を高めることができる。
【００８０】
また、一実施の形態によれば、特徴素抽出部４０で複数の特徴素抽出方法を使用可能な状態にしておき、これらの複数の特徴素抽出方法にそれぞれ対応する特徴素を抽出し、特に、分類カテゴリ間の弁別能力が高い特徴素抽出方法に対応する特徴素を抽出結果とするようにしたので、この特徴素に対応する分類結果の分類精度をさらに高めることができる。
【００８１】
また、一実施の形態によれば、入力部１３０および表示部１２０（編集手段）を設けて、抽出された特徴素を編集（削除、追加等）可能としたので、分類カテゴリに対して柔軟な特徴素設定を行うことができる。
【００８２】
また、一実施の形態によれば、入力部１３０および表示部１２０（指定手段）により、分類サンプルデータ３０、新規テキスト群８０における分類対象箇所を指定するようにしたので、分類に不要な箇所を排除し、本質的に必要な箇所を対象に分類を行うことができるため、分類精度をさらに向上させることができる。
【００８３】
また、一実施の形態によれば、クラスタリング部２０により分類サンプルデータ３０を生成するようにしたので、複数のサンプルテキストから分類カテゴリを手動で生成する場合に比して、格段に効率を向上させることができるとともに、ユーザの作業負担を軽減させることができる。
【００８４】
また、一実施の形態によれば、クラスタリング部２０のクラスタリング結果と所望のクラスタリング結果とを比較し、この比較結果が例えば不一致である場合に、入力部１３０（変更手段）により分類サンプルデータ３０を変更可能としたので、より完全な分類サンプルデータ３０に基づいて新規テキスト群８０の分類を行うことができることから、分類精度を極めて高くすることができる。
【００８５】
また、一実施の形態によれば、表示部１２０にクラスタリング結果分布データＣＢ（図１４参照）を表示するようにしたので、分類結果の分布をユーザが容易に把握することができる。
【００８６】
また、一実施の形態によれば、変形例２で説明したように、最適化された分類結果に基づいて、分類学習情報７０Ａを再生成し、この分類学習情報７０Ａに従って、新規テキスト群８０を再度分類するようにしたので、さらに分類精度を向上させることができる。
【００８７】
また、一実施の形態によれば、上記最適化前後における分類結果の相違をアイコンＩ₁〜１₄（相違認識情報）として表示させ、ユーザが一目で相違を認識できるようにしたので、相違に基づくユーザの対応を迅速に行わせることができ、結果的に分類精度を高めることができる。
【００８８】
以上本発明にかかる一実施の形態について図面を参照して詳述してきたが、具体的な構成例はこの一実施の形態に限られるものではなく、本発明の要旨を逸脱しない範囲の設計変更等があっても本発明に含まれる。たとえば、前述した一実施の形態においては、情報分類装置の機能を実現するための情報分類プログラムを図１７に示したコンピュータ読み取り可能な記録媒体４００に記録して、この記録媒体４００に記録された情報分類プログラムを同図に示したコンピュータ３００に読み込ませ、実行することにより情報分類を行うようにしてもよい。
【００８９】
図１７に示したコンピュータ３００は、上記情報分類プログラムを実行するＣＰＵ３０１と、キーボード、マウス等の入力装置３０２と、各種データを記憶するＲＯＭ（Read Only Memory）３０３と、演算パラメータ等を記憶するＲＡＭ（Random Access Memory）３０４と、記録媒体４００から情報分類プログラムを読み取る読取装置３０５と、ディスプレイ、プリンタ等の出力装置３０６と、装置各部を接続するバスＢＵとから構成されている。
【００９０】
ＣＰＵ３０１は、読取装置３０５を経由して記録媒体４００に記録されている情報分類プログラムを読み込んだ後、情報分類プログラムを実行することにより、前述した情報分類を行う。なお、記録媒体４００には、光ディスク、フロッピーディスク、ハードディスク等の可搬型の記録媒体が含まれることはもとより、ネットワークのようにデータを一時的に記録保持するような伝送媒体も含まれる。
【００９１】
また、一実施の形態では、図１に示した分類方法決定部５０で、分類方法の決定方式の一例としてクロスバリデーション方式を採用した場合について説明したが、この方式に限られるものではなく、再現率（結果の中で正解の含まれている割合）や、適合率（結果の中で間違いの少なさ）といった値をキーとして分類方法を決定するようにしてもよい。要は、複数の分類方法が使用可能であること、これらの分類方法の中から最も分類精度が高いものを選択できること、という要件を具備していれば、いかなる方式を採用しても本発明に含まれる。
【００９２】
【発明の効果】
以上説明したように、請求項１にかかる発明によれば、複数の分類方法を使用可能な状態にしておき、分類方法決定手段により、分類サンプル情報に基づいて複数の分類方法の中から最も分類精度が高い分類方法を決定した後、この分類方法に従って新規テキスト群を分類カテゴリ毎に分類するようにしたので、従来に比して、分類対象の情報の内容、量にかかわらず、分類精度を高めることができるという効果を奏する。また、新規テキストを分類し直した分類サンプル情報に基づいて分類学習情報を再生成して分類を再実行し、再生成前後の分類結果の相違を表示するようにしたので、分類学習情報の分類精度をさらに高めることができるという効果を奏する。
【００９３】
また、請求項２にかかる発明によれば、特徴素抽出手段で複数の特徴素抽出方法を使用可能な状態にしておき、これらの複数の特徴素抽出方法にそれぞれ対応する特徴素を抽出し、特に、分類カテゴリ間の弁別能力が高い特徴素抽出方法に対応する特徴素を抽出結果とするようにしたので、この特徴素に対応する分類結果の分類精度をさらに高めることができるという効果を奏する。
【００９４】
また、請求項３にかかる発明によれば、編集手段を設けて、抽出された特徴素を編集（削除、追加等）可能としたので、分類カテゴリに対して柔軟な特徴素設定を行うことができるという効果を奏する。
【００９５】
また、請求項４にかかる発明によれば、複数の分類方法を使用可能な状態にしておき、分類方法決定手段により、分類サンプル情報に基づいて複数の分類方法の中から最も分類精度が高い分類方法をクロスバリデーション方式により決定した後、この分類方法に従って新規テキスト群を分類カテゴリ毎に分類するようにしたので、従来に比して、分類対象の情報の内容、量にかかわらず、分類精度を高めることができるという効果を奏する。
【００９６】
また、請求項５にかかる発明によれば、指定手段により、分類サンプル情報、新規テキスト群における分類対象箇所を指定するようにしたので、分類に不要な箇所を排除し、本質的に必要な箇所を対象に分類を行うことができるため、分類精度をさらに向上させることができるという効果を奏する。
【００９７】
また、請求項６にかかる発明によれば、クラスタリング手段により分類サンプル情報を生成するようにしたので、複数のサンプルテキストから分類カテゴリを手動で生成する場合に比して、格段に効率を向上させることができるとともに、ユーザの作業負担を軽減させることができるという効果を奏する。
【００９８】
また、請求項７にかかる発明によれば、クラスタリング手段のクラスタリング結果と所望のクラスタリング結果とを比較し、この比較結果が例えば不一致である場合に、変更手段により分類サンプル情報を変更可能としたので、より完全な分類サンプル情報に基づいて新規テキスト群の分類を行うことができることから、分類精度を極めて高くすることができるという効果を奏する。
【００９９】
また、請求項８にかかる発明によれば、クラスタリング結果表示手段によりクラスタリング結果を表示するようにしたので、分類結果の分布をユーザが容易に把握することができるという効果を奏する。
【０１００】
また、請求項９にかかる発明によれば、最適化手段により最適化された分類結果に基づいて、分類学習情報を再生成し、この分類学習情報に従って、新規テキスト群を再度分類するようにしたので、さらに分類精度を向上させることができるという効果を奏する。
【０１０１】
また、請求項１０にかかる発明によれば、最適化前後における分類結果の相違を相違認識情報として表示させ、ユーザが一目で相違を認識できるようにしたので、相違に基づくユーザの対応を迅速に行わせることができ、結果的に分類精度を高めることができるという効果を奏する。
【０１０２】
また、請求項１１にかかる発明によれば、複数の分類方法を使用可能な状態にしておき、分類方法決定工程で、分類サンプル情報に基づいて複数の分類方法の中から最も分類精度が高い分類方法を決定した後、この分類方法に従って新規テキスト群を分類カテゴリ毎に分類するようにしたので、従来に比して、分類対象の情報の内容、量にかかわらず、分類精度を高めることができるという効果を奏する。また、新規テキストを分類し直した分類サンプル情報に基づいて分類学習情報を再生成して分類を再実行し、再生成前後の分類結果の相違を表示するようにしたので、分類学習情報の分類精度をさらに高めることができるという効果を奏する。
【図面の簡単な説明】
【図１】本発明にかかる一実施の形態の構成を示すブロック図である。
【図２】図１に示した新規テキスト群８０の一例を示す図である。
【図３】図１に示した分類結果データ１００の一例を示す図である。
【図４】図１に示したクラスタリング部１１０におけるクラスタリング結果Ｃを示す図である。
【図５】図１に示した表示部１２０の表示例を示す図である。
【図６】図１に示した表示部１２０の表示例を示す図である。
【図７】図１に示した表示部１２０の表示例を示す図である。
【図８】同一実施の形態の動作を説明するフローチャートである。
【図９】図８に示した特徴素抽出処理を説明するフローチャートである。
【図１０】図８に示した分類方法決定処理を説明するフローチャートである。
【図１１】同一実施の形態における特徴素出現頻度順リストＲ₁ を示す図である。
【図１２】同一実施の形態における特徴素ランキングリストＲ₂ を示す図である。
【図１３】同一実施の形態における特徴素ランキングリストＲ₃ を示す図である。
【図１４】同一実施の形態におけるクラスタリング結果分布データＣＢを示す図である。
【図１５】同一実施の形態の変形例１を説明するフローチャートである。
【図１６】同一実施の形態の変形例２を説明する図である。
【図１７】同一実施の形態の変形例３を示すブロック図である。
【図１８】従来の情報分類装置の構成を示すブロック図である。
【符号の説明】
２０クラスタリング部
４０特徴素抽出部
５０分類方法決定部
６０分類学習情報生成部
９０自動分類部
１１０クラスタリング部
１２０表示部
１３０入力部
３００コンピュータ
３０１ＣＰＵ
４００記録媒体[0001]
BACKGROUND OF THE INVENTION
  The present invention relates to an information classification device used for classification of a large amount of text information and the like.andInformation classification apparatus, and in particular, an information classification device that can improve classification accuracy and efficiency by selecting a classification method with the highest classification accuracy from a plurality of classification methodsandIt relates to information classification programs.
[0002]
Recently, a huge amount of text information can be easily obtained by using the Internet. For this reason, there is a need for a technique for grasping the contents of such a large amount of text information and efficiently extracting necessary text information from the contents. This is because if these text information items are classified into a predetermined classification category, it is convenient for searching when using the text information later or finding related text information.
[0003]
Conventionally, such a large amount of text information is determined by the person in charge of classification, the creator of the text information, or the user of the text information. Each was classified manually into the optimal classification category. As another classification method, there is a method in which the content of new text information is analyzed using a computer system, and the text information corresponding to the classification category is automatically classified based on the analysis result. The former classification method is very expensive, and the latter classification method has a problem in the number of classification categories and classification accuracy for obtaining a practical result. For this reason, means and methods for effectively solving such problems have been eagerly desired.
[0004]
[Prior art]
At present, when a large amount of digitized text information has been distributed, classification based on the semantic content of text information has become an important issue from the viewpoint of efficient search / use of text information. Conventionally, as means for solving such a problem, information classification devices that automatically execute text information classification work have been used in various fields.
[0005]
Conventionally, as a method for classifying new text information based on this classification method after deriving a text information classification method based on a given text information classification example, Japanese Patent Application Laid-Open No. 11-328211, Various classification methods are disclosed in JP-A-11-296552, JP-A-11-167581, JP-A-11-161671, and the like. Here, conventional classification methods are listed in the following items (1) to (3).
(1) Statistical classification method based on probability model
(2) Classification method for automatic classification by learning
(3) Create rules to classify text information into each category
Classification method that performs automatic classification using this rule
[0006]
The classification method (1) can find a general classification tendency, but cannot find a fine classification tendency. The classification method (2) can obtain high classification accuracy when the number of classification categories is less than several tens, but the classification accuracy decreases when the number of classification categories increases to several tens or more. In addition, the classification method in (3) requires a great deal of cost for creating and maintaining rules. As described above, the classification methods (1) to (3) have their advantages and disadvantages.
[0007]
FIG. 18 is a block diagram showing a configuration of a conventional information classification apparatus. In this figure, classification sample data 2 is correct answer data relating to classification consisting of a plurality of texts in which which text is classified into which classification category. The feature element extraction unit 1 extracts, from each text, feature elements (words) representing the characteristics of each classification category from the classification sample data 2.
[0008]
Here, in the feature element extraction, it is necessary to efficiently extract the feature elements that can enhance the discrimination ability of each classification category. Therefore, the feature element extraction unit 1 uses the feature element extraction method for enhancing the discrimination capability based on the appearance frequency of the feature elements. Conventionally, a plurality of feature element extraction methods have been proposed. Also, a method of designating some part of speech for the attribute of the feature element is adopted.
[0009]
The classification learning information generation unit 3 calculates the characteristics of each classification category from the feature elements extracted by the feature element extraction unit 1, and generates the classification learning information 4 as the calculation result. As a classification learning method in the classification learning information generation unit 3, a plurality of methods have been proposed conventionally. The classification learning information 4 is information representing the correspondence between the feature element status and the classification category. The automatic classification unit 5 classifies the new text group 6 composed of a plurality of texts, which is a classification target, into a classification category based on the classification learning information 4 according to one classification method fixedly set in advance. Data 7 is output.
[0010]
[Problems to be solved by the invention]
Incidentally, as described above, in the conventional information classification apparatus (see FIG. 18), it has been described that there are a plurality of feature element extraction methods of the feature element extraction unit 1, but the new text group 6 to be classified 6 Because the classification accuracy in the classification result data 7 varies depending on the content and quantity of the text, a universal extraction method that maintains a high classification accuracy for all new contents and quantities of the new text group 6 is uniquely specified. Is difficult.
[0011]
Similarly, the classification learning information generation unit 3 also states that there are a plurality of classification learning methods. However, the classification accuracy in the classification result data 7 varies depending on the content and amount of the new text group 6. For this reason, it is difficult to uniquely define a versatile classification learning method that maintains high classification accuracy. For this reason, in the conventional information classification apparatus, one classification method among a plurality of classification methods (feature element extraction method, classification learning method) is unavoidably used.
[0012]
Therefore, in the conventional information classification device, since the new text group 6 is classified by one fixed classification method, the classification accuracy varies depending on the contents and amount of the new text group 6, and as a result, the classification accuracy is low. There was a problem of becoming.
[0013]
  The present invention has been made in view of the above, and an information classification device capable of improving classification accuracy regardless of the content and amount of information to be classified.andThe purpose is to provide an information classification program.
[0014]
[Means for Solving the Problems]
  In order to achieve the above object, the invention according to claim 1 is characterized for each classification category from each of a plurality of sample texts included in classification sample information in which a plurality of sample texts and a plurality of classification categories are associated in advance. A feature element extracting means for extracting elements, a classification method determining means for determining a classification method having the highest classification accuracy from among a plurality of classification methods based on the classification sample information, and the classification method determining means According to the classification method, based on the feature elements extracted by the feature element extraction means, classification learning information generation means for generating classification learning information representing features for each classification category, and the classification method determined by the classification method determination means And classifying new text groups to be classified into classification categories according to the classification learning information.And store the classification result in the storage meansClassification means andThe new text classified into the first classification category by the classification means is included in the classification sample information to associate the new text with the second classification category, and from the classification sample information including the new text, A process for re-extracting a feature element is performed by the feature element extraction unit, and a classification method based on the classification sample information including the new text is re-determined by the classification method determination unit; The classification learning information is regenerated by the re-extracted feature element by the classification learning information generation means, and the new text group is reclassified by the re-determined classification method and the regenerated classification learning information. A symbol indicating a difference between the re-learning means to be performed and the previous classification result stored in the storage means for each new text, and the new text group Murrell display means for displaying on the display unit the result of the re-classification of each new textIt is characterized by providing.
[0015]
  According to the present invention, after the plurality of classification methods are made available, the classification method determining means determines the classification method with the highest classification accuracy from among the plurality of classification methods based on the classification sample information. Since the new text group is classified for each classification category according to this classification method, the classification accuracy can be improved compared to the conventional case regardless of the content and amount of information to be classified.In addition, the classification learning information is regenerated based on the classification sample information obtained by reclassifying the new text, the classification is re-executed, and the difference between the classification results before and after the regeneration is displayed. The accuracy can be further increased.
[0016]
According to a second aspect of the present invention, in the information classification apparatus according to the first aspect, the feature element extraction unit extracts feature elements by a plurality of feature element extraction methods, and based on these extraction results, A feature element extraction method having a high discrimination capability between classification categories is selected from among a plurality of feature element extraction methods, and a feature element corresponding to the selection result is used as an extraction result.
[0017]
According to the present invention, a plurality of feature element extraction methods can be used by the feature element extraction unit, and feature elements respectively corresponding to the plurality of feature element extraction methods are extracted. Since the feature element corresponding to the feature element extraction method having high discrimination ability is used as the extraction result, the classification accuracy of the classification result corresponding to the feature element can be further increased.
[0018]
According to a third aspect of the present invention, in the information classification apparatus according to the first aspect of the present invention, the information classification device further includes an editing unit that edits the feature element extracted by the feature element extraction unit.
[0019]
According to the present invention, since the editing means is provided and the extracted feature elements can be edited (deleted, added, etc.), flexible feature element setting can be performed for the classification category.
[0020]
According to a fourth aspect of the present invention, in the information classification apparatus according to any one of the first to third aspects, the classification method determining means is the most accurate classification method among a plurality of classification methods by a cross-validation method. A classification method having a high value is determined.
[0021]
According to the present invention, a plurality of classification methods are left in a usable state, and the classification method determining means determines the classification method having the highest classification accuracy among the plurality of classification methods based on the classification sample information by the cross-validation method. After the determination, the new text group is classified for each classification category in accordance with this classification method. Therefore, the classification accuracy can be improved regardless of the content and amount of information to be classified as compared with the conventional case.
[0022]
The invention according to claim 5 is the information classification apparatus according to any one of claims 1 to 4, further comprising a designation unit that designates a classification target location in the sample information and the new text group. And
[0023]
According to the present invention, the designation means designates the classification target location in the classification sample information and the new text group. Therefore, the location unnecessary for the classification is excluded, and the classification is performed on the essentially necessary location. Therefore, the classification accuracy can be further improved.
[0024]
According to a sixth aspect of the present invention, in the information classification device according to any one of the first to fifth aspects, the plurality of sample texts and the plurality of classification categories are obtained by clustering a plurality of sample texts. Clustering means for generating the associated classification sample information is provided.
[0025]
According to this invention, since the classification sample information is generated by the clustering means, the efficiency can be significantly improved as compared with the case where the classification category is manually generated from a plurality of sample texts, and the user Work load can be reduced.
[0026]
The invention according to claim 7 is the information classification apparatus according to any one of claims 1 to 5, wherein a clustering unit that clusters the classification sample information, a clustering result of the clustering unit, and a desired clustering result And comparing means for changing the classification sample information as necessary based on the comparison result of the comparing means.
[0027]
According to the present invention, the clustering result of the clustering means is compared with the desired clustering result, and the classification sample information can be changed by the changing means when the comparison result is inconsistent, for example. Since the new text group can be classified based on the information, the classification accuracy can be extremely increased.
[0028]
The invention according to claim 8 is the information classification apparatus according to any one of claims 1 to 7, wherein the new text group in the classification result of the classification means is clustered and the clustering result is displayed. Means are provided.
[0029]
According to this invention, since the clustering result is displayed by the clustering result display means, the user can easily grasp the distribution of the classification result.
[0030]
The invention according to claim 9 is the information classification apparatus according to any one of claims 1 to 8, further comprising an optimization unit that optimizes a classification result of the classification unit, and the classification learning information generation unit. Regenerates the classification learning information based on the optimized classification result, and the classification means determines the classification target in accordance with the classification method determined by the classification method determination means and the regenerated classification learning information. A new text group is classified for each classification category.
[0031]
According to the present invention, the classification learning information is regenerated based on the classification result optimized by the optimization means, and the new text group is classified again according to the classification learning information. Can be improved.
[0032]
According to a tenth aspect of the present invention, in the information classification device according to the ninth aspect, the difference recognition information capable of visually recognizing a difference between the classification result before the optimization and the classification result after the optimization. A difference recognition information display means for displaying is provided.
[0033]
According to this invention, since the difference between the classification results before and after optimization is displayed as difference recognition information, and the user can recognize the difference at a glance, the user can be quickly responded based on the difference, As a result, the classification accuracy can be increased.
[0034]
  The invention according to claim 11 isA feature element extracting means for extracting a feature element for each classification category from each of a plurality of sample texts included in classification sample information in which a plurality of sample texts and a plurality of classification categories are associated in advance; Based on the information, the classification method determining means for determining the classification method having the highest classification accuracy from among a plurality of classification methods, and the feature element extracting means extracted according to the classification method determined by the classification method determining means Classification learning information generating means for generating classification learning information representing features for each classification category based on the feature element, the classification method determined by the classification method determination means, and the new text to be classified according to the classification learning information Classifying means for classifying groups into classification categories and storing the classification results in storage means; The new text classified into the first classification category is included in the classification sample information to associate the new text with the second classification category, and the feature element is extracted from the classification sample information including the new text. The re-extraction process is performed by the feature element extraction unit, and the classification method determination unit based on the classification sample information including the new text is re-determined by the classification method determination unit. The classification learning information is regenerated by the classification learning information generation means using the feature elements, and the classification means is reclassified by the re-determined classification method and the regenerated classification learning information. A symbol indicating a difference between the re-learning means and the previous classification result stored in the storage means is attached to each new text, and each new text included in the new text group is added. Display means for displaying on the display unit the result of the strike reclassification
  It is characterized by operating as.
[0035]
  According to the present invention, a plurality of classification methods are made available, and in the classification method determination step, after determining the classification method with the highest classification accuracy from among the plurality of classification methods based on the classification sample information, Since the new text group is classified for each classification category according to this classification method, the classification accuracy can be improved compared to the conventional case regardless of the content and amount of information to be classified.In addition, the classification learning information is regenerated based on the classification sample information obtained by reclassifying the new text, the classification is re-executed, and the difference between the classification results before and after the regeneration is displayed. The accuracy can be further increased.
[0040]
DETAILED DESCRIPTION OF THE INVENTION
  Hereinafter, an information classification device according to the present invention with reference to the drawings.andAn embodiment of the information classification program will be described in detail.
[0041]
FIG. 1 is a block diagram showing the configuration of an embodiment according to the present invention. In this figure, a sample text group 10 is a set of a plurality of unclassified texts. The clustering unit 20 clusters the sample text group 10 and generates classification sample data 30. This classification sample data 30 is correct data regarding classification consisting of a plurality of texts in which which text is classified into which classification category.
[0042]
Similar to the feature element extraction unit 1 (see FIG. 18), the feature element extraction unit 40 extracts feature elements (words) representing the features of the respective classification categories from the respective texts from the classification sample data 30. However, the feature element extraction unit 1 performs feature element extraction according to one feature element extraction method, whereas the feature element extraction unit 40 performs feature element extraction according to each of a plurality of feature element extraction methods. , Different from the feature element extraction unit 1.
[0043]
The classification learning information generation unit 60 calculates the features of each classification category from the feature elements extracted by the feature element extraction unit 40 in the same manner as the classification learning information generation unit 3 (see FIG. 18). The classification learning information 70 is generated. However, the classification learning information generation unit 3 calculates features according to one classification learning method, whereas the classification learning information generation unit 60 calculates features according to each of a plurality of classification learning methods. Different from the generator 3.
[0044]
The classification method determination unit 50 determines a classification method with the highest classification accuracy from among a plurality of classification methods by, for example, known cross-validation. Details of the operation of the classification method determination unit 50 will be described later. As shown in FIG. 2, the new text group 80 includes a plurality of new texts TX to be classified.₁(Text name text1) to new text TX_Ten (Text name text10). Returning to FIG. 1, the automatic classification unit 90 classifies the new text group 80 into classification categories based on the classification method and the classification learning information 70 determined by the classification method determination unit 50, and classifies the new text group 80 into the classification result data 100 (FIG. 3)).
[0045]
The clustering unit 110 clusters the classification result data 100 to obtain a clustering result C (see FIG. 4). The display unit 120 is a display that displays the clustering result C from the clustering unit 110 and various data from each unit. 5 to 7 show display examples of the display unit 120. FIG. The input unit 130 is a mouse, a keyboard, or the like for performing editing operations described later, window operations on the display unit 120, and the like.
[0046]
Next, the operation of the above-described embodiment will be described with reference to the flowcharts shown in FIGS. When the sample text group 10 is input to the clustering unit 20 shown in FIG. 1, the clustering unit 20 clusters a plurality of texts in the sample text group 10 in step SA1 shown in FIG. In step SA2, the clustering unit 20 classifies each cluster into a classification category. In step SA3, the clustering unit 20 outputs to the feature element extraction unit 40 classification sample data 30 (correct data) relating to a classification consisting of a plurality of texts in which which text is classified into which classification category.
[0047]
As a result, in step SA4, the feature element extraction unit 40 executes a feature element extraction process for extracting, from each text, feature elements (words) representing the characteristics of each classification category in the classification sample data 30. That is, in step SB1 shown in FIG. 9, the feature element extraction unit 40 performs morphological analysis on the classification sample data 30 to extract feature element (word) candidates representing the characteristics of the classification category. In step SB2, the feature element extraction unit 40 executes a process of unifying synonyms in the extracted feature element candidates.
[0048]
In step SB3, the feature element extraction unit 40 counts feature elements of the same word for each classification category with respect to the extracted plurality of feature element candidates. In step SB4, the feature element extraction unit 40 executes a ranking process for narrowing down a plurality of feature element candidates for each classification category. In this ranking process, with respect to a plurality of feature element candidates, a feature element is ranked for each classification category in descending order of appearance frequency, a feature element is ranked for each classification category in descending order of appearance probability, For example, a statistical method (a method for lowering the ranking of feature elements that also appear in other classification categories) is used for calculating the frequency, and a method for ranking the feature elements for each classification category is employed.
[0049]
In step SB5, the feature element extraction unit 40 extracts a predetermined number of feature elements having high rankings from the top for each classification category, and extracts these as feature elements. In step SB6, the feature element extraction unit 40 outputs the extracted feature elements as feature element extraction result data. FIG. 11 shows a list R of feature element appearance frequencies ranked in the order of appearance frequency among the three ranking methods described above.₁It is a figure showing (corresponding to feature element extraction result data).
[0050]
The figure shows the fields of classification categories (“Economic”, “Foreign”,..., “Society” and “Sport”) and the appearance of characteristic elements (“market”, “yen appreciation”, etc.) There is a frequency field indicating the frequency. In the record corresponding to each classification category, the number of texts classified into the classification category is described. Here, the text refers to what constitutes the sample text group 10 (see FIG. 1). For example, 27 texts are classified in the classification category “Economic”, and 43 texts are classified in the classification category “Foreign”.
[0051]
The leftmost field in the figure is a ranking representing the order of appearance frequency. For example, in the classification category “Economic”, the feature element ranked first in the appearance frequency in 27 texts is “market” (frequency: 61.0), and the feature element ranked second is “yen appreciation”. (Frequency: 40.0). Similarly, the 30th feature element is “finance” (frequency: 12.0).
[0052]
FIG. 12 shows a feature element appearance frequency order list R in which a statistical method called a Kullback-Leibler method is adopted among the above-described three ranking methods, and feature elements are ranked for each classification category.₂It is a figure showing (corresponding to feature element extraction result data). Feature element ranking list R shown in the figure₂ The basic structure of is the feature element appearance frequency order list R₁(See FIG. 11).
[0053]
However, feature element ranking list R₂ Employs a statistical method for lowering the ranking of feature elements appearing in other classification categories and improving the discrimination ability between the classification categories and other classification categories. For example, the “dollar” (feature element) ranked third in the category “Economic” shown in FIG. 11 is ranked 31 or less (not shown) in the category “Economic” shown in FIG. .
[0054]
Returning to FIG. 8, in step SA <b> 5, the classification method determination unit 50 determines whether to automatically determine a classification method to be applied to the new text group 80. If there is no instruction from the user, the classification method determination unit 50 sets “Yes” as a result of the determination made at step SA5. On the other hand, when the classification method is instructed manually by the user, the classification method determination unit 50 sets “No” as the determination result in step SA5, and determines the classification method based on the instruction from the user in step SA7.
[0055]
In this case, in step SA6, the classification method determination unit 50 executes a classification method determination process that automatically determines the classification method, for example, by cross validation. That is, in step SC1 shown in FIG. 10, the classification method determination unit 50 randomly divides the classification sample (text) into N pieces for each classification category in the classification sample data 30. In step SC2, the classification method determination unit 50 applies a plurality of learning algorithms (classification methods) to (N-1) classification samples, respectively, and features elements and classification learning information corresponding to the respective learning algorithms. Create
[0056]
In step SC3, the classification method determining unit 50 applies the learning algorithm method to the remaining (1 / N) classification samples using the feature elements and classification learning information created in step SC2, thereby classifying the classification. Perform a test to calculate the classification accuracy. This classification accuracy is calculated individually for each of the plurality of learning algorithms. In step SC4, the classification method determination unit 50 determines whether or not the classification test has been executed N times. In this case, the determination result is “No”. Thereafter, in step SC2 and step SC3, the classification accuracy for each of the N classification samples is calculated for each of the plurality of learning algorithms by changing the classification samples one by one.
[0057]
When the determination result in step SC4 is “Yes”, in step SC5, the classification method determining unit 50 calculates the average value of the classification accuracy for the N classification samples for each of the plurality of learning algorithms. In step SC6, the classification method determination unit 50 selects the highest one of the average values of the plurality of classification accuracy corresponding to the plurality of learning algorithms (classification methods), and then selects the learning algorithm corresponding to the selected classification accuracy ( Select (Classification method). Further, the classification method determination unit 50 notifies the classification learning information generation unit 60 and the automatic classification unit 90 of the learning algorithm (classification method) having the highest classification accuracy.
[0058]
Returning to FIG. 8, in step SA8, the classification learning information generation unit 60, based on the learning algorithm (classification method) notified by the classification method determination unit 50 and the feature element extraction result data from the feature element extraction unit 40, Classification learning information 70 is generated. In step SA9, the classification learning information generation unit 60 registers the classification learning information 70 in a database (not shown). In step SA10, the automatic classification unit 90 determines whether or not the new text group 80 to be classified has been input. In this case, the determination result is “No” and the determination is repeated.
[0059]
When the new text group 80 is input to the automatic classification unit 90, the automatic classification unit 90 sets “Yes” as a result of the determination made at step SA10. In step SA11, the automatic classification unit 90 creates a new text TX constituting the new text group 80 (see FIG. 2).₁ , New text TX₂・・・・・・ New text TX_Ten ,... Are determined, and in this case, the determination result is “No”. Thereafter, in Step SA15 to Step SA21, the automatic classification unit 90 executes automatic classification processing based on the classification method determined by the classification method determination unit 50.
[0060]
Hereinafter, as an example of the classification method, a case where the new text group 80 is classified based on the vector space method will be described. In this case, it is assumed that the classification learning information 70 includes 30 feature elements for each classification category, and includes a vector of all feature elements and a vector of each classification category. In this state, in step SA15, the automatic classification unit 90 creates a new text TX in the new text group 80.₁ A morpheme analysis is performed on (see FIG. 2), and feature elements (words) are extracted. In step SA16, the automatic classification unit 90 executes synonym unification processing for unifying synonyms in the extracted feature elements.
[0061]
In step SA17, the automatic classification unit 90 counts the extracted feature elements. In step SA18, the automatic classification unit 90 converts a feature element identical to the feature element in the classification learning information 70 to a new text TX.₁ Obtained from a plurality of feature elements included in. Next, the automatic classification unit 90 acquires the acquired feature element, that is, the new text TX.₁ Generate a document vector for.
[0062]
In step SA19, the new text TX₁ The degree of similarity (cosine value) between the document vector relating to and the vector of each classification category in the classification learning information 70 is calculated. The degree of similarity (cosine value) is determined by using a classification category vector A and a new text TX.₁ If the document vector of B is B, it is expressed by the following equation.
Similarity (cosine value) = inner product of vector A and document vector B / (size of vector A × size of document vector B)
[0063]
That is, in step SA19, the new text TX₁ As for the number of classification categories, similarities (cosine values) are calculated. In step SA20, the automatic classification unit 90 normalizes the calculated plurality of similarities (cosine values) (sets values between 0 and 100). In step SA21, the automatic classification unit 90 selects a similarity level greater than or equal to a threshold value (for example, 70) from among a plurality of similarity levels (cosine values), and then adds a new classification category corresponding to the selected similarity level. Text TX₁ Classify. If all of the plurality of similarities are less than the threshold value, the automatic classification unit 90 determines that the new text TX₁ Is a text that cannot be classified. Thereafter, the processes from step SA15 to step SA21 are repeated, whereby new texts are sequentially classified into classification categories.
[0064]
When all new texts are classified, the automatic classification unit 90 sets the determination result in step SA11 to “Yes”. In step SA12, the automatic classification unit 90 outputs the classification result data 100 shown in FIG. In this figure, text name text1 to text name text20,... Correspond to text name text1 to text name text10,... Shown in FIG.2, and “AUTOMOTION_INDUSTRRY” indicates a classification category. The number on the right side of the classification category represents a score (for example, similarity). That is, the new text TX shown in FIG.₁ Are classified into the classification category “AUTOMOTION_INDUSTRIY”, and the score (similarity) is “85”.
[0065]
Returning to FIG. 8, in step SA <b> 13, the clustering unit 110 clusters the new text group 80 using the classification result data 100. FIG. 4 is a diagram illustrating a clustering result C in the clustering unit 110. This figure shows a breakdown of 26 new texts when 26 new texts are classified into a category “Economic” when a new text group 80 composed of 1000 new texts is classified. (Number of texts, feature elements) is shown.
[0066]
In step SA14, for example, the clustering result C shown in FIG. Thereby, the user can confirm what kind of content is classified into the classification category (Economic in this case).
[0067]
In one embodiment, the feature element ranking list R shown in FIG.₂ Is displayed on the display unit 120, and according to the user's request, the feature element ranking list R₂ And the feature element ranking list R shown in FIG._Three You may make it classify | categorize using. In this case, the user uses the input unit 130 to display the feature element ranking list R.₂ The editing is performed to delete the feature element determined to be unnecessary in step (1). As a result, the feature element ranking list R_Three (See FIG. 13) is created, and this feature element ranking list R_Three Based on the above, the processing described above is executed.
[0068]
In the embodiment, when the structure of the classification sample data 30 and the new text group 80 is defined in advance, the classification target portion in the classification sample data 30 and the new text group 80 is designated by the input unit 130. May be.
[0069]
In the above description, an example in which the result (classification sample data 30) clustered by the clustering unit 20 shown in FIG. 1 is used as it is in the feature element extraction unit 40 has been described. However, the clustered result is verified. Also good. Hereinafter, this case will be described as a first modification of the embodiment with reference to FIGS. 14 and 15.
[0070]
In step SD1 shown in FIG. 15, clustering is performed by the clustering unit 20 on the sample text group 10 included in the classified sample data 30 (correct data) shown in FIG. In this case, the assignment of the classification category in the classification sample data 30 is ignored. FIG. 14 is a diagram illustrating a result of clustering by the clustering unit 20 (clustering result distribution data CB). This figure shows a result of clustering 277 texts assigned to seven classification categories (“Economic”, “Foreign”,... “Sport”).
[0071]
According to this figure, it can be seen that “Sports” in the A record and “Politics” in the C and E records are clearly assigned with classification categories. On the other hand, the distinction between “Economic” and “Industry” in the D record and “Foreign”, “Industry”, “Policics”, “Science”, and “Society” after the F record are ambiguous. I understand that. In this case, the process of step SD4 described later is executed. In step SD2, the comparison result (not shown) compares the clustered result (classification category assignment) with the classification category assignment initially assumed by the user.
[0072]
In step SD3, the comparison unit determines whether or not the comparison result in step SD2 is the same. If the determination result is “No”, the comparison unit causes the display unit 120 to display the comparison result. Thereby, in step SD4, the user reexamines the clustered result (classification category assignment) using the input unit 130, and edits the classification category. On the other hand, when the determination result in step SD3 is “Yes”, that is, since the assignment of the classification category in the classification sample data 30 is the same as that initially assumed by the user, the classification category and the classification sample are determined in step SD5. (Text) is used as learning information. In step SD <b> 6, the comparison unit (not shown) outputs the classification sample data 30 to the feature element extraction unit 40. Thereby, the processing described above is executed.
[0073]
In the above description, the example in which the classification result data 100 classified by the automatic classification unit 90 is output as it is has been described. However, whether or not the classification result data 100 is desired after the automatic classification unit 90 performs classification. If the verification result is NG, the verification result may be fed back to the classification learning information 70 and relearned to improve the classification accuracy. Hereinafter, this case will be described as a second modification of the embodiment with reference to FIG. In the figure, parts corresponding to the parts in FIG. In this figure, a relearning processing unit 140 is newly provided. The relearning processing unit 140 generates classification learning information 70A in response to the feedback described above.
[0074]
When a new text group 80 composed of 20 new texts is input to the information classification device 200, the new text group 80 is automatically generated based on the classification learning information 70 and a predetermined classification method in the same manner as described above. being classified. As a result, the classification result data 100 is output from the information classification device 200. The classification result data 100 is displayed on the display unit 120. Here, in the classification result data 100, the new texts (5) and (6) allocated to the classification category B should be allocated to the classification category A and the new text (9) allocated to the classification category C. Is to be assigned to the classification category D, the user uses the input unit 130 to edit to the desired assignment.
[0075]
Thereby, the re-learning processing unit 140 executes the re-learning process based on the edited classification result data 100 by the same operation as the classification learning information generation unit 60 (see FIG. 1), and the classification learning information 70A is obtained. Rebuild. In this state, when the new text group 80 is input to the information classification device 200, the new text group 80 is based on the reconstructed classification learning information 70A and a predetermined classification method in the same manner as the above-described operation. Automatically classified. In this case, the classification accuracy of the classification result data 100 output from the information classification apparatus 200 is extremely high due to the effect of relearning.
[0076]
In one embodiment, the screen G shown in FIG. 5 is displayed on the display unit 120 shown in FIG.₁ May be displayed, and various information generated in the classification process may be displayed. Screen G₁ Includes a folder H corresponding to the classification category “user complaint classification”.₀ , Folders H corresponding to the classification categories of “initial failure” belonging to the classification category, “inquiry” and “document not classified”, respectively.₁ ~ H₇ Are displayed.
[0077]
In addition, screen G₁ The screen G is controlled by window control.₂ ~ G_Four Is displayed. Screen G₂ Includes the title K of the sample document (corresponding to the classification sample data 30) corresponding to the classification category "inquiry" as shown in FIG.₁ And text content K₂ Is displayed. In addition, the screen G shown in FIG._Three Displays a keyword (feature element) corresponding to the classification category “inquiry”. Screen G shown in FIG._Four Includes a list screen J of new text classified into the classification category “inquiry” J₁ And contents display screen J regarding the contents of the new text₂ Is displayed. Here, new text list screen J₁ Icon I₁~ I_Four Represents a change in the score after re-learning with respect to the score (similarity) before re-learning according to Modification 2 described above.
[0078]
Icon I₁ Means that the score (similarity) is higher than the previous time, and the icon I₂ Means that the score (similarity) is lower than the previous time. Icon I_Three Means that the new text previously classified into the classification category (in this case “inquiry”) has not been classified into the classification category this time. The icon I_Four Means that a new text that was not previously classified into the classification category (in this case, “inquiry”) has been classified into the classification category this time.
[0079]
As described above, according to one embodiment, a plurality of classification methods are made available, and the classification method determination unit 50 determines the most classification among the plurality of classification methods based on the classification sample data 30. After the classification method with high accuracy is determined, the new text group 80 is classified according to the classification method according to this classification method. Therefore, compared with the conventional case, the classification accuracy is improved regardless of the content and amount of information to be classified. Can be increased.
[0080]
Also, according to one embodiment, the feature element extraction unit 40 makes a plurality of feature element extraction methods available, and extracts feature elements respectively corresponding to the plurality of feature element extraction methods. Since the feature element corresponding to the feature element extraction method having high discrimination capability between the classification categories is used as the extraction result, the classification accuracy of the classification result corresponding to the feature element can be further improved.
[0081]
In addition, according to the embodiment, the input unit 130 and the display unit 120 (editing unit) are provided so that the extracted feature elements can be edited (deleted, added, etc.). Feature element setting can be performed.
[0082]
Moreover, according to one embodiment, the classification target data 30 and the new text group 80 are designated by the input unit 130 and the display unit 120 (designating means), so that a part unnecessary for classification is designated. Since it is possible to eliminate and classify essentially necessary locations, classification accuracy can be further improved.
[0083]
In addition, according to the embodiment, since the classification sample data 30 is generated by the clustering unit 20, the efficiency is remarkably improved as compared with the case where the classification category is manually generated from a plurality of sample texts. And the burden on the user can be reduced.
[0084]
Further, according to the embodiment, the clustering result of the clustering unit 20 is compared with the desired clustering result, and when the comparison result does not match, for example, the classification sample data 30 is obtained by the input unit 130 (changing unit). Since the change is made possible, the new text group 80 can be classified based on the more complete classification sample data 30, so that the classification accuracy can be made extremely high.
[0085]
Further, according to the embodiment, since the clustering result distribution data CB (see FIG. 14) is displayed on the display unit 120, the user can easily grasp the distribution of the classification results.
[0086]
In addition, according to the embodiment, as described in the second modification, based on the optimized classification result, the classification learning information 70A is regenerated, and the new text group 80 is generated according to the classification learning information 70A. Since the classification is performed again, the classification accuracy can be further improved.
[0087]
Further, according to one embodiment, the difference between the classification results before and after the optimization is represented by an icon I.₁~ 1_FourSince it is displayed as (difference recognition information) so that the user can recognize the difference at a glance, it is possible to promptly respond to the user based on the difference, and as a result, the classification accuracy can be improved.
[0088]
Although one embodiment of the present invention has been described in detail with reference to the drawings, a specific configuration example is not limited to this one embodiment, and the design can be changed without departing from the gist of the present invention. And the like are included in the present invention. For example, in the above-described embodiment, the information classification program for realizing the function of the information classification apparatus is recorded on the computer-readable recording medium 400 shown in FIG. The information classification program may be read by the computer 300 shown in FIG.
[0089]
A computer 300 shown in FIG. 17 includes a CPU 301 that executes the information classification program, an input device 302 such as a keyboard and a mouse, a ROM (Read Only Memory) 303 that stores various data, and a RAM that stores calculation parameters and the like. (Random Access Memory) 304, a reading device 305 that reads an information classification program from the recording medium 400, an output device 306 such as a display and a printer, and a bus BU that connects each part of the device.
[0090]
The CPU 301 reads the information classification program recorded on the recording medium 400 via the reading device 305 and then executes the information classification program to perform the above-described information classification. The recording medium 400 includes a portable recording medium such as an optical disk, a floppy disk, and a hard disk, and also includes a transmission medium that temporarily stores data such as a network.
[0091]
In the embodiment, the classification method determination unit 50 illustrated in FIG. 1 has described the case where the cross-validation method is employed as an example of the classification method determination method. However, the present invention is not limited to this method, and is reproduced. The classification method may be determined using values such as a rate (a ratio in which the correct answer is included in the result) and a matching rate (the number of errors in the result are small) as keys. In short, any method can be adopted in the present invention as long as it has a requirement that a plurality of classification methods can be used and that a classification method having the highest classification accuracy can be selected from these classification methods. included.
[0092]
【The invention's effect】
  As described above, according to the first aspect of the present invention, a plurality of classification methods are made available, and the classification method determination unit determines the most classification among the plurality of classification methods based on the classification sample information. After deciding a classification method with high accuracy, the new text group is classified according to this classification method for each classification category. Therefore, compared with the conventional method, the classification accuracy is improved regardless of the content and amount of information to be classified. There is an effect that it can be increased.In addition, the classification learning information is regenerated based on the classification sample information obtained by reclassifying the new text, the classification is re-executed, and the difference between the classification results before and after the regeneration is displayed. There is an effect that the accuracy can be further increased.
[0093]
According to the invention of claim 2, the feature element extraction unit is allowed to use a plurality of feature element extraction methods, and the feature elements respectively corresponding to the plurality of feature element extraction methods are extracted, In particular, since the feature element corresponding to the feature element extraction method having a high discrimination capability between the classification categories is used as the extraction result, the classification accuracy of the classification result corresponding to the feature element can be further improved. .
[0094]
According to the invention of claim 3, since the editing unit is provided so that the extracted feature element can be edited (deletion, addition, etc.), flexible feature element setting can be performed for the classification category. There is an effect that can be done.
[0095]
According to the invention of claim 4, a plurality of classification methods are made available, and the classification method determining means performs classification with the highest classification accuracy among the plurality of classification methods based on the classification sample information. After the method is determined by the cross-validation method, the new text group is classified for each classification category according to this classification method. Therefore, compared with the conventional method, the classification accuracy is improved regardless of the content and amount of information to be classified. There is an effect that it can be increased.
[0096]
According to the invention of claim 5, since the designation means designates the classification sample information and the classification target location in the new text group, the location unnecessary for the classification is eliminated, and the essentially necessary location. Since the classification can be performed on the subject, the classification accuracy can be further improved.
[0097]
Further, according to the invention of claim 6, since the classification sample information is generated by the clustering means, the efficiency is remarkably improved as compared with the case where the classification category is manually generated from a plurality of sample texts. In addition, the user's workload can be reduced.
[0098]
According to the invention of claim 7, the clustering result of the clustering means is compared with the desired clustering result, and the classification sample information can be changed by the changing means when the comparison result does not match, for example. Since the new text group can be classified based on more complete classification sample information, the classification accuracy can be extremely increased.
[0099]
According to the eighth aspect of the invention, since the clustering result is displayed by the clustering result display means, there is an effect that the user can easily grasp the distribution of the classification result.
[0100]
According to the invention of claim 9, the classification learning information is regenerated based on the classification result optimized by the optimization means, and the new text group is classified again according to the classification learning information. Therefore, there is an effect that the classification accuracy can be further improved.
[0101]
According to the invention of claim 10, the difference between the classification results before and after optimization is displayed as the difference recognition information so that the user can recognize the difference at a glance. As a result, the classification accuracy can be improved.
[0102]
  Claims11According to the invention, after the plurality of classification methods are made usable, after the classification method determining step determines the classification method with the highest classification accuracy from the plurality of classification methods based on the classification sample information. Since the new text group is classified for each classification category according to this classification method, the classification accuracy can be improved regardless of the content and amount of information to be classified as compared with the conventional method.In addition, the classification learning information is regenerated based on the classification sample information obtained by reclassifying the new text, the classification is re-executed, and the difference between the classification results before and after the regeneration is displayed. There is an effect that the accuracy can be further increased.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an embodiment according to the present invention.
FIG. 2 is a diagram showing an example of a new text group 80 shown in FIG.
FIG. 3 is a diagram illustrating an example of classification result data 100 illustrated in FIG. 1;
4 is a diagram showing a clustering result C in the clustering unit 110 shown in FIG.
FIG. 5 is a diagram illustrating a display example of the display unit 120 illustrated in FIG. 1;
6 is a diagram showing a display example of the display unit 120 shown in FIG. 1. FIG.
7 is a diagram showing a display example of the display unit 120 shown in FIG. 1. FIG.
FIG. 8 is a flowchart for explaining the operation of the same embodiment;
FIG. 9 is a flowchart for explaining the feature element extraction processing shown in FIG. 8;
FIG. 10 is a flowchart for explaining the classification method determination process shown in FIG. 8;
FIG. 11 is a list of feature element appearance frequency order R in the same embodiment;₁ FIG.
FIG. 12 is a feature element ranking list R in the same embodiment.₂ FIG.
FIG. 13 is a feature element ranking list R in the same embodiment._Three FIG.
FIG. 14 is a diagram showing clustering result distribution data CB in the same embodiment.
FIG. 15 is a flowchart illustrating a first modification of the same embodiment.
FIG. 16 is a diagram illustrating a second modification of the same embodiment.
FIG. 17 is a block diagram showing a third modification of the same embodiment.
FIG. 18 is a block diagram showing a configuration of a conventional information classification device.
[Explanation of symbols]
20 Clustering unit
40 Feature element extraction unit
50 Classification method decision part
60 Classification learning information generator
90 Automatic classification part
110 Clustering unit
120 display
130 Input section
300 computers
301 CPU
400 recording media

Claims

A feature element extracting means for extracting a feature element for each classification category from each of a plurality of sample texts included in classification sample information in which a plurality of sample texts and a plurality of classification categories are associated in advance;
Based on the classification sample information, a classification method determining means for determining a classification method with the highest classification accuracy among a plurality of classification methods;
Classification learning information generating means for generating classification learning information representing features for each classification category based on the feature elements extracted by the feature element extraction means in accordance with the classification method determined by the classification method determination means;
In accordance with the classification method determined by the classification method determination unit and the classification learning information, a classification unit that classifies a new text group to be classified for each classification category , and stores a classification result in a storage unit ;
The new text classified into the first classification category by the classification means is included in the classification sample information, the new text is associated with the second classification category, and the characteristics are obtained from the classification sample information including the new text. The feature element extracting unit performs the process of re-extracting the element, and the classification method determining unit performs the re-determination of the classification method based on the classification sample information including the new text. The classification learning information is regenerated by the classification learning information generation unit using the extracted feature elements, and the reclassification of the new text group by the re-determined classification method and the regenerated classification learning information is performed by the classification unit. Re-learning means to perform,
Display means for attaching a symbol indicating a difference from the previous classification result stored in the storage means for each new text and displaying a result of reclassification of each new text included in the new text group on a display unit; An information classification apparatus comprising:

The feature element extraction means extracts feature elements by a plurality of feature element extraction methods, respectively, and based on these extraction results, a feature element extraction method having a high discrimination capability between classification categories from the plurality of feature element extraction methods The information classification apparatus according to claim 1, wherein a feature element corresponding to the selection result is selected as an extraction result.

The information classification apparatus according to claim 1, further comprising an editing unit that edits the feature element extracted by the feature element extraction unit.

The information classification apparatus according to any one of claims 1 to 3, wherein the classification method determination unit determines a classification method having the highest classification accuracy from among a plurality of classification methods by a cross-validation method. .

The information classification apparatus according to claim 1, further comprising a designation unit that designates the classification sample information and a classification target portion in the new text group.

6. Clustering means for generating the classification sample information in which the plurality of sample texts and a plurality of classification categories are associated by clustering a plurality of sample texts. The information classification device according to one.

Clustering means for clustering the classification sample information, comparison means for comparing the clustering result of the clustering means with a desired clustering result, and changing the classification sample information as needed based on the comparison result of the comparison means The information classification device according to any one of claims 1 to 5, further comprising a changing unit.

The information classification apparatus according to claim 1, further comprising a clustering result display unit configured to cluster a new text group in the classification result of the classification unit and to display the clustering result.

An optimization unit for optimizing a classification result of the classification unit, wherein the classification learning information generation unit regenerates the classification learning information based on the optimized classification result, and the classification unit includes the classification method 9. The new text group to be classified is classified for each classification category according to the classification method determined by the determination means and the regenerated classification learning information. Information classification device.

The information according to claim 9, further comprising difference recognition information display means for displaying the difference between the classification result before the optimization and the classification result after the optimization as visually recognizable difference recognition information. Classification device.

Computer
A feature element extracting means for extracting a feature element for each classification category from each of a plurality of sample texts included in classification sample information in which a plurality of sample texts and a plurality of classification categories are associated in advance;
Based on the classification sample information, a classification method determining means for determining a classification method with the highest classification accuracy among a plurality of classification methods;
Classification learning information generating means for generating classification learning information representing features for each classification category based on the feature elements extracted by the feature element extraction means in accordance with the classification method determined by the classification method determination means;
In accordance with the classification method determined by the classification method determination unit and the classification learning information, a classification unit that classifies a new text group to be classified for each classification category, and stores a classification result in a storage unit;
The new text classified into the first classification category by the classification means is included in the classification sample information, the new text is associated with the second classification category, and the characteristics are obtained from the classification sample information including the new text. The feature element extracting unit performs the process of re-extracting the element, and the classification method determining unit performs the re-determination of the classification method based on the classification sample information including the new text. The classification learning information is regenerated by the classification learning information generation unit using the extracted feature elements, and the reclassification of the new text group by the re-determined classification method and the regenerated classification learning information is performed by the classification unit. Re-learning means to perform,
Display means for adding a symbol indicating a difference from the previous classification result stored in the storage means for each new text, and displaying a result of reclassification of each new text included in the new text group on a display unit
Information classification program characterized by being operated as