JP3603392B2

JP3603392B2 - Document classification support method and apparatus

Info

Publication number: JP3603392B2
Application number: JP17068295A
Authority: JP
Inventors: 久雄間瀬; 由起子森本; 洋辻
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1995-07-06
Filing date: 1995-07-06
Publication date: 2004-12-22
Anticipated expiration: 2015-07-06
Also published as: JPH0922414A

Description

【０００１】
【産業上の利用分野】
本発明は、テキスト情報を含む電子化文書を、カテゴリに分類する文書分類方法および装置に関し、特に、計算機による分類結果に対してユーザがチェックする作業を効率良く行うための文書分類支援方法および装置に関する。
【０００２】
【従来の技術】
社会の情報化、および、情報インフラの整備に伴い、大量の情報が氾濫するようになり、必要な情報を効率良く取り出すことが必要不可欠となっている。その解決方法の一つに、予め文書を適当なカテゴリに分類しておくことが挙げられ、計算機による自動分類技術の開発が要求されてきている。
【０００３】
電子化テキスト文書の自動分類技術としては、ＰｒｏｃｅｅｄｉｎｇｓｏｆｓｅｃｏｎｄＡｎｎｕａｌＣｏｎｆｅｒｅｎｃｅｏｎＩｎｎｏｖａｔｉｖｅ（１９９０）や、情報処理学会研究報告ＮＬ−９８−１１や、Ｉｎｆｏ−Ｔｅｃｈ’９４講演論文集ｐｐ．１３８〜ｐｐ．１４６に記載されている技術がある。これらは、テキスト文書中のキーワードの出現傾向に基づいてカテゴリを決定するものである。
【０００４】
【発明が解決しようとする課題】
上記の技術は、計算機によってテキストを全自動で分類するものであり、ユーザと協調的に分類結果を決定する方法については、上記文献の中で言及されていない。また、上記の技術による分類精度は、人間と同等レベルに至っていない。
【０００５】
しかし、人間と同等レベルの分類精度を要求されるような状況では、計算機の分類結果をユーザがチェックする必要がある。従って、計算機とユーザが役割分担し、協調的に分類作業を行うことが、コスト削減につながる。つまり、計算機の分類結果に基づいて、いかに効率良く、少ない作業負担で、分類すべきカテゴリを確定するかが課題となる。
【０００６】
特に、分類処理の対象となる文書の数が大量である場合、１件当たりに要する作業時間をいかに少なくし、作業負担をいかに軽減するかが課題となる。また、カテゴリの数が比較的多い場合や、カテゴリが複雑でその識別が非常に困難である場合、計算機が出力した分類結果が正しいかどうかを判定する作業や、その分類結果が誤りである場合に、真のカテゴリを一から見つける作業は、大変困難となる。従って、これらの作業をいかに効率良く行うかが課題となる。
【０００７】
そこで、本発明の一つの目的は、分類結果が正しいかどうかを判定する作業や、その分類結果が誤りである場合に、真のカテゴリを見つける作業を効率良く行うことにある。
【０００８】
また、大量の文書を順次分類する場合、その順番は、文書の内容に依存していないことが多い。その場合、文書が変わる度に記述された内容が大きく変わるため、チェックするユーザは、内容が変わる毎に、その内容に頭を切り替える必要がある。このため、チェックの効率も悪く、作業負担も増大するという課題がある。
【０００９】
そこで、本発明の他の一つの目的は、分類すべき文書の内容が頻繁に大きく変わることによる作業負担を軽減し、分類作業の効率を向上させることにある。
【００１０】
【課題を解決するための手段】
本発明では、分類処理の対象となる文書が何故そのカテゴリに分類されたかに関するログデータをカテゴリの推定結果出力手段を介してユーザに提示し、提示したログデータをユーザ入力手段を介してユーザに修正させ、修正後のログデータに基づいてカテゴリを再推定し、再推定後のカテゴリを推定結果出力手段を介してユーザに提示することにより、上記課題を解決する。
【００１１】
また、本発明では、テキスト解析手段およびカテゴリ推定手段により複数の文書について分類すべきカテゴリをそれぞれ推定し、推定されたカテゴリが互いに類似しているあるいは同一である文書集合を認定する類似文書認定手段を持ち、類似文書について、推定結果出力手段を介してカテゴリ推定手段によって推定されたカテゴリをユーザに順次提示し、提示された文書についてユーザ入力手段を介してユーザに分類すべきカテゴリを確定させることにより、上記課題を解決する。
【００１２】
【作用】
推定したカテゴリに基づいて、内容の類似している文書をまとめ、ユーザに順次提示してチェックを促すので、内容の大きな変化に伴う、ユーザの頭の切り替えが少なく済み、作業負担が軽減する。また、内容の類似した文書が続くため、以前の文書をチェックしたときのコツ、ノウハウ、教訓、データなどを次回の文書のチェックに活かすことが容易となり、チェック作業時間が少なく済む。
【００１３】
【実施例】
本発明の実施例について、以下、図面を用いて詳細に説明する。
本実施例は、新聞記事をあるカテゴリに分類し、文書データベースに格納するものである。データベースにカテゴリ毎に格納された新聞記事データは、公知の検索システムを用いることにより、検索することが可能である。
【００１４】
図１は、本実施例の概要を示す図である。
まず、分類の対象となる文書を文書入力１で入力する。文書データは、ネットワークを介して外部から取得しても良いし、フロッピーなどの媒体を介して取得しても良いし、音声認識装置、画像認識装置（文字認識を含む）、ペンなどの手書き入力装置などを介して取得しても良い。また、定期的に文書データをまとめて取得しても良いし、流通している文書データを不定期的に逐次取得しても良い。取得した文書データは、文書ファイル１０に一時的に格納する。
【００１５】
次に、ユーザからの分類する文書データの指定およびカテゴリ推定の実行指示により、文書データを解析する。推定されていない文書があるか否かを判別し（１ａ）、ない場合は、ステップ３ａに進む。
【００１６】
ある場合は、まず、テキスト解析２で、テキストから自然言語処理によりその内容を特徴付けるキーワードを自動抽出する。すなわち、単語およびその品詞・活用情報を格納した単語辞書１１を参照して、テキストを単語に分割し、品詞が名詞である単語をキーワードとし、各キーワードの出現頻度とともにキーワードテーブル１２に格納する。
【００１７】
次に、カテゴリ推定３で、予め各カテゴリを特徴付けるキーワードを定義・格納した分類知識１３およびカテゴリの体系を定義したカテゴリ定義テーブル１４を参照して、テキストから抽出したキーワードテーブル１２のキーワードが、どのカテゴリに含まれているかを探索し、含まれている場合には、そのカテゴリに得点を付与する。そして、得点の高いカテゴリがそのテキストの分類すべきカテゴリであると推定する。推定結果は、推定カテゴリテーブル１５に格納する。また、カテゴリを推定する際に用いたキーワード情報や、カテゴリの得点情報などのデータは、ログデータ１７に格納する。
【００１８】
次に、ユーザに推定結果をチェックさせるために、推定結果を出力する。このとき、推定結果をその内容が類似している文書毎に表示するか否かをユーザに指定させ（３ａ）、内容が類似している文書毎に表示しない場合、文書ＩＤの順に推定結果を表示する。
【００１９】
内容が類似している文書毎に表示する場合は、類似文書認定４で、推定カテゴリテーブル１５に格納された各文書のカテゴリ推定結果から、類似している文書を認定し、その結果を類似文書テーブル１６に格納する。
【００２０】
次に、ユーザによってカテゴリが確定されていない文書があれば（４ａ）、カテゴリ推定結果を順次ユーザに提示し（５）、結果のチェックおよび分類すべきカテゴリの確定を促す（５ａ）。このときに、ログデータ１７に格納した解析データもユーザに提示する。
【００２１】
ユーザは、提示されたカテゴリが正しいかをチェックする。そして、正しいのであれば、カテゴリを確定し、文書データベース１８に文書を登録する。
正しくないのであれば、正しいカテゴリを見つけなければならない。そのとき、ユーザが分類すべきカテゴリの推定をしなおすと指示した場合、まず、提示されているログデータについて、ユーザに修正させ（６）、修正後のデータに基づいて、カテゴリを再推定し（７）、新しい推定結果を新しい解析データとともにユーザに提示する。これにより、正しいカテゴリであるとユーザが判断した場合、カテゴリを確定し（８）、文書データベースに登録する（９）。カテゴリの再推定を何度か行っても正しいカテゴリを見つけられない場合、ユーザが人手でカテゴリを確定する。
【００２２】
カテゴリを確定すると、次の文書のチェックに移り（９ａ）、その文書のカテゴリ推定結果およびログデータを出力する。
【００２３】
図２は、本実施例のハードウエアの概要を示す図である。ユーザからの操作指示およびデータを入力するためのキーボード２０、マウス２５、結果を出力する出力モニタ３０、種々の処理を実行する処理装置４０、ファイルやプログラムを格納する記憶装置５０からなる。また、文書データを取得するために、計算機ネットワーク９０と接続されており、ネットワークを介して文書を取得可能となっている。
【００２４】
記憶装置５０は、一時的なデータを格納するワーキングエリア６１、取得した文書データを一時格納する文書ファイル格納エリア６２、単語辞書格納エリア６３、キーワードテーブル格納エリア６４、分類知識格納エリア６５、カテゴリ定義テーブル格納エリア６６、推定カテゴリテーブル格納エリア６７、類似文書テーブル格納エリア６８、ログデータ格納エリア６９、文書データベース格納エリア７０を含んでいる。ワーキングエリア６１以外の上記格納エリアに格納されるのは、データ形式のファイルである。
【００２５】
さらに、記憶装置５０は、テキスト解析処理部格納エリア７１、カテゴリ推定処理部格納エリア７２、類似文書認定処理部格納エリア７３、カテゴリ推定結果表示部格納エリア７４、ログデータ修正部格納エリア７５、カテゴリ再推定処理部格納エリア７６、カテゴリ確定処理部格納エリア７７、文書データベース登録処理部格納エリア７８をも含んでいる。これらの格納エリアに格納されるのは、実行形式のロードモジュールファイルである。
【００２６】
なお、図２に示した（）内の数字は図１に示した各部との対応関係を示す。
【００２７】
図３は、文書に含まれるテキスト情報の一例を示す図である。
本実施例で扱う文書データは、新聞記事であるが、文書データとしては、電子ニュース、電子メール、科学技術論文、特許明細書、クレーム・質問・意見文、会議の議事録など、他の種類のものでも良い。また、本実施例では、文書データには、テキスト情報を含んでいることを前提とし、これらの情報は、テキストコード形式でファイルに格納されていることを前提とする。ただし、静止画、動画、音声情報などがリンクされているものは差し支えない。
【００２８】
図４は、テキスト解析２で参照する単語辞書１１の一例を示す図である。単語辞書は、見出し２０１の他、品詞２０２、活用種２０３、活用行２０４といった単語属性情報を持つ。
【００２９】
図５は、テキスト解析２における、単語分割結果の一例を示す図である。テキスト解析２では、まず、図３のようなテキストに対して、図４の単語辞書１１を参照して、各文を単語毎に分割し、図５のように、単語の見出し２１１および品詞２１２を抽出する。単語分割の具体的な実現方法については、例えば、情報処理学会第４４回全国大会論文集（３）３−１８１に示すように、既に公知であるので、ここでは詳細の記述を省略する。
【００３０】
図６は、テキストから抽出したキーワードを格納するキーワードテーブル１２の一例を示す図である。テキスト解析２では、テキストを単語分割した後、品詞が名詞である単語を抽出してキーワードとし、さらに当該テキストにおける各キーワードの出現頻度を算出し、キーワードの重みとする。もちろん、名詞以外の品詞をキーワードとしても良いし、出現頻度を重みとする以外にも、キーワードの出現位置や、その前後の単語との関係などを考慮して重み付けしても良い。キーワードテーブル１２は、文書を識別する文書ＩＤ２２１、キーワード見出し２２２、その重み２２３からなる。
【００３１】
図７は、カテゴリの体系を定義したカテゴリ定義テーブル１４の一例を示す図である。本実施例では、新聞記事を分類するためのカテゴリとして、大カテゴリ２３１と小カテゴリ２３２という２階層からなるカテゴリを定義している。大カテゴリ２３１のそれぞれには、一つ以上の小カテゴリ２３２が属しており、木構造の体系をしている。カテゴリの階層は、何階層あっても良い。
【００３２】
図８は、分類知識１３の一例を示す図である。本実施例では、キーワードの有無に基づいて分類すべきカテゴリを推定するという手法を用いている。従って、分類知識１３は、カテゴリを特徴付けるキーワードの集合である。すなわち、分類知識１３は、大カテゴリ２４１、小カテゴリ２４２、そのカテゴリを特徴付けるキーワード２４３、およびそのキーワードの重要度に依存する重み２４４からなる。重み２４４は、そのキーワードがそのカテゴリを特徴付ける重要なキーワードであるほど、値が大きい。なお、この分類知識１３は、予め記憶装置５０に格納しておく。また、分類知識は、人手によって作成しても良いし、既にカテゴリの確定しているテキストをカテゴリ別に用意し、カテゴリ毎にキーワードを自動抽出することによって、作成しても良い。
【００３３】
図９は、カテゴリ推定３の処理手順を示す図である。まず、各カテゴリの得点を格納するテーブルを０に初期化する（ステップ５０１）。
【００３４】
次に、キーワードテーブル１２に格納された当該文書のキーワードすべてについて以下の処理を行う（ステップ５０２）。当該キーワードを含む分類知識１３中のカテゴリが存在するか否かを判別し（ステップ５０３）、存在するカテゴリについては、当該文書のキーワードの持つ重みＷｉ（図６の２２３に相当）と、当該カテゴリのキーワードの持つ重みＷｊ（図８の２４４に相当）の積を計算し、当該カテゴリの得点として、加算する（ステップ５０４）。
【００３５】
すべてのキーワードについて上記の処理を行った時点で、各カテゴリの得点が決定されるので、これらの得点から各カテゴリの得点の偏差値を計算する（ステップ５０５）。さらに、偏差値の高い順にカテゴリをソートする（ステップ５０６）。そして、推定カテゴリテーブル１５に、当該文書ＩＤ、カテゴリ、およびその偏差値の値を組にして、偏差値の高い順に格納する（ステップ５０７）。本実施例では、上位３個のカテゴリを格納する。もちろん、上位ｎ個のカテゴリを格納しても良いし、偏差値の値に下限を設けて、下限以上のカテゴリを格納しても良い。最後に、ログデータ１７に、当該文書ＩＤ，当該文書から抽出したキーワード、各キーワードが各カテゴリの持つキーワードに含まれる場合、ステップ５０４の重みＷｉ、重みＷｊ、及びその積の値を格納する（ステップ５０８）。
【００３６】
なお、本実施例は、２階層（大カテゴリ、小カテゴリ）のカテゴリ体系をなしているが、カテゴリ推定３では、小カテゴリについて行い、大カテゴリの推定は、小カテゴリが決まれば一意に決まるので、行っていない。別の推定方法として、まず、大カテゴリについてカテゴリを推定し、上位にランクされた大カテゴリに限定した形で、小カテゴリを推定する方法でも良い。この場合、大カテゴリを特徴付けるキーワードおよびその重みを定義した分類知識１３が必要である。人手により新たに作成しても良いし、小カテゴリに関する分類知識を大カテゴリ毎にまとめあげることで容易に作成することもできる。
【００３７】
図１０は、推定カテゴリテーブル１５の一例を示す図である。推定カテゴリテーブル１５は、文書ＩＤ２５１、推定されたカテゴリの順位２５２、推定された大カテゴリ候補２５３、推定された小カテゴリ候補２５４、そのカテゴリの偏差値２５５からなる。
【００３８】
図１１は、類似文書認定４の処理手順を示す図である。まず、類似文書テーブル１６を初期化する（ステップ５２１）。次に、すべてのカテゴリについて、以下の処理を行う（ステップ５２２）。推定カテゴリテーブル１５を参照して、カテゴリを推定した文書の中で、当該カテゴリに第１位に分類すべきと推定された文書の文書ＩＤを抽出する（ステップ５２３）。
【００３９】
次に、抽出した文書ＩＤについて、第２位に分類すべきと推定されたカテゴリ毎にまとめ、当該カテゴリと対応付けて、類似文書テーブル１６に格納する（ステップ５２４）。
【００４０】
図１２は、類似文書テーブル１６の一例を示す図である。図１１に示すように、本実施例では、第１位に推定されたカテゴリと第２位に推定されたカテゴリが同一の文書毎にまとめられて、類似文書テーブル１６に格納している。すなわち、類似文書テーブル１６は、第１位に推定されたカテゴリ２６１、第２位に推定されたカテゴリ２６２、そして、それらを推定結果としてもつ文書ＩＤ２６３から構成される。
【００４１】
図１３は、カテゴリ推定結果表示の一例を示す図である。ここで、文書指定ボタン４０１は、処理する文書の範囲を指定するものであり、文書の存在するディレクトリを指定する。分類ボタン４０２は、指定された文書について、テキスト解析２およびカテゴリ推定３を実行し、推定結果およびログデータを得る。再分類ボタン４０３は、ユーザによって修正されたデータに基づいてカテゴリの再推定を実行し、再推定結果を出力する。絞込分類ボタン４０４は、後述するように、上位階層のカテゴリをユーザに指定させ、そのカテゴリに属する下位カテゴリに限定した中でカテゴリ推定を実行し、推定結果を出力する。カテゴリ一覧ボタン４０５は、カテゴリ定義テーブル１４の内容を表示する。分類知識参照ボタン４０６は、分類知識１３に格納されているキーワードおよびその重みをカテゴリ別に表示する。終了ボタン４０７は、システムを終了する。
【００４２】
４１１は、テキストの内容を表示するエリアであり、文書テキストのＩＤも表示している。４１２は、当該テキストから抽出したキーワードおよびその重み（出現頻度）を対にして重みの高い順に表示するエリアである。
【００４３】
４１３は、各カテゴリについて、４１２のキーワードのうち、どのキーワードを含んでいるか、また、その得点はどのくらいの大きさかを表示する。カテゴリの指定は、分類結果である４１４のカテゴリのうちのどれか一つを指定することにより行う。図１３の４１３で、例えば、「円」というキーワードは、「国際経済」という小カテゴリのキーワードに含まれており、テキストから抽出したキーワードの持つ重みＷｉが４、分類知識１３の「国際経済」という小カテゴリのキーワード「円」の持つ重みＷｊが８、その結果、得点が４×８＝３２点与えられたことを示している。
【００４４】
４１４は、推定された大カテゴリ、小カテゴリ、およびその偏差値を表示するエリアである。４１５は、ユーザが確定したカテゴリを表示するエリアである。
４１６は、現在チェックしている文書の直前にチェックした文書について、そのカテゴリ推定結果およびログデータ、確定カテゴリを表示するボタンである。これらチェック済みの文書に関するデータは、推定カテゴリテーブルおよびログデータに格納されているので、それらのデータを表示することで容易に実現可能である。
【００４５】
４１７は、現在チェックしている文書についてカテゴリを確定し、次の文書のチェックに移ることを指示するボタンである。この時点で、４１５に記述されたカテゴリを分類すべきカテゴリとして確定し、文書データベース１８に当該文書をカテゴリ情報とともに登録する。
【００４６】
図１４は、カテゴリ推定結果表示の他の一例を示す図である。４２１は、分類知識の一覧であり、分類知識参照ボタン４０６を押した時に、分類知識１３を参照して表示する。４２２は、カテゴリ一覧ボタン４０５を押した時に、カテゴリ定義テーブル１４を参照して表示する。４２３は、カテゴリの範囲を記述した文章であり、カテゴリ一覧４２２において、どれか一つのカテゴリを選択した場合に、表示される。
【００４７】
図１５は、ユーザによりログデータが修正された後の画面の一例を示す図である。４１１、４１２については、ユーザがキーボード２０およびマウス２５を介して表示されたデータを修正できるようになっている。図１５では、４１２について修正がなされている。キーワードに関しては、表示されているキーワードの削除、新しいキーワードの追加、表示されている重みの修正が可能である。修正前の画面である図１３に対し、図１５では、「円」、「為替市場」、「急騰」などのキーワードの重みが修正され、また、「１日」、「一時」などのあまり重要でないキーワードが削除されている。
【００４８】
図１６は、カテゴリ再推定結果の一例を示す図である。キーワードおよびその重みを修正した結果、分類結果４１４として、前回の推定結果として現れなかったカテゴリ「為替」が第１位に新しく現れたことを示している。このように新たに現れたカテゴリについては、星印を付加して、他のカテゴリと区別している。もちろん、区別の仕方は星印の付加以外でも良い。
【００４９】
図１７は、カテゴリ再推定７の処理手順を示す図である。まず、各カテゴリの得点を格納するテーブルを０に初期化する（ステップ５４１）。
【００５０】
次に、当該文書ＩＤ、修正後のテキスト、修正後のキーワードおよびその重みを出力画面から読み取り、ワーキングエリア１６に格納する（ステップ５４２）。次に、テキスト情報が修正されたか否かを判別する（ステップ５４３）。テキスト情報が修正されてしまうと、そこから抽出されるキーワードおよびその重みが大きく変わるため、テキスト解析２からやり直す必要がある。それに対して、テキスト情報が修正されていない場合は、表示画面から読み取ったキーワード情報を使用することができるので、カテゴリ推定３から処理すれば良い。テキスト情報が修正されたか否かについては、テキスト修正フラグを設け、そのオンオフにより判別できる。
【００５１】
ステップ５４３で、テキスト情報が修正された場合、テキスト解析２を実行して、修正後のテキストからキーワードおよび重みを抽出し、結果をワーキングエリア６１に格納する（ステップ５４４）。
【００５２】
次に、ワーキングエリア６１に格納されたすべてのキーワードについて、以下の処理を行う（ステップ５４５）。当該キーワードを含む分類知識中のカテゴリが存在するか否かを判別し（ステップ５４６）、存在するカテゴリについては、当該文書のキーワードの持つ重みＷｉ（図６の２２３に相当）と、当該カテゴリのキーワードの持つ重みＷｊ（図８の２４４に相当）の積を計算し、当該カテゴリの得点として、加算する（ステップ５４７）。
【００５３】
すべてのキーワードについて行った時点で、各カテゴリの得点が決定されるので、これらの得点から各カテゴリの得点の偏差値を計算する（ステップ５４８）。さらに、偏差値の高い順にカテゴリをソートする（ステップ５４９）。そして、推定カテゴリテーブル１５に、当該文書ＩＤ、カテゴリ、およびその偏差値の値を組にして、偏差値の高い順に格納する（ステップ５５０）。
【００５４】
図１８は、ログデータ１７の一例を示す図である。ログデータ１７には、文書ＩＤ、テキストから抽出したキーワードおよびその重み、カテゴリ別の得点の内訳、確定されたカテゴリに関するデータを、システム終了するまで格納、保持する。従って、ある文書のカテゴリ推定結果をチェックしているときに、それまでにチェック済みの文書のデータを参照することもできる。
【００５５】
図１９は、カテゴリ確定８の一例を示す図である。ユーザは、分類結果４１４を参照して、カテゴリを確定する。本実施例では、分類結果４１４において、確定したいカテゴリをマウスでダブルクリックすることにより、選択したカテゴリを確定カテゴリ４１５に表示する。
【００５６】
このように、本実施例によれば、文書を分類したい場合、計算機によってカテゴリの候補を推定させ、その結果を表示させ、それをユーザがチェックするというマンマシン分担型の文書分類支援システムを実現できる。また、分類結果を表示する際に、推定されたカテゴリ別にまとめて順次結果を提示するので、ユーザは効率良くチェックが行える。また、提示された結果が誤りであっても、データを修正し、再分類することによって、正しいカテゴリに分類する精度を向上させることができ、分類すべきカテゴリをユーザが一から見つけるという負担の大きな作業をする割合を極力少なくすることができる。
【００５７】
次に、本実施例の変形例について述べる。
類似文書認定４において、本実施例では、上位２個の推定カテゴリによって認定したが、推定カテゴリの代わりに、テキストから抽出した重みの高いキーワードによって認定しても良い。
【００５８】
図２０は、その処理方法を示す図である。まず、類似文書テーブル１６を初期化する（ステップ５６１）。次に、類似文書としてまだ認定されていない文書の存在する間、以下の処理を実行する（ステップ５６２）。認定されていないある文書について、当該文書から抽出された重みの高いｍ種類のキーワードのうちのｎ種類（ｍ＞＝ｎ）以上のキーワードが、重みの高いｍ種類のキーワードの中に含まれている文書を抽出し、類似文書集合を識別するための集合識別子とともに、類似文書テーブルに格納する（ステップ５６３）。図１１では、集合識別子に相当するものとして、カテゴリの名称を用いていたが、ここでは、それを代用するものとして、集合識別子を定義する。これは、類似文書集合を識別可能であれば、どんな形でも良い。
【００５９】
ステップ５６３の後、類似文書テーブル１６に格納した文書をステップ５６２の処理対象から除く（ステップ５６４）。
以上の処理によって、カテゴリ推定された結果をユーザに提示する際に、重みの高いキーワードをどれだけ共有しているかということに基づいて類似文書毎に提示することが可能となる。
【００６０】
次に、本実施例の拡張例について述べる。
本実施例のように、カテゴリが複数の階層からなる場合、上位カテゴリをユーザに提示して指定させ、指定された上位カテゴリに属する下位カテゴリに限定してカテゴリの推定を行うことにより、分類精度向上が期待できる。これは、特に、下位カテゴリの数が膨大である場合に、有効である。
【００６１】
図２１は、大カテゴリを指定するための画面の一例を示した図である。大カテゴリの指定は、絞込分類ボタン４０４が押された時、指定用画面４２４を表示することによって行われる。大カテゴリの指定は、複数であっても良い。また、指定用画面４２４における大カテゴリの表示順序は、基本的には、カテゴリ定義テーブル１４に定義されている順序であるが、カテゴリ推定３において、まず大カテゴリを推定し、その結果を用いて小カテゴリを推定する手法を採用する場合には、当該文書の大カテゴリに関する推定結果をログデータ１７に格納・保持しておくことにより、大カテゴリの推定結果の順序に基づいて表示することも可能である。
【００６２】
指定用画面４２４によって、大カテゴリを指定した後、再分類ボタン４０３を押すことによって、指定された大カテゴリに限定したカテゴリ再推定７を実行する。図１７に示すカテゴリ再推定７の処理手順のステップ５５０において、推定カテゴリテーブル１５に推定結果を格納する際に、推定されたカテゴリの大カテゴリがユーザによって指定された大カテゴリに含まれている場合に限り、格納することにより、上位カテゴリによる絞り込みが実現できる。図１３の結果表示において、仮に、ユーザが、大カテゴリを「経済」に絞り込んだ場合、分類結果４１４において、２位の「政治：国会」というカテゴリは、除去される。
【００６３】
このように、上位カテゴリが比較的少なく、ユーザが容易に確定できる場合、上位カテゴリで絞り込んでカテゴリを推定することにより、正しいカテゴリを得ることができるようになる。
【００６４】
【発明の効果】
文書の自動分類結果をユーザがチェックする際に、計算機によって分類された結果が類似した文書毎にユーザに順次提示し、チェックを促すので、以前の文書をチェックしたときのコツ、ノウハウ、教訓、データなどを次回の文書のチェックに活かすことが容易となり、チェック作業時間が少なく済む。
【００６５】
また、自動分類結果が誤りであった場合でも、自動分類結果とともに出力するログデータをユーザに修正させ、再推定することにより、正しい分類結果を導くことが可能であるため、最初の自動分類結果が誤りであった場合に、ユーザが一から分類しなおすという負担の重い作業を軽減することができる。
【図面の簡単な説明】
【図１】本実施例の概要を示す図である。
【図２】本実施例のハードウエアの概要を示す図である。
【図３】文書に含まれるテキストの一例を示す図である。
【図４】単語辞書の一例を示す図である。
【図５】テキスト解析における単語分割結果の一例を示す図である。
【図６】キーワードテーブルの一例を示す図である。
【図７】カテゴリ定義テーブルの一例を示す図である。
【図８】分類知識の一例を示す図である。
【図９】カテゴリ推定の処理手順を示す図である。
【図１０】推定カテゴリテーブルの一例を示す図である。
【図１１】類似文書認定の処理手順を示す図である。
【図１２】類似文書テーブルの一例を示す図である。
【図１３】カテゴリ推定結果表示の一例を示す図である。
【図１４】カテゴリ推定結果表示の他の一例を示す図である。
【図１５】ユーザにより修正後の画面の一例を示す図である。
【図１６】カテゴリ再推定結果の一例を示す図である。
【図１７】カテゴリ再推定の処理手順を示す図である。
【図１８】ログデータの一例を示す図である。
【図１９】カテゴリ確定の一例を示す図である。
【図２０】類似文書認定の他の処理手順を示す図である。
【図２１】上位カテゴリの絞り込みの一例を示す図である。
【符号の説明】
１：文書入力、２：テキスト解析、３：カテゴリ推定、４：類似文書認定、
５：カテゴリ推定結果表示、６：ログデータ修正、７：カテゴリ再推定、
８：カテゴリ確定、９：文書データベース登録、１０：文書ファイル、
１１：単語辞書、１２：キーワードテーブル、１３：分類知識、
１４：カテゴリ定義テーブル、１５：推定カテゴリテーブル、
１６：類似文書テーブル、１７：ログデータ、１８：文書データベース[0001]
[Industrial applications]
The present invention relates to a document classification method and apparatus for classifying digitized documents including text information into categories, and more particularly to a document classification support method and apparatus for efficiently performing a task of checking a classification result by a computer by a user. About.
[0002]
[Prior art]
With the socialization of information and the development of information infrastructure, a large amount of information has become flooded, and it has become essential to extract necessary information efficiently. One of the solutions is to classify documents into appropriate categories in advance, and the development of a computer-based automatic classification technique has been required.
[0003]
Examples of automatic classification technology of digitized text documents include Proceedings of Second Annual Conference on Innovative (1990), IPSJ Research Reports NL-98-11, and Info-Tech '94 Lecture Papers pp. 138-pp. 146. These determine the category based on the appearance tendency of the keyword in the text document.
[0004]
[Problems to be solved by the invention]
The above-mentioned technique classifies texts automatically by a computer, and does not refer to a method of determining a classification result in cooperation with a user in the literature. In addition, the classification accuracy by the above technique has not reached the same level as humans.
[0005]
However, in a situation where the same level of classification accuracy as that of a human is required, the user needs to check the classification result of the computer. Therefore, if the computer and the user share the roles and perform the classification work cooperatively, it leads to cost reduction. In other words, how to determine the category to be classified efficiently and with a small work load based on the classification result of the computer is an issue.
[0006]
In particular, when the number of documents to be classified is large, it is an issue how to reduce the work time required for each document and how to reduce the work load. In addition, when the number of categories is relatively large, when the categories are complicated and their identification is extremely difficult, when the classification result output by the computer is determined to be correct, or when the classification result is incorrect In addition, it is very difficult to find a true category from scratch. Therefore, how to perform these operations efficiently is an issue.
[0007]
Therefore, it is an object of the present invention to efficiently perform the operation of determining whether the classification result is correct and the operation of finding a true category when the classification result is incorrect.
[0008]
Also, when a large number of documents are sequentially classified, the order often does not depend on the contents of the documents. In this case, the content described changes greatly every time the document changes, so that the user who checks needs to switch his head to the content every time the content changes. For this reason, there is a problem that the efficiency of the check is low and the work load increases.
[0009]
Therefore, another object of the present invention is to reduce the work load due to frequent and significant changes in the contents of documents to be classified, and to improve the efficiency of the classification work.
[0010]
[Means for Solving the Problems]
In the present invention, log data relating to why a document to be classified is classified into the category is presented to the user via the category estimation result output means, and the presented log data is presented to the user via the user input means. The above problem is solved by causing the user to make a correction, re-estimating the category based on the corrected log data, and presenting the re-estimated category to the user via the estimation result output means.
[0011]
Further, in the present invention, similar document recognition means for estimating a category to be classified for a plurality of documents by a text analysis means and a category estimation means, and recognizing a document set in which the estimated categories are similar or identical to each other. And sequentially presenting the category estimated by the category estimating means to the user via the estimation result output means for the similar document, and determining the category to be classified to the user via the user input means for the presented document. Solves the above problem.
[0012]
[Action]
Documents having similar contents are put together based on the estimated category, and the documents are sequentially presented to the user to prompt the user to check. Therefore, the number of switching of the user's head due to a large change in the content is reduced, and the work load is reduced. In addition, since documents having similar contents continue, it is easy to utilize tips, know-how, lessons learned, data, and the like when checking the previous document in checking the next document, and the checking work time is reduced.
[0013]
【Example】
Embodiments of the present invention will be described below in detail with reference to the drawings.
In this embodiment, newspaper articles are classified into a certain category and stored in a document database. Newspaper article data stored in the database for each category can be searched by using a known search system.
[0014]
FIG. 1 is a diagram illustrating an outline of the present embodiment.
First, a document to be classified is input by a document input 1. The document data may be obtained from the outside via a network, may be obtained via a medium such as a floppy disk, or may be input by handwriting such as a voice recognition device, an image recognition device (including character recognition), and a pen. It may be obtained via a device or the like. Further, the document data may be collectively acquired periodically, or the distributed document data may be acquired irregularly and sequentially. The acquired document data is temporarily stored in the document file 10.
[0015]
Next, the document data is analyzed in accordance with the designation of the document data to be classified and the instruction to execute the category estimation from the user. It is determined whether or not there is a document that has not been estimated (1a). If not, the process proceeds to step 3a.
[0016]
If there is, first, in the text analysis 2, a keyword that characterizes the content is automatically extracted from the text by natural language processing. That is, the text is divided into words with reference to the word dictionary 11 storing the words and their part of speech / utilization information, and the words whose nouns are nouns are set as keywords, and stored in the keyword table 12 together with the appearance frequency of each keyword.
[0017]
Next, in the category estimation 3, referring to the classification knowledge 13 defining and storing the keywords characterizing each category in advance and the category definition table 14 defining the category system, the keywords in the keyword table 12 extracted from the text A search is performed to determine whether the category is included. If the category is included, a score is assigned to the category. Then, it is estimated that the category with the highest score is the category to which the text should be classified. The estimation result is stored in the estimation category table 15. Further, data such as keyword information used in estimating the category and score information of the category are stored in the log data 17.
[0018]
Next, the estimation result is output so that the user can check the estimation result. At this time, the user is instructed whether or not to display the estimation result for each document whose content is similar (3a). If the estimation result is not displayed for each document whose content is similar, the estimation result is displayed in the order of the document ID. indicate.
[0019]
In the case of displaying each document having similar contents, similar document recognition 4 recognizes similar documents from the category estimation result of each document stored in the estimated category table 15 and divides the result into a similar document. Stored in table 16.
[0020]
Next, if there is a document whose category has not been determined by the user (4a), the category estimation results are sequentially presented to the user (5), and the results are checked and the category to be classified is determined (5a). At this time, the analysis data stored in the log data 17 is also presented to the user.
[0021]
The user checks whether the presented category is correct. If it is correct, the category is determined and the document is registered in the document database 18.
If not, you have to find the right category. At that time, if the user instructs to re-estimate the category to be classified, first, the user corrects the presented log data (6), and re-estimates the category based on the corrected data. (7) The new estimation result is presented to the user together with the new analysis data. When the user determines that the category is correct, the category is determined (8) and registered in the document database (9). If the correct category cannot be found even after re-estimating the category several times, the user manually determines the category.
[0022]
When the category is determined, the process proceeds to the next document check (9a), and the category estimation result and log data of the document are output.
[0023]
FIG. 2 is a diagram illustrating an outline of hardware of the present embodiment. It comprises a keyboard 20, a mouse 25 for inputting operation instructions and data from a user, an output monitor 30 for outputting results, a processing device 40 for executing various processes, and a storage device 50 for storing files and programs. In addition, it is connected to a computer network 90 for acquiring document data, and can acquire documents via the network.
[0024]
The storage device 50 includes a working area 61 for temporarily storing data, a document file storage area 62 for temporarily storing acquired document data, a word dictionary storage area 63, a keyword table storage area 64, a classification knowledge storage area 65, and a category definition. It includes a table storage area 66, an estimated category table storage area 67, a similar document table storage area 68, a log data storage area 69, and a document database storage area 70. What is stored in the storage area other than the working area 61 is a file in a data format.
[0025]
Further, the storage device 50 includes a text analysis processing unit storage area 71, a category estimation processing unit storage area 72, a similar document recognition processing unit storage area 73, a category estimation result display unit storage area 74, a log data correction unit storage area 75, a category It also includes a re-estimation processing unit storage area 76, a category determination processing unit storage area 77, and a document database registration processing unit storage area 78. Stored in these storage areas are executable load module files.
[0026]
Note that the numbers in parentheses shown in FIG. 2 indicate the correspondence with the respective parts shown in FIG.
[0027]
FIG. 3 is a diagram illustrating an example of text information included in a document.
The document data handled in this embodiment is a newspaper article, but other types of document data such as electronic news, e-mail, scientific and technical papers, patent specifications, claims / questions / opinions, minutes of meetings, etc. It may be something. In this embodiment, it is assumed that the document data includes text information, and that such information is stored in a file in a text code format. However, a still image, a moving image, audio information, and the like may be linked.
[0028]
FIG. 4 is a diagram illustrating an example of the word dictionary 11 referred to in the text analysis 2. The word dictionary has, in addition to the heading 201, word attribute information such as a part of speech 202, a utilization type 203, and a utilization line 204.
[0029]
FIG. 5 is a diagram illustrating an example of a word segmentation result in text analysis 2. In the text analysis 2, first, the text shown in FIG. 3 is divided into each sentence with reference to the word dictionary 11 in FIG. 4, and as shown in FIG. Is extracted. Since a specific method of realizing word division is already known, as shown in, for example, the 44th National Convention of Information Processing Society of Japan, (3) 3-181, detailed description is omitted here.
[0030]
FIG. 6 is a diagram illustrating an example of the keyword table 12 that stores keywords extracted from text. In the text analysis 2, after the text is divided into words, words whose nouns are nouns are extracted and used as keywords, and the appearance frequency of each keyword in the text is calculated and used as the weight of the keywords. Of course, a part of speech other than a noun may be used as a keyword, and in addition to using the appearance frequency as a weight, weighting may be performed in consideration of the appearance position of the keyword, the relationship with words before and after the keyword, and the like. The keyword table 12 includes a document ID 221 for identifying a document, a keyword heading 222, and a weight 223.
[0031]
FIG. 7 is a diagram illustrating an example of the category definition table 14 in which a category system is defined. In this embodiment, as a category for classifying newspaper articles, a category having two hierarchies of a large category 231 and a small category 232 is defined. One or more small categories 232 belong to each of the large categories 231 and have a tree structure. There can be any number of categories.
[0032]
FIG. 8 is a diagram illustrating an example of the classification knowledge 13. In this embodiment, a method of estimating a category to be classified based on the presence or absence of a keyword is used. Therefore, the classification knowledge 13 is a set of keywords characterizing the category. That is, the classification knowledge 13 includes a large category 241, a small category 242, a keyword 243 characterizing the category, and a weight 244 depending on the importance of the keyword. The weight 244 has a larger value as the keyword is an important keyword characterizing the category. The classification knowledge 13 is stored in the storage device 50 in advance. The classification knowledge may be created manually, or may be created by preparing a text whose category is already determined for each category and automatically extracting a keyword for each category.
[0033]
FIG. 9 is a diagram illustrating a processing procedure of the third category estimation. First, a table storing scores of each category is initialized to 0 (step 501).
[0034]
Next, the following processing is performed for all keywords of the document stored in the keyword table 12 (step 502). It is determined whether or not there is a category in the classification knowledge 13 including the keyword (step 503). For the existing category, the weight Wi of the keyword of the document (corresponding to 223 in FIG. 6) and the category The product of the weights Wj (corresponding to 244 in FIG. 8) of the keyword is calculated and added as the score of the category (step 504).
[0035]
When the above processing is performed for all keywords, the score of each category is determined, and the deviation value of the score of each category is calculated from these scores (step 505). Further, the categories are sorted in descending order of the deviation value (step 506). Then, the document ID, the category, and the value of the deviation value are paired and stored in the estimated category table 15 in the descending order of the deviation value (step 507). In this embodiment, the top three categories are stored. Of course, the top n categories may be stored, or a lower limit may be set for the value of the deviation value, and the categories above the lower limit may be stored. Finally, the log data 17 stores the document ID, the keyword extracted from the document, and when each keyword is included in the keyword of each category, the values of the weight Wi, the weight Wj, and the product thereof in step 504 are stored ( Step 508).
[0036]
In the present embodiment, a category system of two layers (large category and small category) is used. However, in category estimation 3, the estimation is performed on the small category, and the estimation of the large category is uniquely determined once the small category is determined. ,not going. As another estimation method, first, a category may be estimated for a large category, and a small category may be estimated in a form limited to a large category ranked higher. In this case, the classification knowledge 13 defining the keyword characterizing the large category and its weight is required. It may be newly created manually, or may be easily created by summarizing the classification knowledge about the small categories for each large category.
[0037]
FIG. 10 is a diagram illustrating an example of the estimated category table 15. The estimated category table 15 includes a document ID 251, an estimated category rank 252, an estimated large category candidate 253, an estimated small category candidate 254, and a deviation value 255 of the category.
[0038]
FIG. 11 is a diagram showing a processing procedure of similar document recognition 4. First, the similar document table 16 is initialized (step 521). Next, the following processing is performed for all categories (step 522). With reference to the estimated category table 15, the document ID of the document estimated to be classified into the first place in the category is extracted from the documents whose category is estimated (step 523).
[0039]
Next, the extracted document IDs are grouped for each category estimated to be classified into the second place, and stored in the similar document table 16 in association with the category (step 524).
[0040]
FIG. 12 is a diagram illustrating an example of the similar document table 16. As illustrated in FIG. 11, in the present embodiment, the category estimated as the first place and the category estimated as the second place are grouped for each same document and stored in the similar document table 16. That is, the similar document table 16 includes the category 261 estimated as the first place, the category 262 estimated as the second place, and the document ID 263 having these as the estimation result.
[0041]
FIG. 13 is a diagram illustrating an example of the category estimation result display. Here, the document designation button 401 is for designating the range of the document to be processed, and designates the directory where the document exists. The classification button 402 executes the text analysis 2 and the category estimation 3 on the specified document, and obtains the estimation result and log data. The re-classification button 403 executes re-estimation of the category based on the data corrected by the user, and outputs a re-estimation result. The narrowing down classification button 404 allows the user to specify a category in an upper hierarchy, executes category estimation limited to lower categories belonging to the category, and outputs an estimation result, as described later. The category list button 405 displays the contents of the category definition table 14. The classification knowledge reference button 406 displays the keywords and their weights stored in the classification knowledge 13 for each category. An end button 407 ends the system.
[0042]
An area 411 displays the contents of the text, and also displays the ID of the document text. An area 412 displays the keywords extracted from the text and the weights (appearance frequencies) in pairs in descending order of the weight.
[0043]
413 displays, for each category, which keyword among the 412 keywords is included, and how large the score is. The category is specified by specifying one of the 414 categories that are the classification results. In 413 of FIG. 13, for example, the keyword “yen” is included in the keyword of the small category “international economy”, the weight Wi of the keyword extracted from the text is 4, and the “international economy” of the classification knowledge 13 is The weight “Wj” of the keyword “circle” in the small category “8” is 8, and as a result, 4 × 8 = 32 points are given.
[0044]
An area 414 displays the estimated large category, small category, and their deviation values. An area 415 displays the category determined by the user.
A button 416 displays the category estimation result, log data, and confirmed category of the document checked immediately before the currently checked document. Since the data on these checked documents is stored in the estimated category table and the log data, it can be easily realized by displaying those data.
[0045]
A button 417 is used to determine the category of the currently checked document, and to instruct to check the next document. At this time, the category described in 415 is determined as a category to be classified, and the document is registered in the document database 18 together with the category information.
[0046]
FIG. 14 is a diagram showing another example of the category estimation result display. Reference numeral 421 denotes a list of classification knowledge, which is displayed with reference to the classification knowledge 13 when the classification knowledge reference button 406 is pressed. Reference numeral 422 indicates the category definition table 14 when the category list button 405 is pressed. A text 423 describes the range of the category, and is displayed when any one of the categories is selected in the category list 422.
[0047]
FIG. 15 is a diagram illustrating an example of the screen after the log data has been corrected by the user. Regarding 411 and 412, the user can modify the data displayed via the keyboard 20 and the mouse 25. In FIG. 15, a correction has been made for 412. As for keywords, it is possible to delete displayed keywords, add new keywords, and correct displayed weights. In contrast to FIG. 13 which is the screen before correction, in FIG. 15, the weights of keywords such as “yen”, “foreign exchange market”, and “rapid rise” are corrected. Non-keywords have been deleted.
[0048]
FIG. 16 is a diagram illustrating an example of the category re-estimation result. As a result of correcting the keywords and their weights, the category "exchange" that did not appear as the previous estimation result newly appeared as the first place as the classification result 414. Categories that have newly appeared in this way are distinguished from other categories by adding an asterisk. Of course, the method of distinction may be other than the addition of the star.
[0049]
FIG. 17 is a diagram illustrating a processing procedure of the category re-estimation 7. First, a table storing scores of each category is initialized to 0 (step 541).
[0050]
Next, the document ID, the corrected text, the corrected keyword, and its weight are read from the output screen and stored in the working area 16 (step 542). Next, it is determined whether the text information has been corrected (step 543). If the text information is corrected, the keywords extracted therefrom and their weights change greatly, and it is necessary to start over from the text analysis 2. On the other hand, when the text information has not been corrected, the keyword information read from the display screen can be used, so that the processing may be performed from the category estimation 3. Whether or not the text information has been corrected can be determined by providing a text correction flag and turning it on and off.
[0051]
If the text information is corrected in step 543, text analysis 2 is executed to extract keywords and weights from the corrected text and store the result in the working area 61 (step 544).
[0052]
Next, the following processing is performed for all keywords stored in the working area 61 (step 545). It is determined whether or not a category in the classification knowledge including the keyword exists (step 546). For the existing category, the weight Wi of the keyword of the document (corresponding to 223 in FIG. 6) and the The product of the weights Wj (equivalent to 244 in FIG. 8) of the keyword is calculated and added as the score of the category (step 547).
[0053]
Since the score of each category is determined at the point of time when all the keywords are performed, a deviation value of the score of each category is calculated from these scores (step 548). Further, the categories are sorted in descending order of the deviation value (step 549). Then, the document ID, the category, and the value of the deviation value are paired and stored in the estimated category table 15 in the descending order of the deviation value (step 550).
[0054]
FIG. 18 is a diagram illustrating an example of the log data 17. The log data 17 stores and holds the document ID, keywords extracted from the text and their weights, the breakdown of the scores for each category, and data related to the determined categories until the system is terminated. Therefore, when checking the category estimation result of a certain document, it is possible to refer to the data of the document which has been checked so far.
[0055]
FIG. 19 is a diagram illustrating an example of the category determination 8. The user determines the category with reference to the classification result 414. In the present embodiment, in the classification result 414, by double-clicking the category to be determined with the mouse, the selected category is displayed as the determined category 415.
[0056]
As described above, according to the present embodiment, when a document is to be classified, a computer-assisted category candidate is estimated, the result is displayed, and the user checks the result, thereby realizing a man-machine sharing type document classification support system. it can. In addition, when displaying the classification results, the results are presented sequentially in a grouped manner for each estimated category, so that the user can efficiently check. In addition, even if the presented result is incorrect, the data can be corrected and re-classified to improve the accuracy of classifying the data into a correct category. The rate of performing large tasks can be reduced as much as possible.
[0057]
Next, a modified example of the present embodiment will be described.
In the similar document recognition 4, in the present embodiment, the recognition is performed based on the top two estimated categories. However, instead of the estimated category, the recognition may be performed using a keyword having a high weight extracted from the text.
[0058]
FIG. 20 is a diagram showing the processing method. First, the similar document table 16 is initialized (step 561). Next, the following process is executed while there is a document that has not been recognized as a similar document (step 562). For a certain document that has not been certified, n (m> = n) or more keywords out of the m weight keywords extracted from the document are included in the m weight keywords. The extracted document is extracted and stored in a similar document table together with a set identifier for identifying a similar document set (step 563). In FIG. 11, the name of the category is used as the equivalent of the set identifier, but here, the set identifier is defined as a substitute for the name. This may be in any form as long as a similar document set can be identified.
[0059]
After step 563, the documents stored in the similar document table 16 are excluded from the processing target of step 562 (step 564).
By the above processing, when presenting the result of the category estimation to the user, it is possible to present each similar document based on how much a keyword having a high weight is shared.
[0060]
Next, an extended example of the present embodiment will be described.
When the category is composed of a plurality of layers as in the present embodiment, the upper category is presented to the user to be specified, and the category is estimated only for the lower category belonging to the specified higher category, whereby the classification accuracy is improved. Improvement can be expected. This is particularly effective when the number of lower categories is enormous.
[0061]
FIG. 21 is a diagram showing an example of a screen for specifying a large category. The designation of the large category is performed by displaying the designation screen 424 when the narrow-down classification button 404 is pressed. The designation of the large category may be plural. The display order of the large categories on the designation screen 424 is basically the order defined in the category definition table 14, but in the category estimation 3, the large category is first estimated, and the result is used by using the result. When the method of estimating the small category is adopted, the estimation result of the large category of the document is stored and held in the log data 17 so that the document can be displayed based on the order of the estimation result of the large category. It is.
[0062]
After the large category is designated on the designation screen 424, the category re-estimation 7 limited to the designated large category is executed by pressing the re-classification button 403. When storing the estimation result in the estimation category table 15 in step 550 of the processing procedure of the category re-estimation 7 shown in FIG. 17, when the large category of the estimated category is included in the large category specified by the user. Only by storing, it is possible to narrow down by upper category. In the result display of FIG. 13, if the user narrows down the large category to “economy”, the category “politics: parliament” in the second place in the classification result 414 is removed.
[0063]
As described above, when the upper category is relatively small and the user can easily determine the category, the correct category can be obtained by narrowing down the upper category and estimating the category.
[0064]
【The invention's effect】
When the user checks the automatic classification result of the document, the result classified by the computer is sequentially presented to the user for each similar document, and the check is prompted, so that the tips, know-how, lessons, and It becomes easy to utilize data and the like for the next document check, and the check work time is reduced.
[0065]
Even if the result of the automatic classification is incorrect, the user can correct the log data output together with the result of the automatic classification and re-estimate the result, so that the correct classification result can be derived. In the case where is incorrect, it is possible to reduce the burdensome work of the user re-sorting from the beginning.
[Brief description of the drawings]
FIG. 1 is a diagram showing an outline of the present embodiment.
FIG. 2 is a diagram illustrating an outline of hardware of the embodiment.
FIG. 3 is a diagram illustrating an example of a text included in a document.
FIG. 4 is a diagram illustrating an example of a word dictionary.
FIG. 5 is a diagram illustrating an example of a word segmentation result in text analysis.
FIG. 6 is a diagram illustrating an example of a keyword table.
FIG. 7 is a diagram illustrating an example of a category definition table.
FIG. 8 is a diagram illustrating an example of classification knowledge.
FIG. 9 is a diagram showing a processing procedure of category estimation.
FIG. 10 is a diagram illustrating an example of an estimated category table.
FIG. 11 is a diagram showing a processing procedure for similar document recognition.
FIG. 12 illustrates an example of a similar document table.
FIG. 13 is a diagram showing an example of a category estimation result display.
FIG. 14 is a diagram showing another example of a category estimation result display.
FIG. 15 is a diagram showing an example of a screen after correction by a user.
FIG. 16 is a diagram illustrating an example of a category re-estimation result.
FIG. 17 is a diagram showing a procedure for re-estimating a category.
FIG. 18 is a diagram illustrating an example of log data.
FIG. 19 is a diagram showing an example of category determination.
FIG. 20 is a diagram illustrating another processing procedure of similar document recognition.
FIG. 21 is a diagram illustrating an example of narrowing down upper categories.
[Explanation of symbols]
1: Document input, 2: Text analysis, 3: Category estimation, 4: Similar document recognition,
5: Category estimation result display, 6: Log data correction, 7: Category re-estimation,
8: Category confirmation, 9: Document database registration, 10: Document file,
11: word dictionary, 12: keyword table, 13: classification knowledge,
14: Category definition table, 15: Estimated category table,
16: similar document table, 17: log data, 18: document database

Claims

An input device, an output device, and a processing device having a storage device, a document classification support device for classifying a plurality of documents into categories,
The storage device stores a plurality of documents, a dictionary defining keywords, a category, a keyword characterizing each category, and classification knowledge data defining a first weight indicating the importance of the keyword for each category. ,
The analysis unit of the processing device refers to the dictionary of the storage device and extracts a keyword included in each document,
The estimating means of the processing device calculates a second weight of each keyword based on the frequency of appearance of each keyword included in each document, refers to the classification knowledge data, and defines the second weight by the classification knowledge data. A search is made for a category including the keyword extracted by the analysis means from among the categories, and each of the keywords in the category is determined based on the first weight and the second weight of the keyword in the searched category. Calculate the score of, calculate the deviation value of the category by adding the score of the keyword for each category, sort the category for each document according to the calculated deviation value of the category,
The certifying unit of the processing device certifies, as a similar document, a document having a common category having a high ranking by sorting among the plurality of documents,
The document classification support device, wherein the output device outputs an estimation result of the estimation unit for each of the similar documents recognized by the recognition unit.

It said determining means of the processing device, wherein among the estimation result of the category estimation unit, document classification assisting apparatus according to claim 1, characterized in that to determine the user-specified category as the category of the document.

In a document classification support device for classifying a plurality of documents into categories,
A storage device for storing a plurality of documents, a dictionary defining keywords, a category, a keyword characterizing each category, and classification knowledge data defining a first weight indicating importance of the keyword for each category;
Analysis means for extracting a keyword included in each document by referring to the dictionary of the storage device;
The second weight of each keyword is calculated based on the appearance frequency of each keyword included in each document, and the analysis unit refers to the classification knowledge data and selects the category defined by the classification knowledge data. Searching for a category including the extracted keyword; calculating a score of each keyword in the category based on the first weight and the second weight of each keyword in the searched category; Estimating means for calculating the deviation value of the category by adding the score of the keyword for each document, and sorting the category for each document according to the calculated deviation value of the category;
A certifying unit for certifying, as a similar document, a document having a common category having a high ranking by sorting among the plurality of documents;
A document classification support device, comprising: output means for outputting an estimation result of the estimation means for each similar document recognized by the authentication means.

4. The document classification support apparatus according to claim 3 , further comprising a determination unit that determines a category specified by a user among the categories of the estimation result of the estimation unit as a category of the document.