JP3602084B2

JP3602084B2 - Database management device

Info

Publication number: JP3602084B2
Application number: JP2001299138A
Authority: JP
Inventors: 勉小林; 茂美中里; 剛松隈; 幸夫中本; 弘山崎
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2001-09-28
Filing date: 2001-09-28
Publication date: 2004-12-15
Anticipated expiration: 2021-09-28
Also published as: JP2003108576A

Description

【０００１】
【発明の属する技術分野】
この発明は、文書検索技術に係わり、検索に用いるデータベースの管理装置及び管理方法に関する。
【０００２】
【従来の技術】
近年、大量の電子化された文書データが流通するようになり、その文書データが、どのような分野に属するかを自動的に分類する技術が実用化されている。一般的な技術としては、データベースに色々な分野を代表する文書を複数登録しておき、入力された文書と登録されている文書間の類似性の度合いを表す値（以下「類似度」と表記）をベクトル空間法などを用いて求め、類似していた文書が属する分野を参考に入力された文書が属すると思われる分野を特定するというものである。
【０００３】
このような文書自動分類システムは、時間が経つにしたがって、分野を代表するような新しい単語が使われるようになったり、逆に、あまり使われなくなる単語があったりし、いつまでも同じデータベースを利用することは、分類の精度低下につながる。
【０００４】
また、自動分類したものは、そのままその分野が付与されることは少なく、自動分類された結果が正しいかどうかを人手によって判断し、その結果を元に正解／不正解と類似度情報からデータベースに登録されている文書に点数を付け、点数の悪い文書をデータベースから削除する方式があった。
【０００５】
このように、従来は、データベースから削除すべき文書であるかの判断の材料として、「分野特定結果が正解であったかどうか」といった情報や、「分野特定の要となったデータベース内の文書と、その文書との類似度」といった情報が用いられてきた。このような手法を用いた例として、特開２００１−１５５０２５公報に開示された文書分類装置がある。
【０００６】
しかし、データベースに登録されている文書の属する分野は、すべて独立している訳ではなく、関連性の深い分野も存在する。例えば「レーザー」という言葉は、印刷機器や医療機器、記憶装置、計測機器などの広い分野で使用される可能性がある。このような、様々な分野で使用される単語を多く含む文書は、分野特定においてある１つの分野に特定することは難しく、複数の可能性のある分野に特定され易くなる。
【０００７】
このような、複数の分野と関連性の高い分野に属する文書については、複数の分野で同じような類似度となる可能性が高く、一つの分野に特定することは困難である。このような文書は、分野特定で失敗する可能性も高くなるが、これは、その分野の特徴によるものであり、データベースに登録されている特定の文書が悪影響している訳ではない。
【０００８】
このような正解しにくい分野の文書を分類した際に、不正解に影響した文書を優先的に削除するこでは精度向上は望めず、逆に、関連のある分野の文書が削除され易くなり、それらの分野の特定精度が低下することにもつながる。
【０００９】
また、分類する文書から抽出した単語に、データベースに登録されている分野の特徴を表す単語がほとんど含まれていない場合なども考えられる。このような場合にも、その文書とデータベースに登録されている文書との間の類似度は全体的に同じような値になるため、このような文書の分類結果から、データベースに登録されている文書の悪影響の度合いを判断しデータベースから削除することは、精度低下を招くことになる。
【００１０】
【発明が解決しようとする課題】
本発明は上記の問題を解決するためになされたものであり、分類の精度を維持しながら、データベースのメンテナンスを行うことのできる文書分類装置および文書分類方法を提供することを目的とする。
【００１１】
【課題を解決するための手段】
この発明は、分野情報を有する登録文書を記録したデータベースを管理する管理装置であって、所定の文書を入力する第一入力手段と、前記データベースに登録された登録文書を読み込む読み込み手段と、前記所定の文書と前記登録文書との間の類似度を算出する類似度算出手段と、前記類似度算出手段により算出された類似度をもとに、確度を算出する確度算出手段と、前記所定の文書の属する分野を入力する第二入力手段と、前記所定の文書の属する分野と、前記登録文書が登録されている分野との一致／不一致を判別する判別手段と、前記判別手段が前記所定の文書の属する分野と、前記登録文書が登録されている分野とが一致していると判別した場合、前記類似度と前記確度をもとに正解影響度を算出する正解影響度算出手段と、前記判別手段が前記所定の文書の属する分野と、前記登録文書が登録されている分野とが一致していないと判別した場合、前記類似度と前記確度をもとに不正解影響度を算出する正解影響度算出手段と、前記正解影響度及び前記不正解影響度から削除文書候補点数を算出する削除文書候補点数算出手段とを具備することを特徴とする。
【００１２】
このような構成によれば、分類の精度を維持しながら、データベースのメンテナンスを行うことが可能となる。
【００１３】
この発明は、分野情報を有する登録文書を記録したデータベースを管理するデータベース管理方法であって、所定の文書を入力する第一入力ステップと、前記データベースに登録された登録文書を読み込む読み込みステップと、前記所定の文書と前記登録文書との間の類似度を算出する類似度算出ステップと、前記類似度算出ステップにより算出された類似度をもとに、確度を算出する確度算出ステップと、前記所定の文書の属する分野を入力する第二入力ステップと、前記第二入力ステップで入力された所定の文書の属する分野と、前記登録文書が登録されている分野との一致／不一致を判別する判別ステップと、前記判別ステップにおいて、手段が前記所定の文書の属する分野と、前記登録文書が登録されている分野とが一致していると判別した場合、前記類似度と前記確度をもとに正解影響度を算出する正解影響度算出ステップと、前記判別手段が前記所定の文書の属する分野と、前記登録文書が登録されている分野とが一致していないと判別した場合、前記類似度と前記確度をもとに不正解影響度を算出する正解影響度算出ステップと、前記正解影響度及び前記不正解影響度から削除文書候補点数を算出する削除文書候補点数算出ステップとを具備することを特徴とする。
【００１４】
このような構成によれば、分類の精度を維持しながら、データベースのメンテナンスを行うことが可能となる。
【００１５】
【発明の実施の形態】
本発明の具体的な構成について説明する前に、発明の理解の一助として、本発明のアウトラインを説明する。本発明においては、データベースのメンテナンスを行うために、ユーザが予め所定の分野に属することが分かっている文書を入力して、文書分類装置に分類動作を行わせる。ここでユーザが入力する文書を「分類文書」とし、この分類文書が属する分野としてユーザが予め認識している分野を「正解分野」とする。
【００１６】
この分類文書を用いて、文書分類を行う際に、この文書分類装置は分類文書とデータベースに登録されている文書（以下「登録文書」と表記する）間の類似度と、その類似度や類似度算出時に参照した単語数などから算出した確度を算出する。
【００１７】
次に、登録文書がそれぞれ分類されている分野（以下「登録分野」と表記する）と、先述の分類文書の正解分野が一致していた場合と異なっていた場合に、その分野の特定に影響した文書との類似度と、分野を特定した際に、比較した文書との類似度や比較に使用した単語数などの情報から、特定結果がどの程度信頼できるかを表す値（以下「確度」と表記）を元に、それぞれ正解影響度と不正解影響度を算出する。すなわち、ある登録文書に関し、第一の分類文書について類似度と確度を求める。つづいてこの登録文書の登録分野と分類文書の正解分野が一致した場合は、正解影響度を蓄積する。また、登録分野と正解分野が一致しなかった場合は、不正解影響度を蓄積する。
【００１８】
この操作を分類文書を複数種用いて繰り返し、各々の登録文書について正解影響度と不正解影響度を蓄積して登録文書毎に格納する。
【００１９】
この蓄積した正解影響度と不正解影響度（以下、両者をまとめて「正解／不正解影響度」と表記）をもとに、登録文書毎の削除文書候補点数を算出する。この削除文書候補点数は不正解影響度を正解影響度で除することで求められ、正解影響度に比して不正解影響度が大きい文書については、削除文書候補点数が大きくなる。データベースのメンテナンスに当たっては、この削除文書候補点数が大きい文書を削除文書候補として抽出するというものである。
【００２０】
以下、図面を参照して本発明の実施形態について以下の通り説明する。図１は本発明に関する一実施形態である類似文書検索装置のハードウェア構成を示すブロック図である。なお、本装置は一般的なアーキテクチャを持つコンピュータ上の一機能として構成されるものである。
【００２１】
図１に示すように、この類似文書検索装置は、ＣＰＵおよびメモリなどから構成される制御装置１、キーボード、ポインティングデバイス、スキャナ、マイクなどの入力装置２、類似文書の検索結果などを表示する表示装置３、および文書データや類似度情報、各種設定値などを格納する外部記憶装置４（ハードディスク、ＭＯ、ＤＶＤ−ＲＡＭなど）から構成される。
【００２２】
図２に本類似文書検索装置における制御装置１の構成を示す。制御装置１はプログラム部２００とバッファ部２５０からなる。プログラム部２００は、初期化部２０１、分類文書入力部２０２、登録文書読み込み部２０３、類似度算出部２０４、確度算出部２０５、分類結果出力部２０６、正解／不正解影響度格納部２０７、削除文書候補点数算出部２０８、削除文書候補出力部２０９の機能を有している。
【００２３】
バッファ部２５０は、分類文書格納バッファ部２５１、登録文書格納バッファ部２５２、類似度算出結果格納バッファ部２５３、正解／不正解影響度格納バッファ部２５４、削除文書候補格納バッファ部２５５の領域を有している。
【００２４】
初期化部２０１は、バッファ部２５０内の各バッファ部をクリアする。分類文書入力部２０２は、ユーザが入力装置２を用いて入力する分類文書データを、分類文書格納バッファ部２５１へ格納する。この時、分類文書ＩＤが発行され、このＩＤも分類文書格納バッファ部２５１へ格納される。
登録文書読み込み部２０３は、外部記憶装置４に格納された登録文書を読み出し、登録文書格納バッファ部２５２へ格納する。
【００２５】
類似度算出部２０４は、分類文書格納バッファ部２５１に格納されている分類文書と、登録文書格納バッファ部２５２に格納されている登録文書を単語に分割し、各単語の出現回数をベクトルの成分とするベクトル空間法などで類似の度合いを算出し、分類文書ＩＤと登録文書ＩＤと類似度と登録文書が属する分野情報を組にして、類似度算出結果格納バッファ部２５３に格納する。類似度はベクトル空間法の代わりに共通単語数により算出するようにしても構わない。
【００２６】
確度算出部２０５は、類似度算出結果格納バッファ部２５３に格納されている類似度の合計値を算出し、各登録文書との類似度が占める割合を確度として算出し、類似度算出結果格納バッファ部２５３に格納する。
【００２７】
分類結果出力部２０６は、類似度算出結果格納バッファ部２５３に格納されているデータを類似度でソートし、類似度の高い登録文書に付与されている分野を出力する。
【００２８】
正解／不正解影響度格納部２０６は、類似度算出結果格納バッファ２５３に格納される、類似度算出結果情報と入力装置２より入力された、分類文書の正解分野情報から、登録文書毎の正解／不正解への影響の度合いとして類似度に確度を掛け合せた値を、正解／不正解影響度格納バッファ部２５４に加算する。正解／不正解への影響の度合いとしては、類似度に確度を掛け合せた値の他に、確度が設定された閾値以上の場合にのみ類似度を加算するようにしても良い。
【００２９】
削除文書候補点数算出部２０８は、正解／不正解影響度格納バッファ部２５４に格納されている正解／不正解影響度から削除文書候補としての点数を算出し、削除文書候補格納バッファ部２５５に格納する。削除文書候補出力部２０９は、削除文書候補格納バッファ部２５５に格納されている削除文書候補を削除文書候補点数でソートし出力する。
【００３０】
次に、本発明の実施形態の一つである文書分類装置の動作について図３及び図４のフローチャート図を参照して以下の通り説明する。
【００３１】
本実施例は、大きく分けて図３に示す第１のステップと、図４に示す第２のステップとからなる。第１のステップは、文書分類装置に登録された文書から、削除すべき文書を選択するために、ユーザが予め正解分野を把握している分類文書を用いて分類処理を行い、その処理結果を蓄積するステップである。第２のステップは、この蓄積された処理結果をもとに、削除すべき文書の候補を出力するステップである。
【００３２】
まず、図３を参照して分類処理結果を蓄積する第１のステップについて説明する。はじめにユーザは、入力装置２を使用して、外部記憶装置４にデータベースのメンテナンスの対象となる登録文書の文書データを格納する（ステップ３０１）。続いて初期化部２０１により全バッファをクリアする（ステップ３０２）。
【００３３】
次に、分類文書入力部２０２が、入力装置２を通じてユーザより分類文書を受け付けて、分類文書格納バッファ部２５１に格納する。（ステップ３０３）。具体例として、図５に示すような「この文書は、計測機器について記述したものです。」というテキスト文書を分類文書の一つとして格納したとする。
【００３４】
続いて登録文書読み出し部２０３が、外部記憶装置４から複数の登録文書を読み出し、登録文書格納バッファ部２５２に登録文書として格納する（ステップ３０４）。検索対象となる登録文書には、文書を識別するための文書ＩＤと、その文書の分類を表す分野（登録分野）の情報が付与されている。具体例として、図６に示すように、文書ＩＤ、分野情報、本文からなるデータを格納したとする。例えば文書ＩＤが「１」の文書は「エンジン」に関する分野であり、本文として「この文書は、エンジンについて記述したものです。」というデータを格納する。もちろん、より長い本文データについても同様に処理する。以下、文書ＩＤ「２」、「３」…と各登録文書について同様の処理を行う。
【００３５】
次に、類似度算出部２０４が、分類文書格納バッファ部２５１に格納された分類文書と、登録文書格納バッファ部２５２に格納された登録文書の本文とを比較し、類似の度合いを示す数値である類似度をベクトル空間法を用いて算出した後、登録文書ＩＤとその文書の分類を表す分野情報とともに類似度算出結果格納バッファ部２５３に格納する（ステップ３０５）。ここで、ベクトル空間法は、特開２０００−３１１１７３公報に記載されたような手法を用いることができる。
【００３６】
この時、類似度が大きいものから一定の件数だけ格納したり、一定の類似度以上のものだけを格納しても構わない。図７の類似度算出結果格納例では、分類文書ＩＤが「１」の文書について、登録文書に関する１番目のデータは、文書ＩＤ＝１０２３、登録分野＝記憶装置、類似度＝０．３７８という内容が格納されていることを示す。以下、２番目、３番目と同様に格納される。
【００３７】
次に、類似度を算出していない登録文書が残っているかを判断し（ステップ３０６）、残っている場合は、ステップ３０４に戻って残りの登録文書に対してステップ３０４、３０５の動作を繰り返す。一方、他に登録文書が無い場合は、ステップ３０７に進む。
【００３８】
次に、ステップ３０５で類似度算出結果格納バッファ部２５３に格納した類似度算出結果の登録文書ごとの類似度の和を算出し、その値に対して各文書の類似度が占める割合を確度として算出し、類似度算出結果格納バッファ部２５３に格納する（ステップ３０７）。なお、確度は分類結果の確からしさを表す値であれば、類似度の合計値に対する占有率以外にも、文書同士を比較した際の共通単語数などから算出したものでも構わない。図８に図７に示した例における確度の算出例を示す。類似度算出結果格納バッファ部２５３に格納された登録文書の類似度の和は２．７８３である。ここで、登録文書ＩＤ「１０２３」の文書は分類文書ＩＤ「１」の分類文書に対する類似度が０．３７８である場合、登録文書ＩＤ「１０２３」の文書の確度は０．３７８÷２．７８３＝０．１３６となり、確度の値として０．１３６が格納される。他の文書についても同様に確度が求められ、格納される。
【００３９】
次に、ステップ３０７までで算出された各情報について出力する（ステップ３０８）。この出力は図８の情報を出力する形が好ましいが、類似度順にソートし、上位の文書から順に付与されている登録分野の分野情報を出力するようにしても構わない。ソートした上で分野情報を出力した例を図９に示す。ここで、各登録分野ごとにその分野に含まれる登録文書の類似度の和を取り、高い順に並べている。この出力はこの後の処理では使用しないが、ユーザにとって分類状況を把握しやすくなるという効果がある。
【００４０】
分類結果出力が済むと、他に使用する分類文書が残っているかを判断し（ステップ３０９）、残っていればステップ３０３に戻り、ステップ３０３から３０８までを繰り返す。図８に相当するデータは分類文書ごとに異なるので、分類文書ごとにそれぞれ格納される。一方、分類文書が残っていなければ分類処理を終了する。
【００４１】
次に、第１のステップで格納された、類似度算出結果を使用して、登録文書から削除する文献の候補を出力する第２のステップについて説明する。図４はその手順を示すフローチャートである。
【００４２】
はじめに初期化部２０１により類似度度算出結果格納バッファ部２５３以外のバッファをクリアする（ステップ３５１）。次に、入力装置２より、第１のステップで用いた分類文書のＩＤとその分類文書の正解分野を入力する（ステップ３５２）。
【００４３】
次に、ステップ３５２で入力された分類文書ＩＤに対応する類似度算出結果と正解分野をもとに、各登録文書について、登録分野と正解分野が一致していれば正解影響度として、一致していなければ不正解影響度として、類似度に確度を掛け合せた値を正解／不正解影響度格納バッファ部２５４に加算する（ステップ３５３）。類似度と確度を掛け合わせることで、一種の重み付けを行うことができる。この正解／不正解影響度は登録文書ごとに管理される。
【００４４】
分類文書ＩＤが１で、正解分野が計測機器であった場合、類似度算出結果が図８の状態であるとすると、登録文書ＩＤ＝１０２３の文書は、その分野が正解分野と異なるので、その類似度に確度を掛け合せた値０．３７８×０．１３６＝０．０５１を不正解影響度に加算して格納する。登録文書ＩＤ＝５９３３の文書は、その分野が正解分野と同じなので、その類似度に確度を掛け合せた値０．１７２×０．０６２＝０．０１１を正解影響度に加算して格納する。
【００４５】
ここでは、正解／不正解影響度に加算する値として類似度に確度を掛け合せた値を利用しているが、確度に閾値を設けて、その閾値よりも確度が大きい場合にのみ類似度を正解／不正解影響度に加算する方式であっても構わない。
【００４６】
続いて、処理中の分類文書の類似度算出結果が残っているか判断し（ステップ３５４）、残っている場合はステップ３５３に戻り、ステップ３５３の処理を繰り返す。この処理の対象となるのはすべての類似度算出結果でも構わないし、類似度の高いものから何件、または類似度や確度が一定の値以上のものでも構わない。一方、処理する類似度算出結果が残っていない場合はステップ３５５に進む。
【００４７】
ステップ３５５では、他に正解情報が残っているかを判断し（ステップ３５５）、残っている場合はステップ３５２に戻り、上述したステップ３５２から３５４までの処理を繰り返し、残っていなければ、ステップ３５６に進む。ここで、正解情報が残っている場合、すなわち別の分類文書による計算結果を用いる場合は、その正解情報に対応する分類文書ＩＤで管理された算出結果を用いることになる。
【００４８】
このようにして、登録文書ごとにいくつかの分類文書についてそれぞれ正解／不正解影響度を求め、登録文書ごとに格納した結果となる、正解／不正解影響度格納バッファ部２５４の例を図１０に示す。例えば登録文書ＩＤ「１」の文書について、登録分野は「エンジン」であり、正解影響度は０．００２４９、不正解影響度は０．２５３８２、となる。正解影響度の大きい登録文書は、所定の分類文書と同じ分野であり、一般的に類似度も高く、確度も高いということができる。一方、不正解影響度の大きい登録文書は、所定の分類文書と異なる分野であるが、類似度や確度が高く、紛らわしい文書であるということができる。
【００４９】
ステップ３５６では、ステップ３５３で正解／不正解影響度格納バッファ部２５５に格納した、正解／不正解影響度をもとに、データベースに登録されている各登録文書について「削除文書候補点数」を算出する。この削除文書候補点数の算出式の例を図１１に示す。ここで、削除文書候補点数は、「不正解影響度÷（正解影響度＋０．００１）」で求められる。分母となる正解影響度に０．００１を加えているのは、正解影響度が０である文書があった場合に０による除算エラーが発生するのを防ぐためである。この式によれば、正解影響度に比して不正解影響度が大きい登録文書が削除文書候補点数が高くなる。削除文書候補点数算出部２０８は算出した結果を削除文書候補格納バッファ部２５５に格納する。
【００５０】
図１２の例は、図１０の類似度算出結果格納バッファ部２５３に格納されている類似度算出結果を図１１に示す削除文書候補点数の算出式を用いて点数を算出した結果を表す。この結果は削除文書候補格納バッファ部２５５に格納される。例えば登録文書ＩＤ「１」の登録文書について、その削除文書候補点数は「０．２５３８２／（０．００２４９＋０．００１」で求められ、その値は７２．７２７７９となる。
【００５１】
次に、正解／不正解影響度データが残っているか、すなわち削除候補点数を算出する登録文書が残っているかを判断し（ステップ３５７）、残っていればステップ３５６に戻り、ステップ３５６を繰り返す。正解／不正解影響度データが残っていない場合はステップ３５８に進む。
【００５２】
ステップ３５８では、ステップ３５６で算出した削除文書候補点数を用いて削除文献候補格納バッファ部２５５の内容をソートし、削除文書候補点数の高い文書順に削除文書候補として出力する。図１３に、削除文書候補の出力例を示す。登録文書ＩＤが９９２４の文書は、計測機器分野の文書で、削除文書候補点数が１２９．１６９７３であることを表す。
【００５３】
以上で、第２のステップである登録文書削除候補出力処理を終了する。ユーザはこの出力を見て、データベースから削除すべき登録文書を選択することができる。この選択まで、自動的に実行させることも可能である。
【００５４】
複数の分野と関連の深い（類似性の高い）分野の文書は、１つの分野に特定することが難しく、特定した分野が正解分野と一致する確率も低くなる傾向にある。本発明では確度が低い場合には不正解への影響度として加算する値を低くするので、そのような分野の文書が優先的に削除されることによる、分類精度の低下を抑えることができる。
【００５５】
また、分類する文書に分野の特徴を表す単語が少ないような分類処理に不向きな文書を分類した場合、正解分野に特定される確率は低くなる。このような文書を分類処理した場合、分野特定結果が不正解であっても、確度が小さければ不正解影響度として類似度が加算されにくくなるため、削除文書候補の抽出時に不適当な候補が抽出されることを少なくできる。
【００５６】
【発明の効果】
以上説明したように、この発明によれば、分類の精度を維持しながら、データベースのメンテナンスを行うことが可能となる。
【図面の簡単な説明】
【図１】本発明の実施形態に係わる類似文書検索装置のハードウェア構成を示すブロック図。
【図２】本発明の実施形態に係わる類似文書検索装置の制御装置の機能ブロック図。
【図３】文書分類処理の流れを示すフローチャート図。
【図４】削除文書の候補を出力する処理の流れを示すフローチャート図。
【図５】分類文書の例を示す図。
【図６】登録文書の例を示す図。
【図７】類似度算出結果の例を示す図。
【図８】確度算出の例を示す図。
【図９】分類結果の出力例を示す図。
【図１０】正解／不正解影響度の例を示す図。
【図１１】削除文書候補点数計算式の例を示す図。
【図１２】削除文書候補格納の例を示す図。
【図１３】削除文書候補出力の例を示す図。
【符号の説明】
１…制御装置、２…入力装置、３…表示装置、４…外部記憶装置、２００…プログラム部、２０１…初期化部、２０２…分類文書入力部、２０３…登録文書読み込み部、２０４…類似度算出部、２０５…確度算出部、２０６…分類結果出力部、２０７…正解／不正解影響度格納部、２０８…削除文書候補点数算出部、２０９…削除文書候補出力部、２５０…バッファ部、２５１…分類文書格納バッファ部、２５２…登録文書格納バッファ部、２５３…類似度算出結果格納バッファ部、２５４…正解／不正解影響度格納バッファ部、２５５…削除文書候補格納バッファ部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document search technique, and relates to a management device and a management method of a database used for search.
[0002]
[Prior art]
In recent years, a large amount of digitized document data has been distributed, and a technology for automatically classifying to which field the document data belongs has been put to practical use. As a general technique, a plurality of documents representing various fields are registered in a database, and a value indicating the degree of similarity between the input document and the registered document (hereinafter referred to as “similarity”). ) Is determined using a vector space method or the like, and the field to which the input document is supposed to belong is specified with reference to the field to which the similar document belongs.
[0003]
Such an automatic document classification system uses the same database forever because new words that are representative of the field are used over time, or words that are used less frequently are used. This leads to a decrease in classification accuracy.
[0004]
In the case of automatically classified items, the field is rarely given as it is. It is manually judged whether the automatically classified result is correct, and based on the result, the correct / incorrect and similarity information are stored in a database. There has been a method of marking a registered document and deleting a document with a poor score from a database.
[0005]
As described above, conventionally, as a material for determining whether a document should be deleted from the database, information such as "whether or not the field identification result was correct" or "document in the database that became the key to field identification, Information such as "similarity with the document" has been used. As an example using such a method, there is a document classification device disclosed in JP-A-2001-155025.
[0006]
However, the fields to which the documents registered in the database belong are not all independent, and some fields are closely related. For example, the word "laser" may be used in a wide range of fields, such as printing equipment, medical equipment, storage devices, and measurement equipment. Such a document containing many words used in various fields is difficult to specify in one field in field specification, and is likely to be specified in a plurality of possible fields.
[0007]
Such a document belonging to a field that is highly relevant to a plurality of fields is likely to have the same similarity in the plurality of fields, and it is difficult to specify a single field. Such documents are also more likely to fail in field identification, but this is due to the characteristics of the field and not the specific documents registered in the database.
[0008]
When documents in fields that are difficult to answer correctly are classified, it is not possible to improve accuracy by preferentially deleting documents that affected incorrect answers, and conversely, documents in related fields are more likely to be deleted, This also leads to a decrease in the accuracy of specifying those fields.
[0009]
It is also conceivable that the words extracted from the document to be classified hardly include words representing the characteristics of the field registered in the database. Even in such a case, since the similarity between the document and the document registered in the database has the same overall value, the document is registered in the database based on the classification result of such a document. Determining the degree of adverse effects on a document and deleting it from the database will result in a decrease in accuracy.
[0010]
[Problems to be solved by the invention]
SUMMARY An advantage of some aspects of the invention is to provide a document classification device and a document classification method capable of maintaining a database while maintaining classification accuracy.
[0011]
[Means for Solving the Problems]
The present invention is a management device that manages a database that stores registered documents having field information, a first input unit that inputs a predetermined document, a reading unit that reads a registered document registered in the database, A similarity calculator that calculates a similarity between a predetermined document and the registered document; a similarity calculator that calculates a certainty based on the similarity calculated by the similarity calculator; A second input unit for inputting a field to which the document belongs; a discriminating unit for discriminating a match / mismatch between the field to which the predetermined document belongs and the field in which the registered document is registered; When it is determined that the field to which the document belongs and the field in which the registered document is registered match, a correct answer influence calculating means for calculating a correct influence based on the similarity and the accuracy, If the discriminating unit determines that the field to which the predetermined document belongs and the field in which the registered document is registered do not match, the correct answer for calculating the degree of incorrect answer impact based on the similarity and the accuracy The apparatus further comprises an influence degree calculating means, and a deleted document candidate point calculating means for calculating a deleted document candidate point from the correct answer influence degree and the incorrect answer influence degree.
[0012]
According to such a configuration, it is possible to maintain the database while maintaining the accuracy of the classification.
[0013]
The present invention is a database management method for managing a database recording registered documents having field information, a first input step of inputting a predetermined document, a reading step of reading a registered document registered in the database, A similarity calculating step of calculating a similarity between the predetermined document and the registered document; a probability calculating step of calculating a certainty based on the similarity calculated in the similarity calculating step; A second input step of inputting a field to which the document belongs, and a discriminating step of discriminating a match / mismatch between the field to which the predetermined document input in the second input step belongs and the field in which the registered document is registered. In the determining step, the means determines that the field to which the predetermined document belongs and the field in which the registered document is registered match. If it is different, a correct answer influence calculating step of calculating a correct influence based on the similarity and the accuracy, a field to which the determination unit belongs to the predetermined document, and a field in which the registered document is registered If it is determined that does not match, the correct answer impact calculating step of calculating the incorrect answer impact based on the similarity and the accuracy, the deleted document candidate score from the correct answer impact and the incorrect answer impact Calculating the number of deleted document candidate points to be calculated.
[0014]
According to such a configuration, it is possible to maintain the database while maintaining the accuracy of the classification.
[0015]
BEST MODE FOR CARRYING OUT THE INVENTION
Before describing a specific configuration of the present invention, an outline of the present invention will be described to assist understanding of the present invention. In the present invention, in order to perform database maintenance, a user inputs a document that is known to belong to a predetermined field in advance, and causes the document classification device to perform a classification operation. Here, a document input by the user is referred to as a “classified document”, and a field to which the user belongs in advance is referred to as a “correct answer field”.
[0016]
When classifying a document using this classification document, the document classification apparatus uses the similarity between the classification document and a document registered in the database (hereinafter referred to as a “registered document”), and the similarity and similarity. The accuracy calculated from the number of words referred to at the time of calculating the degree is calculated.
[0017]
Next, if the fields in which the registered documents are classified (hereinafter referred to as “registered fields”) and the correct fields in the above-mentioned classified documents are different from those in the case where they match, this will affect the identification of those fields. A value indicating how reliable the specified result is based on information such as the similarity with the specified document and the similarity with the compared document and the number of words used in the comparison when the field is specified (hereinafter referred to as “accuracy”). And the correct answer influence degree and the incorrect answer influence degree, respectively. That is, for a certain registered document, the similarity and certainty are obtained for the first classified document. Subsequently, when the registered field of this registered document matches the correct answer field of the classification document, the degree of correct answer influence is accumulated. If the registered field and the correct answer field do not match, the degree of incorrect answer influence is accumulated.
[0018]
This operation is repeated using a plurality of types of classified documents, and the correct answer influence and the incorrect answer influence are accumulated for each registered document and stored for each registered document.
[0019]
Based on the accumulated correct answer influence degree and incorrect answer influence degree (hereinafter collectively referred to as “correct answer / incorrect answer influence degree”), a deleted document candidate score for each registered document is calculated. The deleted document candidate score is obtained by dividing the incorrect answer influence degree by the correct answer influence degree. For a document having a higher incorrect answer influence degree than the correct answer influence degree, the deleted document candidate score becomes larger. In maintenance of the database, a document having a large number of candidates for a deleted document is extracted as a candidate for a deleted document.
[0020]
Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a hardware configuration of a similar document search device according to an embodiment of the present invention. This apparatus is configured as one function on a computer having a general architecture.
[0021]
As shown in FIG. 1, the similar document search device includes a control device 1 including a CPU and a memory, an input device 2 such as a keyboard, a pointing device, a scanner, and a microphone, and a display for displaying search results of similar documents. It comprises a device 3 and an external storage device 4 (hard disk, MO, DVD-RAM, etc.) for storing document data, similarity information, various setting values, and the like.
[0022]
FIG. 2 shows a configuration of the control device 1 in the similar document search device. The control device 1 includes a program unit 200 and a buffer unit 250. The program unit 200 includes an initialization unit 201, a classified document input unit 202, a registered document reading unit 203, a similarity calculation unit 204, a certainty calculation unit 205, a classification result output unit 206, a correct answer / incorrect answer influence storage unit 207, and a deletion. It has the functions of a document candidate score calculation unit 208 and a deleted document candidate output unit 209.
[0023]
The buffer section 250 has areas of a classified document storage buffer section 251, a registered document storage buffer section 252, a similarity calculation result storage buffer section 253, a correct answer / incorrect answer influence storage buffer section 254, and a deleted document candidate storage buffer section 255. are doing.
[0024]
The initialization unit 201 clears each buffer unit in the buffer unit 250. The classified document input unit 202 stores the classified document data input by the user using the input device 2 in the classified document storage buffer unit 251. At this time, a classification document ID is issued, and this ID is also stored in the classification document storage buffer unit 251.
The registered document reading unit 203 reads the registered document stored in the external storage device 4 and stores it in the registered document storage buffer unit 252.
[0025]
The similarity calculation unit 204 divides the classified document stored in the classified document storage buffer unit 251 and the registered document stored in the registered document storage buffer unit 252 into words, and calculates the number of appearances of each word as a vector component. The degree of similarity is calculated by a vector space method or the like, and a set of the classified document ID, the registered document ID, the similarity, and the field information to which the registered document belongs is stored in the similarity calculation result storage buffer unit 253. The similarity may be calculated by the number of common words instead of the vector space method.
[0026]
The likelihood calculation unit 205 calculates the total value of the similarities stored in the similarity calculation result storage buffer unit 253, calculates the proportion of the similarity with each registered document as the certainty, and stores the similarity calculation result storage buffer. Stored in the section 253.
[0027]
The classification result output unit 206 sorts the data stored in the similarity calculation result storage buffer unit 253 according to the similarity, and outputs a field assigned to a registered document having a high similarity.
[0028]
Based on the similarity calculation result information stored in the similarity calculation result storage buffer 253 and the correct answer field information of the classified document input from the input device 2, the correct answer / incorrect answer influence storage unit 206 determines the correct answer for each registered document. A value obtained by multiplying the similarity by the accuracy as the degree of influence on the correct / incorrect answer is added to the correct / incorrect answer influence storage buffer unit 254. As the degree of influence on the correct answer / incorrect answer, in addition to the value obtained by multiplying the similarity by the certainty, the similarity may be added only when the certainty is equal to or more than a set threshold.
[0029]
The deleted document candidate score calculation unit 208 calculates a score as a deleted document candidate from the correct / incorrect answer influence degree stored in the correct / incorrect answer influence storage buffer unit 254 and stores the score in the deleted document candidate storage buffer unit 255. I do. The deleted document candidate output unit 209 sorts and outputs the deleted document candidates stored in the deleted document candidate storage buffer unit 255 according to the number of deleted document candidate points.
[0030]
Next, the operation of the document classification device according to one embodiment of the present invention will be described below with reference to the flowcharts of FIGS.
[0031]
This embodiment is roughly divided into a first step shown in FIG. 3 and a second step shown in FIG. In the first step, in order to select a document to be deleted from the documents registered in the document classifying device, the user performs a classification process using a classification document in which the correct answer field is known in advance, and the processing result is determined. This is the step of accumulating. The second step is a step of outputting document candidates to be deleted based on the accumulated processing results.
[0032]
First, a first step of accumulating the classification processing result will be described with reference to FIG. First, the user uses the input device 2 to store document data of a registered document to be maintained in the database in the external storage device 4 (Step 301). Subsequently, all buffers are cleared by the initialization unit 201 (step 302).
[0033]
Next, the classified document input unit 202 receives the classified document from the user via the input device 2 and stores the classified document in the classified document storage buffer unit 251. (Step 303). As a specific example, it is assumed that a text document “This document describes a measuring instrument” as shown in FIG. 5 is stored as one of the classified documents.
[0034]
Subsequently, the registered document reading unit 203 reads a plurality of registered documents from the external storage device 4 and stores them in the registered document storage buffer unit 252 as registered documents (step 304). A registered document to be searched is provided with a document ID for identifying the document and information on a field (registered field) indicating the classification of the document. As a specific example, as shown in FIG. 6, it is assumed that data including a document ID, field information, and a text is stored. For example, a document whose document ID is “1” is a field related to “engine”, and stores data “this document describes an engine” as a text. Of course, similar processing is performed for longer text data. Hereinafter, similar processing is performed for document IDs “2”, “3”,.
[0035]
Next, the similarity calculation unit 204 compares the classified document stored in the classified document storage buffer unit 251 with the text of the registered document stored in the registered document storage buffer unit 252, and uses a numerical value indicating the degree of similarity. After calculating a certain similarity using the vector space method, the similarity is stored in the similarity calculation result storage buffer unit 253 together with the registered document ID and the field information indicating the classification of the document (step 305). Here, as the vector space method, a method as described in JP-A-2000-31173 can be used.
[0036]
At this time, a certain number of cases may be stored from the one with the highest similarity, or only those having a certain degree of similarity or more may be stored. In the storage example of the similarity calculation result in FIG. 7, for the document with the classification document ID “1”, the first data related to the registered document is such that the document ID is 1023, the registered field is the storage device, and the similarity is 0.378. Is stored. Hereinafter, it is stored in the same manner as the second and third.
[0037]
Next, it is determined whether or not there remains a registered document for which the similarity has not been calculated (step 306). If there is, the process returns to step 304 to repeat the operations of steps 304 and 305 for the remaining registered documents. . On the other hand, if there is no other registered document, the process proceeds to step 307.
[0038]
Next, in step 305, the sum of the similarity calculation results stored in the similarity calculation result storage buffer unit 253 for each registered document is calculated, and the ratio of the similarity of each document to the value is defined as the accuracy. It is calculated and stored in the similarity calculation result storage buffer unit 253 (step 307). Note that the probability may be a value calculated from the number of common words when documents are compared, in addition to the occupation ratio with respect to the total value of the similarities, as long as the value indicates the probability of the classification result. FIG. 8 shows an example of calculating the accuracy in the example shown in FIG. The sum of the similarities of the registered documents stored in the similarity calculation result storage buffer unit 253 is 2.783. Here, when the similarity of the document with the registered document ID “1023” to the classified document with the classified document ID “1” is 0.378, the accuracy of the document with the registered document ID “1023” is 0.378 ÷ 2.783. = 0.136, and 0.136 is stored as the accuracy value. Accuracy is similarly obtained and stored for other documents.
[0039]
Next, each information calculated up to step 307 is output (step 308). This output is preferably in the form of outputting the information shown in FIG. 8, but it is also possible to sort in the order of similarity and output the field information of the registered fields assigned in order from the top document. FIG. 9 shows an example in which field information is output after sorting. Here, for each registered field, the sum of the similarities of the registered documents included in the field is calculated and arranged in descending order. This output is not used in the subsequent processing, but has the effect of making it easier for the user to grasp the classification situation.
[0040]
When the classification result is output, it is determined whether or not another classification document to be used remains (step 309). If the classification document remains, the process returns to step 303, and steps 303 to 308 are repeated. Since the data corresponding to FIG. 8 differs for each classified document, it is stored for each classified document. On the other hand, if no classification document remains, the classification processing ends.
[0041]
Next, the second step of outputting a candidate for a document to be deleted from a registered document using the similarity calculation result stored in the first step will be described. FIG. 4 is a flowchart showing the procedure.
[0042]
First, buffers other than the similarity calculation result storage buffer unit 253 are cleared by the initialization unit 201 (step 351). Next, the ID of the classification document used in the first step and the correct answer field of the classification document are input from the input device 2 (step 352).
[0043]
Next, based on the similarity calculation result corresponding to the classification document ID input in step 352 and the correct answer field, for each registered document, if the registered field and the correct answer field match, the matching field is determined as the correct answer influence degree. If not, a value obtained by multiplying the similarity by the accuracy is added to the correct / incorrect answer influence storage buffer unit 254 as the incorrect answer influence degree (step 353). A kind of weighting can be performed by multiplying the similarity and the accuracy. This correct answer / incorrect answer influence degree is managed for each registered document.
[0044]
If the classification document ID is 1 and the correct answer field is a measurement device, and the similarity calculation result is in the state of FIG. 8, the document with the registered document ID = 1023 has a different field from the correct answer field. A value obtained by multiplying the degree of similarity by the degree of accuracy, 0.378 × 0.136 = 0.051, is added to the degree of influence of the incorrect answer and stored. Since the field of the registered document ID = 5933 has the same field as the correct answer field, a value obtained by multiplying the similarity by the accuracy, 0.172 × 0.062 = 0.011, is added to the correct answer influence degree and stored.
[0045]
Here, a value obtained by multiplying the similarity by the accuracy is used as a value to be added to the degree of influence of the correct / incorrect answer. A method of adding to the influence degree of incorrect answer may be used.
[0046]
Subsequently, it is determined whether or not the similarity calculation result of the classified document being processed remains (step 354). If the similarity calculation result remains, the process returns to step 353 and the process of step 353 is repeated. The target of this processing may be all the similarity calculation results, or may be any of the highest similarity results, or the similarity or accuracy of a certain value or more. On the other hand, if there is no similarity calculation result to be processed, the process proceeds to step 355.
[0047]
In step 355, it is determined whether or not other correct answer information remains (step 355). If there is any remaining answer information, the process returns to step 352, and the above-described steps 352 to 354 are repeated. move on. Here, when the correct answer information remains, that is, when the calculation result by another classified document is used, the calculation result managed by the classified document ID corresponding to the correct answer information is used.
[0048]
FIG. 10 shows an example of the correct answer / incorrect answer influence storage buffer unit 254 which is obtained as a result of obtaining the correct answer / incorrect answer influence for some classified documents for each registered document and storing the result for each registered document. Shown in For example, for the document having the registered document ID “1”, the registration field is “engine”, and the correct answer influence degree is 0.00249 and the incorrect answer influence degree is 0.25382. A registered document having a high degree of influence on the correct answer is in the same field as the predetermined classified document, and generally has high similarity and high accuracy. On the other hand, a registered document having a high degree of incorrect answer influence is a field different from the predetermined classified document, but can be said to be a confusing document having high similarity and certainty.
[0049]
In step 356, the “deleted document candidate score” is calculated for each registered document registered in the database based on the correct / incorrect answer influence stored in the correct / incorrect answer influence storage buffer unit 255 in step 353. I do. FIG. 11 shows an example of a formula for calculating the number of deleted document candidates. Here, the deleted document candidate score is obtained by “incorrect answer influence degree 影響 (correct answer influence degree + 0.001)”. The reason why 0.001 is added to the correct answer influence degree serving as a denominator is to prevent a division error by 0 from occurring when there is a document whose correct answer influence degree is 0. According to this formula, the registered document having a higher degree of incorrect answer influence than the correct answer influence has a higher deleted document candidate score. The deleted document candidate score calculation unit 208 stores the calculated result in the deleted document candidate storage buffer unit 255.
[0050]
The example of FIG. 12 shows the result of calculating the score of the similarity calculation result stored in the similarity calculation result storage buffer unit 253 of FIG. 10 by using the calculation formula of the deletion document candidate score shown in FIG. This result is stored in the deleted document candidate storage buffer unit 255. For example, for the registered document with the registered document ID “1”, the candidate score of the deleted document is obtained as “0.25382 / (0.00249 + 0.001)”, and the value is 72.727779.
[0051]
Next, it is determined whether correct / incorrect answer influence data remains, that is, whether a registered document for calculating a deletion candidate score remains (step 357), and if it remains, returns to step 356 and repeats step 356. When there is no correct answer / incorrect answer influence data, the process proceeds to step 358.
[0052]
In step 358, the contents of the deleted document candidate storage buffer unit 255 are sorted using the deleted document candidate points calculated in step 356, and are output as the deleted document candidates in the order of the documents having the highest deleted document candidate points. FIG. 13 shows an output example of a deletion document candidate. The document with the registered document ID of 9924 is a document in the field of measurement equipment, and indicates that the candidate number of the deleted document is 129.16973.
[0053]
This is the end of the registered document deletion candidate output process of the second step. The user can view the output and select a registered document to be deleted from the database. Up to this selection, it is also possible to automatically execute the selection.
[0054]
Documents in a field closely related to a plurality of fields (high similarity) are difficult to specify in one field, and the probability that the specified field matches the correct field tends to be low. In the present invention, when the accuracy is low, the value to be added as the degree of influence on the incorrect answer is reduced, so that a decrease in the classification accuracy due to the preferential deletion of a document in such a field can be suppressed.
[0055]
Further, when a document that is unsuitable for the classification process in which the words to be classified have few words representing the characteristics of the field is classified, the probability of being specified as the correct answer field decreases. When such a document is classified, even if the field identification result is incorrect, if the accuracy is small, it is difficult to add the similarity as the influence of the incorrect answer. Extraction can be reduced.
[0056]
【The invention's effect】
As described above, according to the present invention, it is possible to perform database maintenance while maintaining the accuracy of classification.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a hardware configuration of a similar document search device according to an embodiment of the present invention.
FIG. 2 is a functional block diagram of a control device of the similar document search device according to the embodiment of the present invention.
FIG. 3 is a flowchart showing the flow of a document classification process.
FIG. 4 is a flowchart showing the flow of processing for outputting a candidate for a deleted document.
FIG. 5 is a diagram showing an example of a classified document.
FIG. 6 is a diagram showing an example of a registered document.
FIG. 7 is a diagram showing an example of a similarity calculation result.
FIG. 8 is a diagram showing an example of accuracy calculation.
FIG. 9 is a diagram showing an output example of a classification result.
FIG. 10 is a diagram showing an example of a correct answer / incorrect answer influence degree.
FIG. 11 is a diagram showing an example of a deleted document candidate point calculation formula.
FIG. 12 is a diagram showing an example of storing deleted document candidates.
FIG. 13 is a view showing an example of output of a deleted document candidate.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Control device, 2 ... Input device, 3 ... Display device, 4 ... External storage device, 200 ... Program part, 201 ... Initialization part, 202 ... Classification document input part, 203 ... Registration document reading part, 204 ... Similarity Calculation unit, 205: accuracy calculation unit, 206: classification result output unit, 207: correct / incorrect answer influence storage unit, 208: deleted document candidate score calculation unit, 209: deleted document candidate output unit, 250: buffer unit, 251 ... Classification document storage buffer section, 252... Registered document storage buffer section, 253... Similarity calculation result storage buffer section, 254. Correct / incorrect answer influence storage buffer section, 255.

Claims

A management device that manages a database that records registered documents having field information,
First input means for inputting a predetermined document;
Reading means for reading a registered document registered in the database;
Similarity calculating means for calculating the similarity between the predetermined document and the registered document,
Based on the similarity calculated by the similarity calculating means, a certainty calculating means for calculating the certainty,
Second input means for inputting a field to which the predetermined document belongs;
Determining means for determining a match / mismatch between a field to which the predetermined document belongs and a field in which the registered document is registered;
When the discriminating unit determines that the field to which the predetermined document belongs and the field in which the registered document is registered match, a correct answer for calculating a correct answer impact based on the similarity and the accuracy Impact degree calculating means;
When the determining unit determines that the field to which the predetermined document belongs and the field in which the registered document is registered do not match, an incorrect answer influence degree is calculated based on the similarity and the accuracy. Means for calculating the degree of influence of an incorrect answer;
A database management apparatus comprising: a deleted document candidate score calculating unit configured to calculate a deleted document candidate score from the correct answer impact and the incorrect answer impact.

The database management device is capable of inputting a plurality of predetermined documents, and the correct answer influence calculating unit and the incorrect answer influence calculating unit are configured to correct the correct answer influence or the incorrect answer influence for each of the plurality of predetermined documents. Is calculated,
2. The database management apparatus according to claim 1, wherein the correct answer influence degree and the incorrect answer influence degree are accumulated for each of the registered documents.

2. The database management device according to claim 1, wherein the certainty calculating means calculates a value obtained by dividing the similarity of the registered document by the sum of the similarities of other registered documents including the registered document as the certainty.

4. The database management device according to claim 1, wherein the correct answer influence calculating unit calculates a product of the similarity and the accuracy as a correct answer influence.

4. The database management apparatus according to claim 1, wherein the incorrect answer influence calculating unit calculates a product of the similarity and the accuracy as a correct answer influence.