JP4671212B2

JP4671212B2 - Document search apparatus, document search method, program, and recording medium

Info

Publication number: JP4671212B2
Application number: JP2001088734A
Authority: JP
Inventors: 博子真野; 泰嗣小川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2001-03-26
Filing date: 2001-03-26
Publication date: 2011-04-13
Anticipated expiration: 2021-03-26
Also published as: JP2002288215A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書検索装置、文書検索方法、文書検索装置の機能を実行させるためのプログラムおよびそのプログラムを記録したコンピュータ読みとり可能な記録媒体に関し、より詳細には、与えられたキーワードに対して適合する文書を選択し、この適合文書から抽出したキーワードの関連語を付加したキーワードによって適合する文書を検索しなおすことにより、ユーザの所望する文書が検索できる文書検索装置、文書検索方法、文書検索装置の機能を実行させるためのプログラムおよびそのプログラムを記録したコンピュータ読みとり可能な記録媒体に関する。
【０００２】
【従来の技術】
文書を多数集積している文書データベースからユーザの必要とする文書を探しだすには、ユーザが入力したキーワードを用いて一旦検索した後、そのキーワードに適合した文書中に出現する単語の中から入力したキーワードに関連した単語を選出し、はじめに入力したキーワードに追加し、再度、検索することで、よりユーザの求めるものに近いものを得る方法が知られている。
例えば、キーワードの関連語を選出する方法として、適合文書中の各単語について、適合文書の中での出現状況などの統計情報を利用して、キーワードとの関連度を算出し、その値の大きい上位何単語かを選出する方法が提案されている（文献１：Robertson,S.E."On term selection for query expansion,"Journal of Documentation 46,Dec 1990,p359-364）。
次に、この従来の関連語抽出方法を説明する。
ユーザから入力されたキーワード中の各単語に対して単語の重要度に応じた重みを付与する。この単語の重みの計算式には、例えば、確率モデルにもとづくRobertsonの計算式（式１）が知られている（文献２：Robertson,S.E. and Walker,S."On relevance weights with little relevance information,"SIGIR97,ACM Press,pp.16-24）。この文献２の技術においては、キーワード中の各単語の重みは、検索対象文書全体の中での各単語の出現状況Wp、Wqに応じて付与される。
W（重み）＝WpWq ………（式１）
ここで
Wp＝k4+log(N/(N-n))
Wq＝log(n/(N-n))
N:検索対象総文書数
n:単語の出現する文書数
k4:調整パラメータ
【０００３】
次に、キーワード中の各単語の重みをもとに、各文書の文書適合度を計算する。この文書適合度の計算式は、例えば、文献２の計算式（式２）で求まる。
F（適合度）＝Σ(W×tf/(k1+tf)) ………（式２）
ここで
W：（式１）で求めた単語の重み
tf:文書あたりの単語の出現数
k1:調整パラメータ
各文書の文書適合度を求め、適合度の高い順に各文書を順序づけ、上位何件かを適合文書とみなし、下位何件かを非適合文書とみなす。
適合文書の選出後、適合文書中の不要語（たとえば冠詞のaなど）を除いたすべての単語について、適合文書および非適合文書での出現状況、すなわちフィードバック情報を反映させて、それぞれの単語の重みを再計算する。
適合文書選出後の重みは、例えば、文献２の計算式（式３）を用いて、検索対象文書全体での出現状況Wp、Wq（（式１）のコメント参照）と適合文書／非適合文書の中での出現状況WrとWsを比率CpとCqで足し合わせて付与される。
W'（重み）＝(Cp・Wp+(1-Cp)・Wr)-(Cq・Wq+(1-Cq)・Ws)……（式３）
ここで
Wr=log((r+0.5)/(R-r+0.5))
Ws=log((s+0.5)/(S-s+0.5))
Cp＝k5/(k5+√R)
Cq＝k6/(k6+√S)
R:適合文書数
r:適合文書集合の中で単語の出現する文書数
S:非適合文書数
s:非適合文書集合の中で単語の出現する文書数
k5,k6:調整パラメータ
さらに、この重みとフィードバック情報から適合文書中の不要語を除いた各単語について、キーワードとの関連度を求める。
関連度の算出方法としては、たとえば、Boughanemの計算式（式４）がある(文献３：Walker,S.etal.,"Okapi at TREC-6:Automated adhoc,VLC,routing,filtering and QSDR,"The Sixth Text REtrieval Conference(TREC-6),1996,NIST)。
関連度＝(r/R-α・s/S)×W' ………（式４）
ここで
α:調整パラメータ
このようにして、適合文書中の各単語について、キーワードとの関連度を求めて、関連度の高いものから順にキーワード関連語として選出し、入力したキーワードに追加して新しいキーワードを作成する。
この新しいキーワードを用いて、再度、適合文書を選出する。このとき、文書適合度の算出には、上記（式３）で求めた重みが使われる。
【０００４】
【発明が解決しようとする課題】
しかしながら、上記の従来技術では、キーワードとの関連度は、単語の適合文書および非適合文書内での出現回数およびその単語の重みを算出した上で求めていた。しかし、単語の重みを計算するには、検索対象文書中でその単語の出現する文書数を調べる必要があり、そのための処理時間がかかっていた。
一方、従来技術は、検索精度の面でも、単語によっては、再検索時の重みが必要以上に大きくなり検索に影響することがあった。
特に、インターネット上の文書など語彙が統制されていない文書においては、その文書を作成した者しか使用しないような特殊な単語や誤った表記が出現しがちであるが、上記の従来技術では、このような単語に、極端に大きな重みがついてしまい、再検索での検索精度を劣化させるということがある。
本発明は、上述の問題を解決するために、検索に寄与する単語をキーワードの関連語として選出し、その関連語で再検索することによって、ユーザの所望している的確な文書を検索することができる文書検索装置、文書検索方法、文書検索装置の機能を実行させるためのプログラムおよびそのプログラムを記録したコンピュータ読みとり可能な記録媒体を提供することを目的とする。
さらに、関連度を算出する際の単語の重み計算の負荷をなくし、関連語選出にかかる時間を減らすことも目的とする。
また、関連語として検索結果への影響の小さく、且つ、重みが極端に小さい単語を選択しないようにして、無駄な検索時間がかからないようにすることも目的とする。
また、関連語として汎用性が低いにもかかわらず重みの大きい単語を関連語として選択されないようにして、再検索の精度に影響しないようにすることも目的とする。
【０００５】
【課題を解決するための手段】
請求項１の発明は、入力されたキーワードに適合する文書を複数の文書を保持する文書データベースから検索する文書検索装置であって、前記文書データベースから前記キーワードに適合する文書および適合しない文書を選出する文書ランキング部と、前記文書ランキング部で選出された適合文書中に出現する単語の前記キーワードとの関連度を、Rを適合文書数、Sを非適合文書数、rtfを適合文書中の各文書における出現回数、stfを非適合文書中の各文書における出現回数、K、βを調整パラメータとしてΣ(rtf/K+rtf)/R-β×Σ(stf/K+stf)/Sにより算出し、この関連度が高い単語を前記キーワードの関連語として選出する単語ランキング部と、前記単語ランキング部で選出した関連語をもとの前記キーワードに追加して新しいキーワードとするキーワード生成部とを備え、前記キーワード生成部で生成された新しいキーワードに適合する文書を再度、前記文書ランキング部で検索するようにしたことを特徴とする。
請求項２の発明は、請求項１に記載の文書検索装置において、単語辞書を有し、前記単語ランキング部は、前記関連度の高い単語を選出した後、その単語の重みが大きく、かつその単語が前記単語辞書に登録されていない場合、その単語を前記関連語から除外することを特徴とする。
請求項３の発明は、入力されたキーワードに適合する文書を複数の文書を保持する文書データベースから検索する文書検索装置による文書検索方法であって、前記文書データベースから前記キーワードに適合する文書および適合しない文書を選出し、前記選出された適合文書中に出現する単語の前記キーワードとの関連度を、Rを適合文書数、Sを非適合文書数、rtfを適合文書中の各文書における出現回数、stfを非適合文書中の各文書における出現回数、K、βを調整パラメータとしてΣ(rtf/K+rtf)/R-β×Σ(stf/K+stf)/Sにより算出し、この関連度が高い単語を前記キーワードの関連語として選出して、前記キーワードに追加して新しいキーワードとし、この生成された新しいキーワードに適合する文書を再度、前記文書データベースを検索するようにしたことを特徴とする。
請求項４の発明は、請求項３に記載の文書検索装置による文書検索方法において、単語辞書を有し、前記関連度の高い単語を選出した後、その単語の重みが大きく、かつその単語が前記単語辞書に登録されていない場合、その単語を前記関連語から除外することを特徴とする。
請求項５の発明は、入力されたキーワードに適合する文書を複数の文書を保持する文書データベースから検索する文書検索装置のコンピュータを、前記文書データベースから前記キーワードに適合する文書および適合しない文書を選出する文書ランキング部、前記文書ランキング部で選出された適合文書中に出現する単語の前記キーワードとの関連度を、Rを適合文書数、Sを非適合文書数、rtfを適合文書中の各文書における出現回数、stfを非適合文書中の各文書における出現回数、K、βを調整パラメータとしてΣ(rtf/K+rtf)/R-β×Σ(stf/K+stf)/Sにより算出し、この関連度が高い単語を前記キーワードの関連語として選出する単語ランキング部、前記単語ランキング部で選出した関連語をもとの前記キーワードに追加して新しいキーワードとするキーワード生成部、として機能させるプログラムであって、文書ランキング部は、前記キーワード生成部で生成された新しいキーワードに適合する文書を再度検索することを特徴とする。
請求項６の発明は、請求項５に記載のプログラムを記録したコンピュータ読みとり可能な記録媒体である。
【０００８】
【発明の実施の形態】
以下に、図面を用いて本発明の実施例の構成および動作を詳細に述べる。
（１）実施例の構成
図１は、本発明の一実施例である文書検索装置の構成を示すブロック図である。
本実施例の文書検索装置は、キーワード入力部１１０、文書ランキング部１２０、単語ランキング部１３０、キーワード生成部１４０、文書出力部１５０、文書データベース１６０より構成される。
キーワード入力部１１０は、ユーザがキーボード等により、文書データベース１６０中にある文書の特徴をあらわすキーワードとなる文字列を入力する。
この入力された文字列は、必要に応じて、単語辞書１７０をもちいて形態素解析され単語に分解する。
この単語辞書１７０は、少なくとも各単語の表記、品詞等から構成される。
または、単語辞書１７０を使わず、この入力された文字列をｎ−ｇｒａｍに区切って、それを単語としてもよい。
【０００９】
文書ランキング部１２０は、キーワード入力部１１０から渡されたキーワードに対して、文書データベース１６０を検索し、適合する文書と適合しない文書とを選定する。この選定された適合文書は、文書出力部１５０へ渡される。
また、文書ランキング部１２０は、キーワード生成部１４０で生成された新しいキーワードに対してもう一度適合する文書を選定する。
文書データベース１６０は、検索対象となる文書を保持する文書情報と、その文書中に含まれている各単語の単語統計情報から構成される（図２参照）。
例えば、文書情報には、各文書に対して次のような情報が保持される。
文書識別子（ＩＤ）、文書名、書誌事項（作成者、作成日、発行所等）、文書実体へのポインタ等
また、単語統計情報には、単語ごとに次のような統計情報を保持する。
単語の表記、この単語の文書データベース全体での出現頻度、単語出現情報等ここで単語出現情報としては、単語が出現する文書ごとに次の情報を保持する。
この単語が出現する文書の文書識別子、この文書にこの単語が出現する単語出現頻度、この文書にこの単語が出現する出現位置の一覧等
【００１０】
単語ランキング部１３０は、文書ランキング部１２０で選定された適合文書の文書識別子から文書データベース１６０に格納されている文書を取り出し、形態素解析あるいはｎ−ｇｒａｍによって区切って、単語を抽出し、予め用意された不要語表にこの抽出した単語が登録されていれば削除し、残りの単語を関連語候補とする。次の（式５）で計算した値を入力されたキーワードとこの関連語候補との関連度として算出する。
関連度＝Σ_i(rtf_i/K+rtf_i)/R-β×Σ_j(stf_j/K+stf_j)/S ……（式５）
ここで、Rを適合文書数、Sを非適合文書数、
rtf_iを適合文書の文書_iにおける出現回数、
stf_jを非適合文書の文書_jにおける出現回数、
Kおよびβを調整パラメータとする。
また、（式５）の右辺第１項は、適合文書の各文書についての和であり、第２項は、非適合文書の各文書についての和であるとする。
この取り出された単語の中から、所定の件数（例えば、１０個程度）の関連度の高い上位の単語を関連語候補として選出する。
この関連語候補の中から、検索結果への影響が小さく追加するに値しない単語を除外して、残りを関連語とする。
例えば、関連語として採用する単語の重みの下限を予め定め、この下限に満たない重みの単語を、関連語候補から削除する。
また、関連語候補の中から、汎用性の低い単語を除外して、残りを関連語とする。例えば、関連語として採用する単語の重みの上限を予め定め、この上限を越える重みの単語については、単語辞書１７０に登録されていない単語を関連語候補から削除する。
【００１１】
ここで、上述の単語の重みは（式３）で計算する。
このようにして決定された関連語をキーワード生成部１４０へ渡す。
キーワード生成部１４０は、単語ランキング部１３０から渡された関連語をもとのキーワードに追加して新しいキーワードを生成し、文書ランキング部１２０へ渡す。
文書出力部１４０は、文書ランキング部１２０で選出した適合文書をプリンタ、表示装置、記憶装置等へ出力するか、または、ネットワークを介して他のコンピュータ装置へ送信する。
【００１２】
（２）実施例の動作
次に、このように構成された本実施例の文書検索装置の動作について、図３のフローチャートに基いて説明する。
まず、キーボード等の入力装置から、例えば、英語や日本語の単語や単語の組み合わせで構成されるキーワードを文字列として入力し、必要に応じて単語辞書１７０によって形態素解析して、単語に分解する（ステップＳ１００）。
または、単語辞書１７０を使わず、この入力された文字列をｎ−ｇｒａｍに区切って、それを単語としてもよい。
これにより、キーワード入力部１１０を構成する。
この入力されたキーワード中のそれぞれの単語について、文書データベース１６０の単語統計情報を参照し、例えば、上記（式１）を用いて単語の重要度に応じた重みを計算する（ステップＳ１１０）。
次に、検索対象である文書データベース１６０中のそれぞれの文書に対して、文書データベース１６０の単語統計情報とステップＳ１１０で計算されたキーワードの単語の重みとを参照し、その文書にキーワード中の単語がどのくらい含まれているかを示す適合度を、例えば、上記（式２）を用いて計算し、文書一覧表を作成する（ステップＳ１２０）。
この文書一覧表を適合度をキーとして、降順に各文書を順序付け、その上位から所定の件数（例えば、１０件程度）の文書を適合文書とみなし、下位から所定の件数（例えば、５００件程度）の文書を非適合文書とみなす（ステップＳ１３０）。
あるいは、順序づけられた文書の一覧表（適合度、文書名や書誌事項等の一覧）をユーザに提示し、適合しているかどうか指示させ、適合していると指示された文書を適合文書とし、適合しないと指示された文書を非適合文書とするようにしてもよい。
【００１３】
ステップＳ１１０からステップＳ１３０までにより、文書ランキング部１２０を構成する。
ステップＳ１３０で選出した適合文書がユーザの所望した文書であるかどうかをユーザに指示させる（ステップＳ１４０）。
所望した文書でなければ、ステップＳ１５０へ進む。所望した文書であれば、ステップＳ１７０へ進む。
ステップＳ１３０で選出された適合文書を表示装置、プリンタや記憶装置等の出力装置へ、例えば、ランク順に文書名や書誌事項等を一覧として出力したり、また、ネットワークで接続された他のコンピュータ装置へ送信することによってユーザに提示される（ステップＳ１７０）。
これにより、文書出力部１５０を構成する。
ステップＳ１３０で選定された適合文書の文書識別子から文書データベース１６０に格納されている文書を取り出し、その文書を形態素解析やｎ−ｇｒａｍで区切った単語を抽出し、この抽出された単語が予め用意された不要語表に登録されていれば、その単語を削除した残りの単語を関連語候補とする。
この抽出された関連語候補に対して、（式５）で計算した値を入力されたキーワードとの関連度として算出する。
この取り出された関連度の高い関連語候補から順に所定の件数（例えば、１０単語程度）だけ選択する。
【００１４】
この選択された関連語候補の中から、検索結果への影響が小さく追加するに値しない単語を除外して、残りを関連語とする。
例えば、関連語として採用する単語の重みの下限を予め定め、この下限に満たない重みの単語を、関連語候補から削除する。
また、この選択された関連語候補の中から、汎用性の低い単語を除外して、残りを関連語とする。例えば、関連語として採用する単語の重みの上限を予め定め、この上限を越える重みの単語については、単語辞書１７０に登録されていない単語を関連語候補から削除する。
このようにして削除・選択された関連語候補をキーワードの関連語として抽出する（ステップＳ１５０）。
これにより単語ランキング部１３０を構成する。
単語の関連語（ステップＳ１５０）をもとのキーワードに追加して新しいキーワードを作成する（ステップＳ１６０）。
これによりキーワード生成部１４０を構成する。
この新しいキーワードをステップＳ１１０からステップＳ１３０（文書ランキング部１２０）の処理と同様にして、再度、適合文書を選出する。
【００１５】
本実施例の文書検索装置をこのような構成にすることによって、次のような効果を達成すると共に、検索に寄与する単語をキーワードの関連語として選出することができるので、ユーザの所望している的確な文書を検索することができる。
・関連度算出から単語の重み計算を省くことによって、関連語選出にかかる時間が減少する。
・また、検索結果への影響の小さい、重みが極端に小さい単語を選択しないようにして、無駄な検索時間がかからなくなる。
・また、汎用性が低いにもかかわらず重みが大きい単語を排除することにより、再検索に影響しないようになり、検索精度が向上する。
【００１６】
＜コンピュータによる実施例＞
さらに、本発明は上記の実施の形態のみに限定されたものではない。例えば、図１に示した文書検索装置は、図４のようなハードウェア構成を持つコンピュータ装置２００によっても実現が可能である。
即ち、コンピュータ装置２００は、キーボード、マウス、タッチパネル、スキャナ等により構成され、情報の入力に使用される入力装置１と、種々の出力情報や入力装置１からの入力された情報などを表示出力させる表示装置２と、種々のプログラムを動作させるＣＰＵ（Central Processing Unit；中央処理ユニット）３と、プログラム自身を保持し、またそのプログラムがＣＰＵ３によって実行されるときに一時的に作成される情報等を保持するメモリ４と、本発明の文書検索装置で扱う文書データベース１６０、単語辞書１７０およびプログラムやプログラム実行時の一時的な情報等を保持する記憶装置５と、プログラムやデータ等を記憶した記録媒体を装着してそれらを読み込み、メモリ４または記憶装置５へ格納するのに用いられる媒体駆動装置６と、ネットワーク９へ接続するためのインタフェースであるネットワーク接続装置７とから構成され、それらはバス８で接続されている。
また、ネットワーク９は、コンピュータ装置２００と他のコンピュータ装置２００とを結合するための伝送路であって、一般には、ケーブルで実現され、通信プロトコルにはＴＣＰ／ＩＰが使われる。但し、伝送路としてはケーブルだけではなく、それらの間の通信プロトコルが一致するものであれば無線、有線および放送波のいずれでもよく、例えば、ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）、インターネット、アナログ電話網、デジタル電話網（ＩＳＤＮ：Integral Service Digital Network）、ＰＨＳ（パーソナルハンディホンシステム）、携帯電話網、衛星通信網などを用いることができる。
このようなコンピュータ装置２００の構成において、図１に示した文書検索装置を構成する各機能をそれぞれプログラム化し、予めＣＤ−ＲＯＭ等の記録媒体に書き込んでおき、このＣＤ−ＲＯＭを各サイトのＣＤ−ＲＯＭドライブのような媒体駆動装置６を搭載したコンピュータ装置に装着して、これらのプログラムをそれぞれのコンピュータ装置のメモリ４あるいは記憶装置５に格納し、それを実行することによって、上記の実施の形態と同様な機能を実現することができる。
【００１７】
尚、記録媒体としては半導体媒体（例えば、ＲＯＭ、ＩＣメモリカード等）、光媒体（例えば、ＤＶＤ、ＭＯ、ＭＤ、ＣＤ−Ｒ等）、磁気媒体（例えば、磁気テープ、フレキシブルディスク等）のいずれであってもよい。
また、コンピュータ装置２００のメモリ４へロードしたプログラムを実行することにより前述した実施の形態の機能が実現されるだけでなく、そのプログラムの指示に基づき、オペレーティングシステム等が実際の処理の一部または全部を行い、その処理によって上述した実施の形態の機能が実現される場合も含まれる。
また、上述した実施の形態を実現するプログラムがＲＯＭ等のような半導体の記録媒体である場合には、媒体駆動装置６からではなく、直接、メモリ４へロードして実行される。
【００１８】
＜本発明のネットワーク環境での運用＞
図５は、本発明を有線または無線の通信ネットワークに接続して運用する形態の構成を示している。
例えば、文書検索プログラムを保持するサーバー３００と複数のユーザが利用する端末３１０とをネットワーク９で接続する。
この場合、サーバー３００およびユーザの端末３１０は、図４に示した汎用のコンピュータ装置２００で構成される。
ユーザは、端末３１０からサーバー３００に対してログインしたり、文書検索のためのキーワードを入力装置を用いて入力し、ネットワーク９を介してサーバー３００の文書検索プログラムへ検索の実行を依頼する。
サーバー３００の文書検索プログラムは、指定されたキーワードに適合した検索結果や途中経過をネットワーク９を介して要求元の端末３１０へ戻す。ユーザの端末３１０は、この検索結果や途中経過を出力装置へ出力する。途中経過の出力の時には、その経過如何によっては、サーバー３００への指示も行う。
このように文書検索プログラムをサーバー３００におくことによって、ユーザは常に最新の文書検索プログラムを使えるという利点がある。
また、図５のようにサーバー３００と端末３１０とを有線または無線の通信ネットワークで接続した場合、サーバー３００の磁気ディスク等の記憶装置に本発明の機能を実現する文書検索プログラムを格納しておき、端末３１０に対してダウンロード等の形式で頒布することも可能である。
さらに、本発明の機能を実現する文書検索プログラムを記録媒体や放送波による配布で提供するようにしてもよい。
【００１９】
【発明の効果】
以上説明したように、本発明によれば、関連度算出から単語の重み計算を省くことにより、関連語選出にかかる時間が減少した。
また、検索結果への影響の小さく、重みが極端に小さい単語を選択しないようにして、無駄な検索時間をかからなくした。
また、汎用性が低いにもかかわらず重みの大きい単語を排除したので、再検索に影響しないため検索の精度が向上した。
以上により、検索に寄与する単語をキーワードの関連語として選出できるので、ユーザの所望している的確な文書を検索することができる。
【図面の簡単な説明】
【図１】実施例の文書検索装置の構成を示すブロック図である。
【図２】文書データベースのデータ構造を説明するための図である。
【図３】実施例の文書検索装置の処理の流れを説明するためのフローチャートである。
【図４】本発明の文書検索装置をコンピュータで実現するときのハードウェアの構成を示す図である。
【図５】本発明をネットワーク環境で運用する場合を説明するための図である。
【符号の説明】
１入力装置
２表示装置
３ＣＰＵ
４メモリ
５記憶装置
６媒体駆動装置
７ネットワーク接続装置
８バス
９ネットワーク
１１０キーワード入力部
１２０文書ランキング部
１３０単語ランキング部
１４０キーワード生成部
１５０文書出力部
１６０文書データベース
１７０単語辞書
２００コンピュータ装置
３００サーバー
３１０端末[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document search device, a document search method, a program for executing the function of the document search device, and a computer-readable recording medium storing the program, and more specifically, conforms to a given keyword. Retrieval apparatus, document retrieval method, and document retrieval apparatus capable of retrieving a document desired by a user by selecting a document to be retrieved and re-retrieving a conforming document by a keyword to which a keyword-related word extracted from the conforming document is added The present invention relates to a program for executing the above functions and a computer-readable recording medium on which the program is recorded.
[0002]
[Prior art]
To search for a document that the user needs from a document database that contains a large number of documents, search using a keyword entered by the user, and then enter from the words that appear in the document that matches the keyword. There is known a method of obtaining a word closer to what the user wants by selecting a word related to the keyword, adding it to the keyword input first, and searching again.
For example, as a method for selecting a keyword-related word, the degree of relevance of the keyword is calculated for each word in the matching document using statistical information such as the appearance status in the matching document, and the value is large. A method of selecting the top words has been proposed (Reference 1: Robertson, SE “On term selection for query expansion,” Journal of Documentation 46, Dec 1990, p359-364).
Next, this conventional related word extraction method will be described.
A weight corresponding to the importance of the word is assigned to each word in the keyword input by the user. For example, Robertson's calculation formula (Formula 1) based on a probability model is known (Reference 2: Robertson, SE and Walker, S. "On relevance weights with little relevance information, "SIGIR97, ACM Press, pp.16-24). In the technique of Document 2, the weight of each word in the keyword is given according to the appearance status Wp and Wq of each word in the entire search target document.
W (weight) = WpWq ......... (Formula 1)
here
Wp = k4 + log (N / (Nn))
Wq = log (n / (Nn))
N: Total number of documents to be searched
n: Number of documents in which word appears
k4: Adjustment parameter
Next, the document suitability of each document is calculated based on the weight of each word in the keyword. The calculation formula of the document suitability is obtained by, for example, the calculation formula (Formula 2) of Document 2.
F (goodness of fit) = Σ (W × tf / (k1 + tf)) (2)
here
W: Weight of the word obtained by (Equation 1)
tf: Number of words per document
k1: Adjustment parameter The document conformity of each document is obtained, the documents are ordered in descending order of conformance, and the top several are considered as conforming documents and the lower several are regarded as non-conforming documents.
After selecting the conforming document, all words except the unnecessary word in the conforming document (for example, the article a) are reflected in the conforming document and the non-conforming document, ie feedback information, Recalculate weights.
For example, the weight after selection of the conforming document is calculated by using the calculation formula (formula 3) in Document 2 and the appearance status Wp and Wq (see comments in (formula 1)) in the entire search target document and the conforming document / non-conforming document. Appearance status Wr and Ws are added by ratios Cp and Cq.
W '(weight) = (Cp · Wp + (1-Cp) · Wr)-(Cq · Wq + (1-Cq) · Ws) (Equation 3)
here
Wr = log ((r + 0.5) / (R-r + 0.5))
Ws = log ((s + 0.5) / (S-s + 0.5))
Cp ＝ k5 / (k5 + √R)
Cq ＝ k6 / (k6 + √S)
R: Number of conforming documents
r: Number of documents in which words appear in the conforming document set
S: Number of nonconforming documents
s: Number of documents in which words appear in the non-conforming document set
k5, k6: Adjustment parameters Further, for each word obtained by removing unnecessary words in the matching document from the weight and feedback information, the degree of association with the keyword is obtained.
As a method of calculating the degree of association, for example, there is a calculation formula (formula 4) of Boughanem (Reference 3: Walker, S. etal., “Okapi at TREC-6: Automated adhoc, VLC, routing, filtering and QSDR,” The Sixth Text REtrieval Conference (TREC-6), 1996, NIST).
Relevance = (r / R-α · s / S) × W ′ ……… (Formula 4)
Where α: adjustment parameter In this way, for each word in the conforming document, the degree of relevance with the keyword is obtained, and the word related words are selected in descending order of the degree of relevance and added to the entered keyword to obtain a new one. Create keywords.
Relevant documents are selected again using this new keyword. At this time, the weight obtained in (Equation 3) is used for calculating the document suitability.
[0004]
[Problems to be solved by the invention]
However, in the above-described conventional technology, the degree of association with a keyword is obtained after calculating the number of appearances of the word in the conforming document and the non-conforming document and the weight of the word. However, in order to calculate the weight of a word, it is necessary to check the number of documents in which the word appears in the search target document, which takes a long processing time.
On the other hand, the prior art also has an influence on the search because the weight at the time of re-search becomes larger than necessary depending on the word in terms of search accuracy.
In particular, in documents where the vocabulary is not controlled, such as documents on the Internet, special words that are used only by the person who created the document or incorrect notation tends to appear. Such a word may have an extremely large weight, which may deteriorate the search accuracy in the re-search.
In order to solve the above-described problem, the present invention selects a word that contributes to a search as a related word of a keyword, and searches for the exact document desired by the user by re-searching with the related word. An object of the present invention is to provide a document search device, a document search method, a program for executing the functions of the document search device, and a computer-readable recording medium storing the program.
It is another object of the present invention to reduce the time required for selecting related words by eliminating the burden of calculating word weights when calculating the degree of association.
It is another object of the present invention to avoid useless search time by not selecting a word having a small influence on a search result and having an extremely small weight as a related word.
It is another object of the present invention to prevent a word having a high weight from being selected as a related word even though the versatility is low as a related word so as not to affect the accuracy of the re-search.
[0005]
[Means for Solving the Problems]
請 Motomeko 1 of the invention is a document retrieval apparatus for retrieving from a document database for holding a plurality of documents on a document which conforms to the input keyword, a document that does not conform document and adapted to the keyword from the document database The degree of association between the document ranking part to be selected and the keyword of the word appearing in the conforming document selected by the document ranking part , R is the number of conforming documents, S is the number of nonconforming documents, and rtf is the conforming document. The number of occurrences in each document, stf is the number of occurrences in each non-conforming document, and K and β are adjustment parameters, Σ (rtf / K + rtf) / R-β × Σ (stf / K + stf) / S A word ranking unit that calculates and selects a word having a high degree of relevance as a related word of the keyword, and a keyword that is added to the original keyword as a new keyword by adding the related word selected by the word ranking unit And a generation unit, wherein the document ranking unit again searches for a document that matches the new keyword generated by the keyword generation unit.
Invention 請 Motomeko 2, the document search apparatus according to claim 1, comprising a word dictionary, the word ranking unit, after selecting the high word of the relevance, large weight of the word, and If the word is not registered in the word dictionary, the word is excluded from the related words .
Invention 請 Motomeko 3 is a document search method according to the document search apparatus for searching the document database for holding a plurality of documents on a document which conforms to the input keyword, documents conforming to the keyword from the document database and elected documents that do not fit, the relevance of the keywords of words appearing before hexene in the issued relevant documents, the R relevant documents, number of non-conforming document S, the document in the relevant documents to rtf The number of occurrences in, stf is the number of occurrences in each non-conforming document, and K and β are adjusted parameters as Σ (rtf / K + rtf) / R-β × Σ (stf / K + stf) / S , the degree of association higher word elected as a related word of the keyword, and the new keyword in addition to the keyword, a document which conforms to the new keyword the generated again, to search for the document database did It is characterized by that.
According to a fourth aspect of the present invention, in the document retrieval method by the document retrieval device according to the third aspect, after selecting a word having a high degree of relevance, the weight of the word is large, and the word is When not registered in the word dictionary, the word is excluded from the related words.
The invention of claim 5 selects a document search apparatus computer that searches a document database that holds a plurality of documents for documents that match the input keyword, and selects documents that match the keyword and documents that do not match from the document database. The ranking of the document, the relevance of the word appearing in the conforming document selected by the document ranking unit with the keyword, R is the number of conforming documents, S is the number of nonconforming documents, and rtf is each document in the conforming documents The number of occurrences in, stf is the number of occurrences in each non-conforming document, and K and β are adjusted parameters as Σ (rtf / K + rtf) / R-β × Σ (stf / K + stf) / S A word ranking unit for selecting a word having a high degree of relevance as a related word of the keyword, and adding a related word selected by the word ranking unit to the original keyword as a new keyword. The document ranking unit re-searches a document that matches the new keyword generated by the keyword generation unit.
A sixth aspect of the present invention is a computer-readable recording medium on which the program according to the fifth aspect is recorded.
[0008]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, the configuration and operation of the embodiment of the present invention will be described in detail with reference to the drawings.
(1) Configuration of Embodiment FIG. 1 is a block diagram showing a configuration of a document search apparatus according to an embodiment of the present invention.
The document search apparatus according to this embodiment includes a keyword input unit 110, a document ranking unit 120, a word ranking unit 130, a keyword generation unit 140, a document output unit 150, and a document database 160.
The keyword input unit 110 allows a user to input a character string that serves as a keyword representing the characteristics of a document in the document database 160 using a keyboard or the like.
This input character string is morphologically analyzed using the word dictionary 170 and decomposed into words as necessary.
This word dictionary 170 is composed of at least the notation of each word, the part of speech, and the like.
Alternatively, the input character string may be divided into n-grams and used as words without using the word dictionary 170.
[0009]
The document ranking unit 120 searches the document database 160 for the keyword passed from the keyword input unit 110 and selects a document that matches and a document that does not match. The selected conforming document is transferred to the document output unit 150.
In addition, the document ranking unit 120 selects a document that matches the new keyword generated by the keyword generation unit 140 once again.
The document database 160 includes document information that holds a document to be searched, and word statistical information of each word included in the document (see FIG. 2).
For example, in the document information, the following information is held for each document.
Document identifier (ID), document name, bibliographic item (creator, creation date, issuing place, etc.), pointer to document entity, etc. Further, the word statistical information holds the following statistical information for each word.
Word notation, appearance frequency of this word in the entire document database, word appearance information, etc. Here, as word appearance information, the following information is held for each document in which the word appears.
Document identifier of a document in which this word appears, word appearance frequency in which this word appears in this document, list of occurrence positions in which this word appears in this document, etc.
The word ranking unit 130 extracts documents stored in the document database 160 from the document identifiers of the conforming documents selected by the document ranking unit 120, extracts words by dividing them by morphological analysis or n-gram, and is prepared in advance. If the extracted word is registered in the unnecessary word table, it is deleted, and the remaining word is set as a related word candidate. The value calculated in the following (Formula 5) is calculated as the degree of association between the input keyword and this related word candidate.
Relevance = Σ _i (rtf _i / K + rtf _i ) / R-β × Σ _j (stf _j / K + stf _j ) / S (Formula 5)
Where R is the number of conforming documents, S is the number of non-conforming documents,
rtf _i is the number of occurrences of the relevant document in document _i ,
stf _j is the number of non-conforming document occurrences in document _j ,
Let K and β be adjustment parameters.
Further, the first term on the right side of (Formula 5) is the sum for each document of the conforming document, and the second term is the sum for each document of the non-conforming document.
From the extracted words, a high-order word having a high degree of relevance of a predetermined number (for example, about 10) is selected as a related word candidate.
From these related word candidates, words that have a small influence on the search result and are not worth adding are excluded, and the remaining words are used as related words.
For example, a lower limit of the weight of a word adopted as a related word is determined in advance, and a word having a weight less than the lower limit is deleted from the related word candidates.
Further, from the related word candidates, words with low versatility are excluded and the rest are used as related words. For example, the upper limit of the weight of a word adopted as a related word is determined in advance, and for a word with a weight exceeding the upper limit, a word not registered in the word dictionary 170 is deleted from the related word candidate.
[0011]
Here, the weight of the above word is calculated by (Equation 3).
The related words determined in this way are passed to the keyword generation unit 140.
The keyword generation unit 140 generates a new keyword by adding the related word passed from the word ranking unit 130 to the original keyword, and passes it to the document ranking unit 120.
The document output unit 140 outputs the conforming document selected by the document ranking unit 120 to a printer, a display device, a storage device, or the like, or transmits it to another computer device via a network.
[0012]
(2) Operation of Embodiment Next, the operation of the document search apparatus of the present embodiment configured as described above will be described with reference to the flowchart of FIG.
First, from an input device such as a keyboard, for example, an English or Japanese word or a keyword composed of a combination of words is input as a character string, and if necessary, a morphological analysis is performed by the word dictionary 170 and decomposed into words. (Step S100).
Alternatively, the input character string may be divided into n-grams and used as words without using the word dictionary 170.
Thereby, the keyword input part 110 is comprised.
For each word in the input keyword, the word statistical information in the document database 160 is referred to, and the weight corresponding to the importance of the word is calculated using, for example, (Equation 1) (step S110).
Next, with respect to each document in the document database 160 to be searched, the word statistical information in the document database 160 and the keyword word weight calculated in step S110 are referred to, and the word in the keyword is included in the document. The degree of fitness indicating how much is included is calculated using, for example, the above (Formula 2), and a document list is created (step S120).
Using this document list as the key to suitability, each document is ordered in descending order, a predetermined number of documents (for example, about 10) from the top are regarded as conforming documents, and a predetermined number (for example, about 500) from the bottom ) Is regarded as a non-conforming document (step S130).
Alternatively, an ordered list of documents (a list of conformance, document names, bibliographic items, etc.) is presented to the user and instructed whether or not it conforms. A document instructed not to be compatible may be a non-conforming document.
[0013]
The document ranking unit 120 is configured by steps S110 to S130.
The user is instructed whether or not the conforming document selected in step S130 is a document desired by the user (step S140).
If it is not the desired document, the process proceeds to step S150. If it is a desired document, the process proceeds to step S170.
The relevant document selected in step S130 is output to an output device such as a display device, a printer, or a storage device, for example, as a list of document names and bibliographic items in rank order, or other computer devices connected via a network To be presented to the user (step S170).
Thus, the document output unit 150 is configured.
A document stored in the document database 160 is extracted from the document identifier of the conforming document selected in step S130, a word obtained by dividing the document by morphological analysis or n-gram is extracted, and the extracted word is prepared in advance. If it is registered in the unnecessary word table, the remaining words from which the word is deleted are set as related word candidates.
For the extracted related word candidate, the value calculated in (Equation 5) is calculated as the degree of relevance with the input keyword.
A predetermined number (for example, about 10 words) is selected in order from the extracted related word candidates having a high degree of relevance.
[0014]
From the selected related word candidates, words that have a small influence on the search result and are not worth adding are excluded, and the remaining are set as related words.
For example, a lower limit of the weight of a word adopted as a related word is determined in advance, and a word having a weight less than the lower limit is deleted from the related word candidates.
Further, from the selected related word candidates, words with low versatility are excluded, and the rest are used as related words. For example, the upper limit of the weight of a word adopted as a related word is determined in advance, and for a word with a weight exceeding the upper limit, a word not registered in the word dictionary 170 is deleted from the related word candidate.
The related word candidate deleted / selected in this way is extracted as a related word of the keyword (step S150).
Thereby, the word ranking part 130 is comprised.
A new keyword is created by adding the related word (step S150) to the original keyword (step S160).
Thus, the keyword generation unit 140 is configured.
This new keyword is selected again in the same manner as the processing from step S110 to step S130 (document ranking unit 120).
[0015]
By configuring the document search apparatus according to the present embodiment as described above, the following effects can be achieved, and words contributing to the search can be selected as related words of the keyword. It is possible to search for an accurate document.
-By omitting the word weight calculation from the relevance calculation, the time taken to select related words is reduced.
In addition, by avoiding the selection of words with extremely small weights that have little influence on search results, useless search time is eliminated.
-In addition, by eliminating words with high weight despite low versatility, the re-search is not affected, and the search accuracy is improved.
[0016]
<Example by computer>
Furthermore, the present invention is not limited only to the above-described embodiment. For example, the document search apparatus shown in FIG. 1 can also be realized by a computer apparatus 200 having a hardware configuration as shown in FIG.
That is, the computer device 200 includes a keyboard, a mouse, a touch panel, a scanner, and the like, and displays and outputs the input device 1 used for inputting information, various output information, information input from the input device 1, and the like. The display device 2, a CPU (Central Processing Unit) 3 for operating various programs, the program itself, information temporarily created when the program is executed by the CPU 3, etc. Memory 4 to be held, document database 160 handled by the document search apparatus of the present invention, word dictionary 170, storage device 5 to hold programs, temporary information at the time of program execution, and recording medium storing programs, data, and the like , A medium drive used to read them and store them in the memory 4 or the storage device 5 A device 6, is composed of a network connecting device 7 for an interface for connecting to the network 9, which are connected by a bus 8.
The network 9 is a transmission path for connecting the computer apparatus 200 and another computer apparatus 200, and is generally realized by a cable, and TCP / IP is used as a communication protocol. However, the transmission path is not limited to a cable, and may be any of wireless, wired, and broadcast waves as long as the communication protocol between them is the same. For example, LAN (Local Area Network), WAN (Wide Area Network) Internet, analog telephone network, digital telephone network (ISDN: Integral Service Digital Network), PHS (Personal Handyphone System), mobile phone network, satellite communication network, and the like can be used.
In such a configuration of the computer apparatus 200, each function constituting the document search apparatus shown in FIG. 1 is programmed and written in a recording medium such as a CD-ROM in advance, and this CD-ROM is stored in the CD of each site. The above-described embodiment is implemented by mounting the computer on which the medium drive device 6 such as a ROM drive is mounted, storing these programs in the memory 4 or the storage device 5 of each computer device, and executing them. Functions similar to those of the embodiment can be realized.
[0017]
As a recording medium, any of a semiconductor medium (for example, ROM, IC memory card, etc.), an optical medium (for example, DVD, MO, MD, CD-R, etc.), and a magnetic medium (for example, magnetic tape, flexible disk, etc.) It may be.
Further, not only the functions of the above-described embodiment are realized by executing the program loaded into the memory 4 of the computer apparatus 200, but the operating system or the like is based on an instruction of the program, or a part of actual processing or The case where all the functions are performed and the functions of the above-described embodiments are realized by the processing is also included.
When the program for realizing the above-described embodiment is a semiconductor recording medium such as a ROM, the program is loaded directly into the memory 4 and executed instead of from the medium driving device 6.
[0018]
<Operation in Network Environment of the Present Invention>
FIG. 5 shows a configuration of an embodiment in which the present invention is operated by connecting to a wired or wireless communication network.
For example, a server 300 holding a document search program and a terminal 310 used by a plurality of users are connected via the network 9.
In this case, the server 300 and the user's terminal 310 are constituted by the general-purpose computer apparatus 200 shown in FIG.
A user logs in to the server 300 from the terminal 310 or inputs a keyword for document search using an input device, and requests the document search program of the server 300 to execute the search via the network 9.
The document search program of the server 300 returns the search result and the progress in progress that match the designated keyword to the requesting terminal 310 via the network 9. The user's terminal 310 outputs the search result and the progress on the way to the output device. At the time of an intermediate progress output, an instruction is given to the server 300 depending on the progress.
By placing the document search program in the server 300 in this way, there is an advantage that the user can always use the latest document search program.
When the server 300 and the terminal 310 are connected via a wired or wireless communication network as shown in FIG. 5, a document search program for realizing the functions of the present invention is stored in a storage device such as a magnetic disk of the server 300. It is also possible to distribute to the terminal 310 in the form of download or the like.
Furthermore, a document search program for realizing the functions of the present invention may be provided by distribution through a recording medium or broadcast waves.
[0019]
【The invention's effect】
As described above, according to the present invention, the time required for selecting related words is reduced by omitting the word weight calculation from the related degree calculation.
In addition, the search time is not wasted by not selecting a word having a small influence on the search result and having an extremely small weight.
In addition, since the words with large weights are excluded even though the versatility is low, the search accuracy is improved because the re-search is not affected.
As described above, since the word contributing to the search can be selected as the related word of the keyword, the accurate document desired by the user can be searched.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a document search apparatus according to an embodiment.
FIG. 2 is a diagram for explaining a data structure of a document database.
FIG. 3 is a flowchart for explaining a processing flow of the document search apparatus according to the embodiment.
FIG. 4 is a diagram illustrating a hardware configuration when the document search apparatus of the present invention is realized by a computer.
FIG. 5 is a diagram for explaining a case where the present invention is operated in a network environment.
[Explanation of symbols]
1 Input device 2 Display device 3 CPU
4 memory 5 storage device 6 medium drive device 7 network connection device 8 bus 9 network 110 keyword input unit 120 document ranking unit 130 word ranking unit 140 keyword generation unit 150 document output unit 160 document database 170 word dictionary 200 computer device 300 server 310 terminal

Claims

A document search device that searches a document database that holds a plurality of documents for documents that match an input keyword, the document ranking unit that selects documents that match the keyword and documents that do not match from the document database, and The relevance of the word appearing in the conforming document selected by the document ranking part with the keyword is R, the number of conforming documents, S is the number of nonconforming documents, rtf is the number of occurrences in each document in the conforming document, and stf is Appearance frequency, K, β in each document in non-conforming documents as adjustment parameters
Σ (rtf / K + rtf) / R-β × Σ (stf / K + stf) / S
And a word ranking unit that selects a word having a high degree of relevance as a related word of the keyword, and a keyword generation unit that adds the related word selected by the word ranking unit to the original keyword to be a new keyword The document search device is characterized in that the document ranking unit searches again for a document that matches the new keyword generated by the keyword generation unit.

The document search apparatus according to claim 1, further comprising: a word dictionary , wherein the word ranking unit selects a word having a high degree of association, and then the weight of the word is large and the word is registered in the word dictionary. If not, the document search apparatus is characterized in that the word is excluded from the related words .

A document search method according to the document search apparatus for searching a document which conforms to the input keyword from the document database for holding a plurality of documents, and selecting a document that does not conform document and adapted to the keyword from the document database, before the relevance of the keyword word appearing in cyclohexene out the relevant documents, the number of relevant documents to R, the number of non-conforming document S, number of occurrences in each document adaptation in document rtf, non-conforming the stf Σ (rtf / K + rtf) / R-β × Σ (stf / K + stf) / S with the number of occurrences in each document, K, and β as adjustment parameters
Calculated by, the relevance is high word and selected as relevant words of the keyword, the add the keyword as a new keyword, a document which conforms to the new keyword the generated again, searching the document database A document search method using a document search apparatus, characterized in that :

4. The document search method by the document search device according to claim 3 , wherein a word dictionary is provided , the word having a high degree of association is selected, the weight of the word is large, and the word is registered in the word dictionary. If not, the document search method by the document search apparatus, wherein the word is excluded from the related words.

A computer of a document retrieval device that retrieves a document that matches an input keyword from a document database that holds a plurality of documents.
A document ranking unit for selecting documents that match the keyword and documents that do not match from the document database;
Relevance of the word appearing in the conforming document selected by the document ranking unit with the keyword, R is the number of conforming documents, S is the number of nonconforming documents, rtf is the number of appearances in each document in the conforming document, stf Is the number of appearances in each non-conforming document, and K and β are adjustment parameters.
Σ (rtf / K + rtf) / R-β × Σ (stf / K + stf) / S
And a word ranking unit that selects words having high relevance as related words of the keyword,
A program that functions as a keyword generation unit that adds a related word selected by the word ranking unit to the original keyword to be a new keyword,
Document ranking unit, a program, characterized in that to search for documents matching the new keywords generated by the keyword generating unit again.

A computer-readable recording medium on which the program according to claim 5 is recorded .