JP4179858B2

JP4179858B2 - Document search apparatus, document search method, program, and recording medium

Info

Publication number: JP4179858B2
Application number: JP2002345970A
Authority: JP
Inventors: 秀夫伊東
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2002-11-28
Filing date: 2002-11-28
Publication date: 2008-11-12
Anticipated expiration: 2022-11-28
Also published as: JP2004178421A

Description

【０００１】
【発明の属する技術分野】
本発明は、文書検索装置、文書検索方法、プログラムおよび記録媒体に関し、具体的には、ユーザに指定された適合文書を用いて適合性フィードバックを行うときに検索要求を拡張する技術に関する。
【０００２】
【従来の技術】
近年では、作成される文書または参照可能な文書は、今後ともますます増大していくことが見込まれる。このような膨大な文書群の中からユーザの所望する適切な文書を探し出すことが困難な状態となっている。
このため大量の文書群から適切な文書を効率よく、しかも早く取り出すための技術として文書検索技術が広く研究されている。
【０００３】
この検索技術の１つとして、検索要求に対し文書群中の各文書がその検索要求を満たす度合い（以下、適合度という）を求め、適合度が大きい順に文書をランキングして出力する文書ランキング検索システムが提案されている（例えば、特許文献１、非特許文献１参照）。ここで検索要求は、自然言語文や単語や複合語等の語句で表現される場合が多く、また適合度は、文書中で検索要求中に含まれる語句が多く出現するほど大きな値とする等で与えられる。
【０００４】
実際には、検索結果のうち上位にランクされた文書群がユーザの指定した要求を満たす文書（以下、適合文書という）というわけではない。
このため、システム自身による検索結果の分析またはユーザによる検索結果の評価を反映させて、検索結果にフィードバックをかけながら検索を繰り返し、徐々に検索結果をユーザの求めるものに近づけていく（適合性フィードバック）システムが開発されている。
【０００５】
その多くは、ユーザによって検索結果の文書に評価を与えて検索語の重要度を示す重みを操作したり、適合文書から新たな検索語（以下、関連語という）を抽出し、それらを元の検索要求に加えて、再度文書の検索を試みるという手法を用いている。この手法として、例えば、特許文献２や非特許文献２の適合性フィードバックおよび特許文献３のレリバンスフィードバックが提案されている。
【０００６】
また、検索結果の上位文書の中から適合文書を指定するのではなく、予め用意した適合文書そのものをシステムに与え、上記同様に利用することも適合性フィードバックの一種と見なせる。
一方、検索結果の上位文書群（例えば、上位１〜１０の文書群）を適合文書と見なし、上記同様に利用する手法は擬似適合性フィードバックと呼ばれている。しかし、上位文書群の多くが実際には非適合文書で占められていた場合には、不適切な検索語が追加されることになり、再度検索した場合にはさらに不適切な検索結果を増やすことになり、逆効果になってしまう場合が多い。
【０００７】
【特許文献１】
特開平１１−２２４２６４号公報
【特許文献２】
特開２０００−２４２６４６号公報
【特許文献３】
特開平０９−１５３０５１号公報
【非特許文献１】
K.Sparck Jones, S.Walker, and S.E.Robertson,
”A probabilistic model of information retrieval:
Development and status”, TR446, Cambridge
University Computer Laboratory, September 1998.
http://citeseer.nj.nec.com/jones98probabilistic.html
【非特許文献２】
Chris Buckly, Gerard Salton and James Allen,
”The Effect of Adding Relevance Information in a
Relevance Feedback Environment”,
In Proceedings of SIGIR’94, 1994, pp.292-300
【０００８】
【発明が解決しようとする課題】
さて、上述のような適合性フィードバックを行う場合、ユーザは検索結果の文書の内容をいちいち表示させて内容を確かめるという作業をしなければならないため、ユーザに大きな負荷をかけることになる。したがって、ユーザは検索結果から１つ乃至少量の適合文書を与えてくれるのが実情であろう。
【０００９】
また、適合文書から選択した関連語を新たに検索要求に追加してユーザの所望する適切な文書を検索する適合性フィードバックでは、以下の手順で適合文書から関連語候補を抽出する場合が多い。
【００１０】
（１）単語分割などにより適合文書から語句の集合を求める。
（２）各語句に対して関連語としての望ましさ（以下、関連度という）を計算する。
（３）関連度が大きい順に関連語候補として提示する。
（４）この関連語候補の中から関連語をユーザが選択するか、または、システムが自動的に選択する。
【００１１】
ここで、上述のように抽出された関連語候補をユーザに提示したとしても、提示された関連語の中から有効なものを見分けることは困難であるので、多くのシステムでは自動的に選択するようにしている。
また、抽出したすべての語句を関連語としないのは、関連語が多すぎて検索効率あるいは精度の低下につながる場合が多いので、抽出された語句のうち一部の語句を関連語として選択する必要があるためである。
【００１２】
この各語句に対して与えられる関連度は、以下の要因を基に定義される場合が多い。
（Ａ）適合文書内に何回出現したかを表す適合文書内頻度、
（Ｂ）いくつの適合文書に出現したかを表す局所的文書頻度（Ｌ）、
（Ｃ）いくつの検索対象文書に出現したかを表す大局的文書頻度（Ｇ）、
（Ｄ）適合文書の数（Ｒ）、
（Ｅ）検索対象文書の数（Ｎ）。
特に、多くの適合文書に共通に用いられる語句が適切な（検索精度を向上させる）関連語である場合が多いと考えられるため、関連語を適切に選択するためには要因（Ｂ）が不可欠なものである。
【００１３】
例えば、従来、語句ｔの関連度ＴＳＶ（ｔ）として次の式１が提案されている。
TSV(t)=w(t)×(L(t)/R-G(t)/N) ・・・式１
ここで、ｗ（ｔ）は、語句ｔが出現する文書に対して与えるスコアであり、このスコアが大きい順に文書が順序付けられる。
上記の式１で計算される関連度ＴＳＶ（ｔ）は、適合文書に与えられるスコアの期待値と非適合文書に与えられる期待値の差であり、この値の大きい語句ほど適合文書と非適合文書を弁別する効果が高い。
【００１４】
上述のようにユーザが検索結果から選択した適合文書の数が、または予めユーザが用意した適合文書の数が少数、例えば１つであった場合には、上記（Ｂ）を基にした関連度ＴＳＶにおける適合文書のスコアの期待値は、一定値となってしまい、適切な関連語が得られなくなってしまう。
【００１５】
本発明は、上述の実情を考慮してなされたものであって、ユーザが指定あるいは入力した適合文書が少数（特に１つ）の場合でも、適切な関連語が得られるようにする文書検索装置、文書検索方法、文書検索装置の機能を実行するためのプログラムおよびそのプログラムを記録したコンピュータ読み取り可能な記録媒体を提供することを目的とする。
【００１６】
【課題を解決するための手段】
上記課題を解決するために、本発明の請求項１は、検索要求を入力する入力部と、文書を記憶する文書データベースから前記検索要求に適合する文書をランキング検索し、検索結果記憶部に検索結果を記憶するランキング検索部と、前記検索結果記憶部に記憶された検索結果からユーザにより一つの適合文書を指定する文書指定部と、前記指定された適合文書から単語を抽出する抽出部と、前記単語に対して、文書内の文書頻度を、適合文書内頻度として計算し、続いて、該単語に対して、前記検索結果記憶部により記憶された検索結果の上位の文書群の文書頻度を、局所的文書頻度として計算し、さらに、該適合文書内頻度、該局所的文書頻度、及び該検索結果の上位の文書群の数に基づき、前記指定された適合文書から抽出された単語につき、該検索要求との関連度を求め、求めた関連度が高い単語を関連語として自動的に選定し、選定した関連語を前記検索要求に追加して新しい検索要求とする関連語選定部と、を含み、
前記関連度は、前記適合文書内頻度Ｒｗ、前記局所的文書頻度Ｌｗ、前記検索結果の上位の文書群の数Ｒで表された第１式
関連度＝（１ + ｌｏｇ ₂ （Ｒｗ））×Ｌｗ／Ｒ
により計算されることを特徴とする。
また、本発明の請求項２は、入力部により、検索要求を入力する入力ステップと、ランキング検索部により、文書を記憶する文書データベースから前記検索要求に適合する文書をランキング検索して、検索結果記憶部に検索結果を記憶するランキング検索ステップと、文書指定部により、前記検索結果記憶部に記憶された検索結果からユーザにより一つの適合文書を指定する文書指定ステップと、抽出部により、前記指定された適合文書から単語を抽出する抽出ステップと、関連語選定部により、前記単語に対して、文書内の文書頻度を、適合文書内頻度として計算し、続いて、該単語に対して、前記検索結果記憶部により記憶された検索結果の上位の文書群の文書頻度を、局所的文書頻度として計算し、さらに、該適合文書内頻度、該局所的文書頻度、及び該検索結果の上位の文書群の数に基づき、前記指定された適合文書から抽出された単語につき、該検索要求との関連度を求め、求めた関連度が高い単語を関連語として自動的に選定し、選定した関連語を前記検索要求に追加して新しい検索要求とする関連語選定ステップと、を含み、
前記関連度は、前記適合文書内頻度Ｒｗ、前記局所的文書頻度Ｌｗ、前記検索結果の上位の文書群の数Ｒで表された第１式
関連度＝（１ + ｌｏｇ ₂ （Ｒｗ））×Ｌｗ／Ｒ
により計算される
ことを特徴とする。
【００１７】
また、本発明の請求項３は、検索要求を入力する入力部と、文書を記憶する文書データベースから前記検索要求に適合する文書をランキング検索し、検索結果記憶部に検索結果を記憶するランキング検索部と、前記検索結果記憶部に記憶された検索結果からユーザにより一つの適合文書を指定する文書指定部と、前記指定された適合文書から単語を抽出する抽出部と、前記単語に対して、文書内の文書頻度を、適合文書内頻度として計算し、続いて、該単語に対して、前記検索結果記憶部により記憶された検索結果の上位の文書群の文書頻度を、局所的文書頻度として計算し、さらに、該適合文書内頻度、該局所的文書頻度、及び該検索結果の上位の文書群の数に基づき、指定された適合文書から抽出された単語につき、該検索要求との関連度を求め、求めた関連度が高い単語を関連語として自動的に選定し、選定した関連語を前記検索要求に追加して新しい検索要求とする関連語選定部と、を含み、
前記関連度は、前記適合文書内頻度Ｒｗ、前記局所的文書頻度Ｌｗ、前記検索結果の上位の文書群の数Ｒ、大局的文書頻度Ｇｗ、検索対象の文書総数Ｎで表された第２式
関連度＝（１ + ｌｏｇ ₂ （Ｒｗ））×（Ｌｗ／Ｒ−Ｇｗ／Ｎ）
により計算される
ことを特徴とする。
また、本発明の請求項４は、入力部により、検索要求を入力する入力ステップと、ランキング検索部により、文書を記憶する文書データベースから前記検索要求に適合する文書をランキング検索し、検索結果記憶部に検索結果を記憶するランキング検索ステップと、文書指定部により、前記検索結果記憶部に記憶された検索結果からユーザにより一つの適合文書を指定する文書指定ステップと、抽出部により、前記指定された適合文書から単語を抽出する抽出ステップと、関連語選定部により、前記単語に対して、文書内の文書頻度を、適合文書内頻度として計算し、続いて、該単語に対して、前記検索結果記憶部により記憶された検索結果の上位の文書群の文書頻度を、局所的文書頻度として計算し、さらに、該適合文書内頻度、該局所的文書頻度、及び該検索結果の上位の文書群の数に基づき、前記指定された適合文書から抽出された単語につき、該検索要求との関連度を求め、求めた関連度が高い単語を関連語として自動的に選定し、選定した関連語を前記検索要求に追加して新しい検索要求とする関連語選定ステップと、を含み、
前記関連度は、前記適合文書内頻度Ｒｗ、前記局所的文書頻度Ｌｗ、前記検索結果の上位の文書群の数Ｒ、大局的文書頻度Ｇｗ、検索対象の文書総数Ｎで表された第２式
関連度＝（１ + ｌｏｇ ₂ （Ｒｗ））×（Ｌｗ／Ｒ−Ｇｗ／Ｎ）
により計算される
ことを特徴とする。
【００２０】
また、本発明の請求項５は、コンピュータを、請求項１または３に記載の文書検索装置の各部として機能させるためのプログラムである。
また、本発明の請求項６は、請求項５に記載の文書検索プログラムを記録したコンピュータ読み取り可能な記録媒体である。
【００２１】
以上の構成により、ユーザが指定する適合文書が少数（１つとなる場合が多い）であっても、より適切な関連語が得られるので再度の検索結果によってユーザの所望する文書が見つかる度合いが向上する。
【００２２】
【発明の実施の形態】
以下、図面を参照して、本発明の文書検索装置に係る好適な実施形態を説明する。
＜実施形態１＞
図１は、本実施形態１に係る文書検索装置の機能構成を示すブロック図である。
図１において、文書検索装置は、入力部１０、文書検索部２０、文書データベース（ＤＢ）３０、検索結果記憶部４０、文書指定部５０、語句抽出部６０、関連語選定部７０を少なくとも備えている。
【００２３】
入力部１０は、ユーザがキーボード等により、文書データベース３０中からユーザの所望する文書を検索するための文字列からなる検索要求を入力する。
この文字列が文書検索部２０で扱う検索式の形式でなく、自然言語文のような場合には、単語辞書をもちいて形態素解析して単語に分割し、文書検索部２０で扱う検索式へ変換する。この単語辞書は、少なくとも各単語の表記、品詞等から構成されている。
また、入力された文字列が文書の特徴をあらわすキーワードの組み合わせからなる場合も区切り記号や文字種等により分割して、文書検索部２０で扱う検索式へ変換する。
例えば、図２のような入力画面において、検索式を「経済ａｎｄ政治」当のように入力し、検索ボタンを押下する。
【００２４】
文書検索部２０は、入力部１０から渡された検索式を用いて、文書ＤＢ３０をランキング検索し、所定の文書数分の文書情報を検索結果記憶部４０へ出力する。
ランキング検索は、例えば、文書ごとに次のような式２を用いてスコアを計算し、そのスコアが大きい順に文書群をソートすることによって求めることができる。
【００２５】
score = Σ_w score(w) ・・・式２
ここで、Σ_wは、検索式中のすべての検索語ｗについてのスコアｓｃｏｒｅ（ｗ）を加算することを意味している。
score(w)=tf(w)*(1+log₂(N/df(w)))
tf(w)＝検索語ｗがスコアを計算中の文書に出現する出現頻度、
N＝文書ＤＢ３０に登録された文書数、
df(w)＝文書ＤＢ３０中の検索語ｗを含む文書数。
【００２６】
また、文書検索部２０は、関連語選定部７０で生成された新しい検索式に対して再度文書検索を実施する。
【００２７】
文書ＤＢ３０は、検索対象となる文書を保持する文書情報と、その文書中に含まれている各単語の単語統計情報から構成される（図３参照）。
例えば、文書情報には、各文書に対して、文書識別子（ＩＤ）、文書名、書誌事項（作成者、作成日、発行所等）、文書実体へのポインタ等の情報が保持される。
また、単語統計情報には、単語ごとに、単語の表記およびこの単語が文書ＤＢ３０中のいくつの文書に出現したかを示す出現頻度等の統計情報を保持している。
【００２８】
検索結果記憶部４０は、検索結果のうち、スコアの高い文書から順に所定の数の文書に関する情報を記憶する。
例えば、文書に関する情報としてスコアおよび文書ＩＤを記憶する。または、スコアと文書の内容自体を記憶させるようにしてもよい。
【００２９】
文書指定部５０は、検索結果記憶部４０に記憶されている検索結果を一覧としてディスプレイ等の表示装置へ図４に示すように出力する。図４の一覧表には、スコアと文書名とがランク順に表示されている。
ユーザは、この一覧表示から文書の内容を表示させて内容を確認し、所望の文書に近い文書（以下、適合文書という）をチェックボックスへチェックを入れることによって１つ以上指定する（図４では、黒色の四角で選択していることを示した）。
次に、文書指定部５０は、ユーザが「関連語抽出」ボタンを押下すると、選択された適合文書の文書ＩＤを語句抽出部６０へ渡す。
【００３０】
語句抽出部６０は、文書指定部５０から渡された適合文書の文書ＩＤを参照して文書ＤＢ３０から文書の内容を取り出す。
次に、この文書を形態素解析して得た品詞情報に基づき、例えば、名詞・サ変名詞・未登録語等の自立語類を抽出して、検索式に出現した語句以外の語句を求める。形態素解析では、単語辞書に登録されている最短一致した単語に分割する。
【００３１】
また、語句抽出部６０では、語句を抽出する際に、文書内の出現頻度を計数して、頻度表を作成して一時的に記憶する。例えば、適合文書から語句Ａ、Ｂ、Ｃが求められた場合、次のような頻度表を作成する。
【００３２】
【表１】

【００３３】
さらに、適合文書が複数個指定された場合には、計数された適合文書内頻度は各語句に対してそれぞれの文書の適合文書内頻度を総計した値とする。
次に、語句抽出部６０で抽出された語句は、関連語選定部７０へ渡される。
関連語選定部７０は、検索結果記憶部４０に記憶されている検索結果中のランクの上位文書群（例えば、上位１０文書、以下この文書群を擬似適合文書という。この擬似適合文書にはユーザの指定した適合文書は含まないものとする）に関し、語句抽出部６０で抽出された語句がいくつの擬似適合文書に出現するかを計数し、先の頻度表に局所的文書頻度として追加して一時的に記憶する。
【００３４】
【表２】

【００３５】
次に、関連語選定部７０は、上記頻度表を基に語句ごとに式３によって関連度を計算する。
【００３６】
関連度＝(1+log₂(R_w))×L_w/R ・・・式３
R_w＝語句ｗの適合文書内頻度、
L_w＝語句ｗの局所的文書頻度、
R＝擬似適合文書の数。
【００３７】
表２について、関連度を計算して、頻度表を表３のように更新する。
【００３８】
【表３】

【００３９】
最後に、関連語選定部７０は、更新された頻度表の語句の関連度を大きい順にソートし、所定の個数（関連度の上位２０語程度）を関連語として選定し、選定した語句を新たな検索語として検索式へ追加する。この検索式への追加は、論理演算ＯＲによって、元の検索式に追加する。
例えば、上記の場合、元の検索式が「ＸａｎｄＹ」であり、１語だけを関連語とする場合には、新しい検索式は「（ＸａｎｄＹ）ｏｒＢ」となる。
【００４０】
関連語選定部７０は、新しい検索式を検索要求として、文書検索部２０へ渡す。
文書検索部２０は、この新しい検索式で再度ランキング検索することによって、新たな検索結果を検索結果記憶部４０へ記憶する。
以上の操作をユーザの所望する文書が見つかるまで繰り返す。
【００４１】
特に、検索対象文書数が膨大な場合、あるいは、検索要求の表現が不適切な場合は、検索結果の上位には、非常に少数の適合文書しか見つからない場合は多い。
この場合、ユーザが指定する適合文書は少数（１つとなる場合が多い）となるが、以上のように本実施形態を構成することによって、適切な関連語が得られるので再度の検索結果によってユーザの所望する文書が見つかる度合いが向上する。
【００４２】
文書長の短い文書群に対して、本実施形態１によって評価実験を行ったところ、表４のような結果となった。
【００４３】
【表４】

【００４４】
上記の表４を見ると、ユーザが１つの適合文書を与えた場合、適合性フィードバックよりも本実施形態１の方の平均適合率がよいことが分かる。この精度の向上が極僅かであるのは、文書の長さが短いためで、関連語を選択する余地が少ないことに原因があるものと見られる。
また、ユーザが２つの適合文書を与えた場合には、本実施形態１による効果はあまり見られない。
【００４５】
文書長が適度に長い文書群に対して、同様に評価実験を行ったところ、表５のような結果となった。
【００４６】
【表５】

【００４７】
上記の表５を見ると、ユーザが１つの適合文書を与えた場合、適合性フィードバックよりも本実施形態１の方の平均適合率が７％よいことが分かる。また、ユーザが２つの適合文書を与えた場合には、本実施形態１による精度の向上は極僅かである。
【００４８】
次に、このように構成された実施形態１の動作について、図５のフローチャートに基づいて説明する。
まず、図２のような入力画面において、ユーザがキーボード等により、文書データベース３０を検索するための検索要求を入力する（ステップＳ１０）。
この検索要求が自然言語文のような場合には、単語辞書をもちいて形態素解析して単語に分割し、検索式へ変換する。
また、入力された文字列が文書の特徴をあらわすキーワードの組み合わせからなる場合も区切り記号や文字種等により分割して、検索式へ変換する。
【００４９】
入力された検索式を用いて、文書ＤＢ３０をランキング検索し、スコアの高い方から所定の文書数分のスコアおよび文書ＩＤを検索結果記憶部４０へ出力する（ステップＳ２０）。
ランキング検索は、例えば、文書ごとに上述の式２を用いてスコアを計算し、そのスコアの大きい順に文書群をソートすることによって求めることができる。
【００５０】
検索結果記憶部４０に記憶されている検索結果を図４のような一覧としてディスプレイ等の表示装置へ出力し、ユーザがこの一覧から文書の内容を確認して、所望の文書を見つけた場合（ステップＳ３０の「有」）には、処理を終了する。
一方、一覧中に所望の文書がない場合（ステップＳ３０の「無」）には、一覧の中から所望の文書に近い文書のチェックボックスへチェックを入れることによって１つ以上指定して、ユーザが「関連語抽出」ボタンを押下する（ステップＳ４０）。
【００５１】
ユーザから指定された適合文書の文書ＩＤを参照して文書ＤＢ３０から文書の内容を取り出して、形態素解析して得た品詞情報に基づき、例えば、名詞・サ変名詞・未登録語等の自立語類を抽出して、検索式に出現した語句以外の語句を求める（ステップＳ５０）。形態素解析では、単語辞書に登録されている最短一致した単語に分割し、各語句に対して、文書内の出現頻度を計数して、頻度表を作成して一時的に記憶する。
さらに、適合文書が複数個指定された場合には、計数された適合文書内頻度は各語句に対してそれぞれの文書の適合文書内頻度を総計した値とする。
【００５２】
次に、検索結果記憶部４０に記憶されている検索結果中のランクの上位文書群（ユーザが指定した適合文書を含まない擬似適合文書）に関し、ステップＳ５０で抽出された語句がいくつの文書に出現するかを計数し、上述の式３によって各語句の関連度を求めて大きい順にソートし、所定の個数（関連度の上位２０語程度）を関連語として選定し、選定した語句を新たな検索語として検索式へ追加して、ステップＳ３０へ戻り、ユーザの所望する文書が見つかるまで上記の操作を繰り返す（ステップＳ６０）。この検索式への追加は、論理演算ＯＲによって、元の検索式に追加する。
【００５３】
＜実施形態２＞
たくさんの文書に出現するような語句では、文書を弁別する力がないことは明白であるから、このような語句（検索語）を検索式に追加しても、所望の文書を効率よく得ることはできない。
本実施形態２では、このような弁別力のない語句を関連語として選定しないように、上記の式３で表される関連度の精度を向上させるようにした。
【００５４】
いま、擬似適合文書に出現する語句ｗについて考える。この語句ｗがいくつの非適合文書に出現するのかを示す文書頻度の期待値（Ｈ）が大きいということは、語句ｗは検索対象の文書中に偏在することなく存在していると考えられる。
したがって、語句ｗがいくつの擬似適合文書に出現するのかを示す文書頻度の期待値をＴとした場合、（Ｔ−Ｈ）の値が大きいほど語句ｗには弁別力があるといえる。
【００５５】
本実施形態２では、この（Ｔ−Ｈ）を用いて関連語を選定するようにした。
ここで、期待値Ｔは、次の式で近似される。
Ｔ＝（語句ｗが擬似適合文書に出現する文書頻度）／（擬似適合文書の数）
＝（語句ｗの局所的文書頻度）／（擬似適合文書の数）
【００５６】
また、期待値Ｈは、次の式で近似される。
Ｈ＝（語句ｗが非適合文書に出現する文書頻度）／（非適合文書の数）
ここで、非適合文書の数は、擬似適合文書の数と比べて非常に大きいので、大数の法則を当てはめれば、期待値Ｈは更に次のように近似される。
Ｈ≒（語句ｗが検索対象の文書に出現する文書頻度）／（検索対象の文書総数）
＝（語句ｗの大局的文書頻度）／（検索対象の文書総数）
【００５７】
本実施形態２では、上記期待値をスコアへ変換して、語句ｗの関連度を次の式４で定義した。これにより、式４の関連度の値が大きいほど検索結果のランキングにおいて、適合文書と非適合文書とをスコア的に弁別する力を計測できるようになった。
【００５８】
関連度＝(1+log₂(R_w))×(L_w/R-G_w/N) ・・・式４
R_w＝語句ｗの適合文書内頻度、
L_w＝語句ｗの局所的文書頻度、
R＝擬似適合文書の数、
G_w＝語句ｗの大局的文書頻度、
N＝検索対象の文書総数。
【００５９】
図６は、本実施形態２に係る文書検索装置の機能構成を示すブロック図であり、同図において、上述した実施形態１と同一の部分については、同一の符号を付して、その説明を省略する。図６において、実施形態１と異なる点は、関連語選定部７０において出現頻度計算部８０を有するところである。
【００６０】
出現頻度計算部８０は、関連語選定部７０から起動され、文書ＤＢ３０の単語統計情報を参照して、与えられた単語が文書ＤＢ３０のいくつの文書に出現するかを表す出現頻度（大局的文書頻度）を出力する。
または、関連語選定部７０から与えられた単語を含む文書検索を行って、その検索件数を出力するようにしてもよい。
【００６１】
本実施形態２の関連語選定部７０では、各語句に対して出現頻度計算部８０によって大局的文書頻度を計算して、頻度表を表６のように更新する。
例えば、上述の表２に語句Ａ，Ｂ，Ｃの大局的文書頻度を追加すると表６のようになる。
【００６２】
【表６】

【００６３】
次に、各語句の関連度を上記式４によって求め、前述同様、関連語を選択する。
文書総数（Ｎ）を１００００としたときの関連度を式４で求めると表７のように求められる。
【００６４】
【表７】

【００６５】
以上のように本実施形態２を構成することによって、より適切な関連語が得られるので再度の検索結果によってユーザの所望する文書が見つかる度合いが向上する。
【００６６】
＜実施形態１および実施形態２の変形例＞
実施形態１および実施形態２では、検索要求に対する検索結果の中から適合文書を指定していたが、本変形例では予め用意しておいた文書の内容を適合文書のサンプルとして指定できるようにした。
図７と図８は、それぞれ実施形態１と実施形態２に対応する本変形例の機能構成を示すブロック図であり、上述した実施形態１および実施形態２と同一の部分については、同一の符号を付して、その説明を省略する。図７と図８において異なる点は、文書指定部５０の替わりに文書入力部９０とした点である。
【００６７】
文書入力部９０は、検索結果記憶部４０に記憶されている検索結果を一覧としてディスプレイ等の表示装置へ図９に示すように出力する。図９の一覧表には、図４と同様にスコアと文書名とがランク順に表示されている。
ユーザは、この一覧表示から文書の内容を表示させることによって内容を確認し、所望している文書に近い文書がない場合には、予め用意した適合文書のサンプルを画面下方のテキストボックスへ取り込んで、「関連語抽出」ボタンを押下する。文書入力部９０は、このテキストボックスに入力されたテキストを適合文書として語句抽出部６０へ渡す。
【００６８】
または、図１０のような「適合文書指定」ボタンを用意し、ユーザがこのボタンを押下したときに、適合文書が格納されているファイル名等をユーザに指定させて、適合文書を入力するようにしてもよい。
【００６９】
次に、このように構成された本変形例の動作について、図１１のフローチャートに基づいて説明する。
まず、図２のような入力画面において、ユーザがキーボード等により、文書データベース３０を検索するための検索要求を入力する（ステップＳ１１０）。
この検索要求が自然言語文のような場合には、単語辞書をもちいて形態素解析して単語に分割し、検索式へ変換する。
また、入力された文字列が文書の特徴をあらわすキーワードの組み合わせからなる場合も区切り記号や文字種等により分割して、検索式へ変換する。
【００７０】
入力された検索式を用いて、文書ＤＢ３０をランキング検索し、スコアの高い方から所定の文書数分のスコアおよび文書ＩＤを検索結果記憶部４０へ出力する（ステップＳ１２０）。
ランキング検索は、例えば、文書ごとに上述の式２を用いてスコアを計算し、そのスコアの大きい順に文書群をソートすることによって求めることができる。
【００７１】
検索結果記憶部４０に記憶されている検索結果を図９のような一覧としてディスプレイ等の表示装置へ出力し、ユーザがこの一覧から文書の内容を確認して、所望の文書を見つけた場合（ステップＳ１３０の「有」）には、処理を終了する。
一方、一覧表示中に所望の文書がない場合（ステップＳ１３０の「無」）には、適合文書を図９に示したようなテキストボックスへ読み込むか、または、図１０に示したような「適合文書指定」ボタンを押下して適合文書を読み込むかして、「関連語抽出」ボタンを押下する（ステップＳ１４０）。
【００７２】
ユーザから指定された適合文書の内容を取り出して、形態素解析して得た品詞情報に基づき、例えば、名詞・サ変名詞・未登録語等の自立語類を抽出して、検索式に出現した語句以外の語句を求める（ステップＳ１５０）。形態素解析では、単語辞書に登録されている最短一致した単語に分割し、各語句に対して、文書内の出現頻度を計数して、頻度表を作成して一時的に記憶する。
さらに、適合文書が複数個指定された場合には、計数された適合文書内頻度は各語句に対してそれぞれの文書の適合文書内頻度を総計した値とする。
【００７３】
次に、検索結果記憶部４０に記憶されている検索結果中のランクの上位文書群（擬似適合文書）に関し、ステップＳ１５０で抽出された語句がいくつの文書に出現するかを計数し、上述の式３によって関連度を求める。
または、文書ＤＢ３０の単語に関する統計情報を参照することによって、抽出した語句の大局的文書頻度を取り出し、上述の式４によって関連度を求めるようにしてもよい。
【００７４】
求めた各語句の関連度を大きい順にソートし、所定の個数（関連度の上位２０語程度）を関連語として選定し、選定した語句を新たな検索語として検索式へ追加して、ステップＳ１３０へ戻り、再度検索し、ユーザの所望する文書が見つかるまで上記の操作を繰り返す（ステップＳ１６０）。この検索式への追加は、論理演算ＯＲによって、元の検索式に追加する。
【００７５】
以上のように本変形例を構成することによって、適切な適合文書を予め用意しておくことができるので、より適切な関連語を選定することができ、再度の検索結果によってユーザの所望する文書が見つかる度合いが向上する。
【００７６】
＜実施形態３＞
本発明は、上述した実施形態のみに限定されたものではない。上述した実施形態の文書検索装置を構成する各機能をそれぞれプログラム化し、あらかじめＣＤ−ＲＯＭ等の記録媒体に書き込んでおき、コンピュータに搭載したＣＤ−ＲＯＭドライブのような媒体駆動装置にこのＣＤ−ＲＯＭ等を装着して、これらのプログラムをコンピュータのメモリあるいは記憶装置に格納し、それを実行することによって、本発明の目的が達成されることは言うまでもない。
この場合、記録媒体から読み出されたプログラム自体が上述した実施形態の機能を実現することになり、そのプログラムおよびそのプログラムを記録した記録媒体も本発明を構成することになる。
【００７７】
なお、プログラムを格納する記録媒体としては半導体媒体（例えば、ＲＯＭ、不揮発性メモリ等）、光媒体（例えば、ＤＶＤ、ＭＯ、ＭＤ、ＣＤ等）、磁気媒体（例えば、磁気テープ、フレキシブルディスク等）等のいずれであってもよい。
【００７８】
また、ロードしたプログラムを実行することにより上述した実施形態の機能が実現されるだけでなく、そのプログラムの指示に基づき、オペレーティングシステムあるいは他のアプリケーションプログラム等と共同して処理することによって上述した実施形態の機能が実現される場合も含まれる。
【００７９】
市場に流通させる場合には、可搬型の記録媒体にプログラムを格納して流通させたり、インターネット等の通信網を介して接続されたサーバコンピュータの記憶装置に格納しておき、通信網を通じて他のコンピュータに転送することもできる。この場合、このサーバコンピュータの記憶装置も本発明の記録媒体に含まれる。なお、コンピュータでは、可搬型の記録媒体上のプログラム、または転送されてくるプログラムを、コンピュータに接続した記録媒体にインストールし、そのインストールされたプログラムを実行することによって上述した実施形態の機能が実現される。
【００８０】
【発明の効果】
以上説明したように本発明によれば、ユーザが指定する適合文書が少数（１つとなる場合が多い）となった場合でも、より適切な関連語が得られるので再度の検索結果によってユーザの所望する文書が見つかる度合いが向上する。
【図面の簡単な説明】
【図１】実施形態１に係る文書検索装置の機能構成を示すブロック図である。
【図２】検索式の入力画面例である。
【図３】文書データベースのデータ構造例である。
【図４】検索結果の一覧表示および適合文書の指定例である。
【図５】実施形態１の動作を示すフローチャートである。
【図６】実施形態２に係る文書検索装置の機能構成を示すブロック図である。
【図７】実施形態１の変形例の機能構成を示すブロック図である。
【図８】実施形態２の変形例の機能構成を示すブロック図である。
【図９】検索結果の一覧表示および適合文書の入力例である。
【図１０】検索結果の一覧表示および適合文書の指定例である。
【図１１】実施形態１および実施形態２の変形例の動作を示すフローチャートである。
【符号の説明】
１０…入力部、２０…文書検索部、３０…文書データベース（ＤＢ）、４０…検索結果記憶部、５０…文書指定部、６０…語句抽出部、７０…関連語選定部、８０…出現頻度計算部、９０…文書入力部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document search apparatus, a document search method, a program, and a recording medium, and more specifically to a technique for expanding a search request when relevance feedback is performed using a compatible document designated by a user.
[0002]
[Prior art]
In recent years, it is expected that the number of documents to be created or documents that can be referred to will continue to increase. It is difficult to find an appropriate document desired by the user from such a large document group.
For this reason, a document search technique has been widely studied as a technique for efficiently and quickly extracting an appropriate document from a large number of document groups.
[0003]
As one of the search techniques, a document ranking search is performed in which each document in a document group satisfies the search request in response to the search request (hereinafter referred to as relevance), and the documents are ranked and output in descending order of relevance. A system has been proposed (see, for example, Patent Document 1 and Non-Patent Document 1). Here, the search request is often expressed by a phrase such as a natural language sentence, a word, or a compound word, and the fitness is set to a larger value as more words and phrases included in the search request appear in the document. Given in.
[0004]
Actually, the document group ranked higher among the search results is not a document that satisfies the request specified by the user (hereinafter referred to as a conforming document).
Therefore, reflecting the analysis of the search result by the system itself or the evaluation of the search result by the user, the search is repeated while applying the feedback to the search result, and the search result is gradually brought close to what the user wants (relevance feedback). ) A system has been developed.
[0005]
In many cases, the user evaluates a document as a search result and manipulates the weight indicating the importance of the search word, or extracts a new search word (hereinafter referred to as a related word) from the matching document, In addition to the search request, a technique of trying to search for a document again is used. As this technique, for example, compatibility feedback in Patent Document 2 and Non-Patent Document 2 and relevance feedback in Patent Document 3 have been proposed.
[0006]
In addition, instead of designating a conforming document from the higher-order documents of the search results, it is also considered that a suitable conforming document prepared in advance is given to the system and used in the same manner as described above.
On the other hand, a technique that considers a higher-order document group (for example, higher-order 1 to 10 document groups) in the search result as a conforming document and uses it in the same manner as described above is called pseudo relevance feedback. However, if many of the upper document groups are actually occupied by non-conforming documents, inappropriate search terms will be added, and if you search again, more inappropriate search results will be added. As a result, it is often counterproductive.
[0007]
[Patent Document 1]
Japanese Patent Laid-Open No. 11-224264
[Patent Document 2]
JP 2000-242646 A
[Patent Document 3]
Japanese Patent Laid-Open No. 09-153051
[Non-Patent Document 1]
K. Spark Jones, S. Walker, and S. E. Robertson,
”A probabilistic model of information retrieval:
Development and status ”, TR446, Cambridge
University Computer Laboratory, September 1998.
http://citeseer.nj.nec.com/jones98probabilistic.html
[Non-Patent Document 2]
Chris Buckly, Gerard Salton and James Allen,
”The Effect of Adding Relevance Information in a
Relevance Feedback Environment ”,
In Proceedings of SIGIR’94, 1994, pp.292-300
[0008]
[Problems to be solved by the invention]
Now, when performing relevance feedback as described above, the user must display the contents of the search result document one by one and check the contents, which places a heavy load on the user. Therefore, the user is likely to give one to a small amount of relevant documents from the search results.
[0009]
In addition, relevance feedback in which a related word selected from a compatible document is newly added to a search request and an appropriate document desired by the user is searched for, and a related word candidate is often extracted from the compatible document by the following procedure.
[0010]
(1) A set of phrases is obtained from a relevant document by word division or the like.
(2) The desirability as a related word (hereinafter referred to as relevance) is calculated for each word.
(3) Present as related word candidates in descending order of relevance.
(4) The user selects a related word from the related word candidates, or the system automatically selects the related word.
[0011]
Here, even if the related word candidates extracted as described above are presented to the user, it is difficult to distinguish effective words from the presented related words, so many systems automatically select them. I am doing so.
Also, not extracting all extracted words as related words often results in a decrease in search efficiency or accuracy due to too many related words, so select some of the extracted words as related words This is necessary.
[0012]
The degree of association given to each word is often defined based on the following factors.
(A) Frequency in the conforming document indicating how many times it appears in the conforming document,
(B) Local document frequency (L) indicating how many matching documents have appeared,
(C) Global document frequency (G) representing the number of search target documents.
(D) Number of conforming documents (R),
(E) Number of search target documents (N).
In particular, it is considered that the words commonly used in many relevant documents are appropriate related words (to improve the search accuracy) in many cases, so the factor (B) is indispensable for appropriately selecting the related words. It is a thing.
[0013]
For example, the following formula 1 is conventionally proposed as the relevance TSV (t) of the phrase t.
TSV (t) = w (t) × (L (t) / R-G (t) / N)
Here, w (t) is a score given to a document in which the word t appears, and the documents are ordered in descending order of the score.
The relevance TSV (t) calculated by the above equation 1 is the difference between the expected value of the score given to the conforming document and the expected value given to the non-conforming document. Effective in discriminating documents.
[0014]
As described above, when the number of relevant documents selected by the user from the search result or the number of relevant documents prepared by the user in advance is a small number, for example, one, the degree of relevance based on (B) above The expected value of the score of the conforming document in TSV becomes a constant value, and an appropriate related word cannot be obtained.
[0015]
The present invention has been made in consideration of the above-described circumstances, and is a document retrieval device that can obtain an appropriate related word even when the number of relevant documents specified or input by a user is small (particularly one). An object of the present invention is to provide a document search method, a program for executing the functions of the document search apparatus, and a computer-readable recording medium on which the program is recorded.
[0016]
[Means for Solving the Problems]
  In order to solve the above-mentioned problem, claim 1 of the present invention provides an input unit for inputting a search request and a document that matches the search request from a document database for storing the document.RankingInspectionSearch and store the search results in the search result storageRuRankingSearch part and the search resultOne by the user from the search results stored in the storage unitA document designating unit for designating a conforming document and a designated conforming document.Extract wordsAn extractor;For the word, the document frequency in the document is calculated as the relevant document frequency, and the document frequency of the higher-order document group of the search results stored by the search result storage unit is calculated for the word. For the words extracted from the designated conforming document based on the frequency within the conforming document, the local document frequency, and the number of upper document groups in the search result. The degree of relevance with the search request is obtained, a word having a high degree of relevance obtained is automatically selected as a related word, and the selected related word is added to the search request to create a new search request.A related word selection section,Including
The relevance is a first expression represented by the frequency Rw within the relevant document, the local document frequency Lw, and the number R of the upper document groups of the search result.
Relevance = (1 + log ₂ (Rw)) x Lw / R
Calculated byIt is characterized by that.
  Further, claim 2 of the present invention providesAn input step for inputting a search request by the input unit, and a ranking search for searching for documents that match the search request from a document database storing the documents by a ranking search unit and storing the search results in the search result storage unit A document designating step for designating one relevant document by a user from the search result stored in the search result storage unit by the document designating unit, and an extraction for extracting a word from the designated relevant document by the extracting unit Step, the related word selection unit calculates the document frequency in the document as the relevant document frequency for the word, and then the search result stored in the search result storage unit for the word The document frequency of the higher-order document group is calculated as the local document frequency, and the frequency within the relevant document, the local document frequency, and the search result are calculated. Based on the number of high-order document groups, the degree of relevance with the search request is obtained for the words extracted from the designated conforming document, and the word having the high degree of relevance is automatically selected as the related word and selected. A related term selection step of adding a related term as a new search request to the search request,
The relevance is a first expression represented by the frequency Rw within the relevant document, the local document frequency Lw, and the number R of the upper document groups of the search result.
Relevance = (1 + log ₂ (Rw)) x Lw / R
Calculated by
It is characterized by that.
[0017]
  In addition, claim 3 of the present invention providesAn input unit for inputting a search request, a ranking search unit for searching for documents that match the search request from a document database that stores documents, and storing a search result in a search result storage unit, and a storage in the search result storage unit A document designating unit for designating one conforming document by the user from the retrieved results, an extracting unit for extracting a word from the designated conforming document, and the document frequency in the document for the word, Then, the document frequency of the higher-order document group of the search result stored by the search result storage unit is calculated as the local document frequency for the word. Based on the local document frequency and the number of higher-order document groups in the search result, the degree of relevance with the search request is obtained for the word extracted from the designated matching document, and the obtained degree of relevance is high. Words automatically selected as a related word, the selection was related word comprises a related word selecting unit to add to new search request to the search request,
The relevance is the second expression expressed by the relevance document frequency Rw, the local document frequency Lw, the number R of higher-order document groups in the search result, the global document frequency Gw, and the total number N of documents to be searched.
Relevance = (1 + log ₂ (Rw)) x (Lw / R-Gw / N)
Calculated by
It is characterized by that.
  Further, claim 4 of the present invention providesAn input step of inputting a search request by the input unit, and a ranking search step of performing a ranking search for a document that matches the search request from a document database storing the document by a ranking search unit and storing the search result in a search result storage unit A document designating step by which a user designates one relevant document from a search result stored in the search result storage unit by a document designating unit, and an extraction step by which a word is extracted from the designated relevant document by an extracting unit And the related word selection unit calculates the document frequency in the document as the relevant document frequency for the word, and then, for the word, the search result stored in the search result storage unit The document frequency of the higher-level document group is calculated as the local document frequency, and the frequency within the relevant document, the local document frequency, and the search result are calculated. Based on the number of document groups, the degree of relevance to the search request is obtained for the words extracted from the designated conforming document, and the word having the high degree of relevance is automatically selected as the related word and selected. A related term selection step of adding a related term as a new search request to the search request,
The relevance is the second expression expressed by the relevance document frequency Rw, the local document frequency Lw, the number R of higher-order document groups in the search result, the global document frequency Gw, and the total number N of documents to be searched.
Relevance = (1 + log ₂ (Rw)) x (Lw / R-Gw / N)
Calculated by
It is characterized by that.
[0020]
  Further, the claims of the present invention5The computerThe, Claim 1Or 3Of the document retrieval device described inAs each partMachineAbilityIt is a program to make it.
  Further, the claims of the present invention6Claims5The computer-readable recording medium which recorded the document search program as described in 1 above.
[0021]
With the above configuration, even if there are a small number of relevant documents specified by the user (in many cases, there will be only one), more appropriate related terms can be obtained, and the degree of finding the document desired by the user can be improved by the search result again. To do.
[0022]
DETAILED DESCRIPTION OF THE INVENTION
DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, a preferred embodiment of a document search apparatus according to the invention will be described with reference to the drawings.
<Embodiment 1>
FIG. 1 is a block diagram illustrating a functional configuration of the document search apparatus according to the first embodiment.
1, the document search apparatus includes at least an input unit 10, a document search unit 20, a document database (DB) 30, a search result storage unit 40, a document specification unit 50, a phrase extraction unit 60, and a related word selection unit 70. Yes.
[0023]
The input unit 10 inputs a search request including a character string for searching for a document desired by the user from the document database 30 using a keyboard or the like.
When this character string is not in the form of the search expression handled by the document search unit 20 but is a natural language sentence, the word search is performed to divide it into words by using a word dictionary, and to the search expression handled by the document search unit 20. Convert. This word dictionary is composed of at least the notation of each word, the part of speech, and the like.
Also, when the input character string is composed of a combination of keywords representing the characteristics of the document, it is divided by a delimiter symbol, character type, etc., and converted into a search expression handled by the document search unit 20.
For example, on the input screen as shown in FIG. 2, a search expression is input as “economic and politics” and a search button is pressed.
[0024]
The document search unit 20 performs a ranking search on the document DB 30 using the search formula passed from the input unit 10, and outputs document information for a predetermined number of documents to the search result storage unit 40.
The ranking search can be obtained, for example, by calculating a score using the following formula 2 for each document and sorting the document group in descending order of the score.
[0025]
score = Σ_w score (w) ・・・ Formula 2
Where Σ_wMeans adding scores score (w) for all search terms w in the search formula.
score (w) = tf (w) * (1 + log₂(N / df (w)))
tf (w) = frequency of occurrence of the search term w in the document whose score is being calculated,
N = number of documents registered in the document DB 30
df (w) = number of documents including the search word w in the document DB 30.
[0026]
In addition, the document search unit 20 performs the document search again for the new search expression generated by the related word selection unit 70.
[0027]
The document DB 30 includes document information that holds a document to be searched and word statistical information of each word included in the document (see FIG. 3).
For example, in the document information, information such as a document identifier (ID), a document name, a bibliographic item (creator, creation date, issuing place, etc.), a pointer to a document entity, and the like are held for each document.
Further, the word statistical information holds, for each word, statistical information such as a word notation and an appearance frequency indicating how many documents in the document DB 30 the word has appeared.
[0028]
The search result storage unit 40 stores information related to a predetermined number of documents in order from the document with the highest score among the search results.
For example, a score and a document ID are stored as information about the document. Alternatively, the score and the content of the document itself may be stored.
[0029]
The document designating unit 50 outputs the search results stored in the search result storage unit 40 as a list to a display device such as a display as shown in FIG. In the list of FIG. 4, scores and document names are displayed in rank order.
The user displays the contents of the document from this list display, confirms the contents, and designates one or more documents by checking the check box of documents close to the desired document (hereinafter referred to as conforming documents) (in FIG. 4). , The black square indicates the selection).
Next, when the user presses the “extract related words” button, the document specifying unit 50 passes the document ID of the selected conforming document to the phrase extracting unit 60.
[0030]
The phrase extraction unit 60 refers to the document ID of the conforming document passed from the document specification unit 50 and extracts the contents of the document from the document DB 30.
Next, based on the part-of-speech information obtained by morphological analysis of this document, for example, independent words such as nouns, sa variable nouns, unregistered words, etc. are extracted to obtain phrases other than the phrases that appear in the search expression. In morphological analysis, the word is divided into the shortest matching words registered in the word dictionary.
[0031]
In addition, when extracting a phrase, the phrase extraction unit 60 counts the appearance frequency in the document, creates a frequency table, and temporarily stores it. For example, when the words A, B, and C are obtained from the relevant document, the following frequency table is created.
[0032]
[Table 1]

[0033]
Further, when a plurality of conforming documents are designated, the counted frequency within the conforming document is a total value of the frequencies within each conforming document for each word / phrase.
Next, the phrase extracted by the phrase extraction unit 60 is passed to the related word selection unit 70.
The related word selection unit 70 is a higher rank document group (for example, the top 10 documents in the search results stored in the search result storage unit 40, hereinafter this document group is referred to as a pseudo conforming document. The number of pseudo-conformance documents extracted by the phrase extraction unit 60 is counted and added as a local document frequency to the previous frequency table. Memorize temporarily.
[0034]
[Table 2]

[0035]
Next, the related word selection unit 70 calculates the degree of relevance according to Equation 3 for each word based on the frequency table.
[0036]
Relevance = (1 + log₂(R_w)) × L_w/ R ... Formula 3
R_w= Frequency in the relevant document of the word w
L_w= Local document frequency of word w
R = number of pseudo conforming documents.
[0037]
For Table 2, the relevance is calculated and the frequency table is updated as shown in Table 3.
[0038]
[Table 3]

[0039]
Finally, the related word selection unit 70 sorts the relevance levels of the words in the updated frequency table in descending order, selects a predetermined number (about the top 20 words of relevance) as related words, and newly selects the selected words. To the search expression as a simple search term. The addition to this search expression is added to the original search expression by a logical operation OR.
For example, in the above case, when the original search expression is “X and Y” and only one word is a related word, the new search expression is “(X and Y) or B”.
[0040]
The related word selection unit 70 passes the new search expression to the document search unit 20 as a search request.
The document search unit 20 stores a new search result in the search result storage unit 40 by performing a ranking search again with this new search formula.
The above operation is repeated until a document desired by the user is found.
[0041]
In particular, when the number of documents to be searched is enormous or when the search request expression is inappropriate, there are many cases where only a very small number of matching documents can be found at the top of the search results.
In this case, the number of relevant documents specified by the user is small (in many cases, only one). However, by configuring the present embodiment as described above, an appropriate related word can be obtained, so that the user can obtain a search result again. The degree to which the desired document is found is improved.
[0042]
When an evaluation experiment was performed on the document group having a short document length according to the first embodiment, the results shown in Table 4 were obtained.
[0043]
[Table 4]

[0044]
As can be seen from Table 4 above, when the user gives one conforming document, the average relevance ratio of the first embodiment is better than the conformity feedback. This improvement in accuracy is negligible because the length of the document is short, and it seems that there is little room for selecting related words.
Further, when the user gives two compatible documents, the effect of the first embodiment is not so much seen.
[0045]
When a similar evaluation experiment was performed on a document group having an appropriately long document length, the results shown in Table 5 were obtained.
[0046]
[Table 5]

[0047]
As can be seen from Table 5 above, when the user gives one conforming document, the average relevance ratio of the first embodiment is 7% better than the relevance feedback. Further, when the user gives two compatible documents, the accuracy improvement according to the first embodiment is negligible.
[0048]
Next, the operation of the first embodiment configured as described above will be described based on the flowchart of FIG.
First, on the input screen as shown in FIG. 2, the user inputs a search request for searching the document database 30 by using a keyboard or the like (step S10).
When the search request is a natural language sentence, the word dictionary is used to perform morphological analysis, divide it into words, and convert it into a search expression.
Also, when the input character string is made up of a combination of keywords representing the characteristics of the document, it is divided by a delimiter symbol, character type, etc., and converted into a search expression.
[0049]
Using the input search expression, the document DB 30 is searched for ranking, and scores and document IDs corresponding to a predetermined number of documents are output to the search result storage unit 40 from the higher score (step S20).
The ranking search can be obtained, for example, by calculating a score using the above-described formula 2 for each document and sorting the document group in descending order of the score.
[0050]
When the search result stored in the search result storage unit 40 is output to a display device such as a display as a list as shown in FIG. 4 and the user confirms the contents of the document from this list and finds a desired document ( In step S30 “Yes”), the process is terminated.
On the other hand, when there is no desired document in the list (“No” in step S30), one or more are designated by checking the check boxes of documents close to the desired document from the list, and the user selects A “related word extraction” button is pressed (step S40).
[0051]
Based on the part-of-speech information obtained by extracting the contents of the document from the document DB 30 with reference to the document ID of the relevant document designated by the user and performing morphological analysis, for example, independent words such as nouns, sa variable nouns and unregistered words Is extracted to obtain a phrase other than the phrase that appears in the search expression (step S50). In the morphological analysis, the word is divided into the shortest matching words registered in the word dictionary, the appearance frequency in the document is counted for each word, and a frequency table is created and temporarily stored.
Further, when a plurality of conforming documents are designated, the counted frequency within the conforming document is a total value of the frequencies within each conforming document for each word / phrase.
[0052]
Next, regarding the higher rank document group in the search result stored in the search result storage unit 40 (pseudo conforming document not including the conforming document designated by the user), the number of words extracted in step S50 is included in the number of documents. Count the number of occurrences, find the relevance level of each word according to the above-mentioned formula 3, sort in descending order, select a predetermined number (about the top 20 words of relevance) as related words, and select the selected words The search term is added to the search expression, and the process returns to step S30, and the above operation is repeated until a document desired by the user is found (step S60). The addition to this search expression is added to the original search expression by a logical operation OR.
[0053]
<Embodiment 2>
It is obvious that there is no power to discriminate documents in terms that appear in many documents, so even if such terms (search terms) are added to the search expression, the desired document can be obtained efficiently. I can't.
In the second embodiment, the accuracy of the degree of association represented by the above equation 3 is improved so as not to select such a phrase without discrimination as a related word.
[0054]
Now consider the word w that appears in the pseudo-conforming document. If the expected value (H) of the document frequency indicating how many non-conforming documents this word w appears in is large, it is considered that the word w exists without being unevenly distributed in the search target document.
Therefore, when the expected value of the document frequency indicating how many pseudo-conforming documents the word w appears in is T, it can be said that the larger the value of (TH), the more the word w has a discrimination power.
[0055]
In the second embodiment, related words are selected using this (TH).
Here, the expected value T is approximated by the following equation.
T = (document frequency at which the word w appears in the pseudo conforming document) / (number of pseudo conforming documents)
= (Local document frequency of word w) / (number of pseudo conforming documents)
[0056]
The expected value H is approximated by the following equation.
H = (document frequency at which the word w appears in a non-conforming document) / (number of non-conforming documents)
Here, since the number of non-conforming documents is very large compared to the number of pseudo conforming documents, the expected value H is further approximated as follows if the law of large numbers is applied.
H≈ (document frequency in which the word w appears in the search target document) / (total number of search target documents)
= (Global document frequency of word w) / (Total number of documents to be searched)
[0057]
In the second embodiment, the expected value is converted into a score, and the relevance of the phrase w is defined by the following expression 4. As a result, the greater the relevance value of Equation 4, the more powerful the ability to discriminate between conforming documents and non-conforming documents in the ranking of search results.
[0058]
Relevance = (1 + log₂(R_w)) X (L_w/ R-G_w/ N) ・・・ Formula 4
R_w= Frequency in the relevant document of the word w
L_w= Local document frequency of word w
R = number of pseudo conforming documents,
G_w= Global document frequency of word w
N = Total number of documents to be searched.
[0059]
FIG. 6 is a block diagram illustrating a functional configuration of the document search apparatus according to the second embodiment. In FIG. 6, the same parts as those in the first embodiment described above are denoted by the same reference numerals and the description thereof is omitted. Omitted. In FIG. 6, the difference from the first embodiment is that the related word selection unit 70 has an appearance frequency calculation unit 80.
[0060]
The appearance frequency calculation unit 80 is activated by the related word selection unit 70 and refers to the word statistical information in the document DB 30 to indicate the appearance frequency (global document) indicating how many documents in the document DB 30 the given word appears in. Frequency).
Alternatively, a document search including a word given from the related word selection unit 70 may be performed and the number of searches may be output.
[0061]
In the related word selection unit 70 of the second embodiment, the global frequency is calculated by the appearance frequency calculation unit 80 for each word and the frequency table is updated as shown in Table 6.
For example, when the global document frequencies of the words A, B, and C are added to Table 2 above, Table 6 is obtained.
[0062]
[Table 6]

[0063]
Next, the degree of relevance of each word / phrase is obtained by the above equation 4, and the related word is selected as described above.
When the degree of relevance when the total number of documents (N) is 10000 is obtained by Expression 4, it is obtained as shown in Table 7.
[0064]
[Table 7]

[0065]
By configuring the second embodiment as described above, more appropriate related terms can be obtained, and thus the degree of finding a document desired by the user can be improved based on the search result again.
[0066]
<Modification of Embodiment 1 and Embodiment 2>
In the first embodiment and the second embodiment, the conforming document is specified from the search results corresponding to the search request. In this modification, the contents of the prepared document can be specified as a sample of the conforming document. .
7 and 8 are block diagrams showing the functional configuration of the present modification corresponding to

Embodiments

1 and 2, respectively, and the same parts as those in

Embodiments

1 and 2 described above are denoted by the same reference numerals. The description is omitted. The difference between FIG. 7 and FIG. 8 is that a document input unit 90 is used instead of the document specifying unit 50.
[0067]
The document input unit 90 outputs the search results stored in the search result storage unit 40 as a list to a display device such as a display as shown in FIG. In the list of FIG. 9, the scores and document names are displayed in the rank order as in FIG.
The user confirms the contents by displaying the contents of the document from the list display, and if there is no document close to the desired document, the user reads a sample of a suitable document prepared in advance in the text box at the bottom of the screen. Then, the “Related word extraction” button is pressed. The document input unit 90 passes the text input in the text box to the word / phrase extraction unit 60 as a conforming document.
[0068]
Alternatively, a “compatible document designation” button as shown in FIG. 10 is prepared, and when the user presses this button, the user specifies a file name or the like in which the compatible document is stored, and inputs the compatible document. It may be.
[0069]
Next, the operation of the modified example configured as described above will be described based on the flowchart of FIG.
First, on the input screen as shown in FIG. 2, the user inputs a search request for searching the document database 30 using a keyboard or the like (step S110).
When the search request is a natural language sentence, the word dictionary is used to perform morphological analysis, divide it into words, and convert it into a search expression.
Also, when the input character string is made up of a combination of keywords representing the characteristics of the document, it is divided by a delimiter symbol, character type, etc., and converted into a search expression.
[0070]
Using the input search formula, the document DB 30 is searched for ranking, and scores and document IDs corresponding to a predetermined number of documents are output to the search result storage unit 40 from the higher score (step S120).
The ranking search can be obtained, for example, by calculating a score using the above-described formula 2 for each document and sorting the document group in descending order of the score.
[0071]
When the search results stored in the search result storage unit 40 are output to a display device such as a display as a list as shown in FIG. 9, and the user confirms the contents of the document from this list and finds a desired document ( In step S130 “Yes”), the process is terminated.
On the other hand, when there is no desired document in the list display (“No” in step S130), the compatible document is read into a text box as shown in FIG. 9, or “relevant” as shown in FIG. Either press the “designate document” button to read the relevant document, or press the “extract related words” button (step S140).
[0072]
Based on the part-of-speech information obtained by taking out the contents of the relevant document specified by the user and performing morphological analysis, for example, the words that appear in the search expression by extracting independent words such as nouns, sa variable nouns, unregistered words, etc. Other words are obtained (step S150). In the morphological analysis, the word is divided into the shortest matching words registered in the word dictionary, the appearance frequency in the document is counted for each word, and a frequency table is created and temporarily stored.
Further, when a plurality of conforming documents are designated, the counted frequency within the conforming document is a total value of the frequencies within each conforming document for each word / phrase.
[0073]
Next, regarding the higher rank document group (pseudo conforming document) in the search results stored in the search result storage unit 40, the number of documents in which the word / phrase extracted in step S150 appears is counted, and The degree of relevance is obtained by Equation 3.
Alternatively, by referring to statistical information regarding words in the document DB 30, the global document frequency of the extracted word / phrase may be extracted, and the degree of association may be obtained by the above-described Expression 4.
[0074]
The relevance level of each obtained phrase is sorted in descending order, a predetermined number (about the top 20 words of relevance level) is selected as a related word, and the selected word is added to the search formula as a new search word, step S130 Returning to the above, the search is performed again, and the above operation is repeated until the user's desired document is found (step S160). The addition to this search expression is added to the original search expression by a logical operation OR.
[0075]
By configuring this modification as described above, an appropriate conforming document can be prepared in advance, so that a more appropriate related word can be selected, and a document desired by the user can be determined based on the search result again. The degree to which is found is improved.
[0076]
<Embodiment 3>
The present invention is not limited only to the above-described embodiments. Each function constituting the document retrieval apparatus of the above-described embodiment is programmed, written in advance on a recording medium such as a CD-ROM, and the CD-ROM is loaded on a medium driving apparatus such as a CD-ROM drive mounted on a computer. Needless to say, the object of the present invention is achieved by storing these programs in a memory or storage device of a computer and executing them.
In this case, the program itself read from the recording medium realizes the functions of the above-described embodiment, and the program and the recording medium recording the program also constitute the present invention.
[0077]
As a recording medium for storing the program, a semiconductor medium (for example, ROM, nonvolatile memory, etc.), an optical medium (for example, DVD, MO, MD, CD, etc.), a magnetic medium (for example, magnetic tape, flexible disk, etc.) Any of these may be used.
[0078]
Further, not only the functions of the above-described embodiment are realized by executing the loaded program, but also the above-described implementation by cooperating with the operating system or other application programs based on the instructions of the program. The case where the function of the form is realized is also included.
[0079]
In the case of distribution to the market, the program is stored and distributed on a portable recording medium, or stored in a storage device of a server computer connected via a communication network such as the Internet. It can also be transferred to a computer. In this case, the storage device of this server computer is also included in the recording medium of the present invention. In the computer, the functions of the above-described embodiments are realized by installing a program on a portable recording medium or a transferred program on a recording medium connected to the computer and executing the installed program. Is done.
[0080]
【The invention's effect】
As described above, according to the present invention, even when there are a small number of relevant documents specified by the user (in many cases, there are only one), more appropriate related terms can be obtained. The degree to which a document to be found is found is improved.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a functional configuration of a document search apparatus according to a first embodiment.
FIG. 2 is an example of a search expression input screen.
FIG. 3 is a data structure example of a document database.
FIG. 4 is a list display example of search results and an example of designation of relevant documents.
FIG. 5 is a flowchart showing the operation of the first embodiment.
FIG. 6 is a block diagram illustrating a functional configuration of a document search apparatus according to a second embodiment.
FIG. 7 is a block diagram showing a functional configuration of a modification of the first embodiment.
FIG. 8 is a block diagram showing a functional configuration of a modified example of the second embodiment.
FIG. 9 is a display example of a list of search results and an example of input of relevant documents.
FIG. 10 is a list display example of search results and an example of designation of relevant documents.
FIG. 11 is a flowchart showing the operation of a modification of the first embodiment and the second embodiment.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 ... Input part, 20 ... Document search part, 30 ... Document database (DB), 40 ... Search result memory | storage part, 50 ... Document designation part, 60 ... Word phrase extraction part, 70 ... Related word selection part, 80 ... Appearance frequency calculation Part, 90... Document input part.

Claims

An input unit for inputting a search request, ranking search for documents relevant to the search request from the document database that stores document, and ranking the search unit you store the search result in the search result storage unit, the search result storage unit A document designating unit that designates one relevant document by the user from the search results stored in the document, an extraction unit that extracts a word from the designated relevant document, and the document frequency in the document for the word, Calculated as the frequency within the conforming document, and subsequently calculates the document frequency of the higher-order document group of the search results stored by the search result storage unit as the local document frequency for the word, Based on the frequency in the document, the local document frequency, and the number of higher-order document groups in the search result, the degree of relevance with the search request is obtained for the word extracted from the designated matching document. Every time Automatically selects the high words as related word, anda related word selecting unit to select the related word a new search request in addition to the search request,
The relevance is a first expression represented by the frequency Rw within the relevant document, the local document frequency Lw, and the number R of the upper document groups of the search result.
Relevance = (1 + log ₂ (Rw)) × Lw / R
A document search apparatus characterized by being calculated by the following .

An input step for inputting a search request by the input unit, and a ranking search for searching for documents that match the search request from a document database storing the documents by a ranking search unit and storing the search results in the search result storage unit A document designating step for designating one relevant document by a user from the search result stored in the search result storage unit by the document designating unit, and an extraction for extracting a word from the designated relevant document by the extracting unit Step, the related word selection unit calculates the document frequency in the document as the relevant document frequency for the word, and then the search result stored in the search result storage unit for the word The document frequency of the higher-order document group is calculated as the local document frequency, and the frequency within the relevant document, the local document frequency, and the search result are calculated. Based on the number of high-order document groups, the degree of relevance with the search request is obtained for the words extracted from the designated conforming document, and the word having the high degree of relevance is automatically selected as the related word and selected. A related term selection step of adding a related term as a new search request to the search request,
The relevance is a first expression represented by the frequency Rw within the relevant document, the local document frequency Lw, and the number R of the upper document groups of the search result.
Relevance = (1 + log ₂₂ (Rw)) x Lw / R
Calculated by
A document search method characterized by the above.

An input unit for inputting a search request, a ranking search unit for searching for documents that match the search request from a document database that stores documents, and storing a search result in a search result storage unit, and a storage in the search result storage unit A document designating unit for designating one conforming document by the user from the retrieved results, an extracting unit for extracting a word from the designated conforming document, and the document frequency in the document for the word, Then, the document frequency of the higher-order document group of the search result stored by the search result storage unit is calculated as the local document frequency for the word. Based on the local document frequency and the number of higher-order document groups in the search result, the degree of relevance with the search request is obtained for the word extracted from the designated matching document, and the obtained degree of relevance is high. Words automatically selected as a related word, the selection was related word comprises a related word selecting unit to add to new search request to the search request,
The relevance is the second expression expressed by the relevance document frequency Rw, the local document frequency Lw, the number R of higher-order document groups in the search result, the global document frequency Gw, and the total number N of documents to be searched.
Relevance = (1 + log ₂₂ (Rw)) x (Lw / R-Gw / N)
Calculated by
A document search apparatus characterized by that.

Input step to input search request by input part and ranking search Is stored in the search result storage unit by the document search unit and a ranking search step for searching for documents that match the search request from the document database storing the document, and storing the search results in the search result storage unit. A document designation step for designating one relevant document by the user from the retrieved results, an extraction step for extracting a word from the designated relevant document by an extraction unit, and a document for the word by a related word selection unit The document frequency in the search result stored in the search result storage unit for the word is calculated as the local document frequency. In addition, based on the frequency in the relevant document, the local document frequency, and the number of higher-order document groups in the search result, the extracted document is extracted from the designated relevant document. For each word, a degree of relevance with the search request is obtained, a word having a high degree of relevance is automatically selected as a related word, and the selected related word is added to the search request as a new search request. And including steps,
The relevance is the second expression expressed by the relevance document frequency Rw, the local document frequency Lw, the number R of higher-order document groups in the search result, the global document frequency Gw, and the total number N of documents to be searched.
Relevance = (1 + log ₂₂ (Rw)) x (Lw / R-Gw / N)
Calculated by
A document search method characterized by the above.

The computer program for causing a function as each section of the document search apparatus according to claim 1 or 3.

A computer-readable recording medium on which the document search program according to claim 5 is recorded.