JP2004192374A

JP2004192374A - Document search system, program and recording medium

Info

Publication number: JP2004192374A
Application number: JP2002360158A
Authority: JP
Inventors: Hiroko Mano; 博子真野; Yasutsugu Ogawa; 泰嗣小川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2002-12-12
Filing date: 2002-12-12
Publication date: 2004-07-08
Anticipated expiration: 2022-12-12
Also published as: JP4212347B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document search device searching an appropriate document desired by a user by selecting related terms from document groups based on various viewpoints. <P>SOLUTION: This document search device selects documents fitted to or unfitted to keywords input from document database 160 storing a plurality of documents, selects words appearing in the selected fitted documents and words having high relevance with the input keywords as the related terms of the keywords, and searches the document database 160 again. The selected fitted document groups are divided into a plurality of sets, the related words are found for every set, and the document is searched by the keywords formed by adding a sum of sets of the related words to the keywords. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、文書検索装置、文書検索装置の機能を実行させるためのプログラムおよびそのプログラムを記録したコンピュータ読み取り可能な記録媒体に関し、より詳細には、与えられたキーワードに対して適合する文書を選択し、この適合文書から抽出したキーワードの関連語を付加したキーワードによって適合する文書を検索しなおすことにより、ユーザの所望する文書が検索できる文書検索装置、文書検索装置の機能を実行させるためのプログラムおよびそのプログラムを記録したコンピュータ読み取り可能な記録媒体に関する。
【０００２】
【従来の技術】
文書を多数集積している文書データベースからユーザの必要とする文書を探しだすには、ユーザが入力したキーワードを用いて一旦検索した後、そのキーワードに適合した文書中に出現する単語の中から入力したキーワードに関連した単語を選出し、はじめに入力したキーワードに追加し、再度、検索することで、よりユーザの求めるものに近いものを得る方法が知られている。
【０００３】
例えば、キーワードの関連語を選出する方法として、適合文書中の各単語について、適合文書の中での出現状況などの統計情報を利用して、キーワードとの関連度を算出し、その値の大きい上位何単語かを選出する方法が提案されている（非特許文献１参照）。
【０００４】
次に、この従来の関連語抽出方法について説明する。ユーザから入力されたキーワード中の各単語に対して単語の重要度に応じた重みを付与する。この単語の重みを計算する計算式には、例えば、確率モデルにもとづくRobertsonの計算式（式１）が知られている（非特許文献２参照）。この非特許文献２の技術においては、キーワード中の各単語の重みは、検索対象文書全体の中での各単語の出現状況Wp、Wqに応じて付与される。
【０００５】
W（重み）＝Wp-Wq ………（式１）
ここで
Wp＝k4+log(N/(N-n))，
Wq＝log(n/(N-n))，
N:検索対象総文書数，
n:単語の出現する文書数，
k4:調整パラメータ
【０００６】
次に、キーワード中の各単語の重みをもとに、各文書の文書適合度を計算する。この文書適合度の計算式は、例えば、次に示すような非特許文献２の計算式（式２）で求める。
【０００７】
F（適合度）＝Σ(W×tf/(k1+tf)) ………（式２）
ここで
W：（式１）で求めた単語の重み，
tf:文書あたりの単語の出現数，
k1:調整パラメータ
【０００８】
各文書の文書適合度を求め、適合度の高い順に各文書を順序づけ、上位何件かを適合文書とみなし、下位何件かを非適合文書とみなす。
適合文書の選出後、適合文書中の不要語（たとえば冠詞のａなど）を除いたすべての単語について、適合文書および非適合文書での出現状況、すなわちフィードバック情報を反映させて、それぞれの単語の重みを再計算する。
【０００９】
適合文書選出後の重みは、例えば、非特許文献２の計算式（式３）を用いて、検索対象文書全体での出現状況Wp、Wq（上記（式１）のコメント参照）と適合文書／非適合文書の中での出現状況WrとWsを比率CpとCqで足し合わせて付与される。
【００１０】
W'（重み）＝(Cp・Wp+(1-Cp)・Wr)-(Cq・Wq+(1-Cq)・Ws)……（式３）
ここで
Wr=log((r+0.5)/(R-r+0.5)),
Ws=log((s+0.5)/(S-s+0.5)),
Cp＝k5/(k5+√R),
Cq＝k6/(k6+√S),
R:適合文書数，
r:適合文書集合の中で単語の出現する文書数，
S:非適合文書数，
s:非適合文書集合の中で単語の出現する文書数，
k5,k6:調整パラメータ
【００１１】
さらに、この重みとフィードバック情報から適合文書中の不要語を除いた各単語について、キーワードとの関連度を求める。関連度の算出方法としては、例えば、Boughanemの計算式（式４）がある（非特許文献３参照）。
【００１２】
関連度＝(r/R-α・s/S)×W' ………（式４）
ここで α:調整パラメータ
【００１３】
このようにして、適合文書中の各単語について、キーワードとの関連度を求めて、関連度の高いものから順にキーワード関連語として選出し、入力したキーワードに追加して新しいキーワードを作成する。
この新しいキーワードを用いて、再度、適合文書を選出する。このとき、文書適合度の算出には、上記（式３）で求めた重みが使われる。
【００１４】
一方、特許文献１の技術は、任意の検索語に対して、異なる分類方法で分類した概念の類義語集合と、その上位概念を示す上位語を用意し、ユーザから与えられた検索語を含む類義語集合と上位語とを提示する。
これにより多様な分類方法で分類した概念の検索語を使って、検索することによって効率よく所望の情報を得ることができる。
【００１５】
また、特許文献２の技術は、ユーザから与えられた検索キーワード中の主要なキーワードとなる主要キーワードを選択し、主要キーワードとその関連語や連想語によって得た検索結果を複数の分類に分類する。この分類された分類群の内から検索キーワードに対応する分類を決定し、その分類に属する情報に対して、先に与えられた検索キーワードとその関連語および連想語によって絞り込むことによって効率的に検索することができる。
【００１６】
【特許文献１】
特開平１０−２１２６６号公報
【特許文献２】
特開２００１−５８３０号公報
【非特許文献１】
Robertson,S.E. "On term selection for query expansion,"Journal of Documentation 46,Dec 1990,p359-364
【非特許文献２】
Robertson,S.E. and Walker,S. "On relevance weights withlittle relevance information," SIGIR97,ACM Press,pp.16-24
【非特許文献３】
Walker,S.etal.,"Okapi at TREC-6:Automated adhoc,VLC,routing,filtering and QSDR,"The Sixth Text RetrievalConference(TREC-6),1996,NIST
【００１７】
【発明が解決しようとする課題】
しかしながら、上述した従来の技術では、適合文書群全体をひとつの集合とみなし、この集合全体から関連語を求めているため、ひとつの尺度からのみ関連語を選ぶことになり、以下のような問題点があった。
【００１８】
（１）適合文書とみなした文書が実際には適合文書でなかった場合に、関連語として適さないものを選出してしまうリスクが高い。
（２）適合文書とみなした文書が実際に適合文書であった場合でも、似た傾向の関連語が多く選ばれる可能性があり、同じテーマを別の観点から論じた文書を得たい等の要求には応えられない。
【００１９】
上述の特許文献１および特許文献２の技術は、あらかじめ検索語に対する関連語を定義しておき、ユーザに指定された検索語に関連語を追加して拡張した検索式を作成することができるが、観点を拡張した新たなキーワードを効果的にしかも自動的に得るというものではない。
【００２０】
本発明は、上述した実情を考慮してなされたものであり、文書群から多様な観点に基づいた関連語を選出することによって、ユーザの所望している的確な文書を検索することができる文書検索装置、文書検索装置の機能を実行させるためのプログラムおよびそのプログラムを記録したコンピュータ読み取り可能な記録媒体を提供することを目的とする。
【００２１】
【課題を解決するための手段】
上記の課題を解決するために、本発明の請求項１は、複数の文書を保持する文書データベースと、前記文書データベースから入力されたキーワードに適合する文書および適合しない文書を選出する文書ランキング部と、前記文書ランキング部で選出された適合文書中に出現する単語と前記キーワードとの関連度が高い単語を前記キーワードの関連語として選出する単語ランキング部と、前記単語ランキング部で選出した関連語を前記キーワードに追加するキーワード生成部とを備えて、前記キーワード生成部で生成された新しいキーワードに適合する文書を再度、前記文書ランキング部で検索する文書検索装置において、前記単語ランキング部は、前記文書ランキング部で選出された適合文書群を複数の集合に分割し、それぞれの集合ごとに関連語をもとめてから、それら関連語の和集合をもとのキーワードに追加する関連語とすることを特徴とする文書検索装置。
【００２２】
また、本発明の請求項２は、請求項１に記載の文書検索装置において、前記適合文書群の分割は、適合文書に対する特定の項目が共通あるいは近似の値を持つ適合文書に分割することを特徴とする。
【００２３】
また、本発明の請求項３は、請求項２に記載の文書検索装置において、前記特定の項目は、前記適合文書の書誌事項であることを特徴とする。
【００２４】
また、本発明の請求項４は、請求項２に記載の文書検索装置において、前記文書データベースの文書が特許公報である場合、前記特定の項目は、出願人であることを特徴とする。
また、本発明の請求項５は、請求項２に記載の文書検索装置において、前記文書データベースの文書が特許公報である場合、前記特定の項目は、出願日、公開日、特許登録日等の日付情報であることを特徴とする。
また、本発明の請求項６は、請求項２に記載の文書検索装置において、前記文書データベースの文書が特許公報である場合、前記特定の項目は、国際特許分類、ファセット分類、Ｆターム等の特許分類であることを特徴とする。
また、本発明の請求項７は、コンピュータに、請求項１乃至６のいずれかに記載の文書検索装置の機能を実行させるためのプログラムである。
また、本発明の請求項８は、請求項７に記載のプログラムを記録したコンピュータ読み取り可能な記録媒体である。
【００２５】
したがって、適合文書とみなした文書が、実際には適合文書でなかった場合でも、リスク分散がはかられ、関連語として適さない関連語ばかりを選出する危険性が小さくなる。
また、適合文書とみなした文書群を異なった性質を持つ複数の文書集合に分割し、その文書集合ごとに関連語を選出することで、似た傾向の関連語ばかりが選出されることを防ぐことができる。
これにより、多様な単語をキーワードの関連語として選出できるので、ユーザの所望している的確な文書を検索できる可能性が高くなる。
【００２６】
また、特許公報のような文書集合を検索する場合に、出願人、出願日・公開日・登録日等の日付情報あるいは特許分類を同じくする適合文書ごとに求めた関連語の和集合をもって関連語とするので、多様な観点から関連語を選出することができる。
【００２７】
【発明の実施の形態】
以下に、図面を参照して本発明に係る文書検索装置の好適な実施形態を説明する。
【００２８】
＜実施形態１＞
図１は、本発明に係る文書検索装置の構成を示すブロック図であり、同図において、文書検索装置は、キーワード入力部１１０、文書ランキング部１２０、単語ランキング部１３０、キーワード生成部１４０、文書出力部１５０、文書データベース１６０より構成される。
【００２９】
キーワード入力部１１０は、ユーザがキーボード等により、文書データベース１６０中にある文書の特徴をあらわすキーワードを組み合わせた文字列を入力する。
この入力された文字列は、必要に応じて、単語辞書１７０を用いて形態素解析して単語に分解する。この単語辞書１７０は、少なくとも各単語の表記、品詞等から構成される。
または、単語辞書１７０を使わず、この入力された文字列をｎ−ｇｒａｍに区切って、それを単語としてもよい。
【００３０】
文書ランキング部１２０は、キーワード入力部１１０から渡されたキーワードに対して、文書データベース１６０を検索し、適合する文書と適合しない文書とを選定する。この選定された適合文書は、単語ランキング部１３０へ渡され、関連語の候補となる単語の抽出源となる。
【００３１】
文書データベース１６０は、検索対象となる文書を保持する文書情報と、その文書中に含まれている各単語の単語統計情報から構成される（図２参照）。例えば、文書情報には、各文書に対して次のような情報が保持される。
【００３２】
文書識別子（ＩＤ）、文書名、書誌事項（作成者、作成日、発行所等）、
文書実体へのポインタ等
【００３３】
また、単語統計情報には、単語ごとに次のような統計情報を保持する。
【００３４】
単語の表記、この単語の文書データベース全体での出現頻度、
単語出現情報等
【００３５】
ここで単語出現情報には、単語が出現する文書ごとに次の情報を保持する。
【００３６】
この単語が出現する文書の文書識別子、この文書に出現する単語出現頻度、
この文書にこの単語が出現する出現位置の一覧等
【００３７】
単語ランキング部１３０は、適合文書群全体を複数の適合文書集合に分割し、分割された適合文書集合ごとに関連語候補を抽出し、これらのすべての関連語候補から関連語を選定する。
【００３８】
各適合文書の文書識別子から文書データベース１６０に格納された書誌事項を取り出し、着目した属性（例えば、作成者や作成日等）について、共通の値または近接する値を持つ文書を集めて複数の集合に分けることによって、適合文書を分割する。
このように属性の値を同じくする適合文書の集合を作成することにより、多様な観点からの関連語を選出することができる。
また、文書データベース１６０が特許公報からなる場合には、「出願人」、「出願日・公開日・登録日等の日付情報」あるいは「国際特許分類・ファセット分類・Ｆターム等の特許分類」等を属性として、適合文書をさらに適切な観点から分類することができ、より適切な関連語を抽出することが可能となる。
【００３９】
次に、分割された適合文書集合ごとに、適合文書の文書識別子から文書データベース１６０に格納されている文書を取り出し、形態素解析あるいはｎ−ｇｒａｍによって区切って、単語を抽出し、予め用意された不要語表にこの抽出した単語が登録されていれば削除し、残りの単語を関連語候補とし、入力されたキーワードとこの関連語候補との関連度を、例えば、次の（式５）で算出する。
【００４０】
関連度＝Σ_ｉ(rtf_ｉ/K+rtf_ｉ)/R-β×Σ_ｊ(stf_ｊ/K+stf_ｊ)/S ……（式５）
ここで、
R：適合文書数、
S：非適合文書数、
rtf_ｉ：適合文書の文書ｉにおける出現回数、
stf_ｊ：非適合文書の文書ｊにおける出現回数、
Kおよびβ：調整パラメータ。
また、（式５）の右辺第１項は、適合文書の各文書についての和
であり、第２項は、非適合文書の各文書についての和である。
【００４１】
次に、分割された適合文書集合ごとに抽出された、関連語候補をすべてひとまとめにした中から、所定の件数（例えば、１０個程度）の関連度の高い上位の関連語候補を関連語として選出する。このようにして選定された関連語をキーワード生成部１４０へ渡す。
キーワード生成部１４０は、これら関連語をすべて、あるいは、ユーザに提示した関連語から選択されたものを追加して生成した新しいキーワードを文書ランキング部１２０へ渡す。
文書ランキング部１２０は、キーワード生成部１４０で生成された新しいキーワードに対してもう一度適合する文書を選定し、この選定された適合文書を文書出力部１５０へ渡す。
文書出力部１５０は、文書ランキング部１２０で選出した適合文書一覧を表示装置へ表示し、ユーザが所望の文書がないときには、単語ランキング部１３０を呼び出す。また、所望の文書があった場合には、その文書をプリンタ、表示装置、記憶装置等へ出力するか、または、ネットワークを介して他のコンピュータ装置へ送信する。
【００４２】
次に、このように構成された本実施形態の文書検索装置の動作について、図３のフローチャートに基づいて説明する。
まず、キーボード等の入力装置から、例えば、英語や日本語の単語や単語の組み合わせで構成されるキーワードを文字列として入力し、必要に応じて単語辞書１７０によって形態素解析して、単語に分解する（ステップＳ１００）。
または、単語辞書１７０を使わず、この入力された文字列をｎ−ｇｒａｍに区切って、それを単語としてもよい。
これにより、キーワード入力部１１０を構成する。
【００４３】
この入力されたキーワード中のそれぞれの単語について、文書データベース１６０の単語統計情報を参照し、例えば、上記（式１）を用いて単語の重要度に応じた重みを計算する（ステップＳ１１０）。
【００４４】
次に、検索対象である文書データベース１６０中のそれぞれの文書に対して、文書データベース１６０の単語統計情報とステップＳ１１０で計算されたキーワードの単語の重みとを参照し、その文書にキーワード中の単語がどのくらい含まれているかを示す適合度を、例えば、上記（式２）を用いて計算し、文書一覧表を作成する（ステップＳ１２０）。
【００４５】
適合度をキーとして、この文書一覧表中の各文書を降順に順序付け、その上位から所定の件数（例えば、１０件程度）の文書を適合文書とみなし、下位から所定の件数（例えば、５００件程度）の文書を非適合文書とみなす（ステップＳ１３０）。
あるいは、順序づけられた文書の一覧表（適合度、文書名や書誌事項等の一覧）をユーザに提示し、適合しているかどうか指示させ、適合していると指示された文書を適合文書とし、適合しないと指示された文書を非適合文書とするようにしてもよい。
ステップＳ１１０からステップＳ１３０までにより、文書ランキング部１２０を構成する。
【００４６】
ステップＳ１３０で選出した適合文書一覧にユーザの所望した文書があるかどうかをユーザに指示させる（ステップＳ１４０）。
所望した文書がなければ、ステップＳ１５０へ進む。所望した文書があれば、ステップＳ１９０へ進む。
【００４７】
所望の文書内容等を表示装置、プリンタや記憶装置等の出力装置へ、または、ネットワークで接続された他のコンピュータ装置へ送信することによってユーザに提示される（ステップＳ１９０）。
【００４８】
ステップＳ１３０で求めた各適合文書の文書識別子から文書データベース１６０に格納された書誌事項を取り出し、着目した属性（例えば、作成者や作成日等）について、共通の値または近接する値を持つ文書を集めて複数の集合に分けることによって、適合文書を分割する（ステップＳ１５０）。
また、文書データベース１６０が特許公報からなる場合には、「出願人」、「出願日・公開日・登録日等の日付情報」あるいは「国際特許分類・ファセット分類・Ｆターム等の特許分類」等を属性とする。
【００４９】
次に、分割された適合文書集合ごとに、適合文書の文書識別子から文書データベース１６０に格納されている文書を取り出し、形態素解析あるいはｎ−ｇｒａｍによって区切って、単語を抽出し、予め用意された不要語表にこの抽出した単語が登録されていれば削除し、残りの単語を関連語候補とし、入力されたキーワードとこの関連語候補との関連度を、例えば、次の（式５）で算出する（ステップＳ１６０）。
【００５０】
次に、分割された適合文書集合ごとに抽出された、関連語候補をすべてひとまとめにした中から、所定の件数（例えば、１０個程度）の関連度の高い上位の関連語候補を関連語として選出する（ステップＳ１７０）。
ステップＳ１５０からＳ１７０で単語ランキング部１３０を構成する。
【００５１】
ステップＳ１７０で抽出された関連語、またはこの関連語の中からユーザに選択された関連語をもとのキーワードに追加して新しいキーワードを作成する（ステップＳ１８０）。
これによりキーワード生成部１４０を構成する。
【００５２】
この新しいキーワードをステップＳ１１０からステップＳ１３０（文書ランキング部１２０）の処理と同様にして、再度、適合文書を選出する。
【００５３】
本実施形態を以上のように構成すると、適合文書群全体から関連語をもとめるのでなく、適合文書群を複数の集合に分割し、それぞれの集合ごとに関連語をもとめてから、それら関連語の和集合をもって関連語とするので、適合文書とみなした文書が、実際には適合文書でなかった場合でも、リスク分散がはかられ、関連語として適さない関連語ばかりを選出する危険性が小さくなる。
また、適合文書とみなした文書群を異なった性質を持つ複数の文書集合に分割し、その文書集合ごとに関連語を選出することで、似た傾向の関連語ばかりが選出されることを防ぐことができる。
これにより、多様な単語をキーワードの関連語として選出できるので、ユーザの所望している的確な文書を検索できる可能性が高くなる。
【００５４】
また、特許公報のような文書集合を検索する場合に、出願人、出願日・公開日・登録日等の日付情報あるいは特許分類を同じくする適合文書ごとに求めた関連語の和集合をもって関連語とするので、多様な観点から関連語を選出することができる。
【００５５】
＜実施形態２＞
さらに、本発明は上記の実施の形態のみに限定されたものではない。例えば、図１に示した文書検索装置は、図４のようなハードウェア構成を持つコンピュータ装置２００によっても実現が可能である。
即ち、コンピュータ装置２００は、キーボード、マウス、タッチパネル、スキャナ等により構成され、情報の入力に使用される入力装置１と、種々の出力情報や入力装置１からの入力された情報などを表示出力させる表示装置２と、種々のプログラムを動作させるＣＰＵ（Central Processing Unit；中央処理ユニット）３と、プログラム自身を保持し、またそのプログラムがＣＰＵ３によって実行されるときに一時的に作成される情報等を保持するメモリ４と、本発明の文書検索装置で扱う文書データベース１６０、単語辞書１７０およびプログラムやプログラム実行時の一時的な情報等を保持する記憶装置５と、プログラムやデータ等を記憶した記録媒体を装着してそれらを読み込み、メモリ４または記憶装置５へ格納するのに用いられる媒体駆動装置６と、ネットワーク９へ接続するためのインタフェースであるネットワーク接続装置７とから構成され、それらはバス８で接続されている。
また、ネットワーク９は、コンピュータ装置２００と他のコンピュータ装置２００とを結合するための伝送路であって、一般には、ケーブルで実現され、通信プロトコルにはＴＣＰ／ＩＰが使われる。但し、伝送路としてはケーブルだけではなく、それらの間の通信プロトコルが一致するものであれば無線、有線のいずれでもよく、例えば、ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）、インターネットなどを用いることができる。
【００５６】
このようなコンピュータ装置２００の構成において、上述した実施形態の文書検索装置を構成する各機能をそれぞれプログラム化し、予めＣＤ−ＲＯＭ等の記録媒体に書き込んでおき、コンピュータに搭載したＣＤ−ＲＯＭドライブのような媒体駆動装置６にこのＣＤ−ＲＯＭ等を装着して、これらのプログラムをコンピュータのメモリ４あるいは記憶装置５に格納し、それを実行することによって、本発明の目的が達成されることは言うまでもない。
この場合、記録媒体から読み出されたプログラム自体が上述した実施形態の機能を実現することになり、そのプログラムおよびそのプログラムを記録した記録媒体も本発明を構成することになる。
【００５７】
尚、プログラムを格納する記録媒体としては半導体媒体（例えば、ＲＯＭ、不揮発性メモリ等）、光媒体（例えば、ＤＶＤ、ＭＯ、ＭＤ、ＣＤ等）、磁気媒体（例えば、磁気テープ、フレキシブルディスク等）等のいずれであってもよい。
【００５８】
また、コンピュータ装置２００のメモリ４へロードしたプログラムを実行することにより上述した実施形態の機能が実現されるだけでなく、そのプログラムの指示に基づき、オペレーティングシステムあるいは他のアプリケーションプログラム等と共同して処理することによって上述した実施形態の機能が実現される場合も含まれる。
【００５９】
市場に流通させる場合には、可搬型の記録媒体にプログラムを格納して流通させたり、インターネット等の通信網を介して接続されたサーバコンピュータの記憶装置に格納しておき、通信網を通じて他のコンピュータに転送することもできる。この場合、このサーバコンピュータの記憶装置も本発明の記録媒体に含まれる。なお、コンピュータでは、可搬型の記録媒体上のプログラム、または転送されてくるプログラムを、コンピュータに接続した記憶装置にインストールし、そのインストールされたプログラムを実行することによって上述した実施形態の機能が実現される。
【００６０】
＜ネットワーク環境での運用＞
図５は、本発明を有線または無線の通信ネットワークに接続して運用する形態の構成を示している。
例えば、文書検索プログラムを保持するサーバ３００と複数のユーザが利用する端末３１０とをネットワーク９で接続する。
この場合、サーバ３００およびユーザの端末３１０は、図４に示した汎用のコンピュータ装置２００で構成される。
ユーザは、端末３１０からサーバ３００に対してログインし、文書検索のためのキーワードを入力装置を用いて入力し、ネットワーク９を介してサーバ３００の文書検索プログラムへ検索の実行を依頼する。
サーバ３００の文書検索プログラムは、ネットワーク９を介して、指定されたキーワードに適合した検索結果や途中経過を要求元の端末３１０へ戻す。ユーザの端末３１０は、この検索結果や途中経過を出力装置へ出力する。途中経過の出力の時には、その経過如何によっては、サーバ３００への指示も行う。
このように文書検索プログラムをサーバ３００におくことによって、ユーザは常に最新の文書検索プログラムを使えるという利点がある。
【００６１】
【発明の効果】
以上説明したように本発明によれば、適合文書とみなした文書が、実際には適合文書でなかった場合でも、リスク分散がはかられ、関連語として適さない関連語ばかりを選出する危険性が小さくなる。
また、適合文書とみなした文書群を異なった性質を持つ複数の文書集合に分割し、その文書集合ごとに関連語を選出することで、似た傾向の関連語ばかりが選出されることを防ぐことができる。
これにより、多様な単語をキーワードの関連語として選出できるので、ユーザの所望している的確な文書を検索できる可能性が高くなる。
【００６２】
また、特許公報のような文書集合を検索する場合に、出願人、出願日・公開日・登録日等の日付情報あるいは特許分類を同じくする適合文書ごとに求めた関連語の和集合をもって関連語とするので、多様な観点から関連語を選出することができる。
【図面の簡単な説明】
【図１】本発明に係る文書検索装置の構成を示すブロック図である。
【図２】文書データベースのデータ構造を説明するための図である。
【図３】本発明に係る文書検索装置の処理の流れを説明するためのフローチャートである。
【図４】本発明に係る文書検索装置をコンピュータで実現するときのハードウェアの構成を示す図である。
【図５】本発明に係る文書検索装置をネットワーク環境で運用する場合を説明するための図である。
【符号の説明】
１…入力装置、２…表示装置、３…ＣＰＵ、４…メモリ、５…記憶装置、６…媒体駆動装置、７…ネットワーク接続装置、８…バス、９…ネットワーク、１１０…キーワード入力部、１２０…文書ランキング部、１３０…単語ランキング部、１４０…キーワード生成部、１５０…文書出力部、１６０…文書データベース、１７０…単語辞書、２００…コンピュータ装置、３００…サーバ、３１０…端末。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document search device, a program for executing a function of the document search device, and a computer-readable recording medium on which the program is recorded. More specifically, the present invention selects a document that matches a given keyword. A document search device capable of retrieving a document desired by the user by re-searching for a document that matches with a keyword to which a keyword extracted from the relevant document is added, and a program for executing the function of the document search device And a computer-readable recording medium recording the program.
[0002]
[Prior art]
To search for a document that the user needs from a document database that contains a large number of documents, search once using the keyword input by the user, and then input from the words that appear in the document that matches the keyword There is known a method of selecting a word related to a given keyword, adding the selected keyword to the first input keyword, and searching again to obtain a keyword closer to the one desired by the user.
[0003]
For example, as a method of selecting a related word of a keyword, for each word in a conforming document, the degree of relevance to the keyword is calculated using statistical information such as the appearance status in the conforming document, and the value of the keyword is large. A method of selecting the top several words has been proposed (see Non-Patent Document 1).
[0004]
Next, this related word extraction method will be described. Each word in the keyword input by the user is weighted according to the importance of the word. As a calculation formula for calculating the weight of the word, for example, Robertson's calculation formula (Formula 1) based on a probability model is known (see Non-Patent Document 2). In the technique of Non-Patent Document 2, the weight of each word in a keyword is assigned according to the appearance status Wp, Wq of each word in the entire search target document.
[0005]
W (weight) = Wp-Wq (Equation 1)
here
Wp = k4 + log (N / (Nn)),
Wq = log (n / (Nn)),
N: Total number of documents to be searched,
n: the number of documents in which the word appears,
k4: Adjustment parameter
[0006]
Next, the document relevance of each document is calculated based on the weight of each word in the keyword. The formula for calculating the document relevance is obtained, for example, by the following formula (Formula 2) in Non-Patent Document 2.
[0007]
F (fitness) = Σ (W × tf / (k1 + tf)) (Equation 2)
here
W: weight of word obtained by (Equation 1),
tf: number of words per document,
k1: adjustment parameter
[0008]
The document relevance of each document is determined, and the documents are ordered in descending order of relevance, and some of the top documents are regarded as conforming documents, and some of the bottom documents are regarded as non-conforming documents.
After selecting a conforming document, for all words except for unnecessary words (for example, article a) in the conforming document, the appearance status in the conforming document and the non-conforming document, that is, the feedback information is reflected, and each word is reflected. Recalculate the weights.
[0009]
The weight after the selection of the relevant document is determined by, for example, using the calculation formula (Expression 3) in Non-Patent Document 2 and the appearance status Wp, Wq (see the comment in (Expression 1) above) and the relevant document / The appearance statuses Wr and Ws in the non-conforming document are added by the ratios Cp and Cq.
[0010]
W ′ (weight) = (Cp · Wp + (1-Cp) · Wr) − (Cq · Wq + (1-Cq) · Ws) (Equation 3)
here
Wr = log ((r + 0.5) / (R-r + 0.5)),
Ws = log ((s + 0.5) / (S-s + 0.5)),
Cp = k5 / (k5 + √R),
Cq = k6 / (k6 + √S),
R: Number of conforming documents,
r: the number of documents in which the word appears in the set of conforming documents,
S: Number of non-conforming documents,
s: the number of documents in which the word appears in the non-conforming document set,
k5, k6: Adjustment parameters
[0011]
Further, the degree of relevance to the keyword is determined for each word obtained by removing unnecessary words in the matching document from the weight and the feedback information. As a method of calculating the degree of association, for example, there is a Boughanem equation (Equation 4) (see Non-Patent Document 3).
[0012]
Relevance = (r / R−α · s / S) × W ′ (Equation 4)
Where α: adjustment parameter
[0013]
In this way, for each word in the conforming document, the degree of relevance to the keyword is determined, selected as keyword-related words in descending order of relevance, and added to the input keyword to create a new keyword.
Using this new keyword, a matching document is selected again. At this time, the weight calculated by the above (Equation 3) is used for calculating the document relevance.
[0014]
On the other hand, the technique of Patent Document 1 prepares a set of synonyms of a concept classified by a different classification method and a broader term indicating a broader concept for an arbitrary search term, and generates a synonym including a search term given by a user. Present sets and broader terms.
As a result, desired information can be efficiently obtained by searching using the search words of the concepts classified by various classification methods.
[0015]
Further, the technique of Patent Literature 2 selects a main keyword that is a main keyword among search keywords given by a user, and classifies a search result obtained by the main keyword and its related words and associated words into a plurality of categories. . A classification corresponding to a search keyword is determined from among the classified taxonomic groups, and information belonging to the classification is efficiently searched by narrowing down the information using the previously given search keyword and its related words and associative words. can do.
[0016]
[Patent Document 1]
JP-A-10-21266
[Patent Document 2]
JP 2001-5830 A
[Non-patent document 1]
Robertson, SE "On term selection for query expansion," Journal of Documentation 46, Dec 1990, p359-364
[Non-patent document 2]
Robertson, SE and Walker, S. "On relevance weights with little relevance information," SIGIR97, ACM Press, pp. 16-24
[Non-Patent Document 3]
Walker, S.etal., "Okapi at TREC-6: Automated adhoc, VLC, routing, filtering and QSDR," The Sixth Text RetrievalConference (TREC-6), 1996, NIST
[0017]
[Problems to be solved by the invention]
However, in the above-described conventional technology, the entire set of conforming documents is regarded as one set, and related words are obtained from the entire set. Therefore, related words are selected from only one scale, and the following problem is caused. There was a point.
[0018]
(1) When a document regarded as a conforming document is not actually a conforming document, there is a high risk of selecting an unsuitable related word.
(2) Even if a document regarded as a conforming document is actually a conforming document, there is a possibility that many related words having a similar tendency may be selected, and it is desirable to obtain a document discussing the same theme from a different viewpoint. I can't respond to requests.
[0019]
According to the techniques of Patent Literature 1 and Patent Literature 2 described above, a related word for a search word is defined in advance, and a search expression expanded by adding a related word to a search word specified by a user can be created. However, this does not mean that a new keyword having an expanded viewpoint is effectively and automatically obtained.
[0020]
The present invention has been made in consideration of the above-described circumstances, and a document capable of searching for an accurate document desired by a user by selecting related words from various viewpoints from a document group. It is an object of the present invention to provide a search device, a program for executing a function of a document search device, and a computer-readable recording medium storing the program.
[0021]
[Means for Solving the Problems]
In order to solve the above-mentioned problem, a first aspect of the present invention provides a document database that holds a plurality of documents, and a document ranking unit that selects a document that matches a keyword input from the document database and a document that does not match the keyword. A word ranking section that selects a word having a high degree of relevance between the keyword and the word appearing in the matching document selected by the document ranking section as a related word of the keyword; and a related word selected by the word ranking section. A keyword generation unit for adding to the keyword, wherein the document ranking unit searches again for a document that matches the new keyword generated by the keyword generation unit. Divide the conformable documents selected by the ranking section into multiple sets, and for each set From in search of the phrase, document retrieval apparatus which is characterized in that the related words to add the union of them related words to the original keyword.
[0022]
According to a second aspect of the present invention, in the document search device according to the first aspect, the division of the conformable document group is performed such that a specific item for the conformable document is divided into conformable documents having common or approximate values. Features.
[0023]
According to a third aspect of the present invention, in the document search device according to the second aspect, the specific item is a bibliographic item of the relevant document.
[0024]
According to a fourth aspect of the present invention, in the document search device according to the second aspect, when the document in the document database is a patent gazette, the specific item is an applicant.
According to a fifth aspect of the present invention, in the document search device according to the second aspect, when the document in the document database is a patent gazette, the specific item includes an application date, a publication date, a patent registration date, and the like. It is characterized by date information.
According to a sixth aspect of the present invention, in the document search device according to the second aspect, when the document in the document database is a patent gazette, the specific item includes an international patent classification, a facet classification, an F term, and the like. It is a patent classification.
A seventh aspect of the present invention is a program for causing a computer to execute the functions of the document search device according to any one of the first to sixth aspects.
An eighth aspect of the present invention is a computer-readable recording medium on which the program according to the seventh aspect is recorded.
[0025]
Therefore, even if a document regarded as a conforming document is not actually a conforming document, risk is diversified, and the risk of selecting only relevant words that are not suitable as relevant words is reduced.
In addition, by dividing a group of documents regarded as conforming documents into a plurality of document sets with different properties and selecting related words for each document set, it is possible to prevent only related words with similar tendency from being selected. be able to.
As a result, various words can be selected as the related words of the keyword, so that there is a high possibility that an accurate document desired by the user can be searched.
[0026]
When searching a set of documents such as patent gazettes, the related word is obtained by using the date information such as the applicant, filing date, publication date, registration date, or the union of related words found for each conforming document with the same patent classification. Therefore, related words can be selected from various viewpoints.
[0027]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, a preferred embodiment of a document search device according to the present invention will be described with reference to the drawings.
[0028]
<First embodiment>
FIG. 1 is a block diagram showing a configuration of a document search device according to the present invention. In FIG. 1, the document search device includes a keyword input unit 110, a document ranking unit 120, a word ranking unit 130, a keyword generation unit 140, a document The output unit 150 includes a document database 160.
[0029]
The keyword input unit 110 allows a user to input, using a keyboard or the like, a character string obtained by combining keywords representing the characteristics of documents in the document database 160.
The input character string is subjected to morphological analysis using the word dictionary 170 to be decomposed into words as necessary. The word dictionary 170 includes at least the notation of each word, the part of speech, and the like.
Alternatively, instead of using the word dictionary 170, the input character string may be divided into n-grams and used as words.
[0030]
The document ranking unit 120 searches the document database 160 for the keyword passed from the keyword input unit 110, and selects a matching document and a mismatching document. The selected conforming document is passed to the word ranking unit 130, and becomes a source for extracting a word as a candidate of a related word.
[0031]
The document database 160 includes document information holding a document to be searched, and word statistical information of each word included in the document (see FIG. 2). For example, the document information holds the following information for each document.
[0032]
Document identifier (ID), document name, bibliographic information (creator, creation date, publishing office, etc.),
Pointer to document entity, etc.
[0033]
The word statistical information holds the following statistical information for each word.
[0034]
Word representation, frequency of occurrence of this word in the entire document database,
Word appearance information, etc.
[0035]
Here, the word appearance information holds the following information for each document in which the word appears.
[0036]
The document identifier of the document in which this word appears, the frequency of word occurrences in this document,
List of locations where this word appears in this document
[0037]
The word ranking unit 130 divides the entire group of matching documents into a plurality of matching document sets, extracts related word candidates for each of the divided matching document sets, and selects a related word from all of these related word candidates.
[0038]
Bibliographic items stored in the document database 160 are extracted from the document identifier of each conforming document, and documents having a common value or a close value with respect to the attribute (for example, creator or creation date) of interest are collected into a plurality of sets. To divide the relevant document.
By creating a set of conforming documents having the same attribute value, related words can be selected from various viewpoints.
When the document database 160 is composed of patent gazettes, “applicant”, “date information such as filing date, publication date, registration date” or “patent classification such as international patent classification, facet classification, F-term”, etc. By using the attribute as an attribute, the relevant document can be classified from a more appropriate viewpoint, and a more appropriate related word can be extracted.
[0039]
Next, a document stored in the document database 160 is extracted from the document identifier of the conforming document for each divided conforming document set, and words are extracted by morphological analysis or n-gram, and unnecessary words prepared in advance are extracted. If the extracted word is registered in the word table, the word is deleted, the remaining words are set as related word candidates, and the degree of relevance between the input keyword and the related word candidate is calculated by, for example, the following (Equation 5). I do.
[0040]
Relevance = Σ _i (rtf _i / K + rtf _i ) / R-β × Σ _j (stf _j / K + stf _j ) / S …… (Equation 5)
here,
R: Number of conforming documents,
S: Number of non-conforming documents,
rtf _i : Number of occurrences of a conforming document in document i;
stf _j : Number of occurrences of non-conforming document in document j,
K and β: tuning parameters.
The first term on the right side of (Equation 5) is the sum of
And the second term is the sum of each non-conforming document.
[0041]
Next, from among all the related word candidates extracted for each of the divided matched document sets, a predetermined number of related word candidates having a high degree of relevance (for example, about 10) are set as related words. elect. The related word selected in this way is passed to the keyword generation unit 140.
The keyword generation unit 140 passes to the document ranking unit 120 all of these related words, or a new keyword generated by adding selected ones from the related words presented to the user.
The document ranking unit 120 selects a document that matches again with the new keyword generated by the keyword generation unit 140, and passes the selected compatible document to the document output unit 150.
The document output unit 150 displays the list of compatible documents selected by the document ranking unit 120 on the display device, and calls the word ranking unit 130 when there is no desired document for the user. If there is a desired document, the document is output to a printer, a display device, a storage device, or the like, or transmitted to another computer device via a network.
[0042]
Next, the operation of the thus configured document search device of the present embodiment will be described with reference to the flowchart of FIG.
First, a keyword composed of, for example, English or Japanese words or a combination of words is input as a character string from an input device such as a keyboard, and is subjected to morphological analysis by the word dictionary 170 as necessary to be decomposed into words. (Step S100).
Alternatively, instead of using the word dictionary 170, the input character string may be divided into n-grams and used as words.
Thus, the keyword input unit 110 is configured.
[0043]
For each word in the input keyword, the word statistical information of the document database 160 is referred to, and for example, a weight corresponding to the importance of the word is calculated using (Equation 1) (step S110).
[0044]
Next, with respect to each document in the document database 160 to be searched, the word statistical information of the document database 160 and the weight of the keyword word calculated in step S110 are referred to, and the word in the keyword is added to the document. Is calculated using, for example, the above (Equation 2) to create a document list (step S120).
[0045]
Using the relevance as a key, the documents in the document list are ordered in descending order, a predetermined number of documents (for example, about 10) from the top are regarded as conforming documents, and a predetermined number of documents (for example, 500 Is regarded as a non-conforming document (step S130).
Alternatively, a list of ordered documents (a list of relevance, document names, bibliographic items, etc.) is presented to the user, and the user is instructed whether or not the document is conforming. A document designated as non-conforming may be a non-conforming document.
The document ranking unit 120 is configured by steps S110 to S130.
[0046]
The user is instructed whether or not there is a document desired by the user in the compatible document list selected in step S130 (step S140).
If there is no desired document, the process proceeds to step S150. If there is a desired document, the process proceeds to step S190.
[0047]
The desired document content or the like is presented to the user by transmitting it to an output device such as a display device, a printer or a storage device, or to another computer device connected via a network (step S190).
[0048]
The bibliographic items stored in the document database 160 are extracted from the document identifier of each conforming document obtained in step S130, and documents having a common value or a close value with respect to the noted attribute (for example, creator or creation date) are extracted. The relevant documents are divided by collecting and dividing them into a plurality of sets (step S150).
When the document database 160 is composed of patent gazettes, “applicant”, “date information such as filing date, publication date, registration date” or “patent classification such as international patent classification, facet classification, F-term”, etc. Is an attribute.
[0049]
Next, a document stored in the document database 160 is extracted from the document identifier of the conforming document for each divided conforming document set, and words are extracted by morphological analysis or n-gram, and unnecessary words prepared in advance are extracted. If the extracted word is registered in the word table, the word is deleted, the remaining words are set as related word candidates, and the degree of relevance between the input keyword and the related word candidate is calculated by, for example, the following (Equation 5). (Step S160).
[0050]
Next, from among all the related word candidates extracted for each of the divided matched document sets, a predetermined number of related word candidates having a high degree of relevance (for example, about 10) are set as related words. A selection is made (step S170).
The steps S150 to S170 constitute the word ranking unit 130.
[0051]
A new keyword is created by adding the related word extracted in step S170 or the related word selected by the user from the related words to the original keyword (step S180).
This constitutes the keyword generation unit 140.
[0052]
The new keyword is selected again in the same manner as in the processing from step S110 to step S130 (document ranking unit 120).
[0053]
When the present embodiment is configured as described above, instead of finding related words from the entire set of conforming documents, the group of conforming documents is divided into a plurality of sets, and related terms are found for each set. Since the union is used as a related word, even if the document regarded as a conforming document is not actually a conforming document, the risk is diversified and the risk of selecting only related words that are not suitable as related words is small. Become.
In addition, by dividing a group of documents regarded as conforming documents into a plurality of document sets with different properties and selecting related words for each document set, it is possible to prevent only related words with similar tendency from being selected. be able to.
As a result, various words can be selected as the related words of the keyword, so that there is a high possibility that an accurate document desired by the user can be searched.
[0054]
When searching a set of documents such as patent gazettes, the related word is obtained by using the date information such as the applicant, filing date, publication date, registration date, or the union of related words found for each conforming document with the same patent classification. Therefore, related words can be selected from various viewpoints.
[0055]
<Embodiment 2>
Furthermore, the present invention is not limited to only the above embodiment. For example, the document search device shown in FIG. 1 can be realized by a computer device 200 having a hardware configuration as shown in FIG.
That is, the computer device 200 includes a keyboard, a mouse, a touch panel, a scanner, and the like, and displays and outputs the input device 1 used for inputting information, various output information, information input from the input device 1, and the like. The display device 2, a CPU (Central Processing Unit; central processing unit) 3 for operating various programs, the programs themselves, and information temporarily created when the CPU 3 executes the programs. A memory 4 for holding, a storage device 5 for holding a document database 160, a word dictionary 170, a program, temporary information at the time of executing the program, and the like, which are handled by the document search device of the present invention, and a recording medium storing the program, data, and the like. And a medium drive used to read them and store them in the memory 4 or the storage device 5. A device 6, is composed of a network connecting device 7 for an interface for connecting to the network 9, which are connected by a bus 8.
The network 9 is a transmission path for coupling the computer device 200 to another computer device 200, and is generally realized by a cable, and uses TCP / IP as a communication protocol. However, the transmission path is not limited to cables, but may be wireless or wired as long as the communication protocol between them is the same. For example, LAN (Local Area Network), WAN (Wide Area Network), the Internet, etc. Can be used.
[0056]
In such a configuration of the computer apparatus 200, each function constituting the document search apparatus according to the above-described embodiment is programmed, written in advance on a recording medium such as a CD-ROM, and is stored in a CD-ROM drive mounted on the computer. By mounting the CD-ROM or the like on such a medium drive device 6, storing these programs in the memory 4 or the storage device 5 of the computer, and executing the programs, the object of the present invention can be achieved. Needless to say.
In this case, the program itself read from the recording medium implements the functions of the above-described embodiment, and the program and the recording medium on which the program is recorded also constitute the present invention.
[0057]
Note that a recording medium for storing the program is a semiconductor medium (for example, ROM, non-volatile memory, etc.), an optical medium (for example, DVD, MO, MD, CD, etc.), a magnetic medium (for example, magnetic tape, flexible disk, etc.). And so on.
[0058]
The functions of the above-described embodiments are realized not only by executing the program loaded into the memory 4 of the computer device 200 but also in cooperation with an operating system or other application programs based on the instructions of the program. The case where the functions of the above-described embodiments are realized by the processing is also included.
[0059]
When distributing to the market, the program is stored and distributed in a portable recording medium, or stored in a storage device of a server computer connected via a communication network such as the Internet, and another program is stored through the communication network. It can also be transferred to a computer. In this case, the storage device of the server computer is also included in the recording medium of the present invention. In the computer, the functions of the above-described embodiments are realized by installing a program on a portable recording medium or a transferred program in a storage device connected to the computer, and executing the installed program. Is done.
[0060]
<Operation in network environment>
FIG. 5 shows a configuration in which the present invention is connected to a wired or wireless communication network for operation.
For example, the server 300 holding the document search program and the terminal 310 used by a plurality of users are connected via the network 9.
In this case, the server 300 and the user terminal 310 are configured by the general-purpose computer device 200 shown in FIG.
The user logs in to the server 300 from the terminal 310, inputs a keyword for document search using the input device, and requests the document search program of the server 300 via the network 9 to execute the search.
The document search program of the server 300 returns, via the network 9, a search result that matches the specified keyword and the progress of the search to the requesting terminal 310. The user terminal 310 outputs the search result and the progress on the way to the output device. At the time of outputting the progress, an instruction to the server 300 is also performed depending on the progress.
By placing the document search program in the server 300 in this way, there is an advantage that the user can always use the latest document search program.
[0061]
【The invention's effect】
As described above, according to the present invention, even if a document regarded as a conforming document is not actually a conforming document, the risk is diversified, and there is a risk that only related words that are not suitable as related words are selected. Becomes smaller.
In addition, by dividing a group of documents regarded as conforming documents into a plurality of document sets with different properties and selecting related words for each document set, it is possible to prevent only related words with similar tendency from being selected. be able to.
As a result, various words can be selected as the related words of the keyword, so that there is a high possibility that an accurate document desired by the user can be searched.
[0062]
When searching a set of documents such as patent gazettes, the related word is obtained by using the date information such as the applicant, filing date, publication date, registration date, or the union of related words found for each conforming document with the same patent classification. Therefore, related words can be selected from various viewpoints.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a document search device according to the present invention.
FIG. 2 is a diagram illustrating a data structure of a document database.
FIG. 3 is a flowchart for explaining the flow of processing of the document search device according to the present invention.
FIG. 4 is a diagram illustrating a hardware configuration when the document search device according to the present invention is implemented by a computer.
FIG. 5 is a diagram for explaining a case where the document search device according to the present invention is operated in a network environment.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Input device, 2 ... Display device, 3 ... CPU, 4 ... Memory, 5 ... Storage device, 6 ... Medium drive device, 7 ... Network connection device, 8 ... Bus, 9 ... Network, 110 ... Keyword input part, 120 ... document ranking section, 130 ... word ranking section, 140 ... keyword generation section, 150 ... document output section, 160 ... document database, 170 ... word dictionary, 200 ... computer apparatus, 300 ... server, 310 ... terminal.

Claims

A document database that holds a plurality of documents, a document ranking unit that selects documents that match the keywords input from the document database and documents that do not match, and words that appear in the matching documents selected by the document ranking unit. A keyword ranking unit that selects a word having a high degree of relevance to the keyword as a related word of the keyword; and a keyword generation unit that adds a related word selected by the word ranking unit to the keyword. In the document search device for searching again in the document ranking unit for documents that match the new keyword generated in the above, the word ranking unit divides a group of compatible documents selected by the document ranking unit into a plurality of sets, After finding the related words for each set, calculate the union of those related words. Document retrieval system which is characterized in that the related words that you want to add to the keyword.

2. The document search apparatus according to claim 1, wherein the group of the conformable documents is divided into conforming documents in which specific items for the conforming documents have common or approximate values.

3. The document search device according to claim 2, wherein the specific item is a bibliographic item of the relevant document.

3. The document search device according to claim 2, wherein when the document in the document database is a patent publication, the specific item is an applicant.

3. The document search apparatus according to claim 2, wherein, when the document in the document database is a patent gazette, the specific item is date information such as an application date, a publication date, and a registration date. apparatus.

3. The document search apparatus according to claim 2, wherein when the document in the document database is a patent gazette, the specific item is a patent classification such as an international patent classification, a facet classification, or an F-term. Search device.

A program for causing a computer to execute the functions of the document search device according to claim 1.

A computer-readable recording medium on which the program according to claim 7 is recorded.