JP4154118B2

JP4154118B2 - Related Word Selection Device, Method and Recording Medium, and Document Retrieval Device, Method and Recording Medium

Info

Publication number: JP4154118B2
Application number: JP2000333509A
Authority: JP
Inventors: 博子真野; 泰嗣小川
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2000-10-31
Filing date: 2000-10-31
Publication date: 2008-09-24
Anticipated expiration: 2020-10-31
Also published as: JP2002140366A

Description

【０００１】
【発明の属する技術分野】
本発明は、関連語選出装置、その方法および記録媒体並びに文書検索装置、その方法および記録媒体に関し、より詳細には、文書データベースから与えられたキーワードに適合する文書を検索し、その適合文書中からキーワードに関連のある関連語を選出する技術並びに与えられたキーワードの関連語を付加して再検索するための技術に関する。
【０００２】
【従来の技術】
一般に、文書の特徴をあらわすものにキーワードがある。また、その文書に対して別の面から特徴をあらわすためにシソーラスのような関連語をもつ方法もある。
また、この関連語の応用面として、文書を多数集積している文書データベースからユーザの必要とする文書を探しだすときに、ユーザが入力したキーワードを用いて一旦検索した後、そのキーワードに適合した文書中に出現する単語の中から入力したキーワードに関連した単語を選出し、はじめに入力したキーワードに追加し、再度、検索することで、よりユーザの求めるものに近いものを得る方法が知られている。
たとえば、キーワードの関連語を選出する方法として、適合文書中の各単語について、適合文書の中での出現状況などの統計情報を利用して、キーワードとの関連度を算出し、その値の大きい上位何単語かを選出する方法が提案されている（文献１：Robertson, S.E. "On term selection for query expansion" Journal of Documentation 46, Dec 1990,p359-364）。
【０００３】
次に、この従来の関連語の抽出方法についてより詳細に説明する。
まず、ユーザから入力されたキーワード中の各単語に対して単語の重要度に応じた重みを付与する。この単語の重みの計算式には、たとえば、確率モデルにもとづく Robertson の計算式（式１）が知られている（文献２：Robertson, S.E. and Walker, S. "On relevance weights with little relevance information," SIGIR 97, ACM Press, pp.16-24）。この文献２の技術においては、キーワード中の各単語の重みは、検索対象文書全体の中での各単語の出現状況 Wp、Wq に応じて付与される。
W（重み）＝ Wp Wq ……… （式１）
ここで Wp ＝ k4 + log(N / (N - n))
Wq ＝ log(n / (N - n))
N: 検索対象総文書数
n: 単語の出現する文書数
k4: 調整パラメータ
次に、キーワード中の各単語の重みをもとに、各文書の文書適合度を計算する。この文書適合度の計算式は、たとえば、文献２の計算式（式２）で求まる。
F（適合度）＝ Σ(W × tf /(k1 + tf)) ……… （式２）
ここで W ：（式１）で計算された重み
tf: 文書あたりの単語の出現数
k1: 調整パラメータ
各文書の文書適合度を求め、適合度の高い順に各文書を順序づけ、上位何件かを適合文書とみなし、下位何件かを非適合文書とみなす。
適合文書の選出後、適合文書中の不要語（たとえば、冠詞の a など）を除いたすべての単語について、適合文書および非適合文書での出現状況、すなわちフィードバック情報を反映させて、それぞれの単語の重みを再計算する。
適合文書選出後の重みは、たとえば、文献２の計算式（式３）を用いて、検索対象文書全体での出現状況 Wp、Wq （上記の（式１）のコメント参照）と適合文書／非適合文書の中での出現状況 WrとWs を比率 CpとCq で足し合わせて付与される。
W'（重み）＝(Cp・Wp+(1-Cp)・Wr)-(Cq・Wq+(1-Cq)・Ws) ……（式３）
ここで Wr = log((r + 0.5) / (R - r + 0.5))
Ws = log((s + 0.5) / (S - s + 0.5))
Cp ＝ k5 / (k5 + √R)
Cq ＝ k6 / (k6 + √S)
R: 適合文書数
r: 適合文書集合の中で単語の出現する文書数
S: 非適合文書数
s: 非適合文書集合の中で単語の出現する文書数
k5, k6: 調整パラメータ
さらに、この重みとフィードバック情報から適合文書中の不要語を除いた各単語について、キーワードとの関連度を求める。
【０００４】
関連度の算出方法としては、たとえば、Boughanem の計算式（式４）がある(文献３：Walker, S. et al., "Okapi at TREC-6:Automated ad hoc, VLC, routing, filtering and QSDR," The Sixth Test REtrieval Conference (TREC-6), 1996, NIST)。
関連度＝ (r / R - α・s / S) × W' ……… （式４）
ここで α: 調整パラメータ
このようにして、適合文書中の各単語について、キーワードとの関連度を求めて、関連度の高いものから順にキーワード関連語として選出する。
文書検索装置に利用するときには、入力したキーワードにこの関連語を追加して新しいキーワードを作成し、この新しいキーワードを用いて、再度、適合文書を選出する。
【０００５】
【発明が解決しようとする課題】
上記のような関連語を抽出するためには、検索対象の文書の内容が異なるものとしてデータベースを検索するのが通例であった。
しかし、たとえば、インターネット上で公開されている文書というのは、複数のサーバー（ミラーサイト）上で同一の文書を公開することもめずらしくない。
これらは、同一文書であっても、文書の識別子であるインターネットのアドレスは異なっているので、データベースから見れば、別文書とみなされてしまう。
また、もともと同一文書だった複数の文書（以下、重複文書と呼ぶ）に、個別に、異なったタイミングで小さな修正が加えられた結果、厳密には同じでないものの、ほぼ同一といえる複数の文書（以下、準重複文書と呼ぶ）がデータベース上に混在することになる場合もある。
従って、このような文書データベースには、重複文書や準重複文書の複数の文書が存在する可能性があり、従来の技術で提案されてきたキーワードの関連語選出方法では、次のような問題が出てくることになった。
キーワードと単語の関連度は、適合文書中でその単語が出現する文書数等をもとに計算されるため、当然、重複文書および準重複文書中の単語は、出現文書数が多くなり、その結果、キーワード関連度が高いとみなされてしまうことになる。
適切な関連語を得るには、中身の異なる複数の文書から広く共通に出現する単語を選出するのが望ましく、中身の似通ったいくつかの文書に出現しているからといって高い関連度を付与してしまうと、汎用性のとぼしい偏った単語が選ばれてしまうおそれがある。
本発明は、上述の問題を解決するためのものであり、文書データベースの中から与えられたキーワードの関連語を偏ることなく選出し、検索に寄与できる関連語選出装置、その方法および記録媒体を提供することを目的とする。
また、このような関連語を用いてユーザの所望する的確な文書を検索することができる文書検索装置、その方法および記録媒体を提供することを目的とする。
【０００６】
【課題を解決するための手段】
上記の課題を解決するために、請求項１記載の発明は、複数の文書を保持する文書データベースから、入力したキーワードに関連する関連語を選出する関連語選出装置において、各文書について、前記キーワード中の単語が出現する数及び各単語の重みをもとに前記キーワードに対する適合度を計算し、前記複数の文書中の、所定の規則に基づいて重複しているとみなされる複数の重複文書から、前記適合度の最も高い文書を適合文書として抽出する適合文書抽出部と、前記適合文書抽出部により抽出された各適合文書から前記キーワードに関連する関連語を抽出する関連語抽出部と、を備えたことを特徴とする。
また請求項２に記載の発明は、請求項１に記載の関連語選出装置において、前記適合文書抽出部は、前記複数の文書のうち、単語数が同じ文書を前記重複文書とすることを特徴とする。
また、請求項３に記載の発明は、請求項１に記載の関連語選出装置において、前記重複文書削除部は、前記複数の文書のうち、前記キーワード中の各単語の出現頻度が同じ文書を前記重複文書とすることを特徴とする。
また、請求項４に記載の発明は、請求項１に記載の関連語選出装置において、前記適合文書抽出部は、前記複数の文書のうち、キーワード中の各単語の出現位置および出現間隔が同じ文書を前記重複文書とすることを特徴とする。
また、請求項５に記載の発明は、複数の文書を保持する文書データベースから、入力したキーワードに関連する関連語を選出する関連語選出装置において、各文書について、前記キーワード中の単語が出現する数及び各単語の重みをもとに前記キーワードに対する適合度を計算し、前記適合度が同一の文書のうちの一つを適合文書として抽出する適合文書抽出部と、前記適合文書抽出部により抽出された前記適合文書から前記キーワードに関連する関連語を抽出する関連語抽出部と、を備えたことを特徴とする。
また、請求項６に記載の発明は、請求項１乃至５の何れか一項に記載の関連語選出装置において、前記関連語抽出部で抽出された前記関連語を前記キーワードに付加して新しいキーワードを生成するキーワード生成部を有し、前記適合文書抽出部は、前記キーワード生成部によって生成された新しいキーワード中の単語に基づく新たな適合文書を抽出し、前記関連語抽出部は、前記新たな適合文書から関連語を抽出することを特徴とする。
また、請求項７に記載の発明は、請求項１乃至６の何れか一項に記載の関連語選出装置において、前記関連語抽出部は、すでに関連語として選出された単語と同じ適合文書中に低頻度に出現する単語であって、前記選出された単語よりも関連度が低い単語は関連語として登録しないようにしたことを特徴とする。
また、請求項８に記載の発明は、複数の文書を保持する文書データベースから、入力したキーワードに関連する関連語を選出する関連語選出方法において、適合文書抽出部が、各文書について、前記キーワード中の単語が出現する数及び各単語の重みをもとに前記キーワードに対する適合度を計算し、前記複数の文書中の、所定の規則に基づいて重複しているとみなされる複数の重複文書から、前記適合度の最も高い文書を適合文書として抽出するステップと、関連語抽出部、前記適合文書抽出部により抽出された各適合文書から前記キーワードに関連する関連語を抽出するステップと、から構成されることを特徴とする。
また、請求項９に記載の発明は、複数の文書を保持する文書データベースから、入力したキーワードに関連する関連語を選出する関連語選出方法において、適合文書抽出部が、各文書について、前記キーワード中の単語が出現する数及び各単語の重みをもとに前記キーワードに対する適合度を計算し、前記適合度が同一の文書のうちの一つを適合文書として抽出するステップと、関連語抽出部が、前記適合文書抽出部により抽出された前記適合文書から前記キーワードに関連する関連語を抽出するステップと、から構成されることを特徴とする。
【０００７】
また、請求項１０に記載の発明は、請求項８又は９に記載の関連語選出方法において、キーワード生成部が、前記関連語抽出部で抽出された前記関連語を前記キーワードに付加して新しいキーワードを生成するステップと、前記適合文書抽出部が、前記キーワード生成部によって生成された新しいキーワード中の単語に基づく新たな適合文書を抽出するステップと、前記関連語抽出部が、前記新たな適合文書から関連語を抽出するステップと、を有することを特徴とする。
また、請求項１１に記載の発明は、請求項８乃至１０の何れか一項に記載の関連語選出方法をコンピュータに実現させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体を特徴とする。
また、請求項１２に記載の発明は、複数の文書を保持する文書データベースから入力したキーワードに適合する文書を検索する文書検索装置において、各文書について、前記キーワード中の単語が出現する数及び各単語の重みをもとに前記キーワードに対する適合度を計算し、前記複数の文書中の、所定の規則に基づいて重複しているとみなされる複数の重複文書から、前記適合度の最も高い文書を適合文書として抽出する適合文書抽出部と、前記適合文書抽出部により抽出された各適合文書から前記キーワードに関連する関連語を抽出する関連語抽出部と、前記関連語抽出部で抽出された前記関連語を前記キーワードに付加して新しいキーワードを生成するキーワード生成部と、を備え、前記適合文書抽出部は、前記キーワード生成部によって生成された新しいキーワード中の単語を含む新たな適合文書を抽出し、前記関連語抽出部は、前記新たな適合文書から関連語を抽出することを特徴とする。
また、請求項１３に記載の発明は、複数の文書を保持する文書データベースから入力したキーワードに適合する文書を検索する文書検索装置において、各文書について、前記キーワード中の単語が出現する数及び各単語の重みをもとに前記キーワードに対する適合度を計算し、前記適合度が同一の文書のうちの一つを適合文書として抽出する適合文書抽出部と、前記適合文書抽出部により抽出された前記適合文書から前記キーワードに関連する関連語を抽出する関連語抽出部と、前記関連語抽出部で抽出された前記関連語を前記キーワードに付加して新しいキーワードを生成するキーワード生成部と、を備え、前記適合文書抽出部は、前記キーワード生成部によって生成された新しいキーワード中の単語を含む新たな適合文書を抽出し、前記関連語抽出部は、前記新たな適合文書から関連語を抽出することを特徴とする。
また、請求項１４に記載の発明は、複数の文書を保持する文書データベースから入力したキーワードに適合する文書を検索する文書検索方法において、適合文書抽出部が、各文書について、前記キーワード中の単語が出現する数及び各単語の重みをもとに前記キーワードに対する適合度を計算し、前記複数の文書中の重複文書から、前記適合度の最も高い文書を抽出するステップと、関連語抽出部が、前記適合文書抽出部により抽出された各適合文書から前記キーワードに関連する関連語を抽出するステップと、キーワード生成部が、前記関連語抽出部で抽出された前記関連語を前記キーワードに付加して新しいキーワードを生成するステップと、前記適合文書抽出部が、前記キーワード生成部によって生成された新しいキーワード中の単語に基づく新たな適合文書を抽出するステップと、前記関連語抽出部が、前記新たな適合文書から関連語を抽出するステップと、
から構成されることを特徴とする。
また、請求項１５に記載の発明は、複数の文書を保持する文書データベースから入力したキーワードに適合する文書を検索する文書検索方法において、適合文書抽出部が、各文書について、前記キーワード中の単語が出現する数及び各単語の重みをもとに前記キーワードに対する適合度を計算し、前記適合度が同一の文書のうちの一つを適合文書として抽出するステップと、関連語抽出部が、前記適合文書抽出部により抽出された前記適合文書から前記キーワードに関連する関連語を抽出するステップと、キーワード生成部が、前記関連語抽出部で抽出された前記関連語を前記キーワードに付加して新しいキーワードを生成するステップと、前記適合文書抽出部が、前記キーワード生成部によって生成された新しいキーワード中の単語を含む新たな適合文書を抽出するステップと、前記関連語抽出部が、前記新たな適合文書から関連語を抽出するステップと、から構成されることを特徴とする。
また、請求項１６に記載の発明は、請求項１４又は１５に記載の文書検索方法をコンピュータに実現させるためのプログラムを記録したコンピュータ読取可能な記録媒体を特徴とする。
【０００８】
また、本発明の請求項１０の文書検索装置は、複数の文書を保持する文書データベースから入力したキーワードに適合する文書を検索する文書検索装置において、前記キーワードによって前記文書データベースから検索された適合度の高い文書のうち文書内容が同一か、またはほぼ同一の文書を削除し、削除されずに残った文書から適合度の高い文書を抽出する適合文書抽出部と、前記適合文書抽出部で抽出された適合文書から前記キーワードに関連する関連語を抽出する関連語抽出部と、前記関連語抽出部で抽出された関連語を前記キーワードに付加して新しいキーワードを生成するキーワード生成部とを備え、前記適合文書抽出部は、再度、前記キーワード生成部で生成した新しいキーワードによって検索し、ユーザの所望する的確な文書を得るようにしたことを特徴とする。
また、本発明の請求項１１の文書検索方法は、複数の文書を保持する文書データベースから入力したキーワードに適合する文書を検索する文書検索方法において、前記キーワードによって前記文書データベースから検索された文書のうち文書内容が同一か、またはほぼ同一の文書を削除し、その削除されずに残った文書から適合度の高い文書を抽出し、この抽出された文書から前記キーワードと関連のある関連語を抽出し、この関連語を前記キーワードに付加して新しいキーワードを生成し、この新しいキーワードを用いて、再度、前記文書データベースを検索することによってユーザの所望する文書を検索することを特徴とする。
また、本発明の請求項１２の記録媒体は、コンピュータを、複数の文書を保持する文書データベースから入力したキーワードに適合する文書を検索する文書検索装置として機能させるためのプログラムを記録したコンピュータ読み取り可能な記録媒体であって、前記キーワードによって前記文書データベースから検索された適合度の高い文書のうち文書内容が同一か、またはほぼ同一の文書を削除し、削除されずに残った文書から適合度の高い文書を抽出する適合文書抽出部と、前記適合文書抽出部で抽出された適合文書から前記キーワードに関連する関連語を抽出する関連語抽出部と、前記関連語抽出部で抽出された関連語を前記キーワードに付加して新しいキーワードを生成するキーワード生成部とを備え、前記適合文書抽出部は、再度、前記キーワード生成部で生成した新しいキーワードによって検索し、ユーザの所望する的確な文書を得る機能を実現するための文書検索プログラムを記録した。
【０００９】
【発明の実施の形態】
以下に、図面を用いて本発明の実施の形態の構成および動作を詳細に述べる。
（１）第１の実施の形態の構成
図１は、本発明に係る関連語選出装置の第１の実施の形態の機能ブロック図である。
第１の実施の形態の関連語選出装置は、キーワード入力部１１０、適合文書抽出部１２０、関連語抽出部１３０、キーワード生成部１４０、出力部１５０、文書データベース１６０より構成される。
キーワード入力部１１０は、ユーザがキーボード等により、文書データベース１６０中にある文書の特徴をあらわすキーワードとなる文字列を入力する。
適合文書抽出部１２０は、キーワード入力部１１０から渡されたキーワードに対して、文書データベース１６０を検索し、適合する文書と適合しない文書とを選定する。このとき、適合度の高い検索対象文書のうち文書内容が同一の文書（重複文書）か、またはほぼ同一の文書（準重複文書）（ここでは、これら重複文書と準重複文書をまとめて重複文書と呼ぶことにする）を一つだけ残して他の重複文書を削除する。この削除作業は、予め決められた適合文書数になるまで繰り返す。これにより適合度の高い文書には重複が少なくなる。
関連語抽出部１３０は、適合文書抽出部１２０で抽出された適合文書の中から取り出された単語と入力されたキーワードとの間で計算される関連度に応じて関連語を抽出し、キーワード生成部１４０へ渡す。
キーワード生成部１４０は、関連語抽出部１３０から渡された関連語をもとのキーワードに追加して新しいキーワードを生成する。この新しいキーワードは、さらにもとのキーワードと関連のある関連語を抽出するために適合文書抽出部１２０へ渡すようにしてもよいし、そのまま出力部１５０からユーザに提示させてもよい。
出力部１５０は、関連語抽出部１３０で抽出された関連語およびその元となったキーワードとをプリンタ、表示装置、記憶装置等へ出力するか、または、ネットワークを介して他のコンピュータ装置へ送信する。
文書データベース１６０は、検索対象となる文書を保持する文書情報と、その文書中に含まれている各単語の単語統計情報から構成される（図２参照）。
たとえば、文書情報には、各文書に対して次のような情報が保持される。
文書識別子（ＩＤ）、文書名、書誌事項（作成者、作成日、発行所等）、
文書実体へのポインタ等
また、単語統計情報には、単語ごとに次のような統計情報を保持する。
単語の表記、この単語の文書データベース全体での出現頻度、単語出現情報等ここで単語出現情報としては、単語が出現する文書ごとに次の情報を保持する。
この単語が出現する文書の文書識別子、この文書に出現する単語出現頻度、この文書にこの単語が出現する出現位置の一覧等
【００１０】
（２）第１の実施の形態の動作
次に、このように構成された第１の実施の形態の関連語選出装置の動作について、図３のフローチャートに基いて説明する。
まず、キーボード等の入力装置からキーワードの文字列を入力する（ステップＳ１００）。
これにより、キーワード入力部１１０を構成する。
このキーワードは、たとえば、英語や日本語の単語や単語の組み合わせで構成し、必要に応じて単語の組み合わせは、単単語へ分解する。
この入力されたキーワード中のそれぞれの単語について、文書データベース１６０の単語統計情報を参照し、たとえば、上記（式１）を用いて単語の重要度に応じた重みを計算する（ステップＳ１１０）。
次に、検索対象である文書データベース１６０中のそれぞれの文書に対して、次の情報を計算し、文書一覧表を作成する（ステップＳ１２０）。
【００１１】
（Ａ）文書ごとに適合度の計算
文書データベース１６０の単語統計情報とステップＳ１１０で計算されたキーワードの単語の重みとを参照し、その文書にキーワード中の単語がどのくらい含まれているかを示す適合度を、たとえば、上記（式２）を用いて計算する。
（Ｂ）文書ごとに含まれる単語数
文書データベース１６０の単語統計情報から文書に含まれる単語数を計算する。
この単語統計情報で、同じ文書識別子をもつ単語の出現頻度を総計することによって計算できる。
この文書一覧表を文書の適合度を第１キー、文書に含まれる単語数を第２キーとして、降順に各文書を順序づける（ステップＳ１３０）。
この順序付けられた文書一覧表で、同じ単語数の文書があった場合、そのうちの適合度のもっとも高い文書のみを残して、残りの同じ単語数の文書を削除する。この操作を適合度の高いほうから所定の数（たとえば、１０文書数程度）になるまで繰り返す。ここで選定されたものが適合文書として抽出される。
これは、中身の異なる二つの文書が同じ単語数となることは、極めてまれであることから、単語数が同じ文書は、中身も同じ可能性が高いと本発明ではみなしている。
さらに、文書一覧表の下位から所定の件数（たとえば、５００件程度）の文書を非適合文書とみなす。この非適合文書に対しても適合文書と同じように重複文書を削除する（ステップＳ１４０）。
この適合文書か非適合文書かは、順序づけられた文書の一覧表（適合度、文書名や書誌事項等の一覧）をユーザに提示し、ユーザに指示させて決定するようにしてもよい。
【００１２】
ステップＳ１１０からステップ１４０までにより、適合文書抽出部１２０を構成する。
ステップＳ１４０で求めた適合文書中の単語を入力キーワードの関連語の候補となる関連語単語表として作成する。これは文書データベース１６０の単語統計情報に保持された適合文書に含まれる単語を取り出して作成される。このとき、予め用意された不要語表を参照して、これに登録されている単語は関連語単語表へは登録しない。
さらに、この関連語単語表に登録された単語ごとに、適合文書および非適合文書での出現状況を文書データベース１６０の単語統計情報から取り出し、たとえば、（式３）および（式４）を使って、キーワードとの関連度を計算する。
この関連度の高いものから順に所定の数（たとえば、１０単語程度）だけ選択し、これをキーワード関連語として抽出する（ステップＳ１５０）。
ステップＳ１５０により関連語抽出部１３０を構成する。
ステップS１５０で抽出したキーワード関連語を入力されたときの元のキーワードへ追加し、新しいキーワードを作成する（ステップＳ１６０）。
これにより、キーワード生成部１４０を構成する。
さらに、関連語を抽出するかをユーザに指定させ（ステップS１７０）、抽出を行うという指示の時には、この新しいキーワードを用いてステップＳ１１０から繰り返す。抽出を終了するときには、ステップS１８０へ進む。
なお、この関連語の抽出は、繰り返さずともよいし、所定の回数これを繰り返すようにしてもよいし、また、ユーザに適合文書を検索するたびに出力して繰り返すように構成してもよい。
ステップＳ１６０で生成された新しいキーワードを表示装置、プリンタや記憶装置等の出力装置へ出力することによってユーザに提示される（ステップＳ１８０）。これにより、出力部１５０を構成する。
また、出力は、適合文書をネットワークで接続された他のコンピュータ装置へ送信するようにしてもよい。
関連語選出装置を第１の実施の形態のような構成にすることによって、中身の異なる複数の文書から広く共通に出現する単語をキーワードの関連語として選出することができるようになり、ユーザの所望する的確な文書を検索することに寄与できる。
【００１３】
＜第１の実施の形態の変形例（１）＞
第１の実施の形態の変形例の関連語選出装置について説明する。
本変形例の構成は、図１に示される第１の実施の形態の構成と同じである。
しかし、適合文書抽出部１２０の動作が次の点で第１の実施の形態とは相違している。
第１の実施の形態では、重複文書と見なす基準として、文書に含まれている単語の数に注目したが、本変形例では、適合度が同じ文書の場合に重複文書または準重複文書と見なしている。
この場合、図３のフローチャートのステップＳ１２０の文書一覧表を作成するときに、ステップＳ１３０では適合度だけで降順に順序づける。
その上で、ステップＳ１４０では、文書一覧表で同じ適合度を持つ文書のうちのひとつのみを残し、他の同じ適合度をもつ文書を削除する。
これは、中身の異なる二つの文書が同じ適合度となることは、極めてまれであることから、適合度が同じ文書は、中身も同じ可能性が高いとみなしている。
関連語選出装置を本変形例のような構成にすることによって、中身の異なる複数の文書から広く共通に出現する単語をキーワードの関連語として選出することができるようになり、ユーザの所望する的確な文書を検索することに寄与できる。
【００１４】
＜第１の実施の形態の変形例（２）＞
第１の実施の形態の別の変形例の関連語選出装置について説明する。
本変形例の構成は、図１に示される第１の実施の形態の構成と同じである。
しかし、適合文書抽出部１２０の動作が次の点で第１の実施の形態とは相違している。
第１の実施の形態では、重複文書と見なす基準として、文書に含まれている単語の数に注目したが、本変形例では、キーワード中の各単語の出現頻度が同じ文書の場合に重複文書または準重複文書と見なしている。
この場合、図３のフローチャートのステップＳ１２０の文書一覧表を作成するときに、各文書中に出現するキーワード中の各単語の出現頻度を計算する。
ステップＳ１３０では、適合度とこのキーワード中の各単語の出現頻度とをキーとして降順に順序づける。
その上で、ステップＳ１４０では、文書一覧表でキーワード中の各単語の出現頻度が同じ文書のうちの適合度のもっとも高い文書のうちのひとつのみを残し、各単語の出現頻度が同じ他の文書を削除する。
たとえば、キーワードが「情報検索」であった場合に、文書１は、「情報」が２回、「検索」が1回出現し、適合度が0.87である。さらに、文書２は、「情報」が２回、「検索」が1回出現し、適合度が0.85である。
このような場合には、文書一覧表から文書２を削除し、文書１のみを残すようにする。
これは、中身の異なる二つの文書にキーワード中の各単語が同じ頻度で出現することは、極めてまれであることから、キーワード中の各単語の出現頻度が同じ文書は、中身も同じ可能性が高いとみなしている。
関連語選出装置を本変形例のような構成にすることによって、中身の異なる複数の文書から広く共通に出現する単語をキーワードの関連語として選出することができるようになり、ユーザの所望する的確な文書を検索することに寄与できる。
【００１５】
＜第１の実施の形態の変形例（３）＞
第１の実施の形態の更に別の変形例の関連語選出装置について説明する。
本変形例の構成は、図１に示される第１の実施の形態の構成と同じである。
しかし、適合文書抽出部１２０の動作が次の点で第１の実施の形態とは相違している。
第１の実施の形態では、重複文書と見なす基準として、文書に含まれている単語の数に注目したが、本変形例では、キーワード中の各単語の出現位置および出現間隔が同じ文書の場合に重複文書または準重複文書と見なしている。
この場合、図３のフローチャートのステップＳ１２０の文書一覧表を作成するときに、各文書中に出現するキーワード中の各単語の出現位置（文書の前・後から何語目に）および出現間隔（何語離れて）を抽出する。
ステップＳ１３０では、適合度とこのキーワード中の各単語の出現位置および出現間隔とをキーとして降順に順序づける。
ステップＳ１４０では、文書一覧表でキーワード中の各単語の出現位置および出現間隔が同じ文書のうちの適合度のもっとも高い文書のうちのひとつのみを残し、他の文書を削除する。
【００１６】
たとえば、キーワードが「情報検索」であった場合に、文書１は、10語目と20語目に「情報」が、11語目に「検索」が出現し、適合度が0.87である。
さらに、文書２は、10語目と20語目に「情報」が、11語目に「検索」が出現し適合度が0.85である。
このような場合には、文書一覧表から文書２を削除し、文書１のみを残すようにする。
これは、中身の異なる二つの文書にキーワード中の各単語が同じ出現位置および出現間隔で出現することは、極めてまれであることから、キーワード中の各単語が同じ出現位置および出現間隔が同じ文書は、中身も同じ可能性が高いとみなしている。
本変形例の関連語選出装置をこのような構成にすることによって、中身の異なる複数の文書から広く共通に出現する単語をキーワードの関連語として選出することができるようになり、ユーザの所望する的確な文書を検索することに寄与できる。
なお、本発明の関連語選出装置は、適合文書抽出部１２０として、上記した第１の実施の形態、変形例（１）、（２）または（３）を単独に使用して構成するだけでなく、適宜、組合せた構成をとってもよい。たとえば、最初の重複文書の削除で残った文書に対して、他の方法を適用してさらに重複文書を削除するように構成してもよい。
【００１７】
＜第１の実施の形態の変形例（４）＞
第１の実施の形態の更に別の変形例の関連語選出装置について説明する。
本変形例の構成は、図１に示される第１の実施の形態の構成と同じである。
しかし、関連語抽出部１３０の動作が次の点で上述した第１の実施の形態および変形例（１）〜（３）とは相違している。
第１の実施の形態の関連語抽出部１３０では、単にキーワードと関連度の高い単語を関連語と見なしている。しかし、本変形例では、関連語として一旦、抽出されたもののうち低頻度語を削除してから、関連度の高い関連語を残すようにしている点が相違している。
この低頻度語の削除は、適合文書中に出現する各単語に対してキーワードとの関連度を算出し、適合文書での出現頻度が低く、すでに関連語として選出した単語と同じ文書群に出現する、より関連度の低い単語を関連語として採用しないようにする。
たとえば、キーワード「情報検索」に対して、適合文書抽出部１２０で適合文書として１０文書を選出し、それらの文書から関連語として文書２と文書７に出現しており、関連度が0.68である単語「Okapi」が選出されているとする。
この場合、関連語の候補の単語「BM25」が、文書２と文書７に出現しており関連度が0.62であったときには、単語「BM25」を関連語とはせず、単語「Okapi」のみを関連語として残すようにする。
一般に、適合文書中で単語の出現頻度が高く、検索対象文書全体でその単語の出現頻度が低いほど、単語とキーワードとの関連度は高くなることは、（式３）および（式４）から導出することができる。
逆に、適合文書中で出現頻度が低く、関連度が高い単語は、検索対象文書全体での出現頻度が低いと考えられる。このような検索対象文書全体での出現頻度が低い単語（低頻度語）同士が同じ文書に同時に出現することは、極めてまれであり、複数の低頻度語がともに出現する文書群は、たがいに中身が非常に近い可能性が高いと判断する。
従って、適合文書中で出現頻度が低く、すでに関連語として選出した単語と同じ文書群に出現する、より関連度の低い単語は、関連語として採用しないことで、中身がほぼ同一である可能性の高い文書にある単語を省いて、より関連度の高いキーワード関連語を選択することができるようになる。
【００１８】
本変形例の場合、図３のフローチャートのステップＳ１６０のキーワードの関連語を抽出して新しいキーワードを生成するときに、次のような手順で低頻度の単語を削除して関連語を抽出する。
先ず、適合文書中に出現する単語を抽出し、たとえば、（式３）および（式４）を用いて、これらの単語に対してキーワードとの関連度を算出する。また、これらの単語の適合文書中での出現頻度と、この単語がどの文書に出現するかも抽出する。
この出現頻度が所定の数より低い単語に対して、すでに関連語として選出した単語と同じ文書群にも出現する場合、単語とその関連語の関連度を比較し、単語の方の関連度が低いときは、関連語として採用しない。
この操作によって、残された単語をキーワードの関連語として採用し、新しいキーワードを生成する。
本変形例の関連語選出装置をこのような構成にすることによって、中身の異なる複数の文書からより関連度の強い単語をキーワードの関連語として選出することができるようになり、ユーザの所望する的確な文書を検索することに寄与できる。
【００１９】
＜第２の実施の形態＞
（１）第２の実施の形態の構成
図４は、本発明に係る文書検索装置の実施の形態の機能構成例を示すブロック図である。
この実施の形態の文書検索装置は、キーワード入力部１１０、適合文書抽出部１２０、関連語抽出部１３０、キーワード生成部１４０、文書出力部１７０、文書データベース１６０より構成される。
第１の実施の形態の関連語選出装置と同様の機能をもつブロックには同じ符号を付けてあり、以下では、相違点についてのみ説明する。
適合文書抽出部１２０では、キーワード入力部１１０またはキーワード生成部１４０から受け取るキーワードに対して、文書データベース１６０を検索し、適合度を計算した時点（図３のステップＳ１２０に相当）で、文書出力部を介して、ユーザに検索結果を出力し、その当否を判断させる。
ここでユーザが否とした場合、キーワードを再入力するかまたは関連語を選出して、キーワードに付加し、再度、検索のやり直しをさせるようにする。
関連語を抽出するように指示された場合、適合文書抽出部１２０は、重複文書を削除する（図３のステップＳ１３０からＳ１５０までに相当）。この削除法には、上記した第１の実施の形態（変形例（１）〜（３）も含まれる）が使われる。
キーワード生成部１４０では、生成された関連語を付加した新しいキーワードを生成し、この新キーワードを適合文書抽出部１２０へ直接渡すようにする。
文書出力部１７０は、適合文書抽出部１２０で検索された適合度の高い文書の一覧（たとえば、適合度、文書名、書誌事項等による一覧表）をプリンタ、表示装置またはフ記憶装置等の出力装置へ出力したり、または、ネットワークを介して他のコンピュータへ送信することによって、ユーザへキーワードに適合する文書の一覧を提供できる。
文書検索装置を第２の実施の形態のような構成にすることによって、中身の異なる複数の文書からより関連度の強い単語をキーワードの関連語として選出することができるようになり、ユーザの所望する的確な文書を検索することができる。
【００２０】
＜コンピュータによる実施の形態＞
さらに、本発明は上記の実施の形態のみに限定されたものではない。たとえば、図１または図４に示した関連語選出装置や文書検索装置は、図５のようなハードウェア構成を持つコンピュータ装置２００によっても実現が可能である。
すなわち、コンピュータ装置２００は、キーボード、マウス、タッチパネル、スキャナ等により構成され、情報の入力に使用される入力装置１と、種々の出力情報や入力装置１からの入力された情報などを表示出力させる表示装置２と、種々のプログラムを動作させるＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ；中央処理ユニット）３と、プログラム自身を保持し、またそのプログラムがＣＰＵ３によって実行されるときに一時的に作成される情報等を保持するメモリ４と、本発明の関連語選出装置や文書検索装置の文書データベース１６０およびプログラムやプログラム実行時の一時的な情報等を保持する記憶装置５と、プログラムやデータ等を記憶した記録媒体を装着してそれらを読み込み、メモリ４または記憶装置５へ格納するのに用いられる媒体駆動装置６と、ネットワーク９へ接続するためのインタフェースであるネットワーク接続装置７とから構成され、それらはバス８で接続されている。
また、ネットワーク９は、コンピュータ装置２００と他のコンピュータ装置２００とを結合するための伝送路であって、一般には、ケーブルで実現され、通信プロトコルにはＴＣＰ／ＩＰが使われる。但し、伝送路としてはケーブルだけではなく、それらの間の通信プロトコルが一致するものであれば無線、有線および放送波のいずれでもよく、たとえば、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、インターネット、アナログ電話網、デジタル電話網（ＩＳＤＮ：ＩｎｔｅｇｒａｌＳｅｒｖｉｃｅＤｉｇｉｔａｌＮｅｔｗｏｒｋ）、ＰＨＳ（パーソナルハンディシステム）、携帯電話網、衛星通信網などを用いることができる。
このようなコンピュータ装置２００の構成において、図１または図４に示した関連語選出装置や文書検索装置を構成する各機能をそれぞれプログラム化し、予めＣＤ−ＲＯＭ等の記録媒体に書き込んでおき、このＣＤ−ＲＯＭを各サイトのＣＤ−ＲＯＭドライブのような媒体駆動装置６を搭載したコンピュータ装置に装着して、これらのプログラムをそれぞれのコンピュータ装置のメモリ４あるいは記憶装置５に格納し、それを実行することによって、上記の実施の形態と同様な機能を実現することができる。
【００２１】
なお、記録媒体としては半導体媒体（たとえば、ＲＯＭ、ＩＣメモリカード等）、光媒体（たとえば、ＤＶＤ、ＭＯ、ＭＤ、ＣＤ−Ｒ等）、磁気媒体（たとえば、磁気テープ、フレキシブルディスク等）のいずれであってもよい。
また、コンピュータ装置２００のメモリ４へロードしたプログラムを実行することにより上記した実施の形態の機能が実現されるだけでなく、そのプログラムの指示に基づき、オペレーティングシステム等が実際の処理の一部または全部を行い、その処理によって上記した実施の形態の機能が実現される場合も含まれる。
また、上記した実施の形態を実現するプログラムがＲＯＭ等のような半導体の記録媒体である場合には、媒体駆動装置６からではなく、直接、メモリ４へロードして実行される。
【００２２】
＜本発明のネットワーク環境での運用＞
図６は、本発明を有線または無線の通信ネットワークに接続して運用する形態の構成を示している。
たとえば、関連語選出プログラムや文書検索プログラムを保持するサーバー３００と複数のユーザが利用する端末３１０とをネットワーク９で接続する。
この場合、サーバー３００およびユーザの端末３１０は、図５に示した汎用のコンピュータ装置２００で構成される。
ユーザは、端末３１０からサーバー３００に対してログインしたり、文書検索のためのキーワードを入力し、サーバー３００の文書検索プログラムへ検索の実行を依頼する。サーバー３００の文書検索プログラムは指定されたキーワードに適合した検索結果を要求もとの端末３１０へ戻す。ユーザの端末３１０は、この検索結果を出力する。
このようにすることで、常に最新の文書検索プログラムを使えるという利点がある。
ユーザは、関連語選出プログラムに対しても文書検索プログラム同様にして実行することにより、キーワードに関連した関連語を得ることができる。
また、図６のようにサーバー３００と端末３１０とを有線または無線の通信ネットワークで接続した場合、サーバー３００の磁気ディスク等の記憶装置に本発明の機能を実現する関連語選出プログラムや文書検索プログラムを格納しておき、端末３１０に対してダウンロード等の形式で頒布することも可能である。
さらに、本発明の機能を実現する関連語選出プログラムや文書検索プログラムを媒体や放送波による配布で提供するようにしてもよい。
【００２３】
【発明の効果】
以上説明したように、本発明によれば、キーワード関連語選出の際に、適合文書中に中身が同一あるいは、ほぼ同一である文書が複数含まれていても、それによって関連語が偏ることなく、検索に寄与できる適切な関連語を選ぶことができる。
これによって、ユーザの所望する的確な文書を検索することができる。
【図面の簡単な説明】
【図１】第１の実施の形態の関連語選出装置の構成を示すブロック図である。
【図２】文書データベースのデータ構造を説明するための図である。
【図３】第１の実施の形態の関連語選出装置の処理の流れを説明するためのフローチャートである。
【図４】第２の実施の形態の文書検索装置の構成を示すブロック図である。
【図５】本発明をコンピュータで実現するときのハードウェアの構成を示す図である。
【図６】本発明をネットワーク環境で運用する場合を説明するための図である。
【符号の説明】
１１０ …… キーワード入力部
１２０ …… 適合文書抽出部
１３０ …… 関連語抽出部
１４０ …… キーワード生成部
１５０ …… 出力部
１６０ …… 文書データベース
１７０ …… 文書出力部
２００ …… コンピュータ装置
３００ …… サーバー
３１０ …… 端末
１ …… 入力装置
２ …… 表示装置
３ …… ＣＰＵ
４ …… メモリ
５ …… 記憶装置
６ …… 媒体駆動装置
７ …… ネットワーク接続装置
８ …… バス
９ …… ネットワーク[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a related word selection device, a method and a recording medium thereof, and a document search device, a method and a recording medium, and more particularly, searches for a document that matches a keyword given from a document database, The present invention relates to a technique for selecting a related word related to a keyword from a keyword and a technique for re-searching by adding a related word of a given keyword.
[0002]
[Prior art]
In general, there is a keyword that represents the characteristics of a document. There is also a method of having related terms such as a thesaurus to express the document from another aspect.
Also, as an application of this related term, when searching for a document required by the user from a document database in which a large number of documents are accumulated, a search was made using a keyword entered by the user, and then the keyword was matched. It is known how to select words related to the keyword entered from the words that appear in the document, add it to the keyword entered first, and search again to obtain something closer to what the user wants Yes.
For example, as a method of selecting a keyword-related word, the degree of relevance of the keyword is calculated for each word in the conforming document using statistical information such as the appearance status in the conforming document, and the value is large. A method of selecting the top words has been proposed (Reference 1: Robertson, SE “On term selection for query expansion” Journal of Documentation 46, Dec 1990, p359-364).
[0003]
Next, the conventional related word extraction method will be described in more detail.
First, a weight corresponding to the importance of a word is given to each word in the keyword input by the user. For example, Robertson's formula based on a probabilistic model (Formula 1) is known as this word weight formula (Reference 2: Robertson, SE and Walker, S. "On relevance weights with little relevance information, "SIGIR 97, ACM Press, pp. 16-24). In the technique of this document 2, the weight of each word in the keyword is given according to the appearance status Wp and Wq of each word in the entire search target document.
W (weight) = Wp Wq ……… (Formula 1)
Where Wp = k4 + log (N / (N-n))
Wq = log (n / (N-n))
N: Total number of documents to be searched
n: Number of documents in which word appears
k4: Adjustment parameter
Next, the document suitability of each document is calculated based on the weight of each word in the keyword. The calculation formula for the document suitability is obtained by, for example, the calculation formula (Formula 2) of Document 2.
F (fitness) = Σ (W × tf / (k1 + tf)) ……… (Formula 2)
Where W is the weight calculated in (Equation 1)
tf: number of words per document
k1: Adjustment parameter
The document conformity of each document is obtained, the documents are ordered in descending order of conformance, the top several are regarded as conforming documents, and the lower several are regarded as non-conforming documents.
After selecting the conforming document, all the words in the conforming document, excluding unnecessary words (for example, the article a), reflect the appearance status in the conforming document and the nonconforming document, that is, the feedback information, and each word Recalculate the weight of.
For example, the weight after selection of the relevant document can be calculated by using the calculation formula (formula 3) in Document 2 and the appearance status Wp, Wq (see the comments in the above (formula 1)) in the entire search target document and Appearance status in relevant documents Wr and Ws are added by ratio Cp and Cq.
W '(weight) = (Cp · Wp + (1-Cp) · Wr)-(Cq · Wq + (1-Cq) · Ws) (Equation 3)
Where Wr = log ((r + 0.5) / (R-r + 0.5))
Ws = log ((s + 0.5) / (S-s + 0.5))
Cp = k5 / (k5 + √R)
Cq = k6 / (k6 + √S)
R: Number of conforming documents
r: Number of documents in which a word appears in the conforming document set
S: Number of nonconforming documents
s: Number of documents in which word appears in non-conforming document set
k5, k6: Adjustment parameters
Further, for each word obtained by removing unnecessary words in the relevant document from the weight and feedback information, the degree of association with the keyword is obtained.
[0004]
As a method of calculating the degree of association, for example, there is a calculation formula of Boughanem (Formula 4) (Reference 3: Walker, S. et al., “Okapi at TREC-6: Automated ad hoc, VLC, routing, filtering and QSDR”). , "The Sixth Test REtrieval Conference (TREC-6), 1996, NIST).
Relevance = (r / R-α · s / S) × W '……… (Formula 4)
Where α: Adjustment parameter
In this way, the degree of association with the keyword is obtained for each word in the conforming document, and the words are selected as the keyword-related words in descending order of the degree of association.
When used in a document search apparatus, a new keyword is created by adding this related word to the input keyword, and a relevant document is selected again using this new keyword.
[0005]
[Problems to be solved by the invention]
In order to extract the related terms as described above, it is usual to search the database assuming that the contents of the documents to be searched are different.
However, for example, documents published on the Internet are not uncommon for publishing the same document on a plurality of servers (mirror sites).
Even if they are the same document, since the Internet addresses as document identifiers are different, they are regarded as separate documents when viewed from the database.
In addition, a plurality of documents that were originally the same document (hereinafter referred to as duplicate documents) were individually modified at different times, and as a result, a plurality of documents that are almost identical but not exactly the same ( (Hereinafter referred to as quasi-duplicate documents) may be mixed in the database.
Therefore, there is a possibility that a plurality of documents such as duplicate documents and quasi-duplicate documents exist in such a document database. In the keyword related word selection method proposed in the prior art, there are the following problems. I came out.
Since the degree of association between a keyword and a word is calculated based on the number of documents in which the word appears in the conforming document, naturally, the number of documents in the duplicate document and the quasi-duplicate document increases. As a result, the keyword relevance is regarded as high.
In order to obtain appropriate related words, it is desirable to select words that appear widely in common from multiple documents with different contents, and because they appear in several documents with similar contents, a high degree of relevance is obtained. If it is given, there is a risk that a biased word with a general versatility may be selected.
The present invention is to solve the above-mentioned problem, and selects a related word of a given keyword from a document database without bias, and a related word selection device, method and recording medium that can contribute to a search. The purpose is to provide.
It is another object of the present invention to provide a document search apparatus, a method thereof, and a recording medium that can search for an accurate document desired by a user using such related terms.
[0006]
[Means for Solving the Problems]
To solve the above problem,The invention according to claim 1 is a related word selection device that selects related words related to an input keyword from a document database holding a plurality of documents. For each document, the number of words in the keyword and each Based on the weight of the word, the degree of relevance for the keyword is calculated, and the document with the highest relevance is obtained from a plurality of duplicate documents that are considered to be duplicated based on a predetermined rule in the plurality of documents. And a relevant document extracting unit for extracting as a relevant document, and a related word extracting unit for extracting a related word related to the keyword from each relevant document extracted by the relevant document extracting unit.
According to a second aspect of the present invention, in the related word selecting device according to the first aspect, the conforming document extracting unit sets a document having the same number of words as the duplicate document among the plurality of documents. And
The invention according to claim 3 is the related word selection device according to claim 1, wherein the duplicate document deletion unit selects a document having the same appearance frequency of each word in the keyword among the plurality of documents. The duplicate document is used.
According to a fourth aspect of the present invention, in the related word selection device according to the first aspect, the conforming document extracting unit has the same appearance position and appearance interval of each word in the keyword among the plurality of documents. A document is the duplicate document.
According to a fifth aspect of the present invention, in a related word selection device that selects related words related to an input keyword from a document database holding a plurality of documents, a word in the keyword appears for each document. Based on the number and the weight of each word, the degree of matching for the keyword is calculated, and a matching document extracting unit that extracts one of the documents having the same matching level as a matching document, and the matching document extracting unit extracts And a related word extraction unit that extracts related words related to the keyword from the matched document.
The invention according to claim 6 is the related word selection device according to any one of claims 1 to 5, wherein the related word extracted by the related word extraction unit is added to the keyword to be new. A keyword generation unit for generating a keyword, wherein the matching document extraction unit extracts a new matching document based on a word in the new keyword generated by the keyword generation unit, and the related word extraction unit It is characterized by extracting related terms from a suitable matching document.
The invention according to claim 7 is the related word selection device according to any one of claims 1 to 6, wherein the related word extraction unit is included in the same matching document as the word already selected as the related word. A word that appears at a low frequency and has a lower relevance than the selected word is not registered as a related word.
The invention according to claim 8 is a related word selection method for selecting related words related to an input keyword from a document database holding a plurality of documents. Based on the number of occurrences of words in the word and the weight of each word, the degree of matching with the keyword is calculated, and the plurality of duplicate documents that are considered to be duplicated based on a predetermined rule in the plurality of documents. A step of extracting a document having the highest degree of relevance as a conforming document, a related word extracting unit, and a step of extracting a related word related to the keyword from each conforming document extracted by the conforming document extracting unit. It is characterized by being.
The invention described in claim 9 is a related word selection method for selecting related words related to an input keyword from a document database holding a plurality of documents. Calculating a degree of matching for the keyword based on the number of words in the word and the weight of each word, extracting one of the documents having the same degree of matching as a matching document, and a related word extracting unit Comprises a step of extracting a related word related to the keyword from the relevant document extracted by the relevant document extracting unit.
[0007]
Further, the invention according to claim 10 is the related word selection method according to claim 8 or 9, wherein the keyword generation unit adds the related word extracted by the related word extraction unit to the keyword. A step of generating a keyword, a step of extracting a new matching document based on a word in the new keyword generated by the keyword generation unit, and a step of generating the new matching word Extracting a related word from the document.
The invention described in claim 11 is characterized by a computer-readable recording medium on which a program for causing a computer to implement the related word selecting method described in any one of claims 8 to 10 is recorded.
The invention according to claim 12 is a document search device that searches for a document that matches a keyword input from a document database that holds a plurality of documents. For each document, the number of occurrences of the word in the keyword and each Based on the weight of the word, the degree of relevance for the keyword is calculated, and the document with the highest relevance is obtained from a plurality of duplicate documents that are considered to be duplicated based on a predetermined rule in the plurality of documents. A compatible document extracting unit for extracting as a compatible document, a related word extracting unit for extracting a related word related to the keyword from each compatible document extracted by the compatible document extracting unit, and the extracted by the related word extracting unit A keyword generation unit that generates a new keyword by adding a related word to the keyword, and the conforming document extraction unit uses the keyword generation unit. Extracting a new relevant documents containing the words new keyword in generated Te, the related word extraction unit may extract a related word from the new relevant documents.
The invention according to claim 13 is a document search device for searching for a document that matches a keyword input from a document database holding a plurality of documents. For each document, the number of occurrences of the word in the keyword and each Based on the weight of the word, the degree of matching for the keyword is calculated, and a matching document extraction unit that extracts one of the documents having the same matching level as a matching document; and the extracted by the matching document extraction unit A related word extraction unit for extracting a related word related to the keyword from a conforming document; and a keyword generation unit for generating a new keyword by adding the related word extracted by the related word extraction unit to the keyword. The relevant document extraction unit extracts a new relevant document including a word in the new keyword generated by the keyword generation unit, and Collocation extraction unit may extract a related word from the new relevant documents.
The invention according to claim 14 is a document search method for searching for a document that matches a keyword input from a document database that holds a plurality of documents. Calculating a fitness for the keyword based on the number of occurrences and the weight of each word, extracting the document with the highest fitness from duplicate documents in the plurality of documents, and a related word extracting unit A step of extracting a related word related to the keyword from each compatible document extracted by the compatible document extracting unit; and a keyword generating unit adds the related word extracted by the related word extracting unit to the keyword. Generating a new keyword, and the conforming document extracting unit includes a word in the new keyword generated by the keyword generating unit. Extracting a new relevant documents based, the related word extraction unit, extracting the related word from the new relevant documents,
It is comprised from these.
The invention according to claim 15 is the document search method for searching for a document that matches a keyword input from a document database that holds a plurality of documents. Calculating the degree of matching for the keyword based on the number of occurrences and the weight of each word, extracting one of the documents with the same degree of matching as a matching document, and a related word extracting unit, A step of extracting a related word related to the keyword from the compatible document extracted by the compatible document extracting unit; and a keyword generating unit adding the related word extracted by the related word extracting unit to the keyword A keyword generating step, and the conforming document extracting unit generates a new keyword generated by the keyword generating unit. A new matching document including a word in a new keyword is extracted, and the related word extraction unit is configured to extract a related word from the new matching document.
According to a sixteenth aspect of the present invention, there is provided a computer-readable recording medium on which a program for causing a computer to implement the document search method according to the fourteenth or fifteenth aspect is recorded.
[0008]
The document search apparatus according to claim 10 of the present invention is a document search apparatus that searches for a document that matches a keyword input from a document database that holds a plurality of documents. Documents with the same or almost the same document content among documents with high document quality, and a conforming document extracting unit that extracts documents with high conformity from documents that remain without being deleted, and is extracted by the conforming document extracting unit. A related word extraction unit that extracts a related word related to the keyword from the matching document, and a keyword generation unit that generates a new keyword by adding the related word extracted by the related word extraction unit to the keyword, The relevant document extraction unit searches again with the new keyword generated by the keyword generation unit, and obtains an accurate sentence desired by the user. Characterized in that to obtain the.
The document search method according to claim 11 of the present invention is a document search method for searching for a document that matches a keyword input from a document database holding a plurality of documents. Documents with the same or almost the same document contents are deleted, documents with high relevance are extracted from the documents that remain without being deleted, and related words related to the keyword are extracted from the extracted documents. The related word is added to the keyword to generate a new keyword, and the document desired by the user is searched by searching the document database again using the new keyword.
According to a twelfth aspect of the present invention, there is provided a computer-readable recording medium storing a program for causing a computer to function as a document retrieval device for retrieving a document that matches a keyword input from a document database holding a plurality of documents. A document that has the same or almost the same document content among documents having a high degree of relevance retrieved from the document database by the keyword, and the degree of relevance is determined from the documents that remain without being deleted. A relevant document extracting unit for extracting a high document, a related word extracting unit for extracting a related word related to the keyword from a compatible document extracted by the relevant document extracting unit, and a related word extracted by the related word extracting unit And a keyword generation unit that generates a new keyword by adding to the keyword, the conforming document extraction unit again Search by new keywords generated by the keyword generating unit, recording the document retrieval program for realizing a function to obtain the desired precise documentation of the user.
[0009]
DETAILED DESCRIPTION OF THE INVENTION
The configuration and operation of the embodiment of the present invention will be described below in detail with reference to the drawings.
(1) Configuration of the first embodiment
FIG. 1 is a functional block diagram of a first embodiment of a related word selection device according to the present invention.
The related word selection apparatus according to the first embodiment includes a keyword input unit 110, a compatible document extraction unit 120, a related word extraction unit 130, a keyword generation unit 140, an output unit 150, and a document database 160.
The keyword input unit 110 allows a user to input a character string that serves as a keyword representing the characteristics of a document in the document database 160 using a keyboard or the like.
The matching document extraction unit 120 searches the document database 160 for the keyword passed from the keyword input unit 110 and selects a matching document and a non-matching document. At this time, among the documents to be searched with a high degree of relevance, the document content is the same (duplicate document) or almost the same document (quasi-duplicate document) (here, duplicate documents and quasi-duplicate documents are combined into duplicate documents) Delete the other duplicate documents, leaving only one). This deletion operation is repeated until the predetermined number of conforming documents is reached. This reduces duplication in documents with high fitness.
The related word extraction unit 130 extracts related words according to the degree of relevance calculated between the word extracted from the compatible document extracted by the compatible document extraction unit 120 and the input keyword, and generates a keyword. To part 140.
The keyword generation unit 140 generates a new keyword by adding the related word passed from the related word extraction unit 130 to the original keyword. This new keyword may be passed to the relevant document extracting unit 120 in order to extract a related word related to the original keyword, or may be presented to the user as it is from the output unit 150.
The output unit 150 outputs the related word extracted by the related word extraction unit 130 and the keyword that is the basis thereof to a printer, a display device, a storage device, or the like, or transmits it to another computer device via a network. To do.
The document database 160 includes document information that holds a document to be searched, and word statistical information of each word included in the document (see FIG. 2).
For example, in the document information, the following information is held for each document.
Document identifier (ID), document name, bibliographic items (creator, creation date, issuing place, etc.),
Pointer to document entity, etc.
The word statistical information holds the following statistical information for each word.
Word notation, appearance frequency of this word in the entire document database, word appearance information, etc. Here, as word appearance information, the following information is held for each document in which the word appears.
Document identifier of the document in which this word appears, frequency of appearance of the word in this document, list of occurrence positions in which this word appears, etc.
[0010]
(2) Operation of the first embodiment
Next, the operation of the related word selection device of the first embodiment configured as described above will be described based on the flowchart of FIG.
First, a keyword character string is input from an input device such as a keyboard (step S100).
Thereby, the keyword input part 110 is comprised.
This keyword is composed of, for example, English or Japanese words or word combinations, and the word combinations are decomposed into single words as necessary.
For each word in the input keyword, the word statistical information in the document database 160 is referred to, and the weight corresponding to the importance of the word is calculated using, for example, (Equation 1) (step S110).
Next, the following information is calculated for each document in the document database 160 to be searched, and a document list is created (step S120).
[0011]
(A) Calculation of fitness for each document
By referring to the word statistical information in the document database 160 and the word weight of the keyword calculated in step S110, the fitness indicating how much the word in the keyword is included in the document is, for example, the above (formula 2). Calculate using.
(B) Number of words included in each document
The number of words included in the document is calculated from the word statistical information in the document database 160.
This word statistical information can be calculated by summing up the appearance frequencies of words having the same document identifier.
Using this document list, each document is ordered in descending order using the document matching level as the first key and the number of words contained in the document as the second key (step S130).
If there are documents with the same number of words in this ordered document list, only the document with the highest fitness is left, and the remaining documents with the same number of words are deleted. This operation is repeated until a predetermined number (for example, about 10 documents) is reached from the one with the highest fitness. What is selected here is extracted as a relevant document.
This is because it is extremely rare for two documents having different contents to have the same number of words, and therefore, the present invention considers that documents having the same number of words are likely to have the same contents.
Further, a predetermined number (for example, about 500) of documents from the lower part of the document list are regarded as non-conforming documents. The duplicate document is deleted from the non-conforming document as well as the conforming document (step S140).
Whether the document is a conforming document or a nonconforming document may be determined by presenting a list of ordered documents (a list of conformance, document name, bibliographic items, etc.) to the user and instructing the user.
[0012]
The compatible document extraction unit 120 is configured by steps S110 to 140.
The words in the relevant document obtained in step S140 are created as a related word word table that is a candidate of related words for the input keyword. This is created by extracting words contained in the relevant document held in the word statistical information of the document database 160. At this time, referring to an unnecessary word table prepared in advance, the words registered therein are not registered in the related word word table.
Further, for each word registered in this related word table, the appearance status of the conforming document and the non-conforming document is extracted from the word statistical information of the document database 160, for example, using (Expression 3) and (Expression 4). , Calculate relevance to keywords.
A predetermined number (for example, about 10 words) is selected in descending order of the degree of association, and these are extracted as keyword-related words (step S150).
The related word extraction unit 130 is configured by step S150.
The keyword-related words extracted in step S150 are added to the original keywords when input, and new keywords are created (step S160).
Thus, the keyword generation unit 140 is configured.
Further, the user specifies whether or not to extract related words (step S170), and when instructed to perform extraction, the process repeats from step S110 using this new keyword. When the extraction ends, the process proceeds to step S180.
The extraction of the related terms may not be repeated, may be repeated a predetermined number of times, or may be configured to be output and repeated each time a user searches for a relevant document. .
The new keyword generated in step S160 is presented to the user by outputting it to an output device such as a display device, a printer, or a storage device (step S180). Thus, the output unit 150 is configured.
Further, the output may be such that the conforming document is transmitted to another computer device connected via a network.
By configuring the related word selection device as in the first embodiment, it becomes possible to select words that appear widely in common from a plurality of documents having different contents as related words of the keyword. This can contribute to searching for an accurate document desired.
[0013]
<Modification (1) of the first embodiment>
A related word selection device according to a modification of the first embodiment will be described.
The configuration of this modification is the same as that of the first embodiment shown in FIG.
However, the operation of the conforming document extraction unit 120 is different from that of the first embodiment in the following points.
In the first embodiment, attention is paid to the number of words included in a document as a criterion for considering a duplicate document. However, in this modification, a document having the same fitness is regarded as a duplicate document or a quasi-duplicate document. ing.
In this case, when the document list in step S120 of the flowchart of FIG. 3 is created, in step S130, the document list is ordered in descending order based only on the fitness.
In step S140, only one document having the same fitness level is left in the document list, and other documents having the same fitness level are deleted.
This is because it is extremely rare for two documents having different contents to have the same relevance, and therefore, it is considered that documents having the same relevance are likely to have the same contents.
By configuring the related word selection device as in the present modification, it becomes possible to select words that appear widely and commonly from a plurality of documents having different contents as the related words of the keyword, and to achieve the desired accuracy desired by the user. This can contribute to searching for new documents.
[0014]
<Modification (1) of the first embodiment>
A related word selection device according to another modification of the first embodiment will be described.
The configuration of this modification is the same as that of the first embodiment shown in FIG.
However, the operation of the conforming document extraction unit 120 is different from that of the first embodiment in the following points.
In the first embodiment, attention is paid to the number of words included in a document as a reference to be regarded as a duplicate document. However, in this modification, a duplicate document is used when the occurrence frequency of each word in a keyword is the same document. Or consider it a quasi-duplicate document.
In this case, when the document list in step S120 in the flowchart of FIG. 3 is created, the appearance frequency of each word in the keyword appearing in each document is calculated.
In step S130, the degree of matching and the appearance frequency of each word in the keyword are used as keys to order them in descending order.
After that, in step S140, only one document having the highest relevance among the documents having the same appearance frequency of each word in the keyword in the document list is left, and another document having the same appearance frequency of each word. Is deleted.
For example, when the keyword is “information search”, in document 1, “information” appears twice and “search” appears once, and the fitness is 0.87. Further, in document 2, “information” appears twice and “search” appears once, and the fitness is 0.85.
In such a case, the document 2 is deleted from the document list, and only the document 1 is left.
This is because it is very rare for each word in a keyword to appear in two documents with different contents, so a document with the same frequency of each word in a keyword may have the same contents. I consider it expensive.
By configuring the related word selection device as in the present modification, it becomes possible to select words that appear widely and commonly from a plurality of documents having different contents as the related words of the keyword, and to achieve the desired accuracy desired by the user. This can contribute to searching for new documents.
[0015]
<Modification (3) of the first embodiment>
A related word selection device according to still another modification of the first embodiment will be described.
The configuration of this modification is the same as that of the first embodiment shown in FIG.
However, the operation of the conforming document extraction unit 120 is different from that of the first embodiment in the following points.
In the first embodiment, attention is paid to the number of words included in a document as a reference to be regarded as a duplicate document. However, in the present modification, the appearance position and the appearance interval of each word in a keyword are the same. Are considered duplicate or quasi-duplicate documents.
In this case, when the document list in step S120 of the flowchart of FIG. 3 is created, the appearance position of each word in the keyword appearing in each document (in what word from before and after the document) and the appearance interval ( Extract words).
In step S130, the degree of matching, the appearance position and the appearance interval of each word in the keyword are used as keys, and are ordered in descending order.
In step S140, only one document having the highest relevance among the documents having the same appearance position and appearance interval of each word in the keyword in the document list is left, and the other documents are deleted.
[0016]
For example, when the keyword is “information search”, in document 1, “information” appears in the 10th and 20th words, “search” appears in the 11th word, and the fitness is 0.87.
Further, in document 2, “information” appears in the 10th and 20th words, “search” appears in the 11th word, and the fitness is 0.85.
In such a case, the document 2 is deleted from the document list, and only the document 1 is left.
This is because it is extremely rare for each word in a keyword to appear in two documents with different contents at the same appearance position and interval, so each word in the keyword has the same appearance position and interval. Regards the content as likely.
By configuring the related word selection device of this modification in such a configuration, it becomes possible to select words commonly appearing from a plurality of documents having different contents as related words of the keyword, which is desired by the user. This can contribute to searching for accurate documents.
Note that the related word selection device of the present invention is configured by using the above-described first embodiment and the modification (1), (2) or (3) alone as the relevant document extraction unit 120. However, a combined configuration may be used as appropriate. For example, another document may be applied to the document remaining after the first duplicate document deletion to further delete the duplicate document.
[0017]
<Modification (1) of the first embodiment>
A related word selection device according to still another modification of the first embodiment will be described.
The configuration of this modification is the same as that of the first embodiment shown in FIG.
However, the operation of the related word extraction unit 130 is different from the first embodiment and the modifications (1) to (3) described above in the following points.
In the related word extraction unit 130 of the first embodiment, a word having a high degree of relevance with a keyword is regarded as a related word. However, the present modified example is different in that a related word having a high degree of relevance is left after a low-frequency word is once deleted as a related word.
This infrequent word deletion calculates the degree of relevance to the keyword for each word that appears in the relevant document, so that it appears less frequently in the relevant document and appears in the same document group as the word that has already been selected as the relevant word. Do not adopt words with lower relevance as related words.
For example, for the keyword “information search”, 10 documents are selected as the matching documents by the matching document extraction unit 120, appearing in the documents 2 and 7 as related words from these documents, and the relevance is 0.68. Assume that the word “Okapi” is selected.
In this case, when the related word candidate word “BM25” appears in the document 2 and the document 7 and the relevance degree is 0.62, the word “BM25” is not regarded as a related word and only the word “Okapi” is used. Is left as a related term.
In general, the higher the appearance frequency of a word in a matching document and the lower the appearance frequency of that word in the entire search target document, the higher the degree of association between the word and the keyword is from (Equation 3) and (Equation 4). Can be derived.
Conversely, words that have a low appearance frequency and a high degree of relevance in the matching document are considered to have a low appearance frequency in the entire search target document. It is extremely rare that words (low frequency words) with low appearance frequency in the entire search target document appear in the same document at the same time. Document groups in which multiple low frequency words appear together are Judge that the contents are very likely to be very close.
Therefore, it is possible that words that appear less frequently in the relevant document and appear in the same document group as the words that have already been selected as related words will not be adopted as related words, so the content may be almost the same. It is possible to select a keyword-related word having a higher degree of relevance by omitting words in a document with a high degree of recognition.
[0018]
In the case of this modification, when generating a new keyword by extracting a keyword related keyword in step S160 in the flowchart of FIG. 3, the related word is extracted by deleting a low-frequency word in the following procedure.
First, words appearing in the conforming document are extracted, and, for example, using (Equation 3) and (Equation 4), the degree of association with the keyword is calculated for these words. Also, the appearance frequency of these words in the matching document and the document in which this word appears are extracted.
For words whose appearance frequency is lower than a predetermined number, if they also appear in the same document group as a word that has already been selected as a related word, the degree of relevance between the word and the related word is compared. When it is low, it is not adopted as a related term.
By this operation, the remaining word is adopted as a keyword-related word, and a new keyword is generated.
By configuring the related word selection device of this modification in such a configuration, it becomes possible to select a word having a higher degree of relatedness as a related word of a keyword from a plurality of documents having different contents, which is desired by the user. This can contribute to searching for accurate documents.
[0019]
<Second Embodiment>
(1) Configuration of the second embodiment
FIG. 4 is a block diagram showing a functional configuration example of the embodiment of the document search device according to the present invention.
The document search apparatus according to this embodiment includes a keyword input unit 110, a compatible document extraction unit 120, a related word extraction unit 130, a keyword generation unit 140, a document output unit 170, and a document database 160.
Blocks having the same functions as those of the related word selection device of the first embodiment are denoted by the same reference numerals, and only differences will be described below.
The matching document extraction unit 120 searches the document database 160 for the keyword received from the keyword input unit 110 or the keyword generation unit 140, and calculates the matching level (corresponding to step S120 in FIG. 3). The search result is output to the user via the, and the user is judged whether or not the search result is correct.
If the user decides not, the keyword is re-input or a related word is selected, added to the keyword, and the search is performed again.
When instructed to extract related words, the relevant document extraction unit 120 deletes duplicate documents (corresponding to steps S130 to S150 in FIG. 3). In this deletion method, the above-described first embodiment (including modifications (1) to (3)) is used.
The keyword generation unit 140 generates a new keyword to which the generated related word is added, and directly passes this new keyword to the matching document extraction unit 120.
The document output unit 170 outputs a list of documents with high relevance retrieved by the relevance document extraction unit 120 (for example, a list based on relevance, document name, bibliographic items, etc.) from a printer, display device, storage device, or the like A list of documents that match the keyword can be provided to the user by outputting to a device or sending to another computer via a network.
By configuring the document search apparatus as in the second embodiment, it becomes possible to select a word having a higher degree of association as a related word of a keyword from a plurality of documents having different contents. You can search for the exact document you want.
[0020]
<Embodiment by computer>
Furthermore, the present invention is not limited only to the above-described embodiment. For example, the related word selection device and the document search device shown in FIG. 1 or FIG. 4 can also be realized by a computer device 200 having a hardware configuration as shown in FIG.
That is, the computer device 200 includes a keyboard, a mouse, a touch panel, a scanner, and the like, and displays and outputs the input device 1 used for inputting information, various output information, information input from the input device 1, and the like. The display device 2, a CPU (Central Processing Unit) 3 for operating various programs, information stored temporarily when the program is executed by the CPU 3, etc. Memory 4 to be held, document database 160 of related word selection device and document search device of the present invention, storage device 5 to hold programs, temporary information at the time of program execution, and recording medium storing programs, data, etc. And load them to the memory 4 or storage device 5 The medium drive device 6 used for storage and the network connection device 7 which is an interface for connecting to the network 9 are connected by a bus 8.
The network 9 is a transmission path for connecting the computer apparatus 200 and another computer apparatus 200, and is generally realized by a cable, and TCP / IP is used as a communication protocol. However, the transmission path is not limited to a cable, and any of wireless, wired, and broadcast waves may be used as long as the communication protocol between them is the same. For example, LAN (Local Area Network), WAN (Wide Area Network) Internet, analog telephone network, digital telephone network (ISDN: Integral Service Digital Network), PHS (Personal Handy System), mobile phone network, satellite communication network, and the like can be used.
In such a configuration of the computer device 200, each function constituting the related word selection device and the document search device shown in FIG. 1 or FIG. 4 is programmed and written in a recording medium such as a CD-ROM in advance. A CD-ROM is mounted on a computer device equipped with a medium drive device 6 such as a CD-ROM drive at each site, and these programs are stored in the memory 4 or the storage device 5 of each computer device and executed. By doing so, it is possible to realize the same function as the above-described embodiment.
[0021]
As a recording medium, any of a semiconductor medium (for example, ROM, IC memory card, etc.), an optical medium (for example, DVD, MO, MD, CD-R, etc.), and a magnetic medium (for example, magnetic tape, flexible disk, etc.) It may be.
Further, not only the functions of the above-described embodiments are realized by executing a program loaded into the memory 4 of the computer device 200, but an operating system or the like can execute a part of actual processing or The case where all the functions are performed and the functions of the above-described embodiments are realized by the processing is included.
When the program for realizing the above-described embodiment is a semiconductor recording medium such as a ROM, the program is loaded directly into the memory 4 and executed instead of the medium driving device 6.
[0022]
<Operation in Network Environment of the Present Invention>
FIG. 6 shows a configuration of an embodiment in which the present invention is operated by connecting to a wired or wireless communication network.
For example, a server 300 holding a related word selection program and a document search program and a terminal 310 used by a plurality of users are connected via the network 9.
In this case, the server 300 and the user terminal 310 are configured by the general-purpose computer apparatus 200 shown in FIG.
The user logs in to the server 300 from the terminal 310 or inputs a keyword for document search, and requests the document search program of the server 300 to execute the search. The document search program of the server 300 returns the search result that matches the specified keyword to the requesting terminal 310. The user terminal 310 outputs the search result.
This has the advantage that the latest document search program can always be used.
The user can obtain a related word related to the keyword by executing the related word selection program in the same manner as the document search program.
In addition, when the server 300 and the terminal 310 are connected via a wired or wireless communication network as shown in FIG. 6, a related word selection program or a document search program that realizes the functions of the present invention in a storage device such as a magnetic disk of the server 300. Can be stored and distributed to the terminal 310 in the form of download or the like.
Furthermore, a related word selection program or a document search program that implements the functions of the present invention may be provided by distribution through a medium or broadcast wave.
[0023]
【The invention's effect】
As described above, according to the present invention, when selecting a keyword-related word, even if a plurality of documents with the same or almost the same content are included in the relevant document, the related words are not biased by that. And select relevant terms that can contribute to the search.
As a result, an accurate document desired by the user can be searched.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a related word selection device according to a first embodiment.
FIG. 2 is a diagram for explaining a data structure of a document database.
FIG. 3 is a flowchart for explaining a processing flow of the related word selection device according to the first embodiment;
FIG. 4 is a block diagram illustrating a configuration of a document search apparatus according to a second embodiment.
FIG. 5 is a diagram illustrating a hardware configuration when the present invention is realized by a computer.
FIG. 6 is a diagram for explaining a case where the present invention is operated in a network environment.
[Explanation of symbols]
110 …… Keyword input section
120 …… Relevant document extraction unit
130 ...... Related word extraction unit
140 ...... Keyword generator
150 ...... Output section
160 …… Document database
170 ...... Document output section
200: Computer device
300 …… Server
310 …… Terminal
1 ... Input device
2 ... Display device
3 …… CPU
4 ... Memory
5 ... Storage device
6 …… Medium drive device
7 …… Network connection device
8 …… Bus
9 …… Network

Claims

In a related word selection device for selecting related words related to an input keyword from a document database holding a plurality of documents,
For each document, the degree of suitability for the keyword is calculated based on the number of occurrences of the word in the keyword and the weight of each word, and is considered to be duplicated based on a predetermined rule in the plurality of documents. A compatible document extraction unit that extracts the document with the highest degree of conformity as a compatible document from a plurality of duplicate documents,
A related word extracting unit that extracts related words related to the keyword from each of the compatible documents extracted by the compatible document extracting unit;
A related word selection device characterized by comprising:

The related word selection device according to claim 1,
The relevant document extraction unit is characterized in that, among the plurality of documents, a document having the same number of words is used as the duplicate document.

The related word selection device according to claim 1,
The duplicate document deletion unit is a related word selection device characterized in that, among the plurality of documents, a document having the same appearance frequency of each word in the keyword is used as the duplicate document.

The related word selection device according to claim 1,
The relevant document extraction unit is a related word selection device characterized in that, among the plurality of documents, a document having the same appearance position and appearance interval of each word in a keyword is used as the duplicate document.

In a related word selection device for selecting related words related to an input keyword from a document database holding a plurality of documents,
For each document, the degree of relevance to the keyword is calculated based on the number of occurrences of the words in the keyword and the weight of each word, and one of the documents having the same relevance is extracted as a relevance document A document extractor;
A related word extraction unit that extracts a related word related to the keyword from the relevant document extracted by the relevant document extraction unit;
A related word selection device characterized by comprising:

In the related word selection device according to any one of claims 1 to 5,
  A keyword generation unit for generating a new keyword by adding the related word extracted by the related word extraction unit to the keyword;
  The relevant document extraction unit extracts a new relevant document based on a word in the new keyword generated by the keyword generation unit;
  The related word extraction unit extracts a related word from the new relevant document.

In the related word selection device according to any one of claims 1 to 6,
The related word extraction unit does not register a word that appears less frequently in the same matching document as a word that has already been selected as a related word and has a lower relevance than the selected word as a related word. A related word selection device characterized by that.

In a related word selection method for selecting related words related to an input keyword from a document database holding a plurality of documents,
The conforming document extracting unit calculates the degree of conformity with respect to the keyword based on the number of words appearing in the keyword and the weight of each word for each document, and based on a predetermined rule in the plurality of documents Extracting the document with the highest degree of conformity as a conforming document from a plurality of duplicate documents that are considered to be duplicated; and
A related word extraction unit, extracting a related word related to the keyword from each matching document extracted by the matching document extraction unit;
Related word selection method characterized by comprising.

In a related word selection method for selecting related words related to an input keyword from a document database holding a plurality of documents,
The number of occurrences of the word in the keyword and each word for each document by the conforming document extraction unit Calculating a matching level for the keyword based on the weight of the keyword, and extracting one of the documents having the same matching level as a matching document;
A related word extraction unit extracting a related word related to the keyword from each matching document extracted by the matching document extraction unit;
Related word selection method characterized by comprising.

In the related word selection method according to claim 8 or 9,
  A keyword generating unit adding the related word extracted by the related word extracting unit to the keyword to generate a new keyword;
  The compatible document extracting unit extracting a new compatible document based on a word in the new keyword generated by the keyword generating unit;
  The related word extraction unit extracting a related word from the new relevant document;
The related word selection method characterized by having.

11. A computer-readable recording medium on which a program for causing a computer to realize the related word selection method according to claim 8 is recorded.

In a document search apparatus for searching for a document that matches a keyword input from a document database holding a plurality of documents,
  For each document, the degree of suitability for the keyword is calculated based on the number of occurrences of the word in the keyword and the weight of each word, and is considered to be duplicated based on a predetermined rule in the plurality of documents. A compatible document extraction unit that extracts the document with the highest degree of conformity as a compatible document from a plurality of duplicate documents,
  A related word extracting unit that extracts related words related to the keyword from each of the compatible documents extracted by the compatible document extracting unit;
  A keyword generation unit for generating a new keyword by adding the related word extracted by the related word extraction unit to the keyword;
  The relevant document extraction unit extracts a new relevant document including a word in the new keyword generated by the keyword generation unit, and the related word extraction unit extracts a related word from the new relevant document. Feature document retrieval device.

In a document search apparatus for searching for a document that matches a keyword input from a document database holding a plurality of documents,
  For each document, the degree of relevance to the keyword is calculated based on the number of occurrences of the words in the keyword and the weight of each word, and one of the documents having the same relevance is extracted as a relevance document A document extractor;
  A related word extraction unit that extracts a related word related to the keyword from the relevant document extracted by the relevant document extraction unit;
  A keyword generation unit for generating a new keyword by adding the related word extracted by the related word extraction unit to the keyword;
  The relevant document extraction unit extracts a new relevant document including a word in the new keyword generated by the keyword generation unit, and the related word extraction unit extracts a related word from the new relevant document. Feature document retrieval device.

In a document search method for searching for a document that matches a keyword input from a document database holding a plurality of documents,
  A conforming document extraction unit calculates the degree of conformity with respect to the keyword based on the number of words in the keyword and the weight of each word for each document, and the degree of conformity is calculated from duplicate documents in the plurality of documents. Extracting the highest document of
  A related word extraction unit extracting a related word related to the keyword from each matching document extracted by the matching document extraction unit;
  A keyword generating unit adding the related word extracted by the related word extracting unit to the keyword to generate a new keyword;
  In the new keyword generated by the keyword generation unit, the relevant document extraction unit Extracting a new relevant document based on the words of
  The related word extraction unit extracting a related word from the new relevant document;
A document search method characterized by comprising:

In a document search method for searching for a document that matches a keyword input from a document database holding a plurality of documents,
  The conforming document extraction unit calculates the degree of conformity with respect to the keyword based on the number of occurrences of the words in the keyword and the weight of each word for each document, and selects one of the documents with the same conformity. Extracting as relevant document;
  A related word extracting unit extracting a related word related to the keyword from the relevant document extracted by the relevant document extracting unit;
  A keyword generating unit adding the related word extracted by the related word extracting unit to the keyword to generate a new keyword;
  The compatible document extracting unit extracting a new compatible document including a word in the new keyword generated by the keyword generating unit;
  The related word extraction unit extracting a related word from the new relevant document;
A document search method characterized by comprising:

A computer-readable recording medium on which a program for causing a computer to implement the document search method according to claim 14 is recorded.