JP3637756B2

JP3637756B2 - Information search device, information search method, and recording medium

Info

Publication number: JP3637756B2
Application number: JP34768197A
Authority: JP
Inventors: 雄大中山
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1997-12-17
Filing date: 1997-12-17
Publication date: 2005-04-13
Anticipated expiration: 2017-12-17
Also published as: JPH11175558A

Description

【０００１】
【発明の属する技術分野】
本発明は情報検索装置、情報検索方法および記録媒体に関し、特に、情報の単位であるノードとノード間のリンクによって構成されたハイパードキュメントシステムを対象にして情報を検索する情報検索装置、情報検索方法および記録媒体に関する。
【０００２】
【従来の技術】
ハイパードキュメントシステムでは、取り扱われる情報をいくつかの小さな単位（ノード）に分割し、それらを関連付けて整理している（このような関連付けを示す情報を、リンクと呼ぶ）。例えば、インターネット上では、ＷＷＷ(World Wide Web)により、ハイパードキュメントシステムが構築されている。ＷＷＷの情報は、ＨＴＭＬ(Hyper Text Markup Language)で記述されている。このＨＴＭＬは、ノード間のリンクに意味的制約がない。このようにノード間のリンクに意味的制約を持たないシステムには、ドキュメントオーサ（作者）が意のままにコンテンツとリンク構造を決めることができるという利点がある。そして、このようなハイパードキュメントシステムにより、ドキュメントリーダ（読者）は、ドキュメントオーサの構築したリンク構造をたどり、そのドキュメントオーサが提供する全ての情報に対してアクセスできる。
【０００３】
ところで、インターネットなどのハイパードキュメントシステムの情報量は膨大である。そのため、ドキュメントリーダが必要な情報を見つけ出すには、情報検索を支援するシステムが必要である。そのような検索を支援する従来技術としては、以下の２つがある。
【０００４】
第１の従来技術は、予めできるだけ大量のノードを（ランダムに）スキャンして各ノードの検索インデックスを用意しておき、ドキュメントリーダからのクエリー（キーワードの組み合わせ）に対してマッチするものを提示するものである。なお、検索インデックス作成およびクエリーとのマッチングに間する要素技術として、統計的言語処理手法であるベクタースペースモデル（G. Salton & J. Allan, Text Retrieval Using the Vector Processing Model, in Proc. of SDAIR94 ）が考案されている。
【０００５】
第２の従来技術は、予めできるだけ大量のノードを（ランダムに）スキャンして、それらをトピックにより分類した木構造のディレクトリに割り当てておくものである。ドキュメントリーダは、欲する情報が含まれると考えられるトピックをディレクトリ上から探し、そこから目指す情報にアクセスする。なお、この技術を実現するための要素技術として、自然言語処理を応用した自動文書分類手法（例えば、P. Jacobs, Joining Statistics with NLP for Text Categorization, in Proc. of Applied-ACL92 ）が提案されている。さらに、メディアを画像に拡張した自動文書分類手法（United States Patent: 5526443, T. Nakayama (FXPAL), Method and apparatus for highlighting and categorizing documents using coded word tokens, issue date:1996.6.11）も考案されている。
【０００６】
【発明が解決しようとする課題】
しかし、これらの従来技術では、１つのノード（例えば、１つのＨＴＭＬ文書）を１つの検索対象単位とするため、ノードとリンクによる構造で概念を表現するというハイパードキュメントシステムの本質を捉えることができず、以下に示すような問題が生じている。
【０００７】
ある情報をいくつのノードに分割してどのように構造化するかは、ドキュメントオーサの嗜好によるものであるにもかかわらず、ノードを一単位とするような検索では、ハイパーネットワーク上に構造化されたノード群を、大局的にある１つの意味的まとまりを持つ情報として捉えることができない。つまり、従来技術による検索では、意味的に不完全な情報断片だけを検索対象とすることになり、コンテキストが検索に反映されない。
【０００８】
例えば、一人のドキュメントオーサが作成した１つの意味的まとまりをもった情報が、複数のＨＴＭＬ文書に分割されて表現されている場合、従来技術で文書検索を行うと、各ＨＴＭＬ文書が個別の検索対象となる。ここで、ドキュメントリーダが「概念Ａ」に類似する情報を検索すると、当該ドキュメントオーサが作成した情報が全体として「概念Ａ」に類似していても、分割された個々のノードが「概念Ａ」に類似していなければ、この情報（若しくは一部のノード）が検出されることはない。
【０００９】
しかも、１つのノードを検索対象単位とすると、検索要求を表す概念をハイパーネットワーク上の構造で表現することができないという問題点もある。
さらには、ドキュメントオーサが、ある１つの意味的まとまりを持つ情報を複数のノードに分割して構造化した場合、従来の検索ではそれぞれのノードが個別に出力され、冗長性が生じるという問題点もある。一人のドキュメントオーサが１つの意味的まとまりを持つ情報として作成した一連のＨＴＭＬ文書が個別に出力されると、検索結果の量が膨れ上がってしまい、目的に合致した文書を探し出すためのドキュメントリーダの労力が増加してしまう。
【００１０】
そこで、あるノードを起点として、そのノードにリンクされているＮ次のノード（Ｎ＝２，３，・・・）の特徴を比較してその類似性を判定し、類似であると判定されたＮ次ノードを、起点となるノードと合成することによって、意味的まとまりを持つ情報を１単位とする検索が考えられる。
【００１１】
しかしながら、このような方法によれば、ハイパードキュメントシステムの本質を捉えた情報検索が可能となる一方で、「意味的なまとまり」の間に包含関係が発生する場合があるため、検索結果が冗長になるという問題がある。
【００１２】
そのような場合の一例を図１０に示す。この図において、平行四辺形は、それぞれノードを示しており、また、矢印はリンクを示している。この例では、ノードＡを起点とする意味的まとまり（実線で囲繞された部分）は、ノードＢを起点とする意味的まとまり（点線で囲繞された部分）を包含している。従って、ある検索要求に対して、双方が検索結果として与えられた場合、前者を先に読んだ（ブラウジングした）ドキュメントリーダにとって、後者を読みにいく行為は冗長なものとなる。しかしながら、実際にアクセスしてみないと、冗長であるか否かは判定できないという問題点もあった。
【００１３】
本発明はこのような点に鑑みてなされたものであり、意味的まとまりを持つ情報を一単位として効率よく情報を検索できる情報検索装置を提供することを目的とする。
【００１４】
更に、本発明の別の目的は、意味的まとまりを持つ情報を一単位として効率よく情報を検索できる情報検索方法を提供することである。
更にまた、本発明の他の目的は、意味的まとまりを持つ情報を一単位として効率よく情報を検索するための情報検索プログラムを記録した記録媒体を提供することである。
【００１５】
【課題を解決するための手段】
本発明では上記課題を解決するために、情報の単位であるノードとノード間のリンクによって構成されたハイパードキュメントシステムを対象にして情報を検索する情報検索装置において、リンクによる結合と内容の類似性に基づいてノード群を構成するノード群構成手段と、前記ノード群を構成する個々のノードを記憶する構成ノード記憶手段と、所定の検索要求に対応して前記ハイパードキュメントシステムから情報を検索する情報検索手段と、前記情報検索手段の検索結果として得られた複数のノード群の相互の包含関係を、前記構成ノード記憶手段に記憶されている情報に基づいて解析する包含関係解析手段と、前記包含関係解析手段の解析結果に応じて、他のノード群に包含されているノード群を排除するノード群排除手段と、排除されずに残ったノード群を検索結果として出力する検索結果出力手段と、を有することを特徴とする情報検索装置が提供される。
【００１６】
ここで、ノード群構成手段は、リンクによる結合と内容の類似性に基づいてノード群を構成する。構成ノード記憶手段は、ノード群を構成する個々のノードを記憶する。情報検索手段は、所定の検索要求に対応してハイパードキュメントシステムから情報を検索する。包含関係解析手段は、情報検索手段の検索結果として得られた複数のノード群の相互の包含関係を、構成ノード記憶手段に記憶されている情報に基づいて解析する。ノード群排除手段は、包含関係解析手段の解析結果に応じて、他のノード群に包含されているノード群を排除する。検索結果出力手段は、排除されずに残ったノード群を検索結果として出力する。
【００１７】
また、情報の単位であるノードとノード間のリンクによって構成されたハイパードキュメントシステムを対象にして情報を検索する情報検索装置において、リンクによる結合と内容の類似性に基づいて検索の対象となるノード群を構成するノード群構成手段と、前記ノード群構成手段によって構成されたノード群のそれぞれに属する個々のノードを記憶する構成ノード記憶手段と、前記ノード群の相互の包含関係を、前記構成ノード記憶手段に記憶されている情報に基づいて解析する包含関係解析手段と、前記包含関係解析手段によって包含関係が認められた１組のノード群の間の類似度を算定する類似度算定手段と、前記類似度算定手段によって算定された類似度が所定の閾値以上である場合には、包含される方のノード群を排除し、排除されずに残ったノード群を検索の対象として絞り込む検索対象絞り込み手段と、を有することを特徴とする情報検索装置が提供される。
【００１８】
ここで、ノード群構成手段は、リンクによる結合と内容の類似性に基づいて検索の対象となるノード群を構成する。構成ノード記憶手段は、ノード群構成手段によって構成されたノード群のそれぞれに属する個々のノードを記憶する。包含関係解析手段は、ノード群の相互の包含関係を、構成ノード記憶手段に記憶されている情報に基づいて解析する。類似度算定手段は、包含関係解析手段によって包含関係が認められた１組のノード群の間の類似度を算定する。検索対象絞り込み手段は、類似度算定手段によって算定された類似度が所定の閾値以上である場合には、包含される方のノード群を排除し、排除されずに残ったノード群を検索の対象として絞り込む。
【００１９】
【発明の実施の形態】
以下、本発明の実施の形態を図面を参照して説明する。
図１は、本発明の情報検索装置の実施の形態の構成例を示す図である。この図において、ノード群構成手段１は、ハイパードキュメントシステム１０に分散して配置されている情報から、内容の類似性に応じてノード群（例えば、図１０のノードＡまたはＢを起点とするノード群）を構成する。
【００２０】
構成ノード記憶手段２は、ノード群構成手段１によって構成されたノード群に属する（異なり）ノードのパス名を記憶する。なお、ここで使用するパス名とは、ノードに対してユニークに与えられる絶対アドレスのことで、例えば、ＷＷＷ（World Wide Web）であればＵＲＬ（Uniform Resource Locator）を指す。
【００２１】
情報検索手段３は、ハイパードキュメントシステム１０に配置されているノードから、検索要求に対応するノードを取得し、検索結果の候補として出力する。包含関係解析手段４は、情報検索の結果として得られたノードを入力し、そのノードを起点とするノード群を構成ノード記憶手段２から取得し、得られたノード群の相互の包含関係を解析する。
【００２２】
ノード群排除手段５は、包含関係解析手段４による解析の結果に応じて、他のノード群に包含されているノード群を排除する。
検索結果出力手段６は、ノード群排除手段５によって排除されずに残ったノード群を、最終検索結果として出力する。
【００２３】
次に、図２を参照してノード群構成手段１の構成例について説明する。図２に示すように、ノード群構成手段１は、起点ノード特徴抽出部１ａ、関連ノード取得部１ｂ、関連ノード特徴抽出部１ｃ、類似性判定部１ｄ、ノード合成部１ｅにより構成されている。
【００２４】
起点ノード特徴抽出部１ａは、起点ノードが入力されると、起点ノードの内容を解析し、その特徴を抽出する。抽出した特徴は、起点ノード特徴プロファイルとして類似性判定部１ｄに渡される。ここで、ノードの特徴に関する情報とは、そのノードの内容を特徴付ける単語とその重要度を示す値の対の集合を指す。例えば、起点ノードに出現する各単語に関する出現頻度、出現位置及び品詞の情報に基づいて重み付けすることにより、起点ノード特徴プロファイルを作成する。
【００２５】
関連ノード取得部１ｂは、起点ノードが入力された際に、そのノードからリンクが張られている２次ノードを取得するとともに、取得したノードからさらにリンクが張られているノード（関連ノード）を順次取得する。そして、他のノードへのリンクがなくなるまで同様の処理を行う。この時に取得される２次ノード以降の各ノードを、Ｎ次ノードとする（Ｎ＝２，３，４，・・・）。
【００２６】
関連ノード特徴抽出部１ｃは、関連ノード取得部１ｂが抽出したＮ次の各ノードの特徴を抽出し、関連ノード特徴プロファイルを作成する。作成した関連ノード特徴プロファイルは、類似性判定部１ｄに渡される。
【００２７】
類似性判定部１ｄは、関連ノード特徴抽出部１ｃが作成した関連ノード特徴プロファイルに基づいて、起点ノードからリンクを辿ることによりアクセス可能な全てのノードの起点ノードに対する類似性の判断処理を行う。そして、類似しているノードの内容を、ノード合成部１ｅに渡す。ノード合成部１ｅは、類似性判定部１ｄによって抽出された全てのノードを起点ノードに合成し、合成ノード（ノード群）を出力する。その結果、ノード群構成手段１からは、同一の概念により包括される合成ノード（ノード群）に対応する複数のパス名が出力される。
【００２８】
次に、図１に示す実施の形態の動作について説明する。
ノード群構成手段１は、ハイパードキュメントシステム１０に配置されている情報からノード群を生成して出力する。即ち、ノード群構成手段１は、ハイパードキュメントシステム１０に配置されているノードをランダムにスキャンし、取得したノードを起点ノードとして以下の処理を実行する。
【００２９】
関連ノード取得部１ｂは、与えられた起点ノードに関連する（リンクされている）Ｎ次のノードを順次取得する。そして、起点ノード特徴抽出部１ａと関連ノード特徴抽出部１ｃは、図３に示す処理に従って、特徴抽出処理を行う。なお、起点ノード特徴抽出部１ａと関連ノード特徴抽出部１ｃで行われる処理は同様であるので、以下の説明では起点ノード特徴抽出部１ａで行われる処理についてのみ言及する。
〔Ｓ１〕起点ノードが与えられ、その情報ソースが起点ノード特徴抽出部１ａに入力される。
〔Ｓ２〕情報ソースから、ハイパードキュメントシステム記述言語（例えば、ＨＴＭＬ）で定義されたタグを除去する。
〔Ｓ３〕既知の形態素解析技術を用いて、残されたテキストから単語を抽出する。
〔Ｓ４〕ステップＳ３で得られた単語の集合から重要単語だけを抽出する。ここで、重要単語とは情報ソースの内容を特徴付けている単語のことであり、例えば、名詞だけを重要単語とするといった方法で抽出する。
〔Ｓ５〕ステップＳ４で得られた重要単語に対して、出現頻度や出現位置を考慮して、重み付けをする。すなわち、出現頻度の高い単語ほど重要度を高くする。また、出現位置が文書の先頭に近いほど重要度を高くする。
〔Ｓ６〕最後に、重要単語とその重みとの組からなるリストを作成し、これを起点ノード特徴プロファイルとする。
【００３０】
このようにして得られた、起点ノードの特徴プロファイル（単数）は、類似性判定部１ｄに渡される。
また、前述のように、起点ノードは、関連ノード取得部１ｂにも渡されており、関連ノード取得部１ｂは、受け取ったノードの情報ソースに含まれるリンク情報を検索し、そのリンク先のノードを２次ノードとして取得する。例えば、起点ノードがＨＴＭＬで作成されていれば、アンカータグ（＜Ａ＞・・・＜／Ａ＞）で囲まれた領域内のＵＲＬ(Uniform Resource Locator)を抽出し、そのＵＲＬで指定された文書（２次ノード）を取得する。
【００３１】
そして、同様の処理を繰り返すことにより、Ｎ次のノード（関連ノード）を順次取得していき、関連ノードの集合として、関連ノード特徴抽出部１ｃに渡す。関連ノード特徴抽出部１ｃは、前述の場合と同様の処理を実行することにより、各関連ノードに対する関連ノード特徴プロファイルを作成する。その関連ノードの特徴プロファイル（一般に複数）は、類似性判定部１ｄに渡される。これにより、類似性判定部１ｄには、起点ノード特徴プロファイルと複数の関連ノード特徴プロファイルとが渡されることになる
類似性判定部１ｄでは、図４に示す処理が実行されることになる。
〔Ｓ２１〕Ｎ＝２という初期化を行う。
〔Ｓ２２〕Ｎ次ノードが存在するか否かが判定される。存在すればステップＳ２３に進み、そうでなければ処理を終了する。
〔Ｓ２３〕ｐ＝１という初期化を行う。また、Ｎ次ノードの個数をｍとする。
〔Ｓ２４〕ｐとｍの大小を比較して、ｐ＞ｍであればステップＳ２９に進み、そうでなければステップＳ２５に進む。
〔Ｓ２５〕起点ノードとｐ番目のＮ次ノードの類似度を前述の方法（既知のベクター内積演算手法）で計算する。
〔Ｓ２６〕ステップＳ２５で得られた類似度の値と閾値を比較して、類似度＞閾値であれば、ステップＳ２７に進み、そうでなければ、ステップＳ２８に進む。ここで、閾値は予め設定された値であり、その大小で類似性の許容範囲を調整する。
〔Ｓ２７〕ｐ番目のＮ次ノードを、起点ノードに合成するノードの候補として記憶する。
〔Ｓ２８〕ｐの値に１を加算して、ステップＳ２４に進む。
〔Ｓ２９〕Ｎの値に１を加算して、ステップＳ２２に進む。
【００３２】
これにより、起点ノードからリンクを辿ることによりアクセス可能な全てのノードの中から、起点ノードに類似した内容を有するものが抽出される。
ここで、ノード１１を起点ノードとして入力する場合を考える（図２参照）。なお、ノード１１からは、２つのノード１２，１３へリンクが張られている。これらのノード１２，１３が２次ノードとなる。ノード１２，１３からも他のノードへリンクが張られており、最終的にノード１４〜１６までリンクが張られている。
【００３３】
ノード１１がノード群構成手段１に入力されると、起点ノード特徴抽出部１ａによって、その内容が解析され、ノード１１の特徴が起点ノード特徴プロファイルとして類似性判定部１ｄに渡される。また、関連ノード取得部１ｂによって、ノード１１からリンクが張られているノード１２，１３のノードパス名を抽出し、ノード１２，１３を取得する。
【００３４】
さらに、ノード１２，１３からリンクを辿ることによりアクセスできるノードをＮ次のノード１４〜１６までを全て取得する。取得したノードは、関連ノード特徴抽出部１ｃに渡される。そして、関連ノード特徴抽出部１ｃによって各ノードの内容の特徴が抽出され、関連ノード特徴プロファイルが作成される。すると、類似性判定部１ｄにより、ノード１１に類似する内容を有している関連ノードが全て抽出される。そして、抽出された全てのノードが、ノード合成部１ｅにより起点ノードに合成され、合成ノードが生成される。
【００３５】
以上の処理は、異なるノードを起点として繰り返され、その結果として、類似性を有する複数のノードから構成される合成ノード（ノード群）が複数生成されて出力される。
【００３６】
以上のようにして生成された合成ノードは、構成ノード記憶手段２に供給される。
構成ノード記憶手段２は、以上のようにして生成された合成ノードを構成する各ノードに対応するパス名を記憶する。
【００３７】
情報検索手段３は、ハイパードキュメントシステム１０から、入力された検索要求に対応するノードを検索する。この検索方法としては、例えば、ハイパードキュメントシステム１０からできるだけ大量のノードをスキャンして、各ノードの検索インデックスを用意しておき、入力された検索要求（キーワードの組み合わせ）に該当するものを出力するようにすればよい。
【００３８】
包含関係解析手段４は、情報検索手段３から供給された検索結果の候補であるノードを入力し、そのノードを起点とする合成ノード｛Ｇ₁，Ｇ₂，Ｇ₃，・・・，Ｇ_N｝を構成ノード記憶手段２から検索する。そして、得られた合成ノードの包含関係を図５に示すフローチャートに従って解析する。なお、以下の処理はすべて包含関係解析手段４によって実行される。
［Ｓ４１］情報検索手段３から供給されたノードを含む合成ノードを、構成ノード記憶手段２から検索し、得られた合成ノードのノード群列｛Ｇ₁，Ｇ₂，Ｇ₃，・・・，Ｇ_N｝を入力する。
［Ｓ４２］変数ｉを値“１”に初期設定し、また、排除ノード群リストの内容を空（＝ＮＵＬＬ）の状態に初期設定する。
［Ｓ４３］第ｉ番目のノード群（合成ノード）が、排除ノード群リストに含まれているか否かを判定し、含まれている場合にはステップＳ４９に進み、含まれていない場合にはステップＳ４４に進む。
［Ｓ４４］変数ｋに値（ｉ＋１）を代入する。
［Ｓ４５］第ｉ番目の合成ノードＧ_iを構成する全てのノードが、第ｋ（＝ｉ＋１）番目の合成ノードＧ_kを構成する全てのノードに包含されているか否かを判定し、包含されている場合にはステップＳ４７に進み、包含されていないか、または、Ｇ_iとＧ_kが等しい場合にはステップＳ４６に進む。
［Ｓ４６］第ｉ番目の合成ノードＧ_iを構成する全てのノードが、第ｋ（＝ｉ＋１）番目の合成ノードＧ_kを構成する全てのノードを包含しているか、または、これらが等しいか否かを判定し、包含しているか、または、等しい場合にはステップＳ４８に進み、包含していない場合にはステップＳ５０に進む。
［Ｓ４７］Ｇ_iを排除ノード群リストに追加する。
［Ｓ４８］Ｇ_kを排除ノード群リストに追加する。
［Ｓ４９］変数ｉの値を１だけインクリメントする。
［Ｓ５０］変数ｋの値を１だけインクリメントする。
［Ｓ５１］変数ｉの値が（Ｎ−１）の値と等しいか否かを判定し、これらが等しい場合にはステップＳ５３に進み、また、等しくない場合にはステップＳ４３に戻る。
［Ｓ５２］変数ｋの値がＮの値よりも大きいか否かを判定し、大きいと判定した場合にはステップＳ４９に進み、また、そうでない場合にはステップＳ４５に戻る。
［Ｓ５３］排除ノード群リストを出力する。
【００３９】
以上の処理により、他のノードに包含されている合成ノードが排除の対象として出力される。
ノード群排除手段５は、包含関係解析手段４から出力された排除ノード群リストを参照して、検索結果の合成ノードから不要なものを排除する。
【００４０】
検索結果出力手段６は、不要なノードが排除されたノード群を検索結果として出力する。
以上の実施の形態によれば、検索結果として、他のノードに包含されていない合成ノード（ノード群）だけを得ることができる。例えば、図１０に示すノードＡとノードＢとが検索の結果として得られた場合には、ノードＡを起点とするノード群のみが選択されて出力されることになる。
【００４１】
なお、以上の実施の形態においては、情報検索手段３は、ハイパードキュメントシステム１０から情報を検索するようにしたが、構成ノード記憶手段２に対してノード名とともに検索に必要な情報（例えば、各ノードの検索インデックス）を記憶させることにより、構成ノード記憶手段２から検索するようにすることも可能である。
【００４２】
次に、図６を参照して本発明の情報検索装置の第２の実施の形態について説明する。
図６に示す第２の実施の形態は、検索要求がリンクによる結合と内容の類似性をもとに構成されたノード群である場合について、検索要求に包含されるノード群を検索結果から排除する機能を付加した１実施例を示すものである。
【００４３】
同図において、図１の場合と対応する部分には同一の符号を付してあるのでその部分の説明は省略する。
検索要求構成ノード記憶手段２１は、検索要求を構成するノードのパス名を記憶して保持する。対検索要求包含関係解析手段２２は、検索結果の候補（ノード群）から検索要求のノード群に包含されるものを排除する。即ち、対検索要求包含関係解析手段２２は、検索要求構成ノード記憶手段に２１に保持される情報と、構成ノード記憶手段２に保持される情報と、情報検索手段３の検索結果とを参照して、検索要求と検索結果の候補の相互の包含関係を解析する。対検索要求ノード群排除手段２３は、対検索要求包含関係解析手段２２の解析結果を参照して、検索結果の候補の中から、検索要求に含まれているノード群を排除する。
【００４４】
検索結果出力手段２５は、対検索要求ノード群排除手段２３とノード群排除手段５のいずれによっても排除されなかったノード群を最終検索結果として出力する。
【００４５】
次に、以上の実施の形態の動作について説明する。なお、ノード群構成手段１、構成ノード記憶手段２、包含関係解析手段４、および、ノード群排除手段５の動作は、図１の場合と同様であるので、その説明は省略する。
【００４６】
ドキュメントリーダが検索要求を入力すると、検索要求構成ノード記憶手段２１は、検索要求を構成する個々のノードのパス名を記憶する。対検索要求包含関係解析手段２２は、図７に示す処理に従って、検索結果の候補の中から、検索要求に包含されるものを解析により抽出する。
［Ｓ７１］情報検索手段３によって得られたノードを起点とする合成ノードを、構成ノード記憶手段２から取得することによって得られたノード群列｛Ｇ₁，Ｇ₂，Ｇ₃，・・・，Ｇ_N｝と、検索要求のノード群Ｇ_Qとを入力する。
［Ｓ７２］変数ｉを値“１”に初期設定し、排除ノード群リストの内容を空（ＮＵＬＬ）にする。
［Ｓ７３］第ｉ番目のノード群Ｇ_iを構成する全ての（異なり）ノードが、Ｇ_Qを構成するノードの集合に含まれるか否かを判定し、含まれる場合にはステップＳ７４に進み、含まれない場合にはステップＳ７５に進む。
［Ｓ７４］ノード群列Ｇ_iを排除ノード群リストに加える。
［Ｓ７５］変数ｉの値を“１”だけインクリメントする。
［Ｓ７６］変数ｉの値がＮと等しいか否かを判定し、等しい場合にはステップＳ７７に進み、等しくない場合にはステップＳ７３に戻る。
［Ｓ７７］排除ノード群リストを出力する。
【００４７】
以上の処理によれば、検索結果の候補であるノード群列｛Ｇ₁，Ｇ₂，Ｇ₃，・・・，Ｇ_N｝のうち、検索要求Ｇ_Qに包含されているものが選択され、排除ノード群リストとして出力される。
【００４８】
対検索要求ノード群排除手段２３は、検索結果の候補から、排除ノード群リストに含まれているものを排除し、その結果を出力する。
検索結果出力手段２５は、対検索要求ノード群排除手段２３から出力されたノード群（検索要求のノード群に包含されていないノード群）と、ノード群排除手段５から出力されたノード群（他のノード群に包含されていないノード群）とを比較し、双方に含まれているノード群を検索結果として出力する。
【００４９】
以上のような実施の形態によれば、複数のノード群からなる検索要求が入力された場合に、得られた検索結果の候補の中から、他のノード群に包含されているノード群と、検索要求に包含されているノード群とが排除され、最終的な検索結果として出力されることになる。従って、入力されたノード群自体が検索結果として出力されることを防止するとともに、相互に重複した情報が検索結果として出力されることを防止することができるので、効率的な情報の検索を行うことが可能となる。
【００５０】
なお、以上の実施の形態においては、検索要求に包含されるノード群の排除と、他のノード群に包含されるノード群の排除とを並行して行うようにしたが、これらを順次実行するようにしてもよい。例えば、検索要求に含まれているノード群を排除した後、他のノード群に含まれているノード群を排除するといった具合である。
【００５１】
次に、図８を参照して本発明の第３の実施の形態の構成例について説明する。この実施の形態は、ドキュメントリーダから検索要求が入力される以前に、予め、検索対象となるノード群の包含関係を解析しておき、他のノード群に包含されるノード群を検索対象から排除することで、検索要求入力後の動作時間（ランタイム）を減少させるものである。
【００５２】
この図において、ノード群構成手段１は、ハイパードキュメントシステム１０に配置されているノードから、その内容の類似性とリンクに基づいて、検索の単位となるノード群を生成する。
【００５３】
構成ノード記憶手段２は、ノード群構成手段１によって構成されたノード群に含まれている個々のノードに対応するパス名を記憶する。
包含関係解析手段３は、構成ノード記憶手段２に記憶されているノード群の相互の包含関係を解析する。
【００５４】
類似度算定手段４１は、包含関係にある２組のノード群の内容の類似度を算定し、その結果を包含関係解析手段３に対して出力する。
検索対象絞り込み手段４２は、他のノード群に包含されるとともに、類似度が閾値を超えるノード群を排除して出力する。
【００５５】
次に、以上の実施の形態の動作について説明する。
ノード群構成手段１によって構成されたノード群は、構成ノード記憶手段２に記憶される。包含関係解析手段３と類似度算定手段４１は、図９を参照して後述する処理を連携して実行し、他のノード群に包含されるとともに、相互の類似度が高いノード群を排除するための排除ノード群リストを生成する。
【００５６】
即ち、単純に包含関係だけを解析したのでは、本来、検索結果として返されるべきノード群であっても、それが他の（検索結果として返されない）ノード群に包含されている場合には検索結果として出力されないという不都合が生じてしまう。そこで、類似度算定手段４１において、包含関係にあるノード群間の内容の類似度を算定して、内容が類似であると判定された場合にのみ排除の対象とする。
【００５７】
図９に示すフローチャートが開始されると、以下のような処理が実行されることになる。
［Ｓ９１］包含関係解析手段３は、構成ノード記憶手段２に記憶されているノード群列｛Ｇ₁，Ｇ₂，Ｇ₃，・・・，Ｇ_N｝を入力する。
［Ｓ９２］変数ｉを値“１”に初期設定し、また、排除ノード群リストの内容を空（＝ＮＵＬＬ）の状態に初期設定する。
［Ｓ９３］第ｉ番目のノード群（合成ノード）Ｇ_iが、排除ノード群リストに含まれている（Ｇ_iが排除ノード群リストの要素である）か否かを判定し、含まれている場合にはステップＳ１０１に進み、含まれていない場合にはステップＳ９４に進む。
［Ｓ９４］変数ｋに値（ｉ＋１）を代入する。
［Ｓ９５］第ｉ番目の合成ノードＧ_iを構成する全てのノードが、第ｋ（＝ｉ＋１）番目の合成ノードＧ_kを構成する全てのノードに包含されているか否かを判定し、包含されている場合にはステップＳ９７に進み、包含されていないか、または、Ｇ_iとＧ_kが等しい場合にはステップＳ９６に進む。
［Ｓ９６］第ｉ番目の合成ノードＧ_iを構成する全てのノードが、第ｋ（＝ｉ＋１）番目の合成ノードＧ_kを構成する全てのノードを包含しているか、または、これらが等しいか否かを判定し、包含しているか、または、等しい場合にはステップＳ９８に進み、包含していない場合にはステップＳ１０２に進む。
［Ｓ９７］類似度算定手段４１は、Ｇ_iとＧ_kの類似度を算定する。ここでの類似度の算定方法としては、まず、Ｇ_iとＧ_kを構成する全てのノードをそれぞれ合成する。次に、図３に示す処理を実行し、各合成ノードの内容の特徴を表すプロファイルを作成してそれらの類似度を算定する。そして、包含関係解析手段３は、類似度が予め定められた閾値より大であれば類似であると判定し、ステップＳ９９に進み、また、小であれば類似でないと判定してステップＳ１０１に進む。
［Ｓ９８］ステップＳ９７と同様の処理により、Ｇ_kとＧ_iの類似度を算定し、これらが類似であると判定した場合には、ステップＳ１００に進み、また、類似ではないと判定した場合にはステップ１０２に進む。
［Ｓ９９］Ｇ_iを排除ノード群リストに追加する。
［Ｓ１００］Ｇ_kを排除ノード群リストに追加する。
［Ｓ１０１］変数ｉの値を１だけインクリメントする。
［Ｓ１０２］変数ｋの値を１だけインクリメントする。
［Ｓ１０３］変数ｉの値が（Ｎ−１）と等しいか否かを判定し、これらが等しい場合にはステップＳ１０５に進み、また、等しくない場合にはステップＳ９３に戻る。
［Ｓ１０４］変数ｋの値がＮの値よりも大きいか否かを判定し、大きいと判定した場合にはステップＳ１０１に進み、また、そうでない場合にはステップＳ９５に戻る。
［Ｓ１０５］排除ノード群リストを出力する。
【００５８】
以上の処理により、検索対象となるノード群のうち、他のノードに包含されるとともに、類似度が閾値を超えるノード群が排除ノード群リストとして出力される。
【００５９】
このようにして、他のノード群に包含されるとともに、類似度が所定の閾値を超えるノード群は、検索対象絞り込み手段４２によって排除される。そして、排除されなかったノード群が新たな検索対象となる。
【００６０】
以上の実施の形態によれば、ドキュメントリーダは、排除されなかったノード群を対象とすることにより、効率的な検索を行うことが可能となる。
なお、上記の処理機能は、コンピュータによって実現することができる。その場合、情報検索装置が有すべき機能の処理内容は、コンピュータで読み取り可能な記録媒体に記録されたプログラムに記述されており、このプログラムをコンピュータで実行することにより、上記処理がコンピュータで実現される。
【００６１】
コンピュータで読み取り可能な記録媒体としては、磁気記録装置や半導体メモリ等がある。市場に流通させる場合には、ＣＤ−ＲＯＭ(Compact Disk Read Only Memory) やフロッピーディスク等の可搬型記録媒体にプログラムを格納して流通させたり、ネットワークを介して接続されたコンピュータの記憶装置に格納しておき、ネットワークを通じて他のコンピュータに転送することもできる。コンピュータで実行する際には、コンピュータ内のハードディスク装置等にプログラムを格納しておき、メインメモリにロードして実行する。
【００６２】
【発明の効果】
以上説明したように本発明に係わる情報検索装置では、リンクによる結合と内容の類似性に応じてノード群を構成し、所定の検索要求があった場合には、該当するノード群を取得し、取得されたノード群の中から他のノード群に包含されているものを排除して出力するようにしたので、重複した情報を閲覧することを避けることができる。
【００６３】
また、本発明に係わる情報検索装置では、リンクによる結合と内容の類似性に応じてノード群を構成し、構成されたノード群から包含関係を有するノード群を抽出するとともに、それぞれの類似度を算定し、類似度が所定の閾値よりも大きいノード群に対しては、包含される側のノード群を排除するようにしたので、情報の検索を行う場合にランタイムを短縮することが可能となる。
【図面の簡単な説明】
【図１】本発明の実施の形態の構成例である。
【図２】図１に示すノード群構成手段の構成例を示す図である。
【図３】図２に示す起点ノード特徴抽出部および関連ノード特徴抽出部において実行される処理の一例を説明するフローチャートである。
【図４】図２に示す類似性判定部において実行される処理の１例を説明するフローチャートである。
【図５】図１に示す包含関係解析手段において実行される処理の１例を説明するフローチャートである。
【図６】本発明の第２の実施の形態の構成例を示す図である。
【図７】図６に示す対検索要求包含関係解析手段において実行される処理の１例を説明する図である。
【図８】本発明の第３の実施の形態の構成例を示す図である。
【図９】図８に示す包含関係解析手段と類似度算定手段によって実行される処理の１例を説明するフローチャートである。
【図１０】包含関係にあるノード群を説明する図である。
【符号の説明】
１ノード群構成手段
２構成ノード記憶手段
３情報検索手段
４包含関係解析手段
５ノード群排除手段
６検索結果出力手段
２１検索要求構成ノード記憶手段
２２対検索要求包含関係解析手段
２３対検索要求ノード群排除手段
２５検索結果出力手段
４１類似度算定手段
４２検索対象絞り込み手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information search apparatus, an information search method, and a recording medium, and in particular, an information search apparatus and an information search method for searching for information in a hyper document system configured by nodes as links of information and links between nodes. And a recording medium.
[0002]
[Prior art]
In the hyper document system, information to be handled is divided into several small units (nodes), and they are associated and organized (information indicating such association is called a link). For example, on the Internet, a hyper document system is constructed by WWW (World Wide Web). WWW information is described in HTML (Hyper Text Markup Language). This HTML has no semantic restrictions on links between nodes. As described above, a system that does not have a semantic constraint on the link between nodes has an advantage that the document author (author) can determine the content and the link structure at will. With such a hyper document system, the document reader (reader) can follow the link structure constructed by the document author and access all information provided by the document author.
[0003]
By the way, the amount of information in hyper document systems such as the Internet is enormous. Therefore, in order for the document reader to find necessary information, a system that supports information retrieval is required. There are the following two conventional techniques for supporting such a search.
[0004]
The first conventional technique scans as many nodes as possible in advance (randomly), prepares a search index for each node, and presents what matches a query (a combination of keywords) from the document reader. Is. In addition, as an elemental technology for search index creation and query matching, vector space model (G. Salton & J. Allan, Text Retrieval Using the Vector Processing Model, in Proc. Of SDAIR94) Has been devised.
[0005]
The second conventional technique scans as many nodes as possible in advance (randomly) and assigns them to a tree-structured directory classified by topic. The document reader searches the directory for a topic considered to contain the desired information, and accesses the target information from there. As an elemental technology for realizing this technology, automatic document classification methods using natural language processing (for example, P. Jacobs, Joining Statistics with NLP for Text Categorization, in Proc. Of Applied-ACL92) have been proposed. Yes. Furthermore, an automatic document classification method (United States Patent: 5526443, T. Nakayama (FXPAL), Method and apparatus for highlighting and categorizing documents using coded word tokens, issue date: 1996.6.11) was also devised. Yes.
[0006]
[Problems to be solved by the invention]
However, in these conventional technologies, since one node (for example, one HTML document) is used as one search target unit, it is possible to grasp the essence of a hyper document system in which a concept is expressed by a structure of nodes and links. However, the following problems have occurred.
[0007]
How many nodes are divided and how they are structured depends on the preference of the document author, but in a search that uses a node as a unit, it is structured on the hypernetwork. A node group cannot be regarded as information having a single semantic group. That is, in the search according to the prior art, only the semantically incomplete information fragment is set as the search target, and the context is not reflected in the search.
[0008]
For example, when information with one semantic group created by one document author is divided into a plurality of HTML documents and expressed by performing a document search according to the conventional technique, each HTML document is individually searched. It becomes a target. Here, when the document reader searches for information similar to “concept A”, even though the information created by the document author is generally similar to “concept A”, the divided individual nodes are “concept A”. If it is not similar to this information, this information (or some nodes) is not detected.
[0009]
In addition, if one node is a search target unit, there is a problem that a concept representing a search request cannot be expressed by a structure on a hyper network.
Furthermore, when the document author divides information having a single semantic group into a plurality of nodes and structures them, the conventional search outputs each node individually, resulting in redundancy. is there. When a series of HTML documents created by a single document author as information having a single semantic group is output individually, the amount of search results increases, and the document reader for searching for documents that meet the purpose Labor increases.
[0010]
Therefore, starting from a certain node, the features of Nth-order nodes (N = 2, 3,...) Linked to that node are compared to determine their similarity, and determined to be similar. By combining the Nth-order node with the starting node, it is possible to search using information having a semantic unit as one unit.
[0011]
However, according to such a method, information retrieval that captures the essence of the hyper document system can be performed, but an inclusion relationship may occur between “semantic groups”. There is a problem of becoming.
[0012]
An example of such a case is shown in FIG. In this figure, each parallelogram indicates a node, and each arrow indicates a link. In this example, the semantic group starting from node A (the part surrounded by a solid line) includes the semantic group starting from node B (the part surrounded by a dotted line). Therefore, when both are given as search results for a certain search request, the act of reading the latter is redundant for a document reader that has read (brows) the former first. However, there is also a problem in that it cannot be determined whether it is redundant unless it is actually accessed.
[0013]
The present invention has been made in view of these points, and an object of the present invention is to provide an information search apparatus that can efficiently search for information having a semantic unit as a unit.
[0014]
Furthermore, another object of the present invention is to provide an information retrieval method capable of efficiently retrieving information with information having a semantic unit as one unit.
Furthermore, another object of the present invention is to provide a recording medium on which an information retrieval program for efficiently retrieving information with a semantic unit as a unit is recorded.
[0015]
[Means for Solving the Problems]
In the present invention, in order to solve the above-mentioned problem, in the information retrieval apparatus for retrieving information for a hyper document system composed of nodes as a unit of information and links between the nodes, the combination of links and the similarity of contents Node group configuring means that configures a node group based on the configuration, configuration node storage means for storing individual nodes that configure the node group, and information for retrieving information from the hyperdocument system in response to a predetermined search request An inclusion relation analyzing means for analyzing a mutual inclusion relation among a plurality of node groups obtained as a search result of the information searching means based on information stored in the component node storage means; and the inclusion In accordance with the analysis result of the relationship analysis unit, a node group exclusion unit that excludes a node group included in another node group; Information retrieval apparatus characterized by having a search result output means for outputting the remaining nodes as the search results without being is provided.
[0016]
Here, the node group configuring means configures the node group based on the link connection and the similarity of contents. The configuration node storage means stores individual nodes constituting the node group. The information search means searches for information from the hyper document system in response to a predetermined search request. The inclusion relation analyzing means analyzes the mutual inclusion relation of the plurality of node groups obtained as a search result of the information search means based on the information stored in the constituent node storage means. The node group exclusion unit excludes a node group included in another node group according to the analysis result of the inclusion relationship analysis unit. The search result output means outputs the node group remaining without being excluded as a search result.
[0017]
In addition, in an information search device for searching information in a hyper document system configured by nodes and links between nodes, which are units of information, the nodes to be searched based on the combination of links and the similarity of contents A node group configuration unit that configures a group, a configuration node storage unit that stores individual nodes belonging to each of the node groups configured by the node group configuration unit, and a mutual inclusion relationship between the node groups, the configuration node Inclusion relation analysis means for analyzing based on information stored in the storage means; similarity calculation means for calculating similarity between a set of nodes whose inclusion relation is recognized by the inclusion relation analysis means; If the similarity calculated by the similarity calculation means is greater than or equal to a predetermined threshold, the included node group is excluded and excluded Information retrieval apparatus is provided, characterized in that it comprises a search object narrowing means to narrow the remaining nodes without as searched, a.
[0018]
Here, the node group configuration means configures a node group to be searched based on the link connection and the similarity of contents. The configuration node storage unit stores individual nodes belonging to each of the node groups configured by the node group configuration unit. The inclusion relation analyzing means analyzes the mutual inclusion relation of the node groups based on information stored in the constituent node storage means. The similarity calculation means calculates the similarity between a set of nodes whose inclusion relation is recognized by the inclusion relation analysis means. When the similarity calculated by the similarity calculation means is equal to or greater than a predetermined threshold, the search target narrowing means excludes the included node group and searches the remaining node group for the search. Refine as.
[0019]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a diagram showing a configuration example of an embodiment of an information search apparatus of the present invention. In this figure, the node group constructing means 1 is a node group (for example, a node starting from the node A or B in FIG. 10) according to the similarity of contents from information distributed in the hyper document system 10. Group).
[0020]
The configuration node storage unit 2 stores the path names of (different) nodes belonging to the node group configured by the node group configuration unit 1. The path name used here is an absolute address uniquely given to a node. For example, in the case of WWW (World Wide Web), it indicates a URL (Uniform Resource Locator).
[0021]
The information search means 3 acquires a node corresponding to the search request from the nodes arranged in the hyper document system 10 and outputs it as a search result candidate. The inclusion relation analyzing unit 4 inputs a node obtained as a result of the information search, acquires a node group starting from the node from the constituent node storage unit 2, and analyzes the mutual inclusion relation of the obtained node group To do.
[0022]
The node group exclusion unit 5 excludes a node group included in another node group according to the analysis result by the inclusion relationship analysis unit 4.
The search result output means 6 outputs the node group remaining without being excluded by the node group exclusion means 5 as the final search result.
[0023]
Next, a configuration example of the node group configuring unit 1 will be described with reference to FIG. As shown in FIG. 2, the node group constituting unit 1 includes a starting node feature extracting unit 1a, a related node acquiring unit 1b, a related node feature extracting unit 1c, a similarity determining unit 1d, and a node combining unit 1e.
[0024]
When the start node is input, the start node feature extraction unit 1a analyzes the content of the start node and extracts the feature. The extracted feature is passed to the similarity determination unit 1d as a starting node feature profile. Here, the information related to the characteristics of a node refers to a set of a pair of a word that characterizes the contents of the node and a value indicating its importance. For example, the origin node feature profile is created by weighting based on the appearance frequency, appearance position, and part-of-speech information regarding each word appearing at the origin node.
[0025]
When the origin node is input, the related node acquisition unit 1b acquires a secondary node linked from the node, and also selects a node (related node) further linked from the acquired node. Obtain sequentially. Then, the same processing is performed until there are no links to other nodes. Each node after the secondary node acquired at this time is set as the Nth order node (N = 2, 3, 4,...).
[0026]
The related node feature extraction unit 1c extracts the features of the Nth-order nodes extracted by the related node acquisition unit 1b, and creates a related node feature profile. The created related node feature profile is passed to the similarity determination unit 1d.
[0027]
Based on the related node feature profile created by the related node feature extraction unit 1c, the similarity determination unit 1d performs similarity determination processing for all the accessible nodes by following links from the start node. Then, the contents of the similar nodes are passed to the node composition unit 1e. The node synthesis unit 1e synthesizes all the nodes extracted by the similarity determination unit 1d with the origin node and outputs a synthesis node (node group). As a result, the node group constituting unit 1 outputs a plurality of path names corresponding to the composite node (node group) included by the same concept.
[0028]
Next, the operation of the embodiment shown in FIG. 1 will be described.
The node group constituting unit 1 generates and outputs a node group from information arranged in the hyper document system 10. That is, the node group constituting unit 1 scans the nodes arranged in the hyper document system 10 at random, and executes the following processing using the acquired node as a starting node.
[0029]
The related node acquisition unit 1b sequentially acquires N-th order nodes related to (linked to) the given starting node node. Then, the origin node feature extraction unit 1a and the related node feature extraction unit 1c perform feature extraction processing according to the processing shown in FIG. Since the processes performed by the origin node feature extraction unit 1a and the related node feature extraction unit 1c are the same, only the process performed by the origin node feature extraction unit 1a will be described in the following description.
[S1] A starting node is given, and the information source is input to the starting node feature extraction unit 1a.
[S2] The tag defined in the hyper document system description language (for example, HTML) is removed from the information source.
[S3] A word is extracted from the remaining text using a known morphological analysis technique.
[S4] Only important words are extracted from the set of words obtained in step S3. Here, the important word is a word characterizing the contents of the information source, and is extracted by, for example, a method in which only a noun is regarded as an important word.
[S5] The important word obtained in step S4 is weighted in consideration of the appearance frequency and the appearance position. That is, the higher the frequency of appearance, the higher the importance. Also, the importance is increased as the appearance position is closer to the head of the document.
[S6] Finally, a list composed of pairs of important words and their weights is created, and this is used as a starting node feature profile.
[0030]
The feature profile (single) of the starting node obtained in this way is passed to the similarity determination unit 1d.
As described above, the origin node is also passed to the related node acquisition unit 1b, and the related node acquisition unit 1b searches for link information included in the information source of the received node, and the link destination node. As a secondary node. For example, if the origin node is created in HTML, the URL (Uniform Resource Locator) in the area enclosed by the anchor tag (<A>... </A>) is extracted and specified by the URL. A document (secondary node) is acquired.
[0031]
Then, by repeating the same processing, Nth-order nodes (related nodes) are sequentially acquired and passed to the related node feature extraction unit 1c as a set of related nodes. The related node feature extraction unit 1c creates a related node feature profile for each related node by executing the same process as described above. The characteristic profiles (generally a plurality) of the related nodes are passed to the similarity determination unit 1d. Thereby, the origin node feature profile and a plurality of related node feature profiles are passed to the similarity determination unit 1d.
The similarity determination unit 1d executes the process shown in FIG.
[S21] An initialization of N = 2 is performed.
[S22] It is determined whether or not an Nth order node exists. If it exists, the process proceeds to step S23, and if not, the process ends.
[S23] The initialization p = 1 is performed. The number of Nth order nodes is m.
[S24] The magnitudes of p and m are compared. If p> m, the process proceeds to step S29. Otherwise, the process proceeds to step S25.
[S25] The similarity between the starting node and the p-th Nth-order node is calculated by the above-described method (known vector inner product calculation method).
[S26] The similarity value obtained in step S25 is compared with a threshold value. If similarity> threshold value, the process proceeds to step S27. Otherwise, the process proceeds to step S28. Here, the threshold value is a preset value, and the allowable range of similarity is adjusted according to the magnitude.
[S27] The pth Nth order node is stored as a candidate node to be combined with the starting node.
[S28] Add 1 to the value of p, and proceed to step S24.
[S29] 1 is added to the value of N, and the process proceeds to step S22.
[0032]
As a result, from all the nodes that can be accessed by following the link from the origin node, those having contents similar to the origin node are extracted.
Here, consider a case where the node 11 is input as a starting node (see FIG. 2). The node 11 is linked to two nodes 12 and 13. These nodes 12 and 13 become secondary nodes. Links are also established from the nodes 12 and 13 to other nodes, and finally links to the nodes 14 to 16 are established.
[0033]
When the node 11 is input to the node group constituting unit 1, the content is analyzed by the starting node feature extracting unit 1a, and the feature of the node 11 is passed to the similarity determining unit 1d as the starting node feature profile. The related node acquisition unit 1 b extracts the node path names of the nodes 12 and 13 linked from the node 11 and acquires the nodes 12 and 13.
[0034]
Further, all the N-th nodes 14 to 16 are obtained as accessible nodes by following the links from the nodes 12 and 13. The acquired node is passed to the related node feature extraction unit 1c. Then, the related node feature extraction unit 1c extracts the features of the contents of each node, and creates a related node feature profile. Then, all the related nodes having contents similar to the node 11 are extracted by the similarity determination unit 1d. Then, all the extracted nodes are combined with the starting node by the node combining unit 1e, and a combined node is generated.
[0035]
The above process is repeated starting from different nodes, and as a result, a plurality of composite nodes (node groups) composed of a plurality of nodes having similarity are generated and output.
[0036]
The composite node generated as described above is supplied to the configuration node storage unit 2.
The configuration node storage unit 2 stores a path name corresponding to each node constituting the composite node generated as described above.
[0037]
The information search means 3 searches the hyper document system 10 for a node corresponding to the input search request. As this search method, for example, as many nodes as possible are scanned from the hyper document system 10, a search index for each node is prepared, and an output corresponding to the input search request (a combination of keywords) is output. You can do that.
[0038]
The inclusion relation analysis unit 4 inputs a node that is a candidate for a search result supplied from the information search unit 3, and a synthesis node {G ₁ , G ₂ , G _Three , ..., G _N } Is searched from the constituent node storage means 2. Then, the inclusion relationship of the obtained composite nodes is analyzed according to the flowchart shown in FIG. The following processes are all executed by the inclusion relation analyzing unit 4.
[S41] A composite node including the node supplied from the information search means 3 is searched from the constituent node storage means 2, and a node group string {G ₁ , G ₂ , G _Three , ..., G _N }.
[S42] The variable i is initialized to the value “1”, and the contents of the excluded node group list are initialized to an empty (= NULL) state.
[S43] It is determined whether or not the i-th node group (composite node) is included in the excluded node group list. If it is included, the process proceeds to step S49. Proceed to S44.
[S44] A value (i + 1) is substituted into the variable k.
[S45] i-th synthesis node G _i Are all k nodes (= i + 1) th synthesis node G _k Is included in all the nodes constituting the node, and if included, the process proceeds to step S47, where it is not included, or G _i And G _k If they are equal, the process proceeds to step S46.
[S46] i-th synthesis node G _i Are all k nodes (= i + 1) th synthesis node G _k Whether or not they are equal to each other is determined. If they are included or equal, the process proceeds to step S48. If not, the process proceeds to step S50. move on.
[S47] G _i Is added to the excluded node group list.
[S48] G _k Is added to the excluded node group list.
[S49] The value of variable i is incremented by one.
[S50] The variable k is incremented by 1.
[S51] It is determined whether or not the value of the variable i is equal to the value of (N-1). If they are equal, the process proceeds to step S53, and if they are not equal, the process returns to step S43.
[S52] It is determined whether or not the value of the variable k is greater than the value of N. If it is determined that the value is larger, the process proceeds to step S49. If not, the process returns to step S45.
[S53] An excluded node group list is output.
[0039]
Through the above processing, a composite node included in another node is output as an object to be excluded.
The node group exclusion unit 5 refers to the excluded node group list output from the inclusion relationship analysis unit 4 and excludes unnecessary items from the search result synthesis node.
[0040]
The search result output means 6 outputs a node group from which unnecessary nodes are excluded as a search result.
According to the above-described embodiment, only synthesized nodes (node groups) that are not included in other nodes can be obtained as search results. For example, when the node A and the node B shown in FIG. 10 are obtained as a search result, only the node group starting from the node A is selected and output.
[0041]
In the above embodiment, the information retrieval unit 3 retrieves information from the hyper document system 10, but information necessary for retrieval together with the node name with respect to the constituent node storage unit 2 (for example, each It is also possible to search from the constituent node storage means 2 by storing the node search index).
[0042]
Next, a second embodiment of the information retrieval apparatus of the present invention will be described with reference to FIG.
The second embodiment shown in FIG. 6 excludes the node group included in the search request from the search result when the search request is a node group configured based on the combination of the link and the similarity of the contents. 1 shows an embodiment in which a function is added.
[0043]
In the figure, parts corresponding to those in FIG. 1 are denoted by the same reference numerals, and description thereof is omitted.
The search request constituent node storage unit 21 stores and holds the path names of the nodes constituting the search request. The search request inclusion relation analyzing means 22 excludes search result candidates (node groups) included in the search request node group. That is, the pair search request inclusion relation analyzing means 22 refers to the information held in the search request constituent node storage means 21, the information held in the constituent node storage means 2, and the search result of the information search means 3. Thus, the mutual inclusion relationship between the search request and the search result candidate is analyzed. The pair search request node group exclusion unit 23 refers to the analysis result of the pair search request inclusion relation analysis unit 22 and excludes the node group included in the search request from the search result candidates.
[0044]
The search result output means 25 outputs a node group that has not been excluded by either the pair search request node group exclusion means 23 or the node group exclusion means 5 as a final search result.
[0045]
Next, the operation of the above embodiment will be described. The operations of the node group configuration unit 1, the configuration node storage unit 2, the inclusion relationship analysis unit 4, and the node group exclusion unit 5 are the same as those in FIG.
[0046]
When the document reader inputs a search request, the search request constituent node storage unit 21 stores the path names of the individual nodes constituting the search request. The search request inclusion relation analyzing means 22 extracts, by analysis, those included in the search request from the search result candidates according to the processing shown in FIG.
[S71] A node group sequence {G obtained by acquiring from the constituent node storage unit 2 a composite node starting from the node obtained by the information search unit 3 ₁ , G ₂ , G _Three , ..., G _N } And the node group G of the search request _Q Enter.
[S72] The variable i is initialized to the value “1”, and the contents of the excluded node group list are made empty (NULL).
[S73] i-th node group G _i All (different) nodes that make up G _Q Is included in the set of nodes constituting the node, and if included, the process proceeds to step S74, and if not included, the process proceeds to step S75.
[S74] Node group sequence G _i To the excluded node group list.
[S75] The value of the variable i is incremented by “1”.
[S76] It is determined whether or not the value of the variable i is equal to N. If equal, the process proceeds to step S77. If not equal, the process returns to step S73.
[S77] An excluded node group list is output.
[0047]
According to the above processing, the node group sequence {G ₁ , G ₂ , G _Three , ..., G _N }, Search request G _Q Are included and output as an excluded node group list.
[0048]
The search request node group exclusion means 23 excludes those included in the excluded node group list from the search result candidates and outputs the result.
The search result output means 25 includes a node group output from the pair search request node group exclusion means 23 (a node group not included in the search request node group) and a node group output from the node group exclusion means 5 (others). Node groups not included in the node group) and the node groups included in both are output as search results.
[0049]
According to the embodiment as described above, when a search request including a plurality of node groups is input, among the obtained search result candidates, a node group included in another node group, and The node group included in the search request is excluded, and the final search result is output. Accordingly, it is possible to prevent the input node group itself from being output as a search result, and to prevent output of mutually overlapping information as a search result, so that efficient information search is performed. It becomes possible.
[0050]
In the above embodiment, the exclusion of the node group included in the search request and the exclusion of the node group included in the other node group are performed in parallel, but these are sequentially executed. You may do it. For example, after the node group included in the search request is excluded, the node group included in another node group is excluded.
[0051]
Next, a configuration example of the third embodiment of the present invention will be described with reference to FIG. In this embodiment, before the search request is input from the document reader, the inclusion relation of the node group to be searched is analyzed in advance, and the node group included in the other node group is excluded from the search target. By doing so, the operation time (runtime) after inputting the search request is reduced.
[0052]
In this figure, the node group constituting unit 1 generates a node group as a unit of search from the nodes arranged in the hyper document system 10 based on the similarity of the contents and the link.
[0053]
The configuration node storage unit 2 stores path names corresponding to individual nodes included in the node group configured by the node group configuration unit 1.
The inclusion relationship analysis unit 3 analyzes the mutual inclusion relationship of the node groups stored in the configuration node storage unit 2.
[0054]
The similarity calculation means 41 calculates the similarity of the contents of the two sets of nodes in the inclusion relationship and outputs the result to the inclusion relationship analysis means 3.
The search target narrowing means 42 is included in another node group and excludes and outputs a node group whose similarity exceeds a threshold value.
[0055]
Next, the operation of the above embodiment will be described.
The node group configured by the node group configuration unit 1 is stored in the configuration node storage unit 2. The inclusion relation analysis unit 3 and the similarity calculation unit 41 cooperate to execute processing described later with reference to FIG. 9 to exclude nodes that are included in other node groups and have high mutual similarity. An excluded node group list is generated for this purpose.
[0056]
In other words, if only the inclusion relation is analyzed, even if it is a node group that should originally be returned as a search result, a search is performed if it is included in another (not returned as a search result) node group. As a result, the inconvenience of not being output occurs. Therefore, the similarity calculation unit 41 calculates the similarity of contents between nodes in an inclusive relationship, and only determines that the contents are determined to be similar when excluded.
[0057]
When the flowchart shown in FIG. 9 is started, the following processing is executed.
[S91] The inclusion relation analyzing unit 3 stores the node group sequence {G ₁ , G ₂ , G _Three , ..., G _N }.
[S92] The variable i is initialized to the value “1”, and the contents of the excluded node group list are initialized to an empty (= NULL) state.
[S93] i-th node group (composite node) G _i Is included in the excluded node group list (G _i Is included in the excluded node group list), and if included, the process proceeds to step S101, and if not included, the process proceeds to step S94.
[S94] A value (i + 1) is substituted into the variable k.
[S95] i-th synthesis node G _i Are all k nodes (= i + 1) th synthesis node G _k Is included in all the nodes constituting the, and if included, the process proceeds to step S97, where it is not included, or G _i And G _k If they are equal, the process proceeds to step S96.
[S96] i-th synthesis node G _i Are all k nodes (= i + 1) th synthesis node G _k It is determined whether or not all nodes constituting the node are included or are equal to each other, and if included or equal, the process proceeds to step S98, and if not included, the process proceeds to step S102. move on.
[S97] The similarity calculation means 41 uses G _i And G _k The similarity is calculated. As a method of calculating the similarity here, first, G _i And G _k All the nodes constituting the are respectively synthesized. Next, the processing shown in FIG. 3 is executed to create a profile representing the characteristics of the contents of each composite node and calculate their similarity. The inclusion relationship analyzing unit 3 determines that the similarity is similar if the similarity is greater than a predetermined threshold, and proceeds to step S99. If the similarity is small, the inclusion relationship analyzing unit 3 determines that the similarity is not similar and proceeds to step S101. .
[S98] By the same processing as step S97, G _k And G _i If it is determined that they are similar, the process proceeds to step S100. If it is determined that they are not similar, the process proceeds to step 102.
[S99] G _i Is added to the excluded node group list.
[S100] G _k Is added to the excluded node group list.
[S101] The value of variable i is incremented by one.
[S102] The value of the variable k is incremented by 1.
[S103] It is determined whether or not the value of the variable i is equal to (N-1). If they are equal, the process proceeds to step S105, and if they are not equal, the process returns to step S93.
[S104] It is determined whether or not the value of the variable k is larger than the value of N. If it is determined that the value is larger, the process proceeds to step S101. If not, the process returns to step S95.
[S105] An excluded node group list is output.
[0058]
Through the above processing, among the node groups to be searched, nodes that are included in other nodes and whose similarity exceeds a threshold value are output as an excluded node group list.
[0059]
In this way, node groups that are included in other node groups and whose similarity exceeds a predetermined threshold are excluded by the search target narrowing means 42. A node group that is not excluded becomes a new search target.
[0060]
According to the above embodiment, the document reader can perform an efficient search by targeting the node group that has not been excluded.
The above processing functions can be realized by a computer. In this case, the processing contents of the functions that the information retrieval apparatus should have are described in a program recorded on a computer-readable recording medium, and the above processing is realized by the computer by executing the program by the computer. Is done.
[0061]
Examples of the computer-readable recording medium include a magnetic recording device and a semiconductor memory. When distributing to the market, store the program in a portable recording medium such as a CD-ROM (Compact Disk Read Only Memory) or floppy disk, or store it in a computer storage device connected via a network. In addition, it can be transferred to another computer through the network. When executed by a computer, the program is stored in a hard disk device or the like in the computer, loaded into the main memory and executed.
[0062]
【The invention's effect】
As described above, in the information search apparatus according to the present invention, a node group is configured according to the link and the similarity of contents, and when there is a predetermined search request, the corresponding node group is acquired, Since the nodes included in the other node groups are excluded from the acquired node group and output, it is possible to avoid browsing duplicate information.
[0063]
Further, in the information search apparatus according to the present invention, a node group is configured according to the combination of links and the similarity of contents, and a node group having an inclusive relationship is extracted from the configured node group, and each similarity is determined. As a result of calculating and eliminating the included node group for the node group whose similarity is larger than the predetermined threshold, it is possible to shorten the runtime when searching for information. .
[Brief description of the drawings]
FIG. 1 is a configuration example of an embodiment of the present invention.
FIG. 2 is a diagram illustrating a configuration example of a node group configuration unit illustrated in FIG. 1;
FIG. 3 is a flowchart for explaining an example of processing executed in an origin node feature extraction unit and a related node feature extraction unit shown in FIG. 2;
4 is a flowchart illustrating an example of processing executed in a similarity determination unit shown in FIG.
FIG. 5 is a flowchart for explaining an example of processing executed by the inclusion relationship analysis unit shown in FIG. 1;
FIG. 6 is a diagram illustrating a configuration example of a second embodiment of the present invention.
7 is a diagram for explaining an example of processing executed by the pair search request inclusion relation analyzing unit shown in FIG. 6;
FIG. 8 is a diagram illustrating a configuration example of a third embodiment of the present invention.
FIG. 9 is a flowchart for explaining an example of processing executed by the inclusion relationship analyzing unit and the similarity calculating unit shown in FIG. 8;
FIG. 10 is a diagram illustrating a group of nodes that are in an inclusive relationship.
[Explanation of symbols]
1 Node group configuration means
2. Configuration node storage means
3 Information retrieval means
4 Inclusion relationship analysis means
5 Node group exclusion means
6 Search result output means
21 Search request component node storage means
22 Pair search request inclusion relation analysis means
23 Pair search request node group exclusion means
25 Search result output means
41 Similarity calculation means
42 Search target narrowing means

Claims

In an information retrieval apparatus for retrieving information for a hyper document system configured by nodes and links between nodes as a unit of information,
A node group configuration means for configuring a node group based on the link and the similarity of contents;
Configuration node storage means for storing individual nodes constituting the node group;
Information retrieval means for retrieving information from the hyperdocument system in response to a predetermined retrieval request;
Inclusion relation analysis means for analyzing the mutual inclusion relation of a plurality of node groups obtained as a search result of the information search means based on information stored in the component node storage means;
In accordance with the analysis result of the inclusion relation analysis unit, a node group exclusion unit that excludes a node group included in another node group;
Search result output means for outputting a node group remaining without being excluded as a search result;
An information retrieval apparatus comprising:

Search request configuration node storage means for storing individual nodes constituting the node group when a group of nodes configured based on the link and content similarity is given as a search request;
Pair search request inclusion relation analyzing means for analyzing an inclusion relation between a node group obtained as a search result candidate and the node group of the search request with reference to information stored in the search request constituent node storage means When,
2. The information search apparatus according to claim 1, further comprising pair search request node group exclusion means for excluding a node group included in the search request node group.

In an information search method for searching information for a hyper document system configured by nodes and links between nodes, which are units of information,
A node group configuration step for configuring a node group based on the link and the similarity of contents;
A configuration node storage step of storing individual nodes constituting the node group;
An information search step of searching for information from the hyper document system in response to a predetermined search request;
An inclusion relationship analysis step of analyzing the mutual inclusion relationship of the plurality of node groups obtained as a search result of the information search step based on information stored in the component node storage step;
In accordance with the analysis result of the inclusion relationship analysis step, a node group exclusion step of eliminating a node group included in another node group;
A search result output step for outputting a node group remaining without being excluded as a search result;
A method for retrieving information, comprising:

In a computer-readable recording medium in which an information search program for causing a computer to search for information of a hyper document system configured by nodes and links between nodes, which are units of information, is recorded.
Node group configuration means for configuring a node group based on the link and the similarity of contents,
Configuration node storage means for storing individual nodes constituting the node group,
Information retrieval means for retrieving information from the hyperdocument system in response to a predetermined retrieval request;
Inclusion relation analysis means for analyzing mutual inclusion relations of a plurality of node groups obtained as a search result of the information search means based on information stored in the component node storage means;
Node group exclusion means for eliminating a node group included in another node group according to the analysis result of the inclusion relation analysis means;
A search result output means for outputting the node group remaining without being excluded as a search result;
As a computer-readable recording medium on which a program for causing a computer to function is recorded.

In an information retrieval apparatus for retrieving information for a hyper document system configured by nodes and links between nodes as a unit of information,
A node group configuring means for configuring a node group to be searched based on the link and the similarity of contents;
Configuration node storage means for storing individual nodes belonging to each of the node groups configured by the node group configuration means;
Inclusion relation analysis means for analyzing the mutual inclusion relation of the node group based on information stored in the component node storage means;
Similarity calculating means for calculating the similarity between a set of nodes whose inclusion relation is recognized by the inclusion relation analyzing means;
When the similarity calculated by the similarity calculation means is equal to or greater than a predetermined threshold, the included node group is excluded, and the remaining node group that is not excluded is narrowed down as a search target. Means,
An information retrieval apparatus comprising:

In an information search method for searching information for a hyper document system configured by nodes and links between nodes, which are units of information,
A node group configuration step that configures a node group to be searched based on the combination of links and the similarity of contents;
A configuration node storage step of storing individual nodes belonging to each of the node groups configured by the node group configuration step;
An inclusion relationship analysis step of analyzing mutual inclusion relationships of the node groups based on information stored in the configuration node storage step;
A similarity calculation step for calculating a similarity between a set of nodes whose inclusion relationship is recognized by the inclusion relationship analysis step;
When the similarity calculated by the similarity calculation step is equal to or greater than a predetermined threshold, the included node group is excluded, and the remaining node group that is not excluded is narrowed down as a search target. Steps,
A method for retrieving information, comprising:

In a computer-readable recording medium in which an information search program for causing a computer to search information of a hyper document system configured by nodes and links between nodes, which are units of information, is recorded.
A node group configuration means for configuring a node group to be searched based on the combination of links and the similarity of contents;
Configuration node storage means for storing individual nodes belonging to each of the node groups configured by the node group configuration means,
Inclusion relation analyzing means for analyzing mutual inclusion relation of the node group based on information stored in the component node storage means;
Similarity calculation means for calculating the similarity between a set of nodes whose inclusion relation is recognized by the inclusion relation analysis means;
When the similarity calculated by the similarity calculation means is equal to or greater than a predetermined threshold, the included node group is excluded, and the remaining node group that is not excluded is narrowed down as a search target. means,
A computer-readable recording medium on which a search program for causing a computer to function is recorded.