JP4557513B2

JP4557513B2 - Information search apparatus, information search method and program

Info

Publication number: JP4557513B2
Application number: JP2003195809A
Authority: JP
Inventors: 浩司前川
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2003-07-11
Filing date: 2003-07-11
Publication date: 2010-10-06
Anticipated expiration: 2023-07-11
Also published as: JP2005031950A

Description

【０００１】
【発明の属する技術分野】
本発明は、複数の情報から目的の情報を検索する情報検索装置、情報検索方法およびプログラムに関する。
【０００２】
【従来の技術】
近年、インターネットの普及などを通じて大量の文書情報がインターネット上に存在するようになった。そこで、大量の文書から関係する情報を速やかに収集するために、検索技術は欠かせないものとなっている。
【０００３】
大量の文書の中から必要な文書を取得する方法として、現在、キーワード間の近傍関係を利用した近傍検索や、自然文検索条件の構文情報を利用した検索などが存在する。
【０００４】
近傍検索とは、複数の検索語間の出現距離を指定して検索結果を絞り込むための技術である。検索条件として、たとえば次のような条件を指定した場合、
検索条件：｛日本，文化，近傍３｝
「日本」と「文化」が３文字以内に出現する文書の検索を行なう。
【０００５】
したがって、下記文書（１）は、キーワード間の距離が２文字なので、検索結果となり、下記文書（２）は、キーワード間の距離が１文字なので、検索結果となる。しかし、下記文書（３）は、キーワード間の距離が５文字なので、検索結果とはならず、同様に、下記文書（４）も、キーワード間の距離が７文字であるので、検索結果とはならない。
（１）日本的な文化（○）
（２）日本の文化遺産（○）
（３）日本の伝統的な文化（×）
（４）日本の古代の歴史を文化から（×）
上記文書（１）から上記文書（４）までは、単なる文字列マッチングではすべて検索結果となる文書であるが、このように複数のキーワード間の距離を検索条件に加えることによって、単なる文字列マッチングによる検索結果より、検索結果を絞り込んだ出力をすることができる。
【０００６】
一方、構文情報を利用した検索は、検索条件として入力された自然文を構成する文字列を解析し、その構文情報と一致した文書を検索結果とすることで、検索結果を絞り込むための技術である。検索条件として、たとえば「日本の文化」を入力した場合、キーワード「日本」がキーワード「文化」に連体修飾する構文情報を持つと解析される。
【０００７】
下記文書（５）では、「日本（的）：文化」の連体修飾関係が認められる。下記文書（６）では、「日本」は「文化遺産」に連体修飾する。しかし、複合語「文化遺産」は、「文化」が「遺産」に連体修飾した複合語と考えられる。実際には「日本：遺産」の連体修飾関係と、「文化：遺産」の連体修飾関係の構造を持つ。下記文書（７）では、「日本」は「伝統的」とは関係なく、「日本：文化」の連体修飾と「伝統的：文化」の連体修飾の構造をもつ。下記文書（８）では、「日本：古代」の連体修飾関係を持つと考えられる。したがって、「日本：文化」の関係は存在しない。
（５）日本的な文化（○）
（６）日本の文化遺産（×）
（７）日本の伝統的な文化（○）
（８）日本の古代の歴史を文化から（×）
上記文書（５）から上記文書（８）までは、単なる単語マッチングではすべて検索結果となる文書であるが、このように構文解析を利用することによって、単なる単語マッチングよりも木目の細かい、検索意図に合った検索結果を出力することができる（たとえば、特許文献１参照）。
【０００８】
【特許文献１】
特開平５−３４２２５５号公報
【０００９】
【発明が解決しようとする課題】
しかし、上記従来の情報検索装置では、大量の文書の中から指定した検索条件に合う文書を検索した場合、検索条件によっては、十分に絞り込まれないことが多く、検索者が望む検索結果を速やかに取得することが難しかった。
【００１０】
１つ目の問題点として、検索条件に単語が一文字しか存在しない場合、検索語間の近傍条件や構文情報を利用した検索を実現する上記従来の情報検索装置では、最低でも２単語以上の検索語が検索条件に含まれていないとその機能を十分に発揮させることはできないということである。
【００１１】
検索条件として、たとえば「北海道」を入力した場合、北海道の歴史、北海道の面積、北海道の産業、北海道の経済、北海道は寒い、北海道は広い、北海道南部の地震、北海道から出馬・・・など、「北海道」が含まれるすべての文書が検索対象となる。すなわち、この場合、近傍検索や構文利用の検索を行なえず、大量の文書を対象とすればするほど膨大な量の検索結果が得られてしまう。
【００１２】
検索者は、検索結果となった膨大な文書の中から自分が望む検索結果であるかどうかを１つずつチェックしていくか、あるいは別の検索条件によって再び検索する必要があった。１つずつチェックしていくためには多大な労力を要し、情報検索の機能として十分な役割を果たしているとは言えない。
【００１３】
さらに、絞込条件や、新しい検索条件を与えて再検索した場合、速やかに望む検索結果を得ることができるとは言い難く、問題がある。また、２つめの問題点で説明する問題も同時に抱えることになる。
【００１４】
２つめの問題点として、検索条件に単語が複数存在する場合、つまり本来の近傍検索や構文利用の検索が十分に活用できる条件で検索を行なう場合でも、検索者が指定した検索条件が、十分な絞り込みを行われる条件を満たしていないときには、速やかに望む結果が得られないということである。
【００１５】
近傍検索の検索条件として、たとえば｛日本，文化，近傍５｝
あるいは、構文情報を利用した検索の検索条件として、たとえば「日本の文化」を入力した場合の問題点について説明する。
【００１６】
近傍検索では、「日本」と「文化」が５文字以内に出現する文書を取得するので、日本の文化、日本の食文化、日本の伝統文化、日本の文化遺産、日本から文化の輸出、日本人の文化貢献、日本と韓国の文化・・・・など、「日本」と「文化」が近傍にある情報はすべて検索結果となってしまい、システムで管理している文書が大量であればあるほど、大量の検索結果が出力されてしまう。
【００１７】
同様に、構文情報を利用した検索でも、「日本：文化（連体修飾）」の関係を検索するので、検索結果として、日本の文化を考える、日本の新しい文化、日本人の文化、日本の文化に関するレポート、・・・など「日本」と「文化」が連体修飾関係にある情報はすべて検索結果となってしまい、近傍検索を行なったときと同様に大量の検索結果が出力されていた。
【００１８】
１つ目の問題点であげた対応と同様に、検索者は検索結果となった膨大な文書の中から自分が望む検索結果であるかどうかを１つずつチェックしていくか、あるいは別の検索条件によって再び検索する必要があった。
【００１９】
このように、単純な検索条件では十分に絞り込まれた検索が出来ずに、検索条件に関しても検索者の技能によるところが大きかった。
【００２０】
３つめの問題点として、ユーザが複雑な検索条件を入力した場合、たとえば、「日本の文化の発展を報告したメモ」と入力した場合、検索結果が得られない可能性が大きいということである。
【００２１】
以上のように、上記従来の情報検索装置では、１つの検索語を検索条件としたときには、効果がまったく得られず、また、複数の検索語を検索条件としたときも、十分な絞り込みが行われず、さらに、検索条件を複雑にすると、検索結果が得られないという状況が発生した。
【００２２】
このような検索機能では、何度も試行錯誤を繰り返しながら検索条件を変更して検索する必要があったため、検索者にとっての操作性は非常に悪く、速やかに目的の文書を見つけ出すためには労力や経験が必要となっていた。
【００２３】
本発明は、この点に着目してなされたものであり、操作性を向上させつつ、速やかに目的の情報を検索することが可能となる情報検索装置、情報検索方法およびプログラムを提供することを目的とする。
【００２４】
【課題を解決するための手段】
上記目的を達成するため、請求項１に記載の情報検索装置は、入力された検索条件に基づいて記憶装置に格納された文書データを検索する情報検索装置であって、前記検索条件から複数の検索キーワードを抽出する第１抽出手段と、前記複数の検索キーワードの中からユーザによる指定を受け付ける第１受付手段と、前記第１受付手段で指定を受け付けた検索キーワードに共起する共起キーワードを前記記憶装置に格納された文書データから抽出して提示する提示手段と、前記提示手段により提示された共起キーワードについてユーザによる指定を受け付ける第２受付手段と、前記第１受付手段で指定を受け付けた検索キーワードと前記第２受付手段で指定を受け付けた共起キーワードとの間の係り受け関係を示す係り受け情報を抽出する第２抽出手段と、前記複数の検索キーワードと前記共起キーワードと、に基づいて文書データを検索する際に、前記第２抽出手段が抽出した係り受け情報を優先して検索する検索手段と、有することを特徴とする。
【００２５】
請求項２に記載の情報検索装置は、請求項１の情報検索装置において、前記提示手段は、前記第１受付手段で受け付けた検索キーワードについて、係り側の共起キーワードと受け側の共起キーワードとを識別可能に提示することを特徴とする。
【００２６】
請求項３に記載の情報検索装置は、請求項２の情報検索装置において、前記提示手段によって提示される複数の共起キーワードのそれぞれに、所定の基準に従って序列を付ける序列付け手段をさらに有し、前記提示手段は、前記序列付け手段が付けた序列に基づいて、前記複数の共起情報キーワードを提示することを特徴とする。
【００２７】
請求項４に記載の情報検索装置は、請求項３の情報検索装置において、前記所定の基準は、前記各共起キーワードの重要度であることを特徴とする。
【００２８】
請求項５に記載の情報検索装置は、請求項４の情報検索装置において、前記複数の共起キーワードに対して、前記序列付け手段によって付けられた序列のうち、前記受付手段が受け付けた共起キーワードに対する序列以外の序列を変更する変更手段をさらに有し、前記提示手段は、前記変更手段によって変更された序列に基づいて、前記選択された共起キーワードを除く、前記複数の共起キーワードを提示することを特徴とする。
【００２９】
請求項６に記載の情報検索装置は、請求項２乃至５のいずれか１項に記載の情報検索装置において、前記受付手段が受け付けた共起キーワードを前記記憶装置に記憶させる登録手段をさらに有することを特徴とする。
【００３０】
上記目的を達成するため、請求項７に記載の情報検索方法は、入力された検索条件に基づいて記憶装置に格納された文書データを検索する情報検索装置による情報検索方法であって、第１抽出手段が、前記検索条件から複数の検索キーワードを抽出する第１抽出工程と、第１受付手段が、前記複数の検索キーワードの中からユーザによる指定を受け付ける第１受付工程と、提示手段が、前記第１受付工程で指定を受け付けた検索キーワードに共起する共起キーワードを前記記憶装置に格納された文書データから抽出して提示する提示工程と、第２受付手段が、前記提示工程で提示された共起キーワードについてユーザによる指定を受け付ける第２受付工程と、第２抽出手段が、前記第１受付工程で指定を受け付けた検索キーワードと前記第２受付工程で指定を受け付けた共起キーワードとの間の係り受け関係を示す係り受け情報を抽出する第２抽出工程と、検索手段が、前記複数の検索キーワードと前記共起キーワードと、に基づいて文書データを検索する際に、前記第２抽出手段が抽出した係り受け情報を優先して検索する検索工程と、を有することを特徴とする。
【００３１】
上記目的を達成するため、請求項８に記載のプログラムは、コンピュータを、入力された検索条件に基づいて記憶装置に格納された文書データを検索する情報検索装置として機能させるプログラムであって、前記コンピュータを、前記検索条件から複数の検索キーワードを抽出する第１抽出手段と、前記複数の検索キーワードの中からユーザによる指定を受け付ける第１受付手段と、前記第１受付手段で指定を受け付けた検索キーワードに共起する共起キーワードを前記記憶装置に格納された文書データから抽出して提示する提示手段と、前記提示手段により提示された共起キーワードについてユーザによる指定を受け付ける第２受付手段と、前記第１受付手段で指定を受け付けた検索キーワードと前記第２受付手段で指定を受け付けた共起キーワードとの間の係り受け関係を示す係り受け情報を抽出する第２抽出手段と、前記複数の検索キーワードと前記共起キーワードと、に基づいて文書データを検索する際に、前記第２抽出手段が抽出した係り受け情報を優先して検索する検索手段を備える前記情報検索装置として機能させることを特徴とする。
【００３２】
【発明の実施の形態】
以下、本発明の実施の形態を図面に基づいて詳細に説明する。
【００３３】
（第１の実施の形態）
図１は、本発明の第１の実施の形態に係る情報検索装置の概略構成を示すブロック図である。
【００３４】
同図に示すように、本実施の形態の情報検索装置は、キーボードなどの入力装置１と、装置全体の制御を司るＣＰＵ２と、ディスプレイなどの出力装置３と、メモリやハードディスクなどの記憶装置４とによって構成されている。
【００３５】
入力装置１から入力された検索条件は、記憶装置４上に展開された処理プログラム４１によって、ＣＰＵ２で処理される。記憶装置４上に展開された処理プログラム４１は、入力装置１より入力された検索条件を基に、文書データ４２を検索して、検索結果の判定を行なう。検索結果は、出力装置３に出力される。
【００３６】
なお、本実施の形態の情報検索装置は、図１のような単体のコンピュータ上に構築する以外にも、図２のようなローカルなネットワーク環境上および図３のようなインターネット環境上にも構築することができる。
【００３７】
以下、以上のように構成された情報検索装置が実行する動作処理を説明する。
【００３８】
まず、入力された検索条件に対して、係り受け関係処理などを行うことにより、内部的な検索条件を生成する。たとえば「日本の文化」が入力されたときには、検索キーワードとして、「日本：固有名詞」と「文化：一般名詞」が抽出され、係り受け関係として、格助詞「の」による連体修飾関係がキーワード間の距離１として抽出される。以後、抽出結果のキーワードを「トークン」と言い、係り受け関係の情報を「リレーション」と言う。
【００３９】
入力された「日本の文化」という自然文は、トークン１＝［日本：固有名詞］，トークン２＝［文化：一般名詞］、リレーション１＝［１、連体、の］という内部的な検索条件となり、以降の検索処理は、この内部的な検索条件に基づいてなされる。
【００４０】
文書データは、トークンを見出しとして、文書ＩＤと係り受け関係の情報によって構成されている。
【００４１】
図４は、文書データの構造の一例を示す図である。
【００４２】
同図に示すように、見出しとなるトークンに対して、そのトークンが出現する文書の情報列が格納されている。また、文書情報ｎとしては、文書に出現するトークンの詳細な情報列が格納されている。
【００４３】
トークン情報としては、出現したトークンの品詞や活用などが格納され、係り側の情報と複数の受け側の情報が格納される。一般的な係り受け構造のルールとして、複数の係り情報を受けることができ、１つの係り情報を作成する。
【００４４】
係り側情報としては、先に説明したリレーションの情報が格納され、受け側情報としては、どのようなトークンの関係を受けたか分かるように、受け側の情報が格納されている。たとえば、文書番号１の文書として「日本の新しい文化を形成する。」という文字列が登録されている場合、日本−文化、新しい−文化、文化−形成の関係があり、文書データは次のようになる。
日本［１：｛固有名詞（受：文化、２、連体、の）、（係：−）｝］
新しい［１：｛形容連体（受：文化、０、−）、（係：−）｝］
文化［１：｛一般名詞（受：形成、１、目的、を）、（係：日本）、（係：新しい）｝］
形成［１：｛サ動終止（受：−）、（係：文化）｝］
今回の検索条件のトークンは、「日本」と「文化」なので、「日本」と「文化」に該当する文書データをそれぞれ取得して、先の検索条件と文書中とのリレーションの一致度を計算する。トークンの見出しとして、たとえば次の情報を取り出した場合、
日本［１()，３()，５()，７()，９()，１０()，１１()，１３()，・・・・］
文化［１()，３()，５()，８()，９()，１１()，１４()，・・・］
リレーションを構成する両方のトークンが同じ文書に存在する文書に対して、すなわち、１，３，５，８，９，１１の文書に対して詳細な検討を行なう。７，８，１０，１３，１４などの、リレーションを構成するトークンが片方しかない文書に関しては、その単語の重要度のみが加算され、リレーションの一致度は加算されない。
【００４５】
まず、文書番号１の文書では、検索条件である受け側トークン「文化」とのリレーション［１、連体、の］と「日本」の文書情報に含まれている受け側のリレーションを比較する。
日本［１：｛固有名詞（受：文化、２、連体、の）、（係：−）｝］
文化［１：｛一般名詞（受：形成、１、目的、を）、（係：日本）、（係：新しい）｝］
検索条件と文書番号１の文書とでは、トークンの品詞およびトークンの係り受け関係は同じ関係があることが確認できる。しかし、トークンの距離関係は、検索条件が１単語であるのに対して、文書番号１の文書では２単語で出現していることが分かる。
一致度＝｛トークン情報一致度、リレーション一致度｝
トークン情報一致度＝｛品詞一致度、単語重要度｝
リレーション一致度＝｛係り受け関係一致度、距離関係一致度｝
一致度の計算方法は、トークン情報一致度とリレーション一致度を要素とし、トークン情報一致度は、トークン品詞の一致度と単語重要度から計算され、リレーション一致度は、係り受け関係の一致度とトークン間の距離関係の一致度で表わされる。
【００４６】
たとえば、トークン情報一致度とリレーション一致度はそれぞれ、５０：５０の重みを持つ。品詞一致度は、品詞が一致していると“２０”、同じ品詞でない場合、品詞の違いによって一致度は異なる。単語重要度は、すべての単語がそろっている場合、“３０”とし、該文書に出現するトークンの種類数（ｍ）を検索条件のトークンの数（ｎ）で割った値に“３０”をかけたものを単語重要度とする。
【００４７】
係り受け関係一致度は、“４０”を最大値とし、同じ係り受け関係ではない場合、係り受け関係の一致度により異なる。距離関係一致度は、最大１０とし、距離関係が１つ異なる毎に半減する。
【００４８】
したがって、文書番号１の文書の場合、距離関係が１つ異なるので“１０”の半分の“５”になり、一致度は“９５.０”となる。
【００４９】
同様に、文書番号３の文書には、「日本の文化は飛鳥時代から始まった。」という文章があり、この部分の一致度は“１００.０”であり、文書番号５の文書では、「日本の伝統的な文化には歌舞伎や・・・」の部分の一致度は“９２.５”であり、文書番号９の文書では、「日本の文化は和食の文化につながる」の部分の一致度は“１００.０”であり、文書番号１１の文書では、「日本の伝統文化として歌舞伎や能があげられる。」の部分の一致度は“９５.０”である。最終的に、図５のような検索結果となる。
【００５０】
この検索結果に対し、本実施の形態では、さらに詳細条件を加えたいトークンを指定することにより、その係り受け関係を基とする詳細情報を取得することが出来る。ここでは、「文化」を選択すると、先ほど取り出したトークン見出しの情報を基に、「文化」に関する係り受け関係を全て取得し、指定トークンの情報として、係り側共起情報と受け側共起情報を表示する。
【００５１】
図６は、リレーション情報を取り出すトークンとして「文化」を選んだときの例である。
【００５２】
同図に示すように、指定トークン「文化」への係り側共起情報として、「新しい」、「伝統」、「食」・・・などの共起トークンが取り出される。また、受け側共起情報として、「形成」、「発展」、「あげる」・・・などの共起トークン情報が取り出される。
【００５３】
ここで、検索者は、指定トークン「文化」の係り側共起情報および受け側共起情報の中から、自分の望んでいる検索条件に近い共起トークン情報を選択する。
このとき、「伝統」という共起トークンを選択したとする。
【００５４】
そこで、「伝統」、「品詞（全品詞同列）」というトークン情報と、「伝統」と「文化」のリレーションとして、「全距離同列：全関係同列：指定なし」という条件が新たに加えられた検索条件に基づいて再検索され、その検索結果が再表示される。
【００５５】
図７は、共起トークンとして「伝統」を指定したときの検索結果の一例を示す図である。
文書番号１１：「日本の伝統文化として歌舞伎や能があげられる。」
文書番号５：「日本の伝統的な文化には歌舞伎や・・・」
文書番号１０４：「日本の文化の歴史は伝統を重んじ・・・」
上記文書番号１１の文書は、「伝統」、「文化」というリレーションが認められるために、一致度が若干あがる。上記文書番号５の文書でも同様である。しかし、上記文書番号１０４の文書は、「伝統」というトークンは存在するが、「伝統」と「文化」の間に関係が認められないため、一致度は下がるものの、文書番号３の文書などのように共起トークン「伝統」が文書中に存在しない文書と比較するとその下がり方は穏やかである。
【００５６】
さらに、詳細情報を取得したい場合には、文化の受け側の情報や日本の係り側の情報を指定することにより、詳細検索が可能となる。
【００５７】
続いて、検索条件に１つの検索語しか含まれなかった場合の動作処理を説明する。ここでは、検索条件として「北海道」を入力したときについて説明する。この場合、トークン情報としては「北海道：固有名詞」が作られるが、リレーションは作られないので、「北海道」を含む単語がすべて検索結果になる。次に、トークン「北海道」で文書データを取得する。
【００５８】
図８のように、
北海道［３()，８()，１０()，２１()，３０()，・・・，１００()，１３１()，・・・・］
というデータを取得することができ、これらに含まれる文書番号は一致度の計算をすることなく検索結果として出力できる。
【００５９】
次に、さらに詳細な検索をするためにトークン情報「北海道」を指定する。
【００６０】
図９のように、「北海道」についてのリレーション情報の取得を行なう。受け側共起情報には、「産業」、「経済」、「味覚」、・・・などの共起トークンがあることが分かる。一方、係り側共起情報には、「昨年」、「秋」、「夏」・・・などの共起トークンがあることが分かる。
【００６１】
検索者は、「北海道の味覚」について検索したい場合、受け側共起情報にある共起トークン「味覚」を指定する。これにより、図１０のように、北海道−味覚というリレーションを加味した検索結果が表示される。
【００６２】
さらに、係り側共起情報として、たとえば共起トークン「秋」を指定することによって、図１１のように、「秋の北海道の味覚」についての検索を行うことができる。
【００６３】
（第２の実施の形態）
本実施の形態では、上記第１の実施の形態に加えて、指定された検索語に関する共起関係の情報を関係の重要性を加味し、重要性に基づいて出力する。
【００６４】
共起関係の重要度を求める計算方法には、めずらしい共起関係を重要とする特異性を求める方法や、よく使われる共起関係を重要とする一般性を求める方法などが考えられるが、本実施の形態では一般性を重視した方法について説明する。
【００６５】
上記第１の実施の形態における「日本の文化」を例にとって考える。
【００６６】
「文化」についての文書データを基に、係り受け情報を取得する。検索条件の指定により、「日本」との係り受け関係は指定されているために、この条件を満たす共起関係が最優先となる。
【００６７】
たとえば、次の例文をトークン「文化」のデータについてみてみると、
「日本の新しい文化の形成」（１）
「日本は中国の文化を取り入れた」（２）
（１）の文書では、
係り側共起情報日本、新しい
受け側共起情報形成、距離＝１、連体修飾、の
一方、（２）の文書では、
係り側共起情報中国
受け側共起情報取り入れる、距離＝１、目的、を
となり、検索条件にある「日本」と「文化」の係り受け関係を持つ、（１）の文書に現れる「文化」の共起情報のほうが優先される。
係り側情報新しい＞中国
受け側情報形成＞取り入れる
次に、係り受け関係の出現率によって重みを加える。
【００６８】
たとえば、係り側共起情報として検索条件によって優先された共起関係にある共起トークンが、「食」、「独自」、「伝統」、「珍しい」、「新しい」、「韓国」、・・・、であった場合において、それぞれの出現数が、食（５）、独自（８）、伝統（６）、珍しい（２）、新しい（９）、韓国（３）だった場合、出力順は、新しい、独自、伝統、食、韓国、珍しい・・・となる。また、検索条件の履歴が残っているシステムの場合には、履歴に残る共起トークンを優先することも考えられる。
【００６９】
同様に、受け側共起情報として検索条件によって優先された共起関係にある共起トークンが、発展（１５）、受け入れる（３）、取り入れる（２）、形成（１０）、歴史（８）、・・・、であった場合、出力順は、「発展」、「形成」、「歴史」、「受け入れる」、「取り入れる」・・・となる。また、検索条件の履歴が残っているシステムの場合には、履歴に残る共起関係を優先することも考えられる。
【００７０】
最終的なリレーション情報の取り出し結果として、図１２のような結果を出力することになる。
【００７１】
（第３の実施の形態）
本実施の形態では、上記第２の実施の形態に加えて、指定されたトークンの共起関係を共起トークンの重要度を基に表示する処理において、共起トークンを指定したときの処理の例を説明する。
【００７２】
上記第２の実施の形態では、指定トークンと共起関係にある共起トークンが指定されたとき、指定トークンと共起トークンの間にリレーション情報が作成され、その情報を検索条件で指定された検索条件に追加して再検索を実行し、検索結果を再表示していた。
【００７３】
本実施の形態では、さらに共起トークンが指定されることにより、他の共起情報に表示されている共起トークンの重要度が変更される検索処理を説明する。
【００７４】
「日本の文化」を検索条件とし、トークン「日本」およびトークン「文化」に関する共起情報として、図１３のような共起情報が表示されたことを前提に説明する。
【００７５】
検索語の「日本」に関しては、係り側共起情報に、共起トークンとして、「新しい」、「昔」、「現代」、「最近」、「韓国」などが重要度に従って出力される。受け側共起情報に関しては、既に「日本」と「文化」の間にリレーションが存在するので、優先される受け側の共起トークンは存在しない。しかし、トークン「日本」に対する共起トークンは存在するので、「歴史」、「経済」、「政治」など「日本の文化」を無視した共起トークンを重要度に従って出力することができる。「文化」に関しても同様に、「新しい」、「独自」、「伝統」、「食」、「韓国」、「珍しい」などの係り側共起情報の共起トークンが重要度に従って出力される。同様に受け側共起情報には、「発展」、「形成」、「歴史」、「受け入れる」、「取り入れる」、・・・、などの共起トークンが重要度に従って出力される。
【００７６】
ここで、トークン「文化」の係り側共起情報の共起トークン「伝統」を指定した場合について説明する。
【００７７】
共起トークン「伝統」を指定したことにより、検索条件のトークン間のリレーション「日本−文化」に加えて、指定した共起トークンとトークンの間のリレーション「伝統−文化」も検索条件に加えられて再検索される。その結果、新しい検索条件による検索結果が表示されるが、同時に新しいリレーションである「伝統−文化」をも同時に満たす条件が共起トークンの重要度を計算するときに加えられ、重要度を再計算して、新しい重要度を再計算する。
【００７８】
また、検索条件であるトークン間の関係（「日本」と「文化」）を無視した共起トークンを指定することも可能である。たとえば、文化の共起トークンである「アメリカ」を指定した場合、「日本」と「文化」の間のリレーション関係は壊れる。この場合は、リレーションによる優先関係を無視した共起トークンの重要度に従って共起情報内の共起トークンが出力される。
【００７９】
（第４の実施の形態）
本実施の形態では、指定した共起トークンに対する共起情報の取り出しおよび共起トークンの表示を行なう。
【００８０】
前記図７のように、共起トークンとして「伝統」を指定した場合、「伝統」と「文化」に関するリレーションを追加して検索結果を出力していた。
【００８１】
本実施の形態では、ここで指定した共起トークン「伝統」に対して、共起情報の表示を指定できるようにしたものである。
【００８２】
図１４は、共起トークンに対する共起情報の取り出し例である。
【００８３】
このように、共起トークンに対する指定をすることによって、複雑な係り受け関係にある文書を視覚的に取得することが可能となる。
【００８４】
なお、上述した各実施の形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システムまたは装置に供給し、そのシステムまたは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読出し実行することによっても、本発明の目的が達成されることは言うまでもない。
【００８５】
この場合、記憶媒体から読出されたプログラムコード自体が本発明の新規な機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。
【００８６】
プログラムコードを供給するための記憶媒体としては、たとえば、フレキシブルディスク、ハードディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、ＤＶＤ−ＲＷ、ＤＶＤ＋ＲＷ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。また、通信ネットワークを介してサーバコンピュータからプログラムコードが供給されるようにしてもよい。
【００８７】
また、コンピュータが読出したプログラムコードを実行することにより、上述した各実施の形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳなどが実際の処理の一部または全部を行い、その処理によって上述した各実施の形態の機能が実現される場合も含まれることは言うまでもない。
【００８８】
さらに、記憶媒体から読出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって上述した各実施の形態の機能が実現される場合も含まれることは言うまでもない。
【００８９】
【発明の効果】
以上、説明したように、本発明によれば、まず、検索条件の検索語が１単語であった場合においても、その単語に係り受け関係にある共起情報を表示することにより、効率の良い絞り込みが可能になる。次に、複数の検索語によって検索された場合において、大量の検索結果が得られた場合においても同様に、係り受け関係を意識した検索条件を提示することになり、複雑な検索条件を検索者は意識することなく指定することが可能となる。さらに、絞り込み作業によって検索結果が０件になることがなくなる。
【００９０】
したがって、検索者は意識することなく複雑な検索条件を指定することが可能となり、検索の操作性は大幅に向上し、また共起関係が表示されることから次の検索語に追加する単語を考える必要はなく、リストから選べることにより、検索者の検索に対するスキルを必要とせず、誰でも簡単に検索結果を得ることができる。また、検索結果がなくなると言うことがないために、最終的な検索結果に速くたどり着くことが可能になる。また、共起データに対する情報はメモリ上に保持しておくことにより、絞込検索時に再び文書データへのアクセスが不要であるために高速な検索が実現される。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態に係る情報検索装置の概略構成を表すブロック図である。
【図２】図１の情報検索装置を構築する他の環境の一例として挙げた、ローカルなネットワーク環境を示す図である。
【図３】図１の情報検索装置を構築する他の環境の一例として挙げた、インターネット環境を示す図である。
【図４】図１の記憶装置内の文書データの構造の一例を示す図である。
【図５】図１の情報検索装置による検索結果の一例を示す図である。
【図６】共起トークンとして「文化」を指定したときの検索結果の一例を示す図である。
【図７】共起トークンとして「伝統」を指定したときの検索結果の一例を示す図である。
【図８】検索条件として「北海道」の一語を入力したときの検索結果の一例を示す図である。
【図９】「北海道」についてのリレーション情報を取得する様子の一例を示す図である。
【図１０】「北海道」に「味覚」というリレーション情報を加味したときの検索結果の一例を示す図である。
【図１１】図１０の検索条件に、さらに、係り側共起情報として「秋」を加味したときの検索結果の一例を示す図である。
【図１２】本発明の第２の実施の形態に係る情報検索装置による検索結果の一例を示す図である。
【図１３】本発明の第３の実施の形態に係る情報検索装置が実行する検索処理を説明するための図である。
【図１４】本発明の第４の実施の形態に係る情報検索装置が実行する検索処理を説明するための図である。
【符号の説明】
１入力装置
２ＣＰＵ
３出力装置
４記憶装置
４１処理プログラム
４２文書データ[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an information search apparatus, an information search method, and a program for searching for target information from a plurality of pieces of information.
[0002]
[Prior art]
In recent years, a large amount of document information has existed on the Internet through the spread of the Internet. Therefore, search technology is indispensable for quickly collecting related information from a large number of documents.
[0003]
As a method for acquiring a necessary document from a large number of documents, there are currently a neighborhood search using a neighborhood relationship between keywords and a search using syntax information of a natural sentence search condition.
[0004]
The neighborhood search is a technique for narrowing down search results by specifying appearance distances between a plurality of search terms. For example, if you specify the following conditions as search conditions:
Search conditions: {Japan, culture, neighborhood 3}
Search for documents in which “Japan” and “culture” appear within 3 characters.
[0005]
Therefore, the following document (1) is a search result because the distance between keywords is two characters, and the following document (2) is a search result because the distance between keywords is one character. However, the following document (3) is not a search result because the distance between keywords is 5 characters, and similarly, the following document (4) is also a search result because the distance between keywords is 7 characters. Don't be.
(1) Japanese culture (○)
(2) Japanese cultural heritage (○)
(3) Japanese traditional culture (×)
(4) Japanese ancient history from culture (×)
The documents (1) to (4) are all documents that are search results by simple character string matching. However, by adding the distance between a plurality of keywords to the search condition in this way, simple character string matching is possible. The search result can be narrowed down from the search result by.
[0006]
On the other hand, a search using syntax information is a technique for narrowing down search results by analyzing a character string that constitutes a natural sentence input as a search condition and using a document that matches the syntax information as a search result. is there. For example, when “Japanese culture” is input as a search condition, the keyword “Japan” is analyzed as having syntactic information that is combined with the keyword “culture”.
[0007]
In the following document (5), the “Japan (target): culture” modification relationship is recognized. In the following document (6), “Japan” is combined with “Cultural Heritage”. However, the compound word “cultural heritage” is considered to be a compound word in which “culture” is combined with “heritage”. In fact, it has a structure of “Japan: heritage” and “culture: heritage”. In the following document (7), “Japan” has the structure of “Japan: Culture” and “Traditional: Culture” regardless of “traditional”. In the following document (8), it is considered that there is a “Japan: Ancient” linkage modification relationship. Therefore, there is no “Japan: Culture” relationship.
(5) Japanese culture (○)
(6) Japanese cultural heritage (×)
(7) Japanese traditional culture (○)
(8) Japanese ancient history from culture (×)
From the document (5) to the document (8), all of the simple word matching results in a search result. By using the syntax analysis in this way, the search intention is finer than the simple word matching. A search result that matches the above can be output (see, for example, Patent Document 1).
[0008]
[Patent Document 1]
JP-A-5-342255
[0009]
[Problems to be solved by the invention]
However, in the above-described conventional information search apparatus, when searching for a document that meets a specified search condition from a large number of documents, the search result desired by the searcher can be quickly obtained depending on the search condition. It was difficult to get into.
[0010]
The first problem is that when there is only one character in the search condition, the above-described conventional information search apparatus that implements the search using the neighborhood condition between the search words and the syntax information searches for at least two words. If the word is not included in the search condition, the function cannot be fully exerted.
[0011]
For example, if you enter "Hokkaido" as a search condition, Hokkaido history, Hokkaido area, Hokkaido industry, Hokkaido economy, Hokkaido is cold, Hokkaido is wide, southern Hokkaido earthquake, run from Hokkaido, etc. All documents that include “Hokkaido” are searched. That is, in this case, the neighborhood search and the syntax-based search cannot be performed, and the larger the number of documents, the larger the amount of search results that can be obtained.
[0012]
The searcher has to check one by one from the enormous amount of documents that are the search results to check whether the search result is desired or to search again according to another search condition. It takes a lot of labor to check one by one, and it cannot be said that it plays a sufficient role as an information search function.
[0013]
Furthermore, there is a problem that it is difficult to say that a desired search result can be obtained promptly when a narrow search condition or a new search condition is given to perform a search again. In addition, the problem described in the second problem is also involved.
[0014]
The second problem is that the search condition specified by the searcher is sufficient even when there are multiple words in the search condition, that is, when the search is performed under conditions where the original neighborhood search and syntax-based search can be fully utilized. When the condition for performing the narrowing down is not satisfied, the desired result cannot be obtained promptly.
[0015]
As search conditions for neighborhood search, for example, {Japan, culture, neighborhood 5}
Alternatively, a problem when, for example, “Japanese culture” is input as a search condition for search using syntax information will be described.
[0016]
In the neighborhood search, documents with "Japan" and "culture" appearing within 5 characters are acquired, so Japanese culture, Japanese food culture, Japanese traditional culture, Japanese cultural heritage, Japanese culture export, Japan All information in the vicinity of "Japan" and "Culture", such as human cultural contributions, Japanese and Korean cultures, etc., will be search results, and there are a large number of documents managed by the system. As a result, a large amount of search results are output.
[0017]
Similarly, in the search using syntactic information, the relationship of “Japan: Culture (combined modification)” is searched, so as a search result, think about Japanese culture, Japanese new culture, Japanese culture, Japanese culture All the information that “Japan” and “Culture” have a linkage modification relationship, such as a report on..., Etc., became search results, and a large amount of search results were output as in the case of neighborhood search.
[0018]
Similar to the first problem, the searcher checks whether the search result is the one he or she wants from the enormous number of documents that are the search result, or another It was necessary to search again according to the search conditions.
[0019]
As described above, a sufficiently narrow search cannot be performed with a simple search condition, and the search condition depends largely on the skill of the searcher.
[0020]
The third problem is that when a user enters complicated search conditions, for example, when he / she enters “a memo that reports on the development of Japanese culture”, there is a high possibility that no search results will be obtained. .
[0021]
As described above, in the conventional information retrieval apparatus, no effect is obtained when a single search word is used as a search condition, and sufficient narrowing is performed even when a plurality of search words are used as a search condition. In addition, when the search conditions are complicated, a search result cannot be obtained.
[0022]
In such a search function, it was necessary to change the search conditions while repeating trial and error many times, so the operability for the searcher was very bad, and it was an effort to find the target document quickly. And experience was necessary.
[0023]
The present invention has been made paying attention to this point, and provides an information search apparatus, an information search method, and a program capable of quickly searching for target information while improving operability. Objective.
[0024]
[Means for Solving the Problems]
In order to achieve the above object, an information search apparatus according to claim 1 is an information search apparatus that searches document data stored in a storage device based on an input search condition. First extraction means for extracting a search keyword, and the plurality of search keywords A first accepting unit that accepts designation by the user from the list, and a presentation that presents the co-occurrence keyword co-occurring with the search keyword accepted by the first accepting unit from the document data stored in the storage device Means and , A second accepting unit that accepts designation by the user for the co-occurrence keyword presented by the presenting unit, a search keyword that accepts designation by the first accepting unit, and a co-occurrence keyword that accepts designation by the second accepting unit. Second extraction means for extracting dependency information indicating a dependency relationship between; The plurality of Search keywords and Said Co-occurrence keyword And Search document data based on And prioritizing the dependency information extracted by the second extraction means. And a search means.
[0025]
The information search device according to claim 2 is the information search device according to claim 1, The presenting means presents the co-occurrence keyword on the dependency side and the co-occurrence keyword on the reception side in a distinguishable manner for the search keyword received by the first reception means. It is characterized by that.
[0026]
The information search device according to claim 3 is the claim. 2 In the information retrieval apparatus of Each of the plurality of co-occurrence keywords presented by the presenting means further comprises an ordering means for placing an order according to a predetermined criterion, and the presenting means is configured based on the order given by the ordering means. Presenting co-occurrence information keywords It is characterized by that.
[0027]
The information search device according to claim 4 is a billing Item 3 In the information retrieval apparatus of The predetermined criterion is the importance of each co-occurrence keyword It is characterized by that.
[0028]
The information search device according to claim 5 is the information search device according to claim 4, wherein the reception unit receives the ranks assigned by the rank ordering unit to the plurality of co-occurrence keywords. The A change means for changing an order other than the order for the attached co-occurrence keyword, wherein the presenting means excludes the selected co-occurrence keyword based on the order changed by the change means; It is characterized by presenting co-occurrence keywords.
[0029]
The information search device according to claim 6 is the information search device according to any one of claims 2 to 5, wherein the reception unit receives the information search device. The It further has a registration means for storing the added co-occurrence keyword in the storage device.
[0030]
In order to achieve the above object, an information search method according to claim 7 is an information search device for searching document data stored in a storage device based on an input search condition. Information retrieval by A method, The first extraction means is A first extraction step of extracting a plurality of search keywords from the search condition; A first receiving unit that receives a user designation from the plurality of search keywords, and a presentation unit that selects a co-occurrence keyword that co-occurs in the search keyword received in the first receiving step; A presentation step of extracting and presenting from the document data stored in the storage device; a second reception step in which the second reception means accepts designation by the user for the co-occurrence keyword presented in the presentation step; and a second extraction means Of the search keyword received in the first receiving step and the co-occurrence keyword received in the second receiving step A second extraction step of extracting dependency information indicating a dependency relationship between; Search means Said plural Search keywords and Said Co-occurrence keyword And Search document data based on And prioritizing the dependency information extracted by the second extraction means. And a search step.
[0031]
In order to achieve the above object, a program according to claim 8 is a program that causes a computer to function as an information retrieval device that retrieves document data stored in a storage device based on an inputted retrieval condition. A first extraction means for extracting a plurality of search keywords from the search condition; First receiving means for accepting designation by a user from among the plurality of search keywords, and co-occurrence keywords co-occurring with the search keyword accepted by the first accepting means are extracted from document data stored in the storage device Presenting means, second accepting means for accepting designation by the user for the co-occurrence keyword presented by the presenting means, search keyword accepted for designation by the first accepting means, and designation by the second accepting means With co-occurrence keywords accepted Second extraction means for extracting dependency information indicating a dependency relationship between; The plurality of Search keywords and Said Co-occurrence keyword And Search document data based on And prioritizing the dependency information extracted by the second extraction means. It is made to function as said information search apparatus provided with a search means.
[0032]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0033]
(First embodiment)
FIG. 1 is a block diagram showing a schematic configuration of an information search apparatus according to the first embodiment of the present invention.
[0034]
As shown in the figure, the information retrieval apparatus according to the present embodiment includes an input device 1 such as a keyboard, a CPU 2 that controls the entire device, an output device 3 such as a display, and a storage device 4 such as a memory and a hard disk. And is composed of.
[0035]
The search condition input from the input device 1 is processed by the CPU 2 by the processing program 41 developed on the storage device 4. The processing program 41 developed on the storage device 4 searches the document data 42 based on the search condition input from the input device 1 and determines the search result. The search result is output to the output device 3.
[0036]
The information retrieval apparatus according to the present embodiment is constructed not only on a single computer as shown in FIG. 1, but also on a local network environment as shown in FIG. 2 and an Internet environment as shown in FIG. can do.
[0037]
Hereinafter, operation processing executed by the information search apparatus configured as described above will be described.
[0038]
First, an internal search condition is generated by performing dependency relation processing or the like on the input search condition. For example, when “Japanese culture” is entered, “Japan: proper nouns” and “culture: general nouns” are extracted as search keywords, and the dependency modification relationship by the case particle “no” is a dependency relationship between keywords. Is extracted as a distance 1. Hereinafter, the extracted keyword is referred to as “token”, and the dependency relation information is referred to as “relation”.
[0039]
The natural sentence “Japanese culture” entered is an internal search condition with token 1 = [Japan: proper noun], token 2 = [culture: general noun], relation 1 = [1, union, no]. The subsequent search processing is performed based on this internal search condition.
[0040]
Document data is composed of a document ID and dependency-related information with a token as a headline.
[0041]
FIG. 4 is a diagram illustrating an example of the structure of document data.
[0042]
As shown in the figure, an information string of a document in which the token appears is stored for the token serving as a headline. As the document information n, a detailed information string of tokens appearing in the document is stored.
[0043]
As token information, part-of-speech and use of the token that has appeared are stored, and information on the clerk side and information on a plurality of recipients are stored. As a general dependency structure rule, a plurality of pieces of dependency information can be received, and one piece of dependency information is created.
[0044]
Information on the relation described above is stored as the relation side information, and information on the reception side is stored as the reception side information so that it can be understood what kind of token relation is received. For example, when the document No. 1 is registered with the character string “form a new culture of Japan”, there is a relation of Japan-culture, new-culture, culture-formation, and the document data is as follows: become.
Japan [1: {proprietary nouns (acceptance: culture, 2, union), (person :-)}]
New [1: {Adjective (Reception: Culture, 0,-), (Partner:-)}]
Culture [1: {general nouns (acceptance: formation, 1, purpose), (person: Japan), (person: new)}]
Formation [1: {End of service (reception:-), (person: culture)}]
Since the tokens for this search condition are “Japan” and “Culture”, the document data corresponding to “Japan” and “Culture” are acquired, and the degree of coincidence between the previous search condition and the document is calculated. To do. For example, if you retrieve the following information as the token header:
Japan [1 (), 3 (), 5 (), 7 (), 9 (), 10 (), 11 (), 13 (), ...]
Culture [1 (), 3 (), 5 (), 8 (), 9 (), 11 (), 14 (), ...]
A detailed study is performed on documents in which both tokens constituting the relation exist in the same document, that is, documents 1, 3, 5, 8, 9, and 11. For documents such as 7, 8, 10, 13, and 14 that have only one token constituting the relation, only the importance of the word is added, and the degree of coincidence of the relation is not added.
[0045]
First, in the document with the document number 1, the relation [1] of the receiving token “culture” that is the search condition is compared with the receiving relation included in the document information of “Japan”.
Japan [1: {proprietary nouns (acceptance: culture, 2, union), (person :-)}]
Culture [1: {general nouns (acceptance: formation, 1, purpose), (person: Japan), (person: new)}]
It can be confirmed that the search condition and the document of document number 1 have the same part-of-speech token and token dependency relationship. However, it can be seen that the distance relationship between tokens appears in two words in the document with document number 1 while the search condition is one word.
Match level = {Token information match level, relation match level}
Token information coincidence = {Part of speech coincidence, word importance}
Relation coincidence = {Dependency relation coincidence, distance relation coincidence}
The method of calculating the degree of coincidence uses the token information coincidence and the relation coincidence as elements, and the token information coincidence is calculated from the token part-of-speech coincidence and the word importance. It is represented by the degree of coincidence of the distance relationship between tokens.
[0046]
For example, the token information matching degree and the relation matching degree each have a weight of 50:50. The part-of-speech coincidence degree is “20” when the part-of-speech coincides. The word importance is “30” when all the words are available, and “30” is obtained by dividing the number of types of tokens (m) appearing in the document by the number of tokens (n) of the search condition. Multiply the word importance.
[0047]
The dependency relationship coincidence degree varies depending on the coincidence degree of the dependency relationship when “40” is the maximum value and the dependency relationship is not the same. The distance relationship coincidence is set to 10 at the maximum, and is halved for each different distance relationship.
[0048]
Therefore, in the case of the document with the document number 1, since the distance relationship is different by one, it becomes “5” which is half of “10”, and the matching degree becomes “95.0”.
[0049]
Similarly, the document with document number 3 has a sentence “Japanese culture began in the Asuka era”, and the degree of coincidence of this part is “100.0”. The degree of coincidence of the part of Kabuki and ... for the traditional Japanese culture is “92.5”, and the document No. 9 shows that “Japanese culture leads to Japanese culture” The degree of coincidence is “100.0”, and in the document of document number 11, the degree of coincidence of the portion “Kabuki and Noh are given as traditional Japanese culture” is “95.0”. Finally, the search result is as shown in FIG.
[0050]
In the present embodiment, detailed information based on the dependency relationship can be acquired by designating a token to which more detailed conditions are to be added in this embodiment. Here, when “Culture” is selected, all the dependency relationships related to “Culture” are acquired based on the token header information extracted earlier, and the dependency co-occurrence information and the receiver co-occurrence information are specified as the specified token information. Is displayed.
[0051]
FIG. 6 is an example when “culture” is selected as a token for extracting relation information.
[0052]
As shown in the figure, co-occurrence tokens such as “new”, “tradition”, “food”,... Also, co-occurrence token information such as “formation”, “development”, “raise”,.
[0053]
Here, the searcher selects the co-occurrence token information close to the search condition desired by the user from the co-occurrence information and the receiver co-occurrence information of the designated token “culture”.
At this time, it is assumed that the co-occurrence token “tradition” is selected.
[0054]
Therefore, as a relation between “tradition” and “part of speech (same part of speech)” and a relation between “tradition” and “culture”, a condition of “all distances is the same: all relations is the same: not specified” has been newly added. The search is performed again based on the search condition, and the search result is displayed again.
[0055]
FIG. 7 is a diagram illustrating an example of a search result when “tradition” is designated as the co-occurrence token.
Document No. 11: “Kabuki and Noh are given as traditional Japanese culture.”
Document No. 5: “Japanese traditional culture is Kabuki and ...”
Document No. 104: “The history of Japanese culture respects tradition ...”
The document number 11 has a slightly higher degree of coincidence because the relationship of “traditional” and “culture” is recognized. The same applies to the document with the document number 5. However, the document with the document number 104 has a token “tradition”, but the relationship between “tradition” and “culture” is not recognized. Thus, when the co-occurrence token “tradition” is compared with a document that does not exist in the document, the descending method is gentle.
[0056]
Furthermore, if detailed information is to be acquired, detailed search can be performed by designating information on the recipient side of the culture or information on the Japanese side.
[0057]
Next, an operation process when only one search word is included in the search condition will be described. Here, a case where “Hokkaido” is input as a search condition will be described. In this case, “Hokkaido: proper noun” is created as token information, but no relation is created, so all words including “Hokkaido” are the search results. Next, the document data is acquired with the token “Hokkaido”.
[0058]
As shown in FIG.
Hokkaido [3 (), 8 (), 10 (), 21 (), 30 (), ..., 100 (), 131 (), ...]
The document numbers contained therein can be output as search results without calculating the degree of coincidence.
[0059]
Next, specify the token information “Hokkaido” for a more detailed search.
[0060]
As shown in FIG. 9, the relation information for “Hokkaido” is acquired. It can be seen that the receiver co-occurrence information includes co-occurrence tokens such as “industry”, “economy”, “taste”,. On the other hand, it can be seen that the co-occurrence information on the side includes co-occurrence tokens such as “Last year”, “Autumn”, “Summer”.
[0061]
When the searcher wants to search for “taste of Hokkaido”, the searcher designates the co-occurrence token “taste” in the receiver co-occurrence information. As a result, as shown in FIG. 10, the search result including the relation of Hokkaido-taste is displayed.
[0062]
Further, for example, by specifying the co-occurrence token “autumn” as the dependency-side co-occurrence information, a search for “taste of Hokkaido in autumn” can be performed as shown in FIG.
[0063]
(Second Embodiment)
In the present embodiment, in addition to the first embodiment, information on the co-occurrence relationship regarding the designated search word is added based on the importance of the relationship and output based on the importance.
[0064]
There are two methods for calculating the importance of co-occurrence relationships, such as a method for obtaining specificity that makes rare co-occurrence relationships important, and a method for obtaining generality that makes common co-occurrence relationships important. In the embodiment, a method that emphasizes generality will be described.
[0065]
Consider “Japanese culture” in the first embodiment as an example.
[0066]
Dependency information is acquired based on document data about “culture”. Since the dependency relationship with “Japan” is specified by specifying the search condition, the co-occurrence relationship satisfying this condition has the highest priority.
[0067]
For example, if you look at the following example sentence for the token “culture” data,
“Formation of new Japanese culture” (1)
“Japan adopted Chinese culture” (2)
In the document (1)
Co-occurrence side information Japan, new
Recipient co-occurrence information formation, distance = 1
On the other hand, in the document (2),
Co-occurrence side information China
Receiving side co-occurrence information Take in, distance = 1, purpose
Therefore, the co-occurrence information of “culture” appearing in the document (1) having a dependency relationship between “Japan” and “culture” in the search condition is given priority.
Contact information New> China
Recipient information formation> Incorporation
Next, a weight is added according to the appearance rate of the dependency relationship.
[0068]
For example, co-occurrence tokens with co-occurrence relationships that are prioritized according to the search conditions as the co-occurrence side information are “food”, “unique”, “tradition”, “rare”, “new”, “Korea”・ If the number of occurrences is food (5), original (8), tradition (6), unusual (2), new (9), Korea (3), the output order is New, unique, traditional, food, Korean, unusual ... In the case of a system in which the search condition history remains, priority may be given to the co-occurrence token remaining in the history.
[0069]
Similarly, co-occurrence tokens having a co-occurrence relationship prioritized by the search condition as receiving co-occurrence information are developed (15), accepted (3), incorporated (2), formed (10), history (8), In this case, the output order is “Development”, “Formation”, “History”, “Accept”, “Incorporate”,. In the case of a system in which the search condition history remains, priority may be given to the co-occurrence relationship remaining in the history.
[0070]
As a final relation information extraction result, a result as shown in FIG. 12 is output.
[0071]
(Third embodiment)
In the present embodiment, in addition to the second embodiment, in the process of displaying the co-occurrence relationship of the designated token based on the importance of the co-occurrence token, the process when the co-occurrence token is designated An example will be described.
[0072]
In the second embodiment, when a co-occurrence token having a co-occurrence relationship with a specified token is specified, relation information is created between the specified token and the co-occurrence token, and the information is specified by a search condition. The search result was re-displayed by adding it to the search condition and re-displaying the search result.
[0073]
In the present embodiment, a search process will be described in which the importance of a co-occurrence token displayed in other co-occurrence information is changed by further specifying a co-occurrence token.
[0074]
Description will be made on the assumption that the co-occurrence information as shown in FIG. 13 is displayed as the co-occurrence information regarding the token “Japan” and the token “culture”, with “Japanese culture” as a search condition.
[0075]
For the search term “Japan”, “new”, “old”, “modern”, “recent”, “Korea”, etc. are output as the co-occurrence tokens in the co-occurrence side information according to the importance. Regarding receiver co-occurrence information, there is already a relation between “Japan” and “culture”, so there is no preferred receiver co-occurrence token. However, since there are co-occurrence tokens for the token “Japan”, co-occurrence tokens ignoring “Japanese culture” such as “history”, “economy”, and “politics” can be output according to importance. Similarly for “culture”, the co-occurrence tokens of the side co-occurrence information such as “new”, “unique”, “tradition”, “food”, “Korea”, “rare”, etc. are output according to the importance. Similarly, co-occurrence tokens such as “development”, “formation”, “history”, “accept”, “take in”,.
[0076]
Here, a case where the co-occurrence token “tradition” of the dependency side co-occurrence information of the token “culture” is specified will be described.
[0077]
By specifying the co-occurrence token “Tradition”, the relationship “Tradition-Culture” between the specified co-occurrence token and the token is added to the search condition in addition to the relationship “Japan-Culture” between the tokens in the search condition. Will be searched again. As a result, search results based on the new search condition are displayed, but at the same time, a condition that simultaneously satisfies the new relation “Tradition-Culture” is added when calculating the importance of the co-occurrence token, and the importance is recalculated. And recalculate the new importance.
[0078]
It is also possible to specify a co-occurrence token that ignores the relationship (“Japan” and “culture”) between tokens as a search condition. For example, if “America”, which is a co-occurrence token for culture, is specified, the relationship between “Japan” and “culture” is broken. In this case, the co-occurrence token in the co-occurrence information is output according to the importance of the co-occurrence token ignoring the priority relationship by the relation.
[0079]
(Fourth embodiment)
In the present embodiment, the co-occurrence information for the designated co-occurrence token is extracted and the co-occurrence token is displayed.
[0080]
As shown in FIG. 7, when “tradition” is designated as the co-occurrence token, a relation related to “tradition” and “culture” is added and a search result is output.
[0081]
In the present embodiment, display of co-occurrence information can be specified for the co-occurrence token “traditional” specified here.
[0082]
FIG. 14 is an example of extracting co-occurrence information for a co-occurrence token.
[0083]
In this way, by specifying the co-occurrence token, it is possible to visually acquire a document having a complicated dependency relationship.
[0084]
A storage medium storing software program codes for realizing the functions of the above-described embodiments is supplied to the system or apparatus, and the computer (or CPU or MPU) of the system or apparatus is stored in the storage medium. It goes without saying that the object of the present invention can also be achieved by reading and executing the program code.
[0085]
In this case, the program code itself read from the storage medium realizes the novel function of the present invention, and the storage medium storing the program code constitutes the present invention.
[0086]
As a storage medium for supplying the program code, for example, a flexible disk, hard disk, magneto-optical disk, CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW, magnetic A tape, a non-volatile memory card, a ROM, or the like can be used. Further, the program code may be supplied from a server computer via a communication network.
[0087]
Further, by executing the program code read out by the computer, not only the functions of the above-described embodiments are realized, but the OS running on the computer based on the instruction of the program code is actually used. Needless to say, the present invention includes a case where part or all of the processing is performed and the functions of the above-described embodiments are realized by the processing.
[0088]
Further, after the program code read from the storage medium is written into a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the instruction of the program code. It goes without saying that the CPU or the like provided in the board or the function expansion unit performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing.
[0089]
【The invention's effect】
As described above, according to the present invention, first, even when the search term of the search condition is one word, the co-occurrence information having the dependency relationship with the word is displayed, so that it is efficient. It becomes possible to narrow down. Next, when a large number of search results are obtained when searching with a plurality of search terms, similarly, search conditions that are conscious of dependency relations are presented, and complicated search conditions are expressed by the searcher. Can be specified without awareness. Furthermore, the search result does not become zero search results.
[0090]
Therefore, the searcher can specify complicated search conditions without being conscious of it, the search operability is greatly improved, and the co-occurrence relationship is displayed, so that a word to be added to the next search term can be added. There is no need to think, and by selecting from the list, anyone can easily obtain search results without requiring the searcher's search skills. In addition, since it is not said that there are no search results, the final search results can be reached quickly. Further, by storing information on the co-occurrence data in the memory, it is not necessary to access the document data again at the time of the narrowing search, so that a high-speed search is realized.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of an information search apparatus according to a first embodiment of the present invention.
FIG. 2 is a diagram showing a local network environment given as an example of another environment for constructing the information search apparatus of FIG. 1;
FIG. 3 is a diagram showing an Internet environment given as an example of another environment for constructing the information search apparatus of FIG. 1;
4 is a diagram showing an example of the structure of document data in the storage device of FIG. 1. FIG.
FIG. 5 is a diagram illustrating an example of a search result obtained by the information search apparatus in FIG. 1;
FIG. 6 is a diagram showing an example of a search result when “culture” is designated as a co-occurrence token.
FIG. 7 is a diagram illustrating an example of a search result when “tradition” is designated as a co-occurrence token.
FIG. 8 is a diagram illustrating an example of a search result when a word of “Hokkaido” is input as a search condition.
FIG. 9 is a diagram illustrating an example of how relation information regarding “Hokkaido” is acquired;
FIG. 10 is a diagram illustrating an example of a search result when relation information “taste” is added to “Hokkaido”.
FIG. 11 is a diagram illustrating an example of a search result when “autumn” is added to the search condition of FIG.
FIG. 12 is a diagram showing an example of a search result by the information search device according to the second embodiment of the present invention.
FIG. 13 is a diagram for explaining search processing executed by an information search device according to a third embodiment of the present invention.
FIG. 14 is a diagram for explaining a search process executed by an information search device according to a fourth embodiment of the present invention.
[Explanation of symbols]
1 Input device
2 CPU
3 Output device
4 storage devices
41 Processing program
42 Document data

Claims

An information search device for searching document data stored in a storage device based on an input search condition,
First extraction means for extracting a plurality of search keywords from the search condition;
First accepting means for accepting designation by a user from among the plurality of search keywords ;
Presenting means for extracting and presenting a co-occurrence keyword co-occurring with a search keyword accepted by the first accepting means from document data stored in the storage device ;
Second accepting means for accepting designation by the user for the co-occurrence keyword presented by the presenting means;
Second extraction means for extracting dependency information indicating a dependency relationship between the search keyword accepted by the first accepting means and the co-occurrence keyword accepted by the second accepting means ;
Wherein a plurality of search keywords and the co-occurrence keyword, and search means when retrieving document data, searching with priority dependency information and the second extracting means has extracted based on,
An information retrieval apparatus comprising:

2. The information search according to claim 1, wherein the presenting means presents the co-occurrence keyword on the dependency side and the co-occurrence keyword on the reception side in a distinguishable manner for the search keyword accepted by the first acceptance means. apparatus.

An ordering unit that ranks each of the plurality of co-occurrence keywords presented by the presenting unit according to a predetermined criterion;
3. The information search apparatus according to claim 2, wherein the presenting means presents the plurality of co-occurrence information keywords based on the order given by the order assigning means.

The information retrieval apparatus according to claim 3, wherein the predetermined criterion is importance of each co-occurrence keyword.

Wherein for a plurality of co-occurrence keyword, among the hierarchy that has been applied by the ranking with means further includes a changing unit for the receiving unit changes the order of non-ordered for the co-occurrence keyword acknowledged and
5. The information search apparatus according to claim 4, wherein the presenting means presents the plurality of co-occurrence keywords excluding the selected co-occurrence keyword based on the order changed by the changing means. .

The receiving unit is information retrieval apparatus according to any one of claims 2 to 5, characterized by further comprising a registering means for storing the co-occurrence keywords attached accepted in the storage device.

An information search method by an information search device for searching document data stored in a storage device based on an input search condition,
A first extraction step in which a first extraction means extracts a plurality of search keywords from the search condition;
A first accepting step for accepting designation by the user from the plurality of search keywords;
A presentation step in which the presenting means extracts and presents the co-occurrence keyword co-occurring with the search keyword received in the first reception step from the document data stored in the storage device;
A second accepting step in which a second accepting unit accepts designation by the user for the co-occurrence keyword presented in the presenting step;
A second extraction means for extracting dependency information indicating a dependency relationship between the search keyword received in the first receiving step and the co-occurrence keyword received in the second receiving step. Process,
Search means, wherein the plurality of search keywords and the co-occurrence keyword, the in retrieving document data based on a search step of searching with priority dependency information and the second extracting means has extracted, the An information search method characterized by comprising:

A program that causes a computer to function as an information search device that searches document data stored in a storage device based on an input search condition,
The computer,
First extraction means for extracting a plurality of search keywords from the search condition;
First accepting means for accepting designation by a user from among the plurality of search keywords;
Presenting means for extracting and presenting a co-occurrence keyword co-occurring with a search keyword accepted by the first accepting means from document data stored in the storage device;
Second accepting means for accepting designation by the user for the co-occurrence keyword presented by the presenting means;
Second extraction means for extracting dependency information indicating a dependency relationship between the search keyword accepted by the first accepting means and the co-occurrence keyword accepted by the second accepting means ;
Wherein a plurality of search keywords and the co-occurrence keyword, when searching the document data based on, as the information retrieval device comprising a retrieval means for retrieving with priority dependency information and the second extracting means has extracted A program characterized by functioning.