JP2004206608A

JP2004206608A - Document retrieval method, its device, and its program

Info

Publication number: JP2004206608A
Application number: JP2002377649A
Authority: JP
Inventors: Takashi Horikoshi; 崇堀越; Masaru Miyamoto; 勝宮本; Teruo Hamano; 輝夫濱野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-12-26
Filing date: 2002-12-26
Publication date: 2004-07-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document retrieval method capable of improving retrieval precision for document retrieval from a retrieval keyword inputted by a user and improving convenience based on automatic recommendation of a next limitted retrieval candidate. <P>SOLUTION: In this document retrieval method using inclusion of all the keywords designated by a user as a retrieval condition, a document satisfying a condition requiring correspondence of both factors, which are character strings of the keywords designated by the user and appearance sequence of the keywords, is retrieved. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、大量の文書群から利用者が指定する条件に一致する文書を検索する文書検索方法及び装置並びにプログラムに関する。
【０００２】
【従来の技術】
従来のキーワード検索では、まず検索の前処理として、検索対象の文書を１文書ごとに、１文書中に出現する文字列全てをキーワードとして抽出し、このキーワードをこの文書を検索するキーワードとして、文書群単位にインデックスデータとして保持しておく。そして、インデックスデータに対し、利用者が投入したキーワードがどの文書に含まれるかを照合し、該当する文書を、その文書群に対する検索結果として表示する、という方法が採られる。
【０００３】
しかし、利用者がキーワードを複数個投入した場合のそれらの順序と、文書中で出現するキーワードの出現順序の一致を検索条件とする検索方式は無い。
【０００４】
また、構造記述言語（例えばＨＴＭＬなど）では、文書中で文書が複数個のブロックに分かれていて、ブロックごとに記述されている内容が大きく異なることがあるが、これらのうち１ブロック内に、利用者が投入した複数キーワードが全て含まれるという条件を指定する検索方法も無い。
【０００５】
利用者に次の検索候補を推薦する機能として、従来は特許文献１の「次検索候補単語提示方法および装置と次検索候補単語提示プログラムを記録した記録媒体」で方法が公開されているように、過去利用者が投入したキーワードで出現頻度の高いものを検索キーワード候補として推薦する方法があった。
【０００６】
また、非特許文献１の「アンケートを対象としたテキスト自動分類システムの検討」で公開されている方法を用いて、検索結果の文書群を文字列により分類しその分類結果を利用者に見せて検索候補を推薦する方法もあった。
【０００７】
しかし、上記のいずれも検索対象の文書に存在する文書構造を反映していないため、精度が悪いという問題点があった。
【０００８】
【特許文献１】
特開２００２−９２０３２号公報
【非特許文献１】
杉崎正之、大久保雅且、田中一男著「アンケートを対象としたテキスト自動分類システムの検討」社団法人情報処理学会、第５９回（平成１１年後期）全国大会、４Ｎ-3、分冊2、pp.377-378(1999)。
【０００９】
【発明が解決しようとする課題】
本発明は上記の事情に鑑みてなされたもので、利用者が投入した検索キーワードからの文書検索の検索精度向上と、次の絞り込み検索候補の自動推薦による利便性向上が得られる文書検索方法及び装置並びにプログラムを提供することを目的とする。
【００１０】
【課題を解決するための手段】
上記目的を達成するために本発明は、利用者が指定した複数個のキーワードを全て含むことを検索条件とした文書検索方法であって、利用者が指定したキーワードの文字列とその出現順序の両方の要素が一致するという条件を満たす文書を検索するステップを有することを特徴とする。
【００１１】
また、本発明の文書検索方法は、利用者が指定したキーワードを含む文書を検索するステップと、検索結果の各文書中で前記キーワード以降に出現する文字列を利用して検索結果を絞り込むキーワードを自動的に利用者に推薦するステップとを有することを特徴とする。
【００１２】
また、本発明の文書検索方法は、検索対象文書内に複数個存在する文書構造ブロックを認識するステップと、前記ブロック内に利用者が指定した複数個のキーワード文字列が全て含まれるという条件を満たす文書を検索するステップとを有することを特徴とする。
【００１３】
また本発明は、前記文書検索方法において、利用者が指定した複数個のキーワード文字列が全て含まれる文書構造ブロック以降のブロックに出現する文字列を利用して、文書検索結果をさらに絞り込むキーワードを自動的に利用者に推薦するステップを有することを特徴とする。
【００１４】
また本発明は、利用者が指定した複数個のキーワードを全て含むことを検索条件とした文書検索装置であって、利用者が指定したキーワードの文字列とその出現順序の両方の要素が一致するという条件を満たす文書を検索する手段を備えたことを特徴とするものである。
【００１５】
また、本発明の文書検索装置は、利用者が指定したキーワードを含む文書を検索する手段と、検索結果の各文書中で前記キーワード以降に出現する文字列を利用して検索結果を絞り込むキーワードを自動的に利用者に推薦する手段とを備えたことを特徴とするものである。
【００１６】
また、本発明の文書検索装置は、検索対象文書内に複数個存在する文書構造ブロックを認識する手段と、前記ブロック内に利用者が指定した複数個のキーワード文字列が全て含まれるという条件を満たす文書を検索する手段とを備えたことを特徴とするものである。
【００１７】
また本発明は、前記文書検索装置において、利用者が指定した複数個のキーワード文字列が全て含まれる文書構造ブロック以降のブロックに出現する文字列を利用して、文書検索結果をさらに絞り込むキーワードを自動的に利用者に推薦する手段を備えたことを特徴とするものである。
【００１８】
また本発明は、利用者が指定した複数個のキーワードを全て含むことを検索条件とした文書検索プログラムであって、利用者が指定したキーワードの文字列とその出現順序の両方の要素が一致するという条件を満たす文書を検索する手順をコンピュータに実行させるためのものである。
【００１９】
また本発明の文書検索プログラムは、利用者が指定したキーワードを含む文書を検索する手順、検索結果の各文書中で前記キーワード以降に出現する文字列を利用して検索結果を絞り込むキーワードを自動的に利用者に推薦する手順をコンピュータに実行させるためのものである。
【００２０】
また本発明の文書検索プログラムは、検索対象文書内に複数個存在する文書構造ブロックを認識する手順、前記ブロック内に利用者が指定した複数個のキーワード文字列が全て含まれるという条件を満たす文書を検索する手順をコンピュータに実行させるためのものである。
【００２１】
また本発明は、前記文書検索プログラムにおいて、利用者が指定した複数個のキーワード文字列が全て含まれる文書構造ブロック以降のブロックに出現する文字列を利用して、文書検索結果をさらに絞り込むキーワードを自動的に利用者に推薦する手順をコンピュータに実行させるためのものである。
【００２２】
すなわち、検索対象の文書内の文書構造としての、キーワードの出現順序と、文書ブロックを、利用者からの検索条件とともに、検索条件として考慮する。
【００２３】
キーワードの出現順序に関しては、インデックスデータ作成時の文書ごとに文書中に出現するキーワードを抽出する際に、文書中でのそのキーワードの出現順序の情報も、そのキーワードに関連づけて、キーワードごとにインデックスデータに保持する。そして、利用者が文書を検索する際に、利用者が投入した複数キーワードの文字列と順序を、インデックスデータと照合し、キーワードが全て一致して、順序の大小関係が正しいという条件に一致するものを検索結果とする。
【００２４】
文書ブロックに関しては、インデックスデータ作成時の文書ごとに文書中に出現するキーワードを抽出する際に、まずその文書の構造を解析しブロック化を行ってから、それぞれにＩＤ（ｉｄｅｎｔｉｆｉｃａｔｉｏｎ）を付け、そのＩＤをキーワードに関連づけて、キーワードごとにインデックスデータに保持する。そして、利用者が文書を検索する際に、利用者が投入した複数キーワードの文字列とブロックＩＤを、インデックスデータと照合し、キーワードが全て一致して、ブロックＩＤが同一という条件に一致するものを検索結果とする。
【００２５】
また、文書ブロックを利用し、検索結果文書内の、上記キーワードが属するブロックＩＤより文書内の位置として後方にあるブロックＩＤに含まれるキーワード文字列をインデックスファイルから抽出し、利用者に検索結果の絞り込み検索キーワード候補として推薦する。
【００２６】
【発明の実施の形態】
以下図面を参照して本発明の実施の形態例を詳細に説明する。
【００２７】
本発明の実施形態例に係る文書検索装置は、例えば図１の構成で実現できる。図１は「（１）文書検索装置（インデックスデータ生成）」と「（２）文書検索装置（文書検索）」の２つに、文書検索装置を分けて示しているが、同一装置内で実現する方法も可能である。図１の構成のそれぞれについて説明する。
【００２８】
（１）文書検索装置（インデックスデータ生成）
Ａ、検索対象文書群
検索の対象とする文書群を保存したデータベースである。データベースはきちんとした検索機能が実現されている必要はなく、単に検索対象の文書が文書ごとに分けて取り出せる構成で保存されていればよいので、コンピュータのファイルシステムのような、比較的単純な保存機能でも実現可能である。
【００２９】
Ｂ、文書選択
Ａの検索対象文書群データベースにアクセスし、データベースから順番に１つずつ文書データを取り出し、以降で実施する処理に文書データを渡す動作をする。
【００３０】
Ｃ、キーワード抽出
Ｂの文書選択で取り出した文書データに対し、自然言語処理等の既存の手法によって、文書から検索キーワードを複数個抽出する処理を行う。
【００３１】
Ｄ、文書構造抽出
文書の構造を抽出する動作を行う。文書の構造を抽出する動作については、後ほど詳しく説明する。
【００３２】
Ｅ、インデックスデータ
Ｃのキーワード抽出とＤの文書構造抽出で抽出したキーワードと文書構造を、文書ごとに、検索時に検索しやすい形式でデータベースに保存する。
【００３３】
（２）文書検索装置（文書検索）
ａ、キーワード抽出
ｇの入力装置で利用者が入力した文字列から、キーワードを抽出する。キーワードの抽出では利用者が文字列入力した順序に応じ、キーワードに順序づけをする。
【００３４】
利用者に提供するユーザインタフェースとしては、例えば図２に示すような利用者がキーワードを入力する画面を用いる。利用者には文字列入力スペースに、ｇの入力装置から直接文字列を入力する。利用者がキーワードを複数個入力する要望がある場合には、文字列をスペースで区切って入力する。
【００３５】
それぞれの文字列に対して、上記Ｃのキーワード抽出技術と同様に自然言語処理を適用し、文字列から複数のキーワードを抽出する。図２のように「今日の天気が」では一般的な自然言語処理によると「今日」と「天気」という２つのキーワードを抽出することができる。
【００３６】
キーワードを抽出するときに、抽出した順序、すなわちそれぞれのキーワードの先頭からの順序も合わせて抽出して、キーワードごとに順序を関連づける。
【００３７】
ｂ、検索条件生成
ａのキーワード抽出で得られた複数キーワードと、それぞれのキーワードの順序をもとに、ｅのインデックスデータを検索する検索条件を生成する。検索条件については後ほど詳しく説明する。
【００３８】
ｃ、検索実行
ｅのインデックスデータにアクセスし、ｂの検索条件生成で生成した検索条件で検索する。
【００３９】
ｄ、検索結果表示
ｃの検索実行で得られた検索結果を、利用者に提示する。また、得られた検索結果に対しさらに絞り込み検索を行うときに必要なキーワードを自動的に提示する機能もここで合わせて実現する。絞り込み検索キーワードの推薦については後ほど詳しく説明する。
【００４０】
ｅ、インデックスデータ
Ｅのインデックスデータで得られた、文書ごとにキーワードと文書構造が、検索時に検索しやすい形式で保存されたデータベースである。
【００４１】
ｆ、表示装置
ｄの検索結果表示で得られた情報を利用者に提示する表示装置であり、コンピュータのディスプレイ装置により実現できる。
【００４２】
ｇ、入力装置
利用者が検索条件を文字列で入力するための、入力装置である。コンピュータのキーボード装置により実現できる。
【００４３】
［文書構造の抽出］
Ｄの文書構造抽出について詳細に説明する。
【００４４】
文書構造としては、以下の２つがある。
【００４５】
Ｄ−１）文書中におけるキーワードの出現順序
Ｄ−２）文書中における文書構造ブロック
Ｄ−１は、文書中のキーワードの出現の順序を利用する。文書を自然言語処理によりキーワード抽出する際に、キーワードを抽出した順序、すなわち抽出したキーワードの文書先頭からの順序も算出し、抽出したキーワードに関連づけて、Ｅのインデックスデータに保持する。
【００４６】
Ｄ−２は、まず文書中から文書構造ブロックを認識する。文書構造ブロックを説明するために図３を用いる。図３は一般的なｗｅｂページで利用されているＨＴＭＬ（ｈｙｐｅｒｔｅｘｔｍａｒｋｕｐｌａｎｇｕａｇｅ）である。
【００４７】
このＨＴＭＬでは、<td>から</td>までの間に１つの情報が入っていて、ＨＴＭＬブラウザで表示すると図４のように表の形式で表示される。図４は図３のｗｅｂページをｗｅｂブラウザで表示した例である。本実施形態例では、このような<td>から</td>までを、文書構造ブロックとして検索に利用する。すなわち、<td>から</td>までを１つのブロックとして認識してそれらにＩＤを振り、文書からキーワードを抽出するときに、この文書構造ブロックＩＤを抽出したキーワードに関連づけてＥのインデックスデータに保持する。
【００４８】
ＨＴＭＬは構造記述言語であり様々なＨＴＭＬタグが文書中に存在する。よって、<td>から</td>までだけでなく、他のＨＴＭＬタグでも同様に文書構造ブロックを認識することが可能である。また、ＨＴＭＬ以外にも、日本語の一般的な文書では、文の末尾に「。」を付けるのが普通であるが、これを利用して、「。」の直後から次の「。」までを文書構造ブロックと認識することも可能である。
【００４９】
［検索条件の生成］
ｂの検索条件生成について詳細に説明する。検索条件の生成方法には、Ｄの文書構造の抽出に依存して、以下の２つがある。
【００５０】
ｂ−１）キーワードの出現順序の一致。文書構造抽出がＤ−１の場合
ｂ−２）文書構造ブロックの一致。文書構造抽出がＤ−２の場合
ｂ−１のキーワードの出現順序の一致に関しては、利用者が検索時に投入した文字列から抽出したキーワードの出現順序と、検索対象の文書から抽出した文書ごとのキーワードの出現順序が、一致していることを、検索条件とする。例えば、利用者が「今日」と「天気」とキーワードを投入した場合には、「今日」「天気」の順にキーワードが出現する文書だけを検索する。「天気」「今日」の順にキーワードが出現する文書は検索しない。
【００５１】
ｂ−２の文書構造ブロックの一致に関しては、利用者が検索時に投入した文字列から抽出したキーワードが、検索対象の文書で、同一の文書構造ブロックに属している、ということを検索条件として検索する。例えば、利用者が「今日」と「天気」とキーワードを投入した場合には、「今日」「天気」が同一の文書構造ブロックＩＤが振られている文書だけを検索する。
【００５２】
［絞り込み検索キーワードの推薦］
ｄの検索結果表示について詳細に説明する。検索結果を表示する際に、その検索結果に対し、さらに絞り込み検索を行い検索結果の候補を絞り込んで表示したいという利用者のニーズがある。よって、自動的に絞り込みキーワードの候補を提示する機能が有用である。本実施形態例による文書構造の活用により、適切な絞り込みキーワードが提示できる。
【００５３】
絞り込み検索キーワードの提示方法には、Ｄの文書構造の抽出に依存して、以下の２つがある。
【００５４】
ｄ−１）キーワードの出現順序の活用。文書構造抽出がＤ−１の場合
ｄ−２）文書構造ブロックの活用。文書構造抽出がＤ−２の場合
ｄ−１のキーワードの出現順序の活用は、事前の文書構造抽出が文書中でのキーワードの出現の順序を利用したものである場合に実現できる。
【００５５】
検索時、まず利用者が検索に利用したキーワードにヒットした文字列が、検索対象のそれぞれの文書中のどの位置に現れるかを特定する。そして、その位置よりも文書中で後半に出現する文字列から絞り込みキーワード候補を抽出する。
【００５６】
検索対象の文書中で、検索にヒットした文字列に対し後半の文字列から絞り込みキーワードを抽出するか、逆に前半から抽出するかは、利用者が任意に選ぶことができるようにする、といった構成も可能である。
【００５７】
ｄ−２の文書構造ブロックの活用は、事前の文書構造抽出が文書中での文書構造ブロックを利用したものである場合に実現できる。
【００５８】
検索時に、上記ｂ−２のように利用者が複数個投入したキーワードが、文書中で同じブロックに属している文書を利用者に提示する。そして、利用者に対する絞り込みキーワード候補の推薦は、利用者が複数個投入したキーワードと同じブロックからさらに抽出することによって、実現できる。
【００５９】
さて、上記実施形態例は、文書検索装置上に保存された文書群に対し、本実施形態例の文書構造抽出方式を活かして検索する方法についてである。
【００６０】
しかし、本実施形態例は装置上に保存された文書群に対してだけでなく、様々な装置に分散して保存されている文書群の検索に対しても有効である。この実施形態例を以下に述べる。
【００６１】
図５の、（１）インデックス装置は、インターネット上の文書群から、文書構造とキーワードを抽出し、インデックスファイルに保存し、文書検索装置からの検索要求に備える機能を持つ。また、（２）文書検索装置は、利用者が入力装置から入力したキーワードを基にインデックス装置に対し検索要求を行い、結果を利用者に提示する機能を持つ。
【００６２】
（１）インデックス装置
インターネット上の複数のサーバに対し、検索対象文書（Ａ）を連続的に読み出す。検索対象文書は、文書選択（Ｂ）機能によってリスト化され、このリストに基づき連続的に文書を読み出す。リストは、初期文書を設定しておくと、その初期文書中にある文書間の関連を利用して他の文書の存在を知り、その文書をリストに付加しながら、さらにそれら文書中に含まれる他の文書への関連を基に、他の文書をリストに付加するという再帰的な手法による方法がある。また、リストを全て手動で作成し、決まった文書のみを検索対象とする場合もある。これら２方法を混ぜた、ある一定範囲の文書に対し再帰的にリストを作成する方法もある。
【００６３】
文書選択（Ｂ）機能によって読み出された検索対象文書は、キーワード抽出（Ｃ）機能によって、自然言語処理等の既存の手法によって、文書から検索キーワードを複数個抽出する処理を行う。
【００６４】
さらに、文書構造抽出（Ｄ）機能によって、文書の構造を抽出する動作を行う。文書構造の抽出方法は、文書中におけるキーワードの出現順序によるもの（１の方法とする）、文書中における文書構造ブロックによるもの（２の方法とする）の２種類がある。
【００６５】
前者は、文書中のキーワードの出現の順序を利用する。文書を自然言語処理によりキーワード抽出する際に、キーワードを抽出した順序、すなわち抽出したキーワードの文書先頭からの順序も算出し、抽出したキーワードに関連づけて、インデックスデータ（Ｅ）に保持する。
【００６６】
後者は、まず文書中から文書構造ブロックを認識する。インターネット上に存在する文書の多くはhtmlやxmlのような構造記述言語であり、構造記述言語はタグによって文書が構造化されている。この構造を１つのブロックとして認識し、それらにＩＤを振って区別し、文書からキーワードを抽出するときに、この文書構造ブロックＩＤを抽出したキーワードに関連づけてインデックスデータ（Ｅ）に保持する。
【００６７】
また、htmlやxml以外にも、日本語の一般的な文書では、文の末尾に「。」を付けるのが普通であり、これを利用して、「。」の直後から次の「。」までを文書構造ブロックと認識することも可能である。他にも、「例えば」や「しかし」など日本語で文書の意味が変化するときに用いられる接続語により、前後の意味を分割し、文書構造とする方法も可能である。
【００６８】
上記、キーワード抽出（Ｃ）機能と文書構造抽出（Ｄ）機能で抽出したキーワードと文書構造を、それぞれ文書ごとに、検索時に読み出しやすい形式でデータベースに保存する。
【００６９】
（２）文書検索装置
まず、利用者が入力装置から入力したキーワードから、キーワード抽出（ａ）機能により、キーワードを抽出する。利用者が入力したそれぞれの文字列に対し、キーワードを抽出する。「今日の天気が」と利用者が入力した場合には、例えば「今日」と「天気」という２つのキーワードを抽出する。キーワードだけでなく、キーワードの順序も保持しておく。
【００７０】
次に、キーワード抽出（ａ）機能で得られた複数個のキーワードと、それぞれのキーワードの順序をもとに、検索実行（ｃ）機能で、インデックスデータを検索する検索条件を生成する。方法には、（１）のインデックスデータの生成方法に依存して、次の２つがある。
【００７１】
キーワードの出現順序の一致。文書構造抽出が１の方法の場合、文書構造ブロックの一致。文書構造抽出が２の方法の場合。
【００７２】
キーワードの出現順序の一致に関しては、利用者が検索時に投入した文字列から抽出したキーワードの出現順序と、検索対象の文書から抽出した文書ごとのキーワードの出現順序が、一致していることを、検索条件とする。例えば、利用者が「今日」と「天気」とキーワードを投入した場合には、「今日」「天気」の順にキーワードが出現する文書だけを検索する。「天気」「今日」の順にキーワードが出現する文書は検索しない。
【００７３】
文書構造ブロックの一致に関しては、利用者が検索時に投入した文字列から抽出したキーワードが、検索対象の文書で、同一の文書構造ブロックに属している、ということを検索条件として検索する。例えば、利用者が「今日」と「天気」とキーワードを投入した場合には、「今日」「天気」が同一の文書構造ブロックＩＤが振られている文書だけを検索する。
【００７４】
作成した検索条件によって、（１）のインデックス装置のインデックスデータ（Ｅ）に対して検索要求を出す。（１）のインデックス装置としては、検索要求にあった検索結果を、（２）の文書検索装置に返す。
【００７５】
この検索結果を、検索結果表示（ｄ）機能で、利用者に提示する。
【００７６】
このとき、得られた検索結果に対しさらに絞り込み検索を行うときに必要なキーワードを自動的に提示する機能もここで合わせて実現する。
【００７７】
絞り込み検索のキーワードの自動での提示は、（１）のインデックス装置での文書構造抽出方法に依存して、２つの方法がある。キーワードの出現順序の活用、及び文書構造ブロックの活用である。
【００７８】
キーワードの出現順序の活用は、（１）での文書構造抽出方法が１の場合である。事前の文書構造抽出が文書中でのキーワードの出現の順序を利用したものである場合に実現できる。
【００７９】
検索時、まず利用者が検索に利用したキーワードにヒットした文字列が、検索対象のそれぞれの文書中のどの位置に現れるかを特定する。そして、その位置よりも文書中で後半に出現する文字列から絞り込みキーワード候補を抽出する。
【００８０】
検索対象の文書中で、検索にヒットした文字列に対し後半の文字列から絞り込みキーワードを抽出するか、逆に前半から抽出するかは、利用者が任意に選ぶことができるようにする、といった構成も可能である。
【００８１】
文書構造ブロックの活用は、（１）での文書構造抽出方法が２の場合である。事前の文書構造抽出が文書中での文書構造ブロックを利用したものである場合に実現できる。
【００８２】
検索時に、上記ｂ−２のように利用者が複数個投入したキーワードが、文書中で同じブロックに属している文書を利用者に提示する。そして、利用者に対する絞り込みキーワード候補の推薦は、利用者が複数個投入したキーワードと同じブロックからさらに抽出することによって、実現できる。
【００８３】
尚、前記実施形態例における文書検索方法は、具体的にはパソコン等のコンピュータにより、予め所定の文書検索プログラムに基づいて実行される。前記文書検索プログラムは例えばＣＤ等の所定のコンピュータ読み取り可能な記録媒体に記録することができる。
【００８４】
【発明の効果】
以上述べたように本発明によれば、キーワード検索で文書を絞り込む場合に、文書中の意味の連携の情報を用いることができるので、▲１▼という事象のあとで▲２▼という事象が起こったことと、▲２▼という事象のあとで▲１▼という事象が起こったことを区別して検索できるため、利用者が望む検索結果が得やすい。
【００８５】
また、従来は次の絞り込み候補を利用者に勧める場合、文書全体から絞り込みが可能な文字列を抽出するので候補が曖昧になりかつ候補数も膨大となり利用が難しいが、本発明の手法を用いると、候補が限定され量も限定されるため、利便性の高いサービスが可能である。
【図面の簡単な説明】
【図１】本発明の実施形態例に係る文書検索装置を示す構成説明図である。
【図２】本発明の実施形態例に係る利用者がキーワードを入力する画面の例を示す説明図である。
【図３】一般的なｗｅｂページで利用されているＨＴＭＬの例を示す説明図である。
【図４】図３のｗｅｂページをｗｅｂブラウザで表示した例を示す説明図である。
【図５】本発明の実施形態例に係るインターネット上の文書検索の構成の例を示す説明図である。
【符号の説明】
Ａ…検索対象文書群、Ｂ…文書選択、Ｃ…キーワード抽出、Ｄ…文書構造抽出、Ｅ…インデックスデータ、ａ…キーワード抽出、ｂ…検索条件生成、ｃ…検索実行、ｄ…検索結果表示、ｅ…インデックスデータ、ｆ…表示装置、ｇ…入力装置。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document search method, apparatus, and program for searching a large number of documents for documents that match conditions specified by a user.
[0002]
[Prior art]
In the conventional keyword search, first, as a preprocessing of the search, all the character strings appearing in one document are extracted as a keyword for each document to be searched, and this keyword is used as a keyword for searching this document. It is stored as index data for each group. Then, the index data is compared with which document contains the keyword input by the user, and the corresponding document is displayed as a search result for the document group.
[0003]
However, there is no search method in which a search condition is a match between the order in which a user inputs a plurality of keywords and the order in which keywords appear in a document.
[0004]
Further, in a structure description language (eg, HTML), a document is divided into a plurality of blocks in a document, and the contents described for each block may be significantly different. There is no search method for specifying a condition that all keywords entered by the user are included.
[0005]
As a function of recommending the next search candidate to the user, a method is conventionally disclosed in Patent Document 1 titled “Next search candidate word presentation method and apparatus and recording medium recording next search candidate word presentation program”. There has been a method of recommending, as a search keyword candidate, a keyword that has been frequently input by a past user.
[0006]
In addition, using the method disclosed in “Study of Automatic Text Classification System for Questionnaire” in Non-Patent Document 1, a document group of search results is classified by a character string, and the classification result is shown to a user. There was also a method of recommending search candidates.
[0007]
However, there is a problem that accuracy is poor because none of the above reflects the document structure existing in the search target document.
[0008]
[Patent Document 1]
JP-A-2002-92032 [Non-Patent Document 1]
Masayuki Sugizaki, Masakatsu Okubo, Kazuo Tanaka, "Study of Automatic Text Classification System for Questionnaire", Information Processing Society of Japan, 59th (Late 1999) National Convention, 4N-3, Volume 2, pp. 377-378 (1999).
[0009]
[Problems to be solved by the invention]
The present invention has been made in view of the above circumstances, and provides a document search method capable of improving the search accuracy of a document search from a search keyword input by a user and improving the convenience by automatically recommending the next narrowed search candidate. It is intended to provide a device and a program.
[0010]
[Means for Solving the Problems]
In order to achieve the above object, the present invention provides a document search method using a search condition that includes all of a plurality of keywords specified by a user. A step of searching for a document that satisfies a condition that both elements match.
[0011]
In addition, the document search method of the present invention includes a step of searching for a document including a keyword specified by a user, and a step of searching for a keyword that narrows the search result by using a character string appearing after the keyword in each document of the search result. Automatically recommending to the user.
[0012]
The document search method according to the present invention includes a step of recognizing a plurality of document structure blocks in the search target document and a condition that the block includes all of a plurality of keyword character strings specified by a user. Searching for a document that satisfies the condition.
[0013]
Further, according to the present invention, in the document search method, a keyword that further narrows down a document search result by using a character string that appears in a block after a document structure block including all of a plurality of keyword character strings specified by a user is used. The method has a step of automatically recommending to a user.
[0014]
Further, the present invention is a document search apparatus using a search condition that includes all of a plurality of keywords specified by a user, wherein both elements of a character string of the keyword specified by the user and an appearance order thereof match. And a means for searching for a document that satisfies the condition.
[0015]
Further, the document search device of the present invention includes means for searching for a document including a keyword specified by a user, and a keyword for narrowing down the search result by using a character string appearing after the keyword in each document of the search result. Means for automatically recommending to a user.
[0016]
In addition, the document search device of the present invention includes means for recognizing a plurality of document structure blocks present in a search target document, and a condition that a plurality of keyword character strings specified by a user are all included in the block. Means for searching for a document that satisfies the condition.
[0017]
Further, according to the present invention, in the document search device, a keyword that further narrows down a document search result by using a character string appearing in a block after a document structure block including all of a plurality of keyword character strings specified by a user. A feature is provided that means for automatically recommending to a user is provided.
[0018]
The present invention is also a document search program that uses a search condition that includes all of a plurality of keywords specified by a user, wherein both elements of a character string of the keyword specified by the user and an appearance order thereof match. This is for causing a computer to execute a procedure for searching for a document satisfying the condition.
[0019]
Further, the document search program of the present invention automatically searches for a document that includes a keyword specified by a user, and automatically narrows down a search result by using a character string appearing after the keyword in each document of the search result. To make the computer execute a procedure recommended to the user.
[0020]
The document search program according to the present invention includes a procedure for recognizing a plurality of document structure blocks in a search target document, and a document satisfying a condition that all of a plurality of keyword character strings specified by a user are included in the block. This is for causing a computer to execute a procedure for searching for.
[0021]
Further, according to the present invention, in the document search program, a keyword that further narrows down a document search result by using a character string that appears in a block after a document structure block including all of a plurality of keyword character strings specified by a user is used. This is for making a computer automatically execute a procedure recommended to a user.
[0022]
That is, the order of appearance of the keywords and the document block as the document structure in the document to be searched are considered as the search condition together with the search condition from the user.
[0023]
Regarding the order of appearance of keywords, when extracting keywords that appear in a document for each document at the time of index data creation, information on the order of appearance of the keywords in the document is also associated with the keywords and indexed for each keyword. Keep in the data. Then, when the user searches for a document, the character strings and the order of the plurality of keywords input by the user are compared with the index data, and the keywords all match, and the condition that the magnitude relation of the order is correct matches. Let things be search results.
[0024]
Regarding document blocks, when extracting keywords that appear in the documents for each document at the time of index data creation, first analyze the structure of the documents and block them, and then attach an ID (identification) to each of them. The ID is associated with the keyword and is stored in the index data for each keyword. Then, when the user searches for a document, the character string of a plurality of keywords input by the user and the block ID are compared with the index data, and all keywords match and the condition that the block ID is the same is matched. Is the search result.
[0025]
Also, using a document block, a keyword character string included in a block ID located later in the document than the block ID to which the keyword belongs in the search result document is extracted from the index file. Recommended as a refined search keyword candidate.
[0026]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0027]
The document search device according to the embodiment of the present invention can be realized by, for example, the configuration of FIG. FIG. 1 shows two document search devices, “(1) document search device (index data generation)” and “(2) document search device (document search)”, which are implemented in the same device. It is also possible to do this. Each of the configurations in FIG. 1 will be described.
[0028]
(1) Document search device (index data generation)
A, a search target document group This is a database that stores a document group to be searched. It is not necessary for the database to have a proper search function. It is only necessary to store the search target documents in a configuration that can be retrieved separately for each document, so relatively simple storage such as a computer file system It can also be realized with functions.
[0029]
B: Access the search target document group database of the document selection A, retrieve the document data one by one from the database in order, and pass the document data to the processing to be performed thereafter.
[0030]
C. A process of extracting a plurality of search keywords from the document is performed on the document data extracted by the document selection of the keyword extraction B by an existing method such as natural language processing.
[0031]
D, Document structure extraction An operation of extracting the structure of the document is performed. The operation of extracting the structure of the document will be described later in detail.
[0032]
E. The keywords and the document structure extracted by the keyword extraction of the index data C and the document structure extraction of the index data D are stored in a database for each document in a format that can be easily searched at the time of search.
[0033]
(2) Document search device (document search)
a, keyword extraction A keyword is extracted from the character string input by the user with the input device of g. In keyword extraction, keywords are ordered according to the order in which the user inputs character strings.
[0034]
As a user interface provided to the user, for example, a screen for the user to input a keyword as shown in FIG. 2 is used. The user inputs a character string directly from the input device of g into the character string input space. If the user requests to enter a plurality of keywords, the character strings are entered by separating them with spaces.
[0035]
Natural language processing is applied to each character string in the same manner as in the above-described keyword extraction technique C to extract a plurality of keywords from the character string. As shown in FIG. 2, in “Today's weather”, two keywords “today” and “weather” can be extracted by general natural language processing.
[0036]
When extracting keywords, the extraction order, that is, the order from the beginning of each keyword is also extracted, and the order is associated with each keyword.
[0037]
b, Search condition generation A search condition for searching the index data of e is generated based on the plurality of keywords obtained by the keyword extraction of a and the order of each keyword. The search conditions will be described later in detail.
[0038]
c, access the index data of the search execution e, and search by the search condition generated by the search condition generation of b.
[0039]
d. The search result obtained by executing the search of the search result display c is presented to the user. In addition, a function of automatically presenting a keyword necessary for performing a further refined search on the obtained search result is also realized here. The recommendation of the refined search keyword will be described later in detail.
[0040]
e, a database in which a keyword and a document structure for each document obtained from the index data of the index data E are stored in a format that is easy to search at the time of search.
[0041]
f, a display device that presents information obtained by the search result display of the display device d to the user, and can be realized by a display device of a computer.
[0042]
g, input device This is an input device for the user to input search conditions as a character string. This can be realized by a computer keyboard device.
[0043]
[Extraction of document structure]
The document structure extraction of D will be described in detail.
[0044]
There are the following two document structures.
[0045]
D-1) Order of appearance of keywords in document D-2) Document structure block D-1 in the document uses the order of appearance of keywords in the document. When keywords are extracted from a document by natural language processing, the order in which the keywords are extracted, that is, the order of the extracted keywords from the head of the document is also calculated, and the extracted keywords are associated with the extracted keywords and stored in the index data of E.
[0046]
D-2 first recognizes a document structure block from the document. FIG. 3 is used to explain the document structure block. FIG. 3 shows HTML (hypertext markup language) used in a general web page.
[0047]
In this HTML, one piece of information is included between <td> and </ td>, and when it is displayed by an HTML browser, it is displayed in a table format as shown in FIG. FIG. 4 shows an example in which the web page of FIG. 3 is displayed by a web browser. In the present embodiment, such a range from <td> to </ td> is used for a search as a document structure block. That is, when <td> to </ td> are recognized as one block, IDs are assigned to them, and a keyword is extracted from the document. When extracting the keyword from the document, the index data of E is associated with the extracted keyword of the document structure block ID. To hold.
[0048]
HTML is a structure description language, and various HTML tags exist in a document. Therefore, it is possible to recognize the document structure block not only from <td> to </ td> but also in other HTML tags. In addition to HTML, in a general Japanese document, it is common to add "." To the end of a sentence, but by using this, from immediately after "." To the next "." Can be recognized as a document structure block.
[0049]
[Generate search condition]
The generation of the search condition b will be described in detail. There are the following two methods for generating search conditions depending on the extraction of the document structure of D.
[0050]
b-1) Match the appearance order of keywords. When the document structure extraction is D-1 b-2) Matching of document structure blocks. When the document structure extraction is D-2, regarding the matching of the appearance order of the keyword of b-1, the appearance order of the keyword extracted from the character string input at the time of the search by the user and the appearance order of each document extracted from the search target document Matching of the appearance order of keywords is set as a search condition. For example, when the user inputs the keywords “today” and “weather”, only documents in which the keywords appear in the order of “today” and “weather” are searched. Documents in which keywords appear in the order of “weather” and “today” are not searched.
[0051]
Regarding the match of the document structure block of b-2, the search condition is that the keyword extracted from the character string input at the time of the search belongs to the same document structure block in the search target document. I do. For example, when the user inputs the keywords “today” and “weather”, only documents having the same document structure block ID as “today” and “weather” are searched.
[0052]
[Recommended search keywords]
The search result display of d will be described in detail. When displaying search results, there is a user's need to further narrow down the search results and to narrow down and display search result candidates. Therefore, a function of automatically presenting narrowed-down keyword candidates is useful. By utilizing the document structure according to the embodiment, an appropriate narrowing keyword can be presented.
[0053]
There are the following two methods of presenting the refined search keyword depending on the extraction of the document structure of D.
[0054]
d-1) Use of the appearance order of keywords. When document structure extraction is D-1 d-2) Use of document structure block. In the case where the document structure extraction is D-2, the use of the keyword appearance order of d-1 can be realized when the prior document structure extraction utilizes the order of appearance of the keywords in the document.
[0055]
At the time of search, first, the character string that hits the keyword used for the search is specified in which position in each document to be searched. Then, narrow-down keyword candidates are extracted from character strings that appear later in the document than that position.
[0056]
In the search target document, the user can arbitrarily select whether to extract a narrowing keyword from the second half of the character string that hits the search or extract it from the first half. A configuration is also possible.
[0057]
The use of the document structure block d-2 can be realized when the document structure extraction in advance uses the document structure block in the document.
[0058]
At the time of retrieval, a document in which a plurality of keywords entered by the user as in b-2 above belongs to the same block in the document is presented to the user. Then, recommendation of the narrowed-down keyword candidates to the user can be realized by further extracting from the same block as the keyword inputted by the user.
[0059]
The above-described embodiment relates to a method of searching a document group stored on the document search apparatus by utilizing the document structure extraction method of the embodiment.
[0060]
However, the present embodiment is effective not only for a document group stored on a device, but also for a search for a document group distributed and stored in various devices. This embodiment will be described below.
[0061]
The (1) index device shown in FIG. 5 has a function of extracting a document structure and a keyword from a group of documents on the Internet, storing them in an index file, and preparing for a search request from the document search device. (2) The document search device has a function of making a search request to the index device based on a keyword input by the user from the input device, and presenting the result to the user.
[0062]
(1) Indexing device The retrieval target document (A) is continuously read from a plurality of servers on the Internet. Documents to be searched are listed by the document selection (B) function, and the documents are continuously read out based on the list. When the initial document is set, the list is used to know the existence of other documents by using the relationship between the documents in the initial document, add the document to the list, and further include it in those documents There is a recursive method of adding another document to a list based on the relation to another document. In some cases, all lists are created manually, and only fixed documents are searched. There is also a method of recursively creating a list for a certain range of documents by mixing these two methods.
[0063]
The search target document read by the document selection (B) function is subjected to a keyword extraction (C) function to extract a plurality of search keywords from the document by an existing method such as natural language processing.
[0064]
Further, an operation of extracting a document structure is performed by a document structure extraction (D) function. There are two types of document structure extraction methods, one based on the order of appearance of keywords in a document (method 1) and the other based on document structure blocks in a document (method 2).
[0065]
The former uses the order of appearance of keywords in a document. When keywords are extracted from a document by natural language processing, the order in which the keywords are extracted, that is, the order of the extracted keywords from the head of the document is also calculated, and the extracted keywords are associated with the extracted keywords and stored in the index data (E).
[0066]
The latter first recognizes a document structure block from the document. Most of the documents existing on the Internet are structured description languages such as html and xml, and the structured description language is structured by tags. This structure is recognized as one block, and IDs are assigned to the blocks to distinguish them. When extracting a keyword from a document, the document structure block ID is held in the index data (E) in association with the extracted keyword.
[0067]
In addition to html and xml, in general Japanese documents, it is common to add "." To the end of a sentence, and by using this, the next "." Immediately after "." Can be recognized as a document structure block. In addition, it is also possible to divide the meaning before and after using a connecting word used when the meaning of a document changes in Japanese, such as "for example" or "but", to form a document structure.
[0068]
The keywords and the document structure extracted by the keyword extraction (C) function and the document structure extraction (D) function are stored in a database for each document in a format that is easy to read at the time of retrieval.
[0069]
(2) Document Retrieval Device First, a keyword is extracted from the keyword input by the user from the input device by the keyword extraction (a) function. A keyword is extracted for each character string input by the user. When the user inputs “Today's weather”, for example, two keywords “today” and “weather” are extracted. Not only keywords but also the order of keywords is kept.
[0070]
Next, based on the plurality of keywords obtained by the keyword extraction (a) function and the order of each keyword, a search condition for searching index data is generated by the search execution (c) function. There are the following two methods depending on the method of generating index data in (1).
[0071]
Match the order in which keywords appear. When the document structure extraction is the method 1, the document structure blocks are matched. When the document structure extraction is the second method.
[0072]
Regarding the matching of the order of appearance of keywords, it is determined that the order of appearance of keywords extracted from the character string input by the user at the time of search matches the order of appearance of keywords for each document extracted from the search target document. Search condition. For example, when the user inputs the keywords “today” and “weather”, only documents in which the keywords appear in the order of “today” and “weather” are searched. Documents in which keywords appear in the order of “weather” and “today” are not searched.
[0073]
As for the matching of the document structure blocks, a search is performed using a search condition that a keyword extracted from a character string input by the user at the time of search belongs to the same document structure block in the search target document. For example, when the user inputs the keywords “today” and “weather”, only documents having the same document structure block ID as “today” and “weather” are searched.
[0074]
According to the created search condition, a search request is issued for the index data (E) of the index device of (1). The index device of (1) returns a search result corresponding to the search request to the document search device of (2).
[0075]
This search result is presented to the user by a search result display (d) function.
[0076]
At this time, a function of automatically presenting a keyword necessary for performing a further refined search on the obtained search result is also realized here.
[0077]
There are two methods for automatically presenting keywords for a narrow search, depending on the document structure extraction method (1) using the index device. Use of the order of appearance of keywords and use of document structure blocks.
[0078]
The use of the order of appearance of the keywords is performed when the document structure extraction method in (1) is one. This can be realized when the prior document structure extraction uses the order of appearance of keywords in the document.
[0079]
At the time of search, first, the character string that hits the keyword used for the search is specified in which position in each document to be searched. Then, narrow-down keyword candidates are extracted from character strings that appear later in the document than that position.
[0080]
In the search target document, the user can arbitrarily select whether to extract a narrowing keyword from the second half of the character string that hits the search or extract it from the first half. A configuration is also possible.
[0081]
The document structure block is used when the document structure extraction method in (1) is 2. This can be realized when the document structure extraction in advance uses a document structure block in the document.
[0082]
At the time of retrieval, a document in which a plurality of keywords entered by the user as in b-2 above belongs to the same block in the document is presented to the user. Then, recommendation of the narrowed-down keyword candidates to the user can be realized by further extracting from the same block as the keyword inputted by the user.
[0083]
It should be noted that the document search method in the embodiment is specifically executed by a computer such as a personal computer based on a predetermined document search program in advance. The document search program can be recorded on a predetermined computer-readable recording medium such as a CD.
[0084]
【The invention's effect】
As described above, according to the present invention, when narrowing down a document by a keyword search, it is possible to use information of meaning linkage in the document, so that the event of (2) occurs after the event of (1). Since the search can be distinguished from the fact that the event (1) has occurred after the event (2), the search result desired by the user can be easily obtained.
[0085]
Conventionally, when recommending the next narrowing candidate to the user, a character string that can be narrowed down is extracted from the entire document, so that the candidate becomes ambiguous and the number of candidates is enormous, making it difficult to use. Thus, since the candidates are limited and the amount is also limited, a highly convenient service is possible.
[Brief description of the drawings]
FIG. 1 is a configuration explanatory diagram showing a document search device according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram showing an example of a screen on which a user inputs a keyword according to the embodiment of the present invention.
FIG. 3 is an explanatory diagram showing an example of HTML used in a general web page.
FIG. 4 is an explanatory diagram showing an example in which the web page of FIG. 3 is displayed on a web browser.
FIG. 5 is an explanatory diagram showing an example of a configuration of a document search on the Internet according to the embodiment of the present invention.
[Explanation of symbols]
A: document group to be searched, B: document selection, C: keyword extraction, D: document structure extraction, E: index data, a: keyword extraction, b: search condition generation, c: search execution, d: search result display, e: index data, f: display device, g: input device.

Claims

A document search method using a search condition that includes all of a plurality of keywords specified by a user,
A document search method comprising a step of searching for a document that satisfies a condition that both a character string of a keyword specified by a user and an element in the order of appearance match each other.

Retrieving documents containing the keyword specified by the user;
Automatically recommending to a user a keyword for narrowing down the search result using a character string appearing after the keyword in each document of the search result.

Recognizing a plurality of document structure blocks in the search target document;
Searching for a document that satisfies the condition that all of the plurality of keyword character strings specified by the user are included in the block.

The document search method according to claim 3,
A step of automatically recommending to the user a keyword for further narrowing down the document search result by using a character string appearing in a block after the document structure block including all of the plurality of keyword character strings specified by the user A document search method characterized in that:

A document search device that includes a search condition that includes all of a plurality of keywords specified by a user,
A document search apparatus comprising: means for searching for a document that satisfies a condition that both a character string of a keyword specified by a user and an element in the order of appearance match.

Means for searching for a document containing the keyword specified by the user; means for automatically recommending to the user a keyword for narrowing down the search result by using a character string appearing after the keyword in each document of the search result. A document search device comprising:

Means for recognizing a plurality of document structure blocks in the search target document;
Means for searching for a document that satisfies a condition that all of a plurality of keyword character strings specified by a user are included in the block.

The document search device according to claim 7,
A means is provided for automatically recommending a keyword to further narrow down a document search result to a user by using a character string appearing in a block after a document structure block including all of a plurality of keyword character strings specified by a user. A document search device characterized by the following.

A document search program using a search condition that includes all of a plurality of keywords specified by a user,
A document search program for causing a computer to execute a procedure for searching for a document that satisfies a condition that both a character string of a keyword specified by a user and an element in the order of appearance match.

A computer includes a procedure for searching for a document including a keyword specified by a user, and a procedure for automatically recommending to a user a keyword for narrowing down a search result using a character string appearing after the keyword in each document of the search result. A document search program to be executed.

A procedure for recognizing a plurality of document structure blocks in the search target document,
A document search program for causing a computer to execute a procedure for searching for a document that satisfies a condition that a plurality of keyword character strings specified by a user are all included in the block.

The document search program according to claim 11,
Using a character string that appears in a block after a document structure block that includes all of a plurality of keyword character strings specified by a user, a computer automatically recommends a keyword for further narrowing down a document search result to a user. A document search program to be executed.