JP4156225B2

JP4156225B2 - Document search apparatus, document search method, and program for causing computer to execute the method

Info

Publication number: JP4156225B2
Application number: JP2001335612A
Authority: JP
Inventors: 勝古川; 禎秀足立; 慎太郎天野; 隆行千葉; 礼奈有働; 裕矢部; 勝彦水戸部
Original assignee: 株式会社ジャストシステム
Priority date: 2001-10-31
Filing date: 2001-10-31
Publication date: 2008-09-24
Anticipated expiration: 2021-10-31
Also published as: JP2003141162A

Description

【０００１】
【発明の属する技術分野】
この発明は、入力された自然文と意味的に類似する文書を検索する文書検索装置、文書検索方法およびその方法をコンピュータに実行させるプログラムに関する。
【０００２】
【従来の技術】
コンピュータやネットワークの急速な普及により、種々の分野の種々の文書が紙媒体に代えて電子媒体で提供されるようになってきているが、中でもコンピュータ（やその周辺機器）のハードウェア・ソフトウェアに関するマニュアルは、比較的早期から電子化の進んだ文書の一つである。
【０００３】
紙のマニュアルに比べ、電子化されたマニュアル（以下「ヘルプ」という）は必要箇所の頭出しが容易である、複数人による共有が容易である、破損や汚損、あるいは紛失などが発生しないなどの多くの利点を有している。
【０００４】
こうしたヘルプをどのように編成するかは作成者の方針にもよるが、ある程度以上の分量になると、単一の文書（ファイル）とはせず複数の文書の集合体として実現するのが一般的である。したがって、ヘルプの検索は一種の文書検索となる。
【０００５】
たとえば本出願人が製造・販売する、写真画像をパソコンに取り込んだりプリンタで印刷したりするためのソフトウェア（製品名「美写楽」）のヘルプは、「デジカメの画像をパソコンで、美写楽でみれるの？」などの質問からなる文書（以下「Ｑ文書」と総称する）と、「閲覧できます。世界で一番美しくあなたの画像を再生します。」など、上記質問に対する回答からなる文書（以下「Ａ文書」と総称する）との、２種類の文書から構成される。
【０００６】
なお、１つのＱ文書には１つの質問が、１つのＡ文書には１つの回答がそれぞれ格納され、どのＱ文書とＡ文書とが対応するのかが文書ＩＤなどによって分かるようになっている。
【０００７】
そして、上記ソフトウェアには上記ヘルプを検索するための検索エンジン、すなわち、ヘルプ内のＱ文書のうち操作者の入力した自然文（以下「問い合わせ文」という）に最も近い内容のものを特定するためのモジュールが搭載されている。本出願人の場合、このエンジンは「ベクトル空間法」と呼ばれる検索手法を採用している。
【０００８】
「ベクトル空間法」とは、上記問い合わせ文の特徴ベクトルと、検索対象となる文書群内の各文書（問い合わせ文と比較される各文書、と言ってもよい）の各特徴ベクトルとの距離を計算し、この距離が小さい文書ほど問い合わせ文との類似度が高い、すなわち操作者の検索要求に対する合致度が高いとするものである。
【０００９】
なお、特徴ベクトルとはｎ個のキーワード（語彙）に対応するｎ個の要素値からなる、ｎ次元のベクトルであって、個々の要素値は最も単純には、対応するキーワードの出現頻度によって決定される。たとえば単一のキーワードのみからなる問い合わせ文（１つのキーワードが１回だけ出現する問い合わせ文、と言ってもよい）の特徴ベクトルは、（０、１、０、０、・・・）のように、当該キーワードに対応する要素の値だけが１で、残りｎ−１個の要素値がすべて０となるようなベクトルである。
【００１０】
もっとも実際にはこれほど単純なものでなく、各要素値は対応するキーワードの出現頻度のほか、たとえば文書群内でもある特定の文書に集中して出現している、文書内でもある特定の部分に集中して出現しているなど、出現箇所の特徴にも配慮して決定される。この特徴ベクトルの作成方法については他にも種々の手法がありうるが、本発明とは直接の関係がないのでここでは立ち入らない。
【００１１】
上記のベクトル空間法においては、本文内に出現するキーワードの全体としての傾向が、問い合わせ文と類似するような文書が検索結果として取り出されるので、問い合わせ文中のキーワードが含まれるか否かにより単純に文書を選別するブーリアン検索（一般のキーワード検索）に比べ、検索結果中のノイズを減少させることが可能である。
【００１２】
【発明が解決しようとする課題】
しかしながら、ベクトル空間法といえども万能というわけではない。特に、ヘルプを必要とするような初心者は文書検索についても素人であることが多く、どのように質問をすればよいか、どのように回答を絞り込んでゆけばよいかなどのノウハウを持たないことが多い。そのため、たとえ必要な情報の記載された文書がヘルプ内に存在したとしても、操作者は多数の文書の中から当該文書を探し出せないことがあるという問題点があった。
【００１３】
ところで、一般に文書検索においては、適合率すなわち拾い出した文書がどれだけ検索者の要求に沿ったものであるかと、再現率すなわち検索者の要求に沿う文書を文書群内からどれだけ漏れなく拾い出せたかとの双方の向上が課題であるが、その両立は容易でないことが多い。すなわち、検索条件を厳しくすると適合率が高くなる反面再現率は低くなり、逆に検索条件を緩くすると再現率は高くなるものの適合率が低くなるという関係にある。
【００１４】
この発明は上記従来技術による問題点に鑑みてなされたものであって、文書検索における適合率と再現率との両立をはかるとともに、検索結果中の多数の文書を、目的の文書の発見がしやすい順序や表示形態で操作者に提示することが可能な文書検索装置、文書検索方法およびその方法をコンピュータに実行させるプログラムを提供することを目的とする。
【００１５】
【課題を解決するための手段】
上述した課題を解決し、目的を達成するため、この発明にかかる文書検索装置は、入力された自然文と意味的に類似する文書を検索する第１の検索手段と、前記自然文と意味的に類似する文書を検索する第２の検索手段と、前記第１の検索手段により検索された文書を特定できる情報と前記第２の検索手段により検索された文書を特定できる情報との両方に、同一の情報が重複して含まれているか否かを判定する判定手段と、前記判定手段により、前記第１の検索手段により検索された文書を特定できる情報と前記第２の検索手段により検索された文書を特定できる情報との両方に、同一の情報が重複して含まれていると判定された場合に、前記第２の検索手段により検索された文書を特定できる情報の中から前記同一の情報を削除する削除手段と、前記第１の検索手段により検索された文書を特定できる情報に続けて、前記削除手段により前記同一の情報を削除された、前記第２の検索手段により検索された文書を特定できる情報を結合する結合手段と、前記結合手段により前記情報が結合された順序で、前記各情報により特定される各文書の本文を表示する表示手段と、を備えたことを特徴とする。
【００１６】
この発明によれば、適合率や再現率の異なる各種の手法で検索された文書が、相対的に適合率の高い手法で検索されたものを上位、相対的に再現率の高い手法で検索されたものを下位として一覧表示される。
【００１７】
また、この発明にかかる文書検索装置は、上記の発明において、前記表示手段が前記第１の検索手段により検索された文書の本文と前記第２の検索手段により検索された文書の本文との表示形態を異ならせて表示することを特徴とする。
【００１８】
この発明によれば、適合率や再現率の異なる各種の手法で検索された文書が、どの手法で検索されたものであるかが表示色の区別などにより明示の上で一覧表示される。
【００１９】
また、この発明にかかる文書検索装置は、上記の発明において、前記第１の検索手段が、その本文内に出現する語彙が前記自然文と共通する文書にあらかじめ対応づけられた文書を、前記自然文と意味的に類似する文書として検索するとともに、前記第２の検索手段が、その本文内に出現する語彙の傾向が前記自然文と類似する文書を、前記自然文と意味的に類似する文書として検索することを特徴とする。
【００２０】
この発明によれば、検索結果一覧では相対的に適合率の高い第１の検索手段により検索された文書が上位、相対的に再現率の高い第２の検索手段により検索された文書が下位に表示されるとともに、いずれの手段により検索された文書であるかが表示色の区別などにより明示される。
【００２１】
また、この発明にかかる文書検索装置は、上記の発明において、前記第１の検索手段が、その本文内に出現する語彙の傾向が前記自然文と類似する文書を、前記自然文と意味的に類似する文書として検索するとともに、前記第２の検索手段が、その本文内に出現する語彙の傾向が前記自然文と類似する文書、およびあらかじめ対応づけられた文書の本文内に出現する語彙の傾向が前記自然文と類似する文書を、前記自然文と意味的に類似する文書として検索することを特徴とする。
【００２２】
この発明によれば、検索結果一覧では相対的に適合率の高い第１の検索手段により検索された文書が下位、相対的に再現率の高い第２の検索手段により検索された文書が上位に表示されるとともに、いずれの手段により検索された文書であるかが表示色の区別などにより明示される。
【００２３】
また、この発明にかかる文書検索装置は、上記の発明において、さらに、前記自然文と意味的に類似する文書が分類されるカテゴリを検索する第３の検索手段と、前記第３の検索手段により検索されたカテゴリの名称を表示する第２の表示手段と、を備えたことを特徴とする。
【００２４】
この発明によれば、適合率や再現率の異なる各種の手法で検索された文書が、相対的に適合率の高い手法で検索されたものを上位、相対的に再現率の高い手法で検索されたものを下位として一覧表示されるとともに、いずれの手段により検索された文書であるかが表示色の区別などにより明示される。
【００２５】
また、この発明にかかる文書検索方法は、入力された自然文と意味的に類似する文書を検索する第１の検索工程と、前記自然文と意味的に類似する文書を検索する第２の検索工程と、前記第１の検索工程で検索された文書を特定できる情報と前記第２の検索工程で検索された文書を特定できる情報との両方に、同一の情報が重複して含まれているか否かを判定する判定工程と、前記判定工程で、前記第１の検索工程で検索された文書を特定できる情報と前記第２の検索工程で検索された文書を特定できる情報との両方に、同一の情報が重複して含まれていると判定された場合に、前記第２の検索工程で検索された文書を特定できる情報の中から前記同一の情報を削除する削除工程と、前記第１の検索工程で検索された文書を特定できる情報に続けて、前記削除工程で前記同一の情報を削除された、前記第２の検索工程で検索された文書を特定できる情報を結合する結合工程と、前記結合工程で前記情報が結合された順序で、前記各情報により特定される各文書の本文を表示する表示工程と、を含んだことを特徴とする。
【００２６】
この発明によれば、適合率や再現率の異なる各種の手法で検索された文書が、相対的に適合率の高い手法で検索されたものを上位、相対的に再現率の高い手法で検索されたものを下位として一覧表示される。
【００２７】
また、この発明にかかるプログラムは、上記に記載された方法をコンピュータに実行させることを特徴とする。
【００２８】
この発明によれば、上記に記載された方法がコンピュータにより実行される。
【００２９】
【発明の実施の形態】
以下に添付図面を参照して、この発明による文書検索装置、文書検索方法およびその方法をコンピュータに実行させるプログラムの好適な実施の形態を詳細に説明する。
【００３０】
図１は、この発明の実施の形態による文書検索装置のハードウェア構成を示す説明図である。同図において、１０１は装置全体を制御するＣＰＵを、１０２は基本入出力プログラムを記憶したＲＯＭを、１０３はＣＰＵ１０１のワークエリアとして使用されるＲＡＭを、それぞれ示している。
【００３１】
また、１０４はＣＰＵ１０１の制御にしたがってＨＤ（ハードディスク）１０５に対するデータのリード／ライトを制御するＨＤＤ（ハードディスクドライブ）を、１０５はＨＤＤ１０４の制御にしたがって書き込まれたデータを記憶するＨＤを、それぞれ示している。
【００３２】
また、１０６はＣＰＵ１０１の制御にしたがってＦＤ（フロッピーディスク）１０７に対するデータのリード／ライトを制御するＦＤＤ（フロッピーディスクドライブ）を、１０７はＦＤＤ１０６の制御にしたがって書き込まれたデータを記憶する着脱自在のＦＤを、それぞれ示している。
【００３３】
また、１０８はカーソル、メニュー、ウィンドウ、あるいは文字や画像などの各種データを表示するディスプレイを、１０９は通信ケーブル１１０を介してＬＡＮなどのネットワークに接続され、当該ネットワークとＣＰＵ１０１とのインターフェースとして機能するネットワークＩ／Ｆを、それぞれ示している。
【００３４】
また、１１１は文字、数値、各種指示などの入力のための複数のキーを備えたキーボードを、１１２は各種指示の選択や実行、処理対象の選択、カーソルの移動などをおこなうマウスを、それぞれ示している。また、１１３は着脱可能な記録媒体であるＣＤ−ＲＯＭを、１１４はＣＤ−ＲＯＭ１１３に対するデータのリードを制御するＣＤ−ＲＯＭドライブを、１００は上記各部を接続するためのバスまたはケーブルを、それぞれ示している。
【００３５】
つぎに、図２はこの発明の実施の形態による文書検索装置の機能的構成を示す説明図である。この発明の実施の形態による文書検索装置は、問い合わせ文解析部２００、ヘルプ文書管理部２０１、第１検索部２０２、第２検索部２０３、第３検索部２０４、第４検索部２０５およびヘルプ画面表示部２０６を含む構成である。
【００３６】
個々の機能部の説明に入る前に、この発明の概略を説明する。本発明においては、ヘルプを構成する文書群について適合率は高いが再現率は低い検索と、逆に適合率は低いが再現率は高い検索とを平行して実施する。後述する第１検索部２０２〜第４検索部２０５は、符号が小さいほど適合率の高い（その代わり再現率は低い）検索手法、符号が大きいほど再現率の高い（その代わり適合率は低い）検索手法によって、それぞれヘルプ文書管理部２０１内の文書を検索する。
【００３７】
そして、上記各部により得られたそれぞれの検索結果をヘルプ画面表示部２０６で併合の上一覧表示するが、この際の各文書の順位づけ、すなわち操作者への提示の優先度は、相対的に適合率の高い検索で出た文書ほど高くするようにする。
【００３８】
したがって、検索結果一覧では第１検索部２０２により検索された文書群が最上位に、第４検索部２０５により検索された文書群が最下位に、それぞれ表示され、その間を第２検索部２０３により検索された文書群、第３検索部２０４により検索された文書群が埋めることになる。
【００３９】
各種の手法で検索された文書をただ単に羅列したのでは、重複分を除くとしても検索結果中の文書数が多くなりすぎ、操作者の目的とする文書が見つけにくくなってしまう。ところで操作者の目的とする文書は、再現率の高い検索、いわゆる「緩い」検索により初めて拾い出せることもあるが、通常は適合率の高い検索、いわゆる「絞り込んだ」検索でも拾えていることが多い。
【００４０】
そこで本発明の検索結果一覧では、上記文書を含む可能性の比較的高い、適合率の高い検索でヒットした文書群を上位に、上記可能性の比較的低い、再現率の高い検索でヒットした文書群を下位に、それぞれ表示するわけである。
【００４１】
以下、図２に示す各部の機能について詳細に説明する。まず、２００は問い合わせ文解析部であり、後述するヘルプ画面表示部２０６から入力した問い合わせ文（任意の自然文）を解析して、後述する特徴ベクトルの基礎となるキーワードの切り出しなどをおこなう。たとえば、「パソコンに画像を取り込むには？」という問い合わせ文からは、上記解析により「パソコン」「画像」「取り込む」の３つのキーワードが切り出される。
【００４２】
つぎに、２０１はヘルプ文書管理部であり、以下で説明するＱ文書ＤＢ（データベース）２０１ａ、Ａ文書ＤＢ（データベース）２０１ｂ、Ｑ＋Ａ文書ＤＢ（データベース）２０１ｃの３つのデータベースを含む構成である。
【００４３】
図３は、ヘルプ文書管理部２０１内の各データベースに保持されるデータの構造を模式的に示す説明図である。図示するように、ヘルプ文書管理部２０１はそのＱ文書ＤＢ２０１ａに、ヘルプを構成するすべてのＱ文書、そのＡ文書ＤＢ２０１ｂに、ヘルプを構成するすべてのＡ文書を保持している。
【００４４】
Ｑ文書には「ＤＪ−１」「ＤＪ−２」・・・のように「ＤＪ−＊」の形式の通し番号が、Ａ文書には「ＺＵ−１」「ＺＵ−２」・・・のように「ＺＵ−＊」の形式の通し番号が、それぞれ固有の文書ＩＤとして付与されている。また、対応するＱ文書とＡ文書はＩＤ内に同じ数字を含んでおり、たとえば「ＺＵ−１」のＡ文書の本文は「ＤＪ−１」のＱ文書の本文である質問に対する回答、「ＺＵ−２」のＡ文書の本文は「ＤＪ−２」のＱ文書の本文である質問に対する回答である。
【００４５】
また、ヘルプ文書管理部２０１はそのＱ＋Ａ文書ＤＢ２０１ｃに、Ｑ文書ＤＢ２０１ａ内のＱ文書とＡ文書ＤＢ２０１ｂ内のＡ文書とを、それぞれ対応するもの同士結合したＱ＋Ａ文書を保持している。Ｑ＋Ａ文書のＩＤとしては、その元となったＱ文書のＩＤをそのまま引き継ぐものとする。たとえば、ＩＤが「ＤＪ−１」のＱ文書と、「ＺＵ−１」のＡ文書とから作成されたＱ＋Ａ文書のＩＤは「ＤＪ−１」である。
【００４６】
ヘルプ文書管理部２０１が保持するデータベースのうち、Ｑ文書ＤＢ２０１ａは後述する第２検索部２０３、Ｑ＋Ａ文書ＤＢ２０１ｃは後述する第３検索部２０４による検索の対象となるデータベースである。また、ヘルプ文書管理部２０１は後述するヘルプ画面表示部２０６から引き渡されたＩＤで特定されるＱ文書やＡ文書を検索するとともに、その本文をヘルプ画面表示部２０６に対して出力する。
【００４７】
図２に戻り、つぎに２０２は第１検索部であり、後述する第２検索部２０３〜第４検索部２０５に比較して最も適合率の高い、すなわち最も絞り込まれた検索をおこなう機能部である。この第１検索部２０２は、以下に説明するＫＤＢ（キーワードデータベース）２０２ａを保持している。
【００４８】
図４は、第１検索部２０２内のＫＤＢ２０２ａに保持されるデータの構造を模式的に示す説明図である。図示するように、ＫＤＢ２０２ａ内には複数の文書が保持され、各文書の本文は１つのキーワードのみにより構成される。そして、各文書にはその属性情報（付属情報）として、当該文書に対応づけられた１〜数個のＱ文書のＩＤが設定されている。
【００４９】
このＫＤＢ２０２ａはあらかじめ、主に人手によって作成されるものである。まず、ヘルプを構成するＱ文書とＡ文書とから、特徴的なキーワードをいくつか抽出する。特徴的なキーワードとしては、たとえばある分野に特有の専門用語であってＩＤＦ値の大きいキーワードや、Ｑ文書・Ａ文書を複数のカテゴリに分類したときに、あるカテゴリの文書にのみ含まれるようなキーワードなどが考えられる。
【００５０】
そして、上記で抽出した個々のキーワードにつき、当該キーワードに対応づけるのに最も適切なＱ文書を１〜数個選定する。何をもって最適とするかは任意であるが、たとえばヘルプであれば、比較的多くの操作者に参照される事項とあまり参照されることのない事項とが経験的に分かるので、当該キーワードでヘルプを引く操作者が、典型的に有している質問のＱ文書を最適として対応づけるようにする。
【００５１】
たとえば、上述の「美写楽」のヘルプにおいて「パソコン」という特徴的なキーワード（ありふれたキーワードのようであるが、特定のヘルプに範囲を限れば稀なキーワードとなることもある）でヘルプを引く操作者は、デジタルカメラで撮影した写真をパソコンでどう見ればよいのかや、撮影した写真をパソコンにどう取り込めばよいのかを問い合わせている場合が多い。
【００５２】
そこで「パソコン」というキーワードには、「カメラ画像をパソコンでみる方法を教えてください。」という質問からなるＱ文書（図３よりそのＩＤは「ＤＪ−１」である）、「ＤｉｇｉＪｕｓｔ−１で画像をパソコンに取り込む方法を教えてください。」という質問からなるＱ文書（同「ＤＪ−２」）、および「デジカメの画像をパソコンで、美写楽でみれるの？」という質問からなるＱ文書（同「ＤＪ−３」）の３つを対応づける。
【００５３】
具体的には、ＫＤＢ２０２ａ内に「パソコン」という１つのキーワードのみを本文とする文書を作成し、当該文書の属性情報（付属情報）として、上記で対応づけた各Ｑ文書のＩＤを設定する。
【００５４】
なお、同じキーワードに対応づけられた文書間の相対的な適切さは、そのＩＤの登録の順序で表されるものとする。たとえば上記の例では、末尾の「ＤＪ−３」よりは先頭の「ＤＪ−１」のほうが、キーワード「パソコン」に対応づけるのにより適切である（「ＤＪ−３」よりは「ＤＪ−１」のほうがよくある質問である、と言ってもよい）。
【００５５】
なお、上記のほかにも「パソコン」というキーワードを含むＱ文書や、当該キーワードから想起・連想される質問を格納したＱ文書などはあろうが、ＫＤＢ２０２ａで対応づけるＱ文書は１キーワードにつき高々数個（上記の例では３個）までである。逆に言えば、あまりに多くのＱ文書が対応づけられてしまうような、ありふれたキーワードにつき１文書を作成してＫＤＢ２０２ａに登録するのは望ましくない。
【００５６】
なお、図４は特徴的なキーワードとして、「パソコン」「接続」「印刷」の３つのキーワードが抽出された場合のＫＤＢ２０２ａの保持内容を示すものである。「接続」「印刷」についても、各キーワードを本文とする文書がＫＤＢ２０２ａに作成されている。このように、ＫＤＢ２０２ａは「キーワードのデータベース」ではなく、「キーワードからなる文書のデータベース」、すなわちそれぞれ１個のキーワードのみを保持する複数の文書からなる文書データベースである。
【００５７】
なお、ここでは説明の便宜上、ＫＤＢ２０２ａ内の各文書にはそれぞれ１個のキーワードが格納されているものとするが、文書内のキーワードは複数であってもよい。すなわち、ＫＤＢ２０２ａ内の文書は、キーワードごとでなく複数のキーワードの組み合わせについて作成するようにしてもよい。また、１個のキーワードのみを保持する文書と、複数のキーワードの組を保持する文書とがＫＤＢ２０２ａ内に混在しているのであってもよい。
【００５８】
図２に示した第１検索部２０２は、問い合わせ文解析部２００から入力した解析結果にもとづいて問い合わせ文の特徴ベクトルを作成し、この特徴ベクトルとＫＤＢ２０２ａ内の各文書の特徴ベクトルとの内積、ひいては当該内積から把握される特徴ベクトル間の距離を順次算出する。
【００５９】
ここで問い合わせ文の特徴ベクトルは、最も単純には当該文章に含まれるキーワードに対応する要素値だけが正の値で、残りの要素値はすべて０となるようなベクトルである。また、ＫＤＢ２０２ａ内の各文書の特徴ベクトルも、当該文書に含まれるキーワードに対応する要素値だけが１で、残りの要素値はすべて０となるようなベクトルである。そのため算出される内積は、問い合わせ文に含まれるのと同一のキーワードを含む文書では何らかの正の値、含まない文書では一律に０となる。
【００６０】
第１検索部２０２は、この内積値が閾値の０を上回った文書、すなわち問い合わせ文と同一のキーワードを含むために、問い合わせ文と特徴ベクトル間の内積が大きくなっている文書（問い合わせ文と特徴ベクトル間の距離が小さくなっている文書、と言ってもよい）を参照して、その属性情報として設定されているＱ文書のＩＤを取得する。そして、これらのＩＤを後述するヘルプ画面表示部２０６に出力する。なお、以下では第１検索部２０２による検索結果中のＱ文書、すなわち上記各ＩＤにより特定されるＱ文書を「Ｒ１」と総称する。
【００６１】
たとえば、上述の「パソコンに画像を取り込むには？」という問い合わせ文の特徴ベクトルでは、「パソコン」「画像」「取り込む」に対応する要素値がそれぞれ１になっている。そのため、同じ位置の要素値が１であるような特徴ベクトルを有する文書、図４の例ではキーワード「パソコン」を本文とする文書についてのみ内積値が０を超え（その他の文書では０）、当該文書に設定されたＱ文書のＩＤ「ＤＪ−１」「ＤＪ−２」および「ＤＪ−３」がヘルプ画面表示部２０６に出力されることになる。
【００６２】
図２に戻り、つぎに第２検索部２０３は上述の第１検索部２０２に比較してやや緩めの検索をおこなう機能部である。この第２検索部２０３は、ヘルプ文書管理部２０１のＱ文書ＤＢ２０１ａに保持されたＱ文書を、通常のベクトル空間法により検索する。
【００６３】
すなわち、問い合わせ文解析部２００の解析結果にもとづいて作成した問い合わせ文の特徴ベクトルと、Ｑ文書ＤＢ２０１ａ内の各Ｑ文書の特徴ベクトルとの内積値を順次算出し、この値が所定の閾値を上回ったＱ文書を特定して、当該値の大きい順にそのＩＤを後述するヘルプ画面表示部２０６に出力する。なお、以下では第２検索部２０３による検索結果中のＱ文書を「Ｒ２」と総称する。
【００６４】
この第２検索部２０３による検索では、Ｒ２内のＱ文書の個数に制限がない。すなわち、問い合わせ文との内積が所定の閾値を上回るＱ文書であれば何個でもＲ２に含まれる。この点、上述の第１検索部２０２による検索では、ＫＤＢ２０２ａ内の各文書に対応づけられたＱ文書が数個に限定されている結果、Ｒ１中の文書数もその個数に制限されるのと異なっている。したがって、一般に第２検索部２０３による検索では、第１検索部２０２による検索と比較して、検索結果中の文書数が多くなる傾向にある（もちろん例外もある）。
【００６５】
つぎに、第３検索部２０４は上述の第２検索部２０３に比較してさらに緩めの検索をおこなう機能部である。この第３検索部２０４は、ヘルプ文書管理部２０１のＱ＋Ａ文書ＤＢ２０１ｃに保持されたＱ＋Ａ文書を、通常のベクトル空間法により検索し、検索した文書のＩＤを後述するヘルプ画面表示部２０６に出力する。なお、以下では第３検索部２０４による検索結果中のＱ＋Ａ文書を「Ｒ３」と総称する。
【００６６】
なお、ここでは対応するＱ文書とＡ文書とを結合したものをＱ＋Ａ文書ＤＢ２０１ｃとしてあらかじめ用意しておき、第３検索部２０４による検索はこの結合後の各文書についておこなうようにしたが、これはもっぱら既存の検索エンジン（具体的には本出願人が製造・販売する「ＣｏｎｃｅｐｔＢａｓｅ」）を転用する場合の便宜をはかったものであり、このようにしなければならないというものではない。
【００６７】
要するに、それ自体が問い合わせ文と類似するＱ文書のほかに、対応するＡ文書が問い合わせ文と類似するようなＱ文書も検索できればそれでよいので、たとえばＱ＋Ａ文書ＤＢ２０１ｃは設けない構成とし、第３検索部２０４はＱ文書ＤＢ２０１ａとＡ文書ＤＢ２０１ｂとを平行して検索して、前者については検索されたＱ文書のＩＤをそのまま、後者については検索されたＡ文書に対応するＱ文書のＩＤを、それぞれヘルプ画面表示部２０６に出力するようにしてもよい。
【００６８】
このように、検索対象をＱ文書に限定している第２検索部２０３と異なり、第３検索部２０４では検索対象が実質的にＱ文書およびＡ文書に拡大されているため、さらに多くの文書を拾い出せる可能性が高い。
【００６９】
つぎに、第４検索部２０５は上述の第１検索部２０２〜第３検索部２０４に比較して最も緩い、すなわち最も適合率の低い（再現率の高い）検索をおこなう機能部である。この第４検索部２０５は、以下に説明するＣＤＢ（カテゴリデータベース）２０５ａを保持している。
【００７０】
あらかじめ図５に示すような、Ｑ文書ＤＢ２０１ａ内のＱ文書を分類するための多階層の分類体系を作成しておく。この分類体系は人手により作成するのであっても、既存の文書分類技術により機械的に作成するのであっても、あるいは機械的に作成されたものを人手により修正するのであってもよい。図示する分類体系では、「パソコン−接続」「パソコン−取り込む」「画像−サイズ」「画像−印刷」の、４つのカテゴリが定義されている。
【００７１】
そして、各カテゴリについてＣＤＢ２０５ａ内に１文書を作成し、その本文には各カテゴリの名称に含まれるキーワードを、その属性情報には各カテゴリに分類されるＱ文書のＩＤを、それぞれ格納する。たとえば、「パソコン−接続」カテゴリについては「パソコン」および「接続」の２つのキーワードを本文とする文書がＣＤＢ２０５ａ内に作成され、その属性情報には当該カテゴリに分類されるすべてのＱ文書のＩＤが設定される。
【００７２】
そして、第４検索部２０５は問い合わせ文中のキーワードから作成したその特徴ベクトルと、ＣＤＢ２０５ａ内の各文書の特徴ベクトルとの内積値を順次算出し、この値が最も高くなった文書の本文、すなわち当該文書に対応するカテゴリの名称を後述するヘルプ画面表示部２０６に出力する。
【００７３】
なお、ＣＤＢ２０５ａ内の各文書の特徴ベクトルは、いわば各カテゴリの特徴ベクトルであって、ここでは単純にカテゴリの名称から作成するようにしたが、たとえば各カテゴリに分類される全Ｑ文書の特徴ベクトルの平均をとり、この平均をＣＤＢ２０５ａ内の各文書の特徴ベクトルとするのであってもよい。
【００７４】
このように、第４検索部２０５は文書を直接検索するのでなく、問い合わせ文に類似するカテゴリを検索することで、当該カテゴリに分類される複数の文書を間接的に検索するのであると言ってもよい。なお、以下では第４検索部２０５により検索されたカテゴリ内のＱ文書を「Ｒ４」と総称する。
【００７５】
図２に戻り、つぎにヘルプ画面表示部２０６は後述するヘルプ画面をディスプレイ１０８に表示する機能部である。ヘルプ画面表示部２０６は、このヘルプ画面によって操作者からの問い合わせ文の入力を受け付けるとともに、第１検索部２０２、第２検索部２０３、第３検索部２０４から入力したそれぞれのＩＤで特定されるＱ文書の本文、および第４検索部２０５から入力したカテゴリの名称を画面表示する。このヘルプ画面表示部２０６の機能については、後述するフローチャートで具体例に則して詳細に説明する。
【００７６】
つぎに、図６はこの発明の実施の形態による文書検索装置における、文書検索のための前準備の手順を示すフローチャートである。まず、ヘルプを構成するすべてのＱ文書・Ａ文書から、上述した特徴的なキーワードを抽出し（ステップＳ６０１）、さらにキーワードごとに適切なＱ文書をいくつか選定する（ステップＳ６０２）。そして、ＫＤＢ２０２ａ内に上記キーワードを本文とする文書を作成し、その属性情報として上記Ｑ文書のＩＤを設定する（ステップＳ６０３）。
【００７７】
さらに、上述のＱ文書ＤＢ２０１ａ、Ａ文書ＤＢ２０１ｂ、Ｑ＋Ａ文書ＤＢ２０１ｃをそれぞれ作成するとともに（ステップＳ６０４）、Ｑ文書ＤＢ２０１ａ内のすべてのＱ文書を分類するための分類体系を作成し（ステップＳ６０５）、これにもとづいて上述のＣＤＢ２０５ａを作成する（ステップＳ６０６）。
【００７８】
つぎに、図７はこの発明の実施の形態による文書検索装置における、文書検索および検索結果の表示の手順を示すフローチャートである。図８に示すようなヘルプ画面において、操作者が問い合わせ文を入力して「検索」ボタン８００をクリックすると（ステップＳ７０１：Ｙｅｓ）、まず問い合わせ文解析部２００による当該文章の解析がおこなわれる（ステップＳ７０２）。
【００７９】
そして、上記による解析結果を供給された第１検索部２０２は、上記結果にもとづいて作成した問い合わせ文の特徴ベクトルと、上述のＫＤＢ２０２ａ内の各文書の特徴ベクトルとを比較して、両者の内積が閾値を超える文書を特定し、当該文書に設定されているＱ文書のＩＤをヘルプ画面表示部２０６に出力する（ステップＳ７０３）。
【００８０】
つぎに、上記解析結果は第２検索部２０３に供給され、第２検索部２０３は上述のＱ文書ＤＢ２０１ａを検索して、ベクトル間の内積が閾値を超えるＱ文書を特定し、当該文書のＩＤをヘルプ画面表示部２０６に出力する（ステップＳ７０４）。
【００８１】
つぎに、上記解析結果は第３検索部２０４に供給され、第３検索部２０４は上述のＱ＋Ａ文書ＤＢ２０１ｃを検索して、ベクトル間の内積が閾値を超えるＱ＋Ａ文書を特定し、当該文書のＩＤをヘルプ画面表示部２０６に出力する（ステップＳ７０５）。
【００８２】
さらに、上記解析結果は第４検索部２０５に供給され、第４検索部２０５は上述のＣＤＢ２０５ａを検索して、ベクトル間の内積が最大となる文書を特定し、当該文書の本文、すなわち当該文書に対応するカテゴリの名称をヘルプ画面表示部２０６に出力する（ステップＳ７０６）。
【００８３】
第１検索部２０２〜第４検索部２０５のそれぞれの検索結果を受け取ったヘルプ画面表示部２０６は、つぎにこれらを併合して一覧表示するが、その前に併合により重複することになる文書の削除をおこなう（ステップＳ７０７）。たとえば、第１検索部２０２〜第３検索部２０４のそれぞれにより検索された文書群Ｒ１〜Ｒ３が、具体的には図９に示すようなものであったとする。
【００８４】
この場合、ＩＤが「ＤＪ−２」の文書はＲ１とＲ２の双方に含まれているが、ヘルプ画面表示部２０６は適合率の相対的に高い検索で出たもの、すなわち第１検索部２０２による検索結果Ｒ１中の「ＤＪ−２」のみを残して、適合率の相対的に低い検索で出たもの、すなわち第２検索部２０３による検索結果Ｒ２中の「ＤＪ−２」を削除する。なお、３つの文書群にまたがって含まれているＩＤは、適合率の最も高い１つのみを残して残り２つを削除する。
【００８５】
そして、上記による重複解消後のＲ１〜Ｒ３を併合して１つの文書群とする（ステップＳ７０８）が、このとき適合率の高い検索で出た文書群ほど、併合後の文書群内での順位が高くなるようにする。
【００８６】
ここでは上述のように、第１検索部２０２、第２検索部２０３、第３検索部２０４の順で適合率が低下してゆくので、ヘルプ画面表示部２０６はＲ１〜Ｒ３をＲ１、Ｒ２、Ｒ３の順に結合する。そのため併合後の文書群（Ｒ１＋Ｒ２＋Ｒ３）では、図９に示すようにＲ１内の「ＤＪ−１」「ＤＪ−２」「ＤＪ−３」が相対的に上位に位置し、逆にＲ３内の「ＤＪ−１０」以下は下位に位置することになる。
【００８７】
このようにして各部により検索された文書の最終的な順序が決まると、つぎにヘルプ画面表示部２０６は上記各文書のＩＤをヘルプ文書管理部２０１に出力する。これを受けたヘルプ文書管理部２０１は、その保持するＱ文書ＤＢ２０１ａから上記ＩＤで特定されるＱ文書を検索し、検索したＱ文書中の本文をヘルプ画面表示部２０６に出力する。そして、ヘルプ画面表示部２０６はこれらの本文を、上記で決定した順序にしたがって一覧表示する。また、同時に第４検索部２０５により検索されたカテゴリの名称もあわせて表示する（ステップＳ７０９）。
【００８８】
図１０は、ヘルプ画面表示部２０６により表示されるヘルプ画面の一例を示す説明図である。図示する検索結果一覧には、第１検索部２０２〜第３検索部２０４により検索された各Ｑ文書の本文が図９に示した順序で表示されるとともに、第４検索部２０５により検索されたカテゴリ名（同図では「パソコン−取り込む」）があわせて示されている。
【００８９】
なお、たとえばＲ１に含まれていたＱ文書の本文は赤、Ｒ２に含まれていたＱ文書の本文は青、Ｒ３に含まれていたＱ文書の本文は緑というように、どの質問がどの検索で引っかかってきたのかを文字の色分けで明示するようにしてもよい。あるいは、本文の背景色を変えるなどの識別表示でもよい。
【００９０】
また、ここではＲ４については、その分類先となるカテゴリの名称を示すのみであるが、Ｒ１〜Ｒ３と同様、Ｒ４を構成する個々のＱ文書の本文を上記一覧中の末尾に（すなわち、Ｒ３に続けて）あわせて表示するようにしてもよい。この場合、Ｒ１〜Ｒ３と重複するＩＤがあれば、当該ＩＤをあらかじめＲ４から削除しておくことは言うまでもない。
【００９１】
図１０に示すヘルプ画面において、操作者が一覧中のいずれかの質問を選択して「表示」ボタン１０００をクリックすると（ステップＳ７１０：Ｙｅｓ）、ヘルプ画面表示部２０６は当該質問を格納するＱ文書のＩＤを参照して、当該質問に対する回答を格納するＡ文書のＩＤを生成し、ヘルプ文書管理部２０１に出力する。
【００９２】
これを受けたヘルプ文書管理部２０１は、その保持するＡ文書ＤＢ２０１ｂから上記ＩＤで特定されるＡ文書を検索し、その本文をヘルプ画面表示部２０６に引き渡す。そして、上記Ａ文書の本文がヘルプ画面表示部２０６により、図１１に示すように画面表示される（ステップＳ７１１）。なお、同図において前面のウィンドウに表示されている、「閲覧できます。世界で一番美しくあなたの画像を再生・・・」という文章は、背面のウィンドウで選択されている「デジカメの画像をパソコンで、美写楽でみれるの？」という質問に対する回答である。
【００９３】
以上説明した実施の形態によれば、同一文書群につき適合率・再現率のレベルのそれぞれ異なる複数の検索が重畳的に実施され、しかも最終的な検索結果一覧では、適合率の最も高い検索で出た文書を筆頭に各手法による検索結果が併合して表示されるので、検索結果を絞り込んで見たいときは上記一覧の最初のほうだけを、漏れなく見たいときは最後のほうまで、それぞれ見ることで操作者は自己の目的を達することができる（従来のように、絞り込みの程度を変えて再び検索をやり直すなどの作業は不要である）。
【００９４】
なお、上述した実施の形態では、第１検索部２０２〜第４検索部２０５による検索はすべてベクトル空間法を基礎としているが、これは既存の検索エンジン（上述の「ＣｏｎｃｅｐｔＢａｓｅ」）を第１検索部２０２〜第４検索部２０５として機能させることを想定しているためであって、原理的にはベクトル空間法によらなければならないというものではない。
【００９５】
たとえば、ＫＤＢ２０２ａのような文書データベースの代わりに、少なくともキーワードとそれに対応するＱ文書のＩＤなどからなるＲＤＢや単なるリストを設けて、第１検索部２０２は問い合わせ文中のキーワードで当該ＲＤＢやリストを検索することにより、Ｑ文書のＩＤを取得するように構成してもよい（上述した第１検索部２０２による検索は、そもそもがブーリアン検索と結果的に変わらないものであって、もっぱら既存の検索エンジンの仕様に合わせてＫＤＢ２０２ａのような仕組みを設けているに過ぎない）。
【００９６】
さらに、第２検索部２０３による検索をＱ文書についてのブーリアン検索に、第３検索部２０４による検索をＱ＋Ａ文書についてのブーリアン検索に、第４検索部２０５による検索をカテゴリ名についてのブーリアン検索に、それぞれ差し替えても、第１検索部２０２から第４検索部２０５にかけて適合率は順次低下し、再現率は順次向上してゆくので、上述した実施の形態による発明と同等の効果を得ることができる。
【００９７】
なお、第１検索部２０２〜第４検索部２０５の検索手法が同種のものである必要もなく、ある検索部はベクトル空間法による検索、ある検索部はブーリアン検索というように、異種の検索手法が混在していてもよい。要するに、適合率や再現率において相互に異なるのであれば、各部の検索手法はどのようなものであってもよい。
【００９８】
また、組み合わせる検索手法は上述の実施の形態の４つに限らず、これより多くても少なくてもよい。特に、第１検索部２０２による検索は第２検索部２０３による検索よりも絞り込んだ結果を得るところにあるが、問い合わせ文中のキーワードとＫＤＢ２０２ａ内の文書のキーワードとに共通するものが多いと、ヒットする文書の数ひいては当該文書に対応づけられたＱ文書の数も多くなるため、Ｒ１のほうがＲ２よりも逆に多くの文書を含んでしまうことがある。
【００９９】
そこで、ＫＤＢ２０２ａから検索された文書が複数ある場合には、第１検索部２０２による検索結果はなしとする（すなわち、ヘルプ画面表示部２０６には何も出力しない）のが望ましい。この場合、実施される検索は実質的に３種類となる。
【０１００】
なお、上述した実施の形態では検索対象はヘルプ文書としたが、これに限るものではなく、たとえば各文書の要約文を格納したデータベースとその全文を格納したデータベースとがあり、要約文と全文との対応づけがなされている場合に、要約文のみを検索、要約文＋全文を検索など、適合率・再現率が異なる検索を複数組み合わせて実施することが可能である。
【０１０１】
なお、上述した問い合わせ文解析部２００、ヘルプ文書管理部２０１、第１検索部２０２、第２検索部２０３、第３検索部２０４、第４検索部２０５およびヘルプ画面表示部２０６は、それぞれＨＤ１０５などからＲＡＭ１０３に読み出されたプログラムの命令にしたがってＣＰＵ１０１が命令処理を実行することにより、各部の機能を実現するものである。このプログラムはＨＤ１０５のほか、ＦＤ１０７、ＣＤ−ＲＯＭ１１３あるいはＭＯなどの各種記録媒体に格納することができ、あるいはネットワークを介して配布することもできる。
【０１０２】
【発明の効果】
以上説明したようにこの発明は、入力された自然文と意味的に類似する文書を検索する第１の検索手段と、前記自然文と意味的に類似する文書を検索する第２の検索手段と、前記第１の検索手段により検索された文書を特定できる情報と前記第２の検索手段により検索された文書を特定できる情報との両方に、同一の情報が重複して含まれているか否かを判定する判定手段と、前記判定手段により、前記第１の検索手段により検索された文書を特定できる情報と前記第２の検索手段により検索された文書を特定できる情報との両方に、同一の情報が重複して含まれていると判定された場合に、前記第２の検索手段により検索された文書を特定できる情報の中から前記同一の情報を削除する削除手段と、前記第１の検索手段により検索された文書を特定できる情報に続けて、前記削除手段により前記同一の情報を削除された、前記第２の検索手段により検索された文書を特定できる情報を結合する結合手段と、前記結合手段により前記情報が結合された順序で、前記各情報により特定される各文書の本文を表示する表示手段と、を備えたので、適合率や再現率の異なる各種の手法で検索された文書が、相対的に適合率の高い手法で検索されたものを上位、相対的に再現率の高い手法で検索されたものを下位として一覧表示され、これによって、文書検索における適合率と再現率との両立をはかるとともに、検索結果中の多数の文書を、目的の文書の発見がしやすい順序で操作者に提示することが可能な文書検索装置が得られるという効果を奏する。
【０１０３】
また、この発明は、上記の発明において、前記表示手段が前記第１の検索手段により検索された文書の本文と前記第２の検索手段により検索された文書の本文との表示形態を異ならせて表示するので、適合率や再現率の異なる各種の手法で検索された文書が、どの手法で検索されたものであるかが表示色の区別などにより明示の上で一覧表示され、これによって、文書検索における適合率と再現率との両立をはかるとともに、検索結果中の多数の文書を、目的の文書の発見がしやすい表示形態で操作者に提示することが可能な文書検索装置が得られるという効果を奏する。
【０１０４】
また、この発明は、上記の発明において、前記第１の検索手段が、その本文内に出現する語彙が前記自然文と共通する文書にあらかじめ対応づけられた文書を、前記自然文と意味的に類似する文書として検索するとともに、前記第２の検索手段が、その本文内に出現する語彙の傾向が前記自然文と類似する文書を、前記自然文と意味的に類似する文書として検索するので、検索結果一覧では相対的に適合率の高い第１の検索手段により検索された文書が上位、相対的に再現率の高い第２の検索手段により検索された文書が下位に表示されるとともに、いずれの手段により検索された文書であるかが表示色の区別などにより明示され、これによって、文書検索における適合率と再現率との両立をはかるとともに、検索結果中の多数の文書を、目的の文書の発見がしやすい順序や表示形態で操作者に提示することが可能な文書検索装置が得られるという効果を奏する。
【０１０５】
また、この発明は、上記の発明において、前記第１の検索手段が、その本文内に出現する語彙の傾向が前記自然文と類似する文書を、前記自然文と意味的に類似する文書として検索するとともに、前記第２の検索手段が、その本文内に出現する語彙の傾向が前記自然文と類似する文書、およびあらかじめ対応づけられた文書の本文内に出現する語彙の傾向が前記自然文と類似する文書を、前記自然文と意味的に類似する文書として検索するので、検索結果一覧では相対的に適合率の高い第１の検索手段により検索された文書が下位、相対的に再現率の高い第２の検索手段により検索された文書が上位に表示されるとともに、いずれの手段により検索された文書であるかが表示色の区別などにより明示され、これによって、文書検索における適合率と再現率との両立をはかるとともに、検索結果中の多数の文書を、目的の文書の発見がしやすい順序や表示形態で操作者に提示することが可能な文書検索装置が得られるという効果を奏する。
【０１０６】
また、この発明は、上記の発明において、さらに、前記自然文と意味的に類似する文書が分類されるカテゴリを検索する第３の検索手段と、前記第３の検索手段により検索されたカテゴリの名称を表示する第２の表示手段と、を備えたので、適合率や再現率の異なる各種の手法で検索された文書が、相対的に適合率の高い手法で検索されたものを上位、相対的に再現率の高い手法で検索されたものを下位として一覧表示されるとともに、いずれの手段により検索された文書であるかが表示色の区別などにより明示され、これによって、文書検索における適合率と再現率との両立をはかるとともに、検索結果中の多数の文書を、目的の文書の発見がしやすい順序や表示形態で操作者に提示することが可能な文書検索装置が得られるという効果を奏する。
【０１０７】
また、この発明は、入力された自然文と意味的に類似する文書を検索する第１の検索工程と、前記自然文と意味的に類似する文書を検索する第２の検索工程と、前記第１の検索工程で検索された文書を特定できる情報と前記第２の検索工程で検索された文書を特定できる情報との両方に、同一の情報が重複して含まれているか否かを判定する判定工程と、前記判定工程で、前記第１の検索工程で検索された文書を特定できる情報と前記第２の検索工程で検索された文書を特定できる情報との両方に、同一の情報が重複して含まれていると判定された場合に、前記第２の検索工程で検索された文書を特定できる情報の中から前記同一の情報を削除する削除工程と、前記第１の検索工程で検索された文書を特定できる情報に続けて、前記削除工程で前記同一の情報を削除された、前記第２の検索工程で検索された文書を特定できる情報を結合する結合工程と、前記結合工程で前記情報が結合された順序で、前記各情報により特定される各文書の本文を表示する表示工程と、を含んだので、適合率や再現率の異なる各種の手法で検索された文書が、相対的に適合率の高い手法で検索されたものを上位、相対的に再現率の高い手法で検索されたものを下位として一覧表示され、これによって、文書検索における適合率と再現率との両立をはかるとともに、検索結果中の多数の文書を、目的の文書の発見がしやすい順序で操作者に提示することが可能な文書検索方法が得られるという効果を奏する。
【０１０８】
また、この発明によれば、上記に記載された方法をコンピュータに実行させることが可能なプログラムが得られるという効果を奏する。
【図面の簡単な説明】
【図１】この発明の実施の形態による文書検索装置のハードウェア構成を示す説明図である。
【図２】この発明の実施の形態による文書検索装置の機能的構成を示す説明図である。
【図３】ヘルプ文書管理部２０１内の各データベースに保持されるデータの構造を模式的に示す説明図である。
【図４】第１検索部２０２内のＫＤＢ２０２ａに保持されるデータの構造を模式的に示す説明図である。
【図５】Ｑ文書ＤＢ２０１ａ内のＱ文書が分類される分類体系を模式的に示す説明図である。
【図６】この発明の実施の形態による文書検索装置における、文書検索のための前準備の手順を示すフローチャートである。
【図７】この発明の実施の形態による文書検索装置における、文書検索および検索結果の表示の手順を示すフローチャートである。
【図８】ヘルプ画面表示部２０６により表示されるヘルプ画面の一例（Ｑ文書検索前）を示す説明図である。
【図９】第１検索部２０２〜第３検索部２０３による各検索結果とそれらの併合結果の具体例を示す説明図である。
【図１０】ヘルプ画面表示部２０６により表示されるヘルプ画面の他の一例（Ｑ文書検索後）を示す説明図である。
【図１１】ヘルプ画面表示部２０６により表示されるヘルプ画面の他の一例（Ａ文書表示時）を示す説明図である。
【符号の説明】
１００バスまたはケーブル
１０１ＣＰＵ
１０２ＲＯＭ
１０３ＲＡＭ
１０４ＨＤＤ
１０５ＨＤ
１０６ＦＤＤ
１０７ＦＤ
１０８ディスプレイ
１０９ネットワークＩ／Ｆ
１１０通信ケーブル
１１１キーボード
１１２マウス
１１３ＣＤ−ＲＯＭ
１１４ＣＤ−ＲＯＭドライブ
２００問い合わせ文解析部
２０１ヘルプ文書管理部
２０１ａＱ文書ＤＢ
２０１ｂＡ文書ＤＢ
２０１ｃＱ＋Ａ文書ＤＢ
２０２第１検索部
２０２ａＫＤＢ
２０３第２検索部
２０４第３検索部
２０５第４検索部
２０５ａＣＤＢ
２０６ヘルプ画面表示部[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a document search apparatus that searches for a document that is semantically similar to an input natural sentence, a document search method, and a program that causes a computer to execute the method.
[0002]
[Prior art]
Due to the rapid spread of computers and networks, various documents in various fields have been provided as electronic media instead of paper media, and in particular, related to hardware and software of computers (and peripheral devices). The manual is one of documents that have been digitized relatively early.
[0003]
Compared to paper manuals, digitized manuals (hereinafter referred to as “help”) are easier to find the necessary parts, can be easily shared by multiple people, are not damaged, soiled, or lost. Has many advantages.
[0004]
How to organize such help depends on the author's policy, but if it is more than a certain amount, it is generally realized as a collection of multiple documents instead of a single document (file) It is. Therefore, the help search is a kind of document search.
[0005]
For example, the help for software (product name "Biyaku Raku"), which is manufactured and sold by the applicant, for taking photographic images on a PC or printing them with a printer, is as follows. Documents consisting of answers to the above questions (such as “Q Document”) and “You can view it and play your images the most beautifully in the world.” (Hereinafter collectively referred to as “A document”).
[0006]
Note that one question is stored in one Q document, and one answer is stored in one A document, and which Q document corresponds to the A document can be known by the document ID or the like.
[0007]
In the software, a search engine for searching for the help, that is, a Q document in the help is specified to have a content closest to a natural sentence (hereinafter referred to as “inquiry sentence”) input by an operator. The module is installed. In the case of the present applicant, this engine employs a search method called “vector space method”.
[0008]
The “vector space method” refers to the distance between the feature vector of the query sentence and each feature vector of each document in the document group to be searched (which may be called each document to be compared with the query sentence). It is calculated that a document with a smaller distance has a higher similarity to the inquiry sentence, that is, a higher degree of matching with the search request of the operator.
[0009]
The feature vector is an n-dimensional vector composed of n element values corresponding to n keywords (vocabulary), and each element value is determined most simply by the appearance frequency of the corresponding keyword. Is done. For example, the feature vector of a query statement consisting of only a single keyword (which may be called a query statement in which one keyword appears only once) is (0, 1, 0, 0,...) The vector is such that only the value of the element corresponding to the keyword is 1 and the remaining n-1 element values are all 0.
[0010]
Actually, it is not so simple. In addition to the frequency of occurrence of the corresponding keyword, each element value, for example, a specific part that is also concentrated in a specific document that is also in a document group and that is also in a document. It is determined in consideration of the characteristics of the appearance location, such as the fact that it appears in a concentrated manner. There may be various other methods for creating this feature vector, but since it is not directly related to the present invention, it will not be described here.
[0011]
In the above-mentioned vector space method, a document whose tendency as a whole of the keywords appearing in the text is similar to the query sentence is extracted as a search result. Therefore, it depends simply on whether or not the keyword is included in the query sentence. Compared with Boolean search (general keyword search) for selecting documents, it is possible to reduce noise in search results.
[0012]
[Problems to be solved by the invention]
However, even the vector space method is not universal. In particular, beginners who need help are often amateurs in document search, and do not have know-how on how to ask questions and how to narrow down answers. There are many. For this reason, there is a problem that even if a document in which necessary information is described exists in the help, the operator may not be able to find the document from a large number of documents.
[0013]
By the way, in general, in the document search, the relevance rate, that is, how much the extracted document is in accordance with the searcher's request, and the recall rate, that is, how much the document that meets the searcher's request is picked up from the document group without omission. Although improvement of both whether it was able to be taken out is a subject, the coexistence in many cases is not easy. That is, if the search conditions are strict, the relevance ratio increases, while the recall ratio decreases. Conversely, if the search conditions are relaxed, the reproducibility increases but the relevance ratio decreases.
[0014]
The present invention has been made in view of the above-mentioned problems of the prior art, and aims to achieve both compatibility and reproducibility in document search, and to find a target document from a large number of documents in a search result. It is an object of the present invention to provide a document search apparatus, a document search method, and a program for causing a computer to execute the method, which can be presented to an operator in an easy order and display form.
[0015]
[Means for Solving the Problems]
In order to solve the above-mentioned problems and achieve the purpose, this The document search apparatus according to the invention includes: a first search unit that searches for a document that is semantically similar to the input natural sentence; a second search unit that searches for a document that is semantically similar to the natural sentence; Whether the same information is duplicated in both the information that can specify the document searched by the first search means and the information that can specify the document searched by the second search means. The same information for both the determination means for determining and the information for specifying the document searched for by the first search means and the information for specifying the document searched for by the second search means by the determining means Delete the same information from the information that can identify the document searched by the second search means, and the first search means Documents retrieved by Following the information that can be specified, the information that has been deleted by the deleting means is combined with information that can specify the document searched by the second searching means, and the information is combined by the combining means. Display means for displaying the text of each document specified by each piece of information in the order in which they are specified.
[0016]
this According to the invention, documents retrieved by various methods with different relevance ratios and recalls are retrieved by a method having a relatively high reproducibility, and those retrieved by a method having a relatively high relevance ratio. List things as subordinate.
[0017]
Also, this The document retrieval apparatus according to the invention is: the above In the invention, the display unit displays the text of the document searched by the first search unit and the text of the document searched by the second search unit with different display forms. .
[0018]
this According to the invention, the documents retrieved by various methods having different relevance ratios and recall rates are clearly displayed as a list based on the distinction of the display colors and the like.
[0019]
Also, this The document retrieval apparatus according to the invention is: the above In the invention, the first search means searches for a document whose vocabulary appearing in the text is previously associated with a document having a common vocabulary with the natural sentence as a document that is semantically similar to the natural sentence. The second search means searches for a document whose vocabulary tendency appearing in the text is similar to the natural sentence as a document that is semantically similar to the natural sentence.
[0020]
this According to the invention, in the search result list, the document searched by the first search means having a relatively high relevance rate is displayed at the top, and the document searched by the second search means having a relatively high recall is displayed at the bottom. At the same time, it is clearly indicated by the distinction of the display color that the document is retrieved by any means.
[0021]
Also, this The document retrieval apparatus according to the invention is: the above In the invention, the first search means searches for a document whose vocabulary tendency appearing in the text is similar to the natural sentence as a document semantically similar to the natural sentence, and the second search means The search means selects a document whose vocabulary tendency appearing in the text is similar to the natural sentence, and a document whose vocabulary tendency appearing in the text of a previously associated document is similar to the natural sentence. It is characterized by searching as a document that is semantically similar to a sentence.
[0022]
this According to the invention, in the search result list, documents searched by the first search means having a relatively high relevance rate are included. under Documents retrieved by the second retrieval means with a relatively high recall rate Up In addition, it is clearly indicated by the distinction of the display color which document is retrieved by which means.
[0023]
Also, this The document retrieval apparatus according to the invention is: the above In the invention, a third search means for searching for a category in which documents semantically similar to the natural sentence are classified, and a second name for displaying the name of the category searched by the third search means. And a display means.
[0024]
this According to the invention, documents retrieved by various methods with different relevance ratios and recalls are retrieved by a method having a relatively high reproducibility, and those retrieved by a method having a relatively high relevance ratio. Documents are displayed in a list as a lower level, and the document retrieved by any means is specified by distinguishing display colors.
[0025]
Also, this The document search method according to the invention includes a first search step for searching for a document that is semantically similar to the input natural sentence, a second search step for searching for a document that is semantically similar to the natural sentence, Whether the same information is duplicated in both the information that can specify the document searched in the first search step and the information that can specify the document searched in the second search step The same information for both the determination step for determining and the information for specifying the document searched for in the first search step and the information for specifying the document searched for in the second search step in the determination step Is deleted, and the first search step deletes the same information from the information that can specify the document searched in the second search step. Followed by information that can identify the documents searched in In the deletion step, the same information is deleted, a combination step of combining information that can identify the document searched in the second search step, and the information in the order in which the information is combined in the combination step And a display step for displaying the text of each document specified by.
[0026]
this According to the invention, documents retrieved by various methods with different relevance ratios and recalls are retrieved by a method having a relatively high reproducibility, and those retrieved by a method having a relatively high relevance ratio. Lists things as subordinate.
[0027]
Also, this The program according to the invention is the above The computer is caused to execute the method described in (1).
[0028]
this According to the invention, the above Is executed by a computer.
[0029]
DETAILED DESCRIPTION OF THE INVENTION
Exemplary embodiments of a document search apparatus, a document search method, and a program for causing a computer to execute the method according to the present invention will be explained below in detail with reference to the accompanying drawings.
[0030]
FIG. 1 is an explanatory diagram showing a hardware configuration of a document search apparatus according to an embodiment of the present invention. In the figure, 101 indicates a CPU that controls the entire apparatus, 102 indicates a ROM that stores basic input / output programs, and 103 indicates a RAM that is used as a work area of the CPU 101.
[0031]
Reference numeral 104 denotes an HDD (hard disk drive) that controls reading / writing of data with respect to the HD (hard disk) 105 according to the control of the CPU 101, and 105 denotes an HD that stores data written according to the control of the HDD 104. Yes.
[0032]
Reference numeral 106 denotes an FDD (floppy disk drive) that controls reading / writing of data with respect to the FD (floppy disk) 107 under the control of the CPU 101, and 107 denotes a removable FD that stores data written according to the control of the FDD 106. Respectively.
[0033]
Reference numeral 108 denotes a cursor, menu, window, or display for displaying various data such as characters and images, and 109 is connected to a network such as a LAN via a communication cable 110 and functions as an interface between the network and the CPU 101. Each network I / F is shown.
[0034]
Reference numeral 111 denotes a keyboard having a plurality of keys for inputting characters, numerical values, and various instructions, and 112 denotes a mouse for selecting and executing various instructions, selecting a processing target, moving a cursor, and the like. ing. Reference numeral 113 denotes a CD-ROM which is a detachable recording medium, 114 denotes a CD-ROM drive for controlling reading of data with respect to the CD-ROM 113, and 100 denotes a bus or cable for connecting the above components. ing.
[0035]
Next, FIG. 2 is an explanatory diagram showing a functional configuration of the document search apparatus according to the embodiment of the present invention. The document search apparatus according to the embodiment of the present invention includes an inquiry sentence analysis unit 200, a help document management unit 201, a first search unit 202, a second search unit 203, a third search unit 204, a fourth search unit 205, and a help screen. The display unit 206 is included.
[0036]
Prior to the description of the individual functional units, the outline of the present invention will be described. In the present invention, a search that has a high relevance ratio but a low recall is performed in parallel with a search that has a low relevance ratio but a high recall. The first search unit 202 to the fourth search unit 205 to be described later have a search method that has a higher relevance ratio (replacement rate is lower) as the code is smaller. Each document in the help document management unit 201 is searched by a search method.
[0037]
The respective search results obtained by the above parts are merged and displayed in a list on the help screen display part 206. The ranking of each document at this time, that is, the priority of presentation to the operator is relatively The higher the relevance rate, the higher the document that is extracted.
[0038]
Therefore, in the search result list, the document group searched by the first search unit 202 is displayed at the top and the document group searched by the fourth search unit 205 is displayed at the bottom, respectively, and the second search unit 203 in between is displayed. The retrieved document group and the document group retrieved by the third retrieval unit 204 are filled.
[0039]
If the documents retrieved by various methods are simply listed, the number of documents in the retrieval result will increase even if the duplicates are removed, making it difficult to find the target document for the operator. By the way, the document that is intended by the operator may be picked up for the first time by a search with a high reproduction rate, that is, a so-called “loose” search. Many.
[0040]
Therefore, in the search result list of the present invention, a document group hit by a search with a relatively high possibility of including the document and a high relevance rate is ranked high, and a search with a relatively low possibility and a high recall rate is hit. The document group is displayed at the lower level.
[0041]
Hereinafter, the function of each unit shown in FIG. 2 will be described in detail. First, an inquiry sentence analysis unit 200 analyzes an inquiry sentence (arbitrary natural sentence) input from a help screen display unit 206 (to be described later), and cuts out a keyword serving as a basis for a feature vector (to be described later). For example, from the inquiry sentence “How to import an image into a personal computer?”, Three keywords “computer”, “image”, and “capture” are extracted by the above analysis.
[0042]
Next, reference numeral 201 denotes a help document management unit, which includes three databases, a Q document DB (database) 201a, an A document DB (database) 201b, and a Q + A document DB (database) 201c described below.
[0043]
FIG. 3 is an explanatory diagram schematically showing the structure of data held in each database in the help document management unit 201. As shown in the figure, the help document management unit 201 holds all Q documents constituting help in the Q document DB 201a and all A documents constituting help in the A document DB 201b.
[0044]
Q documents have serial numbers in the format of “DJ- *” such as “DJ-1”, “DJ-2”, etc., and A documents have “ZU-1”, “ZU-2”, etc. A serial number in the format of “ZU- *” is assigned as a unique document ID. The corresponding Q document and A document include the same number in the ID. For example, the text of the A document of “ZU-1” is an answer to a question that is the text of the Q document of “DJ-1”; -2 "is the answer to the question which is the text of the Q document of" DJ-2 ".
[0045]
Further, the help document management unit 201 holds a Q + A document in which the corresponding Q document in the Q document DB 201a and the A document in the A document DB 201b are combined with each other in the Q + A document DB 201c. As the ID of the Q + A document, the ID of the Q document that is the original is assumed to be taken over as it is. For example, the ID of a Q + A document created from a Q document with an ID “DJ-1” and an A document with “ZU-1” is “DJ-1”.
[0046]
Of the databases held by the help document management unit 201, the Q document DB 201a is a database to be searched by a second search unit 203 described later, and the Q + A document DB 201c is a database to be searched by a third search unit 204 described later. Further, the help document management unit 201 searches for a Q document or an A document specified by an ID delivered from a help screen display unit 206 (to be described later), and outputs the text to the help screen display unit 206.
[0047]
Returning to FIG. 2, reference numeral 202 denotes a first search unit, which is a functional unit that performs a search with the highest relevance ratio, that is, the most narrowed search as compared with a second search unit 203 to a fourth search unit 205 described later. is there. The first search unit 202 holds a KDB (keyword database) 202a described below.
[0048]
FIG. 4 is an explanatory diagram schematically showing the structure of data held in the KDB 202a in the first search unit 202. As shown in FIG. As shown in the drawing, a plurality of documents are held in the KDB 202a, and the body of each document is composed of only one keyword. In each document, IDs of one to several Q documents associated with the document are set as attribute information (attached information).
[0049]
The KDB 202a is created in advance mainly manually. First, some characteristic keywords are extracted from the Q document and the A document constituting the help. As characteristic keywords, for example, keywords that are technical terms specific to a certain field and have a large IDF value, or are included only in documents of a certain category when Q documents and A documents are classified into a plurality of categories. Keywords are considered.
[0050]
Then, for each keyword extracted as described above, one to several Q documents most appropriate to be associated with the keyword are selected. What is optimal is arbitrary, but for example, in the case of help, since it is empirically known that there are matters that are referred to by a relatively large number of operators and items that are not often referred to, help with that keyword The Q document of the question typically possessed by the operator who pulls is assigned as the optimum.
[0051]
For example, in the above-mentioned help for “Bikaku Raku”, draw help with the characteristic keyword “computer” (it seems to be a common keyword, but it may be a rare keyword if limited to a specific help). In many cases, an operator inquires how to view a photograph taken with a digital camera on a personal computer, and how to capture the photograph taken on a personal computer.
[0052]
Therefore, the keyword “PC” has a Q document (ID is “DJ-1” from FIG. 3) consisting of a question “Please tell me how to view the camera image on a PC.” Q document (DJ-2) with the question "Tell me how to download images to a computer" and Q document (with the question "Can you view digital camera images on a PC with Bijin Raku?") “DJ-3”) is associated.
[0053]
Specifically, a document having only one keyword “PC” as a body is created in the KDB 202a, and the ID of each Q document associated as described above is set as attribute information (attached information) of the document.
[0054]
Note that the relative appropriateness between documents associated with the same keyword is expressed in the order of registration of the IDs. For example, in the above example, the first “DJ-1” is more appropriate to associate with the keyword “PC” than the last “DJ-3” (“DJ-1” rather than “DJ-3”). May be the more common question).
[0055]
In addition to the above, there may be a Q document containing the keyword “computer” or a Q document storing a question recalled or associated with the keyword, but there are at most a number of Q documents associated with the KDB 202a. Up to 3 (3 in the above example). In other words, it is not desirable to create one document for each common keyword that is associated with too many Q documents and register them in the KDB 202a.
[0056]
FIG. 4 shows the contents held in the KDB 202a when three keywords “PC”, “connection”, and “print” are extracted as characteristic keywords. For “connection” and “printing”, a document having each keyword as a body is created in the KDB 202a. As described above, the KDB 202a is not a “keyword database” but a “document database including keywords”, that is, a document database including a plurality of documents each holding only one keyword.
[0057]
Here, for convenience of explanation, it is assumed that one keyword is stored in each document in the KDB 202a, but there may be a plurality of keywords in the document. That is, the document in the KDB 202a may be created not for each keyword but for a combination of a plurality of keywords. Further, a document that holds only one keyword and a document that holds a plurality of keyword sets may be mixed in the KDB 202a.
[0058]
The first search unit 202 shown in FIG. 2 creates a feature vector of the query statement based on the analysis result input from the query statement analysis unit 200, and the inner product of this feature vector and the feature vector of each document in the KDB 202a, As a result, the distance between the feature vectors grasped from the inner product is sequentially calculated.
[0059]
Here, the feature vector of the query sentence is the simplest vector in which only the element values corresponding to the keywords included in the sentence are positive values, and the remaining element values are all zero. The feature vector of each document in the KDB 202a is also a vector in which only the element value corresponding to the keyword included in the document is 1 and all the remaining element values are 0. For this reason, the calculated inner product is a certain positive value in a document including the same keyword as that included in the inquiry sentence, and is uniformly 0 in a document not including the keyword.
[0060]
The first search unit 202 includes a document in which the inner product value exceeds the threshold value 0, that is, a document in which the inner product between the query sentence and the feature vector is large because it includes the same keyword as the query sentence (inquiry sentence and feature ID of the Q document set as the attribute information is obtained with reference to a document in which the distance between the vectors is small. These IDs are output to a help screen display unit 206 described later. Hereinafter, the Q document in the search result by the first search unit 202, that is, the Q document specified by each ID is collectively referred to as “R1”.
[0061]
For example, in the feature vector of the inquiry sentence “How to import an image into a personal computer” described above, the element values corresponding to “PC”, “image”, and “capture” are each 1. Therefore, the inner product value exceeds 0 only for a document having a feature vector with an element value of 1 at the same position, in the example of FIG. 4 having the keyword “PC” as the body text (0 for other documents). The IDs “DJ-1”, “DJ-2”, and “DJ-3” of the Q document set in the document are output to the help screen display unit 206.
[0062]
Returning to FIG. 2, the second search unit 203 is a functional unit that performs a slightly looser search than the first search unit 202 described above. The second search unit 203 searches the Q document held in the Q document DB 201a of the help document management unit 201 by a normal vector space method.
[0063]
That is, the inner product value of the feature vector of the query statement created based on the analysis result of the query statement analysis unit 200 and the feature vector of each Q document in the Q document DB 201a is sequentially calculated, and this value exceeds a predetermined threshold value. The Q document is specified, and the IDs are output to the help screen display unit 206 described later in descending order of the value. Hereinafter, the Q documents in the search result by the second search unit 203 are collectively referred to as “R2”.
[0064]
In the search by the second search unit 203, the number of Q documents in R2 is not limited. That is, any number of Q documents whose inner product with an inquiry sentence exceeds a predetermined threshold value is included in R2. In this regard, in the search by the first search unit 202 described above, the number of Q documents associated with each document in the KDB 202a is limited to a few, and as a result, the number of documents in R1 is also limited to that number. Is different. Therefore, in general, the search by the second search unit 203 tends to increase the number of documents in the search result as compared to the search by the first search unit 202 (of course, there are exceptions).
[0065]
Next, the third search unit 204 is a functional unit that performs a more lenient search than the second search unit 203 described above. The third search unit 204 searches the Q + A document held in the Q + A document DB 201c of the help document management unit 201 by a normal vector space method, and outputs the ID of the searched document to the help screen display unit 206 described later. . Hereinafter, the Q + A document in the search result by the third search unit 204 is collectively referred to as “R3”.
[0066]
Here, a combination of the corresponding Q document and A document is prepared in advance as the Q + A document DB 201c, and the search by the third search unit 204 is performed for each document after the combination. This is only for the convenience of diverting an existing search engine (specifically, “Concept Base” manufactured and sold by the present applicant), and does not have to be done in this way.
[0067]
In short, in addition to the Q document that itself is similar to the inquiry sentence, it is sufficient if the Q document whose corresponding A document is similar to the inquiry sentence can be searched. For example, the Q + A document DB 201c is not provided, and the third search is performed. The unit 204 searches the Q document DB 201a and the A document DB 201b in parallel. The ID of the searched Q document is directly used for the former, and the ID of the Q document corresponding to the searched A document is set for the latter. You may make it output to the help screen display part 206. FIG.
[0068]
In this way, unlike the second search unit 203 that limits the search target to the Q document, the third search unit 204 substantially expands the search target to the Q document and the A document. Is likely to be picked up.
[0069]
Next, the fourth search unit 205 is a functional unit that performs a search that is the loosest compared to the first search unit 202 to the third search unit 204 described above, that is, the search with the lowest precision (high recall). The fourth search unit 205 holds a CDB (category database) 205a described below.
[0070]
A multi-level classification system for classifying Q documents in the Q document DB 201a as shown in FIG. 5 is created in advance. This classification system may be created manually, mechanically created by existing document classification techniques, or mechanically created one may be manually modified. In the classification system shown in the figure, four categories of “PC-connection”, “PC-capture”, “image-size”, and “image-print” are defined.
[0071]
Then, for each category, one document is created in the CDB 205a, the keyword included in the name of each category is stored in the body, and the ID of the Q document classified into each category is stored in the attribute information. For example, for the “PC-connection” category, a document having two keywords “PC” and “connection” in the body is created in the CDB 205a, and the attribute information includes IDs of all Q documents classified in the category. Is set.
[0072]
Then, the fourth search unit 205 sequentially calculates the inner product value of the feature vector created from the keyword in the inquiry sentence and the feature vector of each document in the CDB 205a, and the body of the document having the highest value, that is, the relevant The name of the category corresponding to the document is output to a help screen display unit 206 described later.
[0073]
Note that the feature vector of each document in the CDB 205a is a feature vector of each category, which is simply created from the name of the category here. For example, the feature vector of all Q documents classified into each category May be taken as the feature vector of each document in the CDB 205a.
[0074]
As described above, the fourth search unit 205 does not directly search for a document, but searches for a plurality of documents classified into the category by searching for a category similar to the query sentence. Also good. Hereinafter, the Q documents in the category searched by the fourth search unit 205 are collectively referred to as “R4”.
[0075]
Returning to FIG. 2, the help screen display unit 206 is a functional unit that displays a later-described help screen on the display 108. The help screen display unit 206 receives input of an inquiry sentence from the operator through this help screen, and is specified by each ID input from the first search unit 202, the second search unit 203, and the third search unit 204. The text of the Q document and the category name input from the fourth search unit 205 are displayed on the screen. The function of the help screen display unit 206 will be described in detail with reference to a specific example in a flowchart described later.
[0076]
Next, FIG. 6 is a flowchart showing a preparation procedure for document search in the document search apparatus according to the embodiment of the present invention. First, the characteristic keywords described above are extracted from all the Q documents / A documents constituting the help (step S601), and some appropriate Q documents are selected for each keyword (step S602). Then, a document having the keyword as the text is created in the KDB 202a, and the ID of the Q document is set as attribute information (step S603).
[0077]
Further, the Q document DB 201a, the A document DB 201b, and the Q + A document DB 201c are respectively created (step S604), and a classification system for classifying all the Q documents in the Q document DB 201a is created (step S605). Based on this, the above-mentioned CDB 205a is created (step S606).
[0078]
Next, FIG. 7 is a flowchart showing the procedure of document search and search result display in the document search apparatus according to the embodiment of the present invention. In the help screen as shown in FIG. 8, when the operator inputs an inquiry sentence and clicks the “search” button 800 (step S701: Yes), the inquiry sentence analysis unit 200 first analyzes the sentence (step S701). S702).
[0079]
The first search unit 202 supplied with the analysis result described above compares the feature vector of the query sentence created based on the result with the feature vector of each document in the KDB 202a, and calculates the inner product of both. The document whose ID exceeds the threshold is specified, and the ID of the Q document set in the document is output to the help screen display unit 206 (step S703).
[0080]
Next, the analysis result is supplied to the second search unit 203. The second search unit 203 searches the above-described Q document DB 201a to identify a Q document whose inner product between vectors exceeds a threshold, and the ID of the document Is output to the help screen display unit 206 (step S704).
[0081]
Next, the analysis result is supplied to the third search unit 204, and the third search unit 204 searches the above-described Q + A document DB 201c, identifies a Q + A document whose inner product between vectors exceeds a threshold, and identifies the ID of the document Is output to the help screen display unit 206 (step S705).
[0082]
Further, the analysis result is supplied to the fourth search unit 205, and the fourth search unit 205 searches the above-mentioned CDB 205a to identify the document having the maximum inner product between the vectors, and the body of the document, that is, the document The category name corresponding to is output to the help screen display unit 206 (step S706).
[0083]
The help screen display unit 206 that has received the respective search results of the first search unit 202 to the fourth search unit 205 next merges them and displays a list, but before that, documents that are duplicated due to the merge are displayed. Deletion is performed (step S707). For example, it is assumed that the document groups R1 to R3 searched by each of the first search unit 202 to the third search unit 204 are specifically as shown in FIG.
[0084]
In this case, the document whose ID is “DJ-2” is included in both R1 and R2, but the help screen display unit 206 is a search that has a relatively high relevance rate, that is, the first search unit 202. Only the “DJ-2” in the search result R1 is left out, and the “DJ-2” in the search result R2 by the second search unit 203 is deleted. Note that the IDs included in the three document groups are deleted except for the one with the highest relevance rate.
[0085]
Then, R1 to R3 after deduplication are merged into one document group (step S708). At this time, the document group obtained by the search having a higher relevance rate is ranked in the merged document group. To be higher.
[0086]
Here, as described above, since the relevance rate decreases in the order of the first search unit 202, the second search unit 203, and the third search unit 204, the help screen display unit 206 sets R1 to R3 to R1, R2,. Bonded in the order of R3. Therefore, in the merged document group (R1 + R2 + R3), as shown in FIG. 9, “DJ-1”, “DJ-2”, “DJ-3” in R1 are positioned relatively higher, and conversely “ "DJ-10" and below are positioned in the lower order.
[0087]
When the final order of the documents retrieved by each unit is determined in this way, the help screen display unit 206 then outputs the ID of each document to the help document management unit 201. Receiving this, the help document management unit 201 searches the stored Q document DB 201a for the Q document specified by the ID, and outputs the text in the searched Q document to the help screen display unit 206. Then, the help screen display unit 206 displays these texts in a list according to the order determined above. At the same time, the name of the category searched by the fourth search unit 205 is also displayed (step S709).
[0088]
FIG. 10 is an explanatory diagram illustrating an example of a help screen displayed by the help screen display unit 206. In the illustrated search result list, the texts of the respective Q documents searched by the first search unit 202 to the third search unit 204 are displayed in the order shown in FIG. 9 and searched by the fourth search unit 205. The category name ("PC-import" in the figure) is also shown.
[0089]
For example, which question is which search, such as the text of the Q document included in R1 is red, the text of the Q document included in R2 is blue, and the text of the Q document included in R3 is green. You may make it clearly indicate by the color coding of the character whether it was caught by. Alternatively, an identification display such as changing the background color of the text may be used.
[0090]
Here, for R4, only the name of the category to be classified is shown. However, as with R1 to R3, the text of each Q document constituting R4 is placed at the end of the list (that is, R3). (Continued to) may be displayed together. In this case, if there is an ID overlapping with R1 to R3, it goes without saying that the ID is deleted from R4 in advance.
[0091]
When the operator selects any question in the list and clicks the “display” button 1000 on the help screen shown in FIG. 10 (step S710: Yes), the help screen display unit 206 displays the Q document storing the question. The ID of the A document for storing the answer to the question is generated and output to the help document management unit 201.
[0092]
In response to this, the help document management unit 201 searches the A document DB 201b held by the help document management unit 201 for the A document specified by the ID, and delivers the text to the help screen display unit 206. Then, the text of the document A is displayed on the screen as shown in FIG. 11 by the help screen display unit 206 (step S711). In the figure, the text “You can view. Play your image most beautifully in the world ...” displayed in the front window is the “Digital camera image” selected in the back window. It is an answer to the question "Can you see Bijo-Raku on PC?"
[0093]
According to the embodiment described above, a plurality of searches with different levels of relevance ratio and reproduction ratio are performed on the same document group in a superimposed manner, and in the final search result list, the search with the highest relevance ratio is performed. The search results of each method are merged and displayed at the top of the document that comes out, so if you want to narrow down the search results and view them, only the first part of the above list, and the last part when you want to see all without omission, respectively. The operator can achieve his / her purpose by looking at it (the work such as changing the degree of narrowing down and starting the search again is unnecessary).
[0094]
In the above-described embodiment, all the searches by the first search unit 202 to the fourth search unit 205 are based on the vector space method, but this is based on the existing search engine (the above “Concept Base”). This is because it is assumed to function as the search unit 202 to the fourth search unit 205, and in principle does not have to be based on the vector space method.
[0095]
For example, instead of a document database such as KDB 202a, an RDB or simple list including at least a keyword and an ID of a Q document corresponding to the keyword is provided, and the first search unit 202 searches the RDB or list with the keyword in the inquiry sentence. By doing so, the ID of the Q document may be acquired (the search by the first search unit 202 described above is essentially the same as the Boolean search, and the existing search engine is exclusively used. Only a mechanism such as KDB 202a is provided in accordance with the specifications of
[0096]
Further, the search by the second search unit 203 is a Boolean search for the Q document, the search by the third search unit 204 is a Boolean search for the Q + A document, and the search by the fourth search unit 205 is a Boolean search for the category name. Even if each is replaced, the relevance rate is sequentially decreased from the first search unit 202 to the fourth search unit 205 and the recall rate is sequentially improved, so that the same effect as the invention according to the above-described embodiment can be obtained. .
[0097]
Note that the search methods of the first search unit 202 to the fourth search unit 205 do not have to be the same type, and a different search method such as a search by a vector space method for a search unit and a Boolean search for a search unit. May be mixed. In short, as long as the relevance rate and the recall rate are different from each other, any search method for each part may be used.
[0098]
Further, the search methods to be combined are not limited to the four in the above-described embodiment, and may be more or less than this. In particular, the search by the first search unit 202 is to obtain a narrower result than the search by the second search unit 203, but if there are many common keywords in the query sentence and the keywords in the document in the KDB 202a, Since the number of documents to be processed and the number of Q documents associated with the documents also increase, R1 may include more documents than R2.
[0099]
Therefore, when there are a plurality of documents searched from the KDB 202a, it is desirable that there is no search result by the first search unit 202 (that is, nothing is output to the help screen display unit 206). In this case, there are substantially three types of searches performed.
[0100]
In the embodiment described above, the search target is the help document. However, the search target is not limited to this. For example, there are a database storing a summary sentence of each document and a database storing the full sentence. Can be executed by combining a plurality of searches having different relevance ratios / reproducibility, such as searching only a summary sentence and searching a summary sentence + full sentence.
[0101]
Note that the inquiry sentence analysis unit 200, the help document management unit 201, the first search unit 202, the second search unit 203, the third search unit 204, the fourth search unit 205, and the help screen display unit 206 described above are each an HD 105 or the like. The function of each unit is realized by the CPU 101 executing instruction processing according to the instructions of the program read from the memory 103 to the RAM 103. In addition to the HD 105, this program can be stored in various recording media such as the FD 107, the CD-ROM 113, and the MO, or can be distributed via a network.
[0102]
【The invention's effect】
As explained above this The invention includes a first search means for searching for a document that is semantically similar to the input natural sentence, a second search means for searching for a document that is semantically similar to the natural sentence, and the first search. Determining means for determining whether or not the same information is duplicated in both the information that can specify the document searched by the means and the information that can specify the document searched by the second searching means; The same information is duplicated in both the information that can specify the document searched by the first search means and the information that can specify the document searched by the second search means by the determination means. A deletion unit that deletes the same information from information that can identify the document searched by the second search unit, and a document searched by the first search unit. Followed by information that can identify In the order in which the information is combined by the combining unit, the combining unit combining information that can identify the document searched by the second searching unit, the same information being deleted by the deleting unit, Display means for displaying the text of each document specified by each information, so that documents searched by various methods having different relevance ratios and recall ratios are searched by a method having a relatively high relevance ratio. The search results are listed as higher, and those searched with a method with a relatively high recall are listed as lower, which makes it possible to achieve both compatibility and recall in document search, as well as a large number of search results. There is an effect that it is possible to obtain a document search apparatus capable of presenting a document to an operator in an order in which a target document can be easily found.
[0103]
Also, this The invention the above In the invention, the display means displays the text of the document searched by the first search means and the text of the document searched by the second search means with different display forms. Documents searched by various methods with different recall rates are clearly listed as to which method the documents were retrieved by distinguishing the display color, etc. And a document search device capable of presenting a large number of documents in the search result to the operator in a display form that makes it easy to find the target document.
[0104]
Also, this The invention the above In the invention, the first search means searches for a document whose vocabulary appearing in the text is previously associated with a document having a common vocabulary with the natural sentence as a document that is semantically similar to the natural sentence. The second search means searches for a document whose vocabulary tendency appearing in the text is similar to the natural sentence as a document that is semantically similar to the natural sentence. Documents searched by the first search means having a high relevance ratio are displayed at the top, documents searched by the second search means having a relatively high recall ratio are displayed at the bottom, and the documents searched by any means This is clearly indicated by the distinction of display colors, etc., which makes it possible to achieve both compatibility and reproducibility in document search, and makes it easy to find the target document among many documents in the search results. There is an effect that the document retrieval device capable of presenting to the operator in order and display form can be obtained.
[0105]
Also, this The invention the above In the invention, the first search means searches for a document whose vocabulary tendency appearing in the text is similar to the natural sentence as a document semantically similar to the natural sentence, and the second search means The search means selects a document whose vocabulary tendency appearing in the text is similar to the natural sentence, and a document whose vocabulary tendency appearing in the text of a previously associated document is similar to the natural sentence. Since the document is retrieved as a document that is semantically similar to the sentence, the document retrieved by the first retrieval unit having a relatively high relevance rate is included in the retrieval result list. under Documents retrieved by the second retrieval means with a relatively high recall rate Up This means that the document searched by which means is clearly indicated by the distinction of the display color, etc., thereby ensuring compatibility between the relevance rate and the recall rate in the document search, and in the search results. There is an effect that it is possible to obtain a document search apparatus capable of presenting a large number of documents to an operator in an order and display form in which a target document can be easily found.
[0106]
Also, this The invention the above In the invention, a third search means for searching for a category in which documents semantically similar to the natural sentence are classified, and a second name for displaying the name of the category searched by the third search means. Display means, so documents retrieved by various methods with different precision and recall rates are those that are retrieved with a method with relatively high precision rates, and methods with relatively high recall rates As a result, the search results are displayed in a list as a subordinate list, and the document that is searched by which means is clearly identified by distinguishing the display color. In addition, there is an effect that it is possible to obtain a document search apparatus that can present a large number of documents in a search result to an operator in an order and display form in which a target document can be easily found.
[0107]
Also, this The invention includes a first search step for searching for a document that is semantically similar to the input natural sentence, a second search step for searching for a document that is semantically similar to the natural sentence, and the first search. A determination step of determining whether the same information is included in both the information that can specify the document searched in the step and the information that can specify the document searched in the second search step; In the determination step, the same information is redundantly included in both the information that can specify the document searched in the first search step and the information that can specify the document searched in the second search step. A deletion step of deleting the same information from information that can identify the document searched in the second search step, and a document searched in the first search step. Followed by information that can identify the same in the deletion step The document identified by the information in the order of combining the information that has been deleted and the information that can identify the document searched in the second search step is combined, and the information is combined in the combining step Display process for displaying the body of the document, so that documents retrieved by various methods with different relevance ratios and recall ratios are those that have been retrieved by a method with a relatively high relevance ratio. The search results are displayed in the lower order by using the method with a high recall ratio. This makes it possible to achieve both the precision and the recall ratio in the document search, and the discovery of the target document from many documents in the search results. It is possible to obtain a document search method that can be presented to the operator in an easy-to-operate order.
[0108]
Also, this According to the invention, the above It is possible to obtain a program capable of causing a computer to execute the method described in 1).
[Brief description of the drawings]
FIG. 1 is an explanatory diagram showing a hardware configuration of a document search apparatus according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram showing a functional configuration of a document search apparatus according to an embodiment of the present invention.
FIG. 3 is an explanatory diagram schematically showing the structure of data held in each database in the help document management unit 201.
4 is an explanatory diagram schematically showing the structure of data held in a KDB 202a in the first search unit 202. FIG.
FIG. 5 is an explanatory diagram schematically showing a classification system in which Q documents in the Q document DB 201a are classified.
FIG. 6 is a flowchart showing a preparation procedure for document search in the document search apparatus according to the embodiment of the present invention;
FIG. 7 is a flowchart showing a procedure of document search and search result display in the document search apparatus according to the embodiment of the present invention.
FIG. 8 is an explanatory diagram showing an example of a help screen (before Q document search) displayed by the help screen display unit 206;
FIG. 9 is an explanatory diagram showing a specific example of each search result by the first search unit 202 to the third search unit 203 and a merged result thereof.
FIG. 10 is an explanatory diagram showing another example of the help screen displayed by the help screen display unit 206 (after Q document search).
FIG. 11 is an explanatory diagram showing another example of the help screen displayed by the help screen display unit 206 (when A document is displayed).
[Explanation of symbols]
100 bus or cable
101 CPU
102 ROM
103 RAM
104 HDD
105 HD
106 FDD
107 FD
108 display
109 Network I / F
110 Communication cable
111 keyboard
112 mouse
113 CD-ROM
114 CD-ROM drive
200 Inquiry sentence analysis department
201 Help Document Management Department
201a Q document DB
201b A document DB
201c Q + A document DB
202 First search unit
202a KDB
203 Second search unit
204 Third search unit
205 4th search part
205a CDB
206 Help screen display

Claims

Storage means for storing documents;
First search means for searching a document semantically similar to the input natural sentence from the storage means ;
Second search means for searching a document semantically similar to the natural sentence from the storage means ;
Whether the same information is duplicated in both the information that can specify the document searched by the first search means and the information that can specify the document searched by the second search means. Determination means for determining;
The same information is redundantly included in both the information that can identify the document retrieved by the first retrieval unit and the information that can identify the document retrieved by the second retrieval unit by the determination unit. A deletion unit that deletes the same information from information that can specify the document searched by the second search unit when it is determined that
The information that can identify the document retrieved by the first retrieval unit is combined with the information that can identify the document retrieved by the second retrieval unit that has been deleted from the same information by the deletion unit. A coupling means;
Display means for displaying the text of each document specified by each information in the order in which the information is combined by the combining means;
With
The document includes a first document and a second document associated with the first document,
Said first search means, vocabulary of the first document, searches the first document in common with the natural sentence, the second document that is associated corresponds to the retrieved first document, Search as a document that is semantically similar to the natural sentence,
The second search means calculates the similarity between the vocabulary of the second document and the vocabulary in the natural sentence, and the second document having the similarity greater than or equal to a predetermined value is semantically combined with the natural sentence. A document search device for searching as a document similar to

The said display means displays the text of the document searched by the first search means and the text of the document searched by the second search means with different display forms. The document search device according to 1.

Further, the document is classified in advance for each category and stored in the storage means,
Third search means for searching a category into which documents that are semantically similar to the natural sentence are classified;
Second display means for displaying the name of the category searched by the third search means;
The document search apparatus according to claim 1 or 2, further comprising:

Storage means for storing a document composed of a first document and a second document associated with the first document, first search means, second search means, determination means, and deletion means And a document search method using a computer comprising a combining means and a display means,
A first search step in which the first search means searches the storage means for a document that is semantically similar to the input natural sentence;
A second search step in which the second search means searches the storage means for a document that is semantically similar to the natural sentence;
The determination unit includes the same information redundantly in both the information that can identify the document searched in the first search step and the information that can specify the document searched in the second search step. A determination step of determining whether or not
In the determination step, the same information is included in both the information that can specify the document searched in the first search step and the information that can specify the document searched in the second search step. A deletion step of deleting the same information from information that can identify the document searched in the second search step when it is determined that the document is included in duplicate;
The combining means specifies the document searched in the second search step, in which the same information is deleted in the deletion step following the information that can specify the document searched in the first search step. A combining step for combining information that can be made; and
A display step in which the display means displays the text of each document specified by the information in the order in which the information is combined in the combining step;
Including
The first search step, the vocabulary of the first document, searches the first document in common with the natural sentence, the second document that is associated corresponds to the retrieved first document, Search as a document that is semantically similar to the natural sentence,
The second search step calculates the similarity between the vocabulary of the second document and the vocabulary in the natural sentence, and the second document having the similarity greater than or equal to a predetermined value is semantically combined with the natural sentence. A document search method characterized by searching as a document similar to.

Storage means for storing a document composed of a first document and a second document associated with the first document, first search means, second search means, determination means, and deletion means And a document search program to be executed by a computer including a combining unit and a display unit,
A first search step in which the first search means searches the storage means for a document that is semantically similar to the input natural sentence;
A second search step in which the second search means searches the storage means for a document that is semantically similar to the natural sentence;
The determination unit includes the same information redundantly in both the information that can identify the document searched in the first search step and the information that can specify the document searched in the second search step. A determination step of determining whether or not
In the determination step, the same information is included in both the information that can specify the document searched in the first search step and the information that can specify the document searched in the second search step. A deletion step of deleting the same information from information that can identify the document searched in the second search step when it is determined that the document is included in duplicate;
The combining means specifies the document searched in the second search step, in which the same information is deleted in the deletion step following the information that can specify the document searched in the first search step. A combining step for combining information that can be made; and
A display step in which the display means displays the text of each document specified by the information in the order in which the information is combined in the combining step;
To the computer,
The first search step, the vocabulary of the first document, searches the first document in common with the natural sentence, the second document that is associated corresponds to the retrieved first document, Search as a document that is semantically similar to the natural sentence,
The second search step calculates the similarity between the vocabulary of the second document and the vocabulary in the natural sentence, and the second document having the similarity greater than or equal to a predetermined value is semantically combined with the natural sentence. A document retrieval program for retrieving as a document similar to