JP2004054588A

JP2004054588A - Document retrieval device and method and program for making computer execute the same method

Info

Publication number: JP2004054588A
Application number: JP2002211112A
Authority: JP
Inventors: Tatsuo Kato; 加藤　竜雄; Sumio Fujita; 藤田　澄男
Original assignee: JustSystems Corp
Current assignee: JustSystems Corp
Priority date: 2002-07-19
Filing date: 2002-07-19
Publication date: 2004-02-19

Abstract

PROBLEM TO BE SOLVED: To preferentially provide a document which is likely to be significant to a retriever to the retriever. SOLUTION: A matching level calculating part 204a in a retrieval executing part 204 calculates the matching level of each document(home page) stored in a collected document storing part 202 with a retrieval condition inputted from a request input accepting part 203 by a vector space method. Then, the matching level correcting part 204b corrects the matching level calculated by the matching level calculating part 204a by taking into consideration various attributes such as the URL of each document, its positioned hierarchy, emphasized character strings in the title or text, referenced frequency from an external page or reference frequency to an internal page, and an anchor text being the reference origin. For example, concerning a document which is much more likely linked to the other documents in the same server or a document including a keyword which is the same or similar to the retrieval condition in the anchor text being the link origin, the calculated matching level is increased with a much higher rate. COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は、複数の電子文書を各文書の検索条件に対する合致度にもとづいて順位づけする文書検索装置、文書検索方法およびその方法をコンピュータに実行させるプログラムに関する。
【０００２】
【従来の技術】
インターネットに存在する大量の文書の中から、何らかのテーマに関連する文書を取り出したい場合、検索者は「Ｙａｈｏｏ！」「Ｇｏｏｇｌｅ」などといった汎用のサーチエンジン、あるいは特定のテーマに特化した専門のサーチエンジンなどを利用するのが通例であった。これらのサーチエンジンにアクセスして、任意のキーワードを入力すると、当該キーワードを含む文書のタイトルなどを漏れなく画面表示させることができる。
【０００３】
【発明が解決しようとする課題】
しかしながら、上記のようにして得られる検索結果一覧は非常に網羅的であるため、ヒットした文書数が多いと、一覧中から目的の文書を検索するにも時間や労力を要してしまうという問題点があった。結果一覧中での文書の配列（順序）が検索者にとっての各文書の重要度を反映していないために、目的の文書に到達するのに手間がかかってしまうと言ってもよい。
【０００４】
この発明は上記従来技術による問題を解決するため、検索者にとって重要である可能性の高い文書を優先的に検索者に提示することが可能な文書検索装置、文書検索方法およびその方法をコンピュータに実行させるプログラムを提供することを目的とする。
【０００５】
【課題を解決するための手段】
上述した課題を解決し、目的を達成するため、請求項１に記載の発明にかかる文書検索装置は、複数の電子文書を各文書の検索条件に対する合致度にもとづいて順位づけする文書検索装置において、前記各文書の属性情報を抽出する属性情報抽出手段と、前記各文書の本文の前記検索条件に対する合致度を算出する合致度算出手段と、前記合致度算出手段により算出された合致度を、前記属性情報抽出手段により抽出された各文書の属性情報にもとづいて補正する合致度補正手段と、を備えたことを特徴とする。
【０００６】
この請求項１に記載の発明によれば、本文だけでは優劣のつかない文書でも、その属性情報にもとづいていずれかが相対的に上位、いずれかが相対的に下位に順位づけされる。
【０００７】
また、請求項２に記載の発明にかかる文書検索装置は、前記請求項１に記載の発明において、前記属性情報抽出手段が、前記各文書のＵＲＬを抽出するとともに、前記合致度補正手段が、前記合致度算出手段により算出された合致度を、前記属性情報抽出手段により抽出された当該文書のＵＲＬの前記検索条件に対する合致度にもとづいて補正することを特徴とする。
【０００８】
この請求項２に記載の発明によれば、本文だけでは優劣のつかない文書でも、そのＵＲＬがより検索条件に類似する文書のほうが、相対的に上位に順位づけされる。
【０００９】
また、請求項３に記載の発明にかかる文書検索装置は、前記請求項１に記載の発明において、前記属性情報抽出手段が、前記各文書を保持する情報処理装置内での当該文書の位置する階層を抽出するとともに、前記合致度補正手段が、前記合致度算出手段により算出された合致度を、前記属性情報抽出手段により抽出された当該文書の位置する階層にもとづいて補正することを特徴とする。
【００１０】
この請求項３に記載の発明によれば、本文だけでは優劣のつかない文書でも、サーバ上で位置する階層のより浅い（ルートに近い）文書のほうが、相対的に上位に順位づけされる。
【００１１】
また、請求項４に記載の発明にかかる文書検索装置は、前記請求項１に記載の発明において、前記属性情報抽出手段が、前記各文書のタイトルまたは前記各文書中の強調文字列を抽出するとともに、前記合致度補正手段が、前記合致度算出手段により算出された合致度を、前記属性情報抽出手段により抽出された当該文書のタイトルまたは当該文書中の強調文字列の前記検索条件に対する合致度にもとづいて補正することを特徴とする。
【００１２】
この請求項４に記載の発明によれば、本文だけでは優劣のつかない文書でも、そのタイトルまたは文中の強調文字列がより検索条件に類似する文書のほうが、相対的に上位に順位づけされる。
【００１３】
また、請求項５に記載の発明にかかる文書検索装置は、前記請求項１に記載の発明において、前記属性情報抽出手段が、前記各文書が当該文書を保持する情報処理装置内の他の文書にリンクする頻度を抽出するとともに、前記合致度補正手段が、前記合致度算出手段により算出された合致度を、前記属性情報抽出手段により抽出された頻度にもとづいて補正することを特徴とする。
【００１４】
この請求項５に記載の発明によれば、本文だけでは優劣のつかない文書でも、同一サーバ内の他の文書により多くリンクする文書のほうが、相対的に上位に順位づけされる。
【００１５】
また、請求項６に記載の発明にかかる文書検索装置は、前記請求項１に記載の発明において、前記属性情報抽出手段は、前記各文書にリンクする他の文書中で当該リンクが埋め込まれている文字列を抽出するとともに、前記合致度補正手段は、前記合致度算出手段により算出された合致度を、前記属性情報抽出手段により抽出された文字列の前記検索条件に対する合致度にもとづいて補正することを特徴とする。
【００１６】
この請求項６に記載の発明によれば、本文だけでは優劣のつかない文書でも、他の文書に埋め込まれているアンカーテキストがより検索条件に類似する文書のほうが、相対的に上位に順位づけされる。
【００１７】
また、請求項７に記載の発明にかかる文書検索方法は、複数の電子文書を各文書の検索条件に対する合致度にもとづいて順位づけする文書検索方法において、前記各文書の属性情報を抽出する属性情報抽出工程と、前記各文書の本文の前記検索条件に対する合致度を算出する合致度算出工程と、前記合致度算出工程で算出された合致度を、前記属性情報抽出工程で抽出された各文書の属性情報にもとづいて補正する合致度補正工程と、を含んだことを特徴とする。
【００１８】
この請求項７に記載の発明によれば、本文だけでは優劣のつかない文書でも、その属性情報にもとづいていずれかが相対的に上位、いずれかが相対的に下位に順位づけされる。
【００１９】
また、請求項８に記載の発明にかかるプログラムによれば、前記請求項７に記載された方法がコンピュータによって実行される。
【００２０】
【発明の実施の形態】
以下に添付図面を参照して、この発明による文書検索装置、文書検索方法およびその方法をコンピュータに実行させるプログラムの好適な実施の形態を詳細に説明する。
【００２１】
図１は、この発明の実施の形態による文書検索装置のハードウエア構成を示す説明図である。同図において、１０１は装置全体を制御するＣＰＵを、１０２は基本入出力プログラムを記憶したＲＯＭを、１０３はＣＰＵ１０１のワークエリアとして使用されるＲＡＭを、それぞれ示している。
【００２２】
また、１０４はＣＰＵ１０１の制御にしたがってＨＤ（ハードディスク）１０５に対するデータのリード／ライトを制御するＨＤＤ（ハードディスクドライブ）を、１０５はＨＤＤ１０４の制御にしたがって書き込まれたデータを記憶するＨＤを、それぞれ示している。
【００２３】
また、１０６はＣＰＵ１０１の制御にしたがってＦＤ（フレキシブルディスク）１０７に対するデータのリード／ライトを制御するＦＤＤ（フレキシブルディスクドライブ）を、１０７はＦＤＤ１０６の制御にしたがって書き込まれたデータを記憶する着脱自在のＦＤを、それぞれ示している。
【００２４】
また、１０８はカーソル、メニュー、ウィンドウ、あるいは文字や画像などの各種データを表示するディスプレイを、１０９は通信回線１１０を介してインターネットに接続され、当該ネットワークとＣＰＵ１０１とのインターフェースとして機能するネットワークＩ／Ｆを、それぞれ示している。
【００２５】
また、１１１は文字、数値、各種指示などの入力のための複数のキーを備えたキーボードを、１１２は各種指示の選択や実行、処理対象の選択、マウスポインタの移動などをおこなうマウスを、それぞれ示している。また、１１３は着脱可能な記録媒体であるＣＤ−ＲＷを、１１４はＣＤ−ＲＷ１１３に対するデータのリードを制御するＣＤ−ＲＷドライブを、１００は上記各部を接続するためのバスまたはケーブルを、それぞれ示している。
【００２６】
つぎに、図２はこの発明の実施の形態による文書検索装置の構成を機能的に示す説明図である。まず、２００は文書収集部であり、インターネットを定期的に巡回して、そこで公開されているホームページ（厳密にはホームページを構成する、ＨＴＭＬファイルやＧＩＦファイルなどの各種ファイル）を収集する。そして、収集した文書を後述する収集文書解析部２０１に引き渡す。
【００２７】
２０１は収集文書解析部であり、文書収集部２００から引き渡された個々の文書を解析して、まず、キーワードから文書を検索するための転置ファイルを作成する。この転置ファイルとは、概念的には収集された全文書を行、当該文書群に出現する全キーワードを列とし、行と列の交点に各文書における各キーワードの出現有無あるいは出現頻度などを記録したテーブルである。
【００２８】
また、収集文書解析部２０１は転置ファイルの作成とともに、個々の文書につき以下に掲げるような属性情報を抽出して、後述する収集文書保存部２０２に引き渡す。
【００２９】
（１）ＵＲＬＴｅｘｔ
文書を一意に特定するＵＲＬ、たとえば「ｈｔｔｐ：／／ｗｗｗ．ｊｕｓｔｓｙｓｔｅｍ．ｃｏ．ｊｐ／ｉｎｄｅｘ．ｈｔｍｌ」などである。
【００３０】
（２）ＵＲＬＬｅｎｇｔｈ
文書がサーバ上で位置する階層の深さである（階層の深さはＵＲＬの長さとして表れる）。たとえばＵＲＬが「ｈｔｔｐ：／／ｗｗｗ．ｊｕｓｔｓｙｓｔｅｍ．ｃｏ．ｊｐ／ｉｎｄｅｘ．ｈｔｍｌ」である文書ＡのＵＲＬＬｅｎｇｔｈ＝０（ルート）、「ｈｔｔｐ：／／ｗｗｗ．ｊｕｓｔｓｙｓｔｅｍ．ｃｏ．ｊｐ／ｎｅｗｓ／２００２０６０１．ｈｔｍｌ」である文書ＢのＵＲＬＬｅｎｇｔｈ＝１である。
【００３１】
（３）Ｔｉｔｌｅ
文書のタイトル、すなわちその＜ｔｉｔｌｅ＞タグ内に記述された全文字列である。
【００３２】
（４）ＬａｒｇｅＦｏｎｔｓ
文書内で強調されているすべての文字列である。どのような態様を「強調」とみなすかは任意であるが、たとえば文書内の他の文字列より大きい文字列（＜ｆｏｎｔ　ｓｉｚｅ＝”＋１”＞タグ内などに記述された文字列）、他の文字列と色の異なる文字列（＜ｆｏｎｔ　ｃｏｌｏｒ＝”＃ＦＦ００００”＞タグ内などの文字列）、太字や斜字など他の文字列と字体の異なる文字列（＜ｂ＞タグ内や＜ｉ＞タグ内の文字列）などを強調文字列とみなして、ＬａｒｇｅＦｏｎｔｓ内に格納する。
【００３３】
なお、ＬａｒｇｅＦｏｎｔｓには文字列そのものでなく、文書内の何文字目から何文字目までというような強調文字列の位置情報を格納するようにしてもよい。
【００３４】
（５）ＩｎｔｅｒＳｅｒｖｅｒＬｉｎｋｅｄ
文書が、当該文書の存在するサーバ以外の他のサーバの文書（外部ページ）からリンク（参照）されている頻度である。たとえば図３に示すように、サーバ１上の文書Ａへジャンプするハイパーリンクがサーバ１上の文書Ｂ、サーバ２上の文書Ｃおよびサーバ３上の文書Ｄにそれぞれ埋め込まれており、文書Ｂ〜Ｄ以外に文書Ａを参照する文書がなかったとすると、文書ＡのＩｎｔｅｒＳｅｒｖｅｒＬｉｎｋｅｄ＝２（＝１＋１）である。
【００３５】
（６）ＩｎｎｅｒＳｅｒｖｅｒＬｉｎｋｅｒ
文書が、当該文書の存在するサーバ内の他の文書（内部ページ）にリンクしている頻度である。たとえば図３に示すように、サーバ１上の文書Ａがサーバ１上の文書Ｂ、サーバ２上の文書Ｃにジャンプするハイパーリンクをそれぞれ一つずつ含んでいたとすると、文書ＡのＩｎｎｅｒＳｅｒｖｅｒＬｉｎｋｅｒ＝１である。
【００３６】
（７）ＩｎｔｅｒＳｅｒｖｅｒＡｎｃｈｏｒ
文書が、当該文書の存在するサーバ以外の他のサーバの文書からリンクされている場合に、当該リンクの埋め込まれている文字列である。たとえばサーバ１上の文書Ａに対するハイパーリンクが、サーバ２上の文書Ｃに「＜ａ　ｈｒｅｆ＝”ｈｔｔｐ：／／ｗｗｗ．ｊｕｓｔｓｙｓｔｅｍ．ｃｏ．ｊｐ／ｉｎｄｅｘ．ｈｔｍｌ”＞株式会社ジャストシステムのホームページへ＜／ａ＞」のような形で埋め込まれ、またサーバ３上の文書Ｄに「＜ａ　ｈｒｅｆ＝”ｈｔｔｐ：／／ｗｗｗ．ｊｕｓｔｓｙｓｔｅｍ．ｃｏ．ｊｐ／ｉｎｄｅｘ．ｈｔｍｌ”＞一太郎についてはこちら＜／ａ＞」のような形で埋め込まれており、文書Ｃ・Ｄ以外に文書Ａを参照する外部ページがない場合に、文書Ａ（文書Ｃ・Ｄではない）のＩｎｔｅｒＳｅｒｖｅｒＡｎｃｈｏｒ＝”株式会社ジャストシステムのホームページへ””一太郎についてはこちら”、である。
【００３７】
（８）ＩｎｎｅｒＳｅｒｖｅｒＡｎｃｈｏｒ
文書が、当該文書の存在するサーバ内の他の文書からリンクされている場合に、当該リンクの埋め込まれている文字列である。たとえばサーバ１上の文書Ａに対するハイパーリンクが、同じサーバ１上の文書Ｂに「＜ａ　ｈｒｅｆ＝”ｈｔｔｐ：／／ｗｗｗ．ｊｕｓｔｓｙｓｔｅｍ．ｃｏ．ｊｐ／ｉｎｄｅｘ．ｈｔｍｌ”＞ホームへ戻る＜／ａ＞」のような形で埋め込まれており、サーバ１上に文書Ｂ以外に文書Ａを参照する文書がない場合に、文書Ａ（文書Ｂではない）のＩｎｎｅｒＳｅｒｖｅｒＡｎｃｈｏｒ＝”ホームへ戻る”、である。
【００３８】
図２の機能部の説明に戻り、つぎに２０２は収集文書保存部であり、収集文書解析部２０１から引き渡された各文書の本体と、各文書の属性情報（上述の（１）〜（８）の属性値）および上述の転置ファイルを保持している。
【００３９】
２０３は要求入力受付部であり、操作者からの検索要求の入力を受け付けて、後述する検索実行部２０４に引き渡す機能部である。検索要求には少なくとも一つのキーワードと、キーワード間を結合するＡＮＤやＯＲなどの検索条件が含まれている。なお、自然文により検索要求を入力させ、そこから形態素解析や構文解析によりキーワードを抽出して検索実行部２０４に引き渡すようにしてもよい。
【００４０】
２０４は検索実行部であり、要求入力受付部２０３から引き渡された検索条件に対する、収集文書保存部２０２内の各文書の合致度（検索条件と各文書との類似度、と言ってもよい）を順次算出する合致度算出部２０４ａ、および合致度算出部２０４ａにより算出された合致度を、上述した各文書の属性情報に鑑みて補正する合致度補正部２０４ｂを含む構成である。
【００４１】
合致度算出部２０４ａは、一般に「ベクトル空間法」と呼ばれる手法により各文書の合致度を算出する。ベクトル空間法では、検索条件中に含まれるキーワードの出現有無あるいは出現頻度などを要素値とするベクトル（クエリーベクトル）を作成するとともに、上述の転置ファイル中の各レコードにより各文書の文書ベクトルを作成する。そして、クエリーベクトルと各文書の文書ベクトルとの距離（コサイン距離）を順次算出し、当該距離が小さいほど大きく、当該距離が大きいほど小さくなるように合致度のスコアを算出する。このスコアにより、各文書を検索条件との合致度の順に順位づけすることができる。
【００４２】
ただし、本発明では上記のようにして算出された各文書の合致度を、各文書について抽出された属性情報にもとづいて補正する。具体的には下記の通りである。
【００４３】
（１）「ＵＲＬＴｅｘｔ」すなわちＵＲＬの検索条件に対する合致度が高い文書ほど、検索条件に対するその合致度を高い割合で水増しする。ＵＲＬはそのホームページに関係する企業その他の団体の名称、ページの主題などを含むことが多く、これらのキーワードは検索条件としても使用される頻度が高い。そこで、たとえばＵＲＬの合致度が８０％である文書Ａの検索条件に対する合致度は、合致度算出部２０４ａにより算出された合致度の１８０％（８０％を加算）、ＵＲＬの合致度が２０％である文書Ｂの検索条件に対する合致度は、合致度算出部２０４ａにより算出された合致度の１２０％（２０％を加算）、というように、ＵＲＬの合致度に応じた補正をおこなう。
【００４４】
なお、ここでは一例として「補正後の合致度＝補正前の合致度＋（補正前の合致度×ＵＲＬの合致度）」という計算式を使用したが、ＵＲＬに検索条件と同一または類似の文字列が多く含まれるほど文書の合致度が高めに補正されるのであれば、計算式は任意のものであってよい。
【００４５】
（２）「ＵＲＬＬｅｎｇｔｈ」すなわち文書の位置する階層によって、より上位の階層（よりルートに近い階層）の文書ほど、検索条件に対するその合致度を高い割合で水増しする。検索者が探しているのはトップページや、トップページに近いページであることが多く、これらのページは通常は上位の階層に置かれている。そこで、たとえばＵＲＬＬｅｎｇｔｈ＝０の文書Ａの検索条件に対する合致度は、合致度算出部２０４ａにより算出された合致度の２００％、ＵＲＬＬｅｎｇｔｈ＝１の文書Ｂの検索条件に対する合致度は、合致度算出部２０４ａにより算出された合致度の１９０％、というように、文書の位置する階層に応じた補正をおこなう。
【００４６】
なお、ここでは一例として「補正後の合致度＝補正前の合致度＋｛補正前の合致度×（１−ＵＲＬＬｅｎｇｔｈ×０．１）｝」という計算式を使用したが、位置する階層の浅い文書ほどその合致度が高めに補正されるのであれば、計算式は任意のものであってよい。
【００４７】
（３）「Ｔｉｔｌｅ」すなわち文書のタイトルの、検索条件に対する合致度が高い文書ほど、検索条件に対するその合致度を高い割合で水増しする。タイトルはＵＲＬと同じく、検索条件として使用されるキーワードを含む可能性が高い。そこで、たとえばタイトルの検索条件に対する合致度が８０％である文書Ａの、検索条件に対する合致度は、合致度算出部２０４ａにより算出された合致度の１８０％、タイトルの合致度が２０％である文書Ｂの検索条件に対する合致度は、合致度算出部２０４ａにより算出された合致度の１２０％、というように、タイトルの合致度に応じた補正をおこなう。
【００４８】
なお、ここでは一例として「補正後の合致度＝補正前の合致度＋（補正前の合致度×タイトルの合致度）」という計算式を使用したが、タイトルに検索条件と同一または類似の文字列が多く含まれるほど文書の合致度が高めに補正されるのであれば、計算式は任意のものであってよい。また、上述のＵＲＬによる補正と同一の計算式である必要もない。
【００４９】
（４）「ＬａｒｇｅＦｏｎｔｓ」すなわち文書内の強調文字列の、検索条件に対する合致度が高い文書ほど、検索条件に対するその合致度を高い割合で水増しする。強調文字列はＵＲＬやタイトルと同じく、検索条件として使用されるキーワードを含む可能性が高い。そこで、たとえば強調文字列部分の検索条件に対する合致度が８０％である文書Ａの、検索条件に対する合致度は、合致度算出部２０４ａにより算出された合致度の１８０％、強調文字列部分の合致度が２０％である文書Ｂの検索条件に対する合致度は、合致度算出部２０４ａにより算出された合致度の１２０％、というように、強調文字列の合致度に応じた補正をおこなう。
【００５０】
なお、ここでは一例として「補正後の合致度＝補正前の合致度＋（補正前の合致度×強調文字列の合致度）」という計算式を使用したが、強調部分に検索条件と同一または類似の文字列が多く含まれるほど文書の合致度が高めに補正されるのであれば、計算式は任意のものであってよい。また、上述のＵＲＬやタイトルによる補正と同一の計算式である必要もない。
【００５１】
（５）「ＩｎｔｅｒＳｅｒｖｅｒＬｉｎｋｅｄ」すなわち他のサーバの文書からリンクされる頻度が高い文書ほど、検索条件に対するその合致度を高い割合で水増しする。多くの外部ページからリンクされている文書はそれだけ内容の充実した、客観的な評価の高い文書であって、検索者にとっても相対的に重要なものである可能性が高い。そこで、たとえばＩｎｔｅｒＳｅｒｖｅｒＬｉｎｋｅｄ＝２の文書Ａの検索条件に対する合致度は、合致度算出部２０４ａにより算出された合致度の１２０％というように、他文書からのリンク回数に応じた補正をおこなう。
【００５２】
なお、ここでは一例として「補正後の合致度＝補正前の合致度＋（補正前の合致度×被参照頻度×０．１）」という計算式を使用したが、多くの文書からリンクされるほど文書の合致度が高めに補正されるのであれば、計算式は任意のものであってよい。
【００５３】
（６）「ＩｎｎｅｒＳｅｒｖｅｒＬｉｎｋｅｒ」すなわち同一サーバ内の他の文書にリンクする頻度が高い文書ほど、検索条件に対するその合致度を高い割合で水増しする。多くの内部ページにリンクする文書はトップページや、少なくとも飛び先の内容を束ねたようなページであって、検索者にとっても相対的に重要なものである可能性が高い。そこで、たとえばＩｎｎｅｒＳｅｒｖｅｒＬｉｎｋｅｒ＝１の文書Ａの検索条件に対する合致度は、合致度算出部２０４ａにより算出された合致度の１１０％というように、他文書へのリンク頻度に応じた補正をおこなう。
【００５４】
なお、ここでは一例として「補正後の合致度＝補正前の合致度＋（補正前の合致度×参照頻度×０．１）」という計算式を使用したが、同一サーバ内の多くの文書にリンクするほど文書の合致度が高めに補正されるのであれば、計算式は任意のものであってよい。
【００５５】
（７）「ＩｎｔｅｒＳｅｒｖｅｒＡｎｃｈｏｒ」すなわちその文書へのリンクが埋め込まれた外部ページ中の文字列の、検索条件に対する合致度が高い文書ほど、検索条件に対するその合致度を高い割合で水増しする。飛び先のページの内容や性質を端的に表現するアンカーテキストは、上述のＵＲＬ、タイトルおよび強調文字列と同じく、検索条件として使用されるキーワードを含む可能性が高い。
そこで、たとえば外部ページ中のアンカーテキストの検索条件に対する合致度が８０％である文書Ａの、検索条件に対する合致度は、合致度算出部２０４ａにより算出された合致度の１８０％というように、アンカーテキストの合致度に応じた補正をおこなう。
【００５６】
なお、ここでは一例として「補正後の合致度＝補正前の合致度＋（補正前の合致度×ＩｎｔｅｒＳｅｒｖｅｒＡｎｃｈｏｒの合致度）」という計算式を使用したが、アンカーテキストに検索条件と同一または類似の文字列が多く含まれるほど文書の合致度が高めに補正されるのであれば、計算式は任意のものであってよい。また、上述のＵＲＬやタイトル、あるいは強調文字列による補正と同一の計算式である必要もない。
【００５７】
（８）「ＩｎｎｅｒＳｅｒｖｅｒＡｎｃｈｏｒ」すなわちその文書へのリンクが埋め込まれた内部ページ中の文字列の、検索条件に対する合致度が高い文書ほど、検索条件に対するその合致度を高い割合で水増しする。その理由と手順は上述のＩｎｔｅｒＳｅｒｖｅｒＡｎｃｈｏｒと同様であって、たとえば内部ページ中のアンカーテキストの検索条件に対する合致度が８０％である文書Ａの、検索条件に対する合致度は、合致度算出部２０４ａにより算出された合致度の１８０％というように、アンカーテキストの合致度に応じた補正をおこなう。
【００５８】
なお、ここでは一例として「補正後の合致度＝補正前の合致度＋（補正前の合致度×ＩｎｎｅｒＳｅｒｖｅｒＡｎｃｈｏｒの合致度）」という計算式を使用したが、アンカーテキストに検索条件と同一または類似の文字列が多く含まれるほど文書の合致度が高めに補正されるのであれば、計算式は任意のものであってよい。また、上述のＵＲＬ、タイトル、強調文字列あるいはＩｎｔｅｒＳｅｒｖｅｒＡｎｃｈｏｒによる補正と同一の計算式である必要もない。
【００５９】
合致度補正部２０４ｂは各文書につき、上記各属性を勘案した補正後の合致度を算出し、たとえばその平均値（単純平均あるいは加重平均）を取ることで、各文書の補正後の合致度を算出する。または、各属性による補正後の合致度を単純に、あるいは所定の重み付けのもとに合計して、各文書の補正後の合致度とするのであってもよい。その後、合致度補正部２０４ｂは合致度の順に文書を順位づけするとともに、合致度が閾値を超えた文書のタイトルおよび当該文書の合致度を後述する結果出力部２０５に引き渡す。
【００６０】
図２に戻り、つぎに２０５は結果出力部であり、検索実行部２０４から引き渡された文書、すなわち検索結果を検索結果一覧などとしてディスプレイ１０８に表示する。
【００６１】
つぎに、図４はこの発明の実施の形態による文書検索装置における、属性情報抽出処理の手順を示すフローチャートである。この処理は後述する文書検索処理の前準備として、あらかじめ指定された時期に定期的に実行されるものである。
【００６２】
文書収集部２００は指定された時期になると、インターネットを巡回して文書収集をおこない（ステップＳ４０１）、所定の終了条件、たとえばＮ個先のリンクまで辿り切ったなどの条件が満たされると（ステップＳ４０２：Ｙｅｓ）、収集した文書を収集文書解析部２０１に引き渡す。
【００６３】
収集文書解析部２０１は引き渡された文書から、上述の各属性情報を抽出のうえ（ステップＳ４０３）、文書と各属性情報との対応表、および文書とキーワードとの対応表である転置ファイルなどの各種データベースを作成する（ステップＳ４０４）。
【００６４】
つぎに、図５はこの発明の実施の形態による文書検索装置における、文書検索処理の手順を示すフローチャートである。
【００６５】
要求入力受付部２０３は、検索者による検索要求の入力を待ち受けて（ステップＳ５０１）、その入力があると（ステップＳ５０２：Ｙｅｓ）、当該検索要求を検索実行部２０４に引き渡す。
【００６６】
検索実行部２０４はまず入力した検索要求中で、絞り込み検索が指定されているかどうかを判定する（ステップＳ５０３）。絞り込み検索とは上述したベクトル空間法による文書検索の前に、検索対象となる文書をあらかじめ絞り込む処理であって、たとえば収集文書保存部２０２内の文書のうちある一定期間内に作成・更新された文書や、ある特定の筆者により作成された文書などに限って、合致度算出部２０４ａや合致度補正部２０４ｂによる合致度の計算をおこなわせることができる（それ以外の文書について計算を省略できるので処理速度が向上する）。
【００６７】
そして、絞り込み検索が指定されている場合は（ステップＳ５０３：Ｙｅｓ）、収集文書保存部２０２内の文書から指定された文書だけを抽出（絞り込み検索）のうえ（ステップＳ５０４）、それらの文書について合致度の算出および補正をおこなう（ステップＳ５０５）。一方、絞り込み検索が指定されていなければ（ステップＳ５０３：Ｎｏ）、収集文書保存部２０２内の全文書につき合致度の算出および補正をおこなう（ステップＳ５０５）。
【００６８】
その後、検索結果を引き渡された結果出力部２０５が検索結果一覧の表示をおこない（ステップＳ５０６）、検索者から検索終了が指示されない限り（ステップＳ５０７：Ｎｏ）、ステップＳ５０１に戻ってつぎの検索条件の入力を受け付ける。
【００６９】
以上説明した実施の形態によれば、検索対象文書の順位づけはその本文と検索条件との類似度だけでなく、ＵＲＬ、位置する階層、タイトルや本文中の強調文字列、外部ページからの被参照頻度や内部ページへの参照頻度、参照元のアンカーテキストなどといった種々の属性に鑑みて総合的に実施される。
【００７０】
実際には、上記で示した属性情報を基礎に上記のような方針で合致度の水増しをおこなうと、一つのサイトを構成する複数のホームページのうち、入り口ページあるいは代表ページとして機能する中核のホームページが、相対的に検索結果中の上位に現れやすくなる。そして、検索者が探しているのはサイト内のマイナーなページではなく、こうした中心的なページであることが多いので、上述のような合致度の補正により、本文だけでは合致度に優劣のつかない類似文書の中から、検索者にとって相対的に重要である可能性の高い文書を上位に抽出することができる。
【００７１】
なお、上述した文書収集部２００、収集文書解析部２０１、要求入力受付部２０３、検索実行部２０４および結果出力部２０５は、具体的にはＨＤ１０５からＲＡＭ１０３に読み出されたプログラムをＣＰＵ１０１が実行することにより実現されるものである。このプログラムはＨＤ１０５のほか、ＦＤ１０７、ＣＤ−ＲＷ１１３、ＭＯなどの各種の記録媒体に格納して配布することができ、ネットワークを介して配布することも可能である。また、収集文書保存部２０２はＨＤ１０５により実現される。
【００７２】
【発明の効果】
以上説明したように請求項１に記載の発明は、複数の電子文書を各文書の検索条件に対する合致度にもとづいて順位づけする文書検索装置において、前記各文書の属性情報を抽出する属性情報抽出手段と、前記各文書の本文の前記検索条件に対する合致度を算出する合致度算出手段と、前記合致度算出手段により算出された合致度を、前記属性情報抽出手段により抽出された各文書の属性情報にもとづいて補正する合致度補正手段と、を備えたので、本文だけでは優劣のつかない文書でも、その属性情報にもとづいていずれかが相対的に上位、いずれかが相対的に下位に順位づけされ、これによって、検索者にとって重要である可能性の高い文書を優先的に検索者に提示することが可能な文書検索装置が得られるという効果を奏する。
【００７３】
また、請求項２に記載の発明は、前記請求項１に記載の発明において、前記属性情報抽出手段が、前記各文書のＵＲＬを抽出するとともに、前記合致度補正手段が、前記合致度算出手段により算出された合致度を、前記属性情報抽出手段により抽出された当該文書のＵＲＬの前記検索条件に対する合致度にもとづいて補正するので、本文だけでは優劣のつかない文書でも、そのＵＲＬがより検索条件に類似する文書のほうが、相対的に上位に順位づけされ、これによって、検索者にとって重要である可能性の高い文書を優先的に検索者に提示することが可能な文書検索装置が得られるという効果を奏する。
【００７４】
また、請求項３に記載の発明は、前記請求項１に記載の発明において、前記属性情報抽出手段が、前記各文書を保持する情報処理装置内での当該文書の位置する階層を抽出するとともに、前記合致度補正手段が、前記合致度算出手段により算出された合致度を、前記属性情報抽出手段により抽出された当該文書の位置する階層にもとづいて補正するので、本文だけでは優劣のつかない文書でも、サーバ上で位置する階層のより浅い（ルートに近い）文書のほうが、相対的に上位に順位づけされ、これによって、検索者にとって重要である可能性の高い文書を優先的に検索者に提示することが可能な文書検索装置が得られるという効果を奏する。
【００７５】
また、請求項４に記載の発明は、前記請求項１に記載の発明において、前記属性情報抽出手段が、前記各文書のタイトルまたは前記各文書中の強調文字列を抽出するとともに、前記合致度補正手段が、前記合致度算出手段により算出された合致度を、前記属性情報抽出手段により抽出された当該文書のタイトルまたは当該文書中の強調文字列の前記検索条件に対する合致度にもとづいて補正するので、本文だけでは優劣のつかない文書でも、そのタイトルまたは文中の強調文字列がより検索条件に類似する文書のほうが、相対的に上位に順位づけされ、これによって、検索者にとって重要である可能性の高い文書を優先的に検索者に提示することが可能な文書検索装置が得られるという効果を奏する。
【００７６】
また、請求項５に記載の発明は、前記請求項１に記載の発明において、前記属性情報抽出手段が、前記各文書が当該文書を保持する情報処理装置内の他の文書にリンクする頻度を抽出するとともに、前記合致度補正手段が、前記合致度算出手段により算出された合致度を、前記属性情報抽出手段により抽出された頻度にもとづいて補正するので、本文だけでは優劣のつかない文書でも、同一サーバ内の他の文書により多くリンクする文書のほうが、相対的に上位に順位づけされ、これによって、検索者にとって重要である可能性の高い文書を優先的に検索者に提示することが可能な文書検索装置が得られるという効果を奏する。
【００７７】
また、請求項６に記載の発明は、前記請求項１に記載の発明において、前記属性情報抽出手段は、前記各文書にリンクする他の文書中で当該リンクが埋め込まれている文字列を抽出するとともに、前記合致度補正手段は、前記合致度算出手段により算出された合致度を、前記属性情報抽出手段により抽出された文字列の前記検索条件に対する合致度にもとづいて補正するので、本文だけでは優劣のつかない文書でも、他の文書に埋め込まれているアンカーテキストがより検索条件に類似する文書のほうが、相対的に上位に順位づけされ、これによって、検索者にとって重要である可能性の高い文書を優先的に検索者に提示することが可能な文書検索装置が得られるという効果を奏する。
【００７８】
また、請求項７に記載の発明は、複数の電子文書を各文書の検索条件に対する合致度にもとづいて順位づけする文書検索方法において、前記各文書の属性情報を抽出する属性情報抽出工程と、前記各文書の本文の前記検索条件に対する合致度を算出する合致度算出工程と、前記合致度算出工程で算出された合致度を、前記属性情報抽出工程で抽出された各文書の属性情報にもとづいて補正する合致度補正工程と、を含んだので、本文だけでは優劣のつかない文書でも、その属性情報にもとづいていずれかが相対的に上位、いずれかが相対的に下位に順位づけされ、これによって、検索者にとって重要である可能性の高い文書を優先的に検索者に提示することが可能な文書検索方法が得られるという効果を奏する。
【００７９】
また、請求項８に記載の発明によれば、前記請求項７に記載された方法をコンピュータに実行させることが可能なプログラムが得られるという効果を奏する。
【図面の簡単な説明】
【図１】この発明の実施の形態による文書検索装置のハードウエア構成を示す説明図である。
【図２】この発明の実施の形態による文書検索装置の機能的構成を示す説明図である。
【図３】この発明の実施の形態による文書検索装置により収集された、複数の文書間の参照関係の一例を示す説明図である。
【図４】この発明の実施の形態による文書検索装置における、属性情報抽出処理の手順を示すフローチャートである。
【図５】この発明の実施の形態による文書検索装置における、文書検索処理の手順を示すフローチャートである。
【符号の説明】
１０１　ＣＰＵ
１０２　ＲＯＭ
１０３　ＲＡＭ
１０４　ＨＤＤ
１０５　ＨＤ
１０６　ＦＤＤ
１０７　ＦＤ
１０８　ディスプレイ
１０９　ネットワークＩ／Ｆ
１１０　通信回線
１１１　キーボード
１１２　マウス
１１３　ＣＤ−ＲＷ
１１４　ＣＤ−ＲＷドライブ
２００　文書収集部
２０１　収集文書解析部
２０２　収集文書保存部
２０３　要求入力受付部
２０４　検索実行部
２０５　結果出力部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a document search device, a document search method, and a program for causing a computer to execute a method for ranking a plurality of electronic documents based on the degree of matching of each document with search conditions.
[0002]
[Prior art]
If you want to retrieve documents related to a certain theme from a large number of documents existing on the Internet, searchers can use a general-purpose search engine such as "Yahoo!" or "Google", or a specialized search specialized for a particular theme. It was customary to use an engine or the like. When the user accesses these search engines and inputs an arbitrary keyword, the title of a document including the keyword can be displayed on the screen without omission.
[0003]
[Problems to be solved by the invention]
However, since the search result list obtained as described above is very comprehensive, if the number of hit documents is large, it takes time and effort to search for the target document from the list. There was a point. Since the arrangement (order) of the documents in the result list does not reflect the importance of each document to the searcher, it can be said that it takes time to reach the target document.
[0004]
SUMMARY OF THE INVENTION The present invention solves the above-described problem of the prior art, and provides a document search apparatus, a document search method, and a computer-readable storage medium capable of preferentially presenting a document likely to be important to a searcher to the searcher. The purpose is to provide a program to be executed.
[0005]
[Means for Solving the Problems]
In order to solve the above-described problem and achieve the object, a document search device according to the first aspect of the present invention is a document search device that ranks a plurality of electronic documents based on the degree of matching of each document with search conditions. Attribute information extracting means for extracting attribute information of each document, matching degree calculating means for calculating a matching degree of the text of each document with respect to the search condition, and a matching degree calculated by the matching degree calculating means, Matching degree correction means for correcting based on the attribute information of each document extracted by the attribute information extraction means.
[0006]
According to the first aspect of the present invention, even in a document which is not superior or inferior in terms of the text alone, one of the documents is ranked relatively higher and the other is relatively ranked lower based on the attribute information.
[0007]
Also, in the document search device according to the second aspect of the present invention, in the invention according to the first aspect, the attribute information extracting unit extracts a URL of each of the documents, and the matching degree correction unit includes The matching degree calculated by the matching degree calculating means is corrected based on the matching degree of the URL of the document extracted by the attribute information extracting means with the search condition.
[0008]
According to the second aspect of the present invention, even for a document which is not superior or inferior only by the text, a document whose URL is more similar to the search condition is ranked relatively higher.
[0009]
According to a third aspect of the present invention, in the document search device according to the first aspect, the attribute information extracting unit locates the document in an information processing apparatus holding each of the documents. Extracting a hierarchy and correcting the matching degree calculated by the matching degree calculation means based on the hierarchy in which the document is extracted by the attribute information extraction means. I do.
[0010]
According to the third aspect of the present invention, even a document that is not superior or inferior in terms of the text alone, a document located on the server and having a shallower hierarchy (closer to the root) is ranked relatively higher.
[0011]
According to a fourth aspect of the present invention, in the document search apparatus according to the first aspect, the attribute information extracting unit extracts a title of each document or an emphasized character string in each document. And the matching degree correcting means calculates the matching degree calculated by the matching degree calculating means, by the matching degree of the title of the document extracted by the attribute information extracting means or the emphasized character string in the document with respect to the search condition. The correction is performed based on
[0012]
According to the fourth aspect of the present invention, even in a document which is not superior or inferior in terms of the text alone, a document whose title or emphasized character string in the sentence is more similar to the search condition is ranked relatively higher. .
[0013]
According to a fifth aspect of the present invention, in the document search device according to the first aspect, the attribute information extracting means is configured to determine that each of the documents has another document in the information processing apparatus that holds the document. In addition, the frequency of linking to the attribute information is extracted, and the degree-of-match correction means corrects the degree of match calculated by the degree-of-match calculation means based on the frequency extracted by the attribute information extraction means.
[0014]
According to the fifth aspect of the present invention, a document which is linked to more documents in the same server than other documents in the same server is ranked relatively higher even if the document is not determined by the text alone.
[0015]
According to a sixth aspect of the present invention, in the document search device according to the first aspect, the attribute information extracting unit includes the link information embedded in another document linked to each of the documents. And the matching degree correcting means corrects the matching degree calculated by the matching degree calculating means on the basis of the matching degree of the character string extracted by the attribute information extracting means with the search condition. It is characterized by doing.
[0016]
According to the sixth aspect of the present invention, even in a document that is not superior or inferior in terms of the text alone, a document whose anchor text embedded in another document is more similar to the search condition is ranked relatively higher. Is done.
[0017]
According to a seventh aspect of the present invention, in the document search method for ranking a plurality of electronic documents based on the degree of matching of each document with a search condition, an attribute for extracting attribute information of each document is provided. An information extracting step, a matching degree calculating step of calculating a matching degree of the text of each document with respect to the search condition, and a matching degree calculated in the matching degree calculating step, wherein each of the documents extracted in the attribute information extracting step And a matching degree correcting step of correcting based on the attribute information of the above.
[0018]
According to the seventh aspect of the present invention, even in a document in which superiority is inferior only by the text alone, one of the documents is ranked relatively higher and the other is ranked relatively lower based on the attribute information.
[0019]
According to a program according to an eighth aspect of the present invention, the method according to the seventh aspect is executed by a computer.
[0020]
BEST MODE FOR CARRYING OUT THE INVENTION
Preferred embodiments of a document search device, a document search method, and a program for causing a computer to execute the method according to the present invention will be described in detail below with reference to the accompanying drawings.
[0021]
FIG. 1 is an explanatory diagram showing a hardware configuration of a document search device according to an embodiment of the present invention. In FIG. 1, reference numeral 101 denotes a CPU for controlling the entire apparatus, 102 denotes a ROM storing a basic input / output program, and 103 denotes a RAM used as a work area of the CPU 101.
[0022]
An HDD (hard disk drive) 104 controls reading / writing of data from / to an HD (hard disk) 105 under the control of the CPU 101, and an HD 105 stores data written under the control of the HDD 104. I have.
[0023]
An FDD (Flexible Disk Drive) 106 controls reading / writing of data from / to an FD (Flexible Disk) 107 under the control of the CPU 101, and a detachable FD 107 stores data written under the control of the FDD 106. Are shown respectively.
[0024]
Reference numeral 108 denotes a display for displaying various data such as cursors, menus, windows, and characters and images. Reference numeral 109 denotes a network I / O connected to the Internet via a communication line 110 and functioning as an interface between the network and the CPU 101. F are shown respectively.
[0025]
Reference numeral 111 denotes a keyboard having a plurality of keys for inputting characters, numerical values, various instructions, and the like. 112 denotes a mouse for selecting and executing various instructions, selecting a processing target, moving a mouse pointer, and the like. Is shown. Reference numeral 113 denotes a CD-RW which is a removable recording medium, 114 denotes a CD-RW drive for controlling reading of data from / to the CD-RW 113, and 100 denotes a bus or a cable for connecting each of the above units. ing.
[0026]
Next, FIG. 2 is an explanatory diagram functionally showing the configuration of the document search device according to the embodiment of the present invention. First, reference numeral 200 denotes a document collection unit which periodically circulates the Internet and collects homepages published there (strictly, various files such as HTML files and GIF files that constitute the homepage). Then, the collected document is delivered to a collected document analysis unit 201 described later.
[0027]
Reference numeral 201 denotes a collected document analysis unit that analyzes each document delivered from the document collection unit 200 and first creates an inverted file for searching for a document from a keyword. This transposed file is conceptually a line of all collected documents, a column of all keywords appearing in the document group, and the presence or absence or frequency of each keyword in each document is recorded at the intersection of rows and columns. It is a table.
[0028]
In addition, together with the creation of the transposed file, the collection document analysis unit 201 extracts the following attribute information for each document and transfers the attribute information to the collection document storage unit 202 described later.
[0029]
(1) URLText
A URL that uniquely specifies the document, for example, “http://www.justsystem.co.jp/index.html” or the like.
[0030]
(2) URL Length
The depth of the hierarchy where the document is located on the server (the depth of the hierarchy is expressed as the length of the URL). For example, the URL of the document A whose URL is “http://www.justsystem.co.jp/index.html”=0 (root), and“ http://www.justsystem.co.jp/news/20020601.html ”. , The URL length of the document B is “1”.
[0031]
(3) Title
This is the title of the document, that is, the entire character string described in the <title> tag.
[0032]
(4) LargeFonts
All strings highlighted in the document. What kind of aspect is regarded as “emphasis” is arbitrary, for example, a character string larger than other character strings in the document (a character string described in a tag, etc.), Character string (character string in tag or the like) with a color different from that of other character strings such as bold or italic characters (characters in the tag or character string in tag) is regarded as an emphasized character string and stored in LargeFonts.
[0033]
Note that the LargeFonts may store not only the character string itself but also the position information of the emphasized character string such as from what character to what character in the document.
[0034]
(5) InterServerLinked
This is the frequency at which the document is linked (referenced) from a document (external page) on a server other than the server where the document exists. For example, as shown in FIG. 3, hyperlinks for jumping to document A on server 1 are embedded in document B on server 1, document C on server 2, and document D on server 3, respectively. If there is no document other than D that references document A, then InterServerLinked of document A = 2 (= 1 + 1).
[0035]
(6) InnerServerLinker
This is the frequency at which the document is linked to another document (internal page) in the server where the document exists. For example, as shown in FIG. 3, assuming that document A on server 1 includes one hyperlink that jumps to document B on server 1 and one hyperlink that jumps to document C on server 2, InnerServerLinker = 1 of document A It is.
[0036]
(7) InterServerAnchor
When the document is linked from a document on a server other than the server where the document is located, this is a character string in which the link is embedded. For example, when a hyperlink to document A on server 1 is added to document C on server 2 as “<a href =" http: // www. justsystem. co. jp / index. http://www.justsystem.co.jp "in the form of </a>, and in the document D on the server 3,"<a href = "http: // www. justsystem. co. jp / index. html "> Ichitaro is embedded in the form like here </a>, and if there is no external page referring to document A other than document C / D, document A (not document C / D) InterServerAnchor = “JustSystems Inc. homepage” “Click here for Ichitaro”.
[0037]
(8) InnerServerAnchor
When the document is linked from another document in the server where the document exists, the character string in which the link is embedded. For example, when a hyperlink to document A on server 1 is added to document B on the same server 1 as “<a href =" http: // www. justsystem. co. jp / index. html "> return to home </a>", and if there is no document that refers to document A other than document B on server 1, InnerServerAnchor of document A (not document B) = "Return to home".
[0038]
Returning to the description of the functional units in FIG. 2, reference numeral 202 denotes a collected document storage unit, and the main body of each document passed from the collected document analysis unit 201 and the attribute information of each document ((1) to (8) ) Is stored.
[0039]
A request input receiving unit 203 is a functional unit that receives an input of a search request from an operator and passes it to a search execution unit 204 described later. The search request includes at least one keyword and search conditions such as AND and OR for combining the keywords. Note that a search request may be input using a natural sentence, and keywords may be extracted from the search request by morphological analysis or syntax analysis and delivered to the search execution unit 204.
[0040]
Reference numeral 204 denotes a search execution unit that matches each document in the collected document storage unit 202 with the search condition passed from the request input reception unit 203 (it may be called similarity between the search condition and each document). Are sequentially calculated, and a matching degree correction unit 204b that corrects the matching degree calculated by the matching degree calculation unit 204a in view of the attribute information of each document described above.
[0041]
The matching degree calculation unit 204a calculates the matching degree of each document by a method generally called a “vector space method”. In the vector space method, a vector (query vector) having element values such as the presence / absence or appearance frequency of a keyword included in a search condition is created, and a document vector of each document is created from each record in the above-described transposed file. I do. Then, the distance (cosine distance) between the query vector and the document vector of each document is sequentially calculated, and a score of the matching degree is calculated such that the smaller the distance is, the larger the distance is, and the larger the distance is, the smaller the matching score is. With this score, each document can be ranked in the order of the degree of matching with the search condition.
[0042]
However, in the present invention, the matching degree of each document calculated as described above is corrected based on the attribute information extracted for each document. Specifically, it is as follows.
[0043]
(1) “URLText”, that is, a document having a higher matching degree with respect to a URL search condition, inflates the matching degree with the search condition at a higher rate. The URL often includes the names of companies and other organizations related to the home page, the subject of the page, and the like, and these keywords are frequently used as search conditions. Therefore, for example, the matching degree for the search condition of the document A whose URL matching degree is 80% is 180% (adding 80%) of the matching degree calculated by the matching degree calculating unit 204a, and the URL matching degree is 20%. The matching degree with respect to the search condition of document B is corrected according to the matching degree of the URL, such as 120% (addition of 20%) of the matching degree calculated by the matching degree calculation unit 204a.
[0044]
Here, as an example, a calculation formula of “matching degree after correction = matching degree before correction + (matching degree before correction × matching degree of URL)” is used, but the URL has the same or similar character as the search condition. As long as the number of columns is increased, the matching degree of the document is corrected to be higher, and the calculation formula may be arbitrary.
[0045]
(2) "URL Length", that is, the higher the level of the document (the level closer to the root), the higher the level of matching of the document with the search condition is increased. Searchers are often looking for the top page or pages that are close to the top page, and these pages are usually located at higher levels. Therefore, for example, the matching degree for the search condition of the document A with URL Length = 0 is 200% of the matching degree calculated by the matching degree calculation unit 204a, and the matching degree for the search condition of the document B with URL Length = 1 is the matching degree calculation unit. Correction is performed according to the hierarchy in which the document is located, for example, 190% of the degree of matching calculated by 204a.
[0046]
Here, as an example, a calculation formula of “the degree of matching after correction = the degree of matching before correction + {the degree of matching before correction × (1−URLLength × 0.1)}} is used. The calculation formula may be arbitrary as long as the matching degree of the document is corrected to be higher.
[0047]
(3) The higher the degree of matching of the “Title”, that is, the title of the document with the search condition, the higher the proportion of the match with the search condition. Like the URL, the title is likely to include a keyword used as a search condition. Therefore, for example, the matching degree of the document A whose matching degree with the search condition of the title is 80% is 180% of the matching degree calculated by the matching degree calculating unit 204a, and the matching degree of the title is 20%. The degree of matching of the document B with the search condition is corrected according to the degree of matching of the title, such as 120% of the degree of matching calculated by the degree of matching calculating unit 204a.
[0048]
Here, as an example, a calculation formula of “matching degree after correction = matching degree before correction + (matching degree before correction × matching degree of title)” is used, but the title has the same or similar character as the search condition. As long as the number of columns is increased, the matching degree of the document is corrected to be higher, and the calculation formula may be arbitrary. Further, it is not necessary to use the same calculation formula as the above-described correction based on the URL.
[0049]
(4) The higher the degree of matching of the “LargeFonts”, that is, the emphasized character string in the document with the search condition, the higher the degree of matching with the search condition. Like the URL and the title, the emphasized character string is likely to include a keyword used as a search condition. Therefore, for example, the matching degree of the document A in which the matching degree of the emphasized character string portion with the search condition is 80% is 180% of the matching degree calculated by the matching degree calculating unit 204a, and the matching degree of the emphasized character string part is Correction according to the matching degree of the emphasized character string is performed such that the matching degree for the search condition of the document B having the degree of 20% is 120% of the matching degree calculated by the matching degree calculation unit 204a.
[0050]
Here, as an example, the calculation formula “matching degree after correction = matching degree before correction + (matching degree before correction × matching degree of emphasized character string)” is used. The calculation formula may be arbitrary as long as the matching degree of the document is corrected to be higher as more similar character strings are included. Further, it is not necessary to use the same calculation formula as the correction based on the URL or the title described above.
[0051]
(5) “InterServerLinked”, that is, a document that is frequently linked from a document of another server, inflates the matching degree with the search condition at a high rate. Documents linked from many external pages are rich in content and highly evaluated objectively, and are likely to be relatively important for searchers. Therefore, for example, the degree of matching with respect to the search condition of the document A of InterServerLinked = 2 is corrected according to the number of links from other documents, such as 120% of the degree of matching calculated by the degree-of-match calculating unit 204a.
[0052]
Here, as an example, a calculation formula “matching degree after correction = matching degree before correction + (matching degree before correction × referenced frequency × 0.1)” is used, but links are made from many documents. As long as the matching degree of the document is corrected to be higher, the calculation formula may be arbitrary.
[0053]
(6) “InnerServerLinker”, that is, a document that has a higher frequency of linking to another document in the same server, inflates the matching degree with the search condition at a higher rate. A document linked to many internal pages is a top page, or at least a page that bundles the contents of a jump destination, and is likely to be relatively important for a searcher. Therefore, for example, the matching degree with respect to the search condition of the document A of InnerServerLinker = 1 is corrected according to the link frequency to another document, such as 110% of the matching degree calculated by the matching degree calculation unit 204a.
[0054]
Here, as an example, a calculation formula of “matching degree after correction = matching degree before correction + (matching degree before correction × reference frequency × 0.1)” is used, but many documents in the same server are used. The calculation formula may be arbitrary as long as the degree of matching of the document is corrected to be higher as the link is made.
[0055]
(7) The higher the degree of matching of the "InterServerAnchor", that is, the character string in the external page in which the link to the document is embedded with the search condition, to the search condition, the higher the degree of matching with the search condition is increased. Like the URL, title, and emphasized character string described above, the anchor text that simply expresses the contents and properties of the destination page is likely to include a keyword used as a search condition.
Therefore, for example, the matching degree of the document A in which the matching degree of the anchor text in the external page with the search condition is 80% is 180% of the matching degree calculated by the matching degree calculation unit 204a. Performs correction according to the degree of matching of text.
[0056]
Here, as an example, a calculation formula “matching degree after correction = matching degree before correction + (matching degree before correction × matching degree of InterServerAnchor)” is used, but the anchor text has the same or similar search condition as the search condition. The calculation formula may be arbitrary as long as the matching degree of the document is corrected to be higher as the number of character strings is increased. Further, it is not necessary to use the same calculation formula as the correction by the URL, the title, or the emphasized character string.
[0057]
(8) The higher the degree of matching of the character string in the inner page embedded with the link to the document to the search condition with respect to the search condition, the higher the degree of matching of the character string in the internal page with the search condition. The reason and the procedure are the same as those of the above-mentioned InterServerAnchor. For example, the matching degree with respect to the search condition of the document A whose matching degree with the search condition of the anchor text in the internal page is 80% is calculated by the matching degree calculation unit 204a. Correction is performed according to the degree of matching of the anchor text, such as 180% of the degree of matching.
[0058]
Here, as an example, a calculation formula of “matching degree after correction = matching degree before correction + (matching degree before correction × matching degree of InnerServerAnchor)” is used, but the anchor text has the same or similar search condition as the search condition. The calculation formula may be arbitrary as long as the matching degree of the document is corrected to be higher as the number of character strings is increased. Further, it is not necessary to use the same calculation formula as the correction by the above URL, title, emphasized character string, or InterServerAnchor.
[0059]
The matching degree correction unit 204b calculates the corrected matching degree for each document in consideration of the above attributes, and calculates the average value (simple average or weighted average), for example, to determine the corrected matching degree of each document. calculate. Alternatively, the degree of matching after correction by each attribute may be simply or summed under a predetermined weight to obtain the degree of matching after correction of each document. Thereafter, the matching degree correction unit 204b ranks the documents in order of the matching degree, and passes the title of the document whose matching degree exceeds the threshold value and the matching degree of the document to the result output unit 205 described later.
[0060]
Referring back to FIG. 2, reference numeral 205 denotes a result output unit which displays the document delivered from the search execution unit 204, that is, the search result, on the display 108 as a search result list or the like.
[0061]
FIG. 4 is a flowchart showing a procedure of attribute information extraction processing in the document search device according to the embodiment of the present invention. This process is periodically executed at a time designated in advance as preparation for a document search process described later.
[0062]
At a designated time, the document collection unit 200 performs document collection by circulating the Internet (step S401), and when a predetermined end condition, for example, a condition such as tracing up to N links ahead is satisfied (step S401). S402: Yes), the collected document is delivered to the collected document analysis unit 201.
[0063]
The collected document analysis unit 201 extracts each of the above-described attribute information from the delivered document (step S403), and stores a correspondence table between the document and each attribute information and an inverted file which is a correspondence table between the document and the keyword. Various databases are created (step S404).
[0064]
FIG. 5 is a flowchart showing a procedure of a document search process in the document search device according to the embodiment of the present invention.
[0065]
The request input receiving unit 203 waits for an input of a search request by the searcher (step S501), and when there is an input (step S502: Yes), passes the search request to the search execution unit 204.
[0066]
The search execution unit 204 first determines whether or not a narrow search is specified in the input search request (step S503). The narrowing-down search is a process of narrowing down documents to be searched in advance before the above-described document search by the vector space method. For example, the documents in the collected document storage unit 202 that have been created and updated within a certain period of time. Only for a document or a document created by a specific writer, the matching degree can be calculated by the matching degree calculation unit 204a and the matching degree correction unit 204b (the calculation can be omitted for other documents. Processing speed is improved).
[0067]
If the narrow search is specified (step S503: Yes), only the specified document is extracted (narrow search) from the documents in the collected document storage unit 202 (step S504), and the documents are matched. The degree is calculated and corrected (step S505). On the other hand, if the refinement search is not specified (step S503: No), the matching degree is calculated and corrected for all the documents in the collected document storage unit 202 (step S505).
[0068]
After that, the result output unit 205 to which the search result has been delivered displays the search result list (step S506). Unless the searcher instructs to end the search (step S507: No), the process returns to step S501 to return to the next search condition. Accept the input of.
[0069]
According to the embodiment described above, the ranking of the search target document is determined not only by the similarity between the text and the search condition, but also by the URL, the hierarchy to be located, the title and the emphasized character string in the text, and the search from the external page. This is performed comprehensively in view of various attributes such as the reference frequency, the reference frequency to the internal page, and the anchor text of the reference source.
[0070]
Actually, if the matching level is increased based on the above-mentioned attribute information based on the above-mentioned attribute information, the core home page functioning as an entrance page or a representative page among a plurality of home pages constituting one site. Are relatively likely to appear higher in the search results. And since searchers are often looking for these core pages rather than the minor pages in the site, the above-mentioned correction of the degree of matching can be used to determine whether the degree of matching is superior or inferior to the text alone. Documents that are likely to be relatively important for the searcher can be extracted from among similar documents that are not present.
[0071]
The above-described document collection unit 200, collection document analysis unit 201, request input reception unit 203, search execution unit 204, and result output unit 205, specifically, the CPU 101 executes a program read from the HD 105 to the RAM 103. This is achieved by: This program can be stored and distributed on various recording media such as the FD 107, the CD-RW 113, and the MO in addition to the HD 105, and can also be distributed via a network. The collected document storage unit 202 is realized by the HD 105.
[0072]
【The invention's effect】
As described above, according to the first aspect of the present invention, in a document search apparatus for ranking a plurality of electronic documents based on the degree of matching of each document with a search condition, attribute information extraction for extracting attribute information of each document Means, a matching degree calculating means for calculating a matching degree of the text of each document with respect to the search condition, and an attribute of each document extracted by the attribute information extracting means, the matching degree calculated by the matching degree calculating means. And a matching level correction unit that corrects based on the information, so that even for documents that are not superior or inferior based on the text alone, based on the attribute information, one of them is ranked relatively higher and the other is ranked relatively lower. As a result, an effect is obtained that a document search device capable of preferentially presenting a document that is likely to be important to the searcher to the searcher can be obtained.
[0073]
According to a second aspect of the present invention, in the first aspect of the present invention, the attribute information extracting unit extracts a URL of each of the documents, and the matching degree correcting unit includes a matching degree calculating unit. Is corrected based on the degree of matching of the URL of the document extracted by the attribute information extracting means with the search condition. Therefore, even in a document that is not superior or inferior in text alone, the URL can be further searched. Documents having similar conditions are ranked relatively higher, thereby providing a document search apparatus capable of preferentially presenting a searcher a document that is likely to be important to the searcher. This has the effect.
[0074]
According to a third aspect of the present invention, in the first aspect of the present invention, the attribute information extracting means extracts a hierarchy in which the document is located in an information processing apparatus holding each of the documents. Since the matching degree correcting means corrects the matching degree calculated by the matching degree calculating means on the basis of the hierarchy in which the document is extracted by the attribute information extracting means, there is no difference in the text alone. Documents shallower (closer to the root) in the hierarchy located on the server are ranked relatively higher, so that documents that are likely to be important to the searcher are given priority over the searcher. And a document search device capable of presenting the document to the user can be obtained.
[0075]
According to a fourth aspect of the present invention, in the first aspect of the present invention, the attribute information extracting means extracts a title of each of the documents or an emphasized character string in each of the documents, A correcting unit that corrects the matching degree calculated by the matching degree calculating unit based on the matching degree of the title of the document or the emphasized character string in the document extracted by the attribute information extracting unit with the search condition. Therefore, even if the text alone is not superior, documents whose title or emphasized character string in the text is more similar to the search condition are ranked relatively higher, which may be important for the searcher. There is an effect that a document search device capable of preferentially presenting a highly-relevant document to a searcher is obtained.
[0076]
According to a fifth aspect of the present invention, in the first aspect of the present invention, the attribute information extracting means determines a frequency at which each document is linked to another document in the information processing apparatus holding the document. At the same time as extracting, the matching degree correcting means corrects the matching degree calculated by the matching degree calculating means on the basis of the frequency extracted by the attribute information extracting means. Documents that link more to other documents on the same server are ranked relatively higher, so that documents that are likely to be important to the searcher can be preferentially presented to the searcher. There is an effect that a possible document search device can be obtained.
[0077]
According to a sixth aspect of the present invention, in the first aspect of the invention, the attribute information extracting means extracts a character string in which the link is embedded in another document linked to each of the documents. And the matching degree correcting means corrects the matching degree calculated by the matching degree calculating means based on the matching degree of the character string extracted by the attribute information extracting means with the search condition. Therefore, even if a document does not compare favorably, a document whose anchor text embedded in another document is more similar to the search condition is ranked relatively higher, which may be important for the searcher. There is an effect that a document search device capable of preferentially presenting a high document to a searcher is obtained.
[0078]
Further, according to a seventh aspect of the present invention, in the document search method for ranking a plurality of electronic documents based on the degree of matching of each document with search conditions, an attribute information extracting step of extracting attribute information of each document; A matching degree calculating step of calculating a matching degree of the text of each document with respect to the search condition; and a matching degree calculated in the matching degree calculating step based on the attribute information of each document extracted in the attribute information extracting step. And a matching degree correction step of correcting the document based on the attribute information. As a result, there is an effect that a document search method capable of preferentially presenting a document likely to be important to the searcher to the searcher is obtained.
[0079]
According to the invention described in claim 8, there is an effect that a program capable of causing a computer to execute the method described in claim 7 is obtained.
[Brief description of the drawings]
FIG. 1 is an explanatory diagram showing a hardware configuration of a document search device according to an embodiment of the present invention.
FIG. 2 is an explanatory diagram showing a functional configuration of the document search device according to the embodiment of the present invention;
FIG. 3 is an explanatory diagram showing an example of a reference relationship between a plurality of documents collected by the document search device according to the embodiment of the present invention;
FIG. 4 is a flowchart showing a procedure of attribute information extraction processing in the document search device according to the embodiment of the present invention.
FIG. 5 is a flowchart showing a procedure of a document search process in the document search device according to the embodiment of the present invention.
[Explanation of symbols]
101 CPU
102 ROM
103 RAM
104 HDD
105 HD
106 FDD
107 FD
108 Display
109 Network I / F
110 communication line
111 keyboard
112 mouse
113 CD-RW
114 CD-RW drive
200 Document Collection Unit
201 Collected document analysis unit
202 Collected document storage
203 Request input reception unit
204 search execution unit
205 Result output section

Claims

In a document search apparatus that ranks a plurality of electronic documents based on the degree of matching of each document with search conditions,
Attribute information extracting means for extracting attribute information of each document;
A matching degree calculating means for calculating a matching degree of the text of each document with respect to the search condition;
A matching degree correcting unit that corrects the matching degree calculated by the matching degree calculating unit based on the attribute information of each document extracted by the attribute information extracting unit;
A document search device comprising:

The attribute information extracting means extracts a URL of each document,
The matching degree correcting means corrects the matching degree calculated by the matching degree calculating means based on the matching degree of the URL of the document extracted by the attribute information extracting means with the search condition. The document search device according to claim 1.

The attribute information extracting means extracts a hierarchy in which the document is located in the information processing apparatus holding each of the documents,
2. The method according to claim 1, wherein the matching degree correction unit corrects the matching degree calculated by the matching degree calculation unit based on a hierarchy in which the document is extracted by the attribute information extraction unit. Document search device as described.

The attribute information extracting means extracts a title of each of the documents or an emphasized character string in each of the documents,
The matching degree correcting unit calculates the matching degree calculated by the matching degree calculating unit based on the matching degree of the title of the document or the emphasized character string in the document extracted by the attribute information extracting unit with the search condition. 2. The document search apparatus according to claim 1, wherein the correction is performed by performing a correction.

The attribute information extracting means extracts a frequency at which each document is linked to another document in the information processing apparatus holding the document,
2. The document search apparatus according to claim 1, wherein the matching degree correction unit corrects the matching degree calculated by the matching degree calculation unit based on the frequency extracted by the attribute information extraction unit. .

The attribute information extracting means extracts a character string in which the link is embedded in another document linked to each of the documents,
The claim, wherein the matching degree correcting means corrects the matching degree calculated by the matching degree calculating means based on the matching degree of the character string extracted by the attribute information extracting means with the search condition. Item 2. The document search device according to Item 1.

In a document search method for ranking a plurality of electronic documents based on the degree of matching of each document with search conditions,
An attribute information extracting step of extracting attribute information of each document;
A matching degree calculating step of calculating a matching degree of the text of each document with respect to the search condition;
A matching degree correcting step of correcting the matching degree calculated in the matching degree calculating step based on the attribute information of each document extracted in the attribute information extracting step;
A document search method comprising:

A program for causing a computer to execute the method according to claim 7.