JP2004192546A

JP2004192546A - Information retrieval method, device, program, and recording medium

Info

Publication number: JP2004192546A
Application number: JP2002362603A
Authority: JP
Inventors: Masanori Harada; 昌紀原田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-12-13
Filing date: 2002-12-13
Publication date: 2004-07-08

Abstract

<P>PROBLEM TO BE SOLVED: To perform ranking retrieval targeting a large-scale document set at high speed and low costs. <P>SOLUTION: A retrieval processing part 7 inputs an appearance location list of an index word from a high-frequency word transposition index 4 in which a list of index words and appearance locations of documents of which the frequencies of the index word are equal to or more than a threshold F is stored when a search word is constituted of only one index word, inputs the appearance location list of the respective index words from the high-frequency word transposition index 4 when the retrieval word is a line of a plurality of index words, calculates a list of locations where all the index words adjacently appear, transfers the list to an adaptation calculation part 6 and receives an adaptation document list. When the obtained adaptation document list refers to T or more documents required for display of retrieval results and when documents with top T-th adaptation have adaptation larger than the maximum adaptation which can be taken by low-frequency documents not to be stored in the high-frequency word transposition index 4, they are outputted. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、検索対象となる文書集合を索引づけし、利用者の入力した検索条件に適合する文書を検索する情報検索方法および装置に関する。
【０００２】
【従来の技術】
情報検索装置は、文書集合の中から、利用者によって与えられた検索質問に適合する文書を検索し、それらを利用者に提示する装置である。そこで、今日の情報検索装置の多くは、単に検索語を含む文書を列挙するのではなく、それぞれの文書の検索質問に対する適合度を計算し、高い適合度を得た文書のみを適合度の降順に提示する。これをランキング検索と呼ぶ。
【０００３】
適合度の計算には索引語の重みという考え方が用いられる（非特許文献３、非特許文献４）。ここで索引語とは、文書の内容を特徴づける語のことであるが、今日の全文検索装置では、基本的に文書中のすべての語を索引語とみなす。日本語などのアジア圏の言語では語の区切りが明確ではないため、形態素あるいは文字Ｎグラムを索引語とするのが一般的である。索引語の重みとは、索引語が文書の内容を表わす上でどれだけの重要度を持っているかを示す数値であり、一般に索引語ｗ_iの文書Ｄ_jにおける重みｄ_i,jは、局所的重みｌ_i,j、大域的重みｇ_i、文書正規化係数ｎ_jという３つの指標から、式（１）のように特徴づけられる（非特許文献３）。
【０００４】
【数１】

【０００５】
たとえば、もっとも基本的な重み付け方法として知られるＴＦ・ＩＤＦ法では、局所的重みｌ_i,jとして文書内での索引語の出現回数である索引語頻度（ＴＦ：term frequency）を、大域的重みｇ_iとして索引語が出現する文書の割合の逆数（ＩＤＦ：inverse document frequency）を用いている。また、文書正規化係数ｎ_jとしては文書サイズ（索引語頻度の総和）を用いるのがもっとも単純な方法である。すなわち、ＴＦ・ＩＤＦ法による索引語の重みｄ_i,jは式（２）のようになる。
【０００６】
【数２】

【０００７】
また、ＴＦ・ＩＤＦ法以外の方法でも、ｌ_i,jは索引語頻度から、ｇ_iは文書頻度から、ｎ_jは文書サイズから求めるものが多い。
【０００８】
多数の文書を対象とする情報検索装置は、検索処理を高速におこなうために、索引語と、それらの出現位置の情報を、索引と呼ばれる一種のデータベースにあらかじめ格納しておく。これを索引検索方式という（非特許文献３、非特許文献４）。索引検索の代表的な実現方式として転置索引がある（非特許文献３、非特許文献４）。転置索引とは索引語を辞書順に列挙し、それらをキーとして、索引語の出現位置リストを参照できるように構成されたデータベースである（図５）。転置索引方式では、位置情報として文書番号だけでなく文書内での位置まで格納しておくことで、複数の索引語の並びから構成される検索語の出現位置を文書自体を参照することなしに求めることができる。また、検索語の各文書での出現回数や、検索語が出現する文書数、検索語の総出現回数など、適合度の計算に必要な基本的なパラメータも同時に求められるため、転置索引はランキング検索に適した索引検索方式といえる。
【０００９】
【非特許文献１】
Michael Persin, Justin Zobel, Ron Sacks-Davis: "Filtered Document Retrieval with Frequency-Sorted Indexes", Journal of the American Society of Information Science, Vol.47, No.10, pp. 749-764, 1996.
【非特許文献２】
速水賢史、竹野浩、永瀬智哉、藤本典幸、萩原兼一「スケーラビリティのあるＷＷＷ並列全文検索システム構築法の提案と評価」、情報処理学会データベース研究会研究報告、Ｖｏｌ．１２３、Ｎｏ．７、ｐｐ．４５−５２、２００１。
【非特許文献３】
北研二、津田和彦、獅子堀正幹「情報検索アルゴリズム」、共立出版、２００２。
【非特許文献４】
徳永健伸「情報検索と言語処理」、東京大学出版会、１９９９。
【００１０】
【発明が解決しようとする課題】
転置索引を採用した情報検索装置の場合、ランキング検索の処理時間の大部分は、出現位置リストを転置索引から主記憶上に読み出す処理と、検索結果となる文書の適合度を計算する処理によって占められる。そのため、ランキング検索には、検索質問に含まれる検索語の出現回数の総和におおむね比例する時間が必要になる。
【００１１】
一般に、語句の出現回数は、検索対象となる文書集合中のテキストの量に比例して大きくなる。したがって、ＷＷＷサーチエンジンのような大規模な情報検索システムでは、従来の転置索引では検索処理に非常に時間がかかる場合がある。この問題に対処するために、文書集合を分割し、それぞれの部分文書集合を複数の計算機システムによって並列に検索し、それらの検索結果を併合するという並列検索方式が用いられることがある（非特許文献２）。しかし、多くのハードウェアが必要となるため、システムの導入や維持に要するコストが大きい。
【００１２】
本発明の目的は、大規模な文書集合を対象としたランキング検索を高速、かつ低コストにおこなうことのできる情報検索方法、装置、プログラムおよび記録媒体を提供することにある。
【００１３】
【課題を解決するための手段】
本発明の情報検索装置は索引作成手段と転置索引と高頻度語抽出手段と高頻度語転置索引と適合度計算手段と検索処理手段を有する。
【００１４】
転置索引には索引作成手段によって、索引語と出現位置の組のリストが格納されている。高頻度語転置索引には高頻度語抽出手段によって、索引語頻度があらかじめ定められた閾値以上の索引語とその出現位置の組のリストが格納されている。検索処理手段は、利用者から検索語を受け取り、検索語が１つのみの索引語から構成される場合には高頻度語転置索引から該索引語の出現位置のリストを入力し、検索語が複数の索引語の並びである場合には、それぞれの索引語の出現位置のリストを高頻度語転置索引から入力し、すべての索引語が隣接して出現する位置のリストを求め、高頻度語転置索引から求められた出現位置のリストが、検索結果の表示に必要なＴ個以上（Ｔは１以上の整数）の文書を参照している場合には該出現位置のリストを適合度計算手段に渡して、適合文書リストを受け取り、適合度上位Ｔ番目の文書が、高頻度語転置索引には格納されない低頻度の文書がとり得る最大の適合度よりも大きい適合度を持つ場合は、それらを出力し、高頻度語転置索引からＴ個以上の文書を参照する出現位置のリストが得られなかった場合、あるいは適合度上位Ｔ個の文書が正しく検索されていない可能性がある場合には、転置索引を用いて検索語の出現位置のリストを求め、それを適合度計算手段に出力し、適合度計算手段から適合文書のリストを入力し、適合文書を、適合度の降順で上位最大Ｔ個出力する。
【００１５】
本発明の情報検索装置は、通常の転置索引に加えて、索引語頻度が大きい出現位置リストのみを格納した高頻度語転置索引をあらかじめ用意することで、検索質問が一つの検索語のみから構成されている場合には、比較的小さな高頻度語転置索引のみを使って適合度の高い文書を検索し、検索質問が複数の検索語から構成される場合には、従来の検索方法を用いる。
【００１６】
【発明の実施の形態】
次に、本発明の実施の形態について図面を参照して説明する。
【００１７】
図１を参照すると、本発明の一実施形態の情報検索装置は索引作成部１と転置索引２と高頻度語抽出部３と高頻度語転置索引４と検索受付部５と適合度計算部６と検索処理部７と文書集合データベース８と文書サイズデータベース９で構成されている。
【００１８】
索引作成部１は文書集合データベース８から検索対象となる文書を入力し、該文書を特徴づける索引語を抽出し、索引語と、それら出現位置のリストを転置索引２に格納する。
【００１９】
高頻度語抽出部３は、図２に示すように、索引語と出現位置の組を転置索引２から順次読出し（ステップ１１）、索引語頻度があらかじめ決められた閾値以上になる文書に対応する出現位置を抽出し（ステップ１２）、そのような文書の数が、検索結果の上位として表示される文書数Ｔ以上（Ｔは１以上の整数）の場合には、抽出された索引語と出現位置のリストを高頻度語転置索引４に格納する（ステップ１３）。
【００２０】
検索受付部５は利用者から検索質問を受付け、検索処理部７に渡し、検索処理部７から適合文書リストを受け取り、文書集合データベース８から適合文書リスト内の文書番号に対応する文書の情報を読み出し、利用者に提示する。
【００２１】
適合度計算部６は、図３に示すように、検索処理部７から検索語の出現位置リストを受け取り（ステップ２１）、文書サイズデータベース９から文書サイズを受け取り（ステップ２２）、各文書における検索語の頻度と文書サイズから、ＴＦ・ＩＤＦ法などの計算式にしたがって文書毎に適合度を計算し（ステップ２３）、適合文書リストとして検索処理部７に出力する（ステップ２４）。検索質問が１つの検索語のみの場合（ＡＮＤ検索やＯＲ検索でない、単なる１つの言葉の検索）、文書頻度（検索語が出現する文書の数）は定数とみなせるため、適合度の計算に用いなくてもよい。すなわち、この場合「検索質問が文書とどれだけ適合するか」は「文書に含まれる索引語が文書をどれだけ強く特徴づけるか」と同じであると考えて、ＴＦ・ＩＤＦ法のような文書中の索引語の重みを計算する式で適合度を計算する。たとえば、"携帯電話"という検索語で検索し、ＴＦ・ＩＤＦ法で適合度を計算する場合、
ある文書と"携帯電話"の適合度
＝ｆ_i,j ｌｏｇ（Ｎ／Ｆ_i）／Σ_iｆ_i,j
ここで、ｆ_i,jはその文書内での"携帯電話"の出現回数、Ｎは文書の総数、Ｆ_iは"携帯電話"が出現する文書数、Σ_iｆ_i,jはその文書における索引語頻度の総和である。
"携帯電話"という語を一定回数以上含む文書すべてについて、この適合度を計算し比較することで、どの文書がよく適合するか（"携帯電話"というトピックに強く関連した文書であるか）がわかる。Σ_iｆ_i,jはある文書ｊでのすべての索引語の索引語頻度の総和であるので、要するに文書ｊの大きさである。これは転置索引２から計算することもできるが、それでは索引全体を読み込んで集計する必要があって非効率なので、文書サイズデータベース９のように別途用意しておく。
【００２２】
図５の"検索"という語の場合、出現位置リストは（２，１００），（２，１２１），（２，２０７），（３，２４），（１９，３１）になる。つまり、"検索"という語は全体で５回出現しており、文書番号２の文書では３回、文書番号３の文書では１回、文書番号１９の文書では１回出現していることがわかる。
【００２３】
このように出現位置リストから文書ごとの索引語頻度を求めつつ、前記説明したように（ｌｏｇ（Ｎ／Ｆ_i）の部分は定数とみなした上で）各文書の適合度を計算する。
【００２４】
検索処理部７は、図４に示すように、まず利用者から検索受付部５を介して渡される検索語を受け取り、高頻度語転置索引４を用いて、その検索語の出現位置リストを求める（ステップ３１）。すなわち、検索語が一つの索引語のみから構成される場合には、高頻度語転置索引４からその索引語の出現位置リストを入力し、検索語が複数の索引語の並びである場合には、それぞれの索引語の出現位置リストを高頻度語転置索引４から入力し、すべての索引語が隣接して出現する位置のリストを求める。次に、高頻度語転置索引４から求められた出現位置リストが、検索結果の表示に必要なＴ個以上の文書を参照している場合には、その出現位置リストを適合度計算部６に渡し、適合文書リストを受け取る（ステップ３２，３６，３７）。そして、適合度上位Ｔ番目の文書が、高頻度語転置索引４には格納されない低頻度の文書がとり得る最大の適合度よりも大きい適合度を持つ場合は、適合度上位Ｔ個の文書が正しく検索されているので、それらを出力する（ステップ３８，３９）。たとえば、適合度の計算にＴＦ・ＩＤＦ法を用いた場合、適合度の降順でＴ番目の文書Ｄ_Tの索引語頻度がＦ_T、文書サイズがＳ_Tとすると、Ｆ_T／Ｓ_Tが（Ｆ−１）／Ｓ_minより大きければ、転置索引２に格納された索引語頻度（Ｆ−１）以下の文書は適合度の上位Ｔ個に入らないことが保証される。ここで、Ｓ_minは文書集合中で最小の文書サイズである。高頻度語転置索引４からはＴ個以上の文書を参照する出現位置リストが得られなかった場合、あるいは適合度上位Ｔ個の文書が正しく検索されていない可能性がある場合には、通常の転置索引２を用いてランキング検索処理をおこなう。すなわち、転置索引２を用いて検索語の出現位置リストを求め（ステップ３３）、それを適合度計算部６に出力し（ステップ３４）、適合度計算部６から適合文書リストを入力し（ステップ３５）、適合度の降順で上位最大Ｔ個を出力する（ステップ３９）。
【００２５】
一つ以上の検索語を用いた検索質問が与えられた場合や、適合度によるランキング検索をおこなわない場合には、通常の転置索引を使った従来通りの検索処理をおこなえばよい。本発明の方法は既存の情報検索装置に付加的に導入することが可能である。
【００２６】
日本語の文書を索引づけする場合、索引語として形態素を用いる方法、Ｎグラムを用いる方法、両者を組み合わせた方法などがあるが、本発明の方法はいずれの場合にも適用可能である。
【００２７】
本発明の方法ではあらかじめ固定された閾値Ｆを用いているが、索引語ごとに異なった閾値を設定することもできる。その場合、検索語フレーズに含まれる索引語の閾値のうち最大の値をＦ_maxとすれば、Ｆ_T／Ｓ_Tが（Ｆ_max−１）／Ｓ_minより大きい場合に、高頻度語転置索引のみで検索できたと判断できる。
【００２８】
本発明の方法は、ＴＦ・ＩＤＦ法に限らず、索引語頻度と文書サイズを適合度計算の主要パラメータとして利用している多くの適合度計算法に適用可能である。また、適合度の計算が厳密である必要がない場合には、高頻度語転置索引でＴ個以上の文書が検索された時点で、その上位Ｔ個を検索結果としてもよい。
【００２９】
高頻度語転置索引を別途作成するかわりに、転置索引に格納する出現位置リストを索引語頻度の降順に並べておくことで、出現位置リストの先頭部分を高頻度語転置索引とみなして本発明と同様の検索処理をおこなうこともできる（非特許文献１）。ただし、そのような構成の転置索引では、本発明の方法を用いない通常の検索処理の速度が遅くなる。
【００３０】
本発明と類似した方法として、あらかじめすべての索引語でランキング検索をおこなっておき、適合文書リストの上位Ｔ個を保存しておく方法も考えられる（非特許文献２）。ただし、その方法では本発明の方法とは異なり、複数の索引語の並びであるフレーズを検索することができない。本発明の方法はこのような方法と併用することも可能である。
【００３１】
なお、本発明は専用のハードウェアにより実現されるもの以外に、その機能を実現するためのプログラムを、コンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行するものであってもよい。コンピュータ読み取り可能な記録媒体とは、フロッピーディスク、光磁気ディスク、ＣＤ−ＲＯＭ等の記録媒体、コンピュータシステムに内蔵されるハードディスク装置等の記憶装置を指す。さらに、コンピュータ読み取り可能な記録媒体は、インターネットを介してプログラムを送信する場合のように、短時間の間、動的にプログラムを保持するもの（伝送媒体もしくは伝送波）、その場合のサーバとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含む。
【００３２】
【発明の効果】
以上説明したように、本発明によれば、単一の検索語を検索する場合に、高頻度語索引のみを使って適合度の高い文書を検索するため、従来の検索方法と比べて、データベースから主記憶上に読み出すデータ量や、適合度の計算に必要となるＣＰＵ処理時間を大幅に減らすことが可能となり、検索処理が高速化される。たとえば、ＷＷＷサーチエンジンでは検索質問の７割前後が単一の検索語のみから構成されているため、本発明による高速化の効果が大きいと期待される。
【図面の簡単な説明】
【図１】本発明の一実施形態の情報検索装置の構成を示すブロック図である。
【図２】図１の情報検索装置の高頻度語抽出部の処理を示す流れ図である。
【図３】図１の情報検索装置の適合度計算部の処理を示す流れ図である。
【図４】図１の情報検索装置の検索処理部の処理を示す流れ図である。
【図５】転置索引の一般的な構成を示す図である。
【符号の説明】
１索引作成部
２転置索引
３高頻度語抽出部
４高頻度語転置索引
５検索受付部
６適合度計算部
７検索処理部
８文書データベース
９文書サイズデータベース
１１〜１４，２１〜２４，３１〜３９ステップ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an information search method and apparatus for indexing a set of documents to be searched and searching for documents that match search conditions input by a user.
[0002]
[Prior art]
The information search device is a device that searches a set of documents for documents that match a search question given by the user, and presents them to the user. Therefore, many of today's information retrieval devices do not simply list documents that include a search term, but calculate the relevance of each document to a search query and sort only those documents that have obtained high relevance in descending order of relevance. To present. This is called a ranking search.
[0003]
The concept of index word weight is used for calculating the degree of matching (Non-Patent Documents 3 and 4). Here, an index word is a word that characterizes the contents of a document. In today's full-text search devices, basically all words in a document are regarded as index words. In Asian languages such as Japanese, the delimitation of words is not clear, so that morphemes or character N-grams are generally used as index words. The weight of the index word, a numerical value indicating whether the index word has how much importance in terms of representing the content of a document, the weight of the document D _j of general index term w _i d _{i, j} is the local It is characterized by Expression (1) based on three indices: a target weight l _{i, j} , a global weight g _i , and a document normalization coefficient n _j (Non-Patent Document 3).
[0004]
(Equation 1)

[0005]
For example, in the TF / IDF method known as the most basic weighting method, an index term frequency (TF), which is the number of appearances of an index term in a document, is used as a local weight l _{i, j} , and a global weight is used. As g _i , an inverse document frequency (IDF) of the ratio of documents in which index words appear is used. The simplest method is to use the document size (sum of index word frequencies) as the document normalization coefficient n _j . That is, the weight d _{i, j} of the index word by the TF / IDF method is as shown in Expression (2).
[0006]
(Equation 2)

[0007]
In addition, even by a method other than the TF · IDF _method, l i, _j from the index word frequency, g _i from the document frequency, n _j is in many cases determined from the document size.
[0008]
An information search apparatus for a large number of documents stores index words and information on their appearance positions in a kind of database called an index in advance in order to perform search processing at high speed. This is called an index search method (Non-Patent Documents 3 and 4). A transposed index is a typical method for realizing an index search (Non-Patent Documents 3 and 4). The inverted index is a database configured to enumerate index words in dictionary order and to refer to the appearance position list of the index words by using them as keys (FIG. 5). In the inverted index method, by storing not only the document number but also the position in the document as the position information, the appearance position of the search term composed of a plurality of index words can be determined without referring to the document itself. You can ask. In addition, since the basic parameters required for calculating the relevance, such as the number of occurrences of the search term in each document, the number of documents in which the search term appears, and the total number of occurrences of the search term, are also obtained at the same time, the inverted index is ranked It can be said that this is an index search method suitable for search.
[0009]
[Non-patent document 1]
Michael Persin, Justin Zobel, Ron Sacks-Davis: "Filtered Document Retrieval with Frequency-Sorted Indexes", Journal of the American Society of Information Science, Vol. 47, No. 10, pp. 749-764, 1996.
[Non-patent document 2]
Kenji Hayami, Hiroshi Takeno, Tomoya Nagase, Noriyuki Fujimoto, Kenichi Hagiwara "Proposal and Evaluation of a Scalable WWW Parallel Full-Text Search System Construction Method", IPSJ Database Research Group Report, Vol. 123, No. 7, pp. 45-52, 2001.
[Non-Patent Document 3]
Kenji Kita, Kazuhiko Tsuda, Masamiki Shishibori "Information Search Algorithm", Kyoritsu Shuppan, 2002.
[Non-patent document 4]
Takenobu Tokunaga, "Information Search and Language Processing," University of Tokyo Press, 1999.
[0010]
[Problems to be solved by the invention]
In the case of an information search apparatus that employs an inverted index, most of the processing time for ranking search is occupied by the process of reading the appearance position list from the inverted index into main memory and the process of calculating the relevance of documents that are search results. Can be Therefore, the ranking search requires a time that is approximately proportional to the total number of occurrences of the search term included in the search question.
[0011]
In general, the number of appearances of a phrase increases in proportion to the amount of text in a document set to be searched. Therefore, in a large-scale information search system such as a WWW search engine, a search process may take a very long time with a conventional inverted index. To cope with this problem, a parallel search method is sometimes used in which a document set is divided, each partial document set is searched in parallel by a plurality of computer systems, and the search results are combined (Non-Patent Document 1). Reference 2). However, since a lot of hardware is required, the cost required for introducing and maintaining the system is large.
[0012]
SUMMARY OF THE INVENTION An object of the present invention is to provide an information search method, apparatus, program, and recording medium capable of performing a high-speed and low-cost ranking search for a large-scale document set.
[0013]
[Means for Solving the Problems]
An information retrieval apparatus according to the present invention includes an index creation unit, an inverted index, a high-frequency word extraction unit, a high-frequency word transposition index, a fitness calculation unit, and a search processing unit.
[0014]
In the inverted index, a list of pairs of index words and appearance positions is stored by the index creation means. In the high-frequency word transposition index, a list of sets of index words whose index word frequencies are equal to or greater than a predetermined threshold and their appearance positions is stored by the high-frequency word extraction means. The search processing means receives the search term from the user, and when the search term is composed of only one index term, inputs a list of the appearance positions of the index term from the high-frequency word transposed index, and In the case of a list of a plurality of index words, a list of the appearance positions of each index word is input from the high-frequency word transposition index, and a list of positions where all the index words appear adjacent to each other is obtained. If the list of occurrence positions obtained from the transposed index refers to T or more (T is an integer of 1 or more) documents necessary for displaying the search result, the list of occurrence positions is calculated by the relevance calculating means. , And if the top T-th document has a relevance greater than the maximum relevance that a low-frequency document that is not stored in the high-frequency word transposition index can take, And output T words from the high-frequency word transposition index If a list of the appearance positions referring to the above documents cannot be obtained, or if there is a possibility that the top T relevance documents have not been correctly searched, an inverted index is used to determine the appearance position of the search term. A list is obtained, and the list is output to the relevance calculating means. A list of relevant documents is input from the relevance calculating means, and the uppermost T documents are output in descending order of relevance.
[0015]
The information search apparatus of the present invention prepares a high-frequency word transposed index storing only a list of occurrence positions having a large index word frequency in addition to a normal transposed index, so that a search query is composed of only one search word. If the search query is composed of a plurality of search terms, a conventional search method is used when a document having a high degree of relevance is searched using only a relatively small high-frequency word transposition index.
[0016]
BEST MODE FOR CARRYING OUT THE INVENTION
Next, embodiments of the present invention will be described with reference to the drawings.
[0017]
Referring to FIG. 1, an information retrieval apparatus according to an embodiment of the present invention includes an index creating unit 1, an inverted index 2, a high-frequency word extracting unit 3, a high-frequency word transposing index 4, a search accepting unit 5, and a fitness calculating unit 6 , A search processing unit 7, a document set database 8, and a document size database 9.
[0018]
The index creation unit 1 inputs a document to be searched from the document collection database 8, extracts index words characterizing the document, and stores the index words and a list of their appearance positions in the inverted index 2.
[0019]
As shown in FIG. 2, the high-frequency word extraction unit 3 sequentially reads out a pair of the index word and the appearance position from the transposed index 2 (step 11), and corresponds to a document whose index word frequency is equal to or more than a predetermined threshold. The appearance position is extracted (step 12). If the number of such documents is equal to or greater than the number T of documents displayed as a higher order in the search result (T is an integer of 1 or more), the extracted index word and the appearance are extracted. The list of positions is stored in the high-frequency word transposition index 4 (step 13).
[0020]
The search accepting unit 5 accepts a search query from the user, passes the query to the search processing unit 7, receives a matching document list from the search processing unit 7, and obtains information of a document corresponding to the document number in the matching document list from the document set database 8. Read it out and present it to the user.
[0021]
As shown in FIG. 3, the relevance calculation unit 6 receives the list of the appearance positions of the search terms from the search processing unit 7 (step 21), receives the document size from the document size database 9 (step 22), and searches for each document. Based on the word frequency and the document size, the relevance is calculated for each document according to a calculation formula such as the TF / IDF method (step 23), and output to the search processing unit 7 as a relevant document list (step 24). If the search query is only one search word (not an AND search or OR search, just a single word search), the document frequency (the number of documents in which the search word appears) can be regarded as a constant, so it is used for calculating the fitness. It is not necessary. In other words, in this case, "how much the search query matches the document" is considered to be the same as "how strongly the index words included in the document characterize the document", and the document like the TF / IDF method is used. The fitness is calculated by the formula for calculating the weight of the index word in the table. For example, when searching with the search term "mobile phone" and calculating the fitness by the TF / IDF method,
Relevance of a document as "mobile _{phone" = f i, j log (} N / F i) / Σ i f i, j
Here, f _{i, j} is the number of appearances of “mobile phone” in the document, N is the total number of documents, F _i is the number of documents in which “mobile phone” appears, Σ _i f _{i, j} is the This is the sum of index word frequencies.
By calculating and comparing the relevance of all documents that contain the word "mobile phone" at least a certain number of times, it is possible to determine which document matches well (is a document strongly related to the topic "mobile phone"). Understand. Since the sum of the index term frequency of all index terms in Σ _i f _{_i, j} is a document j, is the size of the short document j. This can be calculated from the transposed index 2. However, since it is necessary to read the entire index and perform totalization, it is inefficient. Therefore, it is separately prepared as in the document size database 9.
[0022]
In the case of the word "search" in FIG. 5, the appearance position list is (2,100), (2,121), (2,207), (3,24), (19,31). In other words, the word "search" appears five times in total, and appears three times in the document with document number 2, once in the document with document number 3, and once in the document with document number 19. .
[0023]
As described above, while calculating the index word frequency for each document from the appearance position list, the relevance of each document is calculated as described above (with the log (N / F _i ) part regarded as a constant).
[0024]
As illustrated in FIG. 4, the search processing unit 7 first receives a search word passed from the user via the search reception unit 5, and obtains an appearance position list of the search word using the high-frequency word transposition index 4. (Step 31). That is, if the search word is composed of only one index word, a list of appearance positions of the index word is input from the high-frequency word transposition index 4, and if the search word is a sequence of a plurality of index words, Then, the appearance position list of each index word is input from the high frequency word transposition index 4, and a list of positions where all the index words appear adjacent to each other is obtained. Next, when the occurrence position list obtained from the high-frequency word transposition index 4 refers to T or more documents necessary for displaying the search result, the appearance position list is sent to the relevance calculating unit 6. Handover and receive a list of conforming documents (

steps

32, 36, 37). If the T-th document with the highest relevance has a higher relevance than the maximum relevance that a low-frequency document that is not stored in the high-frequency word transposition index 4 can take, the T-top documents with the highest relevance are determined. Since they have been correctly searched, they are output (steps 38 and 39). For example, when using the TF · IDF method for calculation of fitness, index word frequency F _T of the T-th document D _T in descending order of relevance, the document size and S _T, the F _T / S _T ( larger than F-1) / S _min, index words frequencies stored in the inverted index 2 (F-1) the following documents are guaranteed not enter the upper T-number of fitness. Here, S _min is the smallest document size in the document set. If the occurrence position list that refers to T or more documents cannot be obtained from the high-frequency word transposition index 4, or if there is a possibility that the top T documents having good relevance may not be correctly searched, A ranking search process is performed using the transposed index 2. That is, a list of the appearance positions of the search terms is obtained using the transposed index 2 (step 33), and is output to the relevance calculator 6 (step 34), and the relevance document list is input from the relevance calculator 6 (step 33). 35), and output the uppermost T items in descending order of the conformity (step 39).
[0025]
When a search query using one or more search words is given, or when a ranking search based on the relevance is not performed, a conventional search process using a normal transposed index may be performed. The method according to the invention can be additionally introduced into existing information retrieval devices.
[0026]
When indexing a Japanese document, there are a method using a morpheme as an index word, a method using an N-gram, a method combining both, and the like, and the method of the present invention can be applied to any case.
[0027]
In the method of the present invention, a threshold value F fixed in advance is used, but a different threshold value can be set for each index word. In that case, if the maximum value of the index word threshold included in the search phrases with F _max, if F _T / S _T is greater than _{_{(F max -1) / S min}} , high-frequency words inverted index It can be determined that only the search was successful.
[0028]
The method of the present invention is applicable not only to the TF / IDF method but also to many other fitness calculation methods using index word frequency and document size as main parameters for fitness calculation. If it is not necessary to calculate the degree of conformity strictly, when T or more documents are searched by the high-frequency word transposition index, the top T documents may be used as a search result.
[0029]
Instead of creating a high-frequency word transposed index separately, by arranging the occurrence position list to be stored in the inverted index in descending order of the index word frequency, the head of the appearance position list is regarded as a high-frequency word transposed index, and the present invention Similar search processing can be performed (Non-Patent Document 1). However, with an inverted index having such a configuration, the speed of normal search processing without using the method of the present invention is slow.
[0030]
As a method similar to the present invention, a method in which a ranking search is performed in advance for all index words and the top T matching document lists are stored (Non-Patent Document 2). However, in this method, unlike the method of the present invention, it is not possible to search for a phrase that is a sequence of a plurality of index words. The method of the present invention can be used in combination with such a method.
[0031]
In addition, the present invention records a program for realizing the function other than that realized by dedicated hardware on a computer-readable recording medium, and stores the program recorded on the recording medium in a computer system. It may be read and executed. The computer-readable recording medium refers to a recording medium such as a floppy disk, a magneto-optical disk, a CD-ROM, or a storage device such as a hard disk device built in a computer system. Further, the computer-readable recording medium is one that dynamically holds the program for a short time (transmission medium or transmission wave), such as a case where the program is transmitted via the Internet, and serves as a server in that case. It also includes those that hold programs for a certain period of time, such as volatile memory inside a computer system.
[0032]
【The invention's effect】
As described above, according to the present invention, when searching for a single search word, a document having a high degree of relevance is searched using only a high-frequency word index. Thus, the amount of data read from the main memory to the main memory and the CPU processing time required for calculating the degree of conformity can be greatly reduced, and the search processing can be sped up. For example, in the WWW search engine, since about 70% of search questions are composed of only a single search word, it is expected that the speeding-up effect of the present invention is large.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of an information search device according to an embodiment of the present invention.
FIG. 2 is a flowchart showing a process of a high-frequency word extraction unit of the information search device of FIG. 1;
FIG. 3 is a flowchart showing a process of a matching degree calculation unit of the information search device of FIG. 1;
FIG. 4 is a flowchart showing processing of a search processing unit of the information search device of FIG. 1;
FIG. 5 is a diagram showing a general configuration of an inverted index.
[Explanation of symbols]
REFERENCE SIGNS LIST 1 index creation unit 2 transposed index 3 high-frequency word extraction unit 4 high-frequency word transposition index 5 search accepting unit 6 fitness calculation unit 7 search processing unit 8 document database 9 document size database 11 to 14, 21 to 24, 31 to 39 Steps

Claims

An index creation step of extracting index terms from a document to be searched and storing them and a list of pairs of their appearance positions in an inverted index;
From the transposed index, a high frequency word extraction step of extracting a list of pairs of index words and appearance positions where the index word frequency is equal to or greater than a predetermined threshold, and storing the list in the high frequency word transposed index,
A relevance calculating step of receiving a list of search word appearance positions, calculating relevance for each document from the frequency of the search word in each document and the document size, and outputting the relevance document list;
When a search term is received from a user, and the search term is composed of only one index term, a list of the appearance positions of the index term is input from the high-frequency word transposition index, and the search term is a plurality of index terms. If they are arranged, a list of the appearance positions of each index word is input from the high-frequency word transposition index, a list of positions where all the index words appear adjacent to each other is obtained, and If the obtained list of appearance positions refers to T or more documents (T is an integer of 1 or more) necessary for displaying the search result, the list of appearance positions is passed to the fitness calculation step. Receiving a list of matching documents, and if the top T-th matching document has a matching score greater than the maximum matching score of a low-frequency document that is not stored in the high-frequency word transposition index, Output and said high frequency word transposition If a list of occurrence positions referring to T or more documents cannot be obtained from the index, or if there is a possibility that the top T documents having good relevance may not be correctly searched, the search is performed using the inverted index. A search for obtaining a list of the appearance positions of words, outputting the list to the relevance calculation step, inputting a list of relevance documents from the relevance calculation step, and outputting a maximum of T relevance documents in descending order of relevance. An information retrieval method having a processing step.

An inverted index that stores a pair of an index word and an appearance position;
Index creation means for extracting index terms from a document to be searched, and storing a set of the index terms and their appearance positions in the inverted index;
A high-frequency word transposition index that stores only a set of high-frequency index words and their appearance positions;
From the transposed index, extract a list of pairs of index words and appearance positions where the index word frequency is equal to or greater than a predetermined threshold, and store the high-frequency word index in the high-frequency word index,
A relevance calculating means for receiving a list of search word appearance positions, calculating relevance for each document from the frequency of the search word in each document and the document size, and outputting the relevance document list;
When a search term is received from a user, and the search term is composed of only one index term, a list of the appearance positions of the index term is input from the high-frequency word transposition index, and the search term is a plurality of index terms. If they are arranged, a list of the appearance positions of each index word is input from the high-frequency word transposition index, a list of positions where all the index words appear adjacent to each other is obtained, and When the obtained list of appearance positions refers to T or more (T is an integer of 1 or more) documents necessary for displaying the search result, the list of appearance positions is passed to the fitness calculating means. Receiving a list of matching documents, and if the top T-th matching document has a matching score greater than the maximum matching score of a low-frequency document that is not stored in the high-frequency word transposition index, Output the high frequency word transposed index If no list of occurrence positions referring to T or more documents is obtained, or if there is a possibility that the top T documents having good relevance may not have been correctly searched, the search term using the inverted index is used. Search processing for obtaining a list of the appearance positions of, outputting the list to the relevance calculation means, inputting a list of relevance documents from the relevance calculation means, and outputting the uppermost T relevance documents in descending order of relevance. An information retrieval device having means.

An information search program for causing a computer to execute the information search method according to claim 1.

A computer-readable recording medium on which the information retrieval program according to claim 3 is recorded.