JP2010277542A

JP2010277542A - Document search device and document search program

Info

Publication number: JP2010277542A
Application number: JP2009132378A
Authority: JP
Inventors: Yoshihito Yasuda; 宜仁安田; Takashi Inoue; 孝史井上; Yukio Uematsu; 幸生植松; Ryoji Kataoka; 良治片岡
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2009-06-01
Filing date: 2009-06-01
Publication date: 2010-12-09
Anticipated expiration: 2029-06-01
Also published as: JP5193952B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a document search device, capable of searching a document satisfying a search request at high speed while reducing the computational load of search processing. <P>SOLUTION: The document search device uses, in search of an electronic document including a phrase instructed to be searched from a user terminal, a word transposition index storage means 108 which stores relevancy information for electronic documents with each word and a phrase transposition index storage means 107 storing relevancy information for electronic document with each phrase composed of a plurality of words. The search device includes a processing time estimation means 103 which estimates processing times required to extract phrases included in search history information and search electronic documents including each of the extracted phrases by use of word transposition indexes in the word transposition index means 108; and a storage phrase determination means 104 which determines a phrase to be stored in the phrase transposition index storage means 107 based on the processing time of each phrase estimated by the estimation means 103 and the appearance frequency thereof in the search history information. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、文書を蓄積し、利用者の要求内容に従って蓄積された文書の中から要求内容に沿った文書を高速に検索する装置に関するものである。 The present invention relates to an apparatus for accumulating documents and retrieving documents according to requested contents from documents accumulated according to user requested contents at a high speed.

文書の電子化の普及やインターネットの爆発的な普及に伴い、ネットワークや企業内の利用者は、大量の文書を閲覧可能になっている。このような大量の文書に対して、ユーザが表現した検索要求を満たすような文書を高速に検索できるような検索システム（全文検索システム）が広く使われている。検索要求の一般的な表現方法としては、検索文書に含まれるような語の列（キーワード列）を指定する方法が使われている。 With the spread of computerization of documents and the explosive spread of the Internet, users in networks and companies can browse a large amount of documents. A search system (full-text search system) that can search a document satisfying a search request expressed by a user at high speed for such a large number of documents is widely used. As a general expression method of a search request, a method of specifying a word string (keyword string) that is included in a search document is used.

特定のキーワードを含む文書の数は文書全体の一部であるため、検索要求が入力される度に蓄積されたすべての文書毎にキーワードの有無を確認したのでは、キーワードを一切含まない文書に対する処理を数多く繰り返すことになり効率が悪い。このため、語を索引語として、その語を含むような文書群、およびそれら各文書における索引語の出現位置を持つ語転置索引と呼ばれる索引を使って高速化する方法（たとえば、非特許文献１参照）が広く使われている。 Since the number of documents that contain a specific keyword is a part of the entire document, checking for the existence of a keyword for every document that is stored every time a search request is entered The process is repeated many times and the efficiency is poor. For this reason, a method of speeding up using a word group as an index word and using an index called a word transposition index having a document group including the word and an appearance position of the index word in each document (for example, Non-Patent Document 1) Are widely used.

大量の文書を対象とする場合、複数のキーワードが入力された場合には、それぞれのキーワードを文書内に少なくともひとつ含むような文書の検索（ＡＮＤ検索）を実行することが一般的であるが、一方で、複数のキーワードを個別に扱うのではなく、それらのキーワードの検索要求内での隣接情報や順序の情報を保持したような句を含むような文書検索を行いたいという需要がある。 When a large number of documents are targeted, when a plurality of keywords are input, it is common to perform a document search (AND search) that includes at least one of each keyword in the document. On the other hand, instead of handling a plurality of keywords individually, there is a demand for performing a document search including a phrase that holds adjacent information and order information in a search request for those keywords.

しかし、単純な語転置索引だけを用いて句が出現するような文書を検索しようとする場合、計算のための負荷が大きいという問題がある。なぜなら、句を構成する各語が順序を保って隣接して出現することを確認するためには、句を構成する各語を鍵として得られる転置索引の値（転置リスト）を併合し、その結果のリストを逐次確認し、要求された順序で隣接して出現しているかどうかを確認する必要があるためである。 However, when searching for a document in which a phrase appears using only a simple transposed index, there is a problem that the load for calculation is large. Because, in order to confirm that the words constituting the phrase appear adjacent in order, the values of the inverted index (transposed list) obtained using each word constituting the phrase as a key are merged, This is because it is necessary to sequentially check the list of results and check whether they appear adjacent in the requested order.

竹野浩，井上孝史，「分散型高速情報収集／全文検索システムＩｎｆｏＢｅｅ／Ｅｖａｎｇｅｌｉｓｔ」，ＮＴＴＲ＆Ｄ，ｖｏｌ．５２，ｎｏ．２，２００３，ｐｐ７８−８４．Hiroshi Takeno, Takashi Inoue, “Distributed high-speed information collection / full-text search system InfoBee / Evangelist”, NTT R & D, vol. 52, no. 2, 2003, pp 78-84. ＨｕｇｈＥ．Ｗｉｌｌｉａｍｓ，ＪｕｓｔｉｎＺｏｂｅｌａｎｄＤｉｒｋＢａｈｌｅ，“Ｆａｓｔｐｈｒａｓｅｑｕｅｒｙｉｎｇｗｉｔｈｃｏｍｂｉｎｅｄｉｎｄｅｘｅｓ”，ＡＣＭＴＯＩＳ，ｖｏｌ．２２，ｎｏ．４，２００４，ｐｐ．１−１７．Hugh E.E. Williams, Justin Zobel and Dirk Bahr, “Fast phrase querying with combined indexes”, ACM TOIS, vol. 22, no. 4, 2004, pp. 1-17. 井上孝史，植松幸生，安田宜仁，片岡良治，「全文検索システムにおけるフレーズインデックス保持戦略」，信学技報，ＤＥＩＭフォーラム２００９，２００９，ｐｐ．１−５Takashi Inoue, Yukio Uematsu, Yoshihito Yasuda, Ryoji Kataoka, “Phrase index retention strategy in full-text search system”, IEICE Technical Report, DEIM Forum 2009, 2009, pp. 1-5

上記の問題に対し、語を単語とした索引だけではなく、句全体をあたかもひとつの語であるかのように取り扱い、句に対応する索引（以後、句転置索引と呼ぶ）を保持することにより、句を含んだ検索のための計算負荷を下げる方法が知られている。 In response to the above problem, by treating the entire phrase as if it were a single word, not just an index with words as words, and maintaining an index corresponding to the phrase (hereinafter referred to as a phrase transposed index) There are known methods for reducing the computational load for searches involving phrases.

しかし、句転置索引をすべての可能な句を追加したのでは必用な記憶装置が膨大になり現実的ではない。 However, if all possible phrases are added to the phrase transposed index, the necessary storage device becomes enormous and is not realistic.

このため、限られた量の句転置索引に対して、どのような句を格納するのかについて、何らかの基準に基づいて選択することが必要となる。 For this reason, it is necessary to select what phrase to store for a limited amount of phrase transposed index based on some criteria.

句の選択基準として、過去の検索履歴を用いて検索履歴中で高頻度な句を格納する方法が従来より知られている（たとえば、非特許文献２参照）。 As a phrase selection criterion, a method of storing a high-frequency phrase in a search history using a past search history is conventionally known (for example, see Non-Patent Document 2).

上記の方法は一定の効果があることが知られている一方、句検索を通常の転置インデックスのみで対処した場合の負荷の大きさはまちまちであり、必ずしも高頻度な句が負荷が大きいとは限らないため、計算負荷の観点において最適な句を選択できていなかった。これに対し、転置リストのみを用いて句を処理した場合にかかる実測時間を用いることにより、計算負荷の観点からより適した句を選択する手法が非特許文献３などにより報告されている。 While the above method is known to have a certain effect, the amount of load varies when phrase search is dealt with only a normal inverted index. Since it is not limited, an optimal phrase cannot be selected from the viewpoint of calculation load. On the other hand, Non-Patent Document 3 and the like have reported a method of selecting a more suitable phrase from the viewpoint of calculation load by using the actual measurement time when the phrase is processed using only the transposed list.

上記の方法は、句の処理を転置リストのみを用いて処理した場合の実測時間が得られる場合には有効である。しかし、そのような実測時間は必ずしも利用可能ではないという問題がある。なぜなら、既に句転置索引を導入しているシステムにおいて、索引に入っている句については通常の転置索引を用いないため、通常の転置リストのみを用いて処理した場合の実測時間は得られない。また、システムに時間を計測機構を導入することによる負荷が無視できない環境ではやはり実測時間は必ずしも利用可能ではない。 The above method is effective when the actual measurement time is obtained when the phrase processing is performed using only the transposed list. However, there is a problem that such actual measurement time is not always available. This is because in a system in which a phrase transposed index has already been introduced, since an ordinary inverted index is not used for a phrase included in the index, an actual measurement time cannot be obtained when processing is performed using only an ordinary inverted list. In addition, the actual measurement time is not always available in an environment where the load caused by introducing the time measurement mechanism into the system cannot be ignored.

本発明の目的は、上記従来技術の問題点を解決し、検索処理の計算負荷を逓減させるとともに、検索要求を満たす文書を高速に検索できる文書検索装置および文書検索プログラムを提供することにある。 An object of the present invention is to provide a document search apparatus and a document search program that can solve the above-described problems of the prior art, reduce the calculation load of search processing, and search documents that satisfy a search request at high speed.

上記目的を達成するために本発明の文書検索装置は、ユーザ端末から検索指示された語句を含む電子文書を検索するときに、単語と電子文書との関連情報を格納する語転置索引格納手段と、複数の単語からなる句と電子文書との関連情報を格納する句転置索引格納手段とを利用する文書検索装置であって、検索履歴情報に含まれる句を抽出し、該抽出した各句を含む電子文書を前記語転置索引格納手段内の語転置索引を用いて検索するときの処理時間を推測する推測手段と、前記推測手段により推測された各句の処理時間および検索履歴情報中での出現頻度に基づいて前記句転置索引格納手段に格納する句を決定する格納句決定手段とを備えたことを特徴としている。 In order to achieve the above object, the document search apparatus of the present invention includes a word transposed index storage means for storing related information between a word and an electronic document when searching for an electronic document including a word instructed to be searched from a user terminal. , A document search device that uses phrase transposition index storage means for storing related information between a phrase composed of a plurality of words and an electronic document, and extracts phrases included in the search history information, Inference means for estimating a processing time when searching an electronic document including a word transposition index storage means in the word transposition index storage means, processing time of each phrase estimated by the estimation means and search history information And a stored phrase determining means for determining a phrase to be stored in the phrase transposed index storage means based on the appearance frequency.

本発明によれば、実測時間の計測機構を利用できない場合でも、計算に必要な処理時間を推測し、句転置索引に格納すべき句を適切に選択できることから、検索処理の計算負荷が逓減され、処理時間が短縮される。 According to the present invention, even when the measurement mechanism of the actual measurement time cannot be used, the processing time required for the calculation can be estimated, and the phrase to be stored in the phrase transposed index can be appropriately selected, so that the calculation load of the search process is reduced. , Processing time is shortened.

本発明の一実施形態例を示すブロック図。The block diagram which shows one embodiment of this invention. 本発明の一実施形態例における索引生成機能のフローチャート。The flowchart of the index production | generation function in one embodiment of this invention. 本発明の一実施形態例における検索履歴格納データベースの説明図。Explanatory drawing of the search log | history storage database in one embodiment of this invention. 本発明の一実施形態例における語転置索引の説明図。Explanatory drawing of the word transposition index in one embodiment of this invention. 本発明の一実施形態例における実測結果データベースの説明図。Explanatory drawing of the measurement result database in one embodiment of this invention.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は下記の実施形態例に限定されるものではない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments.

図１は本発明の実施形態に係る文書検索装置１００の構成例を示している。図１において、１０１は当該検索装置１００に対してユーザが入力した検索要求の履歴が格納された検索履歴格納データベースである。 FIG. 1 shows a configuration example of a document search apparatus 100 according to an embodiment of the present invention. In FIG. 1, reference numeral 101 denotes a search history storage database in which a search request history input by the user to the search device 100 is stored.

１０２は、前記検索履歴格納データベース１０１の検索履歴を参照して過去に検索語句として利用された句およびその頻度を抽出する句・頻度抽出手段である。 Reference numeral 102 denotes a phrase / frequency extracting unit that refers to the search history in the search history storage database 101 and extracts a phrase used as a search phrase in the past and its frequency.

１０３は、句・頻度抽出手段１０２からの句を入力とし、その句を語転置索引のみを用いて処理した場合にかかる処理時間を推定する、本発明の推測手段としての処理時間推定手段である。 Reference numeral 103 denotes a processing time estimation unit as an estimation unit of the present invention that estimates a processing time when a phrase from the phrase / frequency extraction unit 102 is input and the phrase is processed using only the word transposed index. .

１０４は、句・頻度抽出手段１０２で得られた句の集合を入力とし、前記句の出現頻度と前記処理時間推定手段１０３で推定された処理時間に基づいて、句転置索引格納手段１０７に格納する句を決定する句格納決定手段である。 104 receives a set of phrases obtained by the phrase / frequency extraction unit 102 and stores it in the phrase transposed index storage unit 107 based on the phrase appearance frequency and the processing time estimated by the processing time estimation unit 103. This is phrase storage determining means for determining a phrase to be performed.

１０５は、文書データベース１０６内に蓄積された各文書の語転置索引、句転置索引を生成する転置索引生成手段である。 Reference numeral 105 denotes transposed index generation means for generating a word transposed index and a phrase transposed index for each document stored in the document database 106.

１０７は、複数の単語からなる句と電子文書との関連情報を格納する句転置索引格納手段である。 Reference numeral 107 denotes a phrase transposed index storage unit that stores information related to a phrase composed of a plurality of words and an electronic document.

１０８は、単語と電子文書との関連情報が格納された語転置索引格納手段である。 Reference numeral 108 denotes a word transposed index storage unit that stores information related to a word and an electronic document.

１０９は、文書データベース１０６の文書、句転置索引格納手段１０７の句転置索引および語転置索引格納手段１０８の語転置索引を用いて検索を行なう検索実行手段である。 A search execution unit 109 performs a search using the document in the document database 106, the phrase transposed index of the phrase transposed index storage unit 107, and the word transposed index of the word transposed index storage unit 108.

１１０は、過去の句の検索における、句を構成する転置索引の統計値と処理の実測時間との対応が格納された実測結果データベースである。 Reference numeral 110 denotes an actual measurement result database in which the correspondence between the statistical values of the inverted index constituting the phrase and the actual measurement time in the search for the past phrase is stored.

１１１は、前記実測結果データベース１１０の情報を用いて前記処理時間推定手段１０３の処理で用いるための係数を学習する係数学習手段である。 111 is a coefficient learning means for learning a coefficient to be used in the processing of the processing time estimation means 103 using the information in the actual measurement result database 110.

前記句・頻度抽出手段１０２、処理時間推定手段１０３、格納句決定手段１０４、転置索引生成手段１０５、検索実行手段１０９および係数学習手段１１１の各機能は、例えばコンピュータによって達成される。 The functions of the phrase / frequency extraction means 102, processing time estimation means 103, stored phrase determination means 104, transposed index generation means 105, search execution means 109, and coefficient learning means 111 are achieved by, for example, a computer.

上記構成の文書検索装置１００は、事前に語転置索引、句転置索引を生成する索引生成機能と、生成された索引を利用して文書を検索する検索機能とを有している。 The document search apparatus 100 having the above configuration has an index generation function for generating a word transposed index and a phrase transposed index in advance, and a search function for searching for a document using the generated index.

前記索引生成機能は、図１中の句・頻度抽出手段１０２、処理時間推定手段１０３、係数学習手段１１１、格納句決定手段１０４、転置索引生成手段１０５をもって実行される。また、前記検索機能は、検索実行手段１０９をもって実行される。 The index generation function is executed by the phrase / frequency extraction means 102, processing time estimation means 103, coefficient learning means 111, stored phrase determination means 104, and transposed index generation means 105 in FIG. The search function is executed by search execution means 109.

前記索引生成機能のフローチャートを図２に示す。以下、図２のフローチャートに基き説明する。 A flowchart of the index generation function is shown in FIG. Hereinafter, description will be made based on the flowchart of FIG.

ステップＳ０１前記索引生成機能は句・頻度抽出手段１０２を用いて、検索履歴格納データベース１０１に格納されている検索履歴を参照して、過去に検索語句として利用された句およびその頻度を抽出する。 Step S01 The index generation function uses the phrase / frequency extraction means 102 to refer to the search history stored in the search history storage database 101 and extract a phrase used as a search phrase in the past and its frequency.

尚、検索履歴格納データベース１０１は、例えば図３に示すように、検索システムに対して、これまでにユーザが入力した検索要求を時刻情報つきで格納したものである。 The search history storage database 101 stores, for example, as shown in FIG. 3, search requests that have been input by the user so far with time information in the search system.

句・頻度抽出手段１０２では、過去一定期間内（たとえば１ヶ月）での検索履歴中に出現した句を抽出し、それぞれの出現回数を調べる。具体的には、句の抽出には、引用符（「〜」や“〜”）を含んでいる検索履歴の引用符の内側を句として認定する。それぞれの句の出現回数との対応付けには、句をキーとして回数を値として持つような連想配列を作成する。 The phrase / frequency extraction unit 102 extracts phrases that have appeared in the search history within a certain past period (for example, one month), and examines the number of appearances of each phrase. Specifically, for the phrase extraction, the inside of the search history quotes including quotes (“˜” or “˜”) is recognized as a phrase. In association with the number of occurrences of each phrase, an associative array having the number of times as a value using the phrase as a key is created.

本実施例では明示的な引用符を含むものについて説明したが、これ以外にも検索語中から暗黙的に句と見なせる部分を抽出する既存技術を用いることもできる。 In the present embodiment, an example including an explicit quotation mark has been described. However, in addition to this, an existing technique for extracting a portion that can be implicitly regarded as a phrase from a search term can be used.

これらの暗黙的な句の抽出を行う場合、場合によっては本来句ではないような語の並びを誤って句とみなす可能性もある。そのような誤りはない方が好ましいが、誤りが含まれていたとしても本発明自体は適用可能であるため、誤りを含むような抽出方法を適用しても構わない。 When these implicit phrases are extracted, a word sequence that is not originally a phrase may be mistakenly regarded as a phrase. Although it is preferable that there is no such error, the present invention itself can be applied even if an error is included. Therefore, an extraction method that includes an error may be applied.

ステップＳ０２前記索引生成機能は、転置索引生成手段１０５を用いて、語転置索引格納手段１０８の語転置索引を作成する。 Step S02 The index generation function creates a transposed index of the transposed index storage unit 108 using the transposed index generation unit 105.

この語転置索引は、例えば図４に示すように、語を索引キーとして、値として、その語を含むような文書番号と、各文書内での語の出現開始位置を含むような索引であり、情報検索において広く使われている既存の手法を用いることができる。 For example, as shown in FIG. 4, the word transposition index is an index including a word as an index key, a value including a document number including the word, and a word appearance start position in each document. An existing method widely used in information retrieval can be used.

ステップＳ０３前記索引生成機能は、係数学習手段１１１を用いて、処理時間推測手段１０３で用いるためのパラメータを決定する。 Step S03 The index generation function uses the coefficient learning unit 111 to determine a parameter to be used by the processing time estimation unit 103.

係数学習手段１１１では、過去の句の検索における、句を構成する転置索引の統計値と、処理の実測時間との対応とを用いて、後述する処理時間推定機能で用いるための係数を学習する。句を構成する転置索引の統計値と、処理の実測時間との対応としては、例えば図５に示すような実測結果データベース１１０に格納してある情報を用いる。 The coefficient learning unit 111 learns a coefficient to be used in a processing time estimation function to be described later, using the correspondence between the statistical value of the inverted index constituting the phrase and the actual processing time in the past phrase search. . For example, information stored in the actual measurement result database 110 as shown in FIG. 5 is used as the correspondence between the statistical value of the inverted index constituting the phrase and the actual measurement processing time.

係数の学習には、処理時間を目的変数とし、各統計値を説明変数とするような重回帰分析によって行う。具体的にはリッジ回帰などを用いることができる。 The coefficient learning is performed by multiple regression analysis in which processing time is an objective variable and each statistical value is an explanatory variable. Specifically, ridge regression or the like can be used.

各統計値としては以下のような値を用いる。
１．句を構成する単語数
２．転置リストの長さの最小値
３．転置リストの長さの最大値
４．転置リスト中の文書数の最小値
５．転置リスト中の文書数の最大値
６．転置リスト中の出現位置より得た出現回数の平均値の最小値
７．転置リスト中の出現位置より得た出現回数の平均値の最大値
これら７つの統計値に対応する係数をα_i（ｉ＝１，．．，７）とする。 The following values are used as the statistical values.
1. 1. Number of words that make up a phrase 2. Minimum length of transpose list 3. Maximum length of transposed list 4. Minimum number of documents in transpose list 5. Maximum number of documents in transposition list 6. Minimum value of the average number of appearances obtained from the appearance position in the transposed list The coefficient corresponding to the maximum value of these seven statistical values of the average value of the number of occurrences obtained from occurrence position in the inverted list _{α i (i = 1, ..} , 7) that.

ステップＳ０４前記索引生成機能は、処理時間推定手段１０３を用いて、各句を語転置索引のみを用いて検索処理を実行した場合の処理時間を推定する。 Step S04 The index generation function uses the processing time estimation means 103 to estimate the processing time when each phrase is searched using only the word transposed index.

処理時間推定手段１０３では、句を入力とし、その句を語転置索引のみを用いて処理した場合にかかる処理時間を推定する。 The processing time estimation means 103 estimates a processing time when a phrase is input and the phrase is processed using only the word transposition index.

すなわち処理時間推定手段１０３ではまず、前記検索履歴格納データベース１０１から句・頻度抽出手段１０２により抽出した句を構成する各単語をもって語転置索引格納手段１０８を参照し、該各単語の各転置リストを取得し、転置リストの統計値を得る。統計値としては、係数学習手段１１１と同様、以下のような値を用いる。
１．句を構成する単語数
２．転置リストの長さの最小値
３．転置リストの長さの最大値
４．転置リスト中の文書数の最小値
５．転置リスト中の文書数の最大値
６．転置リスト中の出現位置より得た出現回数の平均値の最小値
７．転置リスト中の出現位置より得た出現回数の平均値の最大値
これらの統計値ｓ₁，．．．，ｓ₇と、係数学習手段１１１によって学習された係数α_i，．．．，α₇を用いた回帰式によって、入力された句を語転置索引のみを用いて処理した場合にかかる処理時間を以下のように求め、出力する。 That is, the processing time estimation unit 103 first refers to the word transposition index storage unit 108 with each word constituting the phrase extracted from the search history storage database 101 by the phrase / frequency extraction unit 102, and stores each transposed list of each word. Get the statistics of the transpose list. As the statistical value, the following values are used as in the coefficient learning unit 111.
1. 1. Number of words that make up a phrase 2. Minimum length of transpose list 3. Maximum length of transposed list 4. Minimum number of documents in transpose list 5. Maximum number of documents in transposition list 6. Minimum value of the average number of appearances obtained from the appearance position in the transposed list The maximum value of the average number of appearances obtained from the appearance position in the transposed list. These statistical values s ₁ ,. . . , S ₇ and the coefficients α _i,. . . , Α ₇ , the processing time required when the input phrase is processed using only the word transposed index is obtained and output as follows.

α₁ｓ₁＋α₂ｓ₂＋．．．＋α₇ｓ₇…（１）
ステップＳ０５前記索引生成機能は、格納句決定手段１０４を用いて、句転置索引格納手段１０７に格納すべき句を決定する。 α ₁ s ₁ + α ₂ s ₂ +. . . + Α ₇ s ₇ (1)
Step S05 The index generation function determines a phrase to be stored in the phrase transposed index storage means 107 using the stored phrase determination means 104.

格納句決定手段１０４は、句・頻度抽出手段１０２によって得られた句の集合を入力とし、句転置索引格納手段１０７に格納する句の集合を決定する。 The stored phrase determination means 104 receives the phrase set obtained by the phrase / frequency extraction means 102 as input, and determines the phrase set to be stored in the phrase transposed index storage means 107.

句の集合内の句Ｐ_iについて、Ｐ_iの一定期間内における検索履歴における頻度Ｆ_iと、処理時間推定機能によって算出されたＰ_iの推定処理時間Ｔ_iを用いて句Ｐ_iの句格納スコアＳ_iを以下の式により算出する
Ｓ_i＝Ｆ_i×Ｔ_i…（２）
ここで、上記Ｓ_iの値が大きいものから順に、事前に定められた句索引の大きさを越えない範囲で格納する句とする。 For clause P _i in the set of clauses, phrases stored in the phrase P _i by using the frequency F _i in the search history in a certain period of P _i, the estimated processing time T _i of P _i calculated by the processing time estimation function The score S _i is calculated by the following formula: S _i = F _i × T _i (2)
Here, in order from the largest value of S _i , the phrases are stored within a range not exceeding the size of a predetermined phrase index.

ステップＳ０６前記索引生成機能は、転置索引生成手段１０５を用いて、句転置索引格納手段１０７の句転置索引を作成する。 Step S06 The index generation function creates a phrase transposed index of the phrase transposed index storage means 107 using the transposed index generation means 105.

句転置索引格納手段１０７の句転置索引は、句を索引キーとして、その句を含むような文書番号と、各文書内での語の出現開始位置を値として持つような索引であり、図４の語転置索引と同様の構造を持つ。 The phrase transposition index of the phrase transposition index storage means 107 is an index having a phrase as an index key, a document number including the phrase, and a word appearance start position in each document as values. It has the same structure as the word transpose index.

次に、前記検索機能は、検索実行手段１０９を用いて文書の検索を行なう。 Next, the search function uses the search execution means 109 to search for a document.

また、本実施形態の文書検索装置における各手段の一部もしくは全部の機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができることは言うまでもなく、コンピュータでその機能を実現するためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えばＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）や、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、メモリカード、ＣＤ（ＣｏｍｐａｃｔＤｉｓｋ）−ＲＯＭ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、ＨＤＤ、リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 Needless to say, the present invention can be realized by configuring a part or all of the functions of each means in the document search apparatus of the present embodiment with a computer program and executing the program using the computer. The program for realizing the function is recorded on a computer-readable recording medium such as FD (Floppy (registered trademark) Disk), MO (Magneto-Optical disk), ROM (Read Only Memory), memory card, It can be recorded on a CD (Compact Disk) -ROM, a DVD (Digital Versatile Disk) -ROM, a CD-R, a CD-RW, an HDD, a removable disk, etc., and stored or distributed. It is also possible to provide the above program through a network such as the Internet or electronic mail.

１００…文書検索装置
１０１…検索履歴格納データベース
１０２…句・頻度抽出手段
１０３…処理時間推定手段
１０４…格納句決定手段
１０５…転置索引生成手段
１０６…文書データベース
１０７…句転置索引格納手段
１０８…語転置索引格納手段
１０９…検索実行手段
１１０…実測結果データベース
１１１…係数学習手段 DESCRIPTION OF SYMBOLS 100 ... Document search apparatus 101 ... Search history storage database 102 ... Phrase / frequency extraction means 103 ... Processing time estimation means 104 ... Stored phrase determination means 105 ... Inverted index generation means 106 ... Document database 107 ... Phrase inverted index storage means 108 ... Word Transposed index storage means 109 ... Search execution means 110 ... Actual measurement result database 111 ... Coefficient learning means

Claims

When searching for an electronic document including a word instructed from the user terminal, word transposed index storage means for storing related information between the word and the electronic document, and related information between the phrase consisting of a plurality of words and the electronic document A document search device that uses phrase transposition index storage means for storing,
A guess means for extracting a phrase included in the search history information, and estimating a processing time when searching for an electronic document including each of the extracted phrases using a word transposed index in the word transposed index storage means;
And a storage phrase determination unit that determines a phrase to be stored in the phrase transposed index storage unit based on the processing time of each phrase estimated by the estimation unit and the appearance frequency in the search history information. Document retrieval device.

The inference means refers to a word transposition index in the word transposed index storage means with each word constituting a phrase extracted from the search history information, obtains a transposed list of each word as the related information, and obtains the The document search apparatus according to claim 1, wherein the processing time is obtained by regression analysis using a statistic of each transposed list.

The said stored phrase determination means calculates the score of each phrase using the said estimated processing time and the said appearance frequency, and determines the said phrase to store according to this score. Document retrieval device.

A document search program for causing a computer to function as each means according to any one of claims 1 to 3.