JP2006172363A

JP2006172363A - Document retrieval device, index reconfiguration method and program

Info

Publication number: JP2006172363A
Application number: JP2004367358A
Authority: JP
Inventors: Yasutsugu Morimoto; 康嗣森本; Naoto Akira; 直人秋良
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2004-12-20
Filing date: 2004-12-20
Publication date: 2006-06-29

Abstract

<P>PROBLEM TO BE SOLVED: To reduce size of an index by preventing omission in retrieval of a worthy document. <P>SOLUTION: In this document retrieval device which creates the index from an acquired document and specifies the document which satisfies inputted retrieval conditions by referring to the index based on the retrieval conditions, it is characterized by providing a storage device which stores an index compression policy file indicating the index and rules for reconfiguring the index and a CPU and so that the CPU reconfigures the index based on time information attached to the document. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、電子化された文書から所望の文書を検索する文書検索装置に関し、特に、インデクスのデータ量を小さくする技術に関する。 The present invention relates to a document retrieval apparatus that retrieves a desired document from an electronic document, and more particularly to a technique for reducing the amount of index data.

コンピュータの普及に伴い、電子化された文書の保存量が増大している。そのため、電子化された文書から所望の文書を検索する技術が必要とされている。このような技術として、全文検索や類似文書検索と呼ばれる各種情報検索技術が開発されている。また、各種の情報検索アルゴリズムが知られている（例えば、非特許文献１参照。）。 With the spread of computers, the storage amount of digitized documents is increasing. Therefore, a technique for retrieving a desired document from an electronic document is required. As such a technique, various information search techniques called full-text search and similar document search have been developed. Various information retrieval algorithms are known (for example, see Non-Patent Document 1).

従来の情報検索システムは、インデクスを使って所望の文書を高速に検索する。インデクスとは、検索される文書から作成された検索用の二次データである。このインデクスを作成する技術として、例えば、全文検索用のインデクスを作成する技術が知られている（例えば、特許文献１参照。）。 A conventional information retrieval system retrieves a desired document at high speed using an index. An index is secondary data for search created from a document to be searched. As a technique for creating this index, for example, a technique for creating an index for full-text search is known (for example, see Patent Document 1).

文書検索におけるインデクスは、画像検索におけるインデクスと異なり、データ量が大きい。よって、情報検索システムは、検索される文書量が多いと、インデクスを記憶する記憶容量が大きくなるので、大きなコストがかかる。 Unlike the index for image retrieval, the index for document retrieval has a large amount of data. Therefore, if the amount of documents to be retrieved is large, the information retrieval system requires a large cost because the storage capacity for storing the index increases.

インデクスのデータ量は、インデクスを作成する方式によって異なる。例えば、形態素解析方式では、検索される文書のデータ量と同等程度であり、Ｎ−ｇｒａｍ方式では、検索される文書のデータ量の数倍程度である。 The amount of index data varies depending on the method of creating the index. For example, in the morphological analysis method, it is about the same as the data amount of the searched document, and in the N-gram method, it is about several times the data amount of the searched document.

インデクスを作成する方式によってデータ量が異なるのは、検索漏れをどの程度考慮しているかに依存するからである。つまり、検索漏れを少なくしたい場合には、インデクスのデータ量は大きくならざるを得ない。 The amount of data differs depending on the index creation method because it depends on how much search omission is considered. In other words, in order to reduce search omissions, the amount of index data must be increased.

例えば、検索される文書に「・・・の文字列を・・・」を含む場合で説明する。 For example, a description will be given of a case where a document to be searched includes “... character string”.

形態素解析方式では、形態素解析によって得られた単語を索引語としてインデクスを作成する。このとき、形態素解析によって「文字列」が単語として認識されたとする。この場合、情報検索システムは、ユーザから検索条件として「文字」を入力されても、当該文書は検索できず、検索漏れが発生する。 In the morpheme analysis method, an index is created using words obtained by morpheme analysis as index words. At this time, it is assumed that “character string” is recognized as a word by morphological analysis. In this case, even if “character” is input as a search condition by the user, the information search system cannot search the document, and a search omission occurs.

Ｎ−ｇｒａｍ方式は、このような検索漏れを防ぐ技術である。Ｎ−ｇｒａｍ方式では、当該文書から、「文字」、「文字列」及び「字列」などのような連続する文字を抽出する。連続する文字には、「字列」のような単語としてとしては不適切なものも含む。よって、Ｎ−ｇｒａｍ方式でインデクスを作成した情報検索システムは、検索漏れを防ぐことはできるが、インデクスのサイズが非常に大きくなる。
特開平５−６１９１０号公報北研二、津田和彦、獅子堀正幹、「情報検索アルゴリズム」、共立出版、２００２年１月 The N-gram method is a technique for preventing such a search omission. In the N-gram method, continuous characters such as “character”, “character string”, and “character string” are extracted from the document. Consecutive characters include those that are inappropriate as words such as “character strings”. Therefore, an information search system that creates an index using the N-gram method can prevent omission of search, but the size of the index becomes very large.
JP-A-5-61910 Kita Kenji, Tsuda Kazuhiko, Choshibori Masatomi, “Information Retrieval Algorithm”, Kyoritsu Shuppan, January 2002

従来の情報検索システムは、すべての文書の価値を同等であると考え、すべての文書に対して同じ方法でインデクスを作成している。しかし、それぞれの文書の価値は異なり、情報検索システムには、検索から漏れると問題になる文書と、検索から漏れても問題にならない文書とが存在する。 Conventional information retrieval systems consider that all documents have the same value, and create indexes in the same way for all documents. However, the value of each document is different, and in the information search system, there are documents that are problematic if they are omitted from the search, and documents that are not problematic even if they are omitted from the search.

そこで、本発明は、価値が高い文書の検索漏れを防ぎ、且つ保存文書全体のインデクスのサイズを小さくすることを目的とする。 Accordingly, an object of the present invention is to prevent omission of retrieval of a document having high value and to reduce the size of the index of the entire stored document.

本発明は、取得された文書からインデクスを作成し、入力された検索条件に基づいて前記インデクスを参照することによって前記検索条件を満たす前記文書を特定する文書検索装置において、前記インデクス及び前記インデクスを再構成する規則を示すインデクス圧縮ポリシーファイルを記憶する記憶装置と、ＣＰＵと、を備え、前記ＣＰＵは、前記文書に付された時刻情報に基づいて前記インデクスを再構成することを特徴とする。 The present invention provides a document search device that creates an index from an acquired document and identifies the document that satisfies the search condition by referring to the index based on an input search condition. A storage device that stores an index compression policy file that indicates a rule to be reconfigured, and a CPU, wherein the CPU reconfigures the index based on time information attached to the document.

本発明によれば、価値が高い文書の検索漏れを防ぎ、且つ保存文書全体のインデクスのサイズを小さくすることができる。 According to the present invention, it is possible to prevent omission of retrieval of a document having high value and to reduce the size of the index of the entire stored document.

以下、本発明の実施の形態を図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（第１の実施の形態）
図１は、本発明の第１の実施の形態の文書検索システム１０の構成のブロック図である。 (First embodiment)
FIG. 1 is a block diagram of a configuration of a document search system 10 according to the first embodiment of this invention.

文書検索システム１０は、中央演算装置（ＣＰＵ）１０１、主メモリ１０２、表示装置１０３、入力装置１０４及び記憶装置１１０から構成される。 The document search system 10 includes a central processing unit (CPU) 101, a main memory 102, a display device 103, an input device 104, and a storage device 110.

ＣＰＵ１０１は、主メモリ１０２から各種プログラムを呼び出して実行することによって、各種処理を行う。主メモリ１０２は、ＣＰＵ１０１の処理に応じて、記憶装置１１０に格納された各種プログラム等を一時的に記憶する。 The CPU 101 performs various processes by calling and executing various programs from the main memory 102. The main memory 102 temporarily stores various programs stored in the storage device 110 according to the processing of the CPU 101.

表示装置１０３は、例えば、ＣＲＴ等のディスプレイであり、情報を表示する。入力装置１０４は、例えば、キーボード等であり、利用者から情報を入力される。 The display device 103 is a display such as a CRT, for example, and displays information. The input device 104 is, for example, a keyboard and receives information from the user.

記憶装置１１０には、オペレーティングシステム（ＯＳ）１１１、文書データ１１２、インデクス１１３、インデクス圧縮ポリシーファイル１１４、登録日管理テーブル１１５、文書利用頻度テーブル１１６、インデクス圧縮プログラム１１７、検索プログラム１１８、インデクス作成プログラム１１９及び文書収集プログラム１２０が記憶されている。 The storage device 110 includes an operating system (OS) 111, document data 112, an index 113, an index compression policy file 114, a registration date management table 115, a document usage frequency table 116, an index compression program 117, a search program 118, and an index creation program. 119 and a document collection program 120 are stored.

ＯＳ１１１は、文書検索システム１０の全体を制御する。文書データ１１２は、文書検索システム１０によって検索される文書が格納される。インデクス１１３は、図２で後述するが、検索プログラム１１８が利用する検索用のデータである。インデクス圧縮ポリシーファイル１１４は、図３で後述するが、インデクス１１３を圧縮する規則を示す。 The OS 111 controls the entire document search system 10. The document data 112 stores a document searched by the document search system 10. As will be described later with reference to FIG. 2, the index 113 is search data used by the search program 118. The index compression policy file 114 indicates a rule for compressing the index 113, which will be described later with reference to FIG.

登録日管理テーブル１１５は、検索対象の文書が文書データ１１２に登録された日を管理する。文書利用頻度テーブル１１６は、文書データ１１２に格納されている文書の使用頻度を示す。文書利用頻度テーブル１１６は、後述する第５の実施の形態で使用する情報であり、他の実施の形態では記憶装置１１０に記憶しておく必要はない。 The registration date management table 115 manages the date when the search target document is registered in the document data 112. The document usage frequency table 116 indicates the usage frequency of the documents stored in the document data 112. The document usage frequency table 116 is information used in a fifth embodiment to be described later, and need not be stored in the storage device 110 in other embodiments.

インデクス圧縮プログラム１１７は、インデクス圧縮ポリシーファイル１１４に基づいて、インデクス１１３を圧縮する。検索プログラム１１８は、入力装置１０４から入力されたクエリを含む文書を文書データ１１２から検索する。 The index compression program 117 compresses the index 113 based on the index compression policy file 114. The search program 118 searches the document data 112 for a document including the query input from the input device 104.

インデクス作成プログラム１１９は、文書データ１１２に格納されている文書から単語を抽出し、抽出された単語からインデクス１１３を転置ファイル法を用いて作成する。なお、本実施の形態では、インデクス作成プログラム１１９は、形態素解析方式によって文書から単語を抽出する例で説明するが、Ｎ−ｇｒａｍ方式等の他の方法によって単語を抽出してもよい。転置ファイル法は、非特許文献１に記載されている公知技術であり、説明は省略する。 The index creation program 119 extracts words from the document stored in the document data 112, and creates an index 113 from the extracted words using the transposed file method. In this embodiment, the index creation program 119 is described as an example of extracting a word from a document by a morphological analysis method, but the word may be extracted by another method such as an N-gram method. The transposed file method is a known technique described in Non-Patent Document 1, and a description thereof will be omitted.

文書収集プログラム１２０は、文書検索システム１０が文書サーバ１３０と接続されている場合のみ記憶されている。 The document collection program 120 is stored only when the document search system 10 is connected to the document server 130.

文書検索システム１０は、ネットワークを介して、一台以上の文書サーバ１３０と接続されていてもよい。文書サーバ１３０には、記憶装置１１０に記憶されている文書データ１１２と同様の文書データが記憶されている。この場合、記憶装置１１０には、文書データ１１２を記憶しなくてもよい。 The document search system 10 may be connected to one or more document servers 130 via a network. The document server 130 stores document data similar to the document data 112 stored in the storage device 110. In this case, the document data 112 need not be stored in the storage device 110.

文書収集プログラム１２０は、例えば、クローラであり、文書サーバ１３０に記憶されている文書データを収集し、主メモリ１０２に格納する。すると、インデクス作成プログラム１１９は、主メモリ１０２に格納されている文書データからインデクス１１３を作成する。 The document collection program 120 is a crawler, for example, and collects document data stored in the document server 130 and stores it in the main memory 102. Then, the index creation program 119 creates the index 113 from the document data stored in the main memory 102.

図２は、本発明の第１の実施の形態の記憶装置１１０に記憶されているインデクス１１３の構成図である。 FIG. 2 is a configuration diagram of the index 113 stored in the storage device 110 according to the first embodiment of this invention.

インデクス１１３は、索引語１１３１、文書番号１１３２及びターム重み１１３３を含む。 The index 113 includes an index word 1131, a document number 1132, and a term weight 1133.

索引語１１３１は、文書データ１１２に格納されている文書に含まれる文字列である。文書番号１１３２は、当該レコードの索引語１１３１を含む文書を識別する一意な識別番号である。 The index word 1131 is a character string included in the document stored in the document data 112. The document number 1132 is a unique identification number for identifying a document including the index word 1131 of the record.

ターム重み１１３３は、入力装置１０４から入力されたクエリと文書とが合致する度合いを示し、合致する度合いが最も高いものを１とする。具体的には、ターム重み１１３３は、文書全体の内容とクエリとが合致する度合いを特徴付ける値とし、例えば、ｔｆ−ｉｄｆ値又は単語の出現頻度を用いる。ｔｆ−ｉｄｆ値は、非特許文献１に記載されている公知技術であり、説明は省略する。 The term weight 1133 indicates the degree of matching between the query input from the input device 104 and the document, and the term having the highest degree of matching is set to 1. Specifically, the term weight 1133 is a value that characterizes the degree of matching between the content of the entire document and the query, and uses, for example, a tf-idf value or a word appearance frequency. The tf-idf value is a known technique described in Non-Patent Document 1, and a description thereof is omitted.

本実施の形態では、索引語「システム」は、文書番号が「３」、「８」、「５６」・・・、「８９５」の文書に含まれる。また、索引語「システム」について、文書番号「３」の文書のターム重みは「０．５」であり、文書番号「８」の文書のターム重みは「０．２」であり、文書番号「５６」の文書のターム重みは「０．８」であり、文書番号「８９５」の文書のターム重みは「０．４」である。 In the present embodiment, the index word “system” is included in documents with document numbers “3”, “8”, “56”. For the index word “system”, the term weight of the document with the document number “3” is “0.5”, the term weight of the document with the document number “8” is “0.2”, and the document number “ The term weight of the document “56” is “0.8”, and the term weight of the document with the document number “895” is “0.4”.

図３は、本発明の第１の実施の形態の記憶装置１１０に記憶されているインデクス圧縮ポリシーファイル１１４の構成図である。 FIG. 3 is a configuration diagram of the index compression policy file 114 stored in the storage device 110 according to the first embodiment of this invention.

インデクス圧縮ポリシーファイル１１４は、期間開始１１４１、期間終了１１４２及び文書重み１１４３を含む。 The index compression policy file 114 includes a period start 1141, a period end 1142, and a document weight 1143.

インデクス圧縮ポリシーファイル１１４には、文書が文書データ１１２に登録された日（登録日）から現在までの年数（経過年数）によって、文書重みが規定されている。 In the index compression policy file 114, the document weight is defined by the number of years (elapsed years) from the date when the document is registered in the document data 112 (registration date) to the present.

すなわち、期間開始１１４１は、当該文書重みを有する文書の経過年数の始期を示し、期間終了１１４２は、当該文書重みを有する文書の経過年数の終期を示す。つまり、期間開始１１４１から期間終了１１４２の間の経過年数の文書は、当該レコードに該当し、規定されている文書重みを有する。 That is, the period start 1141 indicates the start of the elapsed years of the document having the document weight, and the period end 1142 indicates the end of the elapsed years of the document having the document weight. That is, a document having the number of years elapsed between the period start 1141 and the period end 1142 corresponds to the record and has a prescribed document weight.

文書重み１１４３は、該当する文書に関するインデクス１１３を圧縮するデータ圧縮率を示し、登録日に作成されたインデクス１１３を「１」とする。 The document weight 1143 indicates a data compression rate for compressing the index 113 related to the corresponding document, and the index 113 created on the registration date is set to “1”.

本実施の形態では、経過年数が３年までの文書の文書重み１１４３は「１．０」であり、経過年数が４年から５年までの文書の文書重み１１４３は「０．８」であり、経過年数が６年から７年までの文書の文書重み１１４３は、「０．５」であり、経過年数が８年から１００年までの文書の文書重み１１４３は「０．２」である。 In the present embodiment, the document weight 1143 for documents up to 3 years old is “1.0”, and the document weight 1143 for documents from 4 years to 5 years is “0.8”. The document weight 1143 of a document with an elapsed age of 6 to 7 years is “0.5”, and the document weight 1143 of a document with an elapsed age of 8 to 100 years is “0.2”.

文書重み１１４３が「０．８」とは、索引語１１３１に採用された単語数が登録日に作成されたインデクス１１３の８割なるまで、インデクス１１３から索引語１１３１を削除することを示す。 The document weight 1143 of “0.8” indicates that the index word 1131 is deleted from the index 113 until the number of words adopted for the index word 1131 becomes 80% of the index 113 created on the registration date.

図４は、本発明の第１の実施の形態の記憶装置１１０に記憶されている登録日管理テーブル１１５の構成図である。 FIG. 4 is a configuration diagram of the registration date management table 115 stored in the storage device 110 according to the first embodiment of this invention.

登録日管理テーブル１１５は、文書番号１１５１、登録日１１５２及び文書重み１１５３を含む。 The registration date management table 115 includes a document number 1151, a registration date 1152, and a document weight 1153.

文書番号１１５１は、文書データ１１２に登録されている文書を識別する一意な識別番号である。登録日１１５２は、当該文書が文書データ１１２に登録された日である。文書重み１１５３は、当該文書に関するインデクス１１３の現在のデータ圧縮率を示す。 The document number 1151 is a unique identification number for identifying a document registered in the document data 112. The registration date 1152 is the date when the document is registered in the document data 112. The document weight 1153 indicates the current data compression rate of the index 113 related to the document.

なお、登録日１１５２は、文書に付された時刻情報であってもよい。文書に付された時刻情報には、文書の作成日又は更新日等も含む。 The registration date 1152 may be time information attached to the document. The time information attached to the document includes the creation date or update date of the document.

図５は、本発明の第１の実施の形態のインデクス圧縮プログラム１１７の処理のフローチャートである。 FIG. 5 is a flow chart for processing of the index compression program 117 according to the first embodiment of this invention.

まず、文書データ１１２に含まれるすべての文書を処理したか否かを判定する（Ｓ３０１）。 First, it is determined whether all the documents included in the document data 112 have been processed (S301).

すべての文書が処理済みであると、インデクス１１３を圧縮する文書が存在しないので、処理を終了する。一方、処理されていない文書があると、インデクス１１３を圧縮すべき文書が存在するので、未処理の文書の中から文書番号が最も小さい文書を選択する。 If all the documents have been processed, there is no document for compressing the index 113, and the processing is terminated. On the other hand, if there is a document that has not been processed, there is a document for which the index 113 is to be compressed, so the document with the smallest document number is selected from the unprocessed documents.

次に、選択した文書の文書番号と登録日管理テーブル１１５の文書番号１１５１とが一致するレコードの登録日１１５２及び文書重み１１５３を登録日管理テーブル１１５から取得する（Ｓ３０２）。 Next, the registration date 1152 and document weight 1153 of the record in which the document number of the selected document matches the document number 1151 of the registration date management table 115 are acquired from the registration date management table 115 (S302).

次に、取得した登録日１１５２を現在の年月日から減算することによって、当該文書の経過年数を計算する。次に、計算した経過年数が期間開始１１４１から期間終了１１４２の間となるレコードの文書重み１１４３をインデクス圧縮ポリシーファイル１１４から取得する（Ｓ３０３）。 Next, the age of the document is calculated by subtracting the acquired registration date 1152 from the current date. Next, the document weight 1143 of the record whose calculated elapsed years are between the period start 1141 and the period end 1142 is acquired from the index compression policy file 114 (S303).

次に、取得した文書重み１１４３とステップＳ３０２で取得した文書重み１１５３とが同一であるか否かを判定する（Ｓ３０４）。 Next, it is determined whether or not the acquired document weight 1143 is the same as the document weight 1153 acquired in step S302 (S304).

それぞれの文書重み１１４３、１１５３が同一であると、当該文書に関するインデクス１１３を圧縮する必要がないので、ステップＳ３０１に戻る。 If the document weights 1143 and 1153 are the same, it is not necessary to compress the index 113 related to the document, and the process returns to step S301.

一方、それぞれの文書重み１１４３、１１５３が異なると、当該文書に関するインデクス１１３を圧縮する必要があるので、文書データ１１２から当該文書を取得する。次に、取得した文書を形態素解析することによって、文書中の単語を抽出する。次に、抽出した単語のｔｆ−ｉｄｆ値を計算する。次に、計算したｔｆ−ｉｄｆ値が大きいものから順に、抽出した単語に優先順位を付す。次に、抽出した単語と優先順位との対応付けを示す優先度付き単語リストを作成する（Ｓ３０５）。なお、優先順位は、ｔｆ−ｉｄｆ値からでなく、単語の出現頻度等から求めてもよい。 On the other hand, if the document weights 1143 and 1153 are different from each other, the index 113 related to the document needs to be compressed, so the document is acquired from the document data 112. Next, a word in the document is extracted by performing morphological analysis on the acquired document. Next, the tf-idf value of the extracted word is calculated. Next, priority is given to the extracted words in descending order of the calculated tf-idf value. Next, a word list with priority indicating the correspondence between the extracted words and the priority order is created (S305). Note that the priority order may be obtained not from the tf-idf value but from the appearance frequency of words.

次に、作成した優先度付き単語リストに含まれる単語数を数える。数えた単語数にステップＳ３０３で取得した文書重み１１４３を乗ずることによって、インデクス１１３に残す単語数を求める。次に、求めた数の単語を、優先度付き単語リストから、優先順位が高い順に選択する。そして、選択されなかった単語をインデクス１１３から削除する単語とし、削除単語リストを作成する（Ｓ３０６）。 Next, the number of words included in the created priority word list is counted. The number of words to be left in the index 113 is obtained by multiplying the counted number of words by the document weight 1143 acquired in step S303. Next, the determined number of words is selected from the prioritized word list in descending order of priority. Then, an unselected word is set as a word to be deleted from the index 113, and a deleted word list is created (S306).

次に、削除単語リストから削除する単語を一つ選択する。選択した削除する単語とインデクス１１３の索引語１１３１とが一致するレコードをインデクス１１３から選択する。次に、選択したレコードから、当該文書の文書番号１１３２を削除する（Ｓ３０７）。そして、削除した文書番号１１３２に対応するターム重み１１３３を削除する。同様に、削除単語リストに含まれるすべての単語について、文書番号１１３２及びターム重み１１３３を削除し、ステップＳ３０１に戻る。 Next, one word to be deleted is selected from the deleted word list. A record in which the selected word to be deleted matches the index word 1131 of the index 113 is selected from the index 113. Next, the document number 1132 of the document is deleted from the selected record (S307). Then, the term weight 1133 corresponding to the deleted document number 1132 is deleted. Similarly, the document number 1132 and the term weight 1133 are deleted for all the words included in the deleted word list, and the process returns to step S301.

以上のように、インデクス圧縮プログラム１７は、インデクス圧縮ポリシーファイル１１４に従って、インデクス１１３を圧縮する。 As described above, the index compression program 17 compresses the index 113 according to the index compression policy file 114.

本実施の形態の記憶装置１１０は、タームベクトルテーブルを備えていてもよい。 The storage device 110 according to the present embodiment may include a term vector table.

図６は、本発明の第１の実施の形態の記憶装置１１０に記憶されているタームベクトルテーブル２００の構成図である。 FIG. 6 is a configuration diagram of the term vector table 200 stored in the storage device 110 according to the first embodiment of this invention.

タームベクトルテーブル２００は、文書番号２００１及び索引語２００２を含む。 The term vector table 200 includes a document number 2001 and an index word 2002.

文書番号２００１は、文書データ１１２に格納されている文書を識別する一意な識別子である。索引語２００２は、当該文書に含まれる単語である。 The document number 2001 is a unique identifier that identifies a document stored in the document data 112. The index word 2002 is a word included in the document.

インデクス圧縮プログラム１１７の処理（図５）のステップＳ３０５では、優先度付き単語リストを作成するため、文書を形態素解析することによって、文書中の単語を取得した。 In step S305 of the processing of the index compression program 117 (FIG. 5), words in the document are obtained by performing morphological analysis on the document in order to create a word list with priority.

しかし、インデクス圧縮プログラム１１７が、インデクス１１３を圧縮する度に形態素解析するのでは、高速な処理を行うことができない。そこで、文書中の単語のリストであるタームベクトルテーブル２００を記憶装置１１０に予め格納しておく。これによって、インデクス圧縮プログラム１１７は、形態素解析することなく文書中の単語を取得できるので、インデクス１１３を高速に圧縮できる。 However, if the index compression program 117 performs morphological analysis each time the index 113 is compressed, high-speed processing cannot be performed. Therefore, a term vector table 200 that is a list of words in the document is stored in the storage device 110 in advance. As a result, the index compression program 117 can acquire words in the document without performing morphological analysis, so that the index 113 can be compressed at high speed.

（第２の実施の形態）
本発明の第２の実施の形態では、記憶装置１１０のフラグメンテーションを減らすようにインデクス１１３を構成する。 (Second Embodiment)
In the second embodiment of the present invention, the index 113 is configured to reduce the fragmentation of the storage device 110.

第１の実施の形態では、インデクス１１３の圧縮時に、インデクス１１３から文書番号１１３２及びターム重み１１３３を削除するので、記憶装置１１０の記憶領域の断片化（フラグメンテーション）が発生しやすくなる。第２の実施の形態では、インデクス１１３の圧縮時に、インデクス１１３の全体を再構成することによって、記憶装置１１０のフラグメンテーションの発生を防ぐ。 In the first embodiment, since the document number 1132 and the term weight 1133 are deleted from the index 113 when the index 113 is compressed, fragmentation (fragmentation) of the storage area of the storage device 110 is likely to occur. In the second embodiment, when the index 113 is compressed, the entire index 113 is reconfigured to prevent fragmentation of the storage device 110.

図７は、本発明の第２の実施の形態の記憶装置１１０に記憶されているインデクス１１３の構成図である。 FIG. 7 is a configuration diagram of the index 113 stored in the storage device 110 according to the second embodiment of this invention.

第２の実施の形態のインデクス１１３は、索引語１１３５、ポインタ１１３６、文書番号１１３８及びターム重み１１３９から構成される。 The index 113 according to the second embodiment includes an index word 1135, a pointer 1136, a document number 1138, and a term weight 1139.

索引語１１３５は、文書データ１１２に格納されている文書に含まれる文字列である。ポインタ１１３６には、文書番号１１３８及びターム重み１１３９へのリンク情報が格納される。 The index word 1135 is a character string included in the document stored in the document data 112. The pointer 1136 stores link information to the document number 1138 and the term weight 1139.

文書番号１１３８は、当該レコードの索引語を含む文書を識別する一意な識別番号である。ターム重み１１３９は、入力装置１０４から入力されたクエリと文書とが合致する度合いを示し、合致する度合いが最も高いものを１とする。 The document number 1138 is a unique identification number for identifying a document including the index word of the record. The term weight 1139 indicates the degree of matching between the query input from the input device 104 and the document, and the term having the highest degree of matching is set to 1.

第２の実施の形態のインデクス１１３は、文書の登録日の半年単位で区分して構成する。なお、インデクス１１３は、半年以外の所定の期間単位で構成してもよい。 The index 113 according to the second embodiment is configured by dividing the document registration date on a semi-annual basis. Note that the index 113 may be configured in units of a predetermined period other than half a year.

第２の実施の形態では、インデクス１１３を半年単位で区分して構成することによって、インデクス圧縮プログラム１１７は、半年単位でインデクスを再構成（デフラグ）することができる。よって、全体を一つのインデクスで構成した場合と比較して、短時間でインデクス１１３の全体を再構成することができ、記憶装置１１０のフラグメンテーションの発生を防ぐことができる。 In the second embodiment, the index compression program 117 can reconstruct (defragment) an index in units of six months by configuring the index 113 in units of six months. Therefore, the entire index 113 can be reconfigured in a short time compared to the case where the entire is configured with one index, and the occurrence of fragmentation of the storage device 110 can be prevented.

（第３の実施の形態）
本発明の第３の実施の形態は、経過年数に応じて索引語の抽出方法を変えることによって、インデクス１１３を圧縮する。 (Third embodiment)
In the third embodiment of the present invention, the index 113 is compressed by changing the index word extraction method according to the number of years elapsed.

図８は、本発明の第３の実施の形態の記憶装置１１０に記憶されるインデクス圧縮ポリシーファイル１１４の構成図である。 FIG. 8 is a configuration diagram of the index compression policy file 114 stored in the storage device 110 according to the third embodiment of this invention.

第３の実施の形態のインデクス圧縮ポリシーファイル１１４は、期間開始１１４５、期間終了１１４６及び索引語付与方法１１４７から構成される。 The index compression policy file 114 according to the third embodiment includes a period start 1145, a period end 1146, and an index word assigning method 1147.

インデクス圧縮ポリシーファイル１１４には、文書の経過年数に応じて、インデクスを作成する際の索引語付与方法が規定されている。 The index compression policy file 114 defines an index word assignment method for creating an index according to the age of the document.

すなわち、期間開始１１４５は、当該索引語付与方法でインデクスを作成される文書の経過年数の始期を示し、期間終了１１４６は、当該索引語付与方法でインデクスを作成される文書の経過年数の終期を示す。つまり、期間開始１１４５から期間終了１１４６の間の経過年数の文書は、当該レコードに該当し、規定されている索引語付与方法でインデクスを作成される。 In other words, the period start 1145 indicates the start of the elapsed years of documents that are indexed by the index word assignment method, and the period end 1146 indicates the end of the elapsed years of documents that are indexed by the index word assignment method. Show. That is, a document with the number of years elapsed between the period start 1145 and the period end 1146 corresponds to the record, and an index is created by a prescribed index word assigning method.

索引語付与方法１１４７は、インデクス１１３を作成する際における索引語を抽出する方法である。 The index word assigning method 1147 is a method for extracting an index word when creating the index 113.

本実施の形態では、経過年数が５年までの文書に対する索引語付与方法１１４７は「Ｎ−ｇｒａｍ」であり、経過年数が６年から１００年までの文書に対する索引語付与方法１１４７は「形態素解析」である。 In the present embodiment, the index word assigning method 1147 for documents with an elapsed age of up to 5 years is “N-gram”, and the index word assigning method 1147 for documents with an elapsed age of 6 to 100 years is “morphological analysis”. It is.

Ｎ−ｇｒａｍで索引語を抽出すると、検索漏れは少なくなるがインデクスは大きくなる。一方、形態素解析で索引語を抽出すると、インデクスは小さくなるが索引語の漏れは多くなる。 When index terms are extracted by N-gram, search omission is reduced, but the index is increased. On the other hand, when index words are extracted by morphological analysis, the index is reduced, but the number of index words is increased.

つまり、経過年数が５年以内の文書は、登録後の経過時間が短いので価値が大きいと判定し、Ｎ−ｇｒａｍで索引語を抽出する。また、経過年数が６年から１００年までの文書は、登録後の経過時間が長いので価値が小さいと判定し、形態素解析で索引語を抽出する。 That is, a document with an elapsed age of 5 years or less is determined to have a high value because the elapsed time after registration is short, and an index word is extracted by N-gram. Further, since the elapsed time after registration is 6 to 100 years, it is determined that the document has a small value because the elapsed time after registration is long, and index words are extracted by morphological analysis.

これによって、第３の実施の形態の文書検索システム１０は、登録後の経過時間が短い文書を検索から漏らさず、登録後の経過時間の長い文書のインデクス１１３を小さくすることができる。 Thus, the document search system 10 according to the third embodiment can reduce the index 113 of a document having a long elapsed time after registration without leaking a document having a short elapsed time after registration from the search.

（第４の実施の形態）
本発明の第４の実施の形態は、圧縮されたインデクス１１３を復元する。なお、復元とは、圧縮されたインデクス１１３を再作成し、圧縮前のインデクス１１３に戻すことである。 (Fourth embodiment)
In the fourth embodiment of the present invention, the compressed index 113 is restored. Note that decompression refers to re-creating the compressed index 113 and returning it to the index 113 before compression.

図９は、本発明の第４の実施の形態の検索プログラム１１８の処理のフローチャートである。 FIG. 9 is a flowchart of the process of the search program 118 according to the fourth embodiment of this invention.

まず、クエリが入力装置１０４から入力されると、検索処理を開始する。クエリは、一つ以上の単語文字列を含む。 First, when a query is input from the input device 104, search processing is started. The query includes one or more word strings.

なお、クエリとして単語文字列のブール式が入力される場合や、ベクトルモデルによってスコアを計算する場合等については、非特許文献１に記載されている技術によって実現できるため説明を省略する。 Note that a case where a Boolean expression of a word character string is input as a query, a case where a score is calculated by a vector model, and the like can be realized by the technique described in Non-Patent Document 1, and thus description thereof is omitted.

まず、クエリに含まれるすべての単語文字列を処理したか否かを判定する（Ｓ２０１）。 First, it is determined whether all word character strings included in the query have been processed (S201).

すべての単語文字列を処理していないと、未処理の単語文字列がクエリに残っていると判定するので、未処理の単語文字列から一つを選択する。選択した単語文字列とインデクス１１３の索引語１１３１とが一致するレコードにあるすべての文書番号１１３２（文書番号集合）をインデクス１１３から取得する（Ｓ２０２）。 If all the word character strings have not been processed, it is determined that an unprocessed word character string remains in the query, so one is selected from the unprocessed word character strings. All the document numbers 1132 (document number set) in the record in which the selected word character string matches the index word 1131 of the index 113 are acquired from the index 113 (S202).

取得した文書番号集合と前回のステップＳ２０３で求めた結果文書集合とをａｎｄ演算し、その結果を新たな結果文書集合とし（Ｓ２０３）、ステップＳ２０１に戻る。ただし、クエリに含まれる単語文字列の先頭の処理の場合には、前回のステップＳ２０３で求めた結果文書集合が存在しないので、ステップ２０２で取得した文書番号集合を結果文書集合とする。 An AND operation is performed on the acquired document number set and the result document set obtained in the previous step S203, and the result is set as a new result document set (S203), and the process returns to step S201. However, in the case of processing at the beginning of the word character string included in the query, the result document set obtained in the previous step S203 does not exist, so the document number set acquired in step 202 is set as the result document set.

一方、すべての単語文字列が処理されると、ステップＳ２０１では未処理の単語文字列がクエリに存在しないと判定し、求めた結果文書集合に含まれる文書番号に対応する文書の情報を検索結果として表示装置１０３に表示する（Ｓ２０４）。例えば、結果文書集合に含まれる文書番号と登録日管理テーブル１１５の文書番号１１５１とが一致するレコードの文書重み１１５３を登録日管理テーブルから取得する。更に、結果文書集合に含まれる文書番号の文書名を文書データ１１２から取得する。そして、取得した文書重み１１５３及び文書名を検索結果として表示装置１０３に表示する。 On the other hand, when all the word character strings are processed, it is determined in step S201 that there is no unprocessed word character string in the query, and the document information corresponding to the document number included in the obtained document set is obtained as a search result. Is displayed on the display device 103 (S204). For example, the document weight 1153 of the record in which the document number included in the result document set matches the document number 1151 of the registration date management table 115 is acquired from the registration date management table. Further, the document name of the document number included in the result document set is acquired from the document data 112. Then, the acquired document weight 1153 and the document name are displayed on the display device 103 as a search result.

図１０は、本発明の第４の実施の形態の表示装置１０３に表示される検索結果の説明図である。 FIG. 10 is an explanatory diagram of search results displayed on the display device 103 according to the fourth embodiment of this invention.

表示装置１０３には、入力装置１０４から入力されたクエリと一致した文書の文書名及び圧縮率が表示されている。更に、表示装置１０３には、「インデクスを復元」ボタンが表示されている。 The display device 103 displays the document name and compression rate of the document that matches the query input from the input device 104. Further, the display device 103 displays a “Restore index” button.

ユーザは、表示された文書に関するインデクス１１３を圧縮前の状態に戻す（復元する）ことができる。ユーザは、圧縮された（圧縮率が「１．０」でない。）文書が有用であると判定すると、有用な文書を選択して「インデクスを復元」ボタンを操作する。 The user can return (restore) the index 113 relating to the displayed document to the state before compression. When the user determines that the compressed document (compression rate is not “1.0”) is useful, the user selects the useful document and operates the “restore index” button.

インデクス１１３の復元が入力装置１０４から入力されると、インデクス作成プログラム１１９は、選択された文書に関するインデクス１１３を再作成して復元する。 When restoration of the index 113 is input from the input device 104, the index creation program 119 recreates and restores the index 113 related to the selected document.

以上のように、第４の実施の形態の文書検索システム１０は、登録から長期間経過していても有用な文書に関するインデクスを復元することができるので、有用な文書の検索漏れを防ぐことができる。 As described above, the document retrieval system 10 according to the fourth embodiment can restore a useful document index even after a long period of time has elapsed since registration. it can.

本実施の形態において、文書検索システム１０は、ユーザからの要求に応じてインデクス１１３を復元したが、文書の有用性を自動的に判断してインデクス１１３を復元してもよい。 In the present embodiment, the document search system 10 restores the index 113 in response to a request from the user. However, the index 113 may be restored by automatically determining the usefulness of the document.

例えば、文書が所定の回数利用されると、インデクス作成プログラム１１９は、その文書を有用と判定し、当該文章に関するインデクス１１３を復元する。これによって、文書検索システム１０は、ユーザの利用状況から文書の価値を判定することができるので、複数のユーザに利用されている場合に好適である。 For example, when a document is used a predetermined number of times, the index creation program 119 determines that the document is useful, and restores the index 113 related to the sentence. As a result, the document search system 10 can determine the value of the document from the usage status of the user, which is suitable when the document search system 10 is used by a plurality of users.

（第５の実施の形態）
本発明の第５の実施の形態は、文書の利用状況に応じてインデクス１１３を圧縮する。 (Fifth embodiment)
In the fifth embodiment of the present invention, the index 113 is compressed according to the usage status of the document.

図１１は、本発明の第５の実施の形態の記憶装置１１０に記憶される文書利用頻度テーブル１１６の構成図である。 FIG. 11 is a configuration diagram of the document usage frequency table 116 stored in the storage device 110 according to the fifth embodiment of this invention.

文書利用頻度テーブル１１６は、文書番号１１６１、登録直後の利用頻度１１６２及び直近の利用頻度１１６３を含む。 The document usage frequency table 116 includes a document number 1161, a usage frequency 1162 immediately after registration, and a latest usage frequency 1163.

文書番号１１６１は、文書データ１１２に格納されている文書を識別する一意な識別番号である。 The document number 1161 is a unique identification number for identifying a document stored in the document data 112.

登録直後の利用頻度１１６２は、当該文書が文書データ１１２に登録されてから所定の期間における文書の利用回数である。文書の利用とは、例えば、文書の中身を参照した、文書をダウンロードした等である。文書の利用は、利用の形態を限定してもよいし、利用の形態によって重み付けを変えてもよい。直近の利用頻度１１６３は、現在までの所定の期間における文書の利用回数である。なお、所定の期間は、登録直後の利用頻度１１６２と直近の利用頻度１１６３とで同一の期間とする。 The usage frequency 1162 immediately after registration is the number of times the document is used in a predetermined period after the document is registered in the document data 112. The use of a document refers to, for example, referring to the contents of a document or downloading a document. The use of a document may limit the form of use, and the weight may be changed depending on the form of use. The most recent use frequency 1163 is the number of times the document has been used in a predetermined period until now. The predetermined period is the same period for the usage frequency 1162 immediately after registration and the latest usage frequency 1163.

例えば、文書番号「１３６２」の文書は、登録直後には５０回利用されたが、直近には１回しか利用されていない。よって、登録直後から比較すると、利用頻度が急激に低下していることが分かる。一方、文書番号「５０６２」の文書は、登録直後に２６回利用されており、直近には１２回利用されている。よって、登録直後から比較すると、利用頻度はそれほど急激には低下していないことが分かる。 For example, the document number “1362” has been used 50 times immediately after registration, but has been used only once recently. Therefore, when compared immediately after registration, it can be seen that the usage frequency is drastically decreased. On the other hand, the document with the document number “5062” has been used 26 times immediately after registration, and has been used 12 times most recently. Therefore, when compared immediately after registration, it can be seen that the usage frequency has not decreased so rapidly.

図１２は、本発明の第５の実施の形態のインデクス圧縮プログラム１１７の処理のフローチャートである。 FIG. 12 is a flowchart of the process of the index compression program 117 according to the fifth embodiment of this invention.

まず、文書データ１１２に含まれるすべての文書を処理したか否かを判定する（Ｓ４０１）。 First, it is determined whether all the documents included in the document data 112 have been processed (S401).

すべての文書の処理が完了すると、インデクス１１３を圧縮する文書が存在しないので、処理を終了する。 When the processing of all the documents is completed, there is no document for compressing the index 113, and the processing is terminated.

一方、処理していない文書があると、インデクス１１３を圧縮する文書が存在するので、未処理の文書の中から文書番号が最も小さい文書を選択する。次に、選択した文書の文書番号と文書利用頻度テーブル１１６の文書番号１１６１とが一致するレコードの登録直後の利用頻度１１６２及び直近の利用頻度１１６３を文書利用頻度管理テーブル１１６から取得する（Ｓ４０２）。次に、取得した直近の利用頻度１１６３を登録直後の利用頻度１１６２で割って、利用回数の低下の割合を求める。 On the other hand, if there is a document that has not been processed, there is a document that compresses the index 113, so the document with the smallest document number is selected from the unprocessed documents. Next, the usage frequency 1162 immediately after registration of the record in which the document number of the selected document matches the document number 1161 of the document usage frequency table 116 and the latest usage frequency 1163 are acquired from the document usage frequency management table 116 (S402). . Next, the most recent usage frequency 1163 acquired is divided by the usage frequency 1162 immediately after registration to determine the rate of decrease in the number of usages.

次に、求めた利用回数の低下の割合が閾値以上であるか否かを判定する（Ｓ４０３）。なお、利用回数の低下の割合ではなく、文書の利用価値を判定できる他の値によって判定してもよい。 Next, it is determined whether or not the obtained rate of decrease in the number of uses is equal to or greater than a threshold (S403). Note that the determination may be based on other values that can determine the utility value of the document instead of the rate of decrease in the number of times of use.

利用回数の低下の割合が閾値以上であると、文書の価値がそれほど低下していないと判定するので、ステップＳ４０１に戻る。 If the rate of decrease in the number of uses is equal to or greater than the threshold value, it is determined that the value of the document has not decreased so much, and the process returns to step S401.

一方、利用回数の低下の割合が閾値より小さいと、文書の価値が著しく低下しているので、ステップＳ３０２〜Ｓ３０７の処理を行う。なお、ステップＳ３０２〜Ｓ３０７は、本発明の第１の実施の形態のインデクス圧縮プログラム１１７の処理（図５）と同一なので、説明は省略する。 On the other hand, if the rate of decrease in the number of usages is smaller than the threshold value, the value of the document is remarkably reduced, so the processes of steps S302 to S307 are performed. Note that steps S302 to S307 are the same as the processing (FIG. 5) of the index compression program 117 according to the first embodiment of this invention, and thus the description thereof is omitted.

第５の実施の形態では、文書の利用頻度に応じてインデクスを圧縮することができる。 In the fifth embodiment, the index can be compressed according to the frequency of use of the document.

本発明によると、インデクスのデータ量を小さくすることができる。よって、インデクスを格納する記憶容量を小さくすることができるので、低コストで検索装置を実現することができる。 According to the present invention, the amount of index data can be reduced. Therefore, since the storage capacity for storing the index can be reduced, a search device can be realized at low cost.

本発明はインデクスを用いて文書を検索する検索装置に利用することができる。そして、インデクスのデータ量を小さくすることができるので、低コストの検索装置に有用である。 The present invention can be used for a search device that searches documents using an index. Since the index data amount can be reduced, it is useful for a low-cost search device.

本発明の第１の実施の形態の文書検索システムの構成のブロック図である。It is a block diagram of a structure of the document search system of the 1st Embodiment of this invention. 本発明の第１の実施の形態の記憶装置に記憶されているインデクスの構成図である。It is a block diagram of the index memorize | stored in the memory | storage device of the 1st Embodiment of this invention. 本発明の第１の実施の形態の記憶装置に記憶されているインデクス圧縮ポリシーファイルの構成図である。It is a block diagram of the index compression policy file memorize | stored in the memory | storage device of the 1st Embodiment of this invention. 本発明の第１の実施の形態の記憶装置に記憶されている登録日管理テーブルの構成図である。It is a block diagram of the registration date management table memorize | stored in the memory | storage device of the 1st Embodiment of this invention. 本発明の第１の実施の形態のインデクス圧縮プログラムの処理のフローチャートである。It is a flowchart of the process of the index compression program of the 1st Embodiment of this invention. 本発明の第１の実施の形態の記憶装置に記憶されているタームベクトルテーブルの構成図である。It is a block diagram of the term vector table memorize | stored in the memory | storage device of the 1st Embodiment of this invention. 本発明の第２の実施の形態の記憶装置に記憶されているインデクスの構成図である。It is a block diagram of the index memorize | stored in the memory | storage device of the 2nd Embodiment of this invention. 本発明の第３の実施の形態の記憶装置に記憶されるインデクス圧縮ポリシーファイルの構成図である。It is a block diagram of the index compression policy file memorize | stored in the memory | storage device of the 3rd Embodiment of this invention. 本発明の第４の実施の形態の検索プログラムの処理のフローチャートである。It is a flowchart of a process of the search program of the 4th Embodiment of this invention. 本発明の第４の実施の形態の表示装置に表示される検索結果の説明図である。It is explanatory drawing of the search result displayed on the display apparatus of the 4th Embodiment of this invention. 本発明の第５の実施の形態の記憶装置に記憶される文書利用頻度テーブルの構成図である。It is a block diagram of the document utilization frequency table memorize | stored in the memory | storage device of the 5th Embodiment of this invention. 本発明の第５の実施の形態のインデクス圧縮プログラムの処理のフローチャートである。It is a flowchart of the process of the index compression program of the 5th Embodiment of this invention.

Explanation of symbols

１０文書検索システム
１０１ＣＰＵ
１０２主メモリ
１０３表示装置
１０４入力装置
１１０記憶装置
１１１ＯＳ
１１２文書データ
１１３インデクス
１１４インデクス圧縮ポリシーファイル
１１５登録日管理テーブル
１１６文書利用頻度テーブル
１１７インデクス圧縮プログラム
１１８検索プログラム
１１９インデクス作成プログラム
１２０文書収集プログラム
１３０文書サーバ
10 Document Search System 101 CPU
102 Main memory 103 Display device 104 Input device 110 Storage device 111 OS
112 Document data 113 Index 114 Index compression policy file 115 Registration date management table 116 Document usage frequency table 117 Index compression program 118 Search program 119 Index creation program 120 Document collection program 130 Document server

Claims

In a document search device for creating an index from an acquired document and identifying the document that satisfies the search condition by referring to the index based on an input search condition,
A storage unit for storing the index and an index compression policy file indicating a rule for reconfiguring the index, and a control unit,
The document search apparatus, wherein the control unit reconfigures the index based on time information attached to the document and a rule described in the index compression policy file.

The document search apparatus according to claim 1, wherein the control unit reconfigures the index based on an input value determination result.

The document search apparatus according to claim 1, wherein the control unit reconstructs the index based on a use frequency of the document.

The controller is
Determining priorities for index terms for each document;
The document search apparatus according to claim 1, wherein the index is reconstructed by deleting the index word for each document from the index in order of the determined priority.

The controller is
Based on the time information attached to the document, select one of a plurality of index generation algorithms,
The document search apparatus according to claim 1, wherein the index is reconstructed using the selected index generation algorithm.

An index reconstruction method in a document search apparatus for creating an index from an acquired document and identifying the document that satisfies the search condition by referring to the index based on an input search condition,
The document search apparatus includes a storage unit that stores the index and an index compression policy file indicating a rule for reconfiguring the index, and a control unit.
The index reconfiguration method, wherein the control unit reconfigures the index based on time information attached to the document and a rule described in the index compression policy file.

The index reconfiguration method according to claim 6, wherein the control unit reconfigures the index based on the input value determination result.

The index reconstructing method according to claim 6, wherein the control unit reconstructs the index based on the frequency of use of the document.

The controller is
Determining priorities for index terms for each document;
7. The index reconstruction method according to claim 6, wherein the index is reconstructed by deleting the index word for each document from the index in descending order of the determined priority.

The controller is
Based on the time information attached to the document, select one of a plurality of index generation algorithms,
7. The index reconstruction method according to claim 6, wherein the index is reconstructed using the selected index generation algorithm.

A program for creating an index from an acquired document, reconstructing the index in a document search device that identifies the document that satisfies the search condition by referring to the index based on an input search condition,
The document search apparatus includes a storage unit that stores the index and an index compression policy file indicating a rule for reconfiguring the index, and a control unit.
A program comprising a procedure for reconfiguring the index based on time information attached to the document and a rule described in the index compression policy file.

The program according to claim 11, further comprising a procedure for reconfiguring the index based on an input value determination result.

The program according to claim 11, further comprising a procedure for reconstructing the index based on the frequency of use of the document.

Further, a procedure for determining a priority for the index word for each document;
The program according to claim 11, further comprising: reconfiguring the index by deleting the index word for each document from the index in descending order of the determined priority.

A step of selecting one of a plurality of index generation algorithms based on the time information attached to the document;
The program according to claim 11, comprising: a step of reconstructing the index with the selected index generation algorithm.