JP5317418B2

JP5317418B2 - Program and inverted index storage method

Info

Publication number: JP5317418B2
Application number: JP2007050209A
Authority: JP
Inventors: 知弘安田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2007-02-28
Filing date: 2007-02-28
Publication date: 2013-10-16
Anticipated expiration: 2027-02-28
Also published as: JP2008217122A

Description

本発明は、大規模文書集合を対象に、高速な全文検索および文書データの逐次追加をするための、転置インデックスの圧縮方法に関する。 The present invention relates to a transposed index compression method for high-speed full-text search and sequential addition of document data for a large-scale document set.

今日、電子文書は経済活動のあらゆる場面において不可欠であり、膨大な量の電子文書が日々作成されている。インターネットの拡大も電子文書の量を爆発的に増加させる要因となった。これらの電子文書を最大限に活用するためには、欲しい文書を短時間で検索する文書検索技術が必須である。 Today, electronic documents are indispensable in every scene of economic activity, and a huge amount of electronic documents are created every day. The expansion of the Internet has also caused an explosion in the amount of electronic documents. In order to make the best use of these electronic documents, a document retrieval technique for retrieving desired documents in a short time is essential.

文書検索の典型的なスタイルは、与えられた文書集合から、指定された単語を含む文書を、極力短時間で出力するというものである。この目的のために、転置インデックスと呼ばれるデータ構造が使用される。図１は、検索対象となる文書集合101と、それに基づき構築される転置インデックス102の概略図である。ある索引語103がある文書104に出現するとき、その文書の識別番号105と、その文書中の当該索引語の出現回数106から構成される情報をポスティング107と呼び、ある索引語に関するポスティングを全て集めて得られるリストを転置リスト109と呼ぶ。図１において、108は索引語ごとに割り当てられた、転置リスト109へのポインタである。検索対象文書の全索引語の転置リストを格納したデータ構造が、転置インデックス102である。 A typical style of document retrieval is to output a document including a specified word from a given document set in as short a time as possible. For this purpose, a data structure called an inverted index is used. FIG. 1 is a schematic diagram of a document set 101 to be searched and a transposed index 102 constructed based on the set. When an index word 103 appears in a document 104, the information consisting of the identification number 105 of the document and the number of appearances 106 of the index word in the document is called posting 107, and all postings related to the index word 103 A list obtained by collecting is called a transposed list 109. In FIG. 1, 108 is a pointer to the transposed list 109 assigned for each index word. A data structure that stores an inverted list of all index words of the search target document is an inverted index 102.

英語、フランス語、スペイン語のような単語がスペースで区切られている言語と異なり、日本語、韓国語や中国語では、文書を正確に単語に分割することが困難である。そこでn-gramと呼ばれる、連続するn文字からなる、文書中の任意の部分文字列を単語の代わりに用いて、転置インデックスを構築する場合があり、実用上有用であることが知られている(情報検索アルゴリズム、北研二他著、共立出版）。nは、主に1〜10の整数である。本明細書では、単語及び長さnの部分文字列を総称し、索引語と表記する。 Unlike languages in which words such as English, French and Spanish are separated by spaces, it is difficult to accurately divide a document into words in Japanese, Korean and Chinese. Therefore, a transposed index may be constructed by using an arbitrary partial character string in a document consisting of consecutive n characters called n-gram instead of a word, and is known to be useful in practice. (Information Retrieval Algorithm, Kenji Kita et al., Kyoritsu Publishing). n is an integer of 1 to 10 mainly. In this specification, a word and a partial character string of length n are collectively referred to as an index word.

検索対象文書の数あるいは大きさが増大するにつれ、転置インデックスも巨大なものとなる。したがって、サイズをコンパクトに抑えることが重要となる。高速なアクセスを実現つつ、高い圧縮率でポスティングを圧縮する技術も提案されている(I.H.Witten, A. Moffat, and T.C. Bell, Managing Gigabytes 2nd Ed.: Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998、および、F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel.Compression of Inverted Indexes for Fast QueryEvaluation, In Proceedings of the 25th Annual International ACM SIGIR Conference on Researchand Development in Information Retrieval, pp222-229, 2002)。 As the number or size of documents to be searched increases, the transposed index becomes huge. Therefore, it is important to keep the size compact. A technology that compresses posting at a high compression rate while achieving high-speed access has also been proposed (IHWitten, A. Moffat, and TC Bell, Managing Gigabytes 2nd Ed .: Compressing and Indexing Documents and Images, Morgan Kaufmann Publishers Inc. ., San Francisco, CA, USA, 1998, and F. Scholer, HE Williams, J. Yiannis, and J. Zobel. Compression of Inverted Indexes for Fast QueryEvaluation, In Proceedings of the 25th Annual International ACM SIGIR Conference on Researchand Development in Information Retrieval, pp222-229, 2002).

文書検索における重要な課題のひとつが、新規に文書が与えられたときに、転置インデックスが新規文書を反映するように更新し、新規文書も検索可能とする差分更新技術である。差分更新技術に関しては、主にディスク上の転置インデックスの更新を目的とする技術が開発されてきた(J. Zobel and A. Moffat, Inverted files for text search engines, ACM Comput. Surv., 38(2), 2006)。しかし近年、計算機に搭載される主記憶が増大してきており、検索対象文書集合によっては、転置インデックスを複数の計算機に分散配置し、オンメモリで大規模な転置インデックスを扱うことが可能となってきた。また、ディスク上に転置インデックスを配置する場合であっても、更新部分を容量が許す限りメインメモリ上に記録し、メモリが不足するまでディスクへの書き込みを遅延させることで、ディスクアクセスの回数を減らしインデックスの更新に要する時間を短縮することが多い。したがって、オンメモリの転置インデックス差分更新技術の開発は、レスポンスの早い文書検索システムを提供する上で極めて重要である。ディスク上の転置インデックスを対象とする差分更新技術では、検索時のシーク時間を短縮するために、転置リストを連続性の高い領域に格納することが主要な課題となっている。一方、オンメモリの転置インデックスを対象とする場合、連続性はある程度犠牲にしてもよい。しかし、メモリはディスクに比べ容量が少ないため、転置リスト自体に付加されるポインタなどの付加的なデータの量を抑制する必要がある。 One of the important issues in document retrieval is a differential update technique that, when a new document is given, updates the transposed index to reflect the new document and makes it possible to retrieve the new document. As for differential update technology, technologies mainly for updating inverted indexes on disk have been developed (J. Zobel and A. Moffat, Inverted files for text search engines, ACM Comput. Surv., 38 (2 ), 2006). However, in recent years, the main memory installed in computers has increased, and depending on the search target document set, it is possible to distribute inverted indexes to multiple computers and handle large-scale inverted indexes on-memory. It was. Even if an inverted index is placed on the disk, the updated part is recorded in the main memory as long as the capacity permits, and writing to the disk is delayed until the memory runs out, thereby reducing the number of disk accesses. In many cases, the time required for updating the index is reduced. Therefore, development of an on-memory inverted index difference update technique is extremely important in providing a document search system with a quick response. In the differential update technique for the inverted index on the disk, storing the inverted list in a highly continuous area is a major issue in order to shorten the seek time during search. On the other hand, when an on-memory inverted index is targeted, continuity may be sacrificed to some extent. However, since the memory has a smaller capacity than the disk, it is necessary to suppress the amount of additional data such as a pointer added to the permutation list itself.

更新の必要がない静的な文書集合を検索対象とする場合、転置インデックスのサイズを必要最小限度に抑制するインデックス構築方法としては、もとの文書集合を２回読み直す２パス方式が有効である。２パス方式では、文書集合を初めに読むときは、各索引語の転置リストを格納するために必要な領域の大きさを計算する。この結果に基づき必要なメモリ領域を確保した上で、文書集合をもう一度読み直し実際に転置リストを作成する。２パス方式では、メモリの無駄がない転置リストを構築できるが、文書集合を２回読むコストが大きい。しかも、予め文書集合が完全に与えられる必要があり、新規文書の追加は、インデックス全体を再構築しない限り不可能である。 When a static document set that does not need to be updated is a search target, a two-pass method in which the original document set is read twice is effective as an index construction method that suppresses the size of the inverted index to the minimum necessary level. . In the two-pass method, when the document set is first read, the size of an area necessary for storing the transposed list of each index word is calculated. Based on this result, after securing a necessary memory area, the document set is read again to actually create a transposed list. In the two-pass method, an inverted list that does not waste memory can be constructed, but the cost of reading a document set twice is high. Moreover, the document set needs to be given in advance, and addition of a new document is impossible unless the entire index is reconstructed.

オンメモリで転置インデックスの差分更新を実現するデータ構造としては、図２に示すように、ポスティング107に次のポスティングの位置を表すポインタ201を付与するリスト構造が最も基本的である。リスト構造を採用することで各索引語の転置リストを、必要なときにいつでもメモリの許す限りの長さに拡張できる。しかし整数値・ポインタをいずれも同一のビット数で表現すれば、ポスティングが、文書識別番号、当該文書中の当該索引語の出現頻度の２つの整数から構成される場合、データ構造全体に占めるポインタの割合が33%以上となってしまう。したがって、リスト構造では消費するメモリの量が本来のデータ量の1.5倍に達し、ポインタを格納することによるオーバーヘッドが非常に大きい。 As shown in FIG. 2, the most basic data structure for realizing the difference update of the transposed index on-memory is a list structure in which a pointer 201 indicating the position of the next posting is given to the posting 107. By adopting a list structure, the transposed list of each index word can be expanded to the maximum length of memory whenever necessary. However, if both the integer value and the pointer are expressed by the same number of bits, if the posting is composed of two integers of the document identification number and the appearance frequency of the index word in the document, the pointer occupying the entire data structure The ratio will be 33% or more. Therefore, the amount of memory consumed in the list structure reaches 1.5 times the original data amount, and the overhead due to storing the pointer is very large.

転置リストを連続するメモリ領域に配置すれば、ポスティング同士を連結するためのポインタが不要になる。この際、各転置リストの末尾に空き領域が残るよう多めにメモリを割り当て、新規のポスティングを追加できるようにする手法が知られている。過去の更新状況から将来転置リストが伸張する度合いを予測し、必要なメモリを多めに獲得する方法も提案されている(非特許文献１)。 If the transposed list is arranged in a continuous memory area, a pointer for connecting postings becomes unnecessary. At this time, a technique is known in which a large amount of memory is allocated so that an empty area remains at the end of each transposed list, and a new posting can be added. There has also been proposed a method for predicting the degree of expansion of the future permutation list from the past update status and acquiring more necessary memory (Non-Patent Document 1).

一方、Buttcherらは、リスト構造による転置リストをベースとし、ポインタのオーバーヘッドを削減する方法を提案した(非特許文献２)。Buttcherらの方法では、連続メモリ領域内に複数のポスティングを格納し、これらの領域間をポインタで結合することにより、ポスティング１つあたりのポインタに起因するオーバーヘッドを削減している。さらに、頻繁にポスティングが追加される索引語の転置リストは将来も追加が多いと予想されることを反映し、新規に獲得するメモリ領域の大きさを指数関数的に増加させる。ただし、あまりに大きなメモリを確保してしまうと無駄な領域が多くなる可能性も考慮し、獲得するメモリ領域の大きさに上限を設けている。これらの技術を合わせて用いることにより、Buttcherらの方法は２パス方式に比べてデータ量の増加を5%以下に抑えられることが実験的に確認されている。 On the other hand, Buttcher et al. Proposed a method of reducing pointer overhead based on a transposed list having a list structure (Non-Patent Document 2). In the method of Buttcher et al., A plurality of postings are stored in a continuous memory area, and these areas are connected by pointers, thereby reducing overhead caused by pointers per posting. Further, the index word transposed list to which postings are frequently added reflects the fact that there will be many additions in the future, and the size of the newly acquired memory area is exponentially increased. However, in consideration of the possibility that a wasteful area increases if a too large memory is secured, an upper limit is set on the size of the memory area to be acquired. By using these techniques together, it has been experimentally confirmed that the Buttcher et al. Method can suppress an increase in data amount to 5% or less compared to the 2-pass method.

W.-Y. Shieh, C.-P. Chung, A statistics-based approach to incrementally update inverted files, Inf. Processing and Management, 41(2):275-288, 2005.W.-Y. Shieh, C.-P. Chung, A statistics-based approach to incrementally update inverted files, Inf.Processing and Management, 41 (2): 275-288, 2005. S. Buttcher and C.L.A. Clarke, Memory management strategies for single-pass index construction in text retrieval systems, University of Waterloo Technical Report CS-2005032/Wumpus Technical Report 2005-02, 2005.S. Buttcher and C.L.A.Clark, Memory management strategies for single-pass index construction in text retrieval systems, University of Waterloo Technical Report CS-2005032 / Wumpus Technical Report 2005-02, 2005.

転置インデックスのデータ構造は、転置リスト自体を除く付加的なデータの量を極力少なく押さえつつ、どのような文書が与えられても短時間で更新可能でなければならない。非特許文献１の方法では、新規に追加される文書の数が多い場合、結局は獲得したメモリが不足し新しいメモリを確保する必要が生じる。したがって、多めにメモリを獲得することでデータ構造を更新するタイミングを遅らせることはできるが、メモリ容量の許す限り転置リストを任意の長さに伸張可能なデータ構造は結局必要である。非特許文献２の方法では、比較的単純なデータ構造にも関わらず、２パス方式と比べて十分にオーバーヘッドの少ない転置インデックスを実現している。しかし転置リストの先頭部分が大きさの異なる多数のメモリ領域に分割されてしまうため検索時の速度低下が起きるほか、転置リストへのアクセス方法が複雑化し、処理時間増加の要因となる。ポインタに必要なメモリにも、削減する余地がある。 The data structure of the inverted index must be able to be updated in a short time no matter what document is given while minimizing the amount of additional data excluding the inverted list itself. In the method of Non-Patent Document 1, if the number of newly added documents is large, eventually the acquired memory is insufficient and it is necessary to secure a new memory. Therefore, the timing for updating the data structure can be delayed by acquiring more memory, but a data structure that can expand the transposed list to an arbitrary length as long as the memory capacity permits is eventually necessary. In the method of Non-Patent Document 2, despite the relatively simple data structure, an inverted index with sufficiently low overhead is realized compared to the two-pass method. However, since the top part of the inverted list is divided into a large number of memory areas having different sizes, the speed at the time of retrieval is reduced, and the access method to the inverted list is complicated, resulting in an increase in processing time. There is also room to reduce the memory required for the pointer.

本発明の目的は、高速な文書検索および転置インデックス更新のために、コンパクトでなおかつ任意の新規文書追加要求に対応可能な転置インデックスのデータ構造を提供することにある。 An object of the present invention is to provide a data structure of a transposed index that is compact and can handle any new document addition request for high-speed document search and transposed index update.

本発明は、転置リストを格納するメモリ領域を３種類用意し、転置リストの長さごとにこれらを使い分けることにより、ポインタに起因するメモリのオーバーヘッドが少なく、メモリ領域の細分化を防いだ転置インデックスのデータ構造を提供する(図３)。これらの３種類のメモリ領域を、本明細書ではタイプ１、タイプ２、タイプ３と表記する。本明細書では、特に断らない限り、領域の大きさはバイト単位で表現する。以下、
bは索引語ごとに割り当てられる固定サイズのメモリ領域の大きさを表す整数であるユーザパラメータ、
Bは索引語ごとに追加で割り当てられる最小のメモリ領域の大きさ表す整数であるユーザパラメータ、
rは追加で割り当てられるメモリ領域が不足した場合に新規に割り当てられるメモリ領域の大きさを調整する整数であるユーザパラメータ、
Mは索引語ごとに追加で割り当てられ連続性が保証される最大のメモリ領域の大きさを表す整数であるユーザパラメータ、
Wは獲得したメモリ領域の位置を表現するために必要な領域の大きさ、
W’は獲得したメモリ領域内のデータのビット長を表現するために必要な領域の大きさ、
NをB*r^N≧Mなる最小の整数、
[x]はxを下回らない最小の整数、
c(x)=[(x+7)/8]とする。なお、c(x)は長さxのビット列を格納するために必要な領域のバイト単位での大きさである。 The present invention prepares three types of memory areas for storing an inverted list, and uses them differently for each length of the inverted list, thereby reducing the memory overhead caused by pointers and preventing segmentation of the memory area. The data structure is provided (FIG. 3). These three types of memory areas are referred to as type 1, type 2, and type 3 in this specification. In this specification, unless otherwise specified, the size of the area is expressed in bytes. Less than,
b is a user parameter that is an integer representing the size of a fixed-size memory area allocated for each index word;
B is a user parameter that is an integer representing the size of the smallest memory area that is additionally allocated for each index word,
r is a user parameter that is an integer that adjusts the size of the newly allocated memory area when the additional allocated memory area is insufficient,
M is a user parameter that is an integer representing the size of the maximum memory area that is additionally allocated for each index word and guarantees continuity,
W is the size of the area required to represent the location of the acquired memory area,
W ′ is the size of the area necessary to represent the bit length of the data in the acquired memory area,
N is the smallest integer such that B * r ^N ≧ M,
[x] is the smallest integer not less than x,
Let c (x) = [(x + 7) / 8]. Note that c (x) is the size in bytes of an area necessary for storing a bit string of length x.

本明細書では対数logの底は２とする。さらに、新しく割り当てられるメモリ領域の大きさが元のメモリ領域より大きいことを保証するためにb＜B＜Mを仮定する。 In this specification, the base of the logarithm log is 2. Furthermore, b <B <M is assumed to ensure that the size of the newly allocated memory area is larger than the original memory area.

タイプ１〜３のメモリ領域は、格納したい転置リストのビット単位での長さλにより、下記のように使い分ける。 The memory areas of types 1 to 3 are selectively used as follows depending on the length λ of the permutation list to be stored in bit units.

(1)タイプ１のメモリ領域
c(λ)≦Wなる転置リストを格納するために用いる。従来技術では転置リストは、索引語ごとに用意されているポインタが指す領域に格納される。このポインタを格納する領域301に、本発明ではタイプ１のメモリ領域を確保する。これにより、全くポインタを使用せずに短い転置リストを格納できる。文書集合中に１〜２回しか現れない、非常に稀な索引語に対して、タイプ１のメモリ領域301は有効である。なぜなら、c(λ)≦Wの転置リストでポインタを使用すると、１つポインタを使用するだけでポインタの占める割合が50%を超えるためである。 (1) Type 1 memory area
Used to store a transposed list such that c (λ) ≦ W. In the prior art, the transposed list is stored in an area pointed to by a pointer prepared for each index word. In the present invention, a type 1 memory area is secured in the area 301 for storing the pointer. This allows a short transposed list to be stored without using any pointers. The type 1 memory area 301 is useful for very rare index terms that appear only once or twice in the document set. This is because if the pointer is used in the transposed list of c (λ) ≦ W, the pointer occupies more than 50% by using only one pointer.

(2)タイプ２のメモリ領域
W＜c(λ)≦B*r^N-1なる転置リストを格納するために用いる。タイプ２のメモリ領域302は、索引語ごとに用意されているポインタが指す領域に確保する。長さλの転置リストに対しては、大きさがB*r^n-1のメモリ領域を確保し、転置リストを記録する。ここに、nはc(λ)≦B*r^n-1なる最小の整数である。 (2) Type 2 memory area
Used to store an inverted list of W <c (λ) ≦ B * r ^N−1 . The type 2 memory area 302 is secured in an area pointed to by a pointer prepared for each index word. For the transposed list of length λ, a memory area having a size of B * r ^n-1 is secured and the transposed list is recorded. Here, n is a minimum integer satisfying c (λ) ≦ B * r ⁿ⁻¹ .

(3)タイプ３のメモリ領域
c(λ)＞B*r^N-1なる転置リストを格納するために用いる。タイプ３のメモリ領域は、複数の、大きさMのメモリ領域 303(以下、セグメントと呼ぶ)をポインタで接続したリスト構造である。このリスト中でのi番目のセグメントを、以下ではA(i)と表記する。A(i)中で転置リストを記録するために利用可能な領域の大きさをM(i)とし、k≧1のときM’(k)=M(1)+...+M(k)と定義する。また、M’(0)=0と定義する。この定義により、M’(k)は、k個のセグメントからなるタイプ３のメモリ領域に記録できる最長の転置リストのバイト数となる。M(i)の具体的な値は本発明の方法の実装に依存するが、その一例について後述する。リスト構造を構成するためのポインタや他の情報を記録するために、各セグメントに記録できる転置リストのバイト数はMより小さく、M(i)＜Mである。 (3) Type 3 memory area
Used to store a transposed list of c (λ)> B * r ^N-1 . The type 3 memory area has a list structure in which a plurality of memory areas 303 of size M (hereinafter referred to as segments) are connected by pointers. The i-th segment in this list is denoted as A (i) below. Let M (i) be the size of the area available for recording the transposed list in A (i), and when k ≧ 1, M ′ (k) = M (1) + ... + M (k ). Further, it is defined that M ′ (0) = 0. With this definition, M ′ (k) is the number of bytes of the longest transposed list that can be recorded in a type 3 memory area consisting of k segments. The specific value of M (i) depends on the implementation of the method of the present invention, an example of which will be described later. In order to record pointers and other information for configuring the list structure, the number of bytes of the transposed list that can be recorded in each segment is smaller than M, and M (i) <M.

長さλの転置リストのために確保されるセグメントの数は、c(λ)≦M’(k)を満足する最小の整数kとなる。タイプ３のメモリ領域はリスト構造を取っており、ポインタが必要だが、Mを大きくとりM(i)を十分に大きくすれば、ポスティング１つ１つにポインタを用意する場合に比べポインタの占める割合が小さくなり、また転置リストを読み取る際のポインタアクセスの頻度も相対的に小さいため、ポインタに起因するアクセス時間およびメモリ消費を抑制できる。また、確保されるセグメントの大きさがMで一定のため、実装も容易である。 The number of segments reserved for the transposed list of length λ is the smallest integer k that satisfies c (λ) ≦ M ′ (k). The type 3 memory area has a list structure and requires a pointer, but if M is made large and M (i) is made sufficiently large, the percentage of the pointer will occupy compared to the case where a pointer is prepared for each posting. Since the frequency of pointer access when reading the transposed list is relatively small, the access time and memory consumption caused by the pointer can be suppressed. Moreover, since the size of the secured segment is constant at M, mounting is easy.

rが整数でかつMがBの倍数であれば、タイプ２〜３のメモリ領域で、連続する個々のメモリ領域の大きさは常にBの整数倍となる。したがって予め、多数のBバイトのメモリ領域を含む配列を獲得しておき、その中で何番目の領域であるかを表す整数をポインタの代わりに用いることによって、ポインタを直接用いるより少ないビット数でメモリ領域の位置を指定でき、Wを小さくできる場合がある。 If r is an integer and M is a multiple of B, the size of each successive memory area is always an integer multiple of B in the type 2-3 memory areas. Therefore, by acquiring an array containing a large number of B-byte memory areas in advance and using an integer indicating the number of the area in place of the pointer, the number of bits can be reduced compared to using the pointer directly. The location of the memory area can be specified, and W may be reduced.

本発明により、ポインタに起因するメモリオーバーヘッドが少なく、連続性の高いメモリ領域にポスティングが記録される、新規文書の追加が可能な転置インデックスのデータ構造が提供される。 According to the present invention, there is provided a data structure of an inverted index in which posting is recorded in a highly continuous memory area with little memory overhead caused by a pointer and a new document can be added.

本発明によって提供される転置インデックスの作成および更新を実施する装置の一形態を説明する。装置全体の概観を図４に示す。中央演算装置(CPU)401に、メモリ(主記憶)402が接続されている。必要に応じ、補助記憶403、ＤＶＤ-ＲＡＭ等の書き換え可能なリムーバブルメディア404、ネットワーク405、ユーザ端末406が接続される。本実施形態の装置は、CPU 401により実行されるメモリ402上のプログラムとして索引語抽出手段S501、索引語出現回数取得手段S502、文書識別番号割り当て手段S503、ポスティング圧縮手段S504、転置インデックス更新手段S505を備える。 An embodiment of an apparatus for creating and updating an inverted index provided by the present invention will be described. An overview of the entire device is shown in FIG. A memory (main memory) 402 is connected to the central processing unit (CPU) 401. An auxiliary storage 403, a rewritable removable medium 404 such as a DVD-RAM, a network 405, and a user terminal 406 are connected as necessary. The apparatus according to the present embodiment includes an index word extraction unit S501, an index word appearance number acquisition unit S502, a document identification number assignment unit S503, a posting compression unit S504, and a transposed index update unit S505 as programs on the memory 402 executed by the CPU 401. Is provided.

本装置におけるデータ及び処理の流れを、図５に示す。最初に、入力として新規の検索対象文書104を、CPU 401が必要に応じメモリ402、補助記憶403、リムーバブルメディア404、ネットワーク405のいずれかから取得する。前記文書104に、2通りの処理が行なわれる。まず、索引語抽出手段S501により、前記文書104中の索引語103が切り出される。前述のように、抽出手段として、公知の形態素解析方法又はn-gramが使用できる(情報検索アルゴリズム、北研二他著、共立出版)。さらに、それぞれの索引語が、前記文書104において出現する回数が、出現回数取得手段S502により取得される。出現回数を取得するためには、単純に索引語ごとに整数変数をひとつ用意して０に初期化した後、前記文書104中に当該索引語の出現が検出される度に１ずつ増加させればよい。さらに文書識別番号割り当て手段S503により、前記文書104に文書識別番号を割り当てる。 The flow of data and processing in this apparatus is shown in FIG. First, a new search target document 104 is acquired as input from the memory 402, auxiliary storage 403, removable media 404, or network 405 as necessary. Two types of processing are performed on the document 104. First, the index word extraction means S501 cuts out the index word 103 in the document 104. As described above, a known morphological analysis method or n-gram can be used as the extraction means (information search algorithm, Kitakenji et al., Kyoritsu Shuppan). Further, the number of times each index word appears in the document 104 is obtained by the appearance number obtaining means S502. To obtain the number of appearances, simply prepare one integer variable for each index word and initialize it to 0, then increment it by 1 each time the occurrence of the index word is detected in the document 104. That's fine. Further, the document identification number assigning unit S503 assigns a document identification number to the document 104.

この手段の好ましい実施形態の１つは、１に初期化された整数変数を１つ保持し、新しい文書識別番号が必要な際にこの変数の値を割り当て、その直後に前記変数の値を１増やすことである。S501、S502、S503により、各々の索引語について、文書識別番号105及び索引語出現回数106からなるポスティング107が得られる。このポスティングは、ポスティング圧縮手段S504を用いて圧縮することが好ましい。圧縮されたポスティング506をメモリ402に構築した転置インデックス407中の、当該索引語に対応する転置リストに追記する。当該索引語に対応する転置リストの位置は、索引語ごとのポインタに記録する。必要に応じ、メモリ402上に構築した転置インデックスを、補助記憶403、リムーバブルメディア404、ネットワーク405に転送し保存してもよい。この処理は、新規の検索対象文書104が与えられる度に繰り返される。 One preferred embodiment of this means keeps one integer variable initialized to 1, assigning the value of this variable when a new document identification number is needed, and immediately following that variable's value of 1 It is to increase. Through S501, S502, and S503, a posting 107 including the document identification number 105 and the index word appearance count 106 is obtained for each index word. This posting is preferably compressed using the posting compression means S504. The compressed posting 506 is added to the inverted list corresponding to the index word in the inverted index 407 constructed in the memory 402. The position of the transposed list corresponding to the index word is recorded in the pointer for each index word. If necessary, the transposed index constructed on the memory 402 may be transferred to the auxiliary storage 403, the removable medium 404, and the network 405 for storage. This process is repeated each time a new search target document 104 is given.

本発明の方法による転置インデックスの一例を、図６を用いて説明する。このデータ構造は、基本領域601、拡張領域602、空き領域リスト記録領域603から成る。基本領域は、大きさbのメモリ領域を複数個格納可能な可変長配列である。同様に、拡張領域も大きさBのメモリ領域を複数個格納可能な可変長配列である。可変長配列の実施形態の一例を、後述する。空き領域リスト記録領域603は、N個の、大きさWの要素605からなる固定長配列である。この領域のn番目の要素empty[n] 605は、ある定数NULLに初期化されている。NULLの定義は後述する。基本領域601中の大きさbの個々のメモリ領域を、以下では基本スロットと呼ぶ。同様に、拡張領域602中の大きさBの個々のメモリ領域604を、以下ではスロットと呼ぶ。拡張領域中でk番目のスロットのスロット番号を、kとする。また、連続する複数のスロットが連続メモリ領域として使用される場合、最初のスロットのスロット番号を、このメモリ領域のスロット番号とする。 An example of the inverted index according to the method of the present invention will be described with reference to FIG. This data structure includes a basic area 601, an extended area 602, and a free area list recording area 603. The basic area is a variable length array that can store a plurality of memory areas of size b. Similarly, the extension area is a variable length array capable of storing a plurality of memory areas of size B. An example of an embodiment of a variable length array will be described later. The empty area list recording area 603 is a fixed-length array composed of N elements 605 having a size W. The nth element empty [n] 605 in this area is initialized to a constant NULL. The definition of NULL will be described later. The individual memory areas of size b in the basic area 601 are hereinafter referred to as basic slots. Similarly, each memory area 604 of size B in the expansion area 602 is hereinafter referred to as a slot. Let k be the slot number of the kth slot in the extension area. When a plurality of consecutive slots are used as a continuous memory area, the slot number of the first slot is set as the slot number of this memory area.

新規に獲得するメモリ領域がもとの整数倍でありかつ急激にサイズが増大することを防ぐために、r=2を仮定する。また、拡張領域602内に確保されるすべてのメモリ領域の大きさがB*rⁿ(nは整数)と表現できるよう、MをM=B*r^Nに固定する。さらに、タイプ１のメモリ領域よりもタイプ２のメモリ領域に格納できる最長の転置リストが長くなり、タイプ２のメモリ領域よりも必ずタイプ３のメモリ領域に格納できる最長の転置リストが長くなることを保証するために、b≦B≦B*r^N-1≦M(1)を満足するパラメータが与えられることを仮定する。 In order to prevent the newly acquired memory area from being an integral multiple of the original and from rapidly increasing in size, r = 2 is assumed. The size of all the memory area secured in the extended region 602 is B * r ⁿ so that it can (n is an integer) representation, to secure the M to M = B * r ^N. Furthermore, the longest transposed list that can be stored in the type 2 memory area is longer than the type 1 memory area, and the longest transposed list that can always be stored in the type 3 memory area is longer than the type 2 memory area. In order to guarantee, it is assumed that parameters satisfying b ≦ B ≦ B * r ^N−1 ≦ M (1) are given.

基本領域からは、索引語１つにつき基本スロットが１つ割り当てられる。基本スロットに記録される情報は、次の通りである。
(1)当該索引語の転置リストを記録するメモリ領域が、タイプ１〜３のどれであるか。
(2)-1 タイプ１の場合、転置リストのビット長λと転置リスト自体。
(2)-2 タイプ２の場合、転置リストのビット長λとメモリ領域へのポインタ。
(2)-3 タイプ３の場合、メモリ領域のリストの先頭要素へのポインタ。 From the basic area, one basic slot is assigned to each index word. The information recorded in the basic slot is as follows.
(1) Which of the types 1 to 3 is the memory area in which the transposed list of the index word is recorded.
(2) -1 For Type 1, the bit length λ of the inverted list and the inverted list itself.
(2) -2 For type 2, the bit length λ of the transposed list and the pointer to the memory area.
(2) -3 For Type 3, pointer to the first element of the memory area list.

基本スロットのデータ構造を、図７を用いて説明する。基本スロット７０１は、３つの部分領域R 702、L 703、C 704から成る。 The data structure of the basic slot will be described with reference to FIG. The basic slot 701 includes three partial regions R 702, L 703, and C 704.

●部分領域R 702
整数nを記録する。この整数nの意味は、次の通りである。
n=0: この索引語の転置リストは、タイプ１のメモリ領域に格納されている。
1≦n≦N:この索引語の転置リストは、タイプ２のメモリ領域に格納されている。
そのメモリ領域の大きさは、B*r^n-1である。
n=N+1: この索引語の転置リストは、タイプ３のメモリ領域に格納されている。
●部分領域L 703
タイプ１または２のメモリ領域を使用する場合に、転置リストのビット長を記録する。タイプ３の場合、部分領域Lは意味をもたない。
●部分領域C 704
タイプ１のメモリ領域を使用する場合には、転置リスト自体を記録する。タイプ２の場合には、転置リストが記録されているメモリ領域のスロット番号を記録する。タイプ３の場合には、転置リストが記録されているメモリ領域のリストの最初のセグメントA(1)のスロット番号を記録する。 ● Partial area R 702
Record the integer n. The meaning of this integer n is as follows.
n = 0: This transposed list of index words is stored in the type 1 memory area.
1 ≦ n ≦ N: This index word transposition list is stored in a type 2 memory area.
The size of the memory area is B * r ^n-1 .
n = N + 1: This transposed list of index words is stored in a type 3 memory area.
● Partial area L 703
When the type 1 or 2 memory area is used, the bit length of the transposed list is recorded. In the case of type 3, the partial area L has no meaning.
● Partial area C 704
When a type 1 memory area is used, the transposition list itself is recorded. In the case of type 2, the slot number of the memory area in which the transposed list is recorded is recorded. In the case of type 3, the slot number of the first segment A (1) in the list of the memory area where the transposed list is recorded is recorded.

なお以下では、メモリ領域x中の部分領域yをx{y}と表記する。 Hereinafter, the partial area y in the memory area x is denoted as x {y}.

次に、拡張領域602のスロット上に獲得したメモリ領域のデータ構造を説明する。
■タイプ２のメモリ領域(801)の場合
全体が転置リストを格納するために使用される(図8)。
■タイプ３のメモリ領域(901,902または903)の場合
データ構造を、図９を用いて説明する。 Next, the data structure of the memory area acquired on the slot of the extension area 602 will be described.
(2) Type 2 memory area (801) The entire memory area is used to store a transposed list (FIG. 8).
(2) Type 3 memory area (901, 902 or 903) The data structure will be described with reference to FIG.

●部分領域P3 904
図９に示すリスト構造における、次のセグメントのスロット番号を記録する。但し、当該部分領域P3を含むセグメントがリスト構造の最後のセグメントである場合には、最後であることを表す特殊な数値NULLを記録する。特殊な数値NULLの選択として、好ましい例を２つ示す。
１．スロット番号を１から順に数えることとし、NULLを０とする。スロット番号を０から数え、スロット０を使用しないことにしてNULLを０としてもよい。
２．Wバイトで記録できる最大の数値をNULLとする。
なお、図６、９、１１においては、定数NULLを斜線606で図示した。
●部分領域L3 905
当該部分領域L3を含むセグメントに記録されている分の転置リストのビット長を記録する。
●部分領域P’3 906
リスト構造の最初のセグメントA(1) 901に限り、部分領域P’3 906を設ける。この部分領域には、リスト構造の最後のセグメントのスロット番号を記録する。この情報によって、転置リストの長さに関係なく、転置リストの末尾に一定の時間内でアクセス可能となり、転置リストへの新規要素の追加を迅速に行なえる。
●部分領域C3 907
転置リストを格納する。 ● Partial area P3 904
The slot number of the next segment in the list structure shown in FIG. 9 is recorded. However, if the segment including the partial area P3 is the last segment in the list structure, a special numerical value NULL indicating the end is recorded. Two preferred examples are shown for selecting the special numeric value NULL.
1. Slot numbers are counted in order from 1, and NULL is set to 0. The slot number is counted from 0, and the slot 0 is not used, so that NULL may be set to 0.
2. The maximum numeric value that can be recorded in W bytes is NULL.
In FIGS. 6, 9, and 11, the constant NULL is indicated by a hatched line 606.
● Partial area L3 905
The bit length of the transposed list recorded in the segment including the partial area L3 is recorded.
● Partial area P'3 906
Only the first segment A (1) 901 of the list structure is provided with a partial region P′3 906. In this partial area, the slot number of the last segment of the list structure is recorded. With this information, the end of the inverted list can be accessed within a certain time regardless of the length of the inverted list, and a new element can be quickly added to the inverted list.
● Partial area C3 907
Stores the transpose list.

以上のデータ構造では、部分領域P3、L3、P’3の大きさは、それぞれW、W’、Wとするのが好ましい。この場合、M(1)=M-(2W+W’)、M(i)=M-(W+W’) (i≧2の場合)である。 In the above data structure, the sizes of the partial areas P3, L3, and P'3 are preferably W, W ', and W, respectively. In this case, M (1) = M− (2W + W ′) and M (i) = M− (W + W ′) (when i ≧ 2).

各パラメータの値および各部分領域の大きさの、選択基準について説明する。まず、Wはスロット番号を表すために必要なバイト数であればよい。64ビットアーキテクチャであれば最大W=8バイトだが、スロット番号がNULLも含めて最大でnであるとわかっていれば、W=c(log(n)+1)バイトで十分である。最低でもメガバイトクラスの転置インデックスを構築できるよう、nは2²⁰よりも大きく取ることが好ましい。したがって、Wは少なくとも3バイトであることが好ましい。一方、W’も８バイトとしてもよいが、タイプ１および２のメモリ領域およびセグメントの大きさはMで抑えられるので、W’=c(log(8*M)+1)バイトで十分である。W’=1バイトでもM＜32バイトならばセグメント内の転置リストの長さを表現できるが、この場合はW+W’が最低でも4バイトであり、Mの10%を超えてしまうため、転置リスト以外の付加的なデータ量を抑制するという当初の課題に対する解決手段として不十分となってしまう。W’=2ならばM＜2¹³=8192未満であればよいため、Mを大きく取り、付加的なデータが占める割合を抑制できる。r=2であるべきことは、前述した。基本スロット701は部分領域C 704を包含し、この部分領域C 704の大きさはW以上であるため、b≧W≧3である。アドレス計算にシフトおよび論理演算を使用し、計算を高速化するためには、基本スロットの大きさbは２のべき乗であることが好ましいが、基本スロットは全ての索引語に割り当てられるため、可能な限り小さくする必要がある。 The selection criteria for the value of each parameter and the size of each partial area will be described. First, W may be the number of bytes necessary to represent the slot number. For 64-bit architectures, the maximum is W = 8 bytes, but if you know that the slot number is n at most, including NULL, W = c (log (n) +1) bytes is sufficient. So that we can build the inverted index of megabytes class at a minimum, n is preferably made larger than 2 ^20. Therefore, W is preferably at least 3 bytes. On the other hand, W ′ may be 8 bytes, but the size of the memory areas and segments of type 1 and 2 can be suppressed by M, so W ′ = c (log (8 * M) +1) bytes is sufficient. . Even if W <= 1 byte, if M <32 bytes, the length of the transpose list in the segment can be expressed, but in this case, W + W 'is at least 4 bytes and exceeds 10% of M. This is insufficient as a solution to the original problem of suppressing the additional data amount other than the transposed list. If W ′ = 2, M <2 ¹³ = less than 8192 suffices, so M can be increased to reduce the proportion of additional data. As described above, r = 2 should be satisfied. Since the basic slot 701 includes a partial area C 704 and the size of the partial area C 704 is greater than or equal to W, b ≧ W ≧ 3. In order to use shift and logical operations for address calculation and to speed up the calculation, the basic slot size b is preferably a power of 2, but this is possible because the basic slot is assigned to all index words. It needs to be as small as possible.

したがって、bの値としてはWが3以上であることを考慮し、４以上とすることが好ましい。Bも、bと同様に2のべき乗であることが好ましく、かつ、タイプ２のメモリ領域をタイプ１より大きくするためにb＜B、タイプ３のメモリ領域をタイプ２のメモリ領域よりも大きくするためにB＜Mをそれぞれ満たすことが好ましい。b＜Bより、Bは4*2=8以上とすることが好ましい。一方、M=B*r^N、r=2であるため、Mも2のべき乗である。したがって、M≦4096=8192/2であることが好ましい。このため、B≦2048=4096/2であることが好ましい。さらに、M＞Bより、M≧16=8*2であることが好ましい。なお、部分領域R 703に記録すべき値はlog(M/B)≦log(8192/8)=10であるため10+1=11以下の整数であればよいため、部分領域R 703の大きさは0から11までの整数を表現可能な4ビットであれば十分であり、M, Bの選択によってはもっと少ないビット数であってもよい。部分領域L 704は[log|W|+1]ビット以上であればよい。 Accordingly, the value of b is preferably 4 or more in consideration of W being 3 or more. B is also preferably a power of 2 like b, and in order to make the type 2 memory area larger than type 1, b <B, and the type 3 memory area is made larger than the type 2 memory area. Therefore, it is preferable to satisfy B <M, respectively. From b <B, B is preferably 4 * 2 = 8 or more. On the other hand, since M = B * r ^N and r = 2, M is also a power of 2. Therefore, it is preferable that M ≦ 4096 = 8192/2. For this reason, it is preferable that B ≦ 2048 = 4096/2. Further, from M> B, it is preferable that M ≧ 16 = 8 * 2. Since the value to be recorded in the partial area R 703 is log (M / B) ≦ log (8192/8) = 10, an integer equal to or less than 10 + 1 = 11 may be used. The length is sufficient if it is 4 bits that can represent an integer from 0 to 11, and a smaller number of bits may be used depending on the selection of M and B. The partial area L 704 may be [log | W | +1] bits or more.

次に、このデータ構造を使用した転置リストの作成方法について詳細を述べる。 Next, a method for creating a transposed list using this data structure will be described in detail.

1. ある索引語の最初に、ビット長λの新規ポスティングを追加するときには、基本領域の未使用部分の先頭から新しい基本スロットa 701をひとつ獲得する。λに基づきメモリ領域の種類をタイプ１〜３の中から選択する。
1-1. c(λ)≦Wならば、前述のようにタイプ１が選択される。この場合は、前記基本スロットa内に、a{R}に0、a{L}にλを、a{C}に前記ポスティングを記録する。
1-2. W＜c(λ)≦B*r^N-1ならば、前述のようにタイプ２が選択される。この場合は、B*r^n-1≧c(λ)を満たす最小の整数nに対し、拡張領域からr^n-1個の連続するスロットを獲得する。このr^n-1個の連続するスロット内に、前記ポスティングを記録する。そして、a{R}にn、a{L}にλ、a{C}に前記連続するスロットの最初のスロット番号を記録する。
1-3. B*r^N-1＜c(λ)ならば、前述のようにタイプ３が選択される。この場合は、まず拡張領域からr^N個の連続するスロットを獲得し、A(1)として使用する。この領域の大きさは、M=B*r^Nバイトである。a{R}にN+1、a{C}にA(1)のスロット番号を記録する。さらに、λにより下記1-3-1または1-3-2のいずれかの処理を行なう。 1. When adding a new posting having a bit length λ at the beginning of an index word, one new basic slot a 701 is acquired from the top of the unused portion of the basic area. Based on λ, the type of the memory area is selected from types 1 to 3.
1-1. If c (λ) ≦ W, type 1 is selected as described above. In this case, in the basic slot a, 0 is recorded in a {R}, λ is recorded in a {L}, and the posting is recorded in a {C}.
1-2. If W <c (λ) ≦ B * r ^N−1 , type 2 is selected as described above. In this case, for the smallest integer n satisfying B * r ^n-1 ≧ c (λ), r ^n-1 consecutive slots are acquired from the extension region. The posting is recorded in the r ^n-1 consecutive slots. Then, n is recorded in a {R}, λ is recorded in a {L}, and the first slot number of the continuous slot is recorded in a {C}.
1-3. If B * r ^N-1 <c (λ), type 3 is selected as described above. In this case, first, r ^N consecutive slots are acquired from the extension area and used as A (1). The size of this area is M = B * r ^N bytes. Record the slot number of N + 1 in a {R} and the slot number of A (1) in a {C}. Further, any of the following 1-3-1 or 1-3-2 is performed by λ.

1-3-1. c(λ)≦M(1)の場合
A(1){P3}にNULL、A(1){L3}にλ、A(1){P’3}にNULLを記録し、A(1){C3}に前記ポスティングを先頭から詰めて記録する。 1-3-1. When c (λ) ≤ M (1)
Record NULL in A (1) {P3}, λ in A (1) {L3}, NULL in A (1) {P'3}, and pad the A (1) {C3} from the beginning. Record.

1-3-2. c(λ)＞M(1)の場合
kをc(λ)≦M’(k)を満足する最小の整数とする。変数iを2にセットし、i≦kの間、下記の処理1-3-2-1〜4を行なう。
1-3-2-1. 拡張領域から新規にr^N個の連続するスロットを獲得し、A(i)とする。
1-3-2-2. A(i-1){P3}にA(i)のスロット番号を、A(i-1){L3}にM(i-1)*8を、A(i-1){C3}に前記ポスティングのM’(i-2)+1バイト目からM’(i-1)バイト目までを記録する。
1-3-2-3. iに1を加える。
1-3-2-4. i≦kならば、1-3-2-1からの処理を再び行なう。
1-3-2-1〜4の処理が完了したら、A(k){P3}にNULLを、A(k){L3}に整数λ-M’(k-1)*8を記録し、さらにA(k-1){C3}に前記ポスティングのM’(k-1)+1バイト目以降を先頭から詰めて記録する。そして、A(1){P’3}にA(k)のスロット番号を記録する。 1-3-2. When c (λ)> M (1)
Let k be the smallest integer that satisfies c (λ) ≦ M ′ (k). The variable i is set to 2 and the following processes 1-32-1 to 4 are performed while i ≦ k.
1-3-2-1. Acquire new r ^N consecutive slots from the extension area, and call it A (i).
1-3-2-2. A (i-1) {P3} is the slot number of A (i), A (i-1) {L3} is M (i-1) * 8, A (i -1) Record from the M ′ (i−2) +1 byte to the M ′ (i−1) byte of the posting in {C3}.
1-3-2-3. Add 1 to i.
1-3-2-4. If i ≦ k, repeat the process from 1-3-2-1.
When the processing of 1-3-2-1 to 4 is completed, record NULL in A (k) {P3} and integer λ-M '(k-1) * 8 in A (k) {L3} Further, A (k−1) {C3} is recorded with the M ′ (k−1) +1 byte and subsequent bytes of the posting packed from the beginning. Then, the slot number of A (k) is recorded in A (1) {P'3}.

次に、ある索引語の既存の転置リストに新規ポスティングを加える方法について述べる。既存の転置リストのビット長をλ、新しいポスティングのビット長をλ’とする。 Next, a method for adding a new posting to an existing transposed list of an index word will be described. Assume that the bit length of the existing transposed list is λ, and the bit length of the new posting is λ ′.

2.当該索引語に対応する、基本スロットa’を参照し、a’{R}に基づき既存の転置リストを格納したメモリ領域のタイプを取得する。
2-1. a’{R}=0のとき、すなわち既存の転置リストがタイプ１の場合、a’{L}から、既存の転置リストの長さλを取得する。
2-1-1. c(λ+λ’)≦Wの場合、更新後の転置リストもタイプ１のメモリ領域に格納する。単純に、既存の転置リストの末尾に新規ポスティングを追記し、a’{L}をλ+λ’に置き換える。
2-1-2. W＜c(λ+λ’)≦B*r^N-1の場合、更新後の転置リストはタイプ２のメモリ領域に格納する。まず、拡張領域からr^n-1個の連続するスロットを獲得する。ただし、nはB*r^n-1≧c(λ+λ’)を満たす最小の整数である。このr^n-1個の連続するスロット内に先頭から詰めて、既存の転置リストをコピーし、その直後に新規ポスティングを記録する。そして、a’{R}にn、a’{L}にλ+λ’、a’{C}に新たに獲得したメモリ領域のスロット番号を記録する。
2-1-3. B*r^N-1＜c(λ+λ’)の場合、更新後の転置リストはタイプ３のメモリ領域に格納する。まず、「ビット長λの新規ポスティング」を「ビット長λの既存の転置リスト」、「a」を「a’」と読み替えて処理1-3を実行する。このとき、最後に獲得したセグメントをA(k)とする。以下、整数kは説明の都合上用いているが、実装にあたってはkの値が未知であっても、メモリ領域A(k)のスロット番号が把握されていればよい。次に、下記の処理2-1-3’を実行する。
2-1-3’. A(k){L3}+λ’≦M(k)ならば、A(k){C3}の末尾に新規ポスティングを追記し、A(k){L3}にλ’を加えれば追加処理は終了である。A(k){L3}+λ’＞M(k)の場合、まず、変数DにM(k)*8-A(k){L3}をセットする。このとき、DはA(k)に追記可能な転置リストのビット長になる。次に、新規ポスティングの先頭からDビットまでをA(k){C3}に追記し、A(k){L3}にM(k)*8を記録する。さらに、D＜λ’が成立する間、以下の2-1-3-1〜7の処理を繰り返す。
2-1-3-1. 拡張領域から新規にr^N個の連続するスロットを獲得し、A(k+1)とする。
2-1-3-2. A(k){P3}に、A(k+1)のスロット番号を記録。
2-1-3-3. 変数dを、D+M(i)*8とλ’のうち、小さい方にセットする。
2-1-3-4. 新規ポスティングのD+1ビット目からdビット目を、A(k+1){C3}に先頭から詰めて記録する。
2-1-3-5. A(k+1){L3}に、d-Dを記録する。
2-1-3-6. kをk+1、Dをdで置き換える。
2-1-3-7. D＜λ’が成立すれば、2-1-3-1からの処理を再び行なう。 2. Refer to the basic slot a ′ corresponding to the index word, and acquire the type of the memory area that stores the existing transposed list based on a ′ {R}.
2-1. When a ′ {R} = 0, that is, when the existing inverted list is type 1, the length λ of the existing inverted list is acquired from a ′ {L}.
2-1-1. When c (λ + λ ′) ≦ W, the updated transposed list is also stored in the type 1 memory area. Simply add a new posting to the end of the existing transpose list and replace a '{L} with λ + λ'.
2-1-2. When W <c (λ + λ ′) ≦ B * r ^N−1 , the updated transposed list is stored in the type 2 memory area. First, rn ^-1 consecutive slots are acquired from the extension area. However, n is the smallest integer that satisfies B * r ⁿ⁻¹ ≧ c (λ + λ ′). The existing transposed list is copied into the r ^n-1 consecutive slots from the head, and a new posting is recorded immediately after that. Then, n is recorded in a ′ {R}, λ + λ ′ is recorded in a ′ {L}, and the slot number of the newly acquired memory area is recorded in a ′ {C}.
2-1-3. When B * r ^N-1 <c (λ + λ ′), the updated transposed list is stored in a type 3 memory area. First, “new posting with bit length λ” is read as “existing transposed list with bit length λ”, and “a” is read as “a ′”, and processing 1-3 is executed. At this time, let A (k) be the last acquired segment. Hereinafter, the integer k is used for convenience of explanation, but it is sufficient that the slot number of the memory area A (k) is known even when the value of k is unknown in the implementation. Next, the following process 2-1-3 ′ is executed.
2-1-3 '. If A (k) {L3} + λ'≤M (k), add a new posting at the end of A (k) {C3} and add λ to A (k) {L3}. Adding 'ends the addition process. When A (k) {L3} + λ ′> M (k), first, M (k) * 8−A (k) {L3} is set to the variable D. At this time, D is the bit length of the transposed list that can be added to A (k). Next, the portion from the beginning of the new posting to D bits is added to A (k) {C3}, and M (k) * 8 is recorded to A (k) {L3}. Further, while D <λ ′ is satisfied, the following processes 2-1-3-1 to 7 are repeated.
2-1-3-1. R ^N consecutive slots are newly acquired from the extension area and set to A (k + 1).
2-1-3-2. Record the slot number of A (k + 1) in A (k) {P3}.
2-1-3-3. Set the variable d to the smaller of D + M (i) * 8 and λ ′.
2-1-3-4. Record the D + 1th to dth bits of the new posting in A (k + 1) {C3} from the beginning.
2-1-3-5. Record dD in A (k + 1) {L3}.
2-1-3-6. Replace k with k + 1 and D with d.
2-1-3-7. If D <λ ′ is satisfied, the processing from 2-1-3-1 is performed again.

最後に獲得したセグメントをA(k’)とする。A(k’){P3}にNULLを記録し、A(k’)のスロット番号をA(1){P’3}に記録する。なお、A(1)のスロット番号はa’{C}を参照すれば得られる。
2-2. 1≦a’{R}≦Nの場合、すなわち既存の転置リストがタイプ2のメモリ領域に記録されている場合、a’{R}に書かれた整数nについてB*r^n-1≧c(λ+λ’)ならば、既存の転置リストが書かれたメモリ領域中で、既存の転置リストの直後に新規のポスティングを追記し、a’{L}にλ’を加えて処理が終了する。B*r^n-1＜c(λ+λ’)ならば、まず、2-1.と同様の処理を行って既存の転置リストと新規のポスティングを新たなメモリ領域に記録する。そして、これまで既存の転置リストが記録されていたメモリ領域は破棄し、再利用可能な状態とする。メモリ領域の破棄と再利用の方法は、後述する。
2-3. a’{R}＞B*r^N-1の場合、すなわち既存の転置リストがタイプ3のメモリ領域に記録されている場合、a’{C}に書かれたスロットの位置にあるセグメントA(1)について、A(1){P3}を参照する。A(1){P3}がNULLならば、A(1)がリスト構造の最後のセグメントである。A(1){P3}がNULLでなければ、A(1){P3}に書かれたスロット番号の指すセグメントが最後のセグメントである。説明の都合上、この最後のセグメントがリスト中でk番目であると仮定する。実装にあたっては、最後のセグメントのスロット番号が把握されていればよく、kの値を取得する必要はない。この段階で、処理2-1-3’を実行すれば、新規ポスティングを追加できる。 Let A (k ') be the last acquired segment. NULL is recorded in A (k ′) {P3}, and the slot number of A (k ′) is recorded in A (1) {P′3}. The slot number of A (1) can be obtained by referring to a ′ {C}.
2-2. If 1 ≦ a ′ {R} ≦ N, that is, if the existing transposed list is recorded in the type 2 memory area, B * r ⁿ for the integer n written in a ′ {R} ^{If -1} ≧ c (λ + λ '), add a new posting immediately after the existing transposed list in the memory area where the existing transposed list is written, and add λ' to a '{L} The process ends. If B * r ^n-1 <c (λ + λ ′), first, the same processing as in 2-1 is performed to record the existing transposed list and the new posting in a new memory area. Then, the memory area in which the existing transpose list has been recorded is discarded and made reusable. A method for discarding and reusing the memory area will be described later.
2-3. If a '{R}> B * r ^N-1 , that is, if an existing transposed list is recorded in the type 3 memory area, it will be at the slot position written in a' {C}. For a segment A (1), refer to A (1) {P3}. If A (1) {P3} is NULL, A (1) is the last segment in the list structure. If A (1) {P3} is not NULL, the segment pointed to by the slot number written in A (1) {P3} is the last segment. For convenience of explanation, assume that this last segment is kth in the list. For implementation, it is only necessary to know the slot number of the last segment, and there is no need to obtain the value of k. At this stage, if the process 2-1-3 ′ is executed, a new posting can be added.

上記手順に従えば新規のポスティングが何度も追加されると、最初に格納されたポスティングは、タイプ１または２のメモリ領域からタイプ２または３のメモリ領域へと何度もコピーされる。しかし、１つの転置リストに含まれる各バイトの平均のコピー回数mは、下記のように限られる。ただし、nは転置リストの長さλに対してc(λ)＞B*r^n-1なる最大の整数nである。 If a new posting is added many times according to the above procedure, the first stored posting is copied many times from the type 1 or 2 memory area to the type 2 or 3 memory area. However, the average number of copies m of each byte included in one transposition list is limited as follows. However, n is the maximum integer n such that c (λ)> B * r ^{n−1 with} respect to the length λ of the transposed list.

m≦(b+B+B*r+B*r²+B*r³+…+B*r^n-1)/c(λ)
≦(b+B+B*r+B*r²+B*r³+…+B*r^n-1)/B*r^n-1
＝ b/(B*r^n-1)+(r^-(n-1)+r^-(n-2)+...+r^-2+r^-1+1)
≦ 1+r/(r-1)=2+1/(r-1)≦３
したがって、コピー処理の時間的コストは大きくない。 m ≦ (b + B + B * r + B * r ² + B * r ³ +… + B * r ^n-1 ) / c (λ)
≦ (b + B + B * r + B * r ² + B * r ³ +… + B * r ^n-1 ) / B * r ^n-1
= B / (B * r ^n-1 ) + (r- ^(n-1) + r- ^(n-2) + ... + r ^-2 + r ^-1 +1)
≦ 1 + r / (r-1) = 2 + 1 / (r-1) ≦ 3
Therefore, the time cost of the copy process is not large.

メモリ領域を破棄し、再利用可能な状態とする方法を説明する。ここで説明する方法では、破棄されたメモリ領域を図11に示すリスト構造で管理する。このリストは、破棄されるメモリ領域の大きさごとにN本用意されており、n番目のリストが大きさB*r^n-1の空き領域を管理する。そして、再利用時は、適切な大きさのリストの先頭からメモリ領域を割り当てる。空き領域リスト記録領域603は、前述の通りN個の大きさWの要素605からなる。n番目(n=1,2,...,N)の要素605には、大きさB*r^n-1の空き領域を接続するリストの最初の空き領域のスロット番号が記録される。各要素605は、処理の開始時にNULLに初期化する。大きさB*r^n-1のあるメモリ領域Aが破棄されると、その先頭Wバイト部分1102に、これまでn番目の要素605に書かれていた値が複写され、前記n番目の要素605には新たに破棄された領域Aのスロット番号が記録される。このとき複写される値は、Aがリストの最初のメモリ領域になる場合はNULLとなる。一方、ポスティングを追加する際にr^n-1個のスロットが連続するメモリ領域が必要になった場合には、まずn番目の空き領域リストからの獲得を試み、失敗したときに限り新しい未使用スロットを割り当てる。より詳細には、下記(1)(2)の通りである。 A method of discarding the memory area and making it reusable will be described. In the method described here, the discarded memory area is managed by the list structure shown in FIG. N lists are prepared for each size of the memory area to be discarded, and the nth list manages an empty area of size B * r ^n-1 . At the time of reuse, a memory area is allocated from the top of the list having an appropriate size. The empty area list recording area 603 includes N elements W 605 as described above. In the nth (n = 1, 2,..., N) element 605, the slot number of the first free area in the list connecting the free areas of size B * r ⁿ⁻¹ is recorded. Each element 605 is initialized to NULL at the start of processing. When a memory area A having a size B * r ^n-1 is discarded, the value written in the nth element 605 so far is copied to the first W byte portion 1102 and the nth element 605 is copied. In the field, the slot number of the newly discarded area A is recorded. The value copied at this time is NULL if A is the first memory area in the list. On the other hand, when adding a posting, if a memory area of r ^n-1 slots is required, first try to get from the nth free area list, and if it fails, it will be a new unused one Assign a slot. More specifically, it is as described in (1) and (2) below.

(1)空き領域リスト記録領域603のn番目の要素605がNULLで無い場合は、n番目の要素605に書かれたスロット番号のメモリAを割り当て、前記n番目の要素605には前記割り当てられたメモリAの先頭Wバイトの部分1102に書かれたスロット番号を複写する。 (1) When the n-th element 605 of the free area list recording area 603 is not NULL, the memory A having the slot number written in the n-th element 605 is allocated, and the allocation is made to the n-th element 605. The slot number written in the first W byte portion 1102 of the memory A is copied.

(2)空き領域リスト記録領域603のn番目の要素605がNULLの場合は、拡張領域のまだ使用されたことが無い部分の先頭から、連続するr^n-1個のスロットを割り当てる。 (2) When the n-th element 605 of the free area list recording area 603 is NULL, continuous r ^n-1 slots are allocated from the head of the unused part of the extension area.

なお、破棄されたメモリの再利用方法は、この方法以外にも、公知の任意のガーベージコレクションの方法を使用してもよい。 In addition to this method, any known garbage collection method may be used as a method for reusing discarded memory.

また、可変長配列の実装の形態としては、獲得済みのメモリで容量不足が生じた場合に、既存の領域の大きさにある比率Pを乗じた大きさの新しいメモリ領域を獲得して既存の配列の要素をすべてコピーする方法が一般的である。ただし、本発明の方法では可変長配列全体が連続するメモリ領域である必要はない。そこで、可変長配列を複数のブロック1002に分割し、さらに各ブロック1002へのポインタを別の可変長配列1001に格納して、このポインタを通じアクセスする(図10)。新規に連続スロットが必要になれば既存ブロック内から割り当てを試みるが、空きスロットの不足やスロットの断片化により指定された個数の連続スロットが確保できない場合がある。この場合には可変長配列1003に新たなブロックを追加し、そこから指定された個数の連続するスロットを割り当てる。また、可変長配列1003の長さに関係なく一定の時間で可変長配列の任意の要素にアクセスできるよう、前記ポインタの可変長配列1001は連続するメモリ領域に確保することが好ましい。このため前記の、既存の配列の要素をすべて新しいメモリ領域にコピーする方法でポインタの配列1001を実装することが好ましい。 In addition, as a form of implementation of the variable length array, when there is insufficient capacity in the acquired memory, a new memory area having a size obtained by multiplying the size of the existing area by the ratio P is acquired. A common method is to copy all the elements of an array. However, in the method of the present invention, the entire variable length array need not be a continuous memory area. Therefore, the variable length array is divided into a plurality of blocks 1002, and pointers to the respective blocks 1002 are stored in another variable length array 1001 and accessed through these pointers (FIG. 10). If new continuous slots are required, allocation is attempted from the existing block. However, there are cases where the designated number of continuous slots cannot be secured due to lack of empty slots or slot fragmentation. In this case, a new block is added to the variable length array 1003, and a designated number of consecutive slots are allocated from the new block. The variable length array 1001 of pointers is preferably secured in a continuous memory area so that any element of the variable length array can be accessed in a fixed time regardless of the length of the variable length array 1003. For this reason, it is preferable to implement the pointer array 1001 by copying all the elements of the existing array to a new memory area.

領域P、C、P3、L3、P’3の大きさは、WあるいはW’バイトとしたが、バイト境界に揃える必要はなく、必要に応じてビット単位で切り詰めてもよい。 The sizes of the regions P, C, P3, L3, and P′3 are W or W ′ bytes, but they do not have to be aligned with byte boundaries, and may be truncated in units of bits as necessary.

また、ポスティングの圧縮方法として個々のポスティングの境界がバイト境界に揃うvariable byte方式(F. Scholer, H.E. Williams, J. Zobel, Compression of Inverted Indexes for Fast Query Evaluation, Proc. 25th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval, pp.222-229, 2002)の方法を使用する場合には、転置リストの長さをビット単位ではなくバイト単位で表現できるため、W’を１バイト分短くできる場合がある。 In addition, as a compression method for posting, a variable byte method in which individual posting boundaries are aligned with byte boundaries (F. Scholer, HE Williams, J. Zobel, Compression of Inverted Indexes for Fast Query Evaluation, Proc. 25th Ann. Int'l ACM When using the method of SIGIR Conf. Research and Development in Information Retrieval, pp. 222-229, 2002), the length of the transposed list can be expressed in bytes instead of bits. Sometimes it can be shortened.

また、本発明のデータ構造は、例えばハッシュ表のように、複数の可変長バイト列を管理する場合にも適用可能である。 The data structure of the present invention can also be applied when managing a plurality of variable length byte sequences, such as a hash table.

本発明により、検索対象文書の逐次追加が可能な転置インデックスのデータ構造が提供される。このデータ構造はポインタの数が少ないためメモリオーバーヘッドが少ない上、ある長さ以下の短い転置リストは連続するメモリ領域に配置されるため、高速アクセスが可能で実装も容易であるという特長を有する。 According to the present invention, a data structure of an inverted index capable of sequentially adding search target documents is provided. Since this data structure has a small number of pointers, the memory overhead is small, and a short transposition list of a certain length or less is placed in a continuous memory area, so that it can be accessed at high speed and is easy to implement.

検索対象となる文書集合と、それに基づき構築される転置インデックスの概略図。FIG. 3 is a schematic diagram of a document set to be searched and a transposed index constructed based on the set. ポスティングをリスト構造で接続し構成された転置リストの説明図。Explanatory drawing of the transposed list comprised by connecting posting by list structure. 転置リスト用に割り当てられた３種類のメモリ領域を示す図。The figure which shows three types of memory areas allocated for the transposition list. 本発明を実施する装置の１つの形態を示す概略図。1 is a schematic diagram showing one form of an apparatus for carrying out the present invention. 本発明によるデータ及び処理の流れの例を示す図。The figure which shows the example of the flow of the data by this invention, and a process. 転置インデックス全体のデータ構造の例を示す概略図。Schematic which shows the example of the data structure of the whole transposed index. 基本スロットのデータ構造を示す図。The figure which shows the data structure of a basic slot. タイプ２のメモリ領域のデータ構造を示す図。The figure which shows the data structure of the memory area of type 2. タイプ３のメモリ領域のデータ構造を示す図。The figure which shows the data structure of the memory area of type 3. 可変長配列の概略図。Schematic of variable length sequence. 空きスロットを管理するリスト構造の説明図。Explanatory drawing of the list structure which manages an empty slot.

Explanation of symbols

101: 検索対象文書の集合
102: 転置インデックス
103: 索引語
104: 検索対象文書
105: 文書の識別番号
106: ある文書における、ある索引語の出現回数
107: ポスティング
108: 索引語ごとに割り当てられた、転置リストへのポインタ
109: 転置リスト
301: タイプ１のメモリ領域
302: タイプ２のメモリ領域
303: タイプ３のメモリ領域の１要素、すなわちセグメント
401: 中央演算装置(CPU)
402: メモリ(主記憶装置)
403: 補助記憶装置
404: リムーバブルメディア
405: ネットワーク
406: ユーザ端末
407: 文書の逐次追加が可能な転置インデックス
601: 本発明の転置インデックスのデータ構造中の、基本領域
602: 本発明の転置インデックスのデータ構造中の、拡張領域
603: 本発明の転置インデックスのデータ構造中の、空き領域リスト記録領域
604: 拡張領域中のスロット
605: 空き領域リスト記録領域の一要素
606: 定数NULLを表現する斜線
701: 基本スロット
702: 基本スロットにおいて、メモリ領域の種類およびタイプ２の場合に獲得するメモリ領域の大きさを格納する領域
703: 基本スロットにおいて、タイプ１またはタイプ２のメモリ領域中で転置リストが占める領域の長さを格納する領域
703: 基本領域中で１つの索引語に対応する領域において、タイプ１のメモリ領域を使用する場合には転置リスト、タイプ２または３のメモリ領域を使用する場合にはそのメモリ領域の位置を記録するための領域
801: タイプ２のメモリ領域で転置リストを格納する領域
802: タイプ２のメモリ領域に対応する基本スロット
901: タイプ３のメモリ領域を構成するリスト構造の、最初のセグメント
902: タイプ３のメモリ領域を構成するリスト構造の、最初と最後以外のセグメント
903: タイプ３のメモリ領域を構成するリスト構造の、最後のセグメント
908: タイプ３のメモリ領域に対応する基本スロット
1001: 可変長配列において、個々のブロックへのポインタを格納した配列
1002: 可変長配列において、実際のデータを格納するブロックの１つ
1003: 本発明で使用する可変長配列の一形態
1101: 破棄され、未使用領域となったメモリ領域
1102: 破棄され、未使用領域となったメモリ領域を接続するリスト構造において、次のメモリ領域のスロット番号を格納する大きさWの領域 101: Set of search target documents
102: Inverted index
103: Index terms
104: Search target document
105: Document identification number
106: Number of times an index word appears in a document
107: Posting
108: Pointer to the inverted list assigned to each index word
109: Transpose list
301: Type 1 memory area
302: Type 2 memory area
303: One element of a type 3 memory area, that is, a segment
401: Central processing unit (CPU)
402: Memory (main storage)
403: Auxiliary storage device
404: Removable media
405: Network
406: User terminal
407: Inverted index that can add documents sequentially
601: Basic area in the data structure of the inverted index of the present invention
602: Extended area in data structure of inverted index of the present invention
603: Free area list recording area in the data structure of the inverted index of the present invention
604: Slot in extended area
605: One element of free area list recording area
606: Diagonal line representing the constant NULL
701: Basic slot
702: Area for storing the type of memory area and the size of the memory area acquired in the case of type 2 in the basic slot
703: Area for storing the length of the area occupied by the transposed list in the type 1 or type 2 memory area in the basic slot
703: Record the transposition list when using a type 1 memory area in the area corresponding to one index word in the basic area, and the location of the memory area when using a type 2 or 3 memory area Area to do
801: Type 2 memory area for storing transposition list
802: Basic slot corresponding to type 2 memory area
901: The first segment of the list structure that constitutes a type 3 memory area
902: A segment other than the first and last segments of the list structure that constitutes a type 3 memory area
903: Last segment of list structure that constitutes a type 3 memory area
908: Basic slot corresponding to type 3 memory area
1001: An array that stores pointers to individual blocks in a variable-length array
1002: One of the blocks to store the actual data in the variable length array
1003: One form of variable length sequence used in the present invention
1101: Memory area discarded and unused
1102: An area of size W that stores the slot number of the next memory area in the list structure that connects the memory areas that are discarded and become unused areas

Claims

For all index words of a document set consisting of a plurality of electronic documents each assigned an identification number, the identification number of the electronic document in which the index word appears for each index word and the appearance of the index word in the electronic document In a program for causing a computer to execute a process of storing an inverted index storing an inverted list in which postings that are sets of times are stored in a computer-readable recording medium,
Using a first type of memory area and a second type of memory area to store the transposition list;
In the case of an inverted list with a data size of W bytes or less, storing the inverted list in the first type memory area as a continuous memory area;
When the data size is larger than W bytes, the transposed list is stored in the second type memory area in which one or more continuous memory areas are connected by a list structure, and is combined by the list structure. Recording the position of the last memory area in the first memory area, and further storing a pointer to the first area of the second type memory area in the first type memory area;
A program that causes a computer to execute.

For all index words of a document set consisting of a plurality of electronic documents each assigned an identification number, the identification number of the electronic document in which the index word appears for each index word and the appearance of the index word in the electronic document In a program for causing a computer to execute a process of storing an inverted index storing an inverted list in which postings that are sets of times are stored in a computer-readable recording medium,
Using a first type of memory area, a second type of memory area, and a third type of memory area to store the transposition list;
In the case of an inverted list with a data size of W bytes or less, storing the inverted list in the first type memory area as a continuous memory area;
When the data size is larger than W bytes and equal to or less than M bytes corresponding to the value M specified by the user, the transposition list is stored in the second type memory area as a continuous memory area, and the first Storing a pointer to the first area of the two types of memory areas in the first type of memory area;
When the data size is larger than M bytes, the transposed list is stored in the third type memory area in which one or more continuous memory areas are connected by a list structure, and is combined by the list structure. Recording the position of the last memory area in the first memory area, and further storing a pointer to the first area of the third type memory area in the first type memory area;
A program that causes a computer to execute.

3. The program according to claim 2 , wherein the size of the second type memory area and the size of each memory element of the third type memory area are B bytes corresponding to a specified integer B. A program characterized by securing a new memory area by acquiring one or more elements from a variable-length array consisting of memory areas of size B, which is an integer multiple.

The program according to claim 2 or 3, wherein the value of M is a power of 2 and is not less than 16 and not more than 4096.

The program according to claim 3, wherein the value of B is a power of 2 and is not less than 8 and not more than 2048.

The program according to claim 5, wherein the value of M is a power of 2 and is not less than 16 and not more than 4096.

For all index words of a document set consisting of a plurality of electronic documents each assigned an identification number, the identification number of the electronic document in which the index word appears for each index word and the appearance of the index word in the electronic document In a method of storing an inverted index storing an inverted list in which postings that are sets of times are stored in a computer-readable recording medium through information processing by a computer,
The computer uses a first type of memory area and a second type of memory area to store the transposition list,
When the data size is a transposed list of W bytes or less, the computer stores the transposed list in the first type memory area as a continuous memory area;
When the data size is larger than W bytes, the computer stores the transposed list in the second type memory area in which one or more continuous memory areas are connected by a list structure, and is combined in the list structure. Recording the position of the last memory area in the first memory area, and storing a pointer to the first area of the second type memory area in the first type memory area;
A transposed index storage method characterized by comprising:

For all index words of a document set consisting of a plurality of electronic documents each assigned an identification number, the identification number of the electronic document in which the index word appears for each index word and the appearance of the index word in the electronic document In a method of storing an inverted index storing an inverted list in which postings that are sets of times are stored in a computer-readable recording medium through information processing by a computer,
The computer uses a first type of memory area, a second type of memory area, and a third type of memory area to store the transposition list,
When the data size is a transposed list of W bytes or less, the computer stores the transposed list in the first type memory area as a continuous memory area;
When the data size is larger than W bytes and equal to or less than M bytes corresponding to the value M specified by the user, the computer stores the transposed list in the second type memory area as a continuous memory area, and Storing a pointer to the first area of the second type memory area in the first type memory area;
When the data size is larger than M bytes, the computer stores the transposed list in the third type memory area in which one or more consecutive memory areas are connected by a list structure, and is combined in the list structure. Recording the position of the last memory area in the first memory area, and further storing a pointer to the first area of the third type memory area in the first type memory area;
A transposed index storage method characterized by comprising: