CN103020299B

CN103020299B - The store method of inverted index and supplemental data thereof and memory storage in full-text search

Info

Publication number: CN103020299B
Application number: CN201210591989.9A
Authority: CN
Inventors: 范振勇; 吴震; 张学; 崔维力; 武新; 赵伟
Original assignee: TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd; National Computer Network and Information Security Management Center
Current assignee: Tianjin NanKai University General Data Technologies Co., Ltd.; National Computer Network and Information Security Management Center
Priority date: 2012-12-29
Filing date: 2012-12-29
Publication date: 2016-01-13
Anticipated expiration: 2032-12-29
Also published as: CN103020299A

Abstract

The invention provides the method for the efficient preservation inverted index in a kind of text retrieval system, comprising: detect indexing units data length and whether be greater than threshold values K; If indexing units data are greater than n*K and are less than (n+1) * K(n is natural number), by indexing units data from beginning to the part of n*K stored in indexing units data blocks, will remain indexing units data stored in B tree in; If indexing units data equal n*K, by indexing units data from beginning to the part of n*K stored in indexing units data blocks; If indexing units data are less than K, by indexing units data all stored in B tree.The invention has the beneficial effects as follows the storage efficiency that effectively can improve the full-text index of inverted entry, improve data read rates, can conveniently realize copy-on-write (Copy? On? Write) mechanism, and then the concurrent index that improve data security and reading data.

Description

The store method of inverted index and supplemental data thereof and memory storage in full-text search

Technical field

The invention belongs to field of data storage, especially relate to a kind of store method and memory storage of inverted index and supplemental data thereof in full-text search

Background technology

In relational database system, full-text index is one of mode of search file data full blast, under current network environment, quantity of information and customer volume all become volatile growth, one of full-text index Main Means becoming information retrieval system, inverted index is the core of text retrieval system, and its storage organization also has a great impact text retrieval system performance.

Inverted index (English: Invertedindex), also be often called as reverse indexing, insert archives or reverse archives, be a kind of indexing means, be used to be stored in the mapping of the memory location of certain word in a document or one group of document under full-text search.It is data structure the most frequently used in DRS.By inverted index, the lists of documents of this word can be comprised according to word quick obtaining.

Full-text index inverted entry data are made up of one group of document code corresponding to TermID and skew in a document, its form of expression is TermID-->{<docID, { offset}}, wherein TermID is the minimum index unit that hyphenation device divides, in Chinese Full Text Retrieval, be generally the combination of word, word, English, numeric string and several form, there is following characteristics:

1. different Term correspondence is fallen row's data length and is differed greatly, common word as " ", " " etc. the character height that often frequency of occurrences is suitable, also have the special string only occurred once.

2. in full-text index, vocabulary is huge, and often a full-text search storehouse can have ten million vocabulary, and each vocabulary, as a retrieval unit, can take a large amount of storage spaces

The storage of the full-text index certificate of falling row generally has two kinds: one, adopt empirical data that known dictionary is divided into middle and high, low several grades mode by word frequency, every grade of word frequency adopts different data block sizes to store, the every blocks of data storing high frequency words is comparatively large, otherwise low-limit frequency block is minimum.The benefit of this mode is that disk waste is less, reading efficiency is secure, and shortcoming is not inconsistent once real data and empirical data or occurs new vocabulary, and larger change occurs word frequency, then occur high frequency words space waste, low-frequency word word chain overlength causes reading efficiency low.Two, the mode dividing large data cell is adopted, if every 1GB data are a unit, in each unit, all inverted entry data press entry numbering, document code, side-play amount (TermID, DocID, Offset) sequence, before index completes, all data cells are carried out data and merge into a final unit.The advantage of this mode is space waste is 0, and shortcoming is that final data sorting needs one times of disk remaining space, and cause disk space utilization factor to be 50%, the time of merging is longer.

Summary of the invention

The problem to be solved in the present invention is to provide a kind of method of the efficient preservation index in text retrieval system, can fully improve disk space utilization factor.

For solving the problems of the technologies described above, the technical solution used in the present invention is: a kind of method of the efficient preservation index in text retrieval system, comprising:

1) magnitude relationship of indexing units data length and default threshold values K is compared;

2) if indexing units data length is less than K, by indexing units data all stored in B tree;

3) if indexing units data length equals K, by indexing units data from beginning to the part of K stored in indexing units data blocks;

4) if be greater than K, then compare indexing units data length and n*K (n=2,3 ...) magnitude relationship, and to store according to following manner:

If a. indexing units data length is greater than (n-1) * K and is less than n*K, by indexing units data from beginning to (n-1) * K part stored in indexing units data block, by remainder stored in B tree in;

If b. indexing units data length equals n*K, by all indexing units data in order stored in indexing units data block.

Further, described B tree can store at least two length and be less than the indexing units data that K value or length is greater than the remaining data after K value is stored in block device.

Further, the one distortion that described B tree is B+ tree, a kind of B+ tree being deformed into the pointer removing the sensing brotgher of node of preserving in leaf node of described B+ tree.

According to a further aspect in the invention, present invention also offers a kind of memory storage of efficient preservation indexing means, comprising:

Store data block unit, be used for storing the indexing units data of integral multiple of K;

B sets memory page unit, is used for storing the remainder of indexing units data deficiencies K.

Storage means according to the additional full-text index data of the memory storage of a kind of efficient preservation indexing means provided by the present invention:

1) the indexing units data in former B tree and the indexing units data sum added is calculated;

2) if indexing units data length sum is greater than n*K and be less than (n+1) * K (n=2,3 ...), the part that B tree stores is taken out and the part that indexing units data are stored from B tree to the part of n*K stored in indexing units data blocks, will remain during indexing units data set stored in B;

3) if indexing units data sum equals n*K, the part that B tree stores is taken out and indexing units data are set the part of storage area to n*K stored in indexing units data block from B;

4) if indexing units data sum is less than K, then will add part and set storage area stored in former B successively.

Owing to adopting technique scheme, effectively can improve the storage efficiency of the full-text index of inverted entry, decrease the waste of disk space, effectively improve data read rates; Conveniently can realize copy-on-write (CopyOnWrite) mechanism, and then improve the concurrent index of data security and reading data.

Accompanying drawing explanation

Fig. 1 is the store method schematic flow sheet of efficient preservation indexing means of the present invention and supplemental data

Fig. 2 is the storage schematic diagram of an example in the present invention

Fig. 3 is the schematic diagram realizing copy-on-write mechanism in example of the present invention

Embodiment

Implement to use in scene at one of the present invention, according to setting up corresponding full-text index inverted entry data in full, its representation is TermID (entry numbering)-->{<docID (document code), { offset} (side-play amount) }, the mode adopting B to set combined block file in this example stores Key-Value data, Key is entry numbering, Value is inverted entry data, elongated data, when creating index data, the data of Value constantly add, when length exceedes default threshold value n*K (n=1, 2, 3 ...) time, the data being n*K from the length of beginning extraction Value are stored into module unit, for the index data being less than k of remainder, module unit adopts fixed-length data to preserve the inverted entry data of corresponding TermId, each data block set with B in corresponding key value part form a chain.Thus form the block portions that block file data preserve inverted entry, preserve the fragment of inverted entry in B tree.

Here threshold values in the setting of K value be determine with reference to the memory performance storing inverted entry data, generally all adopt the minimum unit of Computer Storage, as 32k, 64k etc., facilitate magnetic disc array storage data.

As shown in Figure 2, suppose three index minimum units that in full, total hyphenation device delimited, be numbered TermID (1), TermID (2) and TermID (3) respectively, wherein TermID (1) is a character string only occurring few number of times in the text, when its inverted entry index is stored, as shown in figure 21, the data length of the inverted entry index of TermID (1) is not more than K, i.e. TermID (1)-value<k, is deposited into B and sets in memory page by its index data.

TermID (2) is another character string, the number of times occurred in the text is more, when its inverted entry index is stored, as shown in figure 22, the data length of the inverted entry index of TermID (2) equals default threshold values K, i.e. TermID (1)-value=nk (n=1,2,3 ...), equally its index data is deposited into storage block unit.

TermID (3) is another character string in full-text index, often occur in the text, when its inverted entry index is stored, as shown in Reference numeral in Fig. 2 23, the data length of the inverted entry index of TermID (2) equals default threshold values K, i.e. (n+1) * k>TermID (1)-value>n*k, the data being K from the length of beginning extraction Value are stored into storage block unit, remainder stored in B tree in, data block set with B in corresponding key value part form a chain.

In practice, the situation often having inverted entry data supplementing occurs, according to the situation of inverted entry data different length, still different corresponding stored methods is had, still TermID (1) in an above example, illustrate, TermID (1) has a small amount of supplemental data, the index data length that B in TermID (1) sets in memory page is added with supplemental data length, if TermID (the 1)-value=nk after being added, then by former B, the index data set in memory page takes out, the portion of the index data memory page and supplemental data is set successively stored in data block from B, and delete B set key assignments be the data of TermID, if (n+1) K (n=1 after being added, 2,3 ...) >TermID (1)-value>nk (n=1,2,3 ...), then by former B, the index data set in memory page takes out, and be added to the part of nk in order stored in storage block from beginning to original index data in the index data set by former B in memory page and supplemental data, and remaining supplemental data is updated in TermID identical in B tree, if the indexing units data and the original B that add set the indexing units data length stored and are added and are still less than K, then by supplemental data successively stored in B tree memory page.

According to the long tail effect of large data, the data that in inverted entry in full-text search, most of entry is corresponding are considerably less, be not enough to reach block file unit-sized, the mechanism that page preserves the elongated Value of Key-is set by B, the fragmentary data of the inverted entry of multiple Key can be preserved in a page, and the default disk utilization of B tree is 75%, then total disk utilization is (B sets file size × 75%+ block file size)/(B sets file size+block file size); When block file is larger, total disk utilization trend 100% (be 99.5% in 1,000 ten thousand news type wen chang qiao district data).

The page of b-tree indexed and the module unit of block file are all fixed-length datas, conveniently directly locate and use cache algorithm, wherein page size=4 × module unit the size of B tree.The memory size, disk read-write efficiency and other factors that can be used for Cache can be considered in actual items, finally determine the size of each module unit of the Page size that B sets and block file.Block file can utilize system idle time to arrange, and the certificate of falling row of same Term is put into continuous print disk space.

The certificate of falling row is less than to the entry of a module unit, each reading is consuming time is exactly the consuming time of reading B tree, it reads disk number of times is that B sets the degree of depth L time, for the entry of the certificate of falling row more than a module unit, the cost of each complete reading data is reading B trees and reads a blocks of data, consider in full-text index conventional with or etc. logical operation, the certificate of falling row is generally random reading, then can make full use of the fixed length characteristic of the every block of block file, finally effectively improve data reading performance using redundancy.

When full-text index creates index in real time, produce the supplementary demand of the certificate of falling row, B tree causes as layered data characteristic, copy-on-write mechanism can be realized easily, because B tree is a tree structure, during amendment, B just first copies portion before setting each page of amendment, also create one to memory page that is direct or this page of indirect referencing to copy, all be modified in copy modify, thus define amendment before and amendment latter two version, before a modification version carries out read operation, after the modification version carries out write operation, both are independent of each other, and improve concurrency performance.It is also like this that block stores, and data only add at end-of-file, can realize read and write abruption equally, be independent of each other, improve concurrent during access.When data are submitted to successfully, writing version conversion for reading version, if submit to unsuccessfully, then writing version and abandoning, thus ensure that the integrality of data.As shown in Figure 3, the two version of read-write can complete real-time supplemental data and improve limit and build the concurrency performance just searched.Conveniently realize COW mechanism, B sets the distortion adopting B+ tree, and the difference that itself and B+ set is the pointer eliminating the sensing brotgher of node of preserving in leaf node, otherwise there will be the phenomenon needing to copy whole B tree when there is amendment.

Above one embodiment of the present of invention have been described in detail, but described content being only preferred embodiment of the present invention, can not being considered to for limiting practical range of the present invention.All equalizations done according to the present patent application scope change and improve, and all should still belong within patent covering scope of the present invention.

Claims

1. a method for the efficient preservation inverted index in text retrieval system, comprising:

3) if indexing units data length equals K, by indexing units data from beginning to the part of K stored in indexing units data block;

2. the method for the efficient preservation inverted index in text retrieval system according to claim 1, it is characterized in that: the one distortion that described B tree is B+ tree, a kind of B+ tree being deformed into the pointer removing the sensing brotgher of node of preserving in leaf node of described B+ tree.

3. a memory storage for the method for the efficient preservation inverted index in text retrieval system according to claim 1, comprising:

4. the memory storage of the method for the efficient preservation inverted index in text retrieval system according to claim 3, is characterized in that: described B sets memory page unit and can be used to store at least two length and be less than the indexing units data that K value or length is greater than the remaining data after K value is stored in block device.

5. memory storage according to claim 4, is characterized in that: the one distortion that described B tree is B+ tree, a kind of B+ tree being deformed into the pointer removing the sensing brotgher of node of preserving in leaf node of described B+ tree.

6. a data supplementing storage means for the memory storage of the method for the efficient preservation inverted index in text retrieval system as claimed in claim 4, comprising:

1) the indexing units data length in former B tree and the indexing units data length sum added is calculated;

3) if indexing units data sum equals n*K, the part that B tree stores is taken out and indexing units data are set the part of storage area to n*K stored in indexing units data blocks from B;