CN103020299A

CN103020299A - Storage method and device for inverted indexes and appended data in full-text search

Info

Publication number: CN103020299A
Application number: CN2012105919899A
Authority: CN
Inventors: 张学; 范振勇; 崔维力; 武新; 赵伟
Original assignee: TIANJIN NANDA GENERAL DATA TECHNOLOGY Co Ltd
Current assignee: Tianjin NanKai University General Data Technologies Co., Ltd.; National Computer Network and Information Security Management Center
Priority date: 2012-12-29
Filing date: 2012-12-29
Publication date: 2013-04-03
Anticipated expiration: 2032-12-29
Also published as: CN103020299B

Abstract

The invention provides a method for efficiently storing inverted indexes in a full-text search system. The method comprises the following steps: detecting whether the length of index unit data is larger than a threshold value K or not; and if the index unit data is larger than n*K and less than (n+1)*K (n is a natural number), storing the index unit data from a beginning part to an n*K part into the index unit data block; storing the rest index unit data into a tree B; and if the index unit data is equal to n*K, storing the index unit data from the beginning part to the n*K part into the index unit data block; and if the index unit data is less than K, storing all index unit data into the tree B. The method for efficiently storing the inverted indexes in the full-text search system, provided by the invention, has the advantages as follows: the storing efficiency of the full-text search of an inverted file can be effectively increased, the data reading speed is increased, cory on write mechanism can be conveniently achieved, data safety is enhanced, and the parallel index for reading data is further increased.

Description

The store method of inverted index and supplemental data thereof and memory storage in the full-text search

Technical field

The invention belongs to field of data storage, especially relate to a kind of in full-text search store method and the memory storage of inverted index and supplemental data thereof

Background technology

In relational database system, full-text index is one of mode of search file data full blast, at current net environment, quantity of information all becomes volatile growth with customer volume, full-text index becomes one of Main Means of information retrieval system, inverted index is the core of text retrieval system, and its storage organization also has a great impact the text retrieval system performance.

(English: Inverted index), also often be called as reverse indexing, insert archives or reverse archives, be a kind of indexing means to inverted index, is used to be stored in the mapping of the memory location in a document or one group of document of certain word under the full-text search.It is data structure the most frequently used in the DRS.By inverted index, can comprise according to the word quick obtaining lists of documents of this word.

Full-text index inverted entry data are comprised of Term ID corresponding one group of document code and the skew in document, its form of expression is Term ID--〉{＜doc ID, { offset}}, wherein Term ID is the minimum index unit that the hyphenation device is divided, in Chinese Full Text Retrieval, be generally the combination of word, word, English, numeric string and several forms, have following characteristics:

1. the corresponding row of falling of different Term differs greatly according to length, common word as " ", " " etc. the character suitable height of the frequency of occurrences often, the special string that only occurs is once also arranged.

2. vocabulary is huge in the full-text index, and often a full-text search storehouse can have ten million vocabulary, and each vocabulary can take a large amount of storage spaces as a retrieval unit

Full-text index is arranged data storage generally two kinds: one, adopt empirical data that known dictionary is divided into mode middle and high, low several grades by word frequency, every grade of word frequency adopts different data block size storages, every blocks of data of storage high frequency words is larger, otherwise the low-limit frequency piece is minimum.The benefit of this mode is that the disk waste is less, reading efficiency is secure, is not inconsistent or when new vocabulary occurring, larger variation occurs word frequency in case shortcoming is real data and empirical data, the waste of high frequency words space then occurs, low-frequency word word chain overlength causes reading efficiency low.Two, adopt the mode of dividing large data cell, be a unit such as every 1GB data, all inverted entry data are pressed entry numbering, document code, side-play amount (TermID, Doc ID, Offset) ordering in each unit, before index is finished all data cells are carried out data and merge into a final unit.The advantage of this mode is that the space waste is 0, and shortcoming is that final data sorting needs one times of disk remaining space, and causing the disk space utilization factor is 50%, and the time of merging is longer.

Summary of the invention

The problem to be solved in the present invention provides a kind of method of the efficient preservation index in text retrieval system, can fully improve the disk space utilization factor.

For solving the problems of the technologies described above, the technical solution used in the present invention is: a kind of at text retrieval system

In the method for efficient preservation index, comprising:

1) magnitude relationship of comparison indexing units data length and the threshold values K that presets;

2) if the indexing units data length less than K, all deposits the indexing units data in the B tree;

3) if the indexing units data length equals K, deposit the indexing units data in indexing units data data block from the beginning to the part of K;

4) if greater than K, then compare indexing units data length and n*K(n=2,3 ...) magnitude relationship, and store according to following manner:

1. if the indexing units data length is greater than (n-1) * K and less than n*K, partly deposit the indexing units data in the indexing units data block from the beginning to n*K, remainder is deposited in the B tree;

2. if the indexing units data length equals n*K, deposit in order all indexing units data in the indexing units data block.

Further, described B tree can be stored at least two length are stored in block device remaining data afterwards greater than the K value less than K value or length indexing units data.

Further, described B tree is a kind of distortion of B+ tree, a kind of B+ tree that is deformed into the pointer that removes the sensing brotgher of node of preserving in the leaf node of described B+ tree.

According to a further aspect in the invention, the present invention also provides a kind of memory storage of efficient preservation indexing means, comprising:

Storage data block unit is used for storing the indexing units data of the integral multiple of regular length;

B sets the memory page unit, is used for storing the remainder of indexing units data deficiencies K.

The full-text index data storage method of appending according to the memory storage of a kind of efficient preservation indexing means provided by the present invention:

1) calculates the indexing units data in the former B tree and the indexing units data sum of appending;

2) if indexing units data length sum less than n*K and greater than (n+1) * K(n=2,3 ...), the part of B tree storage taken out and deposit the part of indexing units data from the part of B tree storage to n*K in indexing units data data block, will remain the indexing units data and deposit in during B sets;

3) if indexing units data sum equals n*K, deposit indexing units data data block in the part taking-up of B tree storage and with the part of indexing units data from B tree storage area to n*K;

4) if, then will appending part less than K, indexing units data sum deposits successively former B tree storage area in.

Owing to adopt technique scheme, can effectively improve the storage efficiency of the full-text index of inverted entry, reduced the waste of disk space, Effective Raise data read rates; Can conveniently realize copy-on-write (Copy On Write) mechanism, and then improve the concurrent index of data security and reading out data.

Description of drawings

Fig. 1 is the store method schematic flow sheet of efficient preservation indexing means of the present invention and supplemental data

Fig. 2 is the storage synoptic diagram of an example among the present invention

Fig. 3 is the synoptic diagram of realizing copy-on-write mechanism in the example of the present invention

Embodiment

Implement to use in the scene at one of the present invention, according to setting up in full corresponding full-text index inverted entry data, its representation is Term ID(entry numbering)--〉＜doc ID(document code), the offset}(side-play amount) }, adopt in this example the mode of B tree combined block file to store the Key-Value data, Key is the entry numbering, Value is the inverted entry data, elongated data, when creating index data, the data of Value are constantly appended, when length surpasses default threshold value n*K(n=1,2,3 ...) time, the length of extracting Value from the beginning is that the data of n*K store module unit into, for the index data less than k of remainder, chain of corresponding key value part composition in the inverted entry data that module unit adopts fixed-length data to preserve corresponding TermId, each data block and B tree.Thereby form the one-tenth piece part that the block file data are preserved inverted entry, preserve the fragment of inverted entry in the B tree.

Here threshold values in the setting of K value be to determine that with reference to storage inverted entry data storage device performance generally all adopt the minimum unit of Computer Storage, such as 32k, 64k etc. make things convenient for magnetic disc array storage data.

As shown in Figure 2, three index minimum units that total hyphenation device delimited in supposing in full, be numbered respectively Term ID(1), Term ID(2) and Term ID(3), Term ID(1 wherein) is a character string that few number of times only occurred in the text, when its inverted entry index is stored, shown among the figure 21, the data length of inverted entry index Term ID(1) does not surpass K, be Term ID(1)-value＜k, its index data is deposited in the B tree memory page.

Term ID(2) is another character string, the number of times that occurs in the text is more, when its inverted entry index is stored, shown among the figure 22, the threshold values K that the data length of inverted entry index Term ID(2) equals to preset, i.e. Term ID(1)-value=nk(n=1,2,3 ...), equally its index data is deposited into the storage block unit.

Term ID(3) is another character string in full-text index, often occur in the text, when its inverted entry index is stored, as shown in figure 23, the threshold values K that the data length of inverted entry index Term ID(2) equals to preset, i.e. (n+1) * k〉Term ID(1)-value〉n*k, the length of extracting Value from the beginning is that the data of K store the storage block unit into, remainder deposits in the B tree, the chain of corresponding key value part composition in data block and the B tree.

In practice, the situation that often has the inverted entry data supplementing occurs, situation according to inverted entry data different length, different corresponding stored methods is still arranged, Term ID (1) in the example still, illustrate, Term ID(1) a small amount of supplemental data is arranged, with Term ID(1) in index data length and the addition of supplemental data length in the B tree memory page, if the Term ID(1 after the addition)-value=nk, then the index data in the former B tree memory page is taken out, section and the supplemental data of the index data from B tree memory page deposit data block successively in, and deletion B tree key assignments is the data of TermID; If (n+1) K (n=1 after the addition, 2,3 ...) Term ID (1)-value nk (n=1,2,3 ...), then the index data in the former B tree memory page is taken out, and will deposit in the storage block in order from the beginning to the part that is added to nk with original index data in the index data in the former B tree memory page and the supplemental data, and remaining supplemental data is updated among the TermID identical in the B tree; If the indexing units data length addition of the indexing units data of appending and the storage of original B tree still less than K, then deposits supplemental data in the B tree memory page successively.

Long tail effect according to large data, the data that most of entry is corresponding in the inverted entry in the full-text search are considerably less, be not enough to reach the block file unit-sized, mechanism by the elongated Value of B tree page or leaf preservation Key-, in a page or leaf, can preserve the fragmentary data of the inverted entry of a plurality of Key, and the default disk utilization of B tree is 75%, then total disk utilization is that (B tree file size * 75%+ block file size)/(B tree file size+block file size) when block file was larger, always disk utilization trend 100%(was 99.5% in 1,000 ten thousand news type document test datas).

The page of b-tree indexed and the module unit of block file all are fixed-length datas, conveniently directly locate and use cache algorithm, wherein the page size of B tree=4 * module unit size.In actual items, can consider the memory size, disk read-write efficient and other factors that can be used for Cache, finally the size of each module unit of the Page size of definite B tree and block file.Block file can utilize system to put in order free time, and the certificate of falling the row of same Term is put into continuous disk space.

For the entry of the certificate of falling the row less than a module unit, read consuming time is exactly read the B tree consuming time at every turn, it reads the disk number of times is B tree degree of depth L time, the entry that surpasses a module unit for the certificate of falling the row, the cost of each complete reading out data is to read B tree and read blocks of data one time, consider in the full-text index commonly used with or etc. logical operation, the certificate of falling the row generally is to read at random, then can take full advantage of the fixed length characteristic of every of block file, the final data reading performance using redundancy that effectively improves.

When full-text index creates index in real time, produce the supplementary demand of the certificate of falling the row, B tree causes and is the stratiform data characteristic, can realize easily copy-on-write mechanism, because the B tree is a tree structure, B just copies first portion before setting each page modification during modification, the memory page of direct or this page or leaf of indirect referencing is also created one to be copied, all be modified in to copy make amendment, revise front and latter two version of modification thereby formed, version carries out read operation before modification, and version carries out write operation after modification, both are independent of each other, and have improved concurrency performance.The piece storage also is that so data are only appended at end-of-file, can realize equally during access that read-write separates, and is independent of each other, and improves concurrent.When data are submitted to successfully, write version conversion for reading version, if submit to unsuccessfully, then write version and abandon, thereby guaranteed the integrality of data.As shown in Figure 3, the two versions of read-write can be finished real-time supplemental data and improve the limit and build the concurrency performance of just searching.For the convenient COW mechanism that realizes, the B tree is adopted the distortion of B+ tree, and its difference with the B+ tree has been to remove the pointer of the sensing brotgher of node of preserving in the leaf node, otherwise need to copy the phenomenon that whole B sets can occur revising the time.

More than one embodiment of the present of invention are had been described in detail, but described content only is preferred embodiment of the present invention, can not be considered to be used to limiting practical range of the present invention.All equalizations of doing according to the present patent application scope change and improve etc., all should still belong within the patent covering scope of the present invention.

Claims

1. the method for the efficient preservation inverted index in text retrieval system comprises:

1) magnitude relationship of comparison indexing units data length and the threshold values K that presets; If the indexing units data length less than K, all deposits the indexing units data in the B tree;

2. the method for efficient preservation index according to claim 1 is characterized in that: described B tree is a kind of distortion of B+ tree, and a kind of B+ that is deformed into the pointer that removes the sensing brotgher of node of preserving in the leaf node of described B+ tree sets.

3. memory storage of described efficient preservation indexing means according to claim 1 comprises:

4. memory storage according to claim 4 is characterized in that: described B tree memory page unit can be used to store at least two length are stored in block device remaining data afterwards greater than the K value less than K value or length indexing units data.

5. memory storage according to claim 4 is characterized in that: described B tree is a kind of distortion of B+ tree, and a kind of B+ that is deformed into the pointer that removes the sensing brotgher of node of preserving in the leaf node of described B+ tree sets.

6. data supplementing storage means of efficiently preserving as claimed in claim 4 the memory storage of indexing means comprises:

1) calculates the indexing units data length in the former B tree and the indexing units data length sum of appending;