CN108932738B

CN108932738B - Bit slice index compression method based on dictionary

Info

Publication number: CN108932738B
Application number: CN201810716805.4A
Authority: CN
Inventors: 刘晓光; 刘欣瑀; 王刚; 张瞾华
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2018-07-03
Filing date: 2018-07-03
Publication date: 2022-08-16
Anticipated expiration: 2038-07-03
Also published as: CN108932738A

Abstract

A dictionary-based bit slice index compression method and an optimization strategy are suitable for 0/1 bit slice index structures represented by BitFunnel. The method of the invention comprises the following steps: 1. document rearrangement: documents are rearranged at block size intervals according to the density of bit 1 in the index column in order to increase the degree of repetition between blocks. 2. Partial compression: and selecting a part of query low-frequency access lines for compression. 3. Dictionary compression: the index is divided into blocks, and a full 1-bit block and a high-frequency occurrence block in the index are stored in the dictionary. Replacing blocks appearing in the dictionary with block numbers of fewer bits; blocks that do not appear in the dictionary are replaced with the numbers of the most recent similar blocks in the dictionary (which would result in a misstatement result for the query request but ensure no missing solutions). The method is suitable for the scene of bit slice index compression in the field of information retrieval. The method can obviously improve the index compression effect, does not cause larger decompression delay, and has important significance for the optimization of a search engine system.

Description

Bit slice index compression method based on dictionary

[ technical field ] A method for producing a semiconductor device

The invention belongs to the technical field of compression of a bit slice index represented by BitFunnel in a search engine, and particularly relates to a dictionary-based compression method for the bit slice index and optimization strategies for partial compression, document rearrangement and the like. The invention is also suitable for compressing bit slice index structures of other similar bitmaps in the field of information retrieval.

[ background of the invention ]

In the age of rapid development of science and technology, the internet has become an essential part of people's life. Search engines have also become the most important internet portal. The method uses tools such as web crawlers and the like to periodically take the webpage content of the Internet, and provides retrieval service for users after reorganization and storage. The modern commercial search engine system mainly comprises three parts, namely a network server, an index server and a document server. The network server is used for receiving the query of the user and submitting the query to the index server. After receiving the query, the index server performs operations such as access, cache, intersection and the like on the index related to the query word to obtain the document number containing the query word, then sorts the document number according to the query relevance and returns the numbers of K highest-score documents (topK) to the network server. Then the network server sends the document number and the query to the document server to generate an abstract containing query words and returns the abstract to the user.

In the existing search engine system, compression and decompression intersection of indexes is one of the most important parts, and directly influences the query efficiency and the user experience. The index is compressed, so that the storage space can be saved, the expense of a storage device is reduced, and in the query process, as files with the same size contain more compressed information, the index compression can also increase the throughput rate from a disk to a memory and the hit rate of cache, so that the query efficiency is improved. The index compression scheme has two main evaluation indexes, namely compression rate which directly reflects the compression effect; another aspect is decompression time, which is related to query efficiency. The search engine will select a suitable compression scheme to balance the compression ratio and decompression time according to actual requirements.

Microsoft proposed a new index bit slice structure, BitFunnel, in 2017. The index is a bitmap-like bit slice index structure based on Bloom filters. Where each column represents a document and each row represents a mapping of one or more words on the document, a word appearing in the document with a 1 in the middle, and a 0 in the other. Because it applies Bloom filters, the query process may result in a solution other than the true solution-false positive (false positive). Because the comparison and judgment of the traditional inverted index are replaced by bitwise operation, the probability of failure of branch prediction is reduced, the query efficiency is high, and the application space in the future is huge. The specific index structure is shown in fig. 2. However, a structure similar to a bitmap results in a storage overhead that is much larger than an inverted index.

For the above application scenarios, there are currently two solutions — arithmetic coding and bitmap compression. The existing arithmetic coding compression scheme aiming at the inverted index is more focused on integer compression, and the compressed sequence is expected to be monotonous or small in value, so that the method is not suitable for the BitFunnel index structure. While the traditional compression method aiming at the bitmap mainly adopts run-length coding, which requires large-scale continuous 0 or continuous 1 in data, and the BitFunnel structure is difficult to guarantee to meet the condition.

[ summary of the invention ]

In order to solve the problems, the invention provides a bit slice index compression method based on a dictionary, which is different from the traditional compression method using run-length, can achieve a better compression effect without more continuous 0/1 sequences and is more suitable for BitFunnel index. Specifically, a block with high repetition degree in the BitFunnel index is replaced by a number with fewer bits so as to achieve the purpose of compression, and the dictionary stores the mapping from the repeated block to the number.

Referring to fig. 1, the method for compressing (and optimizing the compression strategy) a bit slice index based on a dictionary mainly includes:

step 1(S1) of rearranging the document according to the column density of bit 1 in the index at intervals of the block size;

step 2(S2), selecting a part of query access low-frequency lines according to the query data set feature custom threshold value to compress;

in step 3(S3), dictionary compression is performed on the compression line selected in S2.

S1, a strategy for reordering documents at block size intervals according to the density of bit 1 in the index column. Specifically, the method comprises the following steps:

assuming that the number of index columns is d, the block size is k bits;

step 1.1, counting the density of 1 in each index column, and arranging the columns according to the density descending order;

step 1.2, initialize a null index, make the density of bit 1 highest

The columns are placed into the index 1, k +1,2k +1, …,

a column; will be 1 st high in density

The column is placed into the column with the indices 2, k +2,2k +2, …,

column, and so on, and finally the bit 1 density is the lowest

Columns are placed into k, k + k,2k + k, …,

and (4) columns.

S2, selecting a part of query to access the low frequency line for compression according to the query data set characteristic custom threshold. Specifically, the method comprises the following steps:

and 2.1, customizing a threshold value alpha of the access frequency of the high-frequency access row in the query process to be in the range of [0,1 ].

Step 2.2, sequencing rows in the matrix from high to low according to the query access frequency, selecting the least q rows with the highest frequency access as high-frequency access rows and ensuring that

Where N is the sum of the total access frequencies of all rows, f (i) denotes the access frequency of the ith row; the rows are classified according to the partial query dataset and custom threshold as follows: a high frequency access line, a low frequency access line;

and 2.3, establishing a line compression zone bit file, and judging whether each line is a low-frequency access line, namely the line needs to be compressed.

S3, dictionary compressing the compression row selected in S2, specifically:

step 3.1, dividing the rearranged BitFunnel bit slice index into blocks by a row unit, establishing a dictionary to store a block of which all 1 block appears in the S2 at a high frequency in a selected compression line, and recording the mapping relation between a high-frequency block and a block number;

step 3.2, traversing the index, and replacing the blocks appearing in the dictionary by block numbers with fewer bits; for the blocks not in the dictionary, the numbers of the most approximate blocks in the dictionary are used for replacing the blocks, so that the query request is miscalled, but the solution is not guaranteed to be lost.

For convenience of description, we assume a block size of k bits, a block number size of b bits, and b < k.

After steps S1 and S2 are completed, for the selected lines to be compressed, each line is divided into aligned k-bit fixed length blocks, and the repeated blocks are mined in units of the k-bit fixed length blocks. Starting from the first row, one by one scanning k bitsAnd searching for repeated blocks and recording the occurrence frequency of the repeated blocks until the last line is processed. And arranging the dug repeated blocks in a descending order of frequency. Will occur with the highest frequency of 2 ^b -1 block and one full 1 block into the dictionary. The matrix is traversed again.

A. If a block exists in the dictionary, the block number of b bits is substituted.

B. If the block x does not exist in the dictionary, the most approximate block y which contains the block x (x AND y ═ x) AND has the least number of 1 is searched in the dictionary (since all 1 blocks are stored in the dictionary, such y can be always found), AND the number of y is used to replace x.

Dictionary-based compression methods take advantage of the sparse matrix and 1 uneven appearance. The density of the whole 1 is low, the number of blocks with high density 1 is small, the proportion of repeated blocks is increased, the difference of approximate blocks is not large, and the characteristics are favorable for the compression based on the dictionary of the invention not to greatly increase the misnomer.

The invention has the advantages and beneficial effects that:

the block repetition degree of the BitFunnel index is increased through document rearrangement, the misnomer rate caused by compression is controlled to be about 10% through partial compression, and finally the compression effect of the BitFunnel index is improved by compressing the bitmap index through dictionary-based compression. The method can be widely applied to the fields of search engine performance optimization and index compression.

[ description of the drawings ]

FIG. 1 is a flow chart of a dictionary-based bit slice index compression method of the present invention;

FIG. 2 is a schematic diagram of the BitFunnel index structure;

FIG. 3 is an exemplary diagram of a process and results for reordering documents at block size intervals based on bit 1 density in the index column;

FIG. 4 is an exemplary diagram of a dictionary compression process and results;

[ detailed description ] embodiments

For the purpose of promoting an understanding of the above-described objects, features and advantages of the invention, reference will now be made in detail to the present embodiments of the invention illustrated in the accompanying drawings.

Example 1 document rearrangement

Since the mapping matrix of BitFunnel is determined by the hash function (Bloom filter) and the document contents, we can change the density of 1's in the matrix by adjusting the word-to-row mapping and the number of rows in the matrix. Although 0/1 matrix repetition degree is high, the distribution of 1 in the matrix is very uneven due to characteristics such as document distribution clustering and the like (BitFunnel only guarantees the density of global 1), so we propose a document rearrangement scheme based on density, and increase the similarity between blocks in the matrix, thereby improving the compression effect. Obviously, the higher the block repetition, the more advantageous it is for dictionary compression. The concrete expression is in two aspects: the probability that each block appears in the dictionary becomes high; the proportion of 1 in block compression in a dictionary is reduced, so that the misnomer rate caused by dictionary compression is greatly reduced.

For convenience of description, we assume that the mapping matrix has d columns, and each block contains k bits. Fig. 3 describes the document rearrangement process by taking d-16 and k-4 as an example, the density of each column 1 is firstly counted and arranged in the order from large to small, and for convenience of illustration, each group of d/k bits is marked with different backgrounds. The top O, J, M and G columns with the highest density are rearranged first to columns 1, 5, 9, 13 (hatched with diagonal lines), then so on, and the bottom K, L, P and F documents are rearranged to columns 4, 8, 12, 16 (not hatched). After rearrangement according to the column density, the density of each k columns is decreased from the 1 st column to the k th column, and the repeatability among blocks is increased, so that the dictionary compression is more facilitated.

EXAMPLE 2 partial compression

By analyzing the actual data set, we find that only a few rows are accessed with high frequency and most rows are accessed with low frequency in the query process. While compression for high frequency access lines results in a higher rate of misnomer, compression for low frequency access lines has less impact on the rate of misnomer. Based on this finding, we consider a partial compression scheme, i.e. compressing low frequency access lines, but not compressing high frequency access lines where the misstatement rate affects more.

The rows in the matrix are sorted from high to low according to frequency, the minimum q row with the highest frequency access is selected as the high-frequency access row, and the condition that the minimum q row with the highest frequency access is ensured

Where N is the sum of the total access frequencies of all rows, f (i) denotes the access frequency of the ith row; the rows are classified according to the partial query dataset and custom threshold as follows: high frequency access lines, low frequency access lines. Only low frequency access rows are compressed.

For convenience of description, we assume that the matrix has 100 rows, and the access frequency and row number of each row are equal, i.e., f (i) ═ i. First we sort the rows in the matrix from high to low in frequency, and the access frequency of each row is decreased in turn, specifically 100,99,98, …, 1. Where N (100+99+98+ … +1) is 5050. If we define the threshold as α ═ 0.9, then we select the minimum 69 rows (row numbers 100 to 32) of the highest frequency access guarantees

The rows are accessed for high frequency. The remaining 31 lines (line numbers 31 to 1) are compressed. Due to the characteristics of the actual data set, only a few lines are accessed at high frequency and the access frequency of a large part of lines is low in the query process of real data, so that partial compression can achieve a good compression effect under the condition of ensuring the misnomer.

For the selection of the row, the whole query data set is randomly divided into two equal parts, one part is used as a training set, and whether each row is compressed or not is determined by the row access frequency f (i) obtained by the training set and a threshold value alpha. And the other half of the query set is used as a test set for testing the performance of query time, the misbalance rate and the like. It can be seen that we do not simply select fixed parameters for compression, but learn suitable parameters from the query history by using the characteristic that query requests from the same source have similarity, so that the obtained compression result can obtain better performance on future query requests. The value of the specific self-defined threshold alpha is related to the original misnomer rate of the index and the block repetition degree.

Example 3 dictionary compression

Fig. 4 shows k-4 and b-4The dictionary compression process and the compressed result are described as an example 2. Firstly, recording the occurrence frequency of each block in an original dictionary, and selecting 2 with the highest frequency ² -1-3 blocks and one full 1 block generate dictionary (fig. 4 bottom left); the fourth row of the first block 1100 of fig. 4, which, because it appears in the dictionary, is replaced with the number 10; the second block 0010 in the first row (within the dashed box), since it is not present in the dictionary, has the third bit position 1. The third bit of each block 1111, 1110 and 1010 in the dictionary is 1, and each block 0010 satisfies similar conditions, and the 1010 with the least number of 1 is selected. This strategy ensures that query requests are not missed and that possible increased misnomer is minimized. The hatched portion in fig. 4 indicates a case where a block is not in the dictionary and is replaced with an approximate block. The compressed data is divided into two parts: dictionary, compressed index.

Assuming m rows and d columns of the matrix, the compression ratio of the dictionary compression method is as follows:

the dictionary size here is k x 2 ^b The size of the mapping matrix after compression is mdb/k, and the dictionary is small and can be ignored in practice, so the compression rate is approximately equal to

In the intersection process, the original block definition is extracted from the dictionary by simply taking the block number of b bits as an index to carry out AND operation. Because the scale of the dictionary is generally small and the dictionary is frequently accessed in the intersection (decompression) process, most dictionary data can be resident in the cache of the CPU, the access and storage overhead is reduced, and the intersection performance is improved.

Example 4 results of the experiment

We tested the effect of the dictionary-based bit-slice index compression method of the present invention on the TREC GOV2 data set. Where Pri, Opt are indices generated under the two policies PrivateSharedRank0 and Optimal of BitFunnel. The block size in the experiment is set to 32 bits, and the block number size is set to 16 bits.

The inverted index data set used is explained as follows:

(1) the data set captured from the GOV domain name in 2004 by using TREC GOV2 comprises 2500 tens of thousands of web pages as a document set for generating indexes, wherein HTML tags are removed from all document contents, and the indexes are established after titles and text parts are extracted;

(2) we used MillionSet as Query set 1, containing 2007,2008, and 2009 TREC Million Query Track for 6 Million queries;

(3) we used TerabbyteSet as query set 2, containing a total of 10 ten thousand queries of 2006 TREC Terabbyte Track.

TABLE 1

Table 1 describes compression rates of the dictionary-based bit slice index compression method. According to table 1, the overall compression ratios corresponding to the MillionSet and TerabyteSet in the Pri mode are 0.73 and 0.72, respectively; the overall compression ratios for the two query sets of MillionSet and TerabyteSet in Opt mode are 0.71 and 0.70 respectively. MillionSet is slightly better than TerabbyteSet, but the difference is only 0.02-0.01, which shows that the compression rate of the query set is not greatly influenced, and the partial compression mode of the query set has universality for different query data sets.

We also try to treat every 32 bits as an unsigned integer (unsigned int type), compressed using arithmetic coding and bitmap compression methods. We found that the file size is only reduced by-1%, 11.3%, 4.35% by Pfloor, VByte, EWAH compression, respectively. This is because the distribution of 1 in the BitFunnel matrix is irregular and uneven, so the values in the integer sequence are large, which is not favorable for d-gap and other value conversion, and thus the arithmetic coding compression effect is poor. The compression scheme for the bitmap cannot meet the condition because the use of run-length requires more continuous 0, continuous 1 and BitFunnel index construction completely depends on the document characteristics. In conclusion, the traditional compression algorithm has a poor compression effect on the BitFunnel index, and the compression rate can be effectively improved by the dictionary-based compression scheme adopted by the invention.

TABLE 2

Table 2 shows the average intersection time and the misstatement rate of each query under two strategies. From table 2 we can see that decompression on MillionSet and TerabyteSet datasets using the Pri strategy generator matrix increases the intersection time by 21%, 16%, respectively. Decompression on MillionSet and TerabyteSet datasets using the Opt strategy generator matrix increased intersection time by 48%, 37%, respectively. The increase of the intersection time is that on one hand, partial compression is adopted to enable a part of high-frequency access lines not to be compressed, and whether each line is compressed or not needs to be judged during intersection, and on the other hand, the access delay of accessing the dictionary is caused by compression. Also we can see that dictionary compression increases the misnomer by about 7 to 10 percentage points.

In the aspect of intersection, because the index needs to be decompressed, the compressed index needs to spend 16% -48% more time on query, but because of the characteristics of the BitFunnel, the relevant documents can be obtained by simple bitwise operation, the intersection time is very small, the proportion of the total query time is very small (the complete query comprises the processes of intersection, topK document calculation, abstract generation and the like), and the intersection time of about 5 milliseconds has little influence on the user. In addition compression will introduce additional misnomer, which we use the threshold a to control within a certain range of 7 to 10 percentage points. Because the search engine can carry out relevance ranking on the documents obtained by intersection, the top K pieces with the highest scores are selected to return, and the misentitled documents caused by compression can be screened out quickly due to the low scores. Thus, we predict that in an actual search engine system, this part of the misstatement document will not have a large impact.

The compression scheme provided by the invention has a good effect on compressing BitFunnel bit slice indexes by considering three aspects of comprehensive compression rate, intersection time and misstatement rate.

The dictionary-based compression method of the present invention is described in detail above, and the principle and the implementation of the present invention are explained by applying specific examples herein, and the above description of the examples is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A bit slice index compression method based on a dictionary is characterized by comprising the following steps:

step 1, rearranging the document according to the column density of bit 1 in the index column by taking the block size as an interval;

assuming that the number of index columns is d, the block size is k bits;

step 1.1, counting the density of bit 1 in each column of the index, and arranging the columns in a descending order according to the density;

step 1.2, initialize a null index, make the density of bit 1 highest

Column put index

Columns; 1 st high density

Column put index

Column, and so on, and finally the bit 1 density is the lowest

Is arranged in

Columns;

step 2, selecting a part of query access low-frequency lines according to the query data set feature custom threshold value to compress;

step 2.1, self-defining a threshold value of the access frequency of the high-frequency access row in the query process;

step 2.2, the rows are classified according to the partial query data set and the custom threshold as follows: a high frequency access line, a low frequency access line;

step 2.3, establishing a line compression zone bit file, and judging whether each line is a low-frequency access line, namely the line needs to be compressed;

step 3, performing dictionary compression on the compression line selected in the step 2;

step 3.1, dividing the rearranged BitFunnel bit slice index into blocks by a row unit, establishing a dictionary to store a whole block 1 and a block with high frequency appearing in the compression line selected in the step 2, and recording the mapping relation between the high-frequency block and the block number;

step 3.2, traversing the index, and replacing the blocks appearing in the dictionary by block numbers with fewer bits; for the blocks which do not appear in the dictionary, the numbers of the most approximate blocks in the dictionary are used for replacing the blocks, so that although the query request is miscalled, the solution is guaranteed not to be lost.

2. The method of claim 1, wherein the method application domain comprises:

bitmap, Bloom Filter, and Bloom Filter-based bitslice index structures represented by bitfulnel.

3. The method according to claim 1, wherein step 2.1 is to customize the threshold of access frequency of high frequency access rows in the query process:

and self-defining the threshold value setting of the row access frequency of the high-frequency and low-frequency rows in the query process according to the repetition degree of the BitFunnel index block and the original misnomenclatness rate of the BitFunnel index.

4. The method according to claim 1, wherein the decompression method corresponding to the original data restored after compression is specifically:

in the decompression process, the original block definition is simply extracted from the dictionary by taking the block number as an index to restore the original data.