CN108932738B - Bit slice index compression method based on dictionary - Google Patents

Bit slice index compression method based on dictionary Download PDF

Info

Publication number
CN108932738B
CN108932738B CN201810716805.4A CN201810716805A CN108932738B CN 108932738 B CN108932738 B CN 108932738B CN 201810716805 A CN201810716805 A CN 201810716805A CN 108932738 B CN108932738 B CN 108932738B
Authority
CN
China
Prior art keywords
index
dictionary
compression
block
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810716805.4A
Other languages
Chinese (zh)
Other versions
CN108932738A (en
Inventor
刘晓光
刘欣瑀
王刚
张瞾华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN201810716805.4A priority Critical patent/CN108932738B/en
Publication of CN108932738A publication Critical patent/CN108932738A/en
Application granted granted Critical
Publication of CN108932738B publication Critical patent/CN108932738B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T9/00Image coding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

A dictionary-based bit slice index compression method and an optimization strategy are suitable for 0/1 bit slice index structures represented by BitFunnel. The method of the invention comprises the following steps: 1. document rearrangement: documents are rearranged at block size intervals according to the density of bit 1 in the index column in order to increase the degree of repetition between blocks. 2. Partial compression: and selecting a part of query low-frequency access lines for compression. 3. Dictionary compression: the index is divided into blocks, and a full 1-bit block and a high-frequency occurrence block in the index are stored in the dictionary. Replacing blocks appearing in the dictionary with block numbers of fewer bits; blocks that do not appear in the dictionary are replaced with the numbers of the most recent similar blocks in the dictionary (which would result in a misstatement result for the query request but ensure no missing solutions). The method is suitable for the scene of bit slice index compression in the field of information retrieval. The method can obviously improve the index compression effect, does not cause larger decompression delay, and has important significance for the optimization of a search engine system.

Description

Bit slice index compression method based on dictionary
[ technical field ] A method for producing a semiconductor device
The invention belongs to the technical field of compression of a bit slice index represented by BitFunnel in a search engine, and particularly relates to a dictionary-based compression method for the bit slice index and optimization strategies for partial compression, document rearrangement and the like. The invention is also suitable for compressing bit slice index structures of other similar bitmaps in the field of information retrieval.
[ background of the invention ]
In the age of rapid development of science and technology, the internet has become an essential part of people's life. Search engines have also become the most important internet portal. The method uses tools such as web crawlers and the like to periodically take the webpage content of the Internet, and provides retrieval service for users after reorganization and storage. The modern commercial search engine system mainly comprises three parts, namely a network server, an index server and a document server. The network server is used for receiving the query of the user and submitting the query to the index server. After receiving the query, the index server performs operations such as access, cache, intersection and the like on the index related to the query word to obtain the document number containing the query word, then sorts the document number according to the query relevance and returns the numbers of K highest-score documents (topK) to the network server. Then the network server sends the document number and the query to the document server to generate an abstract containing query words and returns the abstract to the user.
In the existing search engine system, compression and decompression intersection of indexes is one of the most important parts, and directly influences the query efficiency and the user experience. The index is compressed, so that the storage space can be saved, the expense of a storage device is reduced, and in the query process, as files with the same size contain more compressed information, the index compression can also increase the throughput rate from a disk to a memory and the hit rate of cache, so that the query efficiency is improved. The index compression scheme has two main evaluation indexes, namely compression rate which directly reflects the compression effect; another aspect is decompression time, which is related to query efficiency. The search engine will select a suitable compression scheme to balance the compression ratio and decompression time according to actual requirements.
Microsoft proposed a new index bit slice structure, BitFunnel, in 2017. The index is a bitmap-like bit slice index structure based on Bloom filters. Where each column represents a document and each row represents a mapping of one or more words on the document, a word appearing in the document with a 1 in the middle, and a 0 in the other. Because it applies Bloom filters, the query process may result in a solution other than the true solution-false positive (false positive). Because the comparison and judgment of the traditional inverted index are replaced by bitwise operation, the probability of failure of branch prediction is reduced, the query efficiency is high, and the application space in the future is huge. The specific index structure is shown in fig. 2. However, a structure similar to a bitmap results in a storage overhead that is much larger than an inverted index.
For the above application scenarios, there are currently two solutions — arithmetic coding and bitmap compression. The existing arithmetic coding compression scheme aiming at the inverted index is more focused on integer compression, and the compressed sequence is expected to be monotonous or small in value, so that the method is not suitable for the BitFunnel index structure. While the traditional compression method aiming at the bitmap mainly adopts run-length coding, which requires large-scale continuous 0 or continuous 1 in data, and the BitFunnel structure is difficult to guarantee to meet the condition.
[ summary of the invention ]
In order to solve the problems, the invention provides a bit slice index compression method based on a dictionary, which is different from the traditional compression method using run-length, can achieve a better compression effect without more continuous 0/1 sequences and is more suitable for BitFunnel index. Specifically, a block with high repetition degree in the BitFunnel index is replaced by a number with fewer bits so as to achieve the purpose of compression, and the dictionary stores the mapping from the repeated block to the number.
Referring to fig. 1, the method for compressing (and optimizing the compression strategy) a bit slice index based on a dictionary mainly includes:
step 1(S1) of rearranging the document according to the column density of bit 1 in the index at intervals of the block size;
step 2(S2), selecting a part of query access low-frequency lines according to the query data set feature custom threshold value to compress;
in step 3(S3), dictionary compression is performed on the compression line selected in S2.
S1, a strategy for reordering documents at block size intervals according to the density of bit 1 in the index column. Specifically, the method comprises the following steps:
assuming that the number of index columns is d, the block size is k bits;
step 1.1, counting the density of 1 in each index column, and arranging the columns according to the density descending order;
step 1.2, initialize a null index, make the density of bit 1 highest
Figure BDA0001717743720000021
The columns are placed into the index 1, k +1,2k +1, …,
Figure BDA0001717743720000022
a column; will be 1 st high in density
Figure BDA0001717743720000023
The column is placed into the column with the indices 2, k +2,2k +2, …,
Figure BDA0001717743720000024
column, and so on, and finally the bit 1 density is the lowest
Figure BDA0001717743720000025
Columns are placed into k, k + k,2k + k, …,
Figure BDA0001717743720000026
and (4) columns.
S2, selecting a part of query to access the low frequency line for compression according to the query data set characteristic custom threshold. Specifically, the method comprises the following steps:
and 2.1, customizing a threshold value alpha of the access frequency of the high-frequency access row in the query process to be in the range of [0,1 ].
Step 2.2, sequencing rows in the matrix from high to low according to the query access frequency, selecting the least q rows with the highest frequency access as high-frequency access rows and ensuring that
Figure BDA0001717743720000027
Where N is the sum of the total access frequencies of all rows, f (i) denotes the access frequency of the ith row; the rows are classified according to the partial query dataset and custom threshold as follows: a high frequency access line, a low frequency access line;
and 2.3, establishing a line compression zone bit file, and judging whether each line is a low-frequency access line, namely the line needs to be compressed.
S3, dictionary compressing the compression row selected in S2, specifically:
step 3.1, dividing the rearranged BitFunnel bit slice index into blocks by a row unit, establishing a dictionary to store a block of which all 1 block appears in the S2 at a high frequency in a selected compression line, and recording the mapping relation between a high-frequency block and a block number;
step 3.2, traversing the index, and replacing the blocks appearing in the dictionary by block numbers with fewer bits; for the blocks not in the dictionary, the numbers of the most approximate blocks in the dictionary are used for replacing the blocks, so that the query request is miscalled, but the solution is not guaranteed to be lost.
For convenience of description, we assume a block size of k bits, a block number size of b bits, and b < k.
After steps S1 and S2 are completed, for the selected lines to be compressed, each line is divided into aligned k-bit fixed length blocks, and the repeated blocks are mined in units of the k-bit fixed length blocks. Starting from the first row, one by one scanning k bitsAnd searching for repeated blocks and recording the occurrence frequency of the repeated blocks until the last line is processed. And arranging the dug repeated blocks in a descending order of frequency. Will occur with the highest frequency of 2 b -1 block and one full 1 block into the dictionary. The matrix is traversed again.
A. If a block exists in the dictionary, the block number of b bits is substituted.
B. If the block x does not exist in the dictionary, the most approximate block y which contains the block x (x AND y ═ x) AND has the least number of 1 is searched in the dictionary (since all 1 blocks are stored in the dictionary, such y can be always found), AND the number of y is used to replace x.
Dictionary-based compression methods take advantage of the sparse matrix and 1 uneven appearance. The density of the whole 1 is low, the number of blocks with high density 1 is small, the proportion of repeated blocks is increased, the difference of approximate blocks is not large, and the characteristics are favorable for the compression based on the dictionary of the invention not to greatly increase the misnomer.
The invention has the advantages and beneficial effects that:
the block repetition degree of the BitFunnel index is increased through document rearrangement, the misnomer rate caused by compression is controlled to be about 10% through partial compression, and finally the compression effect of the BitFunnel index is improved by compressing the bitmap index through dictionary-based compression. The method can be widely applied to the fields of search engine performance optimization and index compression.
[ description of the drawings ]
FIG. 1 is a flow chart of a dictionary-based bit slice index compression method of the present invention;
FIG. 2 is a schematic diagram of the BitFunnel index structure;
FIG. 3 is an exemplary diagram of a process and results for reordering documents at block size intervals based on bit 1 density in the index column;
FIG. 4 is an exemplary diagram of a dictionary compression process and results;
[ detailed description ] embodiments
For the purpose of promoting an understanding of the above-described objects, features and advantages of the invention, reference will now be made in detail to the present embodiments of the invention illustrated in the accompanying drawings.
Example 1 document rearrangement
Since the mapping matrix of BitFunnel is determined by the hash function (Bloom filter) and the document contents, we can change the density of 1's in the matrix by adjusting the word-to-row mapping and the number of rows in the matrix. Although 0/1 matrix repetition degree is high, the distribution of 1 in the matrix is very uneven due to characteristics such as document distribution clustering and the like (BitFunnel only guarantees the density of global 1), so we propose a document rearrangement scheme based on density, and increase the similarity between blocks in the matrix, thereby improving the compression effect. Obviously, the higher the block repetition, the more advantageous it is for dictionary compression. The concrete expression is in two aspects: the probability that each block appears in the dictionary becomes high; the proportion of 1 in block compression in a dictionary is reduced, so that the misnomer rate caused by dictionary compression is greatly reduced.
For convenience of description, we assume that the mapping matrix has d columns, and each block contains k bits. Fig. 3 describes the document rearrangement process by taking d-16 and k-4 as an example, the density of each column 1 is firstly counted and arranged in the order from large to small, and for convenience of illustration, each group of d/k bits is marked with different backgrounds. The top O, J, M and G columns with the highest density are rearranged first to columns 1, 5, 9, 13 (hatched with diagonal lines), then so on, and the bottom K, L, P and F documents are rearranged to columns 4, 8, 12, 16 (not hatched). After rearrangement according to the column density, the density of each k columns is decreased from the 1 st column to the k th column, and the repeatability among blocks is increased, so that the dictionary compression is more facilitated.
EXAMPLE 2 partial compression
By analyzing the actual data set, we find that only a few rows are accessed with high frequency and most rows are accessed with low frequency in the query process. While compression for high frequency access lines results in a higher rate of misnomer, compression for low frequency access lines has less impact on the rate of misnomer. Based on this finding, we consider a partial compression scheme, i.e. compressing low frequency access lines, but not compressing high frequency access lines where the misstatement rate affects more.
The rows in the matrix are sorted from high to low according to frequency, the minimum q row with the highest frequency access is selected as the high-frequency access row, and the condition that the minimum q row with the highest frequency access is ensured
Figure BDA0001717743720000041
Where N is the sum of the total access frequencies of all rows, f (i) denotes the access frequency of the ith row; the rows are classified according to the partial query dataset and custom threshold as follows: high frequency access lines, low frequency access lines. Only low frequency access rows are compressed.
For convenience of description, we assume that the matrix has 100 rows, and the access frequency and row number of each row are equal, i.e., f (i) ═ i. First we sort the rows in the matrix from high to low in frequency, and the access frequency of each row is decreased in turn, specifically 100,99,98, …, 1. Where N (100+99+98+ … +1) is 5050. If we define the threshold as α ═ 0.9, then we select the minimum 69 rows (row numbers 100 to 32) of the highest frequency access guarantees
Figure BDA0001717743720000042
The rows are accessed for high frequency. The remaining 31 lines (line numbers 31 to 1) are compressed. Due to the characteristics of the actual data set, only a few lines are accessed at high frequency and the access frequency of a large part of lines is low in the query process of real data, so that partial compression can achieve a good compression effect under the condition of ensuring the misnomer.
For the selection of the row, the whole query data set is randomly divided into two equal parts, one part is used as a training set, and whether each row is compressed or not is determined by the row access frequency f (i) obtained by the training set and a threshold value alpha. And the other half of the query set is used as a test set for testing the performance of query time, the misbalance rate and the like. It can be seen that we do not simply select fixed parameters for compression, but learn suitable parameters from the query history by using the characteristic that query requests from the same source have similarity, so that the obtained compression result can obtain better performance on future query requests. The value of the specific self-defined threshold alpha is related to the original misnomer rate of the index and the block repetition degree.
Example 3 dictionary compression
For convenience of description, we assume a block size of k bits, a block number size of b bits, and b < k.
Fig. 4 shows k-4 and b-4The dictionary compression process and the compressed result are described as an example 2. Firstly, recording the occurrence frequency of each block in an original dictionary, and selecting 2 with the highest frequency 2 -1-3 blocks and one full 1 block generate dictionary (fig. 4 bottom left); the fourth row of the first block 1100 of fig. 4, which, because it appears in the dictionary, is replaced with the number 10; the second block 0010 in the first row (within the dashed box), since it is not present in the dictionary, has the third bit position 1. The third bit of each block 1111, 1110 and 1010 in the dictionary is 1, and each block 0010 satisfies similar conditions, and the 1010 with the least number of 1 is selected. This strategy ensures that query requests are not missed and that possible increased misnomer is minimized. The hatched portion in fig. 4 indicates a case where a block is not in the dictionary and is replaced with an approximate block. The compressed data is divided into two parts: dictionary, compressed index.
Assuming m rows and d columns of the matrix, the compression ratio of the dictionary compression method is as follows:
Figure BDA0001717743720000051
the dictionary size here is k x 2 b The size of the mapping matrix after compression is mdb/k, and the dictionary is small and can be ignored in practice, so the compression rate is approximately equal to
Figure BDA0001717743720000052
In the intersection process, the original block definition is extracted from the dictionary by simply taking the block number of b bits as an index to carry out AND operation. Because the scale of the dictionary is generally small and the dictionary is frequently accessed in the intersection (decompression) process, most dictionary data can be resident in the cache of the CPU, the access and storage overhead is reduced, and the intersection performance is improved.
Example 4 results of the experiment
We tested the effect of the dictionary-based bit-slice index compression method of the present invention on the TREC GOV2 data set. Where Pri, Opt are indices generated under the two policies PrivateSharedRank0 and Optimal of BitFunnel. The block size in the experiment is set to 32 bits, and the block number size is set to 16 bits.
The inverted index data set used is explained as follows:
(1) the data set captured from the GOV domain name in 2004 by using TREC GOV2 comprises 2500 tens of thousands of web pages as a document set for generating indexes, wherein HTML tags are removed from all document contents, and the indexes are established after titles and text parts are extracted;
(2) we used MillionSet as Query set 1, containing 2007,2008, and 2009 TREC Million Query Track for 6 Million queries;
(3) we used TerabbyteSet as query set 2, containing a total of 10 ten thousand queries of 2006 TREC Terabbyte Track.
TABLE 1
Figure BDA0001717743720000061
Table 1 describes compression rates of the dictionary-based bit slice index compression method. According to table 1, the overall compression ratios corresponding to the MillionSet and TerabyteSet in the Pri mode are 0.73 and 0.72, respectively; the overall compression ratios for the two query sets of MillionSet and TerabyteSet in Opt mode are 0.71 and 0.70 respectively. MillionSet is slightly better than TerabbyteSet, but the difference is only 0.02-0.01, which shows that the compression rate of the query set is not greatly influenced, and the partial compression mode of the query set has universality for different query data sets.
We also try to treat every 32 bits as an unsigned integer (unsigned int type), compressed using arithmetic coding and bitmap compression methods. We found that the file size is only reduced by-1%, 11.3%, 4.35% by Pfloor, VByte, EWAH compression, respectively. This is because the distribution of 1 in the BitFunnel matrix is irregular and uneven, so the values in the integer sequence are large, which is not favorable for d-gap and other value conversion, and thus the arithmetic coding compression effect is poor. The compression scheme for the bitmap cannot meet the condition because the use of run-length requires more continuous 0, continuous 1 and BitFunnel index construction completely depends on the document characteristics. In conclusion, the traditional compression algorithm has a poor compression effect on the BitFunnel index, and the compression rate can be effectively improved by the dictionary-based compression scheme adopted by the invention.
TABLE 2
Figure BDA0001717743720000071
Table 2 shows the average intersection time and the misstatement rate of each query under two strategies. From table 2 we can see that decompression on MillionSet and TerabyteSet datasets using the Pri strategy generator matrix increases the intersection time by 21%, 16%, respectively. Decompression on MillionSet and TerabyteSet datasets using the Opt strategy generator matrix increased intersection time by 48%, 37%, respectively. The increase of the intersection time is that on one hand, partial compression is adopted to enable a part of high-frequency access lines not to be compressed, and whether each line is compressed or not needs to be judged during intersection, and on the other hand, the access delay of accessing the dictionary is caused by compression. Also we can see that dictionary compression increases the misnomer by about 7 to 10 percentage points.
In the aspect of intersection, because the index needs to be decompressed, the compressed index needs to spend 16% -48% more time on query, but because of the characteristics of the BitFunnel, the relevant documents can be obtained by simple bitwise operation, the intersection time is very small, the proportion of the total query time is very small (the complete query comprises the processes of intersection, topK document calculation, abstract generation and the like), and the intersection time of about 5 milliseconds has little influence on the user. In addition compression will introduce additional misnomer, which we use the threshold a to control within a certain range of 7 to 10 percentage points. Because the search engine can carry out relevance ranking on the documents obtained by intersection, the top K pieces with the highest scores are selected to return, and the misentitled documents caused by compression can be screened out quickly due to the low scores. Thus, we predict that in an actual search engine system, this part of the misstatement document will not have a large impact.
The compression scheme provided by the invention has a good effect on compressing BitFunnel bit slice indexes by considering three aspects of comprehensive compression rate, intersection time and misstatement rate.
The dictionary-based compression method of the present invention is described in detail above, and the principle and the implementation of the present invention are explained by applying specific examples herein, and the above description of the examples is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (4)

1. A bit slice index compression method based on a dictionary is characterized by comprising the following steps:
step 1, rearranging the document according to the column density of bit 1 in the index column by taking the block size as an interval;
assuming that the number of index columns is d, the block size is k bits;
step 1.1, counting the density of bit 1 in each column of the index, and arranging the columns in a descending order according to the density;
step 1.2, initialize a null index, make the density of bit 1 highest
Figure FDA0003702189340000011
Column put index
Figure FDA0003702189340000012
Figure FDA0003702189340000013
Columns; 1 st high density
Figure FDA0003702189340000014
Column put index
Figure FDA0003702189340000015
Column, and so on, and finally the bit 1 density is the lowest
Figure FDA0003702189340000016
Is arranged in
Figure FDA0003702189340000017
Columns;
step 2, selecting a part of query access low-frequency lines according to the query data set feature custom threshold value to compress;
step 2.1, self-defining a threshold value of the access frequency of the high-frequency access row in the query process;
step 2.2, the rows are classified according to the partial query data set and the custom threshold as follows: a high frequency access line, a low frequency access line;
step 2.3, establishing a line compression zone bit file, and judging whether each line is a low-frequency access line, namely the line needs to be compressed;
step 3, performing dictionary compression on the compression line selected in the step 2;
step 3.1, dividing the rearranged BitFunnel bit slice index into blocks by a row unit, establishing a dictionary to store a whole block 1 and a block with high frequency appearing in the compression line selected in the step 2, and recording the mapping relation between the high-frequency block and the block number;
step 3.2, traversing the index, and replacing the blocks appearing in the dictionary by block numbers with fewer bits; for the blocks which do not appear in the dictionary, the numbers of the most approximate blocks in the dictionary are used for replacing the blocks, so that although the query request is miscalled, the solution is guaranteed not to be lost.
2. The method of claim 1, wherein the method application domain comprises:
bitmap, Bloom Filter, and Bloom Filter-based bitslice index structures represented by bitfulnel.
3. The method according to claim 1, wherein step 2.1 is to customize the threshold of access frequency of high frequency access rows in the query process:
and self-defining the threshold value setting of the row access frequency of the high-frequency and low-frequency rows in the query process according to the repetition degree of the BitFunnel index block and the original misnomenclatness rate of the BitFunnel index.
4. The method according to claim 1, wherein the decompression method corresponding to the original data restored after compression is specifically:
in the decompression process, the original block definition is simply extracted from the dictionary by taking the block number as an index to restore the original data.
CN201810716805.4A 2018-07-03 2018-07-03 Bit slice index compression method based on dictionary Active CN108932738B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810716805.4A CN108932738B (en) 2018-07-03 2018-07-03 Bit slice index compression method based on dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810716805.4A CN108932738B (en) 2018-07-03 2018-07-03 Bit slice index compression method based on dictionary

Publications (2)

Publication Number Publication Date
CN108932738A CN108932738A (en) 2018-12-04
CN108932738B true CN108932738B (en) 2022-08-16

Family

ID=64446607

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810716805.4A Active CN108932738B (en) 2018-07-03 2018-07-03 Bit slice index compression method based on dictionary

Country Status (1)

Country Link
CN (1) CN108932738B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109802684B (en) * 2018-12-26 2022-03-25 华为技术有限公司 Method and device for data compression
CN111680035B (en) * 2020-05-07 2023-09-08 中国工业互联网研究院 Compression coding and decoding method for network stream data and bitmap index thereof
CN114979094A (en) * 2022-05-13 2022-08-30 深圳智慧林网络科技有限公司 Data transmission method, device, equipment and medium based on RTP

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6067540A (en) * 1997-02-28 2000-05-23 Oracle Corporation Bitmap segmentation
CN101075261A (en) * 2007-06-12 2007-11-21 腾讯科技(深圳)有限公司 Method and device for compressing index
CN102760165A (en) * 2012-06-12 2012-10-31 上海方正数字出版技术有限公司 Full text retrieval method using bitmap index and device
CN106815875A (en) * 2016-12-06 2017-06-09 腾讯科技(深圳)有限公司 The coding of information bitmap, coding/decoding method and device
KR101872241B1 (en) * 2017-03-24 2018-06-28 경희대학교 산학협력단 Method, apparatus and computer program for information compression

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6067540A (en) * 1997-02-28 2000-05-23 Oracle Corporation Bitmap segmentation
CN101075261A (en) * 2007-06-12 2007-11-21 腾讯科技(深圳)有限公司 Method and device for compressing index
CN102760165A (en) * 2012-06-12 2012-10-31 上海方正数字出版技术有限公司 Full text retrieval method using bitmap index and device
CN106815875A (en) * 2016-12-06 2017-06-09 腾讯科技(深圳)有限公司 The coding of information bitmap, coding/decoding method and device
KR101872241B1 (en) * 2017-03-24 2018-06-28 경희대학교 산학협력단 Method, apparatus and computer program for information compression

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
倒排索引中的文档序号重排技术综述;史亮等;《中文信息学报》;20150315;第29卷(第02期);24-32 *

Also Published As

Publication number Publication date
CN108932738A (en) 2018-12-04

Similar Documents

Publication Publication Date Title
US11263215B2 (en) Methods for enhancing rapid data analysis
US10521441B2 (en) System and method for approximate searching very large data
Ding et al. Using graphics processors for high performance IR query processing
US10210280B2 (en) In-memory database search optimization using graph community structure
CN108932738B (en) Bit slice index compression method based on dictionary
Stabno et al. RLH: Bitmap compression technique based on run-length and Huffman encoding
Ding et al. Scalable techniques for document identifier assignment in inverted indexes
CN106095951B (en) Data space multi-dimensional indexing method based on load balancing and inquiry log
Zhang et al. TARDIS: Distributed indexing framework for big time series data
US11294816B2 (en) Evaluating SQL expressions on dictionary encoded vectors
CN100476824C (en) Method and system for storing element and method and system for searching element
Jiang et al. xLightFM: Extremely memory-efficient factorization machine
CN111611250A (en) Data storage device, data query method, data query device, server and storage medium
Lin et al. Cowic: A column-wise independent compression for log stream analysis
Abu-Libdeh et al. Learned indexes for a google-scale disk-based database
Peng et al. Parallelization of massive textstream compression based on compressed sensing
Sun et al. Handling multi-dimensional complex queries in key-value data stores
Foufoulas et al. Adaptive compression for fast scans on string columns
KR101052220B1 (en) Skyline Query Execution Device and Method Including Search Terms
Wang et al. Mlb+-tree: A multi-level b+-tree index for multidimensional range query on seismic data
Edjah Zone Map Layout Optimization
Xiao et al. Highly efficient string similarity search and join over compressed indexes
KR20010109945A (en) RS-tree for k-nearest neighbor queries with non spatial selection predicates and method for using it
Jayanth Optimizations and Heuristics to improve Compression in Columnar Database Systems
Xiaokang et al. An efficient LSH indexing on discriminative short codes for high-dimensional nearest neighbors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant