CN108460074A

CN108460074A - Multiple row based on BloomFilter indexes establishment and application method in row deposit data library

Info

Publication number: CN108460074A
Application number: CN201711470231.9A
Authority: CN
Inventors: 武新; 赵伟; 史大义; 姚建华; 崔维力; 郑黎辉
Original assignee: TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd
Current assignee: TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2018-08-28

Abstract

The multiple row based on BloomFilter principles that the present invention provides a kind of indexing the implementation method in row deposit data library, including：Multiple row based on BloomFilter principles indexes the creation method in row deposit data library, and the application method in row deposit data library inquiry.The multiple row index that the present invention realizes has the following advantages：Actual value is not stored, space hold is small；Search index speed is fast, takes and fixes；One indexes the query composition that can support arbitrary row；The positive rate (False positives) of vacation in BloomFilter is controllable.The beneficial effects of the invention are as follows the records that farthest can exclude to be not hit by, and the disk access needed for scanning are reduced, to promote the performance of database.

Description

Multiple row based on BloomFilter indexes establishment and use in row deposit data library Method

Technical field

The invention belongs to database fields, are indexed in row deposit data library more particularly, to the multiple row based on BloomFilter In establishment and application method.

Background technology

Bloom Filter were proposed in 1970 by Burton Howard Bloom.It is actually by one very Long binary vector and a series of random mapping functions composition, Bloom Filter can be used for retrieving element whether In one set.Its advantages of is space efficiency and query time all considerably beyond general algorithm, the disadvantage is that there is certain vacation (False positives that is, if during Bloom Filter report that a certain element is gathered there are Mr. Yu, but actually should for positive rate Element may be in the set) and delete it is difficult, but without the wrong situation of identification (False negatives, i.e., such as During fruit Bloom Filter report that a certain element is gathered there is no Mr. Yu, then this element centainly will not be in the set).

Use example：When original state, Bloom Filter are a bit arrays for including m, each is all set to 0, As shown in Figure 4.

In order to express the set of S={ x1, x2 ..., xn } such a n element, Bloom Filter use k phase Mutual independent hash function (Hash Function), each element in set is mapped to the model of { 1 ..., m } by they respectively In enclosing.To any one element x, the position hi (x) of i-th of hash function mapping will be set to 1 (1≤i≤k).Note that such as One position of fruit is repeatedly set to 1, then can only work for the first time, behind several times will be without any effect.In Figure 5, k =3, and there are two hash functions to choose the same position (from left side number the 5th).

When judging whether y belongs to this set, we are to k hash function of y applications, if the position of all hi (y) All it is 1 (1≤i≤k), then it is the element in set that we, which are considered as y, the element for being otherwise considered as y not and being in set.Fig. 6 Middle y1 is not just the element in set.Y2 either belongs to this set or is just a false positive.

Advantage：Compared to other data structures, Bloom Filter have big advantage in terms of room and time. Bloom Filter memory spaces and insertion/query time are all constants.In addition, Hash functions are not related between each other, side Just by hardware parallel realization.Bloom Filter do not need storage element itself, in certain fields very strict to security requirements It closes advantageous.Bloom Filter can indicate that complete or collected works, other any data structures cannot；K is identical with m, uses same group Simultaneously difference operation can be carried out using bit manipulation for the friendship of two Bloom Filter of Hash functions.

Disadvantage：But the shortcomings that Bloom Filter, is apparent as advantage.False sun rate (False Positive) is it One of.Increase with the number of elements of deposit, miscalculation rate increases therewith.But if number of elements is very little, hash is used It is enough for table.

In addition, cannot delete element from Bloom Filter under normal circumstances, we are readily conceivable that handle ranks battle array change At integer array, often it is inserted into a corresponding counter of element and adds 1, cutting counter when deleting element in this way can. However to ensure that safe deletion element is really not so simple.We must assure that the element of deletion really in Bloom first Inside Filter.This point can not ensure only according to this filter.In addition counter, which unrolls, will also result in problem.

Invention content

In view of this, the present invention is directed to create a kind of multiple row index based on Bloom Filter, and deposit number applied to row According in library.

In order to reach above-mentioned target, need to solve the problems, such as the false positive rate of the included presence of Bloom Filter.The present invention adopts Technical solution is to establish a Bloom Filter for each record, specifically：

Multiple row based on Bloom Filter indexes the creation method in row deposit data library, includes the following steps：

1) selection creates the row object of index；

2) record value is taken out from row object, and is encoded；

3) value after being encoded using K HASH function pair is mapped；

4) set is carried out in the binary system array of Bloom Filter according to mapping value, obtains multiple row index.

Preferably, in step 2), the data in row object are read by DataCell data blocks.

Preferably, the digit m that the number K of HASH functions, the columns n for creating index, Bloom Filter arrays include is full The following relationship of foot：K=9, m/n=20.

The multiple row of above-mentioned establishment indexes the application method in row deposit data library inquiry, includes the following steps：

1) querying condition is taken out, and creates a BloomFilter for it；

2) all BloomFilter in indexing the BloomFilter of establishment with multiple row are done and are operated, if it is different, Directly abandon this index record；If identical, this index record is put into recheck set；

3) after the scanning for terminating to index multiple row, actual value is taken to carry out checking again for being verified.

The querying method that the multiple row index of above-mentioned establishment is combined with the intelligence index in row deposit data library, includes the following steps：

1) it is inquired by intelligence index, takes out table；

2) it is inquired by multiple row index, for the corresponding index of DataCell data blocks being not hit by entirely, is directly jumped It crosses, without scanning.

The multiple row index that the present invention realizes has the following advantages：Actual value is not stored, space hold is small；Search index speed Soon, it takes and fixes；One indexes the query composition that can support arbitrary row.

Creation method of the present invention makes false positive rate (False Positive) controllably, will not lead to Bloom as record increases The positive rate of vacation of Filter also rises；Secondly, though index match when occur may be False Positive the case where (rope Draw judge this record there may be), if in use, also such record is reexamined (recheck), and The judgement entirely indexed will not be caused to fail.

Index content can carry out tissue as data according to data block DC (DataCell), can when indexing matching It is matched with loading the corresponding Bloom Filter indexes of specific data block as needed, rather than loads all indexes It is matched.

The matching of Bloom Filter indexes can utilize the distinctive intelligence index scanning result in row deposit data library, by intelligent rope It is introduced through the data block filtered, matching need not be indexed again.

Due to the adoption of the above technical scheme, the record being not hit by can be farthest excluded, the magnetic needed for scanning is reduced Disk accesses, to promote the performance of database.

Description of the drawings

The attached drawing for constituting the part of the present invention is used to provide further understanding of the present invention, schematic reality of the invention Example and its explanation are applied for explaining the present invention, is not constituted improper limitations of the present invention.In the accompanying drawings：

Fig. 1 is the index creation flow diagram of the embodiment of the present invention；

Fig. 2 is the index use flow diagram of the embodiment of the present invention.

Fig. 3 is that the index of the embodiment of the present invention and row deposit data library intelligently index combination schematic diagram；

Fig. 4 is the original state figure of Bloom Filter；

Fig. 5 be k=3 when Bloom Filter and there are two hash function choose the same position when state diagram；

The state diagram of Fig. 6 is y1 when not being the element in set Bloom Filter.

Specific implementation mode

It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the present invention can phase Mutually combination.

The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

Bloom Filter indexes can also be created in single-row upper establishment in multiple row.

Create index：

As shown in Figure 1, including the following steps：

Step 1, it is ready to create the row object of index, the memory needed for initialization index；

Step 2, record value is taken out from row object, and is encoded；It is read specifically, pressing DataCell data blocks successively Take the data in row object；

Step 3, the value after being encoded using K HASH function pair is mapped；

Step 4, set is carried out in the binary system array of Bloom Filter according to mapping value, obtains multiple row index.

In order to ensure that false positive rate is controllable, need to choose suitable k values and the ratio of m and n, in the present embodiment, HASH functions Number K, create index columns n, the Bloom Filter arrays digit m that includes, meet following relationship：K=9, m/n= 20, it can be false positive rate control in a ten thousandth.

Use index：

As shown in Fig. 2, including the following steps：

Step 1, take out querying condition (such as：Col1=10 and col3=30)

Step 2, it is that querying condition creates a BloomFilter (00100100010010100100)；

Step 3, from the multiple row of establishment index in one by one take out record (such as：00110101010010101100) it, will create All BloomFilter during the BloomFilter built is indexed with multiple row are done and are operated, such as： (00100100010010100100) (00110101010010101100) ＆,

If different from the BloomFilter values of querying condition, illustrate that this row record is unsatisfactory for querying condition, directly abandon this Index record；

If identical as the BloomFilter values of querying condition, this record may meet querying condition, it is also possible to discontented Sufficient querying condition (False positives occur), it is therefore desirable to which this index record is put into recheck set.

After the scanning for terminating to index multiple row, actual value is taken to be checked again for, to ensure that result is correct.

It is combined with intelligence index：

In row deposit data library, there is intelligent index, can be rapidly three kinds of states by the DataCell in table points：Full life In, partial hit, be not hit by entirely.To the DataCell being not hit by entirely, scanning need not be indexed.

In order to use the upper filter result intelligently indexed, index also needs to carry out tissue by DataCell, for entirely not The corresponding indexes of DataCell of hit, can directly skip, without scanning.

Specifically, the querying method that multiple row index and the intelligence index in row deposit data library combine, as shown in figure 3, including such as Lower step：

Step 1, it is inquired by intelligence index, takes out table；

Step 2, it is inquired by multiple row index, for the corresponding index of DataCell data blocks being not hit by entirely, directly It connects and skips, without scanning.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention With within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention god.

Claims

1. the multiple row based on BloomFilter indexes the creation method in row deposit data library, it is characterised in that including walking as follows Suddenly：

1) selection creates the row object of index；

2) record value is taken out from row object, and is encoded；

3) value after being encoded using K HASH function pair is mapped；

2. creation method according to claim 1, it is characterised in that：In step 2), read by DataCell data blocks Data in row object.

3. creation method according to claim 1, it is characterised in that：The number K of HASH functions, the columns n for creating index, The digit m that Bloom Filter arrays include, meets following relationship：K=9, m/n=20.

4. the multiple row that creation method according to claim 1 creates indexes the application method in row deposit data library inquiry, It is characterized by comprising following steps：

1) querying condition is taken out, and creates a BloomFilter for it；

2) all BloomFilter in indexing the BloomFilter of establishment with multiple row are done and are operated, if it is different, directly Abandon this index record；If identical, this index record is put into recheck set.

5. application method according to claim 4, it is characterised in that：Further include step 3), multiple row is indexed in end After scanning, actual value is taken to carry out checking again for being verified.

6. what the multiple row index that creation method according to claim 2 creates and the intelligence index in row deposit data library combined looks into Inquiry method, it is characterised in that include the following steps：

1) it is inquired by intelligence index, takes out table；

2) it is inquired by multiple row index, for the corresponding index of DataCell data blocks being not hit by entirely, is directly skipped, no It is scanned.