CN108460074A - Multiple row based on BloomFilter indexes establishment and application method in row deposit data library - Google Patents

Multiple row based on BloomFilter indexes establishment and application method in row deposit data library Download PDF

Info

Publication number
CN108460074A
CN108460074A CN201711470231.9A CN201711470231A CN108460074A CN 108460074 A CN108460074 A CN 108460074A CN 201711470231 A CN201711470231 A CN 201711470231A CN 108460074 A CN108460074 A CN 108460074A
Authority
CN
China
Prior art keywords
row
index
bloomfilter
multiple row
data library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711470231.9A
Other languages
Chinese (zh)
Inventor
武新
赵伟
史大义
姚建华
崔维力
郑黎辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd
Original Assignee
TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd filed Critical TIANJIN NANKAI UNIVERSITY GENERAL DATA TECHNOLOGIES Co Ltd
Priority to CN201711470231.9A priority Critical patent/CN108460074A/en
Publication of CN108460074A publication Critical patent/CN108460074A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables

Abstract

The multiple row based on BloomFilter principles that the present invention provides a kind of indexing the implementation method in row deposit data library, including:Multiple row based on BloomFilter principles indexes the creation method in row deposit data library, and the application method in row deposit data library inquiry.The multiple row index that the present invention realizes has the following advantages:Actual value is not stored, space hold is small;Search index speed is fast, takes and fixes;One indexes the query composition that can support arbitrary row;The positive rate (False positives) of vacation in BloomFilter is controllable.The beneficial effects of the invention are as follows the records that farthest can exclude to be not hit by, and the disk access needed for scanning are reduced, to promote the performance of database.

Description

Multiple row based on BloomFilter indexes establishment and use in row deposit data library Method
Technical field
The invention belongs to database fields, are indexed in row deposit data library more particularly, to the multiple row based on BloomFilter In establishment and application method.
Background technology
Bloom Filter were proposed in 1970 by Burton Howard Bloom.It is actually by one very Long binary vector and a series of random mapping functions composition, Bloom Filter can be used for retrieving element whether In one set.Its advantages of is space efficiency and query time all considerably beyond general algorithm, the disadvantage is that there is certain vacation (False positives that is, if during Bloom Filter report that a certain element is gathered there are Mr. Yu, but actually should for positive rate Element may be in the set) and delete it is difficult, but without the wrong situation of identification (False negatives, i.e., such as During fruit Bloom Filter report that a certain element is gathered there is no Mr. Yu, then this element centainly will not be in the set).
Use example:When original state, Bloom Filter are a bit arrays for including m, each is all set to 0, As shown in Figure 4.
In order to express the set of S={ x1, x2 ..., xn } such a n element, Bloom Filter use k phase Mutual independent hash function (Hash Function), each element in set is mapped to the model of { 1 ..., m } by they respectively In enclosing.To any one element x, the position hi (x) of i-th of hash function mapping will be set to 1 (1≤i≤k).Note that such as One position of fruit is repeatedly set to 1, then can only work for the first time, behind several times will be without any effect.In Figure 5, k =3, and there are two hash functions to choose the same position (from left side number the 5th).
When judging whether y belongs to this set, we are to k hash function of y applications, if the position of all hi (y) All it is 1 (1≤i≤k), then it is the element in set that we, which are considered as y, the element for being otherwise considered as y not and being in set.Fig. 6 Middle y1 is not just the element in set.Y2 either belongs to this set or is just a false positive.
Advantage:Compared to other data structures, Bloom Filter have big advantage in terms of room and time. Bloom Filter memory spaces and insertion/query time are all constants.In addition, Hash functions are not related between each other, side Just by hardware parallel realization.Bloom Filter do not need storage element itself, in certain fields very strict to security requirements It closes advantageous.Bloom Filter can indicate that complete or collected works, other any data structures cannot;K is identical with m, uses same group Simultaneously difference operation can be carried out using bit manipulation for the friendship of two Bloom Filter of Hash functions.
Disadvantage:But the shortcomings that Bloom Filter, is apparent as advantage.False sun rate (False Positive) is it One of.Increase with the number of elements of deposit, miscalculation rate increases therewith.But if number of elements is very little, hash is used It is enough for table.
In addition, cannot delete element from Bloom Filter under normal circumstances, we are readily conceivable that handle ranks battle array change At integer array, often it is inserted into a corresponding counter of element and adds 1, cutting counter when deleting element in this way can. However to ensure that safe deletion element is really not so simple.We must assure that the element of deletion really in Bloom first Inside Filter.This point can not ensure only according to this filter.In addition counter, which unrolls, will also result in problem.
Invention content
In view of this, the present invention is directed to create a kind of multiple row index based on Bloom Filter, and deposit number applied to row According in library.
In order to reach above-mentioned target, need to solve the problems, such as the false positive rate of the included presence of Bloom Filter.The present invention adopts Technical solution is to establish a Bloom Filter for each record, specifically:
Multiple row based on Bloom Filter indexes the creation method in row deposit data library, includes the following steps:
1) selection creates the row object of index;
2) record value is taken out from row object, and is encoded;
3) value after being encoded using K HASH function pair is mapped;
4) set is carried out in the binary system array of Bloom Filter according to mapping value, obtains multiple row index.
Preferably, in step 2), the data in row object are read by DataCell data blocks.
Preferably, the digit m that the number K of HASH functions, the columns n for creating index, Bloom Filter arrays include is full The following relationship of foot:K=9, m/n=20.
The multiple row of above-mentioned establishment indexes the application method in row deposit data library inquiry, includes the following steps:
1) querying condition is taken out, and creates a BloomFilter for it;
2) all BloomFilter in indexing the BloomFilter of establishment with multiple row are done and are operated, if it is different, Directly abandon this index record;If identical, this index record is put into recheck set;
3) after the scanning for terminating to index multiple row, actual value is taken to carry out checking again for being verified.
The querying method that the multiple row index of above-mentioned establishment is combined with the intelligence index in row deposit data library, includes the following steps:
1) it is inquired by intelligence index, takes out table;
2) it is inquired by multiple row index, for the corresponding index of DataCell data blocks being not hit by entirely, is directly jumped It crosses, without scanning.
The multiple row index that the present invention realizes has the following advantages:Actual value is not stored, space hold is small;Search index speed Soon, it takes and fixes;One indexes the query composition that can support arbitrary row.
Creation method of the present invention makes false positive rate (False Positive) controllably, will not lead to Bloom as record increases The positive rate of vacation of Filter also rises;Secondly, though index match when occur may be False Positive the case where (rope Draw judge this record there may be), if in use, also such record is reexamined (recheck), and The judgement entirely indexed will not be caused to fail.
Index content can carry out tissue as data according to data block DC (DataCell), can when indexing matching It is matched with loading the corresponding Bloom Filter indexes of specific data block as needed, rather than loads all indexes It is matched.
The matching of Bloom Filter indexes can utilize the distinctive intelligence index scanning result in row deposit data library, by intelligent rope It is introduced through the data block filtered, matching need not be indexed again.
Due to the adoption of the above technical scheme, the record being not hit by can be farthest excluded, the magnetic needed for scanning is reduced Disk accesses, to promote the performance of database.
Description of the drawings
The attached drawing for constituting the part of the present invention is used to provide further understanding of the present invention, schematic reality of the invention Example and its explanation are applied for explaining the present invention, is not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the index creation flow diagram of the embodiment of the present invention;
Fig. 2 is the index use flow diagram of the embodiment of the present invention.
Fig. 3 is that the index of the embodiment of the present invention and row deposit data library intelligently index combination schematic diagram;
Fig. 4 is the original state figure of Bloom Filter;
Fig. 5 be k=3 when Bloom Filter and there are two hash function choose the same position when state diagram;
The state diagram of Fig. 6 is y1 when not being the element in set Bloom Filter.
Specific implementation mode
It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the present invention can phase Mutually combination.
The present invention will be described in detail below with reference to the accompanying drawings and embodiments.
Bloom Filter indexes can also be created in single-row upper establishment in multiple row.
Create index:
As shown in Figure 1, including the following steps:
Step 1, it is ready to create the row object of index, the memory needed for initialization index;
Step 2, record value is taken out from row object, and is encoded;It is read specifically, pressing DataCell data blocks successively Take the data in row object;
Step 3, the value after being encoded using K HASH function pair is mapped;
Step 4, set is carried out in the binary system array of Bloom Filter according to mapping value, obtains multiple row index.
In order to ensure that false positive rate is controllable, need to choose suitable k values and the ratio of m and n, in the present embodiment, HASH functions Number K, create index columns n, the Bloom Filter arrays digit m that includes, meet following relationship:K=9, m/n= 20, it can be false positive rate control in a ten thousandth.
Use index:
As shown in Fig. 2, including the following steps:
Step 1, take out querying condition (such as:Col1=10 and col3=30)
Step 2, it is that querying condition creates a BloomFilter (00100100010010100100);
Step 3, from the multiple row of establishment index in one by one take out record (such as:00110101010010101100) it, will create All BloomFilter during the BloomFilter built is indexed with multiple row are done and are operated, such as: (00100100010010100100) (00110101010010101100) &,
If different from the BloomFilter values of querying condition, illustrate that this row record is unsatisfactory for querying condition, directly abandon this Index record;
If identical as the BloomFilter values of querying condition, this record may meet querying condition, it is also possible to discontented Sufficient querying condition (False positives occur), it is therefore desirable to which this index record is put into recheck set.
After the scanning for terminating to index multiple row, actual value is taken to be checked again for, to ensure that result is correct.
It is combined with intelligence index:
In row deposit data library, there is intelligent index, can be rapidly three kinds of states by the DataCell in table points:Full life In, partial hit, be not hit by entirely.To the DataCell being not hit by entirely, scanning need not be indexed.
In order to use the upper filter result intelligently indexed, index also needs to carry out tissue by DataCell, for entirely not The corresponding indexes of DataCell of hit, can directly skip, without scanning.
Specifically, the querying method that multiple row index and the intelligence index in row deposit data library combine, as shown in figure 3, including such as Lower step:
Step 1, it is inquired by intelligence index, takes out table;
Step 2, it is inquired by multiple row index, for the corresponding index of DataCell data blocks being not hit by entirely, directly It connects and skips, without scanning.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention With within principle, any modification, equivalent replacement, improvement and so on should all be included in the protection scope of the present invention god.

Claims (6)

1. the multiple row based on BloomFilter indexes the creation method in row deposit data library, it is characterised in that including walking as follows Suddenly:
1) selection creates the row object of index;
2) record value is taken out from row object, and is encoded;
3) value after being encoded using K HASH function pair is mapped;
4) set is carried out in the binary system array of Bloom Filter according to mapping value, obtains multiple row index.
2. creation method according to claim 1, it is characterised in that:In step 2), read by DataCell data blocks Data in row object.
3. creation method according to claim 1, it is characterised in that:The number K of HASH functions, the columns n for creating index, The digit m that Bloom Filter arrays include, meets following relationship:K=9, m/n=20.
4. the multiple row that creation method according to claim 1 creates indexes the application method in row deposit data library inquiry, It is characterized by comprising following steps:
1) querying condition is taken out, and creates a BloomFilter for it;
2) all BloomFilter in indexing the BloomFilter of establishment with multiple row are done and are operated, if it is different, directly Abandon this index record;If identical, this index record is put into recheck set.
5. application method according to claim 4, it is characterised in that:Further include step 3), multiple row is indexed in end After scanning, actual value is taken to carry out checking again for being verified.
6. what the multiple row index that creation method according to claim 2 creates and the intelligence index in row deposit data library combined looks into Inquiry method, it is characterised in that include the following steps:
1) it is inquired by intelligence index, takes out table;
2) it is inquired by multiple row index, for the corresponding index of DataCell data blocks being not hit by entirely, is directly skipped, no It is scanned.
CN201711470231.9A 2017-12-29 2017-12-29 Multiple row based on BloomFilter indexes establishment and application method in row deposit data library Pending CN108460074A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711470231.9A CN108460074A (en) 2017-12-29 2017-12-29 Multiple row based on BloomFilter indexes establishment and application method in row deposit data library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711470231.9A CN108460074A (en) 2017-12-29 2017-12-29 Multiple row based on BloomFilter indexes establishment and application method in row deposit data library

Publications (1)

Publication Number Publication Date
CN108460074A true CN108460074A (en) 2018-08-28

Family

ID=63221171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711470231.9A Pending CN108460074A (en) 2017-12-29 2017-12-29 Multiple row based on BloomFilter indexes establishment and application method in row deposit data library

Country Status (1)

Country Link
CN (1) CN108460074A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538984A (en) * 2020-04-17 2020-08-14 南京东科优信网络安全技术研究院有限公司 Fast matching device and method for credible white list
CN117076466A (en) * 2023-10-18 2023-11-17 河北因朵科技有限公司 Rapid data indexing method for large archive database

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8352490B2 (en) * 2009-10-22 2013-01-08 Vmware, Inc. Method and system for locating update operations in a virtual machine disk image
CN104751055A (en) * 2013-12-31 2015-07-01 北京启明星辰信息安全技术有限公司 Method, device and system for detecting distributed malicious codes on basis of textures
CN105354323A (en) * 2015-11-16 2016-02-24 天津南大通用数据技术股份有限公司 Method and device for increasing precise inquiry speed of columnar storage database by using two-stage filtration
CN107491487A (en) * 2017-07-17 2017-12-19 中国科学院信息工程研究所 A kind of full-text database framework and bitmap index establishment, data query method, server and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8352490B2 (en) * 2009-10-22 2013-01-08 Vmware, Inc. Method and system for locating update operations in a virtual machine disk image
CN104751055A (en) * 2013-12-31 2015-07-01 北京启明星辰信息安全技术有限公司 Method, device and system for detecting distributed malicious codes on basis of textures
CN105354323A (en) * 2015-11-16 2016-02-24 天津南大通用数据技术股份有限公司 Method and device for increasing precise inquiry speed of columnar storage database by using two-stage filtration
CN107491487A (en) * 2017-07-17 2017-12-19 中国科学院信息工程研究所 A kind of full-text database framework and bitmap index establishment, data query method, server and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538984A (en) * 2020-04-17 2020-08-14 南京东科优信网络安全技术研究院有限公司 Fast matching device and method for credible white list
CN117076466A (en) * 2023-10-18 2023-11-17 河北因朵科技有限公司 Rapid data indexing method for large archive database
CN117076466B (en) * 2023-10-18 2023-12-29 河北因朵科技有限公司 Rapid data indexing method for large archive database

Similar Documents

Publication Publication Date Title
Stockinger et al. Query-driven visualization of large data sets
US9400816B1 (en) System for indexing collections of structured objects that provides strong multiversioning semantics
EP2924594B1 (en) Data encoding and corresponding data structure in a column-store database
Almodaresi et al. An efficient, scalable, and exact representation of high-dimensional color information enabled using de Bruijn graph search
Yuan et al. Lindex: a lattice-based index for graph databases
CN105117355A (en) Memory, memory system and data process method
CN104408163B (en) A kind of data classification storage and device
CN104160398B (en) Content structuring method and system used in large object data
US8533200B2 (en) Apparatus and method for organizing, storing and retrieving data using a universal variable-length data structure
CN107332567B (en) Coding method and device
CN108021702A (en) Classification storage method, device, OLAP database system and medium based on LSM-tree
CN108460074A (en) Multiple row based on BloomFilter indexes establishment and application method in row deposit data library
CN103970795A (en) Data processing method, device and system
CN104573571B (en) A kind of generation method of smart card security file system
CN104142979B (en) A kind of indexing means for realizing RFID tag storage management
CN103823641B (en) The virtual volume system of a kind of on-line rapid estimation and its implementation
CN103544109B (en) A kind of combined test case generation method
CN106874458A (en) A kind of Bloom filter building method of the multi-layered database based on layering distribution
Yuan et al. Boundary-connection deletion strategy based method for community detection in complex networks
CN110097361A (en) A kind of block chain dynamic calculation power common recognition method and computer system based on X11 algorithm
CN102968467A (en) Optimization method and query method for multiple layers of Bloom Filters
CN102567545A (en) Method and system for organizational management on XML documents in XML database system
El-Houby Mining protein structure class using one database scan
Kunkle et al. Solving rubik's cube: disk is the new ram
CN105989117A (en) Method and system for rapidly and jointly processing semi-structured data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180828