CN108140050A

CN108140050A - A kind of method and device using Bloom filter filtering file

Info

Publication number: CN108140050A
Application number: CN201680059828.1A
Authority: CN
Inventors: 李勇
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2016-04-25
Filing date: 2016-04-25
Publication date: 2018-06-08
Anticipated expiration: 2036-04-25
Also published as: CN108140050B; WO2017185210A1

Abstract

The embodiment of the invention discloses a kind of method using Bloom filter filtering file, this method is applied to non-relational storage system, for consuming a large amount of computing resources when solving the problems, such as in the prior art using Bloom filter filtering file.This method includes：Determine the corresponding the first row bond number of the first Bloom filter；Obtain the first table figure, wherein, the first table figure includes the correspondence between correspondence between the size of coarseness line unit number and the Bloom filter corresponding to coarseness line unit number and the size of Bloom filter and hash value collection corresponding to the size of Bloom filter；The size that whether there is first Bloom filter is inquired according to the first row bond number in the first table figure；When, there are during the size of first Bloom filter, the hash value collection according to corresponding to the size of first Bloom filter carries out file filter in the first table figure.

Description

A kind of method and device using Bloom filter filtering file

Technical field

The present invention relates to data query fields, and in particular to a kind of method and device using Bloom filter filtering file.

Background technique

The non-relational storage system for being merged tree (full name in English: Log Structured Merge-Tree, abbreviation: LSM-Tree) based on structure log, is widely used in internet.As shown in Fig. 1-a, such as: the data that client is sent first temporarily record in memory, are disposably sequentially written in disk when the data volume of record reaches certain threshold value, then data, to reduce the time that magnetic head positions in disk, alleviate disk pressure.Simultaneously, non-relational storage system based on LSM-Tree, generally data are recorded using key-value (key-value) data model, as shown in Fig. 1-b, wherein, the data of all fields of one line unit (rowkey) institute band are dispersed in multiple files, and Bloom filter is a kind of optimization means, can quickly be filled into the file that can not have rowkey to be checked.

Before carrying out data written document, each rowkey passes through k different Hash (hash) functions first and calculates k different hash values of generation, and a different hash values of insertion k in empty Bloom filter (bloomfilter).

As shown in fig 1-c, the element number used in empty Bloom filter is all actual element number, and calculated hash value is different from substantially, leads to not be multiplexed the result that hash function calculates in this way.When then carrying out file filter using Bloom filter, each rowkey will calculate k hash value, to consume a large amount of computing resource.

Summary of the invention

The embodiment of the invention provides a kind of method and devices using Bloom filter filtering file, consume a large amount of computing resources when for solving the problems, such as in the prior art using Bloom filter filtering file.

First aspect provides a kind of method using Bloom filter filtering file, the method is applied to non-relational storage system, the described method includes: determining the corresponding the first row bond number of the first Bloom filter, wherein, first Bloom filter is Bloom filter corresponding to file to be filtered, and the first row bond number is the coarseness line unit number of first Bloom filter；Obtain the first table figure, wherein it include the corresponding relationship between the size of Bloom filter corresponding to coarseness line unit number and coarseness line unit number in the first table figure, And Bloom filter size and Bloom filter size corresponding to corresponding relationship between hash value collection；The size that whether there is first Bloom filter is inquired in the first table figure according to the first row bond number；When there are when the size of first Bloom filter, carrying out file filter according to hash value collection corresponding to the size of first Bloom filter in the first table figure.

It can be seen that, when using Bloom filter filtering file, the first Bloom filter (that is: Bloom filter corresponding to file to be filtered) corresponding the first row bond number is determined first, wherein, the first row bond number is the coarseness line unit number of the first Bloom filter, that is: the section where the practical line unit number of the first Bloom filter is taken into right boundary value according to goal rule, using this right boundary value as the first row bond number of the first Bloom filter, further, the size that whether there is first Bloom filter is inquired in the first table figure according to the first row bond number, when there are when the size of the first Bloom filter in the first table figure, then the hash value collection according to corresponding to the size of the first Bloom filter carries out file filter.As it can be seen that take the coarseness line unit number of the first Bloom filter can effectively improve same size the first Bloom filter quantity, and hash value collection corresponding to the size of the first Bloom filter is re-used, to reduce the expense of filtering file.

In some possible implementations, when the size of first Bloom filter is not present in the first table figure, the method also includes: the size of first Bloom filter is determined according to the first row bond number；According to each line unit corresponding to the first row bond number determine the size of first Bloom filter corresponding to hash value collection；Save the corresponding relationship between hash value collection corresponding to the size of corresponding relationship and first Bloom filter and the size of first Bloom filter between the size of first Bloom filter corresponding to the first row bond number and the first row bond number.

It can be seen that, when the size of the first Bloom filter is not present in the first table figure, the size of the first Bloom filter can be determined according to the first row bond number, and then determine cryptographic Hash corresponding to corresponding each line unit in the first row bond number, generate hash value collection corresponding to the size of the first Bloom filter, save the first row bond number, corresponding relationship between first Bloom filter size and hash value collection, so that filtering uses when encountering the file with the first Bloom filter same size next time, to reduce the computing resource of filtering file.

In other possible implementations, the corresponding the first row bond number of first Bloom filter of determination includes: the practical line unit number for obtaining first Bloom filter；The practical line unit number of first Bloom filter is divided to corresponding first interval according to goal rule；The right boundary value for choosing the first interval is determined as the first row bond number.

As it can be seen that the practical line unit number of the first Bloom filter is divided to corresponding firstth area according to goal rule Between, and take the right boundary value of first interval as the first row bond number, such as: assuming that the practical line unit number of the first Bloom filter is 230, and goal rule is to divide a section according to every 300 line unit number, that is: 1 to 300 is first interval, 301 to 600 be second interval, and 601 to 900 be 3rd interval etc., then the right boundary value 300 of the first interval where 230 is taken to be used as the first row bond number.

In other possible implementations, before the first table figure of the acquisition, the method also includes: determine the coarseness line unit number of Bloom filter；The size of Bloom filter is determined according to the coarseness line unit number of Bloom filter；Hash value collection corresponding to the size of Bloom filter is determined according to each line unit corresponding to coarseness line unit number；Save the corresponding relationship between hash value collection corresponding to the size of corresponding relationship and Bloom filter and the size of Bloom filter between the size of Bloom filter corresponding to coarseness line unit number and coarseness line unit number.

In practical applications, it obtains between the first table figure, then need to generate the first table figure, the process of the first table figure specifically generated are as follows: first determine the coarseness line unit number of current each Bloom filter, the size of Bloom filter is determined according to the coarseness line unit number of current each Bloom filter, and then corresponding cryptographic Hash is determined according to the corresponding each line unit of coarseness line unit number, all cryptographic Hash form hash value collection, to save the coarseness line unit number of Bloom filter, corresponding relationship between the size and hash value collection of Bloom filter, when inquiring the Bloom filter of same size according to the first table figure when to filter file, it is multiplexed hash value collection, to reduce computing resource when filtering file.

In other possible implementations, the coarseness line unit number of the determining Bloom filter includes: the practical line unit number for obtaining Bloom filter；The practical line unit number of the Bloom filter is divided in corresponding section according to the goal rule；The right boundary value for choosing section is determined as the coarseness line unit number of the Bloom filter.

It can be seen that, the process of the specific line unit number for determining current each Bloom filter are as follows: obtain the practical line unit number of Bloom filter, the practical line unit number of Bloom filter is divided in corresponding section according to goal rule, and the coarseness line unit value of the right boundary value as Bloom filter in section is chosen, and such as: current there are five Bloom filters, and the practical line unit number of each Bloom filter is respectively 211,340,532,160,832.Assuming that goal rule is that every 300 line unit numbers are divided into a section, then corresponding section is respectively 0 to 300,301 to 600,601 to 900, then the corresponding coarseness line unit number of above-mentioned five practical line unit numbers is respectively 300,600,600,300,900.Certainly, in practical applications, the accuracy of goal rule can be determined according to the practical line unit number of each Bloom filter, such as: the practical line unit number of Bloom filter Between gap it is closer, the range of demarcation interval can be reduced, be not specifically limited herein.

In other possible implementations, each line unit according to corresponding to coarseness line unit number determines after hash value collection corresponding to the size of Bloom filter, the method also includes: each cryptographic Hash in the hash value collection is inserted on corresponding Bloom filter, and the corresponding position of each cryptographic Hash is set to 1.

In practical applications, after determining each line unit corresponding cryptographic Hash, each cryptographic Hash is inserted on corresponding Bloom filter, and the corresponding position of each cryptographic Hash is set to 1, it is subsequent using Bloom filter filtering file to facilitate, such as: determine each cryptographic Hash corresponding to line unit to be checked, then judged on the corresponding position of Bloom filter again, if there is the value of any one position is 0, then determine the line unit not in this document, from without filtering this document, the expense of filtering this document is reduced.

Second aspect of the present invention provides a kind of Bloom filter managing device, and the Bloom filter managing device is applied to non-relational storage system, and the Bloom filter managing device includes:

Determining module, for determining the corresponding the first row bond number of the first Bloom filter, wherein first Bloom filter is Bloom filter corresponding to file to be filtered, and the first row bond number is the coarseness line unit number of first Bloom filter；

Obtain module, for obtaining the first table figure, wherein, it include the corresponding relationship between hash value collection corresponding to the size of corresponding relationship and Bloom filter and the size of Bloom filter between the size of Bloom filter corresponding to coarseness line unit number and coarseness line unit number in the first table figure；

Enquiry module, the first row bond number for being determined according to the determining module inquire the size that whether there is first Bloom filter in the first table figure that the acquisition module obtains；

Filtering module carries out file filter according to hash value collection corresponding to the size of first Bloom filter for inquiring in the first table figure there are when the size of first Bloom filter when the enquiry module.

It can be seen that, determining module determines the first Bloom filter (that is: Bloom filter corresponding to file to be filtered) corresponding the first row bond number first, wherein, the first row bond number is the coarseness line unit number of the first Bloom filter, that is: the section where the practical line unit number of the first Bloom filter is taken into right boundary value according to goal rule, using this right boundary value as the first row bond number of the first Bloom filter, further, enquiry module inquires in the first table figure the size that whether there is the first Bloom filter according to the first row bond number, when there are when the size of the first Bloom filter in the first table figure, then filtering module is according to corresponding to the size of the first Bloom filter Hash value collection carry out file filter.As it can be seen that take the coarseness line unit number of the first Bloom filter can effectively improve same size the first Bloom filter quantity, and hash value collection corresponding to the size of the first Bloom filter is re-used, to reduce the expense of filtering file.

In some possible implementations, when the size of first Bloom filter is not present in the first table figure, the Bloom filter managing device further include:

The determining module is also used to determine the size of first Bloom filter according to the first row bond number；According to each line unit corresponding to the first row bond number determine the size of first Bloom filter corresponding to hash value collection；

Preserving module, corresponding relationship between hash value collection corresponding to the size of corresponding relationship and first Bloom filter and the size of first Bloom filter between size for saving first Bloom filter corresponding to the first row bond number and the first row bond number.

It can be seen that, when the size of the first Bloom filter is not present in the first table figure, determining module can determine the size of the first Bloom filter according to the first row bond number, and then determine cryptographic Hash corresponding to corresponding each line unit in the first row bond number, generate hash value collection corresponding to the size of the first Bloom filter, and then preserving module saves the first row bond number, corresponding relationship between first Bloom filter size and hash value collection, so that filtering uses when encountering the file with the first Bloom filter same size next time, to reduce the computing resource of filtering file.

In other possible implementations, the determining module, specifically for obtaining the practical line unit number of first Bloom filter；The practical line unit number of first Bloom filter is divided to corresponding first interval according to goal rule；The right boundary value for choosing the first interval is determined as the first row bond number.

It can be seen that, the practical line unit number of the first Bloom filter is divided to corresponding first interval according to goal rule, and take the right boundary value of first interval as the first row bond number, such as: assuming that the practical line unit number of the first Bloom filter is 230, and goal rule is to divide a section according to every 300 line unit number, it may be assumed that 1 to 300 is first interval, and 301 to 600 be second interval, 601 to 900 be 3rd interval etc., then the right boundary value 300 of the first interval where 230 is taken to be used as the first row bond number.

In other possible implementations, the determining module is also used to before the acquisition module obtains the first table figure, determines the coarseness line unit number of Bloom filter；The size of Bloom filter is determined according to the coarseness line unit number of Bloom filter；Hash value collection corresponding to the size of Bloom filter is determined according to each line unit corresponding to coarseness line unit number；

The preserving module, the corresponding relationship between hash value collection corresponding to the size of corresponding relationship and Bloom filter and the size of Bloom filter for being also used to save between the size of Bloom filter corresponding to coarseness line unit number and coarseness line unit number.

In practical applications, it obtains between the first table figure, then need to generate the first table figure, the process of the first table figure specifically generated are as follows: determining module first determines the coarseness line unit number of current each Bloom filter, the size of Bloom filter is determined according to the coarseness line unit number of current each Bloom filter, and then corresponding cryptographic Hash is determined according to the corresponding each line unit of coarseness line unit number, all cryptographic Hash form hash value collection, to which preserving module saves the coarseness line unit number of Bloom filter, corresponding relationship between the size and hash value collection of Bloom filter, when inquiring the Bloom filter of same size according to the first table figure when to filter file, it is multiplexed hash value collection, to reduce computing resource when filtering file.

In other possible implementations, the determining module, specifically for obtaining the practical line unit number of Bloom filter；The practical line unit number of the Bloom filter is divided in corresponding section according to the goal rule；The right boundary value for choosing section is determined as the coarseness line unit number of the Bloom filter.

It can be seen that, the process of the specific line unit number for determining current each Bloom filter are as follows: obtain the practical line unit number of Bloom filter, the practical line unit number of Bloom filter is divided in corresponding section according to goal rule, and the coarseness line unit value of the right boundary value as Bloom filter in section is chosen, and such as: current there are five Bloom filters, and the practical line unit number of each Bloom filter is respectively 211,340,532,160,832.Assuming that goal rule is that every 300 line unit numbers are divided into a section, then corresponding section is respectively 0 to 300,301 to 600,601 to 900, then the corresponding coarseness line unit number of above-mentioned five practical line unit numbers is respectively 300,600,600,300,900.Certainly, in practical applications, the accuracy of goal rule can be determined according to the practical line unit number of each Bloom filter, such as: the gap between the practical line unit number of Bloom filter is closer, can reduce the range of demarcation interval, be not specifically limited herein.

In other possible implementations, the Bloom filter managing device further include: insertion module, for after the determining module each line unit according to corresponding to coarseness line unit number determines hash value collection corresponding to the size of Bloom filter, each cryptographic Hash in the hash value collection is inserted on corresponding Bloom filter, and the corresponding position of each cryptographic Hash is set to 1.

In practical applications, after determining each line unit corresponding cryptographic Hash, each cryptographic Hash is inserted on corresponding Bloom filter, and the corresponding position of each cryptographic Hash is set to 1, it is subsequent using Bloom filter filtering file to facilitate, such as: determine each cryptographic Hash corresponding to line unit to be checked, then cloth again Grand filter is judged on corresponding position, if there is the value of any one position is 0, it is determined that the line unit in this document, from without filtering this document, does not reduce the expense of filtering this document.

A kind of Bloom filter managing device is provided in terms of third party of the present invention, comprising:

One or more processors, memory, bus system and transceiver, the processor, the memory and the transceiver are connected by the bus system；

Wherein, one or more programs are stored in the memory, one or more of programs include instruction, and described instruction makes the Bloom filter managing device execute method described in first aspect or first aspect any one possible implementation when being executed by the Bloom filter managing device.

Detailed description of the invention

Fig. 1-a is one embodiment schematic diagram of data flow in non-relational storage system in the prior art；

Fig. 1-b is a structural schematic diagram of key-value storage organization in the prior art；

Fig. 1-c is one embodiment schematic diagram for generating Bloom filter in the prior art；

Fig. 2-a is one embodiment schematic diagram of data flow in China-African tie of embodiment of the present invention type storage system；

Fig. 2-b is the flow diagram that Bloom filter is generated in the embodiment of the present invention；

Fig. 2-c is the flow diagram that file is filtered in the embodiment of the present invention；

Fig. 3 is one embodiment schematic diagram for filtering file in the embodiment of the present invention using Bloom filter；

Fig. 4 is one embodiment schematic diagram that Bloom filter is generated in the embodiment of the present invention；

Fig. 5 is one embodiment schematic diagram that Bloom filter is inserted into element in the embodiment of the present invention；

Fig. 6 is another embodiment schematic diagram for filtering file in the embodiment of the present invention using Bloom filter；

Fig. 7 is a structural schematic diagram of Bloom filter managing device in the embodiment of the present invention；

Fig. 8 is another structural schematic diagram of Bloom filter managing device in the embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those skilled in the art's every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.

Description and claims of this specification and the (if present)s such as term " first " in above-mentioned attached drawing, " second ", " third ", " the 4th " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that the data used in this way are interchangeable under appropriate circumstances, so that the embodiments described herein can be implemented with the sequence other than the content for illustrating or describing herein.Furthermore, term " includes " and " having " and their any deformation, it is intended to cover and non-exclusive includes, such as, the process, method, system, product or equipment for containing a series of steps or units those of are not necessarily limited to be clearly listed step or unit, but may include other step or units being not clearly listed or intrinsic for these process, methods, product or equipment.

Before introducing the embodiment of the present invention, introduce system applied by technical solution of the present invention, technical solution of the present invention is applied to the non-relational storage system of LSM-Tree, LSM-Tree is the tree data structure based on log (log), log is first written before memory table is written in data, prevent machine delay machine from causing loss of data, in non-relational storage system, not stringent pattern definition, it can be according to business feature free extension memory table structure, generally will use key-value mode carries out data storage, wherein, rowkey is the mark of the data line in key-value storage, for data routing and file index inquiry.This mode flexible structure of key-value, field can be added at any time or modification has been stored in the data in non-relational storage system, but, leading to the data of all fields of a rowkey institute band in this way may be dispersed in multiple files, when requiring to look up the rowkey and which file not known in advance there are when the rowkey, very more files may be searched, in order to reduce the quantity of documents of inquiry, the non-relational storage system generally uses merging mechanism, that is: multiple data files are merged, the record of a rowkey is allowed to be concentrated in a small amount of file as far as possible, to reduce the quantity of documents of inquiry.

It is a structural schematic diagram of the non-relational storage system of LSM-Tree in the embodiment of the present invention, in the non-relational storage system of the LSM-Tree, data read and write operation as shown in Fig. 2-a All initiated by client, in data write-in process, when quantity of the data in the memory table (Memtable) of memory has been more than the threshold value of Memtable, triggering starts the Bloom filter managing device, the Bloom filter managing device generates Bloom filter, then it writes data into file, and file is stored in disk, it is subsequent when needing to file filter, file is then filtered by Bloom filter, it can be seen that, the function that Bloom filter managing device is mainly completed includes the generation of Bloom filter, the management of file filter and file on disk.

The main modular of the Bloom filter managing device management is described below, it may be assumed that Bloom filter generation module and Bloom filter filtering module.

A. Bloom filter generation module

The module is present in the memory of server end, is managed by Bloom filter managing device, main to complete the generation of Bloom filter and element (that is: the cryptographic Hash of rowkey) insertion when to Memtable written document.Different Memtable has rowkey different numbers, Bloom filter generation module is first according to the prior ready-portioned section rowkey, the rowkey number of each Memtable is incorporated into corresponding section, then take it is practical rowkey number where sections right boundary value, after according to right boundary value calculate Bloom filter size go forward side by side row element insertion.

In practical applications, as shown in Fig. 2-b, for a flow diagram for generating Bloom filter, detailed process are as follows: 101, data update operation, it may be assumed that the data that received server-side to client is sent update request (including update, insertion and deletion, modification etc.).

102, internal memory operation is write, it may be assumed that request corresponding data temporary cache into the memory table (Memtable) of memory for updating.

103, judge whether the data volume in memory table reaches threshold value, wherein memory table has a preset threshold value.

104, when the data volume in memory table reaches threshold value, start operating writing-file, it may be assumed that file is written into the data in memory table.

105, the right boundary value in section where being determined according to rowkey number practical in memory table, that is: before generating file, it needs to generate Bloom filter, by rowkey several demarcation interval in memory table, practical rowkey several numbers in each memory table have corresponding section, and choose the right boundary value in section.

106, the size of Bloom filter is calculated according to right boundary value, to generate empty Bloom filter.

107, element is inserted into Bloom filter, it may be assumed that cryptographic Hash corresponding to rowkey is inserted into Bu Long In filter.

B. Bloom filter filtering module

The module exists in the memory of server end, is equally managed by Bloom filter managing device, is used primarily for Bloom filter filtering file, reduces the file number for needing to scan.Bloom filter is generally present on disk in the form of a file after generation, when in use, is loaded into memory.

In practical applications, as shown in fig. 2-c, the detailed process of Bloom filter filtering file are as follows:

201, data query operation, it may be assumed that receive data inquiry request, and get data file to be filtered.

202, the right boundary value in section where obtaining rowkey according to rowkey number in file.

203, chart (map) is inquired according to right boundary value, wherein, Hash (hash) value of multiplexing is saved in map, the result of the map are as follows: map<BloomfilterSize, list<hash-code>>, wherein BloomfilterSize is the size (that is: according to the size of the calculated Bloom filter of right boundary value) of coarseness Bloom filter, and list<hash_code>is the corresponding reusable hash value set of the coarseness Bloom filter size.

204, judge the rowkey with the presence or absence of the hash value of multiplexing, it may be assumed that judge the rowkey with the presence or absence of the hash value of multiplexing by inquiring map.

If 205, the rowkey has the hash value of multiplexing, hash value is directly read, and filter the file to be filtered.

If 206, there is no the hash values of multiplexing by the rowkey, hash value is then calculated according to the rowkey, and it is stored in into map, that is: when there is no when the hash value of multiplexing by the rowkey, then illustrate that the hash value of this document is not calculated, it will be stored according to the calculated hash value of the rowkey into map, to be used when subsequent query map.

Referring to Fig. 3, the detailed process of the embodiment is as follows to filter one embodiment schematic diagram of the method for file in the embodiment of the present invention using Bloom filter:

Step 301 determines the first table figure.

In practical applications, determine that the detailed process of the first table figure comprises determining that the coarseness line unit number of Bloom filter；The size of Bloom filter is determined according to the coarseness line unit number of Bloom filter；Hash value collection corresponding to the size of Bloom filter is determined according to each line unit corresponding to coarseness line unit number；It saves thick Corresponding relationship between hash value collection corresponding to the size of corresponding relationship and Bloom filter and the size of Bloom filter between the size of Bloom filter corresponding to granularity line unit number and coarseness line unit number.

Wherein it is determined that the coarseness line unit number of Bloom filter includes: the practical line unit number for obtaining Bloom filter；The practical line unit number of the Bloom filter is divided in corresponding section according to the goal rule；The right boundary value for choosing section is determined as the coarseness line unit number of the Bloom filter.

Wherein, each line unit according to corresponding to coarseness line unit number determines after hash value collection corresponding to the size of Bloom filter, each cryptographic Hash in the hash value collection is inserted on corresponding Bloom filter, and the corresponding position of each cryptographic Hash is set to 1.

In practical applications, it is assumed that practical line unit number (rowkey number) is according to one interval range (range) of every 300 divisions, if there is 5 memory tables (Memtable), corresponding rowkey number is respectively as follows: 234,404,453,189,708, then this corresponding right boundary value of 5 Memtable is respectively as follows: 300,600,600,300,900, then the empty Bloom filter difference generated: BF300, BF600, BF600, BF300, BF900.

Further, the insertion of the corresponding hash value of rowkey is carried out according to the empty Bloom filter that right boundary value generates, detailed process is as shown in Figure 5, memory table (MemTable) is before carrying out data written document, an empty Bloom filter can be generated, i.e. in memory one piece be all 0 memory block, the rowkey of every a line record in memory table passes through k different hash functions first and calculates, k different hash values (positive integer) are generated, the position that hash value is specified then is set to 1 on empty Bloom filter.

Step 302 determines the corresponding the first row bond number of the first Bloom filter.

Wherein, first Bloom filter is Bloom filter corresponding to file to be filtered, and the first row bond number is the coarseness line unit number of first Bloom filter.

In practical applications, the detailed process of the corresponding the first row bond number of the first Bloom filter is determined are as follows: obtain the practical line unit number of first Bloom filter；The practical line unit number of first Bloom filter is divided to corresponding first interval according to goal rule；The right boundary value for choosing the first interval is determined as the first row bond number.

Step 303 obtains the first table figure.

Wherein, the corresponding relationship between hash value collection corresponding to the size of the size and Bloom filter of the corresponding relationship between the size including Bloom filter corresponding to coarseness line unit number and coarseness line unit number in the first table figure and Bloom filter；

Step 304 inquires in the first table figure the size that whether there is first Bloom filter according to the first row bond number, if so, thening follow the steps 305；If it is not, thening follow the steps 306 to step 308.

Step 305, the hash value collection according to corresponding to the size of first Bloom filter carry out file filter.

As shown in Figure 6, in practical applications, same k hash is carried out to the rowkey of inquiry first to calculate, obtain k hash value, then the corresponding position of hash value is judged in Bloom filter, if there is the value of any one position is 0, then illustrates that the rowkey is not present in certainly in this file, can thus skip the inquiry of this file.If all hash value designated positions are all 1, but the position that these hash values are specified may be set to 1 by other rowkey, so the rowkey is likely to be present in this document, need to be indexed inquiry or file scan to carry out the inquiry of rowkey value.So there are certain probability of miscarriage of justice for Bloom filter.

The False Rate that Bloom filter has can be calculated by following formula:

F=(1-e^-kn/m)^k

F indicates False Rate, and k indicates the number of hash function, and n indicates that rowkey number for being inserted into Bloom filter, m indicate the size of Bloom filter.Wherein, False Rate f is a configurable item, after False Rate f, which is configured, to be completed, counter can release the size m of Bloom filter:

M=n*log₂e*log₂(1/fexp)

≈n*1.44*log₂(1/fexp)

In addition, the calculating of hash function, generally can be used following formula expression:

H=fhash (m, c)

When carrying out the calculating of hash value, k hash value can be obtained by different c values (a preset constant), if m and c immobilize, it will obtain identical hash value.

In the data query stage, rowkey inquiry is carried out with continuous, the hash value of some multiplexings can be generated, and be stored in map, it is assumed that now with map structure as follows:

It can be seen that, inquiry for user001, the map structure that hash value is multiplexed containing 3 is generated, during continuing file after filtration, if encountering the file that right boundary value is 300,600 or 900, then corresponding multiplexing hash value can be used directly and carry out file filter, and no longer need to recalculate hash value, to reduce the expense of calculating.

Step 306, the size that first Bloom filter is determined according to the first row bond number.

Step 307, each line unit according to corresponding to the first row bond number determine hash value collection corresponding to the size of first Bloom filter.

Corresponding relationship between hash value collection corresponding to the size of corresponding relationship and first Bloom filter and the size of first Bloom filter between the size of first Bloom filter corresponding to step 308, the preservation the first row bond number and the first row bond number.

For the above-mentioned correlation technique convenient for the better implementation embodiment of the present invention, the relevant apparatus for cooperating the above method is also provided below.

Referring to Fig. 7, in the embodiment of the present invention Bloom filter managing device 700 a structural schematic diagram, the Bloom filter managing device 700 is applied to non-relational storage system, and the Bloom filter managing device 700 includes:

Determining module 701, for determining the corresponding the first row bond number of the first Bloom filter, wherein first Bloom filter is Bloom filter corresponding to file to be filtered, and the first row bond number is the coarseness line unit number of first Bloom filter；

Obtain module 702, for obtaining the first table figure, wherein, it include the corresponding relationship between hash value collection corresponding to the size of corresponding relationship and Bloom filter and the size of Bloom filter between the size of Bloom filter corresponding to coarseness line unit number and coarseness line unit number in the first table figure；

Enquiry module 703, the first row bond number for being determined according to the determining module 701 inquire the size that whether there is first Bloom filter in the first table figure that the acquisition module obtains；

Filtering module 704 carries out file filter according to hash value collection corresponding to the size of first Bloom filter for inquiring in the first table figure there are when the size of first Bloom filter when the enquiry module 703.

In some possible implementations, the Bloom filter managing device 700 further include:

The determining module 701 is also used to determine the size of first Bloom filter according to the first row bond number when the size of first Bloom filter is not present in the first table figure；According to each line unit corresponding to the first row bond number determine the size of first Bloom filter corresponding to hash value collection；

Preserving module 705, corresponding relationship between hash value collection corresponding to the size of corresponding relationship and first Bloom filter and the size of first Bloom filter between size for saving first Bloom filter corresponding to the first row bond number and the first row bond number.

In other possible implementations, the determining module 701, specifically for obtaining the practical line unit number of first Bloom filter；The practical line unit number of first Bloom filter is divided to corresponding first interval according to goal rule；The right boundary value for choosing the first interval is determined as the first row bond number.

In other possible implementations, the determining module 701 is also used to before the acquisition module obtains the first table figure, determines the coarseness line unit number of Bloom filter；The size of Bloom filter is determined according to the coarseness line unit number of Bloom filter；Hash value collection corresponding to the size of Bloom filter is determined according to each line unit corresponding to coarseness line unit number；

The preserving module 705, corresponding relationship between hash value collection corresponding to the size of corresponding relationship and Bloom filter and the size of Bloom filter for being also used to save between the size of Bloom filter corresponding to coarseness line unit number and coarseness line unit number.

In other possible implementations, the determining module 701, specifically for obtaining the practical line unit number of Bloom filter；The practical line unit number of the Bloom filter is divided in corresponding section according to the goal rule；The right boundary value for choosing section is determined as the coarseness line unit number of the Bloom filter.

In other possible implementations, the Bloom filter managing device 700 further include: insertion module 706, for determining that cloth is grand in the determining module each line unit according to corresponding to coarseness line unit number After hash value collection corresponding to the size of filter, each cryptographic Hash in the hash value collection is inserted on corresponding Bloom filter, and the corresponding position of each cryptographic Hash is set to 1.

It can be seen that, determining module determines the first Bloom filter (that is: Bloom filter corresponding to file to be filtered) corresponding the first row bond number first, wherein, the first row bond number is the coarseness line unit number of the first Bloom filter, that is: the section where the practical line unit number of the first Bloom filter is taken into right boundary value according to goal rule, using this right boundary value as the first row bond number of the first Bloom filter, further, enquiry module inquires in the first table figure the size that whether there is the first Bloom filter according to the first row bond number, when there are when the size of the first Bloom filter in the first table figure, then filtering module hash value collection according to corresponding to the size of the first Bloom filter carries out file filter.As it can be seen that take the coarseness line unit number of the first Bloom filter can effectively improve same size the first Bloom filter quantity, and hash value collection corresponding to the size of the first Bloom filter is re-used, to reduce the expense of filtering file.

Embodiment shown in Fig. 7 is illustrated the specific structure of Bloom filter managing device from the angle of functional module, is illustrated below in conjunction with the embodiment of Fig. 8 from specific structure of the hardware point of view to Bloom filter managing device:

The present invention also provides a kind of Bloom filter managing devices 800, comprising:

One or more processors 801, memory 802, bus system 803 and transceiver 804, the processor 801, the memory 802 and the transceiver 804 are connected by the bus system 803；

Wherein, one or more programs 805 are stored in the memory 802, one or more of programs 805 include instruction, and described instruction makes the Bloom filter managing device 800 execute method as shown in Figure 3 when being executed by the Bloom filter managing device 800.

It should be noted that, the processor 801 can be CPU, which can also be other general processors, digital signal processor (DSP), specific integrated circuit (ASIC), ready-made programmable gate array (FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processor etc..During realization, instruction can be completed by the instruction of the integrated logic circuit of the hardware in processor 801 or software form, be can be and be embodied directly in hardware processor and execute completion, or in processor hardware and software module combine and execute completion.Software module can be located at random access memory, flash memory, read-only memory, in the storage medium of this fields such as programmable read only memory or electrically erasable programmable memory, register maturation.The storage medium is located at memory 802, and processor 801 is read in memory 802 Information, in conjunction with the step of its hardware completion above method.To avoid repeating, it is not detailed herein.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the part being described in detail in some embodiment, reference can be made to the related descriptions of other embodiments.

It is apparent to those skilled in the art that, for convenience and simplicity of description, the specific work process of the portable electronic device of foregoing description, computer readable storage medium and unit, it can refer to corresponding processes in the foregoing method embodiment, details are not described herein.

In several embodiments provided herein, it should be understood that disclosed system, device and method may be implemented in other ways.Such as, the apparatus embodiments described above are merely exemplary, such as, the division of unit, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed mutual coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit, can be electrical property, mechanical or other forms.

Unit may or may not be physically separated as illustrated by the separation member, and component shown as a unit may or may not be physical unit, it can and it is in one place, or may be distributed over multiple network units.Some or all of unit therein can be selected to realize technical solution provided in this embodiment according to the actual needs.

In addition, the functional units in various embodiments of the present invention may be integrated into one processing unit, it is also possible to each unit and physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated unit both can take the form of hardware realization, can also realize in the form of software functional units.

If integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product, can store in a computer readable storage medium.Based on this understanding, substantially all or part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products technical solution of the present invention in other words, the computer software product is stored in a storage medium, it uses including some instructions so that a computer equipment (can be personal computer, server or the network equipment etc.) execute all or part of the steps of each embodiment method of the present invention.And storage medium above-mentioned includes: that USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. are various can store The medium of program code.

The present invention is described in detail above, used herein a specific example illustrates the principle and implementation of the invention, and the above description of the embodiment is only used to help understand the method for the present invention and its core ideas；At the same time, for those skilled in the art, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, and to sum up, the contents of this specification are not to be construed as limiting the invention.

Claims

A method of file is filtered using Bloom filter, which is characterized in that the method is applied to non-relational storage system, which comprises

Determine the corresponding the first row bond number of the first Bloom filter, wherein first Bloom filter is Bloom filter corresponding to file to be filtered, and the first row bond number is the coarseness line unit number of first Bloom filter；

Obtain the first table figure, wherein, it include the corresponding relationship between hash value collection corresponding to the size of corresponding relationship and Bloom filter and the size of Bloom filter between the size of Bloom filter corresponding to coarseness line unit number and coarseness line unit number in the first table figure；

The size that whether there is first Bloom filter is inquired in the first table figure according to the first row bond number；

When there are when the size of first Bloom filter, carrying out file filter according to hash value collection corresponding to the size of first Bloom filter in the first table figure.
The method according to claim 1, wherein when the size of first Bloom filter is not present in the first table figure, the method also includes:

The size of first Bloom filter is determined according to the first row bond number；

According to each line unit corresponding to the first row bond number determine the size of first Bloom filter corresponding to hash value collection；

Save the corresponding relationship between hash value collection corresponding to the size of corresponding relationship and first Bloom filter and the size of first Bloom filter between the size of first Bloom filter corresponding to the first row bond number and the first row bond number.
Method according to claim 1 or 2, which is characterized in that the corresponding the first row bond number of first Bloom filter of determination includes:

Obtain the practical line unit number of first Bloom filter；

The practical line unit number of first Bloom filter is divided to corresponding first interval according to goal rule；

The right boundary value for choosing the first interval is determined as the first row bond number.
Method according to any one of claims 1 to 3, which is characterized in that before the first table figure of the acquisition, the method also includes:

Determine the coarseness line unit number of Bloom filter；

The size of Bloom filter is determined according to the coarseness line unit number of Bloom filter；

Hash value collection corresponding to the size of Bloom filter is determined according to each line unit corresponding to coarseness line unit number；

Save the corresponding relationship between hash value collection corresponding to the size of corresponding relationship and Bloom filter and the size of Bloom filter between the size of Bloom filter corresponding to coarseness line unit number and coarseness line unit number.
According to the method described in claim 4, it is characterized in that, the coarseness line unit number of the determining Bloom filter includes:

Obtain the practical line unit number of Bloom filter；

The practical line unit number of the Bloom filter is divided in corresponding section according to the goal rule；

The right boundary value for choosing section is determined as the coarseness line unit number of the Bloom filter.
According to the method described in claim 4, it is characterized in that, after each line unit according to corresponding to coarseness line unit number determines hash value collection corresponding to the size of Bloom filter, the method also includes:

Each cryptographic Hash in the hash value collection is inserted on corresponding Bloom filter, and the corresponding position of each cryptographic Hash is set to 1.
A kind of Bloom filter managing device, which is characterized in that the Bloom filter managing device is applied to non-relational storage system, and the Bloom filter managing device includes:

Determining module, for determining the corresponding the first row bond number of the first Bloom filter, wherein first Bloom filter is Bloom filter corresponding to file to be filtered, and the first row bond number is the coarseness line unit number of first Bloom filter；

Obtain module, for obtaining the first table figure, wherein, it include the corresponding relationship between hash value collection corresponding to the size of corresponding relationship and Bloom filter and the size of Bloom filter between the size of Bloom filter corresponding to coarseness line unit number and coarseness line unit number in the first table figure；

Enquiry module, the first row bond number for being determined according to the determining module inquire the size that whether there is first Bloom filter in the first table figure that the acquisition module obtains；

Filtering module carries out text according to hash value collection corresponding to the size of first Bloom filter for inquiring in the first table figure there are when the size of first Bloom filter when the enquiry module Part filtering.
Bloom filter managing device according to claim 7, which is characterized in that when the size of first Bloom filter is not present in the first table figure, the Bloom filter managing device further include:

The determining module is also used to determine the size of first Bloom filter according to the first row bond number；According to each line unit corresponding to the first row bond number determine the size of first Bloom filter corresponding to hash value collection；

Preserving module, corresponding relationship between hash value collection corresponding to the size of corresponding relationship and first Bloom filter and the size of first Bloom filter between size for saving first Bloom filter corresponding to the first row bond number and the first row bond number.
Bloom filter managing device according to claim 7 or 8, which is characterized in that the determining module, specifically for obtaining the practical line unit number of first Bloom filter；The practical line unit number of first Bloom filter is divided to corresponding first interval according to goal rule；The right boundary value for choosing the first interval is determined as the first row bond number.
Bloom filter managing device according to any one of claims 7 to 9, which is characterized in that the determining module is also used to before the acquisition module obtains the first table figure, determines the coarseness line unit number of Bloom filter；The size of Bloom filter is determined according to the coarseness line unit number of Bloom filter；Hash value collection corresponding to the size of Bloom filter is determined according to each line unit corresponding to coarseness line unit number；

The preserving module, the corresponding relationship between hash value collection corresponding to the size of corresponding relationship and Bloom filter and the size of Bloom filter for being also used to save between the size of Bloom filter corresponding to coarseness line unit number and coarseness line unit number.
Bloom filter managing device according to claim 10, which is characterized in that the determining module, specifically for obtaining the practical line unit number of Bloom filter；The practical line unit number of the Bloom filter is divided in corresponding section according to the goal rule；The right boundary value for choosing section is determined as the coarseness line unit number of the Bloom filter.
Bloom filter managing device according to claim 10, it is characterized in that, the Bloom filter managing device further include: insertion module, for after the determining module each line unit according to corresponding to coarseness line unit number determines hash value collection corresponding to the size of Bloom filter, each cryptographic Hash in the hash value collection is inserted on corresponding Bloom filter, and by each cryptographic Hash pair The position answered is set to 1.
A kind of Bloom filter managing device characterized by comprising

One or more processors, memory, bus system and transceiver, the processor, the memory and the transceiver are connected by the bus system；

Wherein, one or more programs are stored in the memory, one or more of programs include instruction, and described instruction makes the Bloom filter managing device execute such as method as claimed in any one of claims 1 to 6 when being executed by the Bloom filter managing device.