CN111930923B

CN111930923B - Bloom filter system and filtering method

Info

Publication number: CN111930923B
Application number: CN202010628830.4A
Authority: CN
Inventors: 方贤斌; 旷黎明; 师文庆
Original assignee: Shanghai Weiyi Intelligent Manufacturing Technology Co ltd; Changzhou Weiyizhi Technology Co Ltd
Current assignee: Shanghai Weiyi Intelligent Manufacturing Technology Co ltd; Changzhou Weiyizhi Technology Co Ltd
Priority date: 2020-07-02
Filing date: 2020-07-02
Publication date: 2021-07-30
Anticipated expiration: 2040-07-02
Also published as: CN111930923A

Abstract

The invention provides a bloom filter system and a filtering method, comprising the following steps: initializing the filter module: when each filter is initialized, two parameters, namely p and n, are required to be transmitted, wherein n is the number of inserted elements, p is the false alarm rate, and the numHashFunctions and the bloom filter length bitSize are generated. The invention greatly saves the single processing time and reduces the actual filtering misjudgment probability, and the filter can be more stable and reliable. And the use of other program services cannot be burdened by long filter processing time after the amount of the post data is large. The overall performance of the improved filter is about 100 times faster than that of the existing filter in the aspect of quantification effect.

Description

Bloom filter system and filtering method

Technical Field

The invention relates to the technical field of data filtering, in particular to a bloom filter system and a filtering method.

Background

With the development of technology and the increase of data volume, the accuracy and the availability of the maximum data volume become a fast development direction, and the application scenarios are more and more related on internet big data and in industrial big data scenarios.

At present, a plurality of filter plug-ins exist in the related field, but a plurality of disadvantages and bottlenecks are found in the use process, for example, the filtering speed is reduced in a curve along with the increase of data quantity in the filtering speed and the filtering error rate, the error rate is also increased in a curve, and a plurality of companies adopt more hardware resources to cluster when solving problems, but the effect is also poor and the cost is relatively high.

If two parameters of p and n are not processed, numHashFunctions become smaller and bitSize is too large at the beginning, and the smaller numHashFunctions and the larger bitSize cause low efficiency of bloom filters, each bit is in the same interval, the probability of generating the same bit value by subsequently generating value values becomes larger and larger after the data volume is increased, and when the repetition rate of the single bit value of different value values is increased, the probability of generating the same bit array by the filters is synchronously increased, the probability increase is fatal to the filters, the accuracy and the usability of processing a large amount of data are realized for the purpose of filtering, and if the filter loses the property, the actual effect of the filters is reduced. The larger bitSize results in more space stored in the filter, larger resource occupation, and lower filter update/storage/reading performance when more values are stored.

If the performance is compromised when the filter is used in an actual service due to the huge memory, the performance is very dangerous in a high-concurrency and high-availability application scenario, and the processing capability of other services is affected. In situations where filters are important and performance is also important, improving the performance and high availability of filters becomes the only option.

Disclosure of Invention

In view of the defects in the prior art, the invention aims to provide a bloom filter system and a filtering method.

According to the present invention there is provided a bloom filter system comprising:

initializing the filter module: when each filter is initialized, two parameters of p and n are required to be transmitted, wherein n is the number of inserted elements, p is the false alarm rate, and the numHashFunctions and the bloom filter length bitSize are generated;

and a filter bit array calculating module: generating a value according to numHashFunctions and bitSize generated by an initialization filter and the three data of the transmitted calculated value key, generating a plurality of long type values according to the numHashFunctions, generating a plurality of int type bit arrays according to the long type values and finally returning the bit arrays;

a weight judging module: circularly obtaining each bit value according to the returned bit array, and inquiring the bit value of each bit value in the database Redis: if the value is 0, the mapping of any value to the bit is not shown, and the value does not exist, and an updating or storing filter module is called; if the returned value is 1, the value already exists, and a deletion filter module is called;

update or store filter module: if the value is judged not to exist in the filter, the returned bit arrays can be respectively pointed to the bit positions and stored in Redis;

and (3) deleting the filter module: if the value is judged to already exist in the filter, the incoming calculated value key is the key value to be deleted, and the value of the Counter of the delete filter is decremented by 1 according to the single bit, and processed one by one.

Preferably, the numHashFunctions and bitSize determine the length of the filter bit;

the length of the filter bit affects the efficiency and error rate of the final query and update values of the filter.

Preferably, the generating a value includes:

initializing a value array value, wherein the array length is numHashFunctions, a value key needing to be verified is transmitted when a user uses a filter, then generating a long type value with hash 64 bits by the key, generating an int type value with 10-bit length by the long type value as hash1, and generating an int type value with 10-bit length as hash2 by the long type value without symbols right shift > > >32 bits;

then, circulating according to the length of the numHashFunctions, generating a nextHash value according to a calculation method of hash1+ i hash2, wherein i is a circulation index, judging whether the nextHash value is less than 0, and performing inverse operation on the nextHash value when the nextHash value is less than 0;

and then value is assigned, and the value generation method comprises the following steps: bitSize for nextHash remainder operation: and (4) bit is nextHash% bitSize, and the size of the bit value does not exceed the value of the bitSize.

Preferably, the incoming calculated value key refers to user incoming data;

one data corresponds to one key value, and whether the corresponding data exists in the filter is inquired through the key value.

Preferably, the specific algorithm for generating the hash function number numHashFunctions and the bloom filter length bitSize is as follows:

if the calculated numHashFunctions value is less than or equal to 3, the numHashFunctions takes a fixed value of 3, and is prohibited from being less than 3.

Preferably, the two parameters of p and n are dynamic;

the countValue value is a value for recording actual filtering or storage of items, is initialized to 0, is increased progressively according to the actual use number, and is stored in Redis;

the value n represents the number of filter values to be reserved, and when the value of the countValue is equal to n, the value n is reinitialized by: multiplying n by 2, i.e. multiplying by a factor of n

The p value represents an acceptable data filtering false positive probability, and is reinitialized when the countValue value is equal to n, and the reinitialized value is: p is divided by 2.

The invention provides a filtration method of a bloom filter, which comprises the following steps:

initializing a filter: when each filter is initialized, two parameters of p and n are required to be transmitted, wherein n is the number of inserted elements, p is the false alarm rate, and the numHashFunctions and the bloom filter length bitSize are generated;

calculating a filter bit array: generating a value according to numHashFunctions and bitSize generated by an initialization filter and the three data of the transmitted calculated value key, generating a plurality of long type values according to the numHashFunctions, generating a plurality of int type bit arrays according to the long type values and finally returning the bit arrays;

and (3) judging the weight: circularly obtaining each bit value according to the returned bit array, and inquiring the bit value of each bit value in the database Redis: if the value is 0, the mapping of any value to the bit is not shown, and the updating or storing filter step is called if the value does not exist; if the returned value is 1, the value is already existed, and the step of deleting the filter is called;

updating or storing the filter step: if the value is judged not to exist in the filter, the returned bit arrays can be respectively pointed to the bit positions and stored in Redis;

and a filter deleting step: if the value is judged to already exist in the filter, the incoming calculated value key is the key value to be deleted, and the value of the Counter of the delete filter is decremented by 1 according to the single bit, and processed one by one.

the length of the filter bit can affect the efficiency and error rate of the final query and update value of the filter;

the incoming computed value key refers to user incoming data;

Preferably, the generating a value includes:

if the calculated numHashFunctions value is less than or equal to 3, the numHashFunctions takes a fixed value of 3, and is prohibited to be less than 3;

the two parameters of p and n are dynamic;

Compared with the prior art, the invention has the following beneficial effects:

according to the method, after initialized n and p values are transformed, the generation sizes of the values of bitSize and numHashFunctions can be controlled, so that the values newly added to Redis are divided, and the value of each section is limited by length so as to achieve the purpose of not influencing each other. The data size of 200000000 can be divided into 100 parts after partitioning, so that the advantage of processing large data size can be achieved by only judging the small data size of 2000000, single processing time is greatly saved, the actual misjudgment probability is reduced, and the filter is more stable and reliable. And the use of other program services cannot be burdened by long filter processing time after the amount of the post data is large. The overall performance of the improved filter is about 100 times faster than that of the existing filter in the aspect of quantification effect.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of a filter for processing repeatability of filtered data according to the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Specifically, the numHashFunctions and bitSize determine the length of the filter bit;

Specifically, the process of generating a value includes:

In particular, the incoming computed value key refers to user incoming data;

Specifically, the specific algorithm for generating the hash function number numHashFunctions and the bloom filter length bitSize is as follows:

Specifically, the two parameters of p and n are dynamic;

the incoming computed value key refers to user incoming data;

Specifically, the process of generating a value includes:

the two parameters of p and n are dynamic;

The present invention will be described more specifically below with reference to preferred examples.

Preferred example 1:

the invention solves the aim of ensuring non-repeated data uniqueness and high availability under a large amount of data, and comprises a filter initialization stage, a filter bit array calculation stage, a filter value updating/storing stage and a filter value deleting stage after data deletion, wherein the filter initialization stage and the filter bit array calculation stage are used for judging whether the Redis comprises the filter bit array, and the filter value is updated/stored and deleted after the data is deleted.

As shown in fig. 1, a filter for dealing with repeatability of filtered data in a context of large data volumes, includes:

initializing the filter module: when each filter is initialized, two parameters of p and n (n is the number of inserted elements, and p is the false alarm rate) need to be introduced, and the numHashFunctions and the bloom filter length (bitSize) are generated. Initializing the filter is an important step, which is the basis for generating numHashFunctions and bitSize, which determine the length of the filter bit. The length of the filter bit affects the efficiency and error rate of the final query and update values of the filter;

and a filter bit array calculating module: generating a value according to numHashFunctions, bitSize and an incoming calculated value key (the value is data incoming by a user, namely a key value, generally, one data corresponds to one key value, and if the user wants to know that a certain data exists in the filter, the user needs to query the filter by using the key value, so that a true value is considered to exist, and if the data is false, the value is considered to not exist) generated by initializing the filter: the value generates a plurality of Long type values according to numHashFunctions (Long is a data type, a Long keyword represents Long integer data, is a basic data type in a programming language, is an abbreviation of Long int and is a signed Long integer by default), then generates a plurality of int type bit arrays according to bitSize by the plurality of Long type values, and finally returns the bit arrays;

a weight judging module: inquiring the value of the bit inside each bit value in Redis according to the returned bit array, if the value is 0, indicating that no value is mapped to the bit, so that the value does not exist, and if the returned values are all 1, the value may already exist; the Redis is a database for storing/querying data (also referred to as https:// baike. baidu. com/item/Redis/6549233fr ═ adadin)

Update/store filter module: if the value is judged not to exist in the filter, the returned bit arrays can be respectively pointed to the bit positions and stored in Redis;

and (3) deleting the filter module: when a value to be deleted is transmitted, the value of the Counter (Counter) is decremented by 1 according to a single bit, and the values are processed one by one.

The invention comprises the following steps:

(1) by adopting a modified numHashFunctions and bitSize generation algorithm, the method comprises an initialization stage, wherein two parameters, namely p and n, are transmitted in the initialization stage to generate numHashFunctions and bitSize, and the generation rule of the numHashFunctions is as follows according to the previous algorithm:

it can be seen that as bitSize increases slowly, the numHashFunctions value decreases until it is less than 1, but if the numHashFunctions value approaches 1 or equals 1, the filter bit array will be smaller and smaller, or even 1 in length. After the numHashFunctions and bitSize generation algorithm are modified, the numHashFunctions can never be close to 1, the minimum value is controlled to be close to or equal to 3, and the error rate of the filter can not be reduced.

(2) The method changes two parameters of p and n into dynamic states, and reforms a calculation filter bit array, and comprises an initialization filter, a calculation filter bit array and an updating/storing filter. Create a new key when updating/storing the filter: the countValue is used for storing and counting the number of values stored by the filter, the total number of the values stored can be known through the countValue, n is divided according to the fact that n is equal to 2 hundred million data volume, n is divided into equal parts or divided into 100 parts according to proportion, 100 intervals are divided, and according to the size of the countValue, when the countValue reaches an equal part value, two parameters of p and n are updated, so that the parameters are changed according to the increase of the actual data volume, and therefore numHashFunctions and bitSize are changed. The size of bitSize is multiplied by the value of 100 bins each, which changes the bin size of each bit in the filter bit array. When each bit size is in 100 intervals, the mutual interference is very little when all the bits of the filter are divided into 100 parts of relatively independent space for storage. Therefore, the problem that when the company has 2 hundred million data volumes or even more data volumes, two parameters of p and n need to be set greatly, and the speed and the error rate of the actually used filter are greatly influenced is solved, the speed and the error rate of only processing 200 ten thousand data volumes can be realized after the improvement, the processing time of using the filter is greatly increased, the actual error rate is greatly reduced, and the capacity of realizing high available processing high data volume is achieved.

countValue: recording the actual filtering or storing value of the project, initializing to 0, increasing the number according to the actual use, storing in Redis (newly added parameters for modification)

The value of n is as follows: the number of filter values representing the reservation is initialized to 2000000, the company target value of 200000000 is divided into 100 segments by 2000000 units in the project, and the single segment is fixed during the project operation. However, when the value of countValue is equal to n, the value of n is reinitialized, and the reinitialized value is: multiplying n by 2, i.e. multiplying by a factor of n

p value: and the data filtering misjudgment probability is acceptable, the value is generated by the same method as the value n and is initialized to 0.0000001, and a single section is fixed in the process of operating the project. However, when the value of countValue is equal to n, the value of p is reinitialized by: p divided by 2

(3) Through the Counter size setting that adopts the transformation to delete the filter, the size location problem of Counter is also a dilemma: in view of the problem of space utilization, it is desirable from the viewpoint of use that the larger the Counter, the better the Counter, and the larger the Counter, the more information can be recorded. But a larger Counter would take up more resources and would in many cases create a significant waste of space. However, after the two parameters p and n are changed into dynamic states, the value of the Counter can be changed into an interval reached according to the countValue value, and the size of the Counter is increased in an equal proportion mode, so that the Counter can be owned at each stage of the filter, the contradiction between space utilization rate and resource occupation can be balanced, and the use efficiency of the filter is ensured.

In the description of the present application, it is to be understood that the terms "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience in describing the present application and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present application.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A bloom filter system, comprising:

update or store filter module: if the value is judged not to exist in the filter, the returned bit arrays point to the bit positions respectively and are stored in Redis;

and (3) deleting the filter module: if the value is judged to exist in the filter, the transmitted calculated value key is the key value needing to be deleted, the value of the Counter of the deletion filter is reduced by 1 according to a single bit, and the processing is carried out one by one;

the generation size of the value of bitSize and numHashFunctions can be controlled after the initialized n and p values, so that the value added to Redis is divided, and the value of each section is limited by the length so as to achieve the purpose of not influencing each other;

the numHashFunctions and bitSize determine the length of the filter bit;

the process of generating a value includes:

and then value is assigned, and the value generation method comprises the following steps: bitSize for nextHash remainder operation: the bit value does not exceed the value of the nextHash% bitSize;

changing two parameters of p and n into dynamic states, and modifying a calculation filter bit array, wherein the filter bit array comprises an initialization filter, a calculation filter bit array and an updating/storing filter; create a new key when updating/storing the filter: the countValue is used for storing and counting the number of values stored by the filter, knowing the total number of the values stored by the countValue, dividing n according to the number of 2 hundred million data volumes, dividing n into 100 parts equally or proportionally, which is equivalent to dividing 100 intervals, and updating two parameters, namely p and n, when the countValue reaches an equal part value according to the size of the countValue, so that the parameters are changed according to the increase of the actual data volume to change numHashFunctions and bitSize; the value of bitSize in each of the 100 intervals is in a multiple relationship, and the multiple relationship changes the interval of the size of each bit in the filter bit array; when the size of each bit is in 100 intervals, all the bits of the filter are divided into 100 parts of relatively independent space for storage;

countValue: recording the actual filtering or storage value of the project, initializing to 0, increasing according to the actual use number, and storing in Redis;

the value of n is as follows: representing the number of the reserved filtering values, initializing to 2000000, dividing the target value of 200000000 into 100 sections by 2000000 as a unit, and keeping the single section constant in the process of project operation; however, when the value of countValue is equal to n, the value of n is reinitialized, and the reinitialized value is: multiplying n by 2, namely multiplying by using n as a base;

p value: representing the acceptable data filtering misjudgment probability, the generation method of the value is the same as the value n, the value is initialized to 0.0000001, and a single section is fixed and unchanged in the operation process; however, when the value of countValue is equal to n, the value of p is reinitialized by: p is divided by 2;

changing the p and n parameters into dynamic state, deleting the Counter value of the filter, and increasing the Counter size in equal proportion according to the arrival interval of the countValue value, so that each stage of the filter has the Counter, thereby balancing the contradiction between space utilization rate and resource occupation and ensuring the use efficiency of the filter;

after the value to be deleted is transmitted, subtracting 1 from the Counter value according to a single bit, and processing one by one;

the numHashFunctions can never be close to 1, and the control is carried out until the minimum value is close to or equal to 3, so that the error rate of the filter can not be reduced.

2. The bloom filter system of claim 1, wherein the incoming computed values key refer to user incoming data;

3. The bloom filter system according to claim 1, wherein the specific algorithm for generating the hash function number numHashFunctions and the bloom filter length bitSize is as follows:

4. A method of filtering a bloom filter, comprising:

and a filter deleting step: if the value is judged to exist in the filter, the transmitted calculated value key is the key value needing to be deleted, the value of the Counter of the deletion filter is reduced by 1 according to a single bit, and the processing is carried out one by one;

the numHashFunctions and bitSize determine the length of the filter bit;

the incoming computed value key refers to user incoming data;

one data corresponds to one key value, and whether the corresponding data exists in the filter is inquired through the key value;

the process of generating a value includes:

p value: representing the acceptable data filtering misjudgment probability, the generation method of the value is the same as the value n, the value is initialized to 0.0000001, and a single section is fixed and unchanged in the operation process; however, when the countValue is equal to n, the p value is reinitialized by: p is divided by 2;

5. The filtering method of the bloom filter according to claim 4, wherein the specific algorithm for generating the hash function number numHashFunctions and the bloom filter length bitSize is as follows: