CN111930923B - Bloom filter system and filtering method - Google Patents

Bloom filter system and filtering method Download PDF

Info

Publication number
CN111930923B
CN111930923B CN202010628830.4A CN202010628830A CN111930923B CN 111930923 B CN111930923 B CN 111930923B CN 202010628830 A CN202010628830 A CN 202010628830A CN 111930923 B CN111930923 B CN 111930923B
Authority
CN
China
Prior art keywords
value
filter
bit
numhashfunctions
bitsize
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010628830.4A
Other languages
Chinese (zh)
Other versions
CN111930923A (en
Inventor
方贤斌
旷黎明
师文庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Weiyi Intelligent Manufacturing Technology Co ltd
Changzhou Weiyizhi Technology Co Ltd
Original Assignee
Shanghai Weiyi Intelligent Manufacturing Technology Co ltd
Changzhou Weiyizhi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Weiyi Intelligent Manufacturing Technology Co ltd, Changzhou Weiyizhi Technology Co Ltd filed Critical Shanghai Weiyi Intelligent Manufacturing Technology Co ltd
Priority to CN202010628830.4A priority Critical patent/CN111930923B/en
Publication of CN111930923A publication Critical patent/CN111930923A/en
Application granted granted Critical
Publication of CN111930923B publication Critical patent/CN111930923B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephone Function (AREA)

Abstract

The invention provides a bloom filter system and a filtering method, comprising the following steps: initializing the filter module: when each filter is initialized, two parameters, namely p and n, are required to be transmitted, wherein n is the number of inserted elements, p is the false alarm rate, and the numHashFunctions and the bloom filter length bitSize are generated. The invention greatly saves the single processing time and reduces the actual filtering misjudgment probability, and the filter can be more stable and reliable. And the use of other program services cannot be burdened by long filter processing time after the amount of the post data is large. The overall performance of the improved filter is about 100 times faster than that of the existing filter in the aspect of quantification effect.

Description

Bloom filter system and filtering method
Technical Field
The invention relates to the technical field of data filtering, in particular to a bloom filter system and a filtering method.
Background
With the development of technology and the increase of data volume, the accuracy and the availability of the maximum data volume become a fast development direction, and the application scenarios are more and more related on internet big data and in industrial big data scenarios.
At present, a plurality of filter plug-ins exist in the related field, but a plurality of disadvantages and bottlenecks are found in the use process, for example, the filtering speed is reduced in a curve along with the increase of data quantity in the filtering speed and the filtering error rate, the error rate is also increased in a curve, and a plurality of companies adopt more hardware resources to cluster when solving problems, but the effect is also poor and the cost is relatively high.
If two parameters of p and n are not processed, numHashFunctions become smaller and bitSize is too large at the beginning, and the smaller numHashFunctions and the larger bitSize cause low efficiency of bloom filters, each bit is in the same interval, the probability of generating the same bit value by subsequently generating value values becomes larger and larger after the data volume is increased, and when the repetition rate of the single bit value of different value values is increased, the probability of generating the same bit array by the filters is synchronously increased, the probability increase is fatal to the filters, the accuracy and the usability of processing a large amount of data are realized for the purpose of filtering, and if the filter loses the property, the actual effect of the filters is reduced. The larger bitSize results in more space stored in the filter, larger resource occupation, and lower filter update/storage/reading performance when more values are stored.
If the performance is compromised when the filter is used in an actual service due to the huge memory, the performance is very dangerous in a high-concurrency and high-availability application scenario, and the processing capability of other services is affected. In situations where filters are important and performance is also important, improving the performance and high availability of filters becomes the only option.
Disclosure of Invention
In view of the defects in the prior art, the invention aims to provide a bloom filter system and a filtering method.
According to the present invention there is provided a bloom filter system comprising:
initializing the filter module: when each filter is initialized, two parameters of p and n are required to be transmitted, wherein n is the number of inserted elements, p is the false alarm rate, and the numHashFunctions and the bloom filter length bitSize are generated;
and a filter bit array calculating module: generating a value according to numHashFunctions and bitSize generated by an initialization filter and the three data of the transmitted calculated value key, generating a plurality of long type values according to the numHashFunctions, generating a plurality of int type bit arrays according to the long type values and finally returning the bit arrays;
a weight judging module: circularly obtaining each bit value according to the returned bit array, and inquiring the bit value of each bit value in the database Redis: if the value is 0, the mapping of any value to the bit is not shown, and the value does not exist, and an updating or storing filter module is called; if the returned value is 1, the value already exists, and a deletion filter module is called;
update or store filter module: if the value is judged not to exist in the filter, the returned bit arrays can be respectively pointed to the bit positions and stored in Redis;
and (3) deleting the filter module: if the value is judged to already exist in the filter, the incoming calculated value key is the key value to be deleted, and the value of the Counter of the delete filter is decremented by 1 according to the single bit, and processed one by one.
Preferably, the numHashFunctions and bitSize determine the length of the filter bit;
the length of the filter bit affects the efficiency and error rate of the final query and update values of the filter.
Preferably, the generating a value includes:
initializing a value array value, wherein the array length is numHashFunctions, a value key needing to be verified is transmitted when a user uses a filter, then generating a long type value with hash 64 bits by the key, generating an int type value with 10-bit length by the long type value as hash1, and generating an int type value with 10-bit length as hash2 by the long type value without symbols right shift > > >32 bits;
then, circulating according to the length of the numHashFunctions, generating a nextHash value according to a calculation method of hash1+ i hash2, wherein i is a circulation index, judging whether the nextHash value is less than 0, and performing inverse operation on the nextHash value when the nextHash value is less than 0;
and then value is assigned, and the value generation method comprises the following steps: bitSize for nextHash remainder operation: and (4) bit is nextHash% bitSize, and the size of the bit value does not exceed the value of the bitSize.
Preferably, the incoming calculated value key refers to user incoming data;
one data corresponds to one key value, and whether the corresponding data exists in the filter is inquired through the key value.
Preferably, the specific algorithm for generating the hash function number numHashFunctions and the bloom filter length bitSize is as follows:
Figure BDA0002567651940000031
Figure BDA0002567651940000032
if the calculated numHashFunctions value is less than or equal to 3, the numHashFunctions takes a fixed value of 3, and is prohibited from being less than 3.
Preferably, the two parameters of p and n are dynamic;
the countValue value is a value for recording actual filtering or storage of items, is initialized to 0, is increased progressively according to the actual use number, and is stored in Redis;
the value n represents the number of filter values to be reserved, and when the value of the countValue is equal to n, the value n is reinitialized by: multiplying n by 2, i.e. multiplying by a factor of n
The p value represents an acceptable data filtering false positive probability, and is reinitialized when the countValue value is equal to n, and the reinitialized value is: p is divided by 2.
The invention provides a filtration method of a bloom filter, which comprises the following steps:
initializing a filter: when each filter is initialized, two parameters of p and n are required to be transmitted, wherein n is the number of inserted elements, p is the false alarm rate, and the numHashFunctions and the bloom filter length bitSize are generated;
calculating a filter bit array: generating a value according to numHashFunctions and bitSize generated by an initialization filter and the three data of the transmitted calculated value key, generating a plurality of long type values according to the numHashFunctions, generating a plurality of int type bit arrays according to the long type values and finally returning the bit arrays;
and (3) judging the weight: circularly obtaining each bit value according to the returned bit array, and inquiring the bit value of each bit value in the database Redis: if the value is 0, the mapping of any value to the bit is not shown, and the updating or storing filter step is called if the value does not exist; if the returned value is 1, the value is already existed, and the step of deleting the filter is called;
updating or storing the filter step: if the value is judged not to exist in the filter, the returned bit arrays can be respectively pointed to the bit positions and stored in Redis;
and a filter deleting step: if the value is judged to already exist in the filter, the incoming calculated value key is the key value to be deleted, and the value of the Counter of the delete filter is decremented by 1 according to the single bit, and processed one by one.
Preferably, the numHashFunctions and bitSize determine the length of the filter bit;
the length of the filter bit can affect the efficiency and error rate of the final query and update value of the filter;
the incoming computed value key refers to user incoming data;
one data corresponds to one key value, and whether the corresponding data exists in the filter is inquired through the key value.
Preferably, the generating a value includes:
initializing a value array value, wherein the array length is numHashFunctions, a value key needing to be verified is transmitted when a user uses a filter, then generating a long type value with hash 64 bits by the key, generating an int type value with 10-bit length by the long type value as hash1, and generating an int type value with 10-bit length as hash2 by the long type value without symbols right shift > > >32 bits;
then, circulating according to the length of the numHashFunctions, generating a nextHash value according to a calculation method of hash1+ i hash2, wherein i is a circulation index, judging whether the nextHash value is less than 0, and performing inverse operation on the nextHash value when the nextHash value is less than 0;
and then value is assigned, and the value generation method comprises the following steps: bitSize for nextHash remainder operation: and (4) bit is nextHash% bitSize, and the size of the bit value does not exceed the value of the bitSize.
Preferably, the specific algorithm for generating the hash function number numHashFunctions and the bloom filter length bitSize is as follows:
Figure BDA0002567651940000041
Figure BDA0002567651940000042
if the calculated numHashFunctions value is less than or equal to 3, the numHashFunctions takes a fixed value of 3, and is prohibited to be less than 3;
the two parameters of p and n are dynamic;
the countValue value is a value for recording actual filtering or storage of items, is initialized to 0, is increased progressively according to the actual use number, and is stored in Redis;
the value n represents the number of filter values to be reserved, and when the value of the countValue is equal to n, the value n is reinitialized by: multiplying n by 2, i.e. multiplying by a factor of n
The p value represents an acceptable data filtering false positive probability, and is reinitialized when the countValue value is equal to n, and the reinitialized value is: p is divided by 2.
Compared with the prior art, the invention has the following beneficial effects:
according to the method, after initialized n and p values are transformed, the generation sizes of the values of bitSize and numHashFunctions can be controlled, so that the values newly added to Redis are divided, and the value of each section is limited by length so as to achieve the purpose of not influencing each other. The data size of 200000000 can be divided into 100 parts after partitioning, so that the advantage of processing large data size can be achieved by only judging the small data size of 2000000, single processing time is greatly saved, the actual misjudgment probability is reduced, and the filter is more stable and reliable. And the use of other program services cannot be burdened by long filter processing time after the amount of the post data is large. The overall performance of the improved filter is about 100 times faster than that of the existing filter in the aspect of quantification effect.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic diagram of a filter for processing repeatability of filtered data according to the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
According to the present invention there is provided a bloom filter system comprising:
initializing the filter module: when each filter is initialized, two parameters of p and n are required to be transmitted, wherein n is the number of inserted elements, p is the false alarm rate, and the numHashFunctions and the bloom filter length bitSize are generated;
and a filter bit array calculating module: generating a value according to numHashFunctions and bitSize generated by an initialization filter and the three data of the transmitted calculated value key, generating a plurality of long type values according to the numHashFunctions, generating a plurality of int type bit arrays according to the long type values and finally returning the bit arrays;
a weight judging module: circularly obtaining each bit value according to the returned bit array, and inquiring the bit value of each bit value in the database Redis: if the value is 0, the mapping of any value to the bit is not shown, and the value does not exist, and an updating or storing filter module is called; if the returned value is 1, the value already exists, and a deletion filter module is called;
update or store filter module: if the value is judged not to exist in the filter, the returned bit arrays can be respectively pointed to the bit positions and stored in Redis;
and (3) deleting the filter module: if the value is judged to already exist in the filter, the incoming calculated value key is the key value to be deleted, and the value of the Counter of the delete filter is decremented by 1 according to the single bit, and processed one by one.
Specifically, the numHashFunctions and bitSize determine the length of the filter bit;
the length of the filter bit affects the efficiency and error rate of the final query and update values of the filter.
Specifically, the process of generating a value includes:
initializing a value array value, wherein the array length is numHashFunctions, a value key needing to be verified is transmitted when a user uses a filter, then generating a long type value with hash 64 bits by the key, generating an int type value with 10-bit length by the long type value as hash1, and generating an int type value with 10-bit length as hash2 by the long type value without symbols right shift > > >32 bits;
then, circulating according to the length of the numHashFunctions, generating a nextHash value according to a calculation method of hash1+ i hash2, wherein i is a circulation index, judging whether the nextHash value is less than 0, and performing inverse operation on the nextHash value when the nextHash value is less than 0;
and then value is assigned, and the value generation method comprises the following steps: bitSize for nextHash remainder operation: and (4) bit is nextHash% bitSize, and the size of the bit value does not exceed the value of the bitSize.
In particular, the incoming computed value key refers to user incoming data;
one data corresponds to one key value, and whether the corresponding data exists in the filter is inquired through the key value.
Specifically, the specific algorithm for generating the hash function number numHashFunctions and the bloom filter length bitSize is as follows:
Figure BDA0002567651940000061
Figure BDA0002567651940000062
if the calculated numHashFunctions value is less than or equal to 3, the numHashFunctions takes a fixed value of 3, and is prohibited from being less than 3.
Specifically, the two parameters of p and n are dynamic;
the countValue value is a value for recording actual filtering or storage of items, is initialized to 0, is increased progressively according to the actual use number, and is stored in Redis;
the value n represents the number of filter values to be reserved, and when the value of the countValue is equal to n, the value n is reinitialized by: multiplying n by 2, i.e. multiplying by a factor of n
The p value represents an acceptable data filtering false positive probability, and is reinitialized when the countValue value is equal to n, and the reinitialized value is: p is divided by 2.
The invention provides a filtration method of a bloom filter, which comprises the following steps:
initializing a filter: when each filter is initialized, two parameters of p and n are required to be transmitted, wherein n is the number of inserted elements, p is the false alarm rate, and the numHashFunctions and the bloom filter length bitSize are generated;
calculating a filter bit array: generating a value according to numHashFunctions and bitSize generated by an initialization filter and the three data of the transmitted calculated value key, generating a plurality of long type values according to the numHashFunctions, generating a plurality of int type bit arrays according to the long type values and finally returning the bit arrays;
and (3) judging the weight: circularly obtaining each bit value according to the returned bit array, and inquiring the bit value of each bit value in the database Redis: if the value is 0, the mapping of any value to the bit is not shown, and the updating or storing filter step is called if the value does not exist; if the returned value is 1, the value is already existed, and the step of deleting the filter is called;
updating or storing the filter step: if the value is judged not to exist in the filter, the returned bit arrays can be respectively pointed to the bit positions and stored in Redis;
and a filter deleting step: if the value is judged to already exist in the filter, the incoming calculated value key is the key value to be deleted, and the value of the Counter of the delete filter is decremented by 1 according to the single bit, and processed one by one.
Specifically, the numHashFunctions and bitSize determine the length of the filter bit;
the length of the filter bit can affect the efficiency and error rate of the final query and update value of the filter;
the incoming computed value key refers to user incoming data;
one data corresponds to one key value, and whether the corresponding data exists in the filter is inquired through the key value.
Specifically, the process of generating a value includes:
initializing a value array value, wherein the array length is numHashFunctions, a value key needing to be verified is transmitted when a user uses a filter, then generating a long type value with hash 64 bits by the key, generating an int type value with 10-bit length by the long type value as hash1, and generating an int type value with 10-bit length as hash2 by the long type value without symbols right shift > > >32 bits;
then, circulating according to the length of the numHashFunctions, generating a nextHash value according to a calculation method of hash1+ i hash2, wherein i is a circulation index, judging whether the nextHash value is less than 0, and performing inverse operation on the nextHash value when the nextHash value is less than 0;
and then value is assigned, and the value generation method comprises the following steps: bitSize for nextHash remainder operation: and (4) bit is nextHash% bitSize, and the size of the bit value does not exceed the value of the bitSize.
Specifically, the specific algorithm for generating the hash function number numHashFunctions and the bloom filter length bitSize is as follows:
Figure BDA0002567651940000081
Figure BDA0002567651940000082
if the calculated numHashFunctions value is less than or equal to 3, the numHashFunctions takes a fixed value of 3, and is prohibited to be less than 3;
the two parameters of p and n are dynamic;
the countValue value is a value for recording actual filtering or storage of items, is initialized to 0, is increased progressively according to the actual use number, and is stored in Redis;
the value n represents the number of filter values to be reserved, and when the value of the countValue is equal to n, the value n is reinitialized by: multiplying n by 2, i.e. multiplying by a factor of n
The p value represents an acceptable data filtering false positive probability, and is reinitialized when the countValue value is equal to n, and the reinitialized value is: p is divided by 2.
The present invention will be described more specifically below with reference to preferred examples.
Preferred example 1:
the invention solves the aim of ensuring non-repeated data uniqueness and high availability under a large amount of data, and comprises a filter initialization stage, a filter bit array calculation stage, a filter value updating/storing stage and a filter value deleting stage after data deletion, wherein the filter initialization stage and the filter bit array calculation stage are used for judging whether the Redis comprises the filter bit array, and the filter value is updated/stored and deleted after the data is deleted.
As shown in fig. 1, a filter for dealing with repeatability of filtered data in a context of large data volumes, includes:
initializing the filter module: when each filter is initialized, two parameters of p and n (n is the number of inserted elements, and p is the false alarm rate) need to be introduced, and the numHashFunctions and the bloom filter length (bitSize) are generated. Initializing the filter is an important step, which is the basis for generating numHashFunctions and bitSize, which determine the length of the filter bit. The length of the filter bit affects the efficiency and error rate of the final query and update values of the filter;
and a filter bit array calculating module: generating a value according to numHashFunctions, bitSize and an incoming calculated value key (the value is data incoming by a user, namely a key value, generally, one data corresponds to one key value, and if the user wants to know that a certain data exists in the filter, the user needs to query the filter by using the key value, so that a true value is considered to exist, and if the data is false, the value is considered to not exist) generated by initializing the filter: the value generates a plurality of Long type values according to numHashFunctions (Long is a data type, a Long keyword represents Long integer data, is a basic data type in a programming language, is an abbreviation of Long int and is a signed Long integer by default), then generates a plurality of int type bit arrays according to bitSize by the plurality of Long type values, and finally returns the bit arrays;
a weight judging module: inquiring the value of the bit inside each bit value in Redis according to the returned bit array, if the value is 0, indicating that no value is mapped to the bit, so that the value does not exist, and if the returned values are all 1, the value may already exist; the Redis is a database for storing/querying data (also referred to as https:// baike. baidu. com/item/Redis/6549233fr ═ adadin)
Update/store filter module: if the value is judged not to exist in the filter, the returned bit arrays can be respectively pointed to the bit positions and stored in Redis;
and (3) deleting the filter module: when a value to be deleted is transmitted, the value of the Counter (Counter) is decremented by 1 according to a single bit, and the values are processed one by one.
The invention comprises the following steps:
(1) by adopting a modified numHashFunctions and bitSize generation algorithm, the method comprises an initialization stage, wherein two parameters, namely p and n, are transmitted in the initialization stage to generate numHashFunctions and bitSize, and the generation rule of the numHashFunctions is as follows according to the previous algorithm:
Figure BDA0002567651940000091
Figure BDA0002567651940000092
it can be seen that as bitSize increases slowly, the numHashFunctions value decreases until it is less than 1, but if the numHashFunctions value approaches 1 or equals 1, the filter bit array will be smaller and smaller, or even 1 in length. After the numHashFunctions and bitSize generation algorithm are modified, the numHashFunctions can never be close to 1, the minimum value is controlled to be close to or equal to 3, and the error rate of the filter can not be reduced.
(2) The method changes two parameters of p and n into dynamic states, and reforms a calculation filter bit array, and comprises an initialization filter, a calculation filter bit array and an updating/storing filter. Create a new key when updating/storing the filter: the countValue is used for storing and counting the number of values stored by the filter, the total number of the values stored can be known through the countValue, n is divided according to the fact that n is equal to 2 hundred million data volume, n is divided into equal parts or divided into 100 parts according to proportion, 100 intervals are divided, and according to the size of the countValue, when the countValue reaches an equal part value, two parameters of p and n are updated, so that the parameters are changed according to the increase of the actual data volume, and therefore numHashFunctions and bitSize are changed. The size of bitSize is multiplied by the value of 100 bins each, which changes the bin size of each bit in the filter bit array. When each bit size is in 100 intervals, the mutual interference is very little when all the bits of the filter are divided into 100 parts of relatively independent space for storage. Therefore, the problem that when the company has 2 hundred million data volumes or even more data volumes, two parameters of p and n need to be set greatly, and the speed and the error rate of the actually used filter are greatly influenced is solved, the speed and the error rate of only processing 200 ten thousand data volumes can be realized after the improvement, the processing time of using the filter is greatly increased, the actual error rate is greatly reduced, and the capacity of realizing high available processing high data volume is achieved.
countValue: recording the actual filtering or storing value of the project, initializing to 0, increasing the number according to the actual use, storing in Redis (newly added parameters for modification)
The value of n is as follows: the number of filter values representing the reservation is initialized to 2000000, the company target value of 200000000 is divided into 100 segments by 2000000 units in the project, and the single segment is fixed during the project operation. However, when the value of countValue is equal to n, the value of n is reinitialized, and the reinitialized value is: multiplying n by 2, i.e. multiplying by a factor of n
p value: and the data filtering misjudgment probability is acceptable, the value is generated by the same method as the value n and is initialized to 0.0000001, and a single section is fixed in the process of operating the project. However, when the value of countValue is equal to n, the value of p is reinitialized by: p divided by 2
(3) Through the Counter size setting that adopts the transformation to delete the filter, the size location problem of Counter is also a dilemma: in view of the problem of space utilization, it is desirable from the viewpoint of use that the larger the Counter, the better the Counter, and the larger the Counter, the more information can be recorded. But a larger Counter would take up more resources and would in many cases create a significant waste of space. However, after the two parameters p and n are changed into dynamic states, the value of the Counter can be changed into an interval reached according to the countValue value, and the size of the Counter is increased in an equal proportion mode, so that the Counter can be owned at each stage of the filter, the contradiction between space utilization rate and resource occupation can be balanced, and the use efficiency of the filter is ensured.
In the description of the present application, it is to be understood that the terms "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience in describing the present application and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present application.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (5)

1. A bloom filter system, comprising:
initializing the filter module: when each filter is initialized, two parameters of p and n are required to be transmitted, wherein n is the number of inserted elements, p is the false alarm rate, and the numHashFunctions and the bloom filter length bitSize are generated;
and a filter bit array calculating module: generating a value according to numHashFunctions and bitSize generated by an initialization filter and the three data of the transmitted calculated value key, generating a plurality of long type values according to the numHashFunctions, generating a plurality of int type bit arrays according to the long type values and finally returning the bit arrays;
a weight judging module: circularly obtaining each bit value according to the returned bit array, and inquiring the bit value of each bit value in the database Redis: if the value is 0, the mapping of any value to the bit is not shown, and the value does not exist, and an updating or storing filter module is called; if the returned value is 1, the value already exists, and a deletion filter module is called;
update or store filter module: if the value is judged not to exist in the filter, the returned bit arrays point to the bit positions respectively and are stored in Redis;
and (3) deleting the filter module: if the value is judged to exist in the filter, the transmitted calculated value key is the key value needing to be deleted, the value of the Counter of the deletion filter is reduced by 1 according to a single bit, and the processing is carried out one by one;
the generation size of the value of bitSize and numHashFunctions can be controlled after the initialized n and p values, so that the value added to Redis is divided, and the value of each section is limited by the length so as to achieve the purpose of not influencing each other;
the numHashFunctions and bitSize determine the length of the filter bit;
the length of the filter bit can affect the efficiency and error rate of the final query and update value of the filter;
the process of generating a value includes:
initializing a value array value, wherein the array length is numHashFunctions, a value key needing to be verified is transmitted when a user uses a filter, then generating a long type value with hash 64 bits by the key, generating an int type value with 10-bit length by the long type value as hash1, and generating an int type value with 10-bit length as hash2 by the long type value without symbols right shift > > >32 bits;
then, circulating according to the length of the numHashFunctions, generating a nextHash value according to a calculation method of hash1+ i hash2, wherein i is a circulation index, judging whether the nextHash value is less than 0, and performing inverse operation on the nextHash value when the nextHash value is less than 0;
and then value is assigned, and the value generation method comprises the following steps: bitSize for nextHash remainder operation: the bit value does not exceed the value of the nextHash% bitSize;
changing two parameters of p and n into dynamic states, and modifying a calculation filter bit array, wherein the filter bit array comprises an initialization filter, a calculation filter bit array and an updating/storing filter; create a new key when updating/storing the filter: the countValue is used for storing and counting the number of values stored by the filter, knowing the total number of the values stored by the countValue, dividing n according to the number of 2 hundred million data volumes, dividing n into 100 parts equally or proportionally, which is equivalent to dividing 100 intervals, and updating two parameters, namely p and n, when the countValue reaches an equal part value according to the size of the countValue, so that the parameters are changed according to the increase of the actual data volume to change numHashFunctions and bitSize; the value of bitSize in each of the 100 intervals is in a multiple relationship, and the multiple relationship changes the interval of the size of each bit in the filter bit array; when the size of each bit is in 100 intervals, all the bits of the filter are divided into 100 parts of relatively independent space for storage;
countValue: recording the actual filtering or storage value of the project, initializing to 0, increasing according to the actual use number, and storing in Redis;
the value of n is as follows: representing the number of the reserved filtering values, initializing to 2000000, dividing the target value of 200000000 into 100 sections by 2000000 as a unit, and keeping the single section constant in the process of project operation; however, when the value of countValue is equal to n, the value of n is reinitialized, and the reinitialized value is: multiplying n by 2, namely multiplying by using n as a base;
p value: representing the acceptable data filtering misjudgment probability, the generation method of the value is the same as the value n, the value is initialized to 0.0000001, and a single section is fixed and unchanged in the operation process; however, when the value of countValue is equal to n, the value of p is reinitialized by: p is divided by 2;
changing the p and n parameters into dynamic state, deleting the Counter value of the filter, and increasing the Counter size in equal proportion according to the arrival interval of the countValue value, so that each stage of the filter has the Counter, thereby balancing the contradiction between space utilization rate and resource occupation and ensuring the use efficiency of the filter;
after the value to be deleted is transmitted, subtracting 1 from the Counter value according to a single bit, and processing one by one;
the numHashFunctions can never be close to 1, and the control is carried out until the minimum value is close to or equal to 3, so that the error rate of the filter can not be reduced.
2. The bloom filter system of claim 1, wherein the incoming computed values key refer to user incoming data;
one data corresponds to one key value, and whether the corresponding data exists in the filter is inquired through the key value.
3. The bloom filter system according to claim 1, wherein the specific algorithm for generating the hash function number numHashFunctions and the bloom filter length bitSize is as follows:
Figure FDA0003031322640000031
Figure FDA0003031322640000032
if the calculated numHashFunctions value is less than or equal to 3, the numHashFunctions takes a fixed value of 3, and is prohibited from being less than 3.
4. A method of filtering a bloom filter, comprising:
initializing a filter: when each filter is initialized, two parameters of p and n are required to be transmitted, wherein n is the number of inserted elements, p is the false alarm rate, and the numHashFunctions and the bloom filter length bitSize are generated;
calculating a filter bit array: generating a value according to numHashFunctions and bitSize generated by an initialization filter and the three data of the transmitted calculated value key, generating a plurality of long type values according to the numHashFunctions, generating a plurality of int type bit arrays according to the long type values and finally returning the bit arrays;
and (3) judging the weight: circularly obtaining each bit value according to the returned bit array, and inquiring the bit value of each bit value in the database Redis: if the value is 0, the mapping of any value to the bit is not shown, and the updating or storing filter step is called if the value does not exist; if the returned value is 1, the value is already existed, and the step of deleting the filter is called;
updating or storing the filter step: if the value is judged not to exist in the filter, the returned bit arrays can be respectively pointed to the bit positions and stored in Redis;
and a filter deleting step: if the value is judged to exist in the filter, the transmitted calculated value key is the key value needing to be deleted, the value of the Counter of the deletion filter is reduced by 1 according to a single bit, and the processing is carried out one by one;
the generation size of the value of bitSize and numHashFunctions can be controlled after the initialized n and p values, so that the value added to Redis is divided, and the value of each section is limited by the length so as to achieve the purpose of not influencing each other;
the numHashFunctions and bitSize determine the length of the filter bit;
the length of the filter bit can affect the efficiency and error rate of the final query and update value of the filter;
the incoming computed value key refers to user incoming data;
one data corresponds to one key value, and whether the corresponding data exists in the filter is inquired through the key value;
the process of generating a value includes:
initializing a value array value, wherein the array length is numHashFunctions, a value key needing to be verified is transmitted when a user uses a filter, then generating a long type value with hash 64 bits by the key, generating an int type value with 10-bit length by the long type value as hash1, and generating an int type value with 10-bit length as hash2 by the long type value without symbols right shift > > >32 bits;
then, circulating according to the length of the numHashFunctions, generating a nextHash value according to a calculation method of hash1+ i hash2, wherein i is a circulation index, judging whether the nextHash value is less than 0, and performing inverse operation on the nextHash value when the nextHash value is less than 0;
and then value is assigned, and the value generation method comprises the following steps: bitSize for nextHash remainder operation: the bit value does not exceed the value of the nextHash% bitSize;
changing two parameters of p and n into dynamic states, and modifying a calculation filter bit array, wherein the filter bit array comprises an initialization filter, a calculation filter bit array and an updating/storing filter; create a new key when updating/storing the filter: the countValue is used for storing and counting the number of values stored by the filter, knowing the total number of the values stored by the countValue, dividing n according to the number of 2 hundred million data volumes, dividing n into 100 parts equally or proportionally, which is equivalent to dividing 100 intervals, and updating two parameters, namely p and n, when the countValue reaches an equal part value according to the size of the countValue, so that the parameters are changed according to the increase of the actual data volume to change numHashFunctions and bitSize; the value of bitSize in each of the 100 intervals is in a multiple relationship, and the multiple relationship changes the interval of the size of each bit in the filter bit array; when the size of each bit is in 100 intervals, all the bits of the filter are divided into 100 parts of relatively independent space for storage;
countValue: recording the actual filtering or storage value of the project, initializing to 0, increasing according to the actual use number, and storing in Redis;
the value of n is as follows: representing the number of the reserved filtering values, initializing to 2000000, dividing the target value of 200000000 into 100 sections by 2000000 as a unit, and keeping the single section constant in the process of project operation; however, when the value of countValue is equal to n, the value of n is reinitialized, and the reinitialized value is: multiplying n by 2, namely multiplying by using n as a base;
p value: representing the acceptable data filtering misjudgment probability, the generation method of the value is the same as the value n, the value is initialized to 0.0000001, and a single section is fixed and unchanged in the operation process; however, when the countValue is equal to n, the p value is reinitialized by: p is divided by 2;
changing the p and n parameters into dynamic state, deleting the Counter value of the filter, and increasing the Counter size in equal proportion according to the arrival interval of the countValue value, so that each stage of the filter has the Counter, thereby balancing the contradiction between space utilization rate and resource occupation and ensuring the use efficiency of the filter;
after the value to be deleted is transmitted, subtracting 1 from the Counter value according to a single bit, and processing one by one;
the numHashFunctions can never be close to 1, and the control is carried out until the minimum value is close to or equal to 3, so that the error rate of the filter can not be reduced.
5. The filtering method of the bloom filter according to claim 4, wherein the specific algorithm for generating the hash function number numHashFunctions and the bloom filter length bitSize is as follows:
Figure FDA0003031322640000051
Figure FDA0003031322640000052
if the calculated numHashFunctions value is less than or equal to 3, the numHashFunctions takes a fixed value of 3, and is prohibited from being less than 3.
CN202010628830.4A 2020-07-02 2020-07-02 Bloom filter system and filtering method Active CN111930923B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010628830.4A CN111930923B (en) 2020-07-02 2020-07-02 Bloom filter system and filtering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010628830.4A CN111930923B (en) 2020-07-02 2020-07-02 Bloom filter system and filtering method

Publications (2)

Publication Number Publication Date
CN111930923A CN111930923A (en) 2020-11-13
CN111930923B true CN111930923B (en) 2021-07-30

Family

ID=73316986

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010628830.4A Active CN111930923B (en) 2020-07-02 2020-07-02 Bloom filter system and filtering method

Country Status (1)

Country Link
CN (1) CN111930923B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112199378A (en) * 2020-12-01 2021-01-08 北京快成科技股份公司 IP address matching method and device
CN112528685B (en) * 2020-12-10 2022-04-08 南京航空航天大学 RFID data redundancy processing method based on dynamic additional bloom filter
CN115203150A (en) * 2022-05-13 2022-10-18 浪潮卓数大数据产业发展有限公司 Bloom filter-based massive file backup data synchronization method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005098863A2 (en) * 2004-03-30 2005-10-20 Centerboard, Inc. Method and apparatus achieving memory and transmission overhead reductions in a content routing network
CN102799617A (en) * 2012-06-19 2012-11-28 华中科技大学 Construction and query optimization methods for multiple layers of Bloom Filters
CN103078754A (en) * 2012-12-29 2013-05-01 大连环宇移动科技有限公司 Network data stream statistical method on basis of counting bloom filter
CN104809182A (en) * 2015-04-17 2015-07-29 东南大学 Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923568B (en) * 2010-06-23 2013-06-19 北京星网锐捷网络技术有限公司 Method for increasing and canceling elements of Bloom filter and Bloom filter

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005098863A2 (en) * 2004-03-30 2005-10-20 Centerboard, Inc. Method and apparatus achieving memory and transmission overhead reductions in a content routing network
CN102799617A (en) * 2012-06-19 2012-11-28 华中科技大学 Construction and query optimization methods for multiple layers of Bloom Filters
CN103078754A (en) * 2012-12-29 2013-05-01 大连环宇移动科技有限公司 Network data stream statistical method on basis of counting bloom filter
CN104809182A (en) * 2015-04-17 2015-07-29 东南大学 Method for web crawler URL (uniform resource locator) deduplicating based on DSBF (dynamic splitting Bloom Filter)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
布隆过滤器的那点事;浪;《SegmentFault思否》;20200603;第1-5页 *
快速判重——布隆过滤器(Bloom Filter);谁说大象不能跳舞;《CSDN博客》;20190331;第1-7页 *

Also Published As

Publication number Publication date
CN111930923A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
CN111930923B (en) Bloom filter system and filtering method
CN111930924A (en) Data duplicate checking system and method based on bloom filter
CN108829344A (en) Date storage method, device and storage medium
CN101655861A (en) Hashing method based on double-counting bloom filter and hashing device
CN112395322B (en) List data display method and device based on hierarchical cache and terminal equipment
WO2017095413A1 (en) Incremental automatic update of ranked neighbor lists based on k-th nearest neighbors
CN115964002A (en) Electric energy meter terminal file management method, device, equipment and medium
CN111813517A (en) Task queue allocation method and device, computer equipment and medium
CN105528183A (en) Data storage method and storage equipment
CN110109867B (en) Method, apparatus and computer program product for improving online mode detection
WO2024078122A1 (en) Database table scanning method and apparatus, and device
CN113297266A (en) Data processing method, device, equipment and computer storage medium
US20110313975A1 (en) Validating files using a sliding window to access and correlate records in an arbitrarily large dataset
CN110221778A (en) Processing method, system, storage medium and the electronic equipment of hotel's data
CN115408547A (en) Dictionary tree construction method, device, equipment and storage medium
CN111465929A (en) Method and system for content-agnostic file indexing
CN111104435B (en) Metadata organization method, device and equipment and computer readable storage medium
US11435926B2 (en) Method, device, and computer program product for managing storage system
CN110716814B (en) Performance optimization method and device for inter-process large-data-volume communication
CN113568573B (en) Data storage method, data storage device, storage medium and product
CN111291126B (en) Data recovery method, device, equipment and storage medium
CN117271440B (en) File information storage method, reading method and related equipment based on freeRTOS
CN118349710A (en) Data retrieval method, electronic device, storage medium and program product
CN108874804B (en) Data storage method, data query method and device
CN109861949B (en) Message filtering method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant