CN107330094B

CN107330094B - Bloom filter tree structure for dynamically storing key value pairs and key value pair storage method

Info

Publication number: CN107330094B
Application number: CN201710542207.5A
Authority: CN
Inventors: 潘海娜; 凌纯清; 谢鲲
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2017-07-05
Filing date: 2017-07-05
Publication date: 2020-06-16
Anticipated expiration: 2037-07-05
Also published as: CN107330094A

Abstract

The invention discloses a bloom filter tree structure for dynamically storing key value pairs and a key value pair storage method, wherein the bloom filter tree structure comprises a complete d-branch tree; the method is characterized in that each node of each complete d-ary tree is a bloom filter; each leaf node of each full d-ary tree represents a value; the storage unit size of each node is half of that of the parent node of the node, and the root node comprises d × k different hash functions, that is, the root node comprises d hash groups, and each group comprises k hash functions. The invention can greatly reduce the time for collecting query, reduce resource consumption, process dynamically arrived data and adapt to network environment in the application fields of generating a large amount of data and needing key value pair query, such as database interactive query, resource positioning in high-speed network, computer network monitoring and the like.

Description

Bloom filter tree structure for dynamically storing key value pairs and key value pair storage method

Technical Field

The invention relates to the field of computer networks and computer system storage, in particular to the application field of interactive query with high performance and high throughput, and specifically relates to a storage structure and a storage method of an expandable bloom tree for key value pairs.

Background

In recent years, with the rapid development of computers, the size of collections in databases, networks and other applications has increased geometrically. Storing and querying key value pairs (keys) are common tasks in computer systems, and therefore, a corresponding key value pair storage data structure needs to be designed to support rapid key value pair query. Key-value pair operations often occur in network and storage systems, such as the key-value database MongoDB, CouchDB. Each unique key placed in the key-value pair storage system corresponds to a value, for example, (3, 5) is a key-value pair with a key of 3 and a value of 5, and after (3, 5) is stored in the key-value pair storage system, the value (value) can be obtained by querying the key (key) of 3.

Designing efficient key-value pair storage and query structures presents a significant challenge. In a layer 2 switch, a MAC address is associated with a unique port. When a frame is to be forwarded, the search engine queries the MAC table of the destination address to be forwarded by the frame, so that the problem of mapping a MAC address to a port is converted into a key-value-pair query problem, and at this time, the MAC address is regarded as a key and the port number to be queried becomes a value. Since the MAC address is continuously added to the list, the size of the element is unknown. If the key value pairs are stored by adopting a cell structure, a large amount of space is consumed, and a large amount of time is consumed when the value of the corresponding key is searched; if a static bloom filter structure is adopted to store key-value pairs, only static data can be processed, which is not practical in practical application. Therefore, in a high-speed computer network, how to efficiently store the information and quickly query the corresponding key-value pairs become a challenge.

The Bloom Filter is a data structure which is economical in space and efficient in query, can meet the requirements of efficient resource interaction and searching in the current life, and can effectively represent a data set. Bloom has been proposed by b.bloom in 1970, and is widely used in various computer systems to represent a huge data set and improve query efficiency. The essence of the bloom filter structure is to map the elements in the set into a bit vector by k hash functions. The bloom filter achieves the efficient representation set, meanwhile, when the element query is carried out, certain false positive (a certain element does not belong to the set and is mistakenly judged as belonging to the set) false judgment rate exists, false negative (a certain element belongs to the set and is mistakenly judged as not belonging to the set) false judgment does not exist, and the query and storage efficiency is high.

But a conventional bloom filter can only support dependent queries of whether an element exists in a set. If the element is a key, then only a dependent query if the key exists in the set can be supported, and no (key, value) operation can be supported. Because the bloom filter cannot directly store values, it cannot operate on key-value pairs using a conventional bloom filter. In order to make the bloom filter support the basic operation of key-value pairs, the traditional bloom filter must be improved, and a new bloom filter structure must be designed.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a bloom filter tree structure for dynamically storing key-value pairs and a key-value pair storage method, aiming at the defects of the prior art.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a bloom filter tree structure for dynamically storing key-value pairs, comprising a full d-ary tree; each node of each complete d-ary tree is a bloom filter; each leaf node of each full d-ary tree represents a value; the storage unit size of each node is half of that of the parent node of the node, and the root node comprises d × k different hash functions, that is, the root node comprises d hash groups, and each group comprises k hash functions.

Correspondingly, the invention also provides a method for storing key value pairs of the bloom filter tree structure, which comprises the following steps:

inserting operation: when a key value pair (key, value) needs to be inserted, firstly checking whether the value is already inserted into the bloom filter, and if not, performing subsequent operation of adding a new value; if the value exists in the bloom filter tree, key value pair insertion is directly carried out, a leaf node corresponding to the value is searched according to the value, the leaf node position code is obtained, a unique path from a root node to the leaf node is determined, two groups of hash functions of the root node are calculated, namely k hash functions h are used for the key_(i,1),h_(i,2),...,h_(i,k)Calculate h_(i,1)(key),h_(i,2)(key),...,h_(i,k)(key),; wherein i represents a group number of the selected hash function, and i is 1 or 2; the hash value calculated by i-1 is stored in the array A, and the hash value calculated by i-2 is stored in the array B; then, according to the first bit code of the leaf node, selecting a group of values in A, B arrays at the root node for insertion, namely performing right zero shift operation on the A or B arrays, then according to the first bit code of the leaf node, obtaining a bloom filter needing the next insertion operation, and then according to the second bit code of the leaf node, selecting A, B arraysPerforming right shift operation on a group of values to obtain an insertion position, and performing insertion; continuing to perform the operation, and inserting keys into each layer of bloom filter until the keys are inserted into the leaf nodes corresponding to the value;

and (3) query operation: when a value corresponding to a key value key needs to be queried, firstly, two groups of hash functions h of a root node are calculated_(i,1)(key),h_(i,2)(key),...,h_(i,k)(key) (i ═ 1 or 2), their values are stored in A, B two sets of arrays, respectively; respectively carrying out query operation on position units corresponding to the A, B two array values at the root node, and searching a first bit encoding value; according to the obtained coding value, switching to the next node for continuous query, at the moment, performing right shift operation on the A, B two arrays, and querying the corresponding position unit to obtain a coding value; continuing to perform the above operation until the leaf node is found, finally obtaining a complete encoding value, and performing decoding operation according to the leaf node encoding to obtain a corresponding value;

adding a new value operation: to add a new value to the bloom filter, the following two cases are distinguished: if the original bloom filter is an unfilled binary tree, when a new value needs to be added, directly adding a new leaf node at the tail of the bloom filter, wherein the leaf node represents the newly added value; if the original bloom filter is a full binary tree, a new bloom filter is added above the root node, the size of the newly added bloom filter is 2 times of the root node of the original bloom filter tree, at the moment, the new bloom filter becomes the root node of the new bloom filter tree, the original bloom filter becomes a left sub-tree of the new root node, a full binary tree which is one layer lower than the original bloom filter tree is created to serve as a right sub-tree of the new root node, the position of the upper 1 of the original root node is shifted to the left by one bit and is inserted into the new root node, and at the moment, two groups of H of the new root node are two, the H of the new root node is two groups of H, the H of the full binary tree is a full binary tree, the full binary tree₃The hash function is compared with the original two groups H₃The hash function has one more layer, namely, one more row is selected in the base function, and new leaf nodes are continuously added at the end of the bloom filter.

In the insertion operation, the method for selecting the array to be inserted from the root node comprises the following steps: if the code value is 0, the A array is selected, and if the code value is 1, the B array is selected.

In the insertion operation, the method for obtaining the next bloom filter to be inserted according to the first bit encoding of the leaf node comprises the following steps: if the encoded value is 0, the left node is operated, and if the encoded value is 1, the right node is operated.

Compared with the prior art, the invention has the beneficial effects that: the invention can greatly reduce the time for collecting query, reduce resource consumption, process dynamically arrived data and adapt to network environment in the application fields of generating a large amount of data and needing key value pair query, such as database interactive query, resource positioning in high-speed network, computer network monitoring and the like.

Drawings

FIG. 1 is a structural diagram of a bloom tree that combines bloom filters with a binary tree structure, where each node is a bloom filter. The root node has two groups of H₃Hash function

Each group having k hash functions. A leaf node represents a value, and a unique path from the root node to the leaf node can be obtained according to the encoding of the leaf node. According to leaf node encoding, if the encoding value is 0, performing shift operation on the first group of hash functions; if the encoded value is 1, a shift operation is performed on the second set of hash functions. In FIG. 1

Is that

The obtained mixture is mixed with a solvent to obtain a mixture,

is that

Obtained by

Is that

The obtained mixture is mixed with a solvent to obtain a mixture,

is that

And (4) obtaining the product.

FIG. 2 is a diagram illustrating the operation of adding a new value to an underfilled binary tree. When a value which does not exist in a leaf node needs to be inserted, if the binary tree is not a full binary tree, the leaf node can be directly added at the tail of the tree to indicate the inserted value. The leaf node encoded as 11 in the figure is the newly added value.

FIG. 3 is a diagram illustrating the operation of adding a new value to a full binary tree. When a value which does not exist in a leaf node needs to be inserted, if the binary tree is a full binary tree, the leaf node cannot be added to the original tree, and at this time, a layer needs to be expanded upwards, such as the newly added level 3 in fig. 3. The original binary tree becomes the left subtree of level 3, and a full binary tree with one layer less than the original binary tree is constructed as the right subtree of level 3. At this time, two groups of H also exist in level 3₃Hash function

Wherein

Is formed by

All the positions in the root nodes of the original binary tree are shifted to the left by one bit and are inserted into level 3; while

Then it is compared from the original base matrix

And selecting one more row. At this time, a new underfill binary tree is constructed, and a leaf node may be added directly at the end of the tree to indicate the inserted value. The leaf node encoded as 100 in the figure is the newly added value.

FIG. 4 shows details of implementation environment data, including data source, data time, and packet size, compared with the results of recent research, which includes: a search Tree based on Bloom filters for multiple-set membership testing is proposed, the structure is used for carrying out key-value pair query by combining a Tree and a Bloom filter structure, the algorithm uses d groups of hash functions on each Bloom filter node, calculation is carried out from a root node until a leaf node corresponding to a value is found, and the algorithm can only process static data; COMB is a key value query algorithm based on bloom filters proposed by a paper 'Fast dynamic multiple-set membership testing using combinatorial bloom filters' in IEEE/ACM Transactions on network 2012, and after a value is encoded by the algorithm, the number of the known value groups is selected to be proper for the number of the bloom filters and the number of the bloom filters needing interpolation, and all the bloom filters need to be queried during query to obtain corresponding codes.

Fig. 5(a) to 5(f) show fixed-size bloom filter trees m-2²⁰And bit, under different hash function numbers, processing 1024 groups of data by using three different algorithms of SBFT, Bloom Tree and COMB, and comparing the data on the average processing time. FIG. 5(a) is a schematic diagram of experimental results obtained by a simulation experiment using the data set MAWI1, FIG. 5(b) is a schematic diagram of experimental results obtained by a simulation experiment using the data set MAWI 2, FIG. 5(c) is a schematic diagram of experimental results obtained by a simulation experiment using the data set MAWI 3, FIG. 5(d) is a schematic diagram of experimental results obtained by a simulation experiment using the data set ClarkNet-HTTP, FIG. 5(e) is a schematic diagram of experimental results obtained by a simulation experiment using the data set UMass, and FIG. 5(e) is a schematic diagram of experimental results obtained by a simulation experiment using the data set UMassAnd 5(f) is a schematic diagram of experimental results obtained by performing simulation experiments by using the data set TKN. All data results in fig. 5(a) -5 (f) show that our proposed algorithm SBFT averages the least amount of time consuming data for each data set. Compared with the Bloom Tree, the algorithm only needs to select d groups of hash functions for calculation at the root node and then carry out shifting operation, and the Bloom Tree needs to select d groups of hash functions for calculation at each node, so that the time consumption is large; compared with the COMB, the COMB is similar to the COMB, the COMB needs to select a plurality of hash functions for each Bloom filter, and the time consumption is large, but the COMB needs to operate the number of the Bloom filters to be less than that of the Bloom filters, so the time consumption is smaller than that of the Bloom filters.

FIG. 6 is a diagram illustrating memory consumption states when three algorithms are used to process static data. The data result shows that the memories consumed by the three algorithms are the same, that is, the algorithm does not occupy redundant memories under the condition of least time consumption for processing data.

Fig. 7(a) to 7(f) are schematic diagrams illustrating how the three algorithms take time to process the data packets when the number of the data packets increases. Each group uses 3 hash functions. Since both Bloom Tree and COMB algorithms operate with known data volumes, the structure needs to be built up when a data packet has not been inserted. And the algorithm can operate on unknown data quantity, and data is processed along with the arrival of a data packet, which is a process for dynamically processing data. Fig. 7(a) is a schematic diagram of an experimental result obtained by a simulation experiment using the data set MAWI1, fig. 7(b) is a schematic diagram of an experimental result obtained by a simulation experiment using the data set MAWI 2, fig. 7(c) is a schematic diagram of an experimental result obtained by a simulation experiment using the data set MAWI 3, fig. 7(d) is a schematic diagram of an experimental result obtained by a simulation experiment using the data set ClarkNet-HTTP, fig. 7(e) is a schematic diagram of an experimental result obtained by a simulation experiment using the data set UMass, and fig. 7(f) is a schematic diagram of an experimental result obtained by a simulation experiment using the data set TKN. All data results in fig. 7 show that our algorithm is a dynamic process for processing packets for each data set and consumes less time than the Bloom Tree and COMB, which are static processes and consume more time.

Fig. 8(a) to 8(f) are schematic diagrams illustrating memory consumption of three algorithms for processing data packets when the number of data packets increases. Each group uses 3 hash functions. Fig. 8(a) is a schematic diagram of an experimental result obtained by a simulation experiment using the data set MAWI1, fig. 8(b) is a schematic diagram of an experimental result obtained by a simulation experiment using the data set MAWI 2, fig. 8(c) is a schematic diagram of an experimental result obtained by a simulation experiment using the data set MAWI 3, fig. 8(d) is a schematic diagram of an experimental result obtained by a simulation experiment using the data set ClarkNet-HTTP, fig. 8(e) is a schematic diagram of an experimental result obtained by a simulation experiment using the data set UMass, and fig. 8(f) is a schematic diagram of an experimental result obtained by a simulation experiment using the data set TKN. All data results in fig. 8(a) -8 (f) show that, for each data set, our algorithm is a dynamic process for processing a data packet, our algorithm is a process for processing data and building a structure, while the Bloom Tree and COMB processing data packets are a static process, and the structure is built before the data is processed, which also shows that their structure can only process static data.

Detailed Description

In this embodiment, when processing static data, the selected memory size m is 2²⁰bit, the root node occupies 95325bit, 18 rows and 8 columns of H are selected₃And a hash function base matrix, wherein the first 16 rows of the base matrix are extracted from the root node, each group selects k to be 3 hash functions, the data of the group number g to be 1024 groups is processed, and the tree height h is 10. FIG. 1 is a diagram of a static bloom tree structure from which key-value pair insertion, query processes may be analyzed.

Insert operation Insert (key, value) process: and searching the leaf node corresponding to the value according to the value to obtain the position code of the leaf node, and determining a unique path from the root node to the leaf node. Calculating two groups of hash functions of the root node, and using k hash functions h for key_(i,1),h_(i,2),...,h_(i,k)(i denotes the number of the selected hash function group) and h_(i,1)(key),h_(i,2)(key),...,h_(i,k)(key) to put their values intoStored in two arrays A, B. Then, according to the first bit encoding of the leaf node, a group of values in A, B arrays is selected for insertion at the root node (if the encoding value is 0, the A array is selected, and if the encoding value is 1, the B array is selected), which is equivalent to performing right-shift zero operation on A, B arrays (A array)>>0 or B>>0). And then, according to the first bit encoding of the leaf node, obtaining the next bloom filter needing to be subjected to the insertion operation (if the encoding value is 0, the left node is operated, and if the encoding value is 1, the right node is operated). Then, according to the second bit encoding of the leaf node, selecting a group of values in A, B array for right shift operation (A)>>1 or B>>1) And obtaining the insertion position. Similar operations are continued, and keys are inserted into each layer of bloom filter until the leaf nodes corresponding to the value are inserted. As in fig. 1, assuming that the value in the inserted key-value pair is encoded as 01, two hash functions of the root node are calculated, and their values are stored in two arrays A, B respectively; since the first bit of the code is 0, then the array A is selected to be shifted to the right at the root node by 0 bit A>>Inserting 0 into the root node, and reaching the left node in the next step; since the second bit is 1, the selected array B is shifted to the right by one bit B at the node>>1, inserting the node, and then arriving at a right node; finally, directly right-shifting the array B by two bits B>>And 2, inserting the node to finish the inserting operation.

Query (key) process: first, two sets of hash functions h of the root node are calculated_(i,1)(key),h_(i,2)(key),...,h_(i,k)(key) (i ═ 1 or 2), where i denotes a group number for which the hash function was chosen, the hash value calculated for i ═ 1 is stored in group a, and the hash value calculated for i ═ 2 is stored in group B. And respectively carrying out query operation on position units corresponding to A, B two array values at the root node, if the values of the corresponding positions are all 1, namely meeting, searching a first bit code value (if the A array is met, the code is 0, if the B array is met, the code is 1, and if the A array is not met, the key value does not exist, and finishing the query). According to the obtained coding value, go to the next node to continue the query, at this time, the operation of right shift (A) needs to be carried out on A, B two arrays>>1，B>>1) Querying at the corresponding location unit to obtain oneThe value is encoded. Similar operations continue until a leaf node is found. Finally, a complete coding value is obtained, and the corresponding value can be obtained according to the leaf node coding. Illustrated by fig. 1: assuming a query key, first calculating two groups of hash functions of a root node, and respectively storing the values of the two groups of hash functions in A, B two groups of arrays; a, B, respectively inquiring the values of the two arrays in the root node, wherein the corresponding position units of the values in the A array in the node are all 1, recording that the first bit is coded as 0, and the next operation node is a left node; shift A, B two arrays to the right by one bit A>>1，B>>1, respectively inquiring in the node, wherein the corresponding position units of the values in the B array in the node are all 1, the second bit code is recorded as 1, and the next operation node is a right node; finally, directly right-shifting the B array by two bits B>>2, inquiring, namely, verifying finally, wherein the step does not need to record a code value, if the verification is passed, the inquired code is 01, and the inquiry is completed only by finding a corresponding value in the table.

In this embodiment, when processing dynamic data, the new value operation is added in two cases: if the original bloom tree is an unfilled binary tree, when a new value needs to be added, a new leaf node can be directly added at the end of the tree, and the leaf node represents the newly added value, such as a new leaf node encoded as 11 in fig. 2; if the original bloom tree is a full binary tree, a new bloom filter needs to be added above a root node at the moment, the size of the bloom filter is 2 times of that of an original root node, the new bloom filter is a new root node at the moment, the original bloom tree becomes a left sub-tree of the root node, a full binary tree which is one layer lower than the original bloom tree is created to serve as a right sub-tree of the new root node, the position of the original root node, which is 1, is completely shifted left by one position and inserted into the new root node, and at the moment, two groups of H of the root node are inserted into the new root node, and the position of the original root node, which is 1, is completely shifted left by one position₃The hash function is more than the original two groups H₃The hash function has one more layer, i.e. one more row is selected in the base function, and then it is possible to continue to add a new leaf node at the end of the tree, such as the new leaf node encoded as 100 in fig. 3.

Claims

1. A method for key-value pair storage of a bloom filter tree structure, the bloom filter tree structure comprising complete d-trees, each node of each complete d-tree being a bloom filter; each leaf node of each full d-ary tree represents a value; the storage unit size of each node is half of that of a father node of the node, and a root node comprises d multiplied by k different hash functions, namely the root node comprises d hash groups, and each group comprises k hash functions; it is characterized by comprising:

inserting operation: when a key value pair (key, value) needs to be inserted, firstly checking whether the value is already inserted into the bloom filter, and if not, performing subsequent operation of adding a new value; if the value exists in the bloom filter tree, key value pair insertion is directly carried out, a leaf node corresponding to the value is searched according to the value, the leaf node position code is obtained, a unique path from a root node to the leaf node is determined, two groups of hash functions of the root node are calculated, namely k hash functions h are used for the key_i,1,h_i,2,...,h_i,kCalculate h_i,1key,h_i,2key,...,h_i,kkey, wherein i represents the group number of the selected hash function, and i is 1 or 2; the hash value calculated by i-1 is stored in the array A, and the hash value calculated by i-2 is stored in the array B; then, according to the first bit code of the leaf node, selecting a group of values in A, B arrays at the root node for insertion, namely performing right zero shift operation on the A or B arrays, then obtaining a bloom filter which needs to be subjected to the next insertion operation according to the first bit code of the leaf node, and then selecting a group of values in A, B arrays for right one shift operation according to the second bit code of the leaf node to obtain an insertion position for insertion; continuing to perform the operation, and inserting keys into each layer of bloom filter until the keys are inserted into the leaf nodes corresponding to the value;

and (3) query operation: when a value corresponding to a key value key needs to be queried, firstly, two groups of hash functions h of a root node are calculated_i,1key,h_i,2key,...,h_i,kkey, i ═ 1 or 2Storing their values in A, B two sets of arrays respectively; respectively carrying out query operation on position units corresponding to the A, B two array values at the root node, and searching a first bit encoding value; according to the obtained coding value, switching to the next node for continuous query, at the moment, performing right shift operation on the A, B two arrays, and querying the corresponding position unit to obtain a coding value; continuing to perform the above operation until the leaf node is found, finally obtaining a complete encoding value, and performing decoding operation according to the leaf node encoding to obtain a corresponding value;

2. The method of claim 1, wherein in the inserting operation, the method for selecting the array to be inserted at the root node is as follows: if the code value is 0, the A array is selected, and if the code value is 1, the B array is selected.

3. The method according to claim 1, wherein in the inserting operation, the method for obtaining the next bloom filter to be inserted according to the first bit encoding of the leaf node is: if the encoded value is 0, the left node is operated, and if the encoded value is 1, the right node is operated.