CN108287840B

CN108287840B - Data storage and query method based on matrix hash

Info

Publication number: CN108287840B
Application number: CN201710014205.9A
Authority: CN
Inventors: 杨仝; 张梦瑜; 李晓明
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2017-01-09
Filing date: 2017-01-09
Publication date: 2022-05-03
Anticipated expiration: 2037-01-09
Also published as: CN108287840A

Abstract

The invention relates to a data storage and query method based on matrix hash. The method comprises the following steps: 1) establishing a hash table data structure comprising z sub-tables, z being an even number, each sub-tableThe size of the equal difference is decreased progressively; for the

Combining the ith sub-table with the z-i +1 th sub-table to obtain

Sub-tables with equal size; 2) establishing an auxiliary data structure which comprises z bloom filters corresponding to the z sub-tables, wherein the size arithmetic of each bloom filter is decreased; for the

Combining the ith bloom filter with the z-i +1 th bloom filter to obtain

A bloom filter of equal size; then the product is mixed with

Adding the corresponding bits of the bloom filters together to form 1 multi-bit bloom filter; 3) and inserting key value pairs by using the hash table data structure and the auxiliary data structure to realize data storage. The invention can realize quick update and quick query.

Description

Data storage and query method based on matrix hash

Technical Field

The invention belongs to the technical field of memory databases, and particularly relates to a data organization, indexing and storage method based on a matrix hash algorithm.

Background

Compared with a disk database, the memory database has higher flexibility and usability, and can be divided into a relational memory database and a key-value memory database in a paradigm. Key Value based memory databases (Key Value Store) have the advantages of flexibility, compactness, memory saving, fast query, etc., and have unique advantages compared with relational memory databases, so that the Key Value based memory databases are widely applied to various large internet companies, such as amazon, Facebook, Youtube, hectogram, new wave, search fox, etc. Data of the key value storage system exists in a key value pair mode, and a hash table is used for storage, so that a hash algorithm is used as a core technology of the key value storage system and is a key factor directly influencing system performance and website efficiency.

The practical problem that exists at present is that with the rapid development of the internet, a large amount of data is accumulated by many internet companies, and because the number of key-value pairs is huge and the available memory space is limited, when a new key-value pair is inserted, the key-value pair conflicts more. Such a conflict may cause problems such as failed insertion of a new key value pair, failed update and lookup of an existing key value pair, and the like, which greatly affects the performance of the key value storage system, thereby causing great economic loss to an internet company using the key value storage system.

Meanwhile, the demands and requirements of clients on data operation are higher and higher, and the query results of data need to be obtained quickly, so that higher requirements are provided for the response capability of the internet company, and if the internet company cannot respond instantly, the user experience is greatly influenced.

The two problems are widely existed in internet companies applying key value storage systems, and the existing hash table design continuously tries a new idea to better solve the two key problems. First, to address the collision problem, existing hash table designs extensively reduce the collision probability through auxiliary data structures (such as bloom filters). A typical algorithm design for comparison is fast hash (fast hash table) (h.song, s.dharma purifier, j.turner, and j.lockwood.fast hash table accessible extended memory filter: an aid to network processing.acm sigcom Communication view,35(4): 181-192, 2005.), segment hash (segment hash) (s.kumar and p.crowe.g. segmented hash table for high performance network processing.in.acm ANCS, pages 91-103,2005.), and peacock hash (peer hash table) in (s.kumar, j.turn, p.arch and p.crowe.g. Communication network). For a new key-value pair that needs to be inserted, these hash designs all use a bloom filter to determine the hash table to be inserted. For conflicting key-value pairs, either pointers are used to hang on the linked list or discarded. These hash designs, while using multiple sub-tables to reduce collisions, still suffer from drawbacks such as lower loading rates. There is also a large reduction in collision rate.

Secondly, the query time problem, more typical hash designs are perfect hashes (z.j.czech, g.havas, and b.s.majewski.an optimal algorithm for generating a minimal work hash function. information Processing Letters,43(5): 257-264, 1992.), cucko hashes (b.fan, d.g. andersen, and m.kaminsky.memc. 3: Compact and current memca with a number of cells that are not identical, volume 13, pages 385-398, 2013), etc., however these hashes are very inefficient at more recent times and require a lot of hash computations and memory access. For example, cuckoo hash requires approximately 500 hash computations and memory accesses when updating a hash table, and even then, it is likely that the update fails. Thus for these hash table designs, if multiple updates fail, the entire hash table will have to be rebuilt. The reconstruction process will require a significant amount of time, which is unacceptable for real-world applications.

Disclosure of Invention

In order to solve the problems of hash table conflict and query time and overcome the defects of high conflict rate, low memory use efficiency, low loading rate and the like of the conventional hash table, the invention provides a novel hash table design scheme, namely 'matrix hash', which combines multi-sub-table hash, a bloom filter and a bitmap.

The technical scheme adopted by the invention is as follows:

a data storage method based on matrix hash is characterized by comprising the following steps:

1) establishing a hash table data structure which comprises z sub-tables, wherein z is an even number, and the size equal difference of each sub-table is decreased progressively; for the

Combining the ith sub-table with the z-i +1 th sub-table to obtain

Sub-tables with equal size;

2) establishing an assistance data structure including the samez bloom filters corresponding to the z sub-tables, wherein the size equal difference of each bloom filter is decreased progressively; for the

Combining the ith bloom filter with the z-i +1 th bloom filter to obtain

A bloom filter of equal size; then the product is mixed with

Adding the corresponding bits of the bloom filters together to form 1 multi-bit bloom filter;

3) and inserting key value pairs by using the hash table data structure and the auxiliary data structure to realize data storage.

Further, each time a new key-value pair is inserted, it is inserted into the sub-table with the smallest load rate.

Further, a linked list is hung on the last sub-list, namely the z-th sub-list, and if an empty bucket cannot be found in the key-value pair to be inserted, the key-value pair is hung on the linked list by using a pointer.

Furthermore, each sub-table has a bitmap corresponding to each bit in the bitmap corresponding to a bucket in the sub-table corresponding to each bit in the bitmap; the bit in the bitmap corresponding to the empty bucket is 0, and the bit in the bitmap corresponding to the non-empty bucket is 1.

Further, an additional bloom filter F is added_halfWhich is responsible for recording the second part of the sub-table, i.e.

To reduce the number of sub-tables queried.

Further, the key-value pairs are inserted as follows:

a) for a given key-value pair, checking whether z candidate buckets are empty through a bitmap, and then inserting the key-value pair into a sub-table with the lowest loading rate to balance the loading rates of all the sub-tables; suppose the sub-table index to be inserted is i, if

Update the bloom filter F_iTo indicate that key x is in sub-table T_iUpdating the corresponding bitmap; if it is not

Update the bloom filter F_z-i+1To indicate x is in the sub-table T_iAnd update F_halfAnd a corresponding bitmap;

b) if the bitmap shows that z buckets into which key x should be inserted are all full, then the mechanism of kicking is used to effect the insertion of the key-value pair.

Further, the key-value pair query mode is as follows: when querying x, first in the multi-bit bloom filter F_mAnd F_halfQuery x if

Return true, and F_halfReturning false, the sub-table T is checked_i(ii) a Otherwise, check the sub-table T first_z-i+1If there is no match, the sub-table T is checked again_i(ii) a If x cannot be found in the z sub-tables, searching a linked list of the last sub-table; if still not found, x is not in the hash table.

Further, the key-value pair deletion mode is as follows: when x is deleted, the bucket where x is located is found according to the query operation, then the key-value pair is removed from the bucket, and the corresponding bit of the bucket where x is located in the bitmap is reset.

The invention has the beneficial effects that: 1) high loading + few pointers: a large number of key-value pairs are stored with a small memory space and the number of pointers used is small. 2) And (4) low collision rate. 3) And (3) quick updating: the hash table can be updated with few memory accesses. 4) The zero update fails. 5) Quick query: the key value pairs can be quickly searched by using few memory accesses, or for the non-existing key value pairs, the non-existing results can be quickly returned. 6) The practicability is as follows: it is easy to implement in a hardware system.

Drawings

Fig. 1 is an algorithm diagram of matrix hashing.

Fig. 2 is a structural diagram of a multi-bit bloom filter.

Detailed Description

The invention is further illustrated by the following specific examples and the accompanying drawings.

Data structure

The data structure of the matrix hash of the invention comprehensively uses a multi-level sub-table, a bloom filter and a bitmap. The data structure is composed of a hash table data structure and an auxiliary data structure.

1. Hash table data structure

The size of each sub-table, i.e. the maximum number of elements that can be stored, is decreasing with arithmetic, and therefore the bloom filters corresponding to the sub-tables are also decreasing with arithmetic. A simpler equalization strategy is used when inserting elements: whenever a new key-value pair is inserted, it is inserted into the sub-table with the smallest loading rate, so that it can be ensured that the number of elements in each sub-table is also in the form of decreasing arithmetic progression.

Assume a total of z sub-tables, where z is an even number. For the

Matrix hash combines the ith sub-table and the z-i +1 th sub-table to finally obtain

Sub-tables of equal size. Since the combined sub-table shape is similar to a matrix, we name this algorithm as matrix hashing. To avoid insertion failures, the last child table is allowed to be linked. If the key-value pair to be inserted cannot find an empty bucket finally, a linked list can be hung on the z-th sub-table. Because the z-th sub-table is the smallest sub-table, the pointers occupy the smallest memory.

Fig. 1 is an algorithmic schematic of matrix hashing, where the left side is 6 sub-tables and 6 bloom filters with decreasing sizes in equal difference, and the middle is 3 sub-tables and 3 bloom filters with equal sizes after combination. The upper right side is a multi-bit bloom filter formed by combining three standard bloom filters BF1, BF2 and BF 3.

2. Auxiliary data structure

Similar to hash table binding, for

Matrix hash combines the ith bloom filter and the z-i +1 bloom filter to finally obtain

A standard bloom filter of equal size. Then, by applying this

The corresponding bits of each bloom filter are added together to form 1 bloom filter. In this bloom filter, each box is composed of

And a bit. I call this bloom filter a multi-bit bloom filter and use F_mAnd (4) showing. To this end, we have combined the original z equal-difference bloom filters into 1 multi-bit bloom filters.

FIG. 2 is a block diagram of a multi-bit bloom filter. As shown in the figure, the three bits in a bin come from three standard bloom filters of equal size, i.e., F1, F2, F3, respectively. It should be noted that the combination of bloom filters is performed in on-chip memory, physically, and the combination of sub-tables is conceptual only. The algorithm implementation of the multi-bit bloom filter is as follows:

suppose that F1, F2, and F3 all have m bits. For F1, the most significant bit is taken first (m bits of F1 are compared with 2)^m-1Do a logical and operation) and then shift the resulting result 2 x m bits to the left (multiply the resulting m bits by 2)^2m) Then take the next highest order bit (compare m bits of F1 with 2)^m-2Do a logical and operation), shift the result 2 x (m-1) bits to the left (multiply the resulting m bits by 2)^2(m-1)) The result is accumulated with the value after the highest bit operation,by analogy, each bit is similarly accumulated until the last bit, assuming that the resulting accumulated value is f 1. The same operation is performed on F2 and F3 respectively, and the accumulated values F2 and F3 are obtained. The result of shifting f1 and f2 to the right by one bit and the result of shifting f3 to the right by two bits are logically or-operated to obtain the multi-bit bloom filter (namely, the multi-bit bloom filter is obtained

)。

One problem arises due to the above bloom filter design: when a bloom filter returns true, we need to query the corresponding two child tables. E.g. if x is in the ith bloom filter, then it needs to be in the sub-table T_iOr T_z-i+1To query. To reduce the number of sub-tables in the query, an additional bloom filter, called F, is added_halfResponsible for recording the second part of the sub-table, i.e.

Matrix hashing also uses bitmaps within a slice, with one bitmap corresponding to each sub-table, and each bit in the bitmap corresponding to a bucket in its corresponding sub-table. The bit in the bitmap corresponding to the empty bucket is 0, and the non-empty bucket is 1.

Matrix hash false positive rate derivation

The matrix haxi has two bloom filters: f_mAnd F_half. Assuming n is the number of key-value pairs, z sub-tables are recomposed

And (4) a sub-table. Suppose F_mThere are m boxes, each box has

A bit, this

Each bit corresponds to

And (4) a sub-table. Suppose F_mThere are k sub-tables of which there are,

F_mand the individual components thereof

The bloom filters are equal. Thus F_mThe false positive rate of (D) is as defined in F (F)_m) Expressed, the formula is as follows:

if the number of bloom filters returning true is u +1, the false positive rate formula is:

f(F_m,u)＝0.5^k*u*(1-0.5^k(z-u-1))

F_halfthere are also k hash functions, F_halfThe false positive rate is: f (F)_half)＝0.5^k. If only F_mReturning true, and the key-value pair only exists in one sub-table, there is no false alarm, and the probability of this event occurrence is (1-F (F)_m))*(1-f(F_half)). If only F_mReturning true and reporting key-value pairs in u +1 sub-tables, there will be u false positives, the probability of this event is F (F)_m,u)*(1-f(F_half)). If only F_halfReporting a false alarm with a probability of (1-F (F)_m))*f(F_half). If F_mThere are u false positives, and F_halfWith a false alarm, the probability of this event occurring is F (F)_m,u)*f(F_half)。

For example: when z is 8 and k is 16, the false positive rate of the matrix hash is 1- (1-F (F)_m))*(1-f(F_half))≈6.1*10^-5This number is very small.

Inserting, inquiring and deleting mode of key value pair

In a key value storage system, the specific operation implementation modes of the matrix hash algorithm for inserting, inquiring and deleting key value pairs are as follows:

1. insertion mode of key value pair

For a given key-value pair, key x is inserted. First it is checked by means of the bitmap whether the z candidate buckets are empty. The key-value pairs are then inserted into the sub-table with the lowest load rate to balance the load rates of all sub-tables. Assume that the sub-table index to be inserted is i. If it is not

Then F is updated_iTo indicate x is in the sub-table T_iIn, updating the corresponding bitmap; if sub-table index

Need to mix F_z-i+1Updated to indicate x is in sub-table T_iAnd update F_halfAnd a corresponding bitmap. During insertion, in order to put a box in contact with F_iThe corresponding bit is set to 1, and the data in this bin is stored

One bit and 2^i-1It is only necessary to do logic or operation.

If the bitmap shows that z buckets into which x should be inserted are all full, then the mechanism of kicking in cuckoo hash (cuckoo hash) is used and the bitmap is used to decide which key-value pair to kick. The bitmap is used to sequentially check the z candidate buckets for x to determine if there is an empty bucket in the remaining z-1 sub-tables that can insert y, which is an element originally in the candidate bucket, such as y. If so, kick y out, insert x, and insert y into the new location. If such a y cannot be found, a blind kick is performed and the above insertion procedure is repeated. And limiting the number of blind kicks to theta, and if the number of blind kicks exceeds theta, hanging the key value pair on the linked list of the last sub-table. By varying the value of θ, RHT4 may be a tradeoff between load rate and insertion speed. Because the bitmap has a global record of empty and non-empty buckets in the sub-table, the use of on-chip bitmaps significantly reduces the number of kicks.

2. Key value pair query mode

Such as query x, first at F_mAnd F_halfQuery x if

Return true, and F_halfReturning false, the sub-table T is checked_i. Otherwise, check the sub-table T first_z-i+1If there is no match, the sub-table T is checked again_i. And if x cannot be found in the z sub-tables, searching the linked list of the last sub-table. If it is still not found, x is not in the hash table.

It should be noted that, during the query process, only k hash functions need to be calculated, and z × k hash functions do not need to be calculated, because: elements forming a bloom filter

The parameters of the individual bloom filters are identical. If reading a box and F_iCorresponding bit, will this

One bit and 2^i-1And performing logic AND operation. If the result is 0, the bit in the bin corresponding to Fi is 0, otherwise it is 1.

3. Key value pair deleting mode

If x is deleted, RHT1 first finds the bucket where x is located according to the above query operation, then removes the key-value pair from the bucket, and resets the bit corresponding to the bucket where x is located in the bitmap.

Fourth, experimental data

To better evaluate the matrix hash of the present invention and the existing hash design, we adopted the data of the actual application. We obtained 12 Forwarding Information Bases (FIBs) of the website www.ripe.net at 8 am on 2014.07.08 days, and for each FIB, a traffic trace (traffic trace) was generated uniformly and manually for each prefix (prefix). We use the part of the FIB that is relevant to us, namely the prefix (prefix) and the relevant next hop. Prefix (prefix) as a key and next hop as a value. We use β to denote the ratio of the total number of buckets and the total number of elements of the hash table. Wherein beta is more than or equal to 1.05 and less than or equal to 10. We denote the threshold for blind kick operation by θ. The number of key value pairs in the FIB was 500,000. The difference in size of the 8 sub-tables created was 5000, and the total size of the sub-tables was β n. Let θ be 0, which means blind kicks are not allowed, but only kicks with bitmaps. And (3) inserting the key value pair every time, wherein the maximum value of copy times is 8+1, and if the element does not find an empty candidate position in 8 sub-tables, the element needs to be inserted into a linked list of the last sub-table. The conflict rate is the ratio of the number on the last sub-table linked list to the total number of elements. The bloom filter has 16 hash functions. The experimental results are as follows:

1. the matrix hash experiment shows that:

1) loading rate and collision rate

Experimental settings β ═ 1.05 and θ ═ 0, the experimental results show that matrix hashing achieves very high loading rates with only 1.05 × n of memory, where the loading rates of the 8 sub-tables are well balanced and the total loading rate is 95.19%. The collision rate was about 0.05%, and only a few FIB collision rates exceeded 0.06%.

2) Insertion and query time

The experiment sets β to 1.05 and θ to 0, and the experiment inserts all elements of each FIB into the matrix hash, and the experimental results show that the more elements are inserted, the more memory accesses are required. Most elements, requiring less than 6 memory accesses per insert, have a query memory access count between 1 and 1.0019, with an average of 1.00059.

3) Bitmap kicking and blind kicking

The experiment sets beta to 1.05, and the experiment result shows that when theta is 5, the memory times of inserting a key-value pair is 8 times (5+1) +1 times 49 times at most, the memory access times of inquiring a key-value pair is lower than 8 times, and then the linked list of the last sublist has no elements. When θ is 0, there are only a few elements (0.56%) on the linked list of the last sub-table, although blind kicks are not allowed. The worst case of memory access at the time of insertion is 8+1 times.

4) Collision rate vs beta

The experimental setting θ is 0, and the experimental result shows that the larger β is, the smaller the collision rate is, and when β ≧ 1.18, the collision rate approaches 0.

2. Matrix hash is compared to other hashes:

experiments compare matrix hashing with six well-known hash designs, namely chain hashing, linear detection, double hashing, cuckoo hashing, d-left hashing and peacock bird hashing. First, an insertion failure is defined, and for linear probes, double-hash and cuckoo hash, when a collision occurs, another bucket is probed and this probing is repeated all the time. The repeated detection times are limited to 500, which means that the maximum value of the memory access times of each insertion is 500 for the three hash designs, if the collision still exists more than 500 times, the three hash designs abandon the continuous insertion, and the elements which are not inserted at the 500 th loop are discarded, thereby causing the insertion failure. For the peacock-bird hash and the matrix hash, the bloom filter has 16 hash functions.

Experiment one: (β 1.05, different FIB)

1) Loading rate

The experimental results show that: the loading rate of matrix hash is always the highest.

2) Insertion time

The experimental results show that: matrix hashing, the number of copies required for insertion is minimal in all but chain hashing. This is because the chain hash requires only one or two accesses during insertion, so the access time is short, but the chain hash has significant disadvantages in other aspects. And the matrix hash can achieve fast insertion due to the existence of the bloom filter and the bitmap.

3) Finding time

The experimental results show that: the matrix hah has the shortest search time because the matrix hah has this higher loading rate and a smaller false positive rate.

Experiment two (different beta, FIB rrc00)

1) Loading rate

The experimental results show that: the loading rate of matrix hash is always the highest, the difference between the loading rates of chain hash and double hash is not large, and the high loading rate is achieved only when beta is higher in peacock-bird hash.

2) Insertion time

The experimental results show that: matrix hashing, the number of copies required for insertion is minimal in all but chain hashing.

3) Finding time

The experimental results show that: the matrix haxi has the shortest search time.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A data storage method based on matrix hash is characterized by comprising the following steps:

Combining the ith sub-table with the z-i +1 th sub-table to obtain

Sub-tables with equal size; each sub-table in the z sub-tables corresponds to a bitmap, and each bit in the bitmap corresponds to a bucket in the corresponding sub-table; the bit in the bitmap corresponding to the empty bucket is 0, and the bit in the bitmap corresponding to the non-empty bucket is 1;

2) establishing an auxiliary data structure which comprises z bloom filters corresponding to the z sub-tables, wherein the size arithmetic of each bloom filter is decreased; for the

Combining the ith bloom filter with the z-i +1 th bloom filter to obtain

A bloom filter of equal size; then the product is mixed with

Corresponding bits of the plural bloom filters are added together to form 1 plural-bit bloom filter F_m(ii) a Adding an additional bloom filter F_halfResponsible for recording the second in the z sub-tables

Individual watch

To the z th sub-table T_zI.e. by

To reduce the number of sub-tables queried;

the multi-bit bloom filter F_mThe implementation mode of (2) is as follows:

assume that the standard bloom filters F1, F2, F3 all have m bits; for F1, firstly taking the most significant bit, then moving the obtained result to the left by 2 x m bits, then taking the next most significant bit, moving the obtained result to the left by 2 x (m-1) bits, accumulating the obtained result and the value after the operation of the most significant bit, and so on, performing similar operation on each bit, and accumulating until the last bit, and assuming that the obtained accumulated value is F1; respectively carrying out the same operation on F2 and F3 to obtain accumulated values F2 and F3; performing logic or operation on the result obtained after the f1 and the f2 are shifted to the right by one bit and the result obtained after the f3 is shifted to the right by two bits to obtain a multi-bit bloom filter;

3) inserting key value pairs by using the hash table data structure and the auxiliary data structure to realize data storage;

wherein, step 3) includes:

inserting a new key value pair into the sub-table with the minimum loading rate in the z sub-tables every time the key value pair is inserted;

hanging a linked list on the last sub-table, namely the z-th sub-table, in the z sub-tables;

the key-value pairs are inserted as follows:

Updating the ith bloom filter F of the z bloom filters_iTo indicate the i-th sub-table T of the z sub-tables of the key-value pair_iUpdating the corresponding bitmap; if it is not

Then the z-i +1 th of the z bloom filters F is updated_z-i+1To indicate the i-th sub-table T of the z sub-tables of the key-value pair_iAnd update F_halfAnd a corresponding bitmap;

b) if the bitmap shows that z buckets into which the key x should be inserted are full, the insertion of the key-value pair is realized by using a kick mechanism;

wherein, the implementation mode of the step b) is as follows: using the bitmap to sequentially check z candidate buckets corresponding to x to determine whether an original element y in the candidate buckets has an empty bucket in the remaining z-1 subtables into which y can be inserted; if yes, kicking out y, inserting x, and inserting y into a new position; and if y cannot be found, executing blind kicking, repeating the above insertion process, limiting the number of times of blind kicking to theta, and if the number of times of blind kicking exceeds theta, hanging the key value pair on the linked list of the last sub-table.

2. The method of claim 1, in which key-value pairsThe query mode is as follows: when looking up the key x, first in the multi-bit bloom filter F_mAnd F_halfMiddle query key x, for

If Fi returns true, and F_halfReturning false, the ith sub-table T in the z sub-tables is checked_i(ii) a Otherwise, checking the z-i +1 th sub-table T in the z sub-tables_z-i+1If there is no match, then check the ith sub-table T of the z sub-tables_i(ii) a If the key x cannot be found in the z sub-tables, searching a linked list of the last sub-table; if it is still not found, it indicates that key x is not in the hash table.

3. The method of claim 1, wherein key-value pairs are deleted by: when deleting the key x, firstly finding the bucket where the key x is located according to the query operation, then removing the key value pair from the bucket, and resetting the corresponding bit of the bucket where the key x is located in the bitmap.