CN108287840B - Data storage and query method based on matrix hash - Google Patents

Data storage and query method based on matrix hash Download PDF

Info

Publication number
CN108287840B
CN108287840B CN201710014205.9A CN201710014205A CN108287840B CN 108287840 B CN108287840 B CN 108287840B CN 201710014205 A CN201710014205 A CN 201710014205A CN 108287840 B CN108287840 B CN 108287840B
Authority
CN
China
Prior art keywords
sub
key
tables
bit
bloom filter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710014205.9A
Other languages
Chinese (zh)
Other versions
CN108287840A (en
Inventor
杨仝
张梦瑜
李晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201710014205.9A priority Critical patent/CN108287840B/en
Publication of CN108287840A publication Critical patent/CN108287840A/en
Application granted granted Critical
Publication of CN108287840B publication Critical patent/CN108287840B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data storage and query method based on matrix hash. The method comprises the following steps: 1) establishing a hash table data structure comprising z sub-tables, z being an even number, each sub-tableThe size of the equal difference is decreased progressively; for the
Figure DDA0001205371440000011
Combining the ith sub-table with the z-i +1 th sub-table to obtain
Figure DDA0001205371440000012
Sub-tables with equal size; 2) establishing an auxiliary data structure which comprises z bloom filters corresponding to the z sub-tables, wherein the size arithmetic of each bloom filter is decreased; for the
Figure DDA0001205371440000013
Combining the ith bloom filter with the z-i +1 th bloom filter to obtain
Figure DDA0001205371440000014
A bloom filter of equal size; then the product is mixed with
Figure DDA0001205371440000015
Adding the corresponding bits of the bloom filters together to form 1 multi-bit bloom filter; 3) and inserting key value pairs by using the hash table data structure and the auxiliary data structure to realize data storage. The invention can realize quick update and quick query.

Description

Data storage and query method based on matrix hash
Technical Field
The invention belongs to the technical field of memory databases, and particularly relates to a data organization, indexing and storage method based on a matrix hash algorithm.
Background
Compared with a disk database, the memory database has higher flexibility and usability, and can be divided into a relational memory database and a key-value memory database in a paradigm. Key Value based memory databases (Key Value Store) have the advantages of flexibility, compactness, memory saving, fast query, etc., and have unique advantages compared with relational memory databases, so that the Key Value based memory databases are widely applied to various large internet companies, such as amazon, Facebook, Youtube, hectogram, new wave, search fox, etc. Data of the key value storage system exists in a key value pair mode, and a hash table is used for storage, so that a hash algorithm is used as a core technology of the key value storage system and is a key factor directly influencing system performance and website efficiency.
The practical problem that exists at present is that with the rapid development of the internet, a large amount of data is accumulated by many internet companies, and because the number of key-value pairs is huge and the available memory space is limited, when a new key-value pair is inserted, the key-value pair conflicts more. Such a conflict may cause problems such as failed insertion of a new key value pair, failed update and lookup of an existing key value pair, and the like, which greatly affects the performance of the key value storage system, thereby causing great economic loss to an internet company using the key value storage system.
Meanwhile, the demands and requirements of clients on data operation are higher and higher, and the query results of data need to be obtained quickly, so that higher requirements are provided for the response capability of the internet company, and if the internet company cannot respond instantly, the user experience is greatly influenced.
The two problems are widely existed in internet companies applying key value storage systems, and the existing hash table design continuously tries a new idea to better solve the two key problems. First, to address the collision problem, existing hash table designs extensively reduce the collision probability through auxiliary data structures (such as bloom filters). A typical algorithm design for comparison is fast hash (fast hash table) (h.song, s.dharma purifier, j.turner, and j.lockwood.fast hash table accessible extended memory filter: an aid to network processing.acm sigcom Communication view,35(4): 181-192, 2005.), segment hash (segment hash) (s.kumar and p.crowe.g. segmented hash table for high performance network processing.in.acm ANCS, pages 91-103,2005.), and peacock hash (peer hash table) in (s.kumar, j.turn, p.arch and p.crowe.g. Communication network). For a new key-value pair that needs to be inserted, these hash designs all use a bloom filter to determine the hash table to be inserted. For conflicting key-value pairs, either pointers are used to hang on the linked list or discarded. These hash designs, while using multiple sub-tables to reduce collisions, still suffer from drawbacks such as lower loading rates. There is also a large reduction in collision rate.
Secondly, the query time problem, more typical hash designs are perfect hashes (z.j.czech, g.havas, and b.s.majewski.an optimal algorithm for generating a minimal work hash function. information Processing Letters,43(5): 257-264, 1992.), cucko hashes (b.fan, d.g. andersen, and m.kaminsky.memc. 3: Compact and current memca with a number of cells that are not identical, volume 13, pages 385-398, 2013), etc., however these hashes are very inefficient at more recent times and require a lot of hash computations and memory access. For example, cuckoo hash requires approximately 500 hash computations and memory accesses when updating a hash table, and even then, it is likely that the update fails. Thus for these hash table designs, if multiple updates fail, the entire hash table will have to be rebuilt. The reconstruction process will require a significant amount of time, which is unacceptable for real-world applications.
Disclosure of Invention
In order to solve the problems of hash table conflict and query time and overcome the defects of high conflict rate, low memory use efficiency, low loading rate and the like of the conventional hash table, the invention provides a novel hash table design scheme, namely 'matrix hash', which combines multi-sub-table hash, a bloom filter and a bitmap.
The technical scheme adopted by the invention is as follows:
a data storage method based on matrix hash is characterized by comprising the following steps:
1) establishing a hash table data structure which comprises z sub-tables, wherein z is an even number, and the size equal difference of each sub-table is decreased progressively; for the
Figure GDA0001860176730000021
Combining the ith sub-table with the z-i +1 th sub-table to obtain
Figure GDA0001860176730000022
Sub-tables with equal size;
2) establishing an assistance data structure including the samez bloom filters corresponding to the z sub-tables, wherein the size equal difference of each bloom filter is decreased progressively; for the
Figure GDA0001860176730000023
Combining the ith bloom filter with the z-i +1 th bloom filter to obtain
Figure GDA0001860176730000024
A bloom filter of equal size; then the product is mixed with
Figure GDA0001860176730000025
Adding the corresponding bits of the bloom filters together to form 1 multi-bit bloom filter;
3) and inserting key value pairs by using the hash table data structure and the auxiliary data structure to realize data storage.
Further, each time a new key-value pair is inserted, it is inserted into the sub-table with the smallest load rate.
Further, a linked list is hung on the last sub-list, namely the z-th sub-list, and if an empty bucket cannot be found in the key-value pair to be inserted, the key-value pair is hung on the linked list by using a pointer.
Furthermore, each sub-table has a bitmap corresponding to each bit in the bitmap corresponding to a bucket in the sub-table corresponding to each bit in the bitmap; the bit in the bitmap corresponding to the empty bucket is 0, and the bit in the bitmap corresponding to the non-empty bucket is 1.
Further, an additional bloom filter F is addedhalfWhich is responsible for recording the second part of the sub-table, i.e.
Figure GDA0001860176730000031
To reduce the number of sub-tables queried.
Further, the key-value pairs are inserted as follows:
a) for a given key-value pair, checking whether z candidate buckets are empty through a bitmap, and then inserting the key-value pair into a sub-table with the lowest loading rate to balance the loading rates of all the sub-tables; suppose the sub-table index to be inserted is i, if
Figure GDA0001860176730000032
Update the bloom filter FiTo indicate that key x is in sub-table TiUpdating the corresponding bitmap; if it is not
Figure GDA0001860176730000033
Update the bloom filter Fz-i+1To indicate x is in the sub-table TiAnd update FhalfAnd a corresponding bitmap;
b) if the bitmap shows that z buckets into which key x should be inserted are all full, then the mechanism of kicking is used to effect the insertion of the key-value pair.
Further, the key-value pair query mode is as follows: when querying x, first in the multi-bit bloom filter FmAnd FhalfQuery x if
Figure GDA0001860176730000034
Return true, and FhalfReturning false, the sub-table T is checkedi(ii) a Otherwise, check the sub-table T firstz-i+1If there is no match, the sub-table T is checked againi(ii) a If x cannot be found in the z sub-tables, searching a linked list of the last sub-table; if still not found, x is not in the hash table.
Further, the key-value pair deletion mode is as follows: when x is deleted, the bucket where x is located is found according to the query operation, then the key-value pair is removed from the bucket, and the corresponding bit of the bucket where x is located in the bitmap is reset.
The invention has the beneficial effects that: 1) high loading + few pointers: a large number of key-value pairs are stored with a small memory space and the number of pointers used is small. 2) And (4) low collision rate. 3) And (3) quick updating: the hash table can be updated with few memory accesses. 4) The zero update fails. 5) Quick query: the key value pairs can be quickly searched by using few memory accesses, or for the non-existing key value pairs, the non-existing results can be quickly returned. 6) The practicability is as follows: it is easy to implement in a hardware system.
Drawings
Fig. 1 is an algorithm diagram of matrix hashing.
Fig. 2 is a structural diagram of a multi-bit bloom filter.
Detailed Description
The invention is further illustrated by the following specific examples and the accompanying drawings.
Data structure
The data structure of the matrix hash of the invention comprehensively uses a multi-level sub-table, a bloom filter and a bitmap. The data structure is composed of a hash table data structure and an auxiliary data structure.
1. Hash table data structure
The size of each sub-table, i.e. the maximum number of elements that can be stored, is decreasing with arithmetic, and therefore the bloom filters corresponding to the sub-tables are also decreasing with arithmetic. A simpler equalization strategy is used when inserting elements: whenever a new key-value pair is inserted, it is inserted into the sub-table with the smallest loading rate, so that it can be ensured that the number of elements in each sub-table is also in the form of decreasing arithmetic progression.
Assume a total of z sub-tables, where z is an even number. For the
Figure GDA0001860176730000041
Matrix hash combines the ith sub-table and the z-i +1 th sub-table to finally obtain
Figure GDA0001860176730000042
Sub-tables of equal size. Since the combined sub-table shape is similar to a matrix, we name this algorithm as matrix hashing. To avoid insertion failures, the last child table is allowed to be linked. If the key-value pair to be inserted cannot find an empty bucket finally, a linked list can be hung on the z-th sub-table. Because the z-th sub-table is the smallest sub-table, the pointers occupy the smallest memory.
Fig. 1 is an algorithmic schematic of matrix hashing, where the left side is 6 sub-tables and 6 bloom filters with decreasing sizes in equal difference, and the middle is 3 sub-tables and 3 bloom filters with equal sizes after combination. The upper right side is a multi-bit bloom filter formed by combining three standard bloom filters BF1, BF2 and BF 3.
2. Auxiliary data structure
Similar to hash table binding, for
Figure GDA0001860176730000043
Matrix hash combines the ith bloom filter and the z-i +1 bloom filter to finally obtain
Figure GDA0001860176730000044
A standard bloom filter of equal size. Then, by applying this
Figure GDA0001860176730000045
The corresponding bits of each bloom filter are added together to form 1 bloom filter. In this bloom filter, each box is composed of
Figure GDA0001860176730000046
And a bit. I call this bloom filter a multi-bit bloom filter and use FmAnd (4) showing. To this end, we have combined the original z equal-difference bloom filters into 1 multi-bit bloom filters.
FIG. 2 is a block diagram of a multi-bit bloom filter. As shown in the figure, the three bits in a bin come from three standard bloom filters of equal size, i.e., F1, F2, F3, respectively. It should be noted that the combination of bloom filters is performed in on-chip memory, physically, and the combination of sub-tables is conceptual only. The algorithm implementation of the multi-bit bloom filter is as follows:
suppose that F1, F2, and F3 all have m bits. For F1, the most significant bit is taken first (m bits of F1 are compared with 2)m-1Do a logical and operation) and then shift the resulting result 2 x m bits to the left (multiply the resulting m bits by 2)2m) Then take the next highest order bit (compare m bits of F1 with 2)m-2Do a logical and operation), shift the result 2 x (m-1) bits to the left (multiply the resulting m bits by 2)2(m-1)) The result is accumulated with the value after the highest bit operation,by analogy, each bit is similarly accumulated until the last bit, assuming that the resulting accumulated value is f 1. The same operation is performed on F2 and F3 respectively, and the accumulated values F2 and F3 are obtained. The result of shifting f1 and f2 to the right by one bit and the result of shifting f3 to the right by two bits are logically or-operated to obtain the multi-bit bloom filter (namely, the multi-bit bloom filter is obtained
Figure GDA0001860176730000051
)。
One problem arises due to the above bloom filter design: when a bloom filter returns true, we need to query the corresponding two child tables. E.g. if x is in the ith bloom filter, then it needs to be in the sub-table TiOr Tz-i+1To query. To reduce the number of sub-tables in the query, an additional bloom filter, called F, is addedhalfResponsible for recording the second part of the sub-table, i.e.
Figure GDA0001860176730000052
Matrix hashing also uses bitmaps within a slice, with one bitmap corresponding to each sub-table, and each bit in the bitmap corresponding to a bucket in its corresponding sub-table. The bit in the bitmap corresponding to the empty bucket is 0, and the non-empty bucket is 1.
Matrix hash false positive rate derivation
The matrix haxi has two bloom filters: fmAnd Fhalf. Assuming n is the number of key-value pairs, z sub-tables are recomposed
Figure GDA0001860176730000053
And (4) a sub-table. Suppose FmThere are m boxes, each box has
Figure GDA0001860176730000054
A bit, this
Figure GDA0001860176730000055
Each bit corresponds to
Figure GDA0001860176730000056
And (4) a sub-table. Suppose FmThere are k sub-tables of which there are,
Figure GDA0001860176730000057
Fmand the individual components thereof
Figure GDA0001860176730000058
The bloom filters are equal. Thus FmThe false positive rate of (D) is as defined in F (F)m) Expressed, the formula is as follows:
Figure GDA0001860176730000059
if the number of bloom filters returning true is u +1, the false positive rate formula is:
f(Fm,u)=0.5k*u*(1-0.5k(z-u-1))
Fhalfthere are also k hash functions, FhalfThe false positive rate is: f (F)half)=0.5k. If only FmReturning true, and the key-value pair only exists in one sub-table, there is no false alarm, and the probability of this event occurrence is (1-F (F)m))*(1-f(Fhalf)). If only FmReturning true and reporting key-value pairs in u +1 sub-tables, there will be u false positives, the probability of this event is F (F)m,u)*(1-f(Fhalf)). If only FhalfReporting a false alarm with a probability of (1-F (F)m))*f(Fhalf). If FmThere are u false positives, and FhalfWith a false alarm, the probability of this event occurring is F (F)m,u)*f(Fhalf)。
For example: when z is 8 and k is 16, the false positive rate of the matrix hash is 1- (1-F (F)m))*(1-f(Fhalf))≈6.1*10-5This number is very small.
Inserting, inquiring and deleting mode of key value pair
In a key value storage system, the specific operation implementation modes of the matrix hash algorithm for inserting, inquiring and deleting key value pairs are as follows:
1. insertion mode of key value pair
For a given key-value pair, key x is inserted. First it is checked by means of the bitmap whether the z candidate buckets are empty. The key-value pairs are then inserted into the sub-table with the lowest load rate to balance the load rates of all sub-tables. Assume that the sub-table index to be inserted is i. If it is not
Figure GDA00018601767300000510
Then F is updatediTo indicate x is in the sub-table TiIn, updating the corresponding bitmap; if sub-table index
Figure GDA00018601767300000511
Need to mix Fz-i+1Updated to indicate x is in sub-table TiAnd update FhalfAnd a corresponding bitmap. During insertion, in order to put a box in contact with FiThe corresponding bit is set to 1, and the data in this bin is stored
Figure GDA0001860176730000061
One bit and 2i-1It is only necessary to do logic or operation.
If the bitmap shows that z buckets into which x should be inserted are all full, then the mechanism of kicking in cuckoo hash (cuckoo hash) is used and the bitmap is used to decide which key-value pair to kick. The bitmap is used to sequentially check the z candidate buckets for x to determine if there is an empty bucket in the remaining z-1 sub-tables that can insert y, which is an element originally in the candidate bucket, such as y. If so, kick y out, insert x, and insert y into the new location. If such a y cannot be found, a blind kick is performed and the above insertion procedure is repeated. And limiting the number of blind kicks to theta, and if the number of blind kicks exceeds theta, hanging the key value pair on the linked list of the last sub-table. By varying the value of θ, RHT4 may be a tradeoff between load rate and insertion speed. Because the bitmap has a global record of empty and non-empty buckets in the sub-table, the use of on-chip bitmaps significantly reduces the number of kicks.
2. Key value pair query mode
Such as query x, first at FmAnd FhalfQuery x if
Figure GDA0001860176730000062
Return true, and FhalfReturning false, the sub-table T is checkedi. Otherwise, check the sub-table T firstz-i+1If there is no match, the sub-table T is checked againi. And if x cannot be found in the z sub-tables, searching the linked list of the last sub-table. If it is still not found, x is not in the hash table.
It should be noted that, during the query process, only k hash functions need to be calculated, and z × k hash functions do not need to be calculated, because: elements forming a bloom filter
Figure GDA0001860176730000063
The parameters of the individual bloom filters are identical. If reading a box and FiCorresponding bit, will this
Figure GDA0001860176730000064
One bit and 2i-1And performing logic AND operation. If the result is 0, the bit in the bin corresponding to Fi is 0, otherwise it is 1.
3. Key value pair deleting mode
If x is deleted, RHT1 first finds the bucket where x is located according to the above query operation, then removes the key-value pair from the bucket, and resets the bit corresponding to the bucket where x is located in the bitmap.
Fourth, experimental data
To better evaluate the matrix hash of the present invention and the existing hash design, we adopted the data of the actual application. We obtained 12 Forwarding Information Bases (FIBs) of the website www.ripe.net at 8 am on 2014.07.08 days, and for each FIB, a traffic trace (traffic trace) was generated uniformly and manually for each prefix (prefix). We use the part of the FIB that is relevant to us, namely the prefix (prefix) and the relevant next hop. Prefix (prefix) as a key and next hop as a value. We use β to denote the ratio of the total number of buckets and the total number of elements of the hash table. Wherein beta is more than or equal to 1.05 and less than or equal to 10. We denote the threshold for blind kick operation by θ. The number of key value pairs in the FIB was 500,000. The difference in size of the 8 sub-tables created was 5000, and the total size of the sub-tables was β n. Let θ be 0, which means blind kicks are not allowed, but only kicks with bitmaps. And (3) inserting the key value pair every time, wherein the maximum value of copy times is 8+1, and if the element does not find an empty candidate position in 8 sub-tables, the element needs to be inserted into a linked list of the last sub-table. The conflict rate is the ratio of the number on the last sub-table linked list to the total number of elements. The bloom filter has 16 hash functions. The experimental results are as follows:
1. the matrix hash experiment shows that:
1) loading rate and collision rate
Experimental settings β ═ 1.05 and θ ═ 0, the experimental results show that matrix hashing achieves very high loading rates with only 1.05 × n of memory, where the loading rates of the 8 sub-tables are well balanced and the total loading rate is 95.19%. The collision rate was about 0.05%, and only a few FIB collision rates exceeded 0.06%.
2) Insertion and query time
The experiment sets β to 1.05 and θ to 0, and the experiment inserts all elements of each FIB into the matrix hash, and the experimental results show that the more elements are inserted, the more memory accesses are required. Most elements, requiring less than 6 memory accesses per insert, have a query memory access count between 1 and 1.0019, with an average of 1.00059.
3) Bitmap kicking and blind kicking
The experiment sets beta to 1.05, and the experiment result shows that when theta is 5, the memory times of inserting a key-value pair is 8 times (5+1) +1 times 49 times at most, the memory access times of inquiring a key-value pair is lower than 8 times, and then the linked list of the last sublist has no elements. When θ is 0, there are only a few elements (0.56%) on the linked list of the last sub-table, although blind kicks are not allowed. The worst case of memory access at the time of insertion is 8+1 times.
4) Collision rate vs beta
The experimental setting θ is 0, and the experimental result shows that the larger β is, the smaller the collision rate is, and when β ≧ 1.18, the collision rate approaches 0.
2. Matrix hash is compared to other hashes:
experiments compare matrix hashing with six well-known hash designs, namely chain hashing, linear detection, double hashing, cuckoo hashing, d-left hashing and peacock bird hashing. First, an insertion failure is defined, and for linear probes, double-hash and cuckoo hash, when a collision occurs, another bucket is probed and this probing is repeated all the time. The repeated detection times are limited to 500, which means that the maximum value of the memory access times of each insertion is 500 for the three hash designs, if the collision still exists more than 500 times, the three hash designs abandon the continuous insertion, and the elements which are not inserted at the 500 th loop are discarded, thereby causing the insertion failure. For the peacock-bird hash and the matrix hash, the bloom filter has 16 hash functions.
Experiment one: (β 1.05, different FIB)
1) Loading rate
The experimental results show that: the loading rate of matrix hash is always the highest.
2) Insertion time
The experimental results show that: matrix hashing, the number of copies required for insertion is minimal in all but chain hashing. This is because the chain hash requires only one or two accesses during insertion, so the access time is short, but the chain hash has significant disadvantages in other aspects. And the matrix hash can achieve fast insertion due to the existence of the bloom filter and the bitmap.
3) Finding time
The experimental results show that: the matrix hah has the shortest search time because the matrix hah has this higher loading rate and a smaller false positive rate.
Experiment two (different beta, FIB rrc00)
1) Loading rate
The experimental results show that: the loading rate of matrix hash is always the highest, the difference between the loading rates of chain hash and double hash is not large, and the high loading rate is achieved only when beta is higher in peacock-bird hash.
2) Insertion time
The experimental results show that: matrix hashing, the number of copies required for insertion is minimal in all but chain hashing.
3) Finding time
The experimental results show that: the matrix haxi has the shortest search time.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (3)

1. A data storage method based on matrix hash is characterized by comprising the following steps:
1) establishing a hash table data structure which comprises z sub-tables, wherein z is an even number, and the size equal difference of each sub-table is decreased progressively; for the
Figure FDA0003554138030000011
Combining the ith sub-table with the z-i +1 th sub-table to obtain
Figure FDA0003554138030000012
Sub-tables with equal size; each sub-table in the z sub-tables corresponds to a bitmap, and each bit in the bitmap corresponds to a bucket in the corresponding sub-table; the bit in the bitmap corresponding to the empty bucket is 0, and the bit in the bitmap corresponding to the non-empty bucket is 1;
2) establishing an auxiliary data structure which comprises z bloom filters corresponding to the z sub-tables, wherein the size arithmetic of each bloom filter is decreased; for the
Figure FDA0003554138030000013
Combining the ith bloom filter with the z-i +1 th bloom filter to obtain
Figure FDA0003554138030000014
A bloom filter of equal size; then the product is mixed with
Figure FDA0003554138030000015
Corresponding bits of the plural bloom filters are added together to form 1 plural-bit bloom filter Fm(ii) a Adding an additional bloom filter FhalfResponsible for recording the second in the z sub-tables
Figure FDA0003554138030000016
Individual watch
Figure FDA0003554138030000017
To the z th sub-table TzI.e. by
Figure FDA0003554138030000018
To reduce the number of sub-tables queried;
the multi-bit bloom filter FmThe implementation mode of (2) is as follows:
assume that the standard bloom filters F1, F2, F3 all have m bits; for F1, firstly taking the most significant bit, then moving the obtained result to the left by 2 x m bits, then taking the next most significant bit, moving the obtained result to the left by 2 x (m-1) bits, accumulating the obtained result and the value after the operation of the most significant bit, and so on, performing similar operation on each bit, and accumulating until the last bit, and assuming that the obtained accumulated value is F1; respectively carrying out the same operation on F2 and F3 to obtain accumulated values F2 and F3; performing logic or operation on the result obtained after the f1 and the f2 are shifted to the right by one bit and the result obtained after the f3 is shifted to the right by two bits to obtain a multi-bit bloom filter;
3) inserting key value pairs by using the hash table data structure and the auxiliary data structure to realize data storage;
wherein, step 3) includes:
inserting a new key value pair into the sub-table with the minimum loading rate in the z sub-tables every time the key value pair is inserted;
hanging a linked list on the last sub-table, namely the z-th sub-table, in the z sub-tables;
the key-value pairs are inserted as follows:
a) for a given key-value pair, checking whether z candidate buckets are empty through a bitmap, and then inserting the key-value pair into a sub-table with the lowest loading rate to balance the loading rates of all the sub-tables; suppose the sub-table index to be inserted is i, if
Figure FDA0003554138030000019
Figure FDA00035541380300000110
Updating the ith bloom filter F of the z bloom filtersiTo indicate the i-th sub-table T of the z sub-tables of the key-value pairiUpdating the corresponding bitmap; if it is not
Figure FDA00035541380300000111
Then the z-i +1 th of the z bloom filters F is updatedz-i+1To indicate the i-th sub-table T of the z sub-tables of the key-value pairiAnd update FhalfAnd a corresponding bitmap;
b) if the bitmap shows that z buckets into which the key x should be inserted are full, the insertion of the key-value pair is realized by using a kick mechanism;
wherein, the implementation mode of the step b) is as follows: using the bitmap to sequentially check z candidate buckets corresponding to x to determine whether an original element y in the candidate buckets has an empty bucket in the remaining z-1 subtables into which y can be inserted; if yes, kicking out y, inserting x, and inserting y into a new position; and if y cannot be found, executing blind kicking, repeating the above insertion process, limiting the number of times of blind kicking to theta, and if the number of times of blind kicking exceeds theta, hanging the key value pair on the linked list of the last sub-table.
2. The method of claim 1, in which key-value pairsThe query mode is as follows: when looking up the key x, first in the multi-bit bloom filter FmAnd FhalfMiddle query key x, for
Figure FDA0003554138030000021
If Fi returns true, and FhalfReturning false, the ith sub-table T in the z sub-tables is checkedi(ii) a Otherwise, checking the z-i +1 th sub-table T in the z sub-tablesz-i+1If there is no match, then check the ith sub-table T of the z sub-tablesi(ii) a If the key x cannot be found in the z sub-tables, searching a linked list of the last sub-table; if it is still not found, it indicates that key x is not in the hash table.
3. The method of claim 1, wherein key-value pairs are deleted by: when deleting the key x, firstly finding the bucket where the key x is located according to the query operation, then removing the key value pair from the bucket, and resetting the corresponding bit of the bucket where the key x is located in the bitmap.
CN201710014205.9A 2017-01-09 2017-01-09 Data storage and query method based on matrix hash Active CN108287840B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710014205.9A CN108287840B (en) 2017-01-09 2017-01-09 Data storage and query method based on matrix hash

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710014205.9A CN108287840B (en) 2017-01-09 2017-01-09 Data storage and query method based on matrix hash

Publications (2)

Publication Number Publication Date
CN108287840A CN108287840A (en) 2018-07-17
CN108287840B true CN108287840B (en) 2022-05-03

Family

ID=62819334

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710014205.9A Active CN108287840B (en) 2017-01-09 2017-01-09 Data storage and query method based on matrix hash

Country Status (1)

Country Link
CN (1) CN108287840B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108989452A (en) * 2018-08-07 2018-12-11 佛山市苔藓云链科技有限公司 A kind of data transmission of internet of things device
CN109471635B (en) * 2018-09-03 2021-09-17 中新网络信息安全股份有限公司 Algorithm optimization method based on Java Set implementation
CN109597807A (en) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 Number storehouse list processing method and apparatus
CN109766341B (en) * 2018-12-27 2022-04-22 厦门市美亚柏科信息股份有限公司 Method, device and storage medium for establishing Hash mapping
CN109800228B (en) * 2018-12-28 2023-03-10 深圳竹云科技有限公司 Method for efficiently and quickly solving hash conflict
CN111563199B (en) * 2020-04-26 2023-10-10 北京奇艺世纪科技有限公司 Data processing method and device
CN111552692B (en) * 2020-04-30 2023-04-07 南方科技大学 Plus-minus cuckoo filter
CN112416933B (en) * 2020-11-19 2022-09-23 重庆邮电大学 High-performance hash table implementation method based on-chip and off-chip memories
CN112699323A (en) * 2021-01-07 2021-04-23 西藏宁算科技集团有限公司 Cloud caching system and cloud caching method based on double bloom filters
CN113342828A (en) * 2021-07-02 2021-09-03 广东唯审信息科技有限公司 Hash table conflict resolution method based on d-dimensional mapping

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104317795A (en) * 2014-08-28 2015-01-28 华为技术有限公司 Two-dimensional filter generation method, query method and device
CN105027527A (en) * 2012-12-31 2015-11-04 华为技术有限公司 Scalable storage systems with longest prefix matching switches
CN105468298A (en) * 2015-11-19 2016-04-06 中国科学院信息工程研究所 Key value storage method based on log-structured merged tree

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10810200B2 (en) * 2015-01-07 2020-10-20 International Business Machines Corporation Technology for join processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105027527A (en) * 2012-12-31 2015-11-04 华为技术有限公司 Scalable storage systems with longest prefix matching switches
CN104317795A (en) * 2014-08-28 2015-01-28 华为技术有限公司 Two-dimensional filter generation method, query method and device
CN105468298A (en) * 2015-11-19 2016-04-06 中国科学院信息工程研究所 Key value storage method based on log-structured merged tree

Also Published As

Publication number Publication date
CN108287840A (en) 2018-07-17

Similar Documents

Publication Publication Date Title
CN108287840B (en) Data storage and query method based on matrix hash
US9495398B2 (en) Index for hybrid database
US8055681B2 (en) Data storage method and data storage structure
CN112000846B (en) Method for grouping LSM tree indexes based on GPU
WO2020057272A1 (en) Index data storage and retrieval methods and apparatuses, and storage medium
CN104077423A (en) Consistent hash based structural data storage, inquiry and migration method
US8352470B2 (en) Adaptive aggregation: improving the performance of grouping and duplicate elimination by avoiding unnecessary disk access
CN106599091B (en) RDF graph structure storage and index method based on key value storage
WO2021051782A1 (en) Consensus method, apparatus and device of block chain
CN116450656B (en) Data processing method, device, equipment and storage medium
CN115718819A (en) Index construction method, data reading method and index construction device
CN109800228B (en) Method for efficiently and quickly solving hash conflict
Khan et al. Set-based unified approach for attributed graph summarization
US11782895B2 (en) Cuckoo hashing including accessing hash tables using affinity table
CN113867627A (en) Method and system for optimizing performance of storage system
Gong et al. Abc: a practicable sketch framework for non-uniform multisets
CN116521956A (en) Graph database query method and device, electronic equipment and storage medium
Patgiri et al. Shed more light on bloom filter's variants
US20210248142A1 (en) Dual filter histogram optimization
CN115114294A (en) Self-adaption method and device of database storage mode and computer equipment
CN113220214A (en) Multi-node storage system and data deduplication method thereof
US20130290378A1 (en) Adaptive probabilistic indexing with skip lists
CN111949439B (en) Database-based data file updating method and device
Sasaniyan Asl et al. A Cuckoo Filter Modification Inspired by Bloom Filter
CN113342828A (en) Hash table conflict resolution method based on d-dimensional mapping

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant