CN113535706A - Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter - Google Patents

Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter Download PDF

Info

Publication number
CN113535706A
CN113535706A CN202110885281.3A CN202110885281A CN113535706A CN 113535706 A CN113535706 A CN 113535706A CN 202110885281 A CN202110885281 A CN 202110885281A CN 113535706 A CN113535706 A CN 113535706A
Authority
CN
China
Prior art keywords
fingerprint
stage
candidate
algorithm
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110885281.3A
Other languages
Chinese (zh)
Other versions
CN113535706B (en
Inventor
李挥
刘涛
王博辉
崔凯
蒋傅礼
张华宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Saisichen Technology Co ltd
Shenzhen Cestbon Technology Co ltd
Original Assignee
Foshan Saisichen Technology Co ltd
Shenzhen Cestbon Technology Co ltd
Chongqing Saiyushen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Saisichen Technology Co ltd, Shenzhen Cestbon Technology Co ltd, Chongqing Saiyushen Technology Co ltd filed Critical Foshan Saisichen Technology Co ltd
Priority to CN202110885281.3A priority Critical patent/CN113535706B/en
Publication of CN113535706A publication Critical patent/CN113535706A/en
Application granted granted Critical
Publication of CN113535706B publication Critical patent/CN113535706B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is suitable for the field of improvement of data processing technology, and provides a two-stage cuckoo filter which is characterized by comprising a plurality of barrels, each barrel comprises a plurality of units, each unit can be used for storing data fingerprints, the plurality of units form a two-dimensional fingerprint matrix structure, each element to be inserted is associated with two hash functions, an insertion algorithm is divided into two stages according to a load rate, the problem of uneven data load is solved by using a strategy of actively carrying out relocation in the first stage with lower load rate, the insertion delay of the cuckoo filter is effectively reduced, and the efficiency and the throughput of a repeating data deleting system are increased.

Description

Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter
Technical Field
The invention belongs to the field of improvement of data processing technology, and particularly relates to a two-stage cuckoo filter and a repeated data deleting method based on the two-stage cuckoo filter.
Background
With the advent of the information age, data on the internet has experienced explosive growth, with a global data volume of 33ZB in 2018 as reported by IDC, and it is predicted that the data will reach 175ZB in 2025. Meanwhile, to alleviate the cost pressure of self-built local storage and maintenance, more and more individuals, companies, and organizations migrate the storage traffic of data to cloud service providers. However, the explosive growth of data poses serious challenges to the storage capacity, network bandwidth, etc. of cloud service providers. In order to solve the problem of data explosion, a redundant data elimination technology is proposed, and the technology goes through development for many years, namely a lossless data compression technology, a lossy data compression technology and a repeated data deletion technology.
In the early stage of redundant data elimination technology, the coding mode is widely applied and researched. Huffman coding builds a code with the shortest average length according to the probability of character occurrence. Subsequent LZ encoding creates a dictionary of data for the data, and if both sender and receiver have such dictionaries, the actual data sent may be replaced by the index of the dictionary, thereby compressing the actual amount of data transferred.
For multimedia data, lossy data compression techniques are widely used, which raise the compression ratio at the cost of some unimportant information, such as some music with very complete frequency spectrum, and cut off the frequency spectrum (the upper limit of hearing of human ear) above 20KHz will not affect the quality of music, which is the MP3 lossy compression technique. For pictures, JPEG and PNG are two of the more common compression algorithms.
After entering the new century, a data de-duplication technology formally appears, supports multi-granularity data de-duplication, has better expansibility, and can be extended to a large-scale distributed storage system from the local. Deduplication means that duplicate data is detected and only a single instance of the data is stored in a collection of digital files, thereby eliminating redundant data. A deduplication technology based on hash identification, which is a deduplication technology with low implementation cost and excellent deduplication effect, is widely applied to various storage systems. The identification of the data block or the file is calculated and stored in the database, when the deduplication process is carried out, the system can calculate the identification of the data block to be deduplicated and compare the identification with the identification in the database, if the identification is matched with the identification, the same data block is proved to be stored, the system gives up storage, but the index information between the file and the unique data block is reserved, and the normal reconstruction of the subsequent file is ensured.
With the rapid expansion of data volume, the space overhead required by data block identification storage is also larger and larger, and the main memory of the storage system cannot meet the storage overhead, so that slow external storage devices such as a disk and the like undertake the task of storing the data block identification. The solution comes with a bottleneck of searching a disk, so that the efficiency of the whole deduplication system is limited, and the response time of the deduplication system is reduced. More and more deduplication systems use additional technology to mitigate the performance degradation problem from disk bottlenecks.
The DDFS system proposed by DataDomain uses a classical approximate set membership decision data structure, a bloom filter, to avoid disk bottlenecks. A bloom filter is a data structure that typically trades off memory space overhead for partial accuracy bit cost, and enables set determination, i.e., whether an element exists in a set, to be done with minimal space overhead. The bloom filter does not need to store the original data itself, but rather summary information of the original data. The main data structure is a bit vector, and the bit vector comprises a plurality of hash functions for mapping data to bits in the bit vector.
The DDFS uses a summary vector to improve the performance of the deduplication system, and the summary vector is implemented by a Bloom Filter, which is stored in the main memory of the DDFS and represents a summary of the data segment identifications in the file system. When the deduplication system needs to query whether an index value exists, it will access the summary vector first, and if the result given by the summary vector is that the index value does not exist, the DDFS considers that the data segment is a new data segment, and no additional lookup operation is needed. If the result given by the summary vector indicates that an index value exists, then the index value exists with a high probability, but the result is not guaranteed, and the DDFS is further confirmed by a database lookup. When the system is closed, the system stores the summary vector to a disk, and the system after power failure is ensured not to lose the information of the summary vector. When the system is restarted to recover, the DDFS will restore the latest one of the summary vectors in disk, and then insert new data after the checkpoint into the summary vector, as shown in fig. 1.
The bloom filter is a data structure which does not support deletion operation, so that the summary vector implemented based on the bloom filter does not support deletion operation, which results in that deletion operation of a file or a data segment cannot be synchronized into the summary vector, and the accuracy of information in the summary vector is reduced. The accuracy of the summarized vector is directly related to the running speed of the DDFS, and the problem of accumulated accuracy reduction becomes the bottleneck of the whole DDFS along with the long-time running of the system.
Agrawal et al teach a cuckoo filter based data de-duplication system. The cuckoo filter solves the problem that a bloom filter cannot delete elements and does not need to sacrifice space or performance overhead. The data structure of the cuckoo filter is composed of a plurality of buckets, each bucket comprises a plurality of units, and each unit can store a fingerprint. The cuckoo filter also only stores the fingerprint information of the elements, and the space overhead is reduced at the cost of accuracy. The cuckoo filter employs two hash functions, the pair of hash functions is used to compute two candidate bucket positions for the element, and the two candidate bucket positions are associated by a partial keycuckoo hash, and the location of one candidate bucket can be obtained from the fingerprint of the element and the location of the other candidate bucket.
The author also uses the cuckoo filter to accelerate the query operation of the data block identification and reduce the disk access times in the process of deleting the repeated data. When querying whether a data block exists, the data block is firstly searched by a hash structure constructed by a cuckoo filter, if the data block identification is found, the system tries to read the metadata information of the data block from the cache, and otherwise, the metadata information is directly obtained from the metadata record. In this process, the system updates the cache via the LRU algorithm. If no data block identification is found in the cuckoo filter, the data block content is written directly to the storage system and this data block is added to the metadata record. In the manner described above, they use cuckoo filters to speed up the entire deduplication process, as shown in fig. 2.
The interpolation algorithm of the cuckoo filter does not take into account the influence that the interpolation algorithm may have on the degree of data load within the filter at the beginning of the design, and only a simple selection algorithm is used to select one of the two candidate buckets. The simple random insertion algorithm can cause the load in the filter to be concentrated in a few buckets, and the imbalance of the load can cause the insertion efficiency to be reduced, the time delay to be increased, the usability of the whole cuckoo filter is influenced, and the efficiency of the repeated data deleting system is further influenced.
Disclosure of Invention
The invention aims to provide a two-stage cuckoo filter and a repeated data deleting method based on the two-stage cuckoo filter, and aims to solve the technical problem that the cuckoo filter used in the repeated data deleting method is low in inserting efficiency.
The invention is realized in such a way that each cuckoo filter in the two-stage cuckoo filter consists of a plurality of barrels, each barrel consists of a plurality of units, each unit can be used for storing data fingerprints, the units form a two-dimensional fingerprint matrix structure, each element to be inserted is associated with two hash functions, and the two hash functions are used for obtaining the positions of two candidate barrels of the element so that the element fingerprints can be only stored in the two candidate barrels to form the cuckoo filter to form an inserted element two-stage algorithm and two-stage repositioning.
The further technical scheme of the invention is as follows: the overall load rate of the cuckoo filter is smaller than a preset threshold value, the first stage is that the cuckoo filter firstly calculates data fingerprints of elements in an insertion algorithm of the first stage in the elements, then calculates the positions of two candidate buckets, and then calculates the loads in the two candidate buckets.
The further technical scheme of the invention is as follows: calculating the load in the two candidate buckets, judging whether the load of the two candidate buckets is larger than a set value or not, if the load of the two candidate buckets is smaller than the set value, selecting a candidate bucket with the minimum load rate to insert by the algorithm, and feeding back the successful insertion; if the load of one candidate barrel is smaller than the set value, the algorithm selects the candidate barrel with the load rate smaller than the set value to insert, and feeds back the successful insertion; if the load rates of the two candidate buckets are larger than the set value, the algorithm randomly selects one candidate bucket, removes one fingerprint from the candidate buckets, names the candidate bucket as a victim, inserts the fingerprint of the element to be inserted into the position before the victim, and carries out the first-stage repositioning operation.
The further technical scheme of the invention is as follows: the first-stage repositioning operation insertion algorithm judges whether the iteration number reaches an upper limit, if so, the feedback insertion fails, if not, judges whether the iteration number is greater than or equal to a preset value, if so, calculates another candidate bucket of the victim by using a partial keybok hash function, and judges the load condition of the candidate bucket, and if so, the algorithm inserts the fingerprint into the bucket, and the feedback insertion succeeds; otherwise, the algorithm randomly selects a fingerprint in the current candidate bucket, removes the fingerprint, updates the fingerprint as a victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the first-stage circular repositioning operation; if the iteration number is larger than or equal to the preset value, another candidate bucket of the victim is calculated by using a partial key cuckoo hash function, whether the candidate bucket is full or not is judged, if the candidate bucket is not full, the algorithm inserts the fingerprint of the element to be inserted into the candidate bucket and feeds back the insertion success, if the candidate bucket is full, the algorithm randomly selects the fingerprint in the current candidate bucket, removes the fingerprint, updates the fingerprint to be the victim, inserts the fingerprint to be inserted into the original position of the victim, adds one iteration number, and returns to the first-stage circular repositioning operation.
The further technical scheme of the invention is as follows: the overall load rate of the cuckoo filter is greater than or equal to a preset threshold value, the second stage is that the cuckoo filter firstly calculates the data fingerprint of the element in the second stage insertion algorithm in the inserted element, then calculates the positions of two candidate buckets, and then calculates the load in the two candidate buckets.
The further technical scheme of the invention is as follows: the algorithm of the cuckoo filter in the insertion element obtains the load judgment of two candidate buckets to be full, if the two candidate buckets are not full, the algorithm selects a candidate bucket with the lowest load to insert the fingerprint into the position, the feedback insertion is successful, if the two candidate buckets are not full and one is full, the algorithm selects the candidate bucket which is not full to insert the fingerprint into the position, the feedback insertion is successful, if the two candidate buckets are full, the algorithm randomly selects one candidate bucket and randomly removes one fingerprint to be called a victim, then the fingerprint of the element to be inserted is inserted into the position, and the second stage of relocation operation is carried out.
The further technical scheme of the invention is as follows: and the second-stage repositioning operation judges whether the current iteration number reaches an upper limit, if the current iteration number reaches the upper limit, the feedback insertion fails, if the current iteration number does not reach the upper limit, the algorithm calculates the position of another candidate bucket of the victim through a partial key valley hash function and judges whether the candidate bucket is full, if the candidate bucket is not full, the algorithm inserts an element into the candidate bucket and feeds back the insertion success, if the candidate bucket is full, the algorithm randomly removes one fingerprint, updates the fingerprint to be the victim, then inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the second-stage repositioning operation.
Another object of the present invention is to provide a data de-duplication method based on a two-stage cuckoo filter, wherein when a file stream enters a storage system, the data de-duplication method based on the two-stage cuckoo filter includes the following steps:
s1, cutting the file into data blocks, and calculating the fingerprint of each data block;
s2, sending the identification of the data block into a two-stage cuckoo filter for inquiring and judging whether the identification exists, if the identification does not exist, judging that the data block is a brand-new data block, storing the data block into a container area by the system, forming a key value pair by the fingerprint and the physical position of the data block, storing the key value pair into a fingerprint index area, and storing the fingerprint into a list area of a file; if the identification exists, the proposed data de-duplication technology can enter a disk database to compare fingerprints, if the disk database does not have the fingerprint, the data block is proved to be a brand new data block, the data block is reserved, the data block is stored in a container area, the fingerprint and the physical position of the data block form a key value pair and then are stored in a fingerprint index area, the fingerprint is stored in a list area of a file, and if the fingerprint exists in the disk database, the data block is proved to be stored by a storage system and is abandoned.
The further technical scheme of the invention is as follows: in step S1, the file is cut into data blocks by using a rolling Rabin fingerprint blocking method.
The further technical scheme of the invention is as follows: the fingerprint of each data chunk is computed in the step S1 by the SHA1 secure hash function.
The invention has the beneficial effects that: the two-stage insertion algorithm relieves the problem of uneven data load, effectively reduces the insertion delay of the cuckoo filter, and accordingly increases the efficiency and the throughput of the repeating data deleting system.
Drawings
Fig. 1 is a schematic diagram of a DDFS in the prior art provided by an embodiment of the present invention.
Fig. 2 is a schematic diagram of an accelerated de-duplication process of a cuckoo filter according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a data structure of a two-stage insertion cuckoo filter according to an embodiment of the present invention.
Fig. 4 is a flowchart of a two-phase insertion algorithm provided by an embodiment of the present invention.
Fig. 5 is a structural diagram of a data de-duplication method according to an embodiment of the present invention.
Fig. 6 is a flowchart of a data de-duplication method according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of a rolling Rabin fingerprint method according to an embodiment of the present invention.
Detailed Description
According to the scheme, a two-stage insertion algorithm is designed, different insertion strategies are used in two stages with different loads, relocation is actively carried out in a first stage with low load rate to balance the loads, and bedding is laid for insertion in a second stage.
With the advent of the information age, data on the network expands rapidly, and more enterprises face the difficult problem of data explosion, which restricts the expansion of enterprise business. However, among massive data, a high proportion of data is redundant repeated data, which causes additional space overhead, bandwidth overhead and energy consumption, and a repeated data deleting technology is developed.
The data de-duplication technology follows the steps of file stream blocking, data block fingerprint calculation, data block fingerprint comparison, data compression and disk drop storage. According to the related research, the data block fingerprint comparison step is the key point of the accelerated promotion of the data deduplication technology. The burst growth of data can cause serious negative effects on the data block fingerprint comparison step. After the total number of data rapidly increases, the total number of data block fingerprints and the required storage space also increase rapidly, so that the memory of the deduplication system is overloaded, and all data fingerprints cannot be stored. Thus, a significant portion of the data fingerprint is transferred to disk for storage. The random access speed of the disk database is very low compared with the access speed of the memory data block, so that the extra disk IO caused by the query of the fingerprint increases the time of index query, and becomes the performance bottleneck of the whole deduplication process. Therefore, various schemes have been proposed in academia and industry to overcome the performance bottleneck and speed up the entire deduplication process. One possible solution is by using a very space efficient summary data structure. The probability data structure does not store the elements per se, but stores some summary information of the elements, so that space overhead is greatly reduced, the space overhead can be stored by a memory, disk IO is reduced, and the whole deduplication process is accelerated.
The cuckoo filter is adopted by a partial data de-duplication system as a novel data structure, but the insertion algorithm does not consider the load influence on the whole filter, so that the insertion delay is increased, the throughput and the availability are reduced, and the efficiency of the whole data de-duplication system is further influenced.
The invention discloses a method for deleting repeated data of a storage system based on a two-stage cuckoo filter, and provides a two-stage insertion algorithm of the cuckoo filter, which can effectively balance loads in the filter, average the loads to each barrel as much as possible, increase the throughput of the cuckoo filter, reduce the insertion delay of the cuckoo filter, and realize an efficient scheme for deleting the repeated data of the storage system based on the two-stage cuckoo filter.
First, we will introduce the core of the efficient deduplication technology proposed by the present invention, a two-stage cuckoo filter.
When element x needs to be inserted into the cuckoo filter, the algorithm first computes its fingerprint by the SHA1 algorithm and computes the locations of its two candidate buckets by two hash functions. The algorithm then takes the current load of the filter and determines the stage it is currently in, which is the first stage if the load factor is less than 0.45, and the second stage otherwise. And there will be an upper iteration limit in the two-stage interpolation algorithm to avoid the interpolation algorithm of the cuckoo filter entering an infinite loop state.
The two-stage insertion cuckoo filter has no difference from the common cuckoo filter in data structure, and the difference is mainly the insertion algorithm.
Each cuckoo filter consists of a number of buckets (one row in the figure), each bucket consists of a number of cells (one grid in the figure), each cell can be used to store a data fingerprint, so that a cuckoo filter presents a structure of a two-dimensional fingerprint matrix. Meanwhile, each element to be inserted is associated with two hash functions, and through the two hash functions, the element can acquire the positions of two candidate buckets of the element, and the fingerprints of the element can only be stored in the two candidate buckets.
The first phase of the interpolation algorithm, which first computes the data fingerprint of an element, then the positions of two candidate buckets, and then the loads in the two candidate buckets, has three states: a) the load of both candidate buckets is less than 0.5. b) There is a candidate bucket with a load less than 0.5. c) The load rates of both candidate buckets are greater than 0.5. When the state is a, the algorithm selects a candidate bucket with the minimum load rate to insert, and returns that the insertion is successful. When the state is b, the algorithm selects a candidate bucket with the load rate less than 0.5 for insertion, and returns that the insertion is successful. When the state is c, the algorithm randomly selects a candidate bucket and removes one of the fingerprints, called the victim, and then inserts the fingerprint of the element to be inserted into the previous position of the victim, followed by the repositioning operation.
In the relocation operation in the first stage, firstly, the algorithm judges whether the iteration number has reached the upper limit, if the iteration number has reached the upper limit, the insertion is returned to fail, and if the iteration number has not reached the upper limit, two cases exist, wherein the iteration number is less than 3 in case of a, and the iteration number is greater than or equal to 3 in case of b.
1) Case a
Another candidate bucket for the victim is computed using a partial keycuckoo hash function, and then the load condition of this candidate bucket is determined, where two conditions may exist: i) the load rate of the candidate bucket is less than 0.5. ii) the load rate of the candidate bucket is 0.5 or more. When it is case i, the algorithm inserts the fingerprint of the element to be inserted into this candidate bucket, returning that the insertion is successful. And in case ii, the algorithm randomly selects a fingerprint in a current candidate bucket, removes the fingerprint, updates the fingerprint to be the victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the circular repositioning operation.
2) Case b
Another candidate bucket for the victim is computed using a partial keycuckoo hash function, and then a determination is made as to whether this candidate bucket is full, in which case two situations may exist: i) an under-fill of the candidate bucket, ii) an over-fill of the candidate bucket. When the case is i, the algorithm inserts the fingerprint of the element to be inserted into this candidate bucket and returns that the insertion is successful. And in case ii, the algorithm randomly selects a fingerprint in a current candidate bucket, removes the fingerprint, updates the fingerprint to be the victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the circular repositioning operation.
The second phase inserts the algorithm, which obtains the load of two candidate buckets, there are three cases: a) both candidate buckets are not full, b) one of the two candidate buckets is not full and one is full, c) both candidate buckets are full. When the case is a, the algorithm will select a candidate bucket with the lowest load to insert the fingerprint into this position, and returns the insertion success. When case b, the algorithm selects an unfilled candidate bucket to insert the fingerprint into this location, returning that the insertion was successful. When case c is the case, the algorithm will randomly select a candidate bucket and randomly remove one of the fingerprints, called the victim, and then insert the fingerprint of the element to be inserted into this location and enter the relocation operation.
And in the second stage of relocation operation, firstly, judging whether the current iteration number reaches an upper limit, and returning to insertion failure if the current iteration number reaches the upper limit. If the upper limit is not reached, the algorithm will calculate another candidate bucket position for the victim by a partial keyvalley hash function, which may be two cases: a) the candidate bucket is not full, b) the candidate bucket is full. When case a, the algorithm inserts the element into this candidate bucket, returning an insertion success. In case b, the algorithm will randomly remove one of the fingerprints, update this fingerprint as the victim, then insert the fingerprint to be inserted into the original position of the victim, add one to the number of iterations, and return to the loop repositioning operation. The overall flow chart of the two-phase interpolation algorithm is shown in fig. 4. The pseudo code for the two-phase insertion algorithm as a whole is as follows:
Figure BDA0003193834030000121
Figure BDA0003193834030000131
based on the data de-duplication method of the two-stage cuckoo filter, the invention will introduce a data de-duplication method with the two-stage cuckoo filter as a core.
When the file stream enters the storage system, the flow of data de-duplication is as follows: (1) the file is cut into data blocks using the rolling Rabin fingerprinting method, and then the fingerprint of each data block is calculated by the SHA1 secure hash function. (2) The identification of the data block is sent to a two-stage insertion cuckoo filter for query, and the cuckoo filter may return two results: a) the identification is not present in the two-stage cuckoo filter, b) the identification is present in the two-stage insertion cuckoo filter. In case a, the data block is determined to be a new data block, so the system stores the data block in the container area, and stores the fingerprint of the data block and the physical location into the fingerprint index area after forming a key value pair, and finally stores the fingerprint into the list area of the file.
The rolling Rabin fingerprint method is an algorithm for dividing a file into data blocks of indefinite length, and the input of the algorithm is a file data stream, and the output of the algorithm is data blocks of indefinite length. As shown in fig. 6.
The algorithm steps are as follows:
(1) a sliding window value and a fingerprint value are preset.
(2) The beginning of the file is set to the first window position.
(3) Calculating Rabin fingerprints (hash values) of the data in the window, and jumping to the step 4 if the Rabin fingerprint value calculated by the data in the current window is the same as the preset fingerprint value, or jumping to the step 5 if not.
(4) Setting the window boundary as one boundary of the block. Jump to step 5,.
(5) If the file has data subsequently, moving the sliding window backwards, and jumping to the step 3; otherwise, jumping to step 6.
(6) And finishing the algorithm, and outputting the blocks according to the calculated boundary.
Rabin fingerprint algorithm
The input of Rabin fingerprint algorithm is binary information, and the output is binary information abstract.
(1) A ([ b _1, …, b _ m ]) is the input binary string
(2) Constructing a polynomial with the corresponding highest degree of m-1 according to A
(3) Given a polynomial P (t) of k highest degree
(4) Calculate Rabin fingerprint ═ A (t) mod P (t)
(5) And outputting the Rabin fingerprint.
SHA1 function
The SHA1 algorithm is a secure hash algorithm, the input is binary information, and the output is 160-bit SHA1 message digest. For information input with length less than 264 bits, the SHA1 algorithm generates a 160-bit information summary (id), and the original input data cannot be obtained reversely from the information summary.
For any length of plaintext, the SHA1 function first groups it so that each group is 512 bits in length, and then repeats the process over and over again for those plaintext blocks.
The digest generation process for each plaintext packet is as follows:
(1) a 512-bit plaintext block is divided into 16 sub-plaintext blocks, each sub-plaintext block being 32 bits.
(2) Claim 5 linked variables of 32 bits, noted A, B, C, D, E.
(3) The 16 sub-plaintext blocks are expanded to 80.
(4)80 sub-plaintext blocks are subjected to 4 rounds of operations.
(5) And performing summation operation on the link variable and the initial link variable.
(6) The above operation is repeated with the chaining variable as input for the next plaintext block.
(7) Finally, the data inside the 5 linked variables is the SHA1 digest.
And in case b, the data de-duplication technology provided by the inventor can enter a disk database to compare fingerprints, if the disk database does not have the fingerprints, the data blocks are proved to be brand new data blocks and are reserved, the data blocks are stored in a container area, the fingerprints and the physical positions of the data blocks form a key value pair and then are stored in a fingerprint index area, and finally the fingerprints are stored in a list area of the file. If this fingerprint exists in the disk database, the data block is certified as having been stored from the storage system, and storage is abandoned. The block diagram of the deduplication method is shown in FIG. 5, and the flow diagram of the deduplication method is shown in FIG. 6. The pseudo code of the deduplication method is as follows:
Figure BDA0003193834030000161
the scheme provides an improved insertion algorithm of the cuckoo filter, namely a two-stage insertion algorithm, so that the problem of uneven data load is solved, and experimental evaluation shows that the scheme effectively reduces the insertion delay of the cuckoo filter, thereby increasing the efficiency and the throughput of a repeating data deleting system.
Aiming at the defect of uneven data of a rough data structure cuckoo filter used in the previous deduplication scheme, the invention provides a two-stage insertion algorithm, a stricter relocation condition is set in the first stage, and data distribution is balanced through a more active relocation strategy, so that the second stage contributing to the main part of insertion delay obtains a better data distribution condition, the insertion delay of the rough structure is reduced, and the whole deduplication process is accelerated.
The invention provides a block-level data deduplication scheme based on a dual-stage insertion cuckoo filter, and the time delay of the whole deduplication algorithm is effectively reduced by means of the advantage of the insertion performance of the dual-stage insertion cuckoo filter.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. The two-stage cuckoo filter is characterized in that each cuckoo filter in the two-stage cuckoo filter consists of a plurality of barrels, each barrel consists of a plurality of units, each unit can be used for storing data fingerprints, the units form a two-dimensional fingerprint matrix structure, each element to be inserted is associated with two hash functions, an insertion algorithm is divided into two stages, and the uniform distribution degree of data is increased through active repositioning in the first stage to form an insertion element two-stage insertion algorithm.
2. The two-stage cuckoo filter of claim 1, wherein the overall loading rate of the cuckoo filter is less than a predetermined threshold for the first stage, and wherein the cuckoo filter first computes data fingerprints for elements in a first-stage insertion algorithm into the elements, then computes two candidate bucket positions, and then computes loads in the two candidate buckets.
3. The two-stage cuckoo filter of claim 2, wherein the load of the two candidate buckets is calculated to determine whether the load of the two candidate buckets is greater than or less than a set value, if the load of the two candidate buckets is less than the set value, the algorithm selects a candidate bucket with the smallest load rate to insert, and feeds back that the insertion is successful; if the load of one candidate barrel is smaller than the set value, the algorithm selects the candidate barrel with the load rate smaller than the set value to insert, and feeds back the successful insertion; if the load rates of the two candidate buckets are larger than the set value, the algorithm randomly selects one candidate bucket, removes one fingerprint from the candidate buckets, names the candidate bucket as a victim, inserts the fingerprint of the element to be inserted into the position before the victim, and carries out the first-stage repositioning operation.
4. The two-stage cuckoo filter of claim 3, wherein the first-stage repositioning operation insertion algorithm determines whether the number of iterations has reached an upper limit, if so, the feedback insertion fails, and if not, determines whether the number of iterations is greater than or equal to a preset value, if so, calculates another candidate bucket of the victim by using a partial key cuckoo hash function, and determines a load condition of the candidate bucket, and if so, the algorithm inserts a fingerprint into the candidate bucket, and the feedback insertion succeeds; otherwise, the algorithm randomly selects a fingerprint in the current candidate bucket, removes the fingerprint, updates the fingerprint as a victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the first-stage circular repositioning operation; if the iteration number is larger than or equal to the preset value, another candidate bucket of the victim is calculated by using a partial key cuckoo hash function, whether the candidate bucket is full or not is judged, if the candidate bucket is not full, the algorithm inserts the fingerprint of the element to be inserted into the candidate bucket and feeds back the insertion success, if the candidate bucket is full, the algorithm randomly selects the fingerprint in the current candidate bucket, removes the fingerprint, updates the fingerprint to be the victim, inserts the fingerprint to be inserted into the original position of the victim, adds one iteration number, and returns to the first-stage circular repositioning operation.
5. The two-stage cuckoo filter of claim 4, wherein the overall loading rate of the cuckoo filter is greater than or equal to a predetermined threshold for the second stage, and wherein the cuckoo filter first computes the data fingerprint of an element, then computes two candidate bucket positions, and then computes the loads in the two candidate buckets in a second stage insertion algorithm into the element.
6. The two-stage cuckoo filter of claim 5, wherein the cuckoo filter performs load judgment on two candidate buckets in the insertion element to be full, if the two candidate buckets are not full, the algorithm selects a candidate bucket with the lowest load to insert the fingerprint into the position, the feedback insertion is successful, if the two candidate buckets are one of not full and one of full, the algorithm selects the candidate bucket with the lowest load to insert the fingerprint into the position, the feedback insertion is successful, if the two candidate buckets are full, the algorithm randomly selects one candidate bucket and randomly removes one of the fingerprints to be called a victim, then inserts the fingerprint of the element to be inserted into the position, and proceeds to the second stage relocation operation.
7. The two-stage cuckoo filter of claim 6, wherein the second stage relocation operation determines whether the current number of iterations reaches an upper limit, if so, the feedback insertion fails, if not, the algorithm computes the location of another candidate bucket of the victim by using a partial key-valley hash function and determines whether the candidate bucket is full, if not, the algorithm inserts an element into the candidate bucket and feeds back the insertion success, if the candidate bucket is full, the algorithm randomly removes one of the fingerprints, updates the one of the fingerprints as the victim, then inserts the fingerprint to be inserted into the original location of the victim, adds one to the number of iterations, and returns to the loop second stage relocation operation.
8. The two-stage cuckoo filter-based deduplication method as recited in any one of claims 1-7, wherein the two-stage cuckoo filter-based deduplication method comprises the following steps when a file stream enters a storage system:
s1, cutting the file into data blocks, and calculating the fingerprint of each data block;
s2, sending the identification of the data block into a two-stage cuckoo filter for inquiring and judging whether the identification exists, if the identification does not exist, judging that the data block is a brand-new data block, storing the data block into a container area by the system, forming a key value pair by the fingerprint and the physical position of the data block, storing the key value pair into a fingerprint index area, and storing the fingerprint into a list area of a file; if the identification exists, the proposed data de-duplication technology can enter a disk database to compare fingerprints, if the disk database does not have the fingerprint, the data block is proved to be a brand new data block, the data block is reserved, the data block is stored in a container area, the fingerprint and the physical position of the data block form a key value pair and then are stored in a fingerprint index area, the fingerprint is stored in a list area of a file, and if the fingerprint exists in the disk database, the data block is proved to be stored by a storage system and is abandoned.
9. The method for de-duplicating data based on two-stage cuckoo filter according to claim 8, wherein said step S1 uses rolling Rabin fingerprint blocking method to cut the file into data blocks.
10. The two-stage cuckoo filter-based deduplication method of claim 8 or 9, wherein the fingerprint of each data chunk is calculated by SHA1 secure hash function in step S1.
CN202110885281.3A 2021-08-03 2021-08-03 Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter Active CN113535706B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110885281.3A CN113535706B (en) 2021-08-03 2021-08-03 Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110885281.3A CN113535706B (en) 2021-08-03 2021-08-03 Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter

Publications (2)

Publication Number Publication Date
CN113535706A true CN113535706A (en) 2021-10-22
CN113535706B CN113535706B (en) 2023-05-23

Family

ID=78090176

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110885281.3A Active CN113535706B (en) 2021-08-03 2021-08-03 Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter

Country Status (1)

Country Link
CN (1) CN113535706B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114268501A (en) * 2021-12-24 2022-04-01 深信服科技股份有限公司 Data processing method, firewall generation method, computing device and storage medium
CN114844638A (en) * 2022-07-03 2022-08-02 浙江九州量子信息技术股份有限公司 Big data volume secret key duplication removing method and system based on cuckoo filter
US11416499B1 (en) * 2021-10-12 2022-08-16 National University Of Defense Technology Vertical cuckoo filters
CN115052264A (en) * 2022-08-11 2022-09-13 中国铁道科学研究院集团有限公司电子计算技术研究所 Railway passenger station wireless network communication method and device based on multipath screening
CN115510092A (en) * 2022-09-27 2022-12-23 青海师范大学 Approximate member query optimization method based on cuckoo filter
CN115643301A (en) * 2022-10-24 2023-01-24 湖南大学 DDS (direct digital synthesizer) automatic discovery method and medium based on compressed cuckoo filter
CN116467307A (en) * 2023-03-29 2023-07-21 济南大学 Design method and system for cuckoo filter for reducing false positive rate
CN116701440A (en) * 2023-06-15 2023-09-05 泉城省实验室 Cuckoo filter and data insertion, query and deletion method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156727A (en) * 2011-04-01 2011-08-17 华中科技大学 Method for deleting repeated data by using double-fingerprint hash check
US20190026042A1 (en) * 2017-07-18 2019-01-24 Vmware, Inc. Deduplication-Aware Load Balancing in Distributed Storage Systems
CN109815234A (en) * 2018-12-29 2019-05-28 杭州中科先进技术研究院有限公司 A kind of multiple cuckoo filter under streaming computing model
CN110222088A (en) * 2019-05-20 2019-09-10 华中科技大学 Data approximation set representation method and system based on insertion position selection
CN111552692A (en) * 2020-04-30 2020-08-18 南方科技大学 Plus-minus cuckoo filter
CN111552693A (en) * 2020-04-30 2020-08-18 南方科技大学 Tag cuckoo filter
CN111858651A (en) * 2020-09-22 2020-10-30 中国人民解放军国防科技大学 Data processing method and data processing device
CN112148928A (en) * 2020-09-18 2020-12-29 鹏城实验室 Cuckoo filter based on fingerprint family

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156727A (en) * 2011-04-01 2011-08-17 华中科技大学 Method for deleting repeated data by using double-fingerprint hash check
US20190026042A1 (en) * 2017-07-18 2019-01-24 Vmware, Inc. Deduplication-Aware Load Balancing in Distributed Storage Systems
CN109815234A (en) * 2018-12-29 2019-05-28 杭州中科先进技术研究院有限公司 A kind of multiple cuckoo filter under streaming computing model
CN110222088A (en) * 2019-05-20 2019-09-10 华中科技大学 Data approximation set representation method and system based on insertion position selection
CN111552692A (en) * 2020-04-30 2020-08-18 南方科技大学 Plus-minus cuckoo filter
CN111552693A (en) * 2020-04-30 2020-08-18 南方科技大学 Tag cuckoo filter
CN112148928A (en) * 2020-09-18 2020-12-29 鹏城实验室 Cuckoo filter based on fingerprint family
CN111858651A (en) * 2020-09-22 2020-10-30 中国人民解放军国防科技大学 Data processing method and data processing device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BIN FAN 等: "Cuckoo Filter: Practically better than bloom", 《PROCEEDINGS OF THE 10TH ACM INTERNATIONAL ON CONFERENCE ON EMERGING NETWORKING EXPERIMENTS AND TECHNOLOGIES》 *
王飞越: "基于负载均衡的高效布谷鸟过滤器研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11416499B1 (en) * 2021-10-12 2022-08-16 National University Of Defense Technology Vertical cuckoo filters
CN114268501B (en) * 2021-12-24 2024-02-23 深信服科技股份有限公司 Data processing method, firewall generating method, computing device and storage medium
CN114268501A (en) * 2021-12-24 2022-04-01 深信服科技股份有限公司 Data processing method, firewall generation method, computing device and storage medium
CN114844638A (en) * 2022-07-03 2022-08-02 浙江九州量子信息技术股份有限公司 Big data volume secret key duplication removing method and system based on cuckoo filter
CN114844638B (en) * 2022-07-03 2022-09-20 浙江九州量子信息技术股份有限公司 Big data volume secret key duplication removing method and system based on cuckoo filter
CN115052264A (en) * 2022-08-11 2022-09-13 中国铁道科学研究院集团有限公司电子计算技术研究所 Railway passenger station wireless network communication method and device based on multipath screening
CN115052264B (en) * 2022-08-11 2022-11-22 中国铁道科学研究院集团有限公司电子计算技术研究所 Railway passenger station wireless network communication method and device based on multipath screening
CN115510092A (en) * 2022-09-27 2022-12-23 青海师范大学 Approximate member query optimization method based on cuckoo filter
CN115510092B (en) * 2022-09-27 2023-05-12 青海师范大学 Approximate member query optimization method based on cuckoo filter
CN115643301A (en) * 2022-10-24 2023-01-24 湖南大学 DDS (direct digital synthesizer) automatic discovery method and medium based on compressed cuckoo filter
CN115643301B (en) * 2022-10-24 2024-04-09 湖南大学 DDS automatic discovery method and medium based on compressed cuckoo filter
CN116467307A (en) * 2023-03-29 2023-07-21 济南大学 Design method and system for cuckoo filter for reducing false positive rate
CN116467307B (en) * 2023-03-29 2024-02-23 济南大学 Design method and system for cuckoo filter for reducing false positive rate
CN116701440A (en) * 2023-06-15 2023-09-05 泉城省实验室 Cuckoo filter and data insertion, query and deletion method
CN116701440B (en) * 2023-06-15 2024-04-16 泉城省实验室 Cuckoo filter and data insertion, query and deletion method

Also Published As

Publication number Publication date
CN113535706B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN113535706B (en) Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter
US9223794B2 (en) Method and apparatus for content-aware and adaptive deduplication
CN103177111B (en) Data deduplication system and delet method thereof
US9262434B1 (en) Preferential selection of candidates for delta compression
US9405764B1 (en) Method for cleaning a delta storage system
KR101653692B1 (en) Data object processing method and apparatus
EP2348690B1 (en) Methods and apparatus for compression and network transport of data in support of continuous availability of applications
US8051252B2 (en) Method and apparatus for detecting the presence of subblocks in a reduced-redundancy storage system
EP1866776B1 (en) Method for detecting the presence of subblocks in a reduced-redundancy storage system
US20120150824A1 (en) Processing System of Data De-Duplication
CN108415671B (en) Method and system for deleting repeated data facing green cloud computing
CN110032470B (en) Method for constructing heterogeneous partial repeat codes based on Huffman tree
US9116902B1 (en) Preferential selection of candidates for delta compression
CN112380196B (en) Server for data compression transmission
CN112162973A (en) Fingerprint collision avoidance, deduplication and recovery method, storage medium and deduplication system
CN109255090B (en) Index data compression method of web graph
Conde-Canencia et al. Data deduplication with edit errors
Sengar et al. A Parallel Architecture for In-Line Data De-duplication
CN111352587A (en) Data packing method and device
CN111177092A (en) Deduplication method and device based on erasure codes
Joe et al. Comprehensive analysis of content defined de-duplication approaches for big data storage
Liu et al. Tscf: An efficient two-stage cuckoo filter for data deduplication
Goel et al. A Detailed Review of Data Deduplication Approaches in the Cloud and Key Challenges
CN113625961B (en) Self-adaptive threshold value repeated data deleting method based on greedy selection
KR20190049244A (en) Lightweight complexity based packet-level deduplication apparatus and method, storage media storing the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230404

Address after: Room 20059, 2nd Floor, Building 5, Phase 1, Guangdong Xiaxi International Rubber and Plastic City, Nanping West Road, Guicheng Street, Nanhai District, Foshan City, Guangdong Province, 528200 (Residence Declaration)

Applicant after: Foshan saisichen Technology Co.,Ltd.

Applicant after: SHENZHEN CESTBON TECHNOLOGY Co.,Ltd.

Address before: 400000 building 10, No. 1, Jiangxia Road, Changshengqiao Town, economic development zone, Nan'an District, Chongqing

Applicant before: Chongqing saiyushen Technology Co.,Ltd.

Applicant before: Foshan saisichen Technology Co.,Ltd.

Applicant before: SHENZHEN CESTBON TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant