CN113535706B - Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter - Google Patents

Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter Download PDF

Info

Publication number
CN113535706B
CN113535706B CN202110885281.3A CN202110885281A CN113535706B CN 113535706 B CN113535706 B CN 113535706B CN 202110885281 A CN202110885281 A CN 202110885281A CN 113535706 B CN113535706 B CN 113535706B
Authority
CN
China
Prior art keywords
candidate
fingerprint
stage
algorithm
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110885281.3A
Other languages
Chinese (zh)
Other versions
CN113535706A (en
Inventor
李挥
刘涛
王博辉
崔凯
蒋傅礼
张华宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Saisichen Technology Co ltd
Shenzhen Cestbon Technology Co ltd
Original Assignee
Shenzhen Cestbon Technology Co ltd
Foshan Saisichen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Cestbon Technology Co ltd, Foshan Saisichen Technology Co ltd filed Critical Shenzhen Cestbon Technology Co ltd
Priority to CN202110885281.3A priority Critical patent/CN113535706B/en
Publication of CN113535706A publication Critical patent/CN113535706A/en
Application granted granted Critical
Publication of CN113535706B publication Critical patent/CN113535706B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is suitable for the field of data processing technology improvement, and provides a two-stage cuckoo filter, which is characterized by comprising a plurality of barrels, wherein each barrel comprises a plurality of units, each unit can be used for storing data fingerprints, the plurality of units form a two-dimensional fingerprint matrix structure, each element to be inserted is associated with two hash functions, an insertion algorithm is divided into two stages according to the load rate, the problem of uneven data load is relieved by using a strategy of actively repositioning in the first stage with lower load rate, and the insertion delay of the cuckoo filter is effectively reduced, so that the efficiency and throughput of a repeated data deletion system are increased.

Description

Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter
Technical Field
The invention belongs to the field of data processing technology improvement, and particularly relates to a two-stage cuckoo filter and a repeated data deleting method based on the two-stage cuckoo filter.
Background
With the advent of the information age, data on the internet underwent explosive growth, and according to IDC reports, global data amounts to 33ZB in 2018, and it was predicted that this data will reach 175ZB in 2025. Meanwhile, in order to alleviate the cost pressure of self-built local storage and maintenance, more and more individuals, companies and organizations migrate the storage traffic of data to cloud service providers. However, the explosive growth of data presents serious challenges to cloud service providers in terms of storage capacity, network bandwidth, etc. In order to solve the problem of data explosion, a redundant data elimination technology has been proposed, and has been developed for many years, from a lossless data compression technology to a lossy data compression technology to a deduplication technology.
In the early stages of the redundant data elimination technique, the coding scheme is widely used and studied. Huffman coding constructs the code with the shortest average length according to the probability of character occurrence. Subsequent LZ encoding creates a dictionary of data, and if both the sender and the receiver have such a dictionary, the actually transmitted data may be replaced by the index of the dictionary, thereby compressing the actual transfer amount of data.
For multimedia data, lossy data compression techniques are widely used, which promote compression ratios at the cost of some unimportant information, such as music with very complete frequency spectrum, and cutting the frequency spectrum above 20KHz (the upper auditory limit of the human ear) does not affect the quality of the music, which is MP3 lossy compression techniques. For pictures, JPEG and PNG are two more popular compression algorithms.
After the new century, the technology of data de-duplication formally appears, which supports multi-granularity de-duplication, has better expansibility, and can be extended from local to large-scale distributed storage systems. The meaning of the deduplication technique is to detect duplicate data in one set of digital files and save only unique instances of the data, thereby eliminating redundant data. The deduplication technology based on hash identification, which is a deduplication technology with low implementation cost and excellent deduplication effect, is widely applied to various storage systems. The method comprises the steps that through calculating the identification of a data block or a file and storing the identification into a database, when a deleting process is carried out, the system calculates the identification of the data block to be deleted again and compares the identification in the database, if the identification is matched, the same data block is proved to be stored, the system gives up storage, but index information between the file and a unique data block is reserved, and normal reconstruction of the subsequent file is guaranteed.
With the rapid expansion of data volume, the space overhead required by the data block identification storage is also larger and larger, and the main memory of the storage system cannot meet the storage overhead, so that slow external memory devices such as magnetic disks and the like bear the task of storing the data block identification. With the solution, the bottleneck of disk searching is adopted, so that the efficiency of the whole deduplication system is limited, and the response time of the deduplication system is reduced. More and more deduplication systems use additional technology to alleviate the performance degradation problem associated with disk bottlenecks.
The DDFS system proposed by DataDomain uses a classical approximation set membership decision data structure, bloom filter, to avoid disk bottlenecks. Bloom filters are a typical data structure that trades for memory space overhead at the cost of partial accuracy bits, and are capable of performing set decisions, i.e., whether an element is present in a set, with minimal space overhead. The bloom filter does not need to store the original data itself, but rather summary information of the original data. The main data structure is a bit vector, and the main data structure comprises a plurality of hash functions for mapping data to bits in the bit vector.
DDFS uses summary vectors to promote the performance of the deduplication system, which are implemented by Bloom filters, which are stored in the DDFS's main memory and represent a summary of the data segment identifications in the file system. When the deduplication system needs to query whether an index value exists, the deduplication system accesses the summarization vector first, and if the result given by the summarization vector is that the index value does not exist, the DDFS considers the data segment as a new data segment without additional searching operation. If the result given by the summary vector indicates that an index value exists, this index value exists with a high probability, but this result is not guaranteed, and the DDFS will be further validated by a database lookup. When the system is closed, the system can save the summary vector to the disk, so that the system after power failure is ensured not to lose the information of the summary vector. When the system is restarted for recovery, the DDFS will go to the latest one of the checkpoints of the summary vector in the recovery disk, and then insert new data into the summary vector after the checkpoints, as shown in fig. 1.
The bloom filter is a data structure which does not support the deleting operation, so that the summary vector realized based on the bloom filter does not support the deleting operation, and the deleting operation of the file or the data segment cannot be synchronized into the summary vector, so that the accuracy of information in the summary vector is reduced. The accuracy of the summary vector is directly related to the operation speed of the DDFS, and the problem of accumulated accuracy degradation becomes a bottleneck of the whole DDFS with long-time operation of the system.
Agrawal et al scholars have proposed a cuckoo filter-based deduplication system. The cuckoo filter solves the problem of the bloom filter failing to delete elements and does not require sacrifice in space or performance overhead. The data structure of the cuckoo filter is composed of a plurality of barrels, each barrel comprises a plurality of units, and each unit can store a fingerprint. The cuckoo filter can only store fingerprint information of elements, and space overhead is reduced at the cost of accuracy. The cuckoo filter selects two hash functions, the pair of hash functions are used for calculating two candidate barrel positions of the element, the two candidate barrel positions are associated through partial key cuckoo hash, and the position of the other candidate barrel can be obtained through the fingerprint of the element and the position of one candidate barrel.
Authors also use a cuckoo filter to speed up the query operation of data block identification, reducing the number of disk accesses during deduplication. When a data block is queried, the data block is firstly searched through a hash structure constructed by a cuckoo filter, if the data block identification is found, the system tries to read the metadata information of the data block from a cache, otherwise, the metadata information is directly obtained from the metadata record. In this process, the system will update the cache via the LRU algorithm. If no data block identification is found in the cuckoo filter, the data block content is directly written into the storage system, and the data block is added in the metadata record. In the manner described above, they use a cuckoo filter to accelerate the overall deduplication process, as shown in fig. 2.
The insertion algorithm of the cuckoo filter did not take into account at the beginning of the design that the insertion algorithm may have an impact on the degree of data loading within the filter, and only a simple selection algorithm was used to pick one of the two candidate buckets. The simple random insertion algorithm can cause the load in the filter to be concentrated in a few barrels, and unbalanced load can cause the efficiency of insertion to be reduced, time delay to be increased, and the usability of the whole cuckoo filter to further influence the efficiency of the repeated data deleting system.
Disclosure of Invention
The invention aims to provide a two-stage cuckoo filter and a repeated data deleting method based on the two-stage cuckoo filter, and aims to solve the technical problem that the insertion efficiency of the cuckoo filter used in the repeated data deleting method is low.
The invention is realized in that a two-stage cuckoo filter is provided, each cuckoo filter in the two-stage cuckoo filter is composed of a plurality of barrels, each barrel is composed of a plurality of units, each unit can be used for storing data fingerprints, the plurality of units form a two-dimensional fingerprint matrix structure, each element to be inserted is associated with two hash functions, and the two hash functions are utilized to obtain the positions of two candidate barrels of the element so that the element fingerprints can only be stored in the two candidate barrels to form a cuckoo filter to form an inserted element two-stage algorithm and two-stage repositioning.
The invention further adopts the technical scheme that: the whole load rate of the cuckoo filter is smaller than a preset threshold value to be a first stage, the cuckoo filter firstly calculates data fingerprints of elements in an insertion algorithm of the first stage in the elements, then calculates two candidate barrel positions, and then calculates loads in the two candidate barrels.
The invention further adopts the technical scheme that: judging whether the loads of the two candidate barrels are larger than or smaller than a set value or not by calculating the loads of the two candidate barrels, if the loads of the two candidate barrels are smaller than the set value, selecting a candidate barrel with the minimum load rate by an algorithm, and feeding back the insertion success; if the load of one candidate barrel is smaller than the set value, the algorithm selects the candidate barrel with the load rate smaller than the set value for insertion, and feeds back that the insertion is successful; if the load rates of the two candidate buckets are greater than the set value, the algorithm randomly selects one candidate bucket, removes one fingerprint which is named as a victim, inserts the fingerprint of the element to be inserted into the position before the victim, and performs a first stage repositioning operation.
The invention further adopts the technical scheme that: the first stage repositioning operation insertion algorithm judges whether the iteration number reaches an upper limit, if the iteration number reaches the upper limit, feedback insertion fails, if the iteration number does not reach the upper limit, judging whether the iteration number is greater than or equal to a preset value, if the iteration number is smaller than the preset value, calculating another candidate barrel of the victim by using a part of key-cuckoo hash function, judging the load condition of the candidate barrel, if the load condition of the candidate barrel is smaller than a set value, inserting fingerprints into the barrel by the algorithm, and if the feedback insertion is successful; otherwise, the algorithm randomly selects a fingerprint in a current candidate barrel, removes the fingerprint, updates the fingerprint as a victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the repositioning operation in the first stage of the loop; if the iteration number is greater than or equal to a preset value, calculating another candidate barrel of the victim by utilizing a part of key cuckoo hash function, judging whether the candidate barrel is full, if not, inserting the fingerprints of the elements to be inserted into the candidate barrel by the algorithm, feeding back the successful insertion, if the candidate barrel is full, randomly selecting the fingerprints in one current candidate barrel by the algorithm, removing the fingerprints, updating the fingerprints to be the victim, inserting the fingerprints to be inserted into the original position of the victim, adding one to the iteration number, and returning to the first stage repositioning operation of the loop.
The invention further adopts the technical scheme that: and the whole load rate of the cuckoo filter is larger than or equal to a preset threshold value to be a second stage, wherein the cuckoo filter firstly calculates the data fingerprint of the element in the second stage of the insertion algorithm in the element, then calculates the positions of two candidate barrels, and then calculates the loads in the two candidate barrels.
The invention further adopts the technical scheme that: the cuckoo filter obtains the load judgment of two candidate barrels in the inserted element through the algorithm, if the two candidate barrels are not full, the algorithm selects one candidate barrel with the lowest load to insert the fingerprint into the position, the feedback insertion is successful, if one candidate barrel is not full and one candidate barrel is full, the algorithm selects the candidate barrel which is not full to insert the fingerprint into the position, the feedback insertion is successful, if the two candidate barrels are full, the algorithm randomly selects one candidate barrel, one fingerprint is called a victim, the fingerprint of the element to be inserted is inserted into the position, and the second stage repositioning operation is carried out.
The invention further adopts the technical scheme that: and the second stage repositioning operation judges whether the current iteration number reaches an upper limit, if the current iteration number reaches the upper limit, the feedback insertion fails, if the current iteration number does not reach the upper limit, the algorithm calculates the position of another candidate barrel of the victim through a part of key-bar hash function and judges whether the candidate barrel is full, if the candidate barrel is not full, the algorithm inserts elements into the candidate barrel, and feeds back the successful insertion, if the candidate barrel is full, the algorithm randomly removes one fingerprint, updates the fingerprint as the victim, inserts the fingerprint to be inserted into the original position of the victim, adds one iteration number, and returns to the second stage repositioning operation of the loop.
Another object of the present invention is to provide a two-stage cuckoo filter-based data de-duplication method, which includes the steps of, when a file stream enters a storage system:
s1, cutting a file into data blocks, and calculating fingerprints of each data block;
s2, sending the identification of the data block into a two-stage insertion cuckoo filter for inquiring and judging whether the identification exists, if the identification does not exist, judging that the data block is a brand new data block, storing the data block into a container area by a system, forming a key value pair by fingerprints and physical positions of the data block, storing the key value pair into a fingerprint index area, and storing the fingerprints into a list area of a file; if the mark exists, the proposed repeated data deleting technology can enter a disk database to compare fingerprints, if the disk database does not have the fingerprints, the data block is proved to be a brand new data block, the data block is reserved, the data block is stored in a container area, the fingerprint and the physical position of the data block form a key value pair and then are stored in a fingerprint index area, the fingerprint is stored in a list area of a file, if the disk database has the fingerprints, the data block is proved to be stored by a storage system, and the storage is abandoned.
The invention further adopts the technical scheme that: in the step S1, a rolling Rabin fingerprint blocking method is used to cut the file into data blocks.
The invention further adopts the technical scheme that: the fingerprint of each data block is calculated by the SHA1 secure hash function in said step S1.
The beneficial effects of the invention are as follows: the two-stage insertion algorithm relieves the problem of uneven data load, effectively reduces the insertion time delay of the cuckoo filter, and therefore increases the efficiency and throughput of the data de-duplication system.
Drawings
Fig. 1 is a schematic diagram of a DDFS in the prior art provided by an embodiment of the present invention.
Fig. 2 is a schematic diagram of a cuckoo filter accelerating deduplication process according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a two-stage inserted cuckoo filter in a data structure according to an embodiment of the present invention.
Fig. 4 is a flowchart of a dual-stage insertion algorithm provided in an embodiment of the present invention.
Fig. 5 is a block diagram of a method for deleting duplicate data according to an embodiment of the invention.
Fig. 6 is a flowchart of a method for deleting duplicate data according to an embodiment of the present invention.
Fig. 7 is a schematic diagram of a rolling Rabin fingerprinting method according to an embodiment of the present invention.
Detailed Description
The scheme aims at solving the problem of low insertion efficiency of the cuckoo filter used in the repeated data deleting method, designs a two-stage insertion algorithm, uses different insertion strategies in two stages with different loads, actively performs repositioning in a first stage with low load rate to balance the load, and lays a cushion for the insertion in a second stage.
With the advent of the information age, data on networks has rapidly expanded, and more enterprises face the difficult problem of data explosion, so that expansion of enterprise business is restricted. However, among the vast amounts of data, there is a high proportion of data that is redundant duplicate data, which creates additional space overhead, bandwidth overhead, and energy consumption, and duplicate data deletion techniques have evolved.
The deduplication technology follows the steps of file stream chunking, data chunk fingerprint calculation, data chunk fingerprint comparison, data compression, and landing disk storage. According to the related research, the data block fingerprint comparison step is an important point of acceleration of the data deduplication technology. Explosive growth of data can have serious negative impact on the fingerprint comparison step of the data blocks therein. After the total number of data grows rapidly, the total number of data block fingerprints and the required storage space also grow rapidly, so that the memory of the deduplication system is overwhelmed, and all data fingerprints cannot be stored. Thus, a substantial portion of the data fingerprint is transferred to disk for storage. The random access speed of the disk database is very low compared with the access speed of the memory data block, so that the additional disk IO caused by the inquiry of the fingerprint increases the time of index inquiry and becomes the performance bottleneck of the whole deduplication process. Therefore, various schemes are proposed in the academia and industry to overcome the performance bottleneck and speed up the whole de-duplication process. One possible solution is by using a space-efficient summary data structure. The probability data structure does not store the elements per se, but stores some summary information of the elements, so that space overhead is greatly reduced, and the probability data structure can be stored by a memory, thereby reducing disk IO and accelerating the whole deduplication process.
The cuckoo filter is used as a novel data structure by a partial deduplication system, but the insertion algorithm does not consider the load effect on the whole filter, resulting in increased insertion delay, reduced throughput and usability, and thus affecting the efficiency of the whole deduplication system.
The invention provides a two-stage cuckoo filter-based storage system repeated data deleting method, which provides a two-stage insertion algorithm of a cuckoo filter, which can effectively balance loads in the filter, average the loads to each barrel as much as possible, increase throughput of the cuckoo filter, reduce insertion time delay of the cuckoo filter, and realize an efficient storage system repeated data deleting scheme based on the two-stage cuckoo filter.
First, we will introduce the core of the efficient deduplication technique proposed by the present invention, the two-stage cuckoo filter.
When it is desired to insert element x into the cuckoo filter, the algorithm will first calculate its fingerprint by the SHA1 algorithm and calculate the positions of its two candidate buckets by two hash functions. The algorithm then takes the current load of the filter and decides what phase it is currently in, and if the load factor is less than 0.45 it is in the first phase, otherwise it is in the second phase. And there will be an upper iteration limit in the two-stage insertion algorithm in order to avoid the cuckoo filter insertion algorithm from entering an infinite loop state.
The two-stage insertion cuckoo filter is not different in data structure from the common cuckoo filter, except for the insertion algorithm.
Each cuckoo filter is composed of a plurality of barrels (one row in the figure), each barrel is composed of a plurality of units (one cell in the figure), and each unit can be used for storing one data fingerprint, so that one cuckoo filter presents a two-dimensional fingerprint matrix structure. Meanwhile, each element to be inserted is associated with two hash functions, through which the element can acquire the positions of two candidate barrels thereof, and fingerprints of the element can only be stored in the two candidate barrels.
The first stage inserts an algorithm that will first compute the data fingerprint of the element, then the locations of the two candidate buckets, then the load in the two candidate buckets, which will now have three states: a) The load of both candidate buckets is less than 0.5. b) There is a candidate bucket with a load less than 0.5. c) The load rate of both candidate buckets is greater than 0.5. When the state is a, the algorithm selects a candidate bucket with the minimum load rate to insert, and returns successful insertion. When the state is b, the algorithm selects a candidate bucket with the load rate smaller than 0.5 for insertion, and returns successful insertion. When the state is c, the algorithm randomly selects a candidate bucket and removes one of the fingerprints, called the victim, and then inserts the fingerprint of the element to be inserted into the position in front of the victim, followed by a relocation operation.
In the repositioning operation in the first stage, the algorithm first judges whether the iteration number reaches the upper limit, if so, the insertion failure is returned, and if not, two cases exist, wherein the iteration number is less than 3, and the iteration number is greater than or equal to 3.
1) Case a
Another candidate bucket for the victim is computed using the partial key-bird hash function, and then the load condition of this candidate bucket is determined, at which time two conditions may exist: i) The load rate of the candidate bucket is less than 0.5. ii) the load rate of the candidate barrel is more than or equal to 0.5. And when the condition is the case i, the algorithm inserts the fingerprints of the elements to be inserted into the candidate bucket, and the insertion is successful. When the condition ii is met, the algorithm randomly selects a fingerprint in a current candidate barrel, removes the fingerprint, updates the fingerprint as a victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the loop repositioning operation.
2) Case b
Using the partial key-bird hash function to calculate another candidate bucket for the victim, then determining if this candidate bucket is full, there may be two situations: i) Less than full of candidate bucket, ii) full of candidate bucket. When the case is i, the algorithm inserts the fingerprint of the element to be inserted into this candidate bucket and returns the success of the insertion. When the condition ii is met, the algorithm randomly selects a fingerprint in a current candidate barrel, removes the fingerprint, updates the fingerprint as a victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the loop repositioning operation.
The second stage inserts an algorithm, which obtains the load of two candidate buckets, and there are three cases: a) both candidate buckets are not full, b) one of the two candidate buckets is not full, one is full, c) both candidate buckets are full. In case a, the algorithm will select a lowest loaded candidate bucket to insert the fingerprint into this location, returning that the insertion was successful. When it is case b, the algorithm selects the candidate bucket that is not full to insert the fingerprint into this location, returning that the insertion was successful. In case c, the algorithm randomly selects a candidate bucket and randomly removes one of the fingerprints, called the victim, and then inserts the fingerprint of the element to be inserted into this location and enters the relocation operation.
And the repositioning operation in the second stage firstly judges whether the current iteration number reaches the upper limit, and if the current iteration number reaches the upper limit, the inserting failure is returned. If the upper bound is not reached, the algorithm will calculate another candidate bucket location for the victim via a partial key-pad hash function, at which time there may be two cases: a) The candidate bucket is not full, b) the candidate bucket is full. In case a, the algorithm inserts the element into this candidate bucket, returning that the insertion was successful. In case b, the algorithm will randomly remove one of the fingerprints, update the fingerprint as the victim, insert the fingerprint to be inserted into the original position of the victim, add one to the iteration number, and return to the loop repositioning operation. A flowchart of the whole of the double-stage insertion algorithm is shown in fig. 4. The whole pseudo code of the double-stage insertion algorithm is as follows:
Figure BDA0003193834030000121
/>
Figure BDA0003193834030000131
the invention discloses a method for deleting repeated data based on a two-stage cuckoo filter, and subsequently, the invention introduces the method for deleting repeated data with the two-stage cuckoo filter as a core.
When a file stream enters a storage system, the flow of deduplication is as follows: (1) The file is cut into data blocks using the rolling Rabin fingerprint method, and then the fingerprint of each data block is calculated by the SHA1 secure hash function. (2) The identification of the data block is sent to a two-stage insertion cuckoo filter for inquiry, and the cuckoo filter may return two results at the moment: a) The identifier is not present in the two-stage cuckoo filter, b) the identifier is present in the two-stage insertion cuckoo filter. When the case a is the case, the data block can be judged to be a brand new data block, so that the system stores the data block into the container area, forms a key value pair with the fingerprint and the physical position of the data block, stores the key value pair into the fingerprint index area, and finally stores the fingerprint into the list area of the file.
The rolling Rabin fingerprint method is an algorithm for dividing a file into data blocks with indefinite length, and the data blocks with indefinite length are output by inputting the data blocks into a file data stream. As shown in fig. 6.
The algorithm comprises the following steps:
(1) A sliding window value and a fingerprint value are preset.
(2) The beginning of the file is set to the first window position.
(3) Calculating Rabin fingerprint (hash value) of the data in the window, if the Rabin fingerprint value calculated by the data in the current window is the same as the preset fingerprint value, jumping to the step 4, otherwise jumping to the step 5.
(4) The window boundary is set to be one boundary of the block. Jump to step 5.
(5) If the file has data in the follow-up process, moving the sliding window backwards, and jumping to the step 3; otherwise, jumping to step 6.
(6) And ending the algorithm, and outputting the blocks according to the calculated boundary.
Rabin fingerprint algorithm
The input of the Rabin fingerprint algorithm is binary information, and the output is binary information abstract.
(1) A ([ b_1, …, b_m ]) is an input binary string
(2) Constructing a corresponding polynomial with the highest term degree of m-1 according to A
(3) Given a polynomial P (t) of degree k with the highest term
(4) Calculation of Rabin fingerprint = a (t) mod P (t)
(5) And outputting Rabin fingerprints.
SHA1 function
The SHA1 algorithm is a secure hash algorithm, and is input as binary information and output as 160-bit SHA1 information digest. For information inputs less than 264 bits in length, the SHA1 algorithm generates a 160-bit information digest (signature) and cannot reversely obtain the original input data from the information digest.
For plaintext of arbitrary length, the SHA1 function first groups it so that each group is 512 bits in length, and then repeats the process iteratively on these plaintext groups.
The digest generation process for each plaintext block is as follows:
(1) The 512-bit plaintext block is divided into 16 sub-plaintext blocks, each sub-plaintext block being 32 bits.
(2) Apply for 5 32-bit link variables, denoted A, B, C, D, E.
(3) The 16-part sub-plaintext packet is expanded to 80 parts.
(4) The 80 sub-plaintext blocks are subjected to 4 rounds of operation.
(5) The linked variables are summed with the initial linked variables.
(6) The linking variable repeats the above as input to the next plaintext packet.
(7) Finally, the data in the 5 linked variables is the SHA1 digest.
In case b, the proposed repeated data deleting technology can enter a disk database to compare fingerprints, if the disk database does not have the fingerprints, the data block is proved to be a brand new data block, the data block is reserved and stored in a container area, the fingerprints and the physical positions of the data block form a key value pair and then are stored in a fingerprint index area, and finally the fingerprints are stored in a list area of a file. If this fingerprint is present in the disk database, the data block is verified to have been stored from the stored system, thus giving up storage. The block diagram of the deduplication method is shown in fig. 5, and the flow chart of the deduplication method is shown in fig. 6. The pseudocode for the deduplication method is as follows:
Figure BDA0003193834030000161
the scheme provides an improved insertion algorithm of the cuckoo filter, namely a two-stage insertion algorithm, so that the problem of uneven data load is solved, and meanwhile, experimental evaluation shows that the scheme effectively reduces the insertion time delay of the cuckoo filter, thereby increasing the efficiency and throughput of a repeated data deletion system.
Aiming at the defect of uneven data of a summary data structure cuckoo filter used in the prior de-duplication scheme, the invention provides a double-stage insertion algorithm, wherein a stricter repositioning condition is set in a first stage, and data distribution is balanced through a more positive repositioning strategy, so that a second stage contributing to the main part of the insertion time delay obtains better data distribution conditions, thereby reducing the insertion time delay of the summary structure and accelerating the whole de-duplication flow.
The invention provides a block-level data deduplication scheme based on a dual-stage inserted cuckoo filter, which effectively reduces the time delay of the whole deduplication algorithm by virtue of the advantage of the insertion performance of the dual-stage inserted cuckoo filter.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (6)

1. The two-stage cuckoo filter is characterized in that each cuckoo filter in the two-stage cuckoo filter consists of a plurality of barrels, each barrel consists of a plurality of units, each unit can be used for storing data fingerprints, the plurality of units form a two-dimensional fingerprint matrix structure, each element to be inserted is associated with two hash functions, an insertion algorithm is divided into two stages, and the uniform distribution degree of data is increased through active repositioning in a first stage to form the two-stage insertion algorithm of the inserted element;
the whole load rate of the cuckoo filter is smaller than a preset threshold value to be a first stage, the cuckoo filter firstly calculates the data fingerprint of the element in the first stage of the insertion algorithm in the element, then calculates the positions of two candidate barrels, then calculates the loads in the two candidate barrels,
judging whether the loads of the two candidate barrels are larger or smaller than a set value, if the loads of the two candidate barrels are smaller than the set value, selecting a candidate barrel with the smallest load rate by the algorithm, and feeding back the success of the insertion; if the load of one candidate barrel is smaller than the set value, the algorithm selects the candidate barrel with the load rate smaller than the set value for insertion, and feeds back that the insertion is successful; if the load rates of the two candidate barrels are both larger than the set value, the algorithm randomly selects one candidate barrel, removes one fingerprint, names the candidate barrel as a victim, inserts the fingerprint of the element to be inserted into the position in front of the victim, and performs a first stage repositioning operation;
the whole load rate of the cuckoo filter is larger than or equal to a preset threshold value as a second stage, the cuckoo filter is inserted into an element in the second stage, the data fingerprint of the element is calculated firstly, then the positions of two candidate barrels are calculated, then the loads in the two candidate barrels are calculated,
and judging whether the load of the two candidate barrels is full, if the two candidate barrels are not full, the algorithm selects one candidate barrel with the lowest load to insert the fingerprint into the position, if the two candidate barrels are not full, if the two candidate barrels are full, the algorithm selects the candidate barrel with the lowest load to insert the fingerprint into the position, if the two candidate barrels are full, the algorithm randomly selects one candidate barrel, randomly removes one fingerprint, names the candidate barrel as a victim, then inserts the fingerprint of the element to be inserted into the position, and performs a second stage repositioning operation.
2. The two-stage cuckoo filter according to claim 1, wherein the first stage relocation operation insertion algorithm determines whether the iteration number has reached an upper limit, if the iteration number has reached the upper limit, the feedback insertion fails, if the iteration number has not reached the upper limit, determines whether the iteration number is greater than or equal to a preset value, if the iteration number is less than the preset value, calculates another candidate bucket of the victim by using a partial key cuckoo hash function, and determines a load condition of the candidate bucket, such that the load of the candidate bucket is less than a set value, the algorithm inserts a fingerprint into the bucket, and the feedback insertion is successful; otherwise, the algorithm randomly selects a fingerprint in a current candidate barrel, removes the fingerprint, updates the fingerprint as a victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the repositioning operation in the first stage of the loop; if the iteration number is larger than or equal to a preset value, calculating another candidate barrel of the victim by utilizing a part of key cuckoo hash function, judging whether the candidate barrel is full, if the candidate barrel is not full, inserting the fingerprints of the elements to be inserted into the candidate barrel by the algorithm, feeding back the successful insertion, if the candidate barrel is full, randomly selecting the fingerprints in one current candidate barrel by the algorithm, removing the fingerprints, updating the fingerprints to be the victim, inserting the fingerprints to be inserted into the original position of the victim, adding one to the iteration number, and returning to the first stage repositioning operation of the loop.
3. The two-stage bird dropper filter according to claim 2, wherein the second stage repositioning operation determines whether the current iteration number reaches an upper limit, if the current iteration number reaches the upper limit, the feedback insertion fails, if the current iteration number does not reach the upper limit, the algorithm calculates another candidate bucket position of the victim through a partial key-pad hash function and determines whether the candidate bucket is full, if the candidate bucket is not full, the algorithm inserts an element into the candidate bucket, and feeds back success of the insertion, if the candidate bucket is full, the algorithm randomly removes one of the fingerprints, updates the fingerprint as the victim, then inserts the fingerprint to be inserted into the original position of the victim, increments the iteration number by one, and returns to the loop second stage repositioning operation.
4. A method of de-duplication based on a two-stage cuckoo filter according to any one of claims 1-3, when a file stream enters a storage system, characterized in that the method of de-duplication based on a two-stage cuckoo filter comprises the steps of:
s1, cutting a file into data blocks, and calculating fingerprints of each data block;
s2, sending the fingerprints of the data blocks into a two-stage inserted cuckoo filter for inquiring and judging whether the fingerprints exist, if the fingerprints do not exist, judging the data blocks to be brand new data blocks, storing the data blocks into a container area by a system, forming a key value pair by the fingerprints and the physical positions of the data blocks, storing the key value pair into a fingerprint index area, and storing the fingerprints into a list area of a file; if the fingerprint exists, the proposed repeated data deleting technology can enter a disk database to compare the fingerprints, if the disk database does not have the fingerprint, the data block is proved to be a brand new data block, the data block is reserved, the data block is stored in a container area, the fingerprint and the physical position of the data block form a key value pair and then are stored in a fingerprint index area, the fingerprint is stored in a list area of a file, if the disk database has the fingerprint, the data block is proved to be stored by a storage system, and the storage is abandoned.
5. The method for repeating data deletion based on two-stage cuckoo filter as set forth in claim 4, wherein the step S1 uses a rolling Rabin fingerprint blocking method to cut the file into data blocks.
6. The two-stage cuckoo filter-based de-duplication method according to claim 4 or 5, wherein the fingerprint of each data block is calculated by SHA1 secure hash function in step S1.
CN202110885281.3A 2021-08-03 2021-08-03 Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter Active CN113535706B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110885281.3A CN113535706B (en) 2021-08-03 2021-08-03 Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110885281.3A CN113535706B (en) 2021-08-03 2021-08-03 Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter

Publications (2)

Publication Number Publication Date
CN113535706A CN113535706A (en) 2021-10-22
CN113535706B true CN113535706B (en) 2023-05-23

Family

ID=78090176

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110885281.3A Active CN113535706B (en) 2021-08-03 2021-08-03 Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter

Country Status (1)

Country Link
CN (1) CN113535706B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11416499B1 (en) * 2021-10-12 2022-08-16 National University Of Defense Technology Vertical cuckoo filters
CN114268501B (en) * 2021-12-24 2024-02-23 深信服科技股份有限公司 Data processing method, firewall generating method, computing device and storage medium
CN114844638B (en) * 2022-07-03 2022-09-20 浙江九州量子信息技术股份有限公司 Big data volume secret key duplication removing method and system based on cuckoo filter
CN115052264B (en) * 2022-08-11 2022-11-22 中国铁道科学研究院集团有限公司电子计算技术研究所 Railway passenger station wireless network communication method and device based on multipath screening
CN115510092B (en) * 2022-09-27 2023-05-12 青海师范大学 Approximate member query optimization method based on cuckoo filter
CN115643301B (en) * 2022-10-24 2024-04-09 湖南大学 DDS automatic discovery method and medium based on compressed cuckoo filter
CN116467307B (en) * 2023-03-29 2024-02-23 济南大学 Design method and system for cuckoo filter for reducing false positive rate
CN116701440B (en) * 2023-06-15 2024-04-16 泉城省实验室 Cuckoo filter and data insertion, query and deletion method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156727A (en) * 2011-04-01 2011-08-17 华中科技大学 Method for deleting repeated data by using double-fingerprint hash check
CN109815234A (en) * 2018-12-29 2019-05-28 杭州中科先进技术研究院有限公司 A kind of multiple cuckoo filter under streaming computing model
CN110222088A (en) * 2019-05-20 2019-09-10 华中科技大学 Data approximation set representation method and system based on insertion position selection
CN111552693A (en) * 2020-04-30 2020-08-18 南方科技大学 Tag cuckoo filter
CN111552692A (en) * 2020-04-30 2020-08-18 南方科技大学 Plus-minus cuckoo filter
CN111858651A (en) * 2020-09-22 2020-10-30 中国人民解放军国防科技大学 Data processing method and data processing device
CN112148928A (en) * 2020-09-18 2020-12-29 鹏城实验室 Cuckoo filter based on fingerprint family

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11461027B2 (en) * 2017-07-18 2022-10-04 Vmware, Inc. Deduplication-aware load balancing in distributed storage systems

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156727A (en) * 2011-04-01 2011-08-17 华中科技大学 Method for deleting repeated data by using double-fingerprint hash check
CN109815234A (en) * 2018-12-29 2019-05-28 杭州中科先进技术研究院有限公司 A kind of multiple cuckoo filter under streaming computing model
CN110222088A (en) * 2019-05-20 2019-09-10 华中科技大学 Data approximation set representation method and system based on insertion position selection
CN111552693A (en) * 2020-04-30 2020-08-18 南方科技大学 Tag cuckoo filter
CN111552692A (en) * 2020-04-30 2020-08-18 南方科技大学 Plus-minus cuckoo filter
CN112148928A (en) * 2020-09-18 2020-12-29 鹏城实验室 Cuckoo filter based on fingerprint family
CN111858651A (en) * 2020-09-22 2020-10-30 中国人民解放军国防科技大学 Data processing method and data processing device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Cuckoo Filter: Practically better than bloom;Bin Fan 等;《Proceedings of the 10th ACM international on conference on emerging networking experiments and technologies》;20141204;第77-88页 *
基于负载均衡的高效布谷鸟过滤器研究;王飞越;《中国优秀硕士学位论文全文数据库 信息科技辑》;20200315;I138-241 *

Also Published As

Publication number Publication date
CN113535706A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN113535706B (en) Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter
US9223794B2 (en) Method and apparatus for content-aware and adaptive deduplication
US9454318B2 (en) Efficient data storage system
CN103177111B (en) Data deduplication system and delet method thereof
US8051252B2 (en) Method and apparatus for detecting the presence of subblocks in a reduced-redundancy storage system
US8255398B2 (en) Compression of sorted value indexes using common prefixes
US8543555B2 (en) Dictionary for data deduplication
EP1866776B1 (en) Method for detecting the presence of subblocks in a reduced-redundancy storage system
CN107046812B (en) Data storage method and device
CN108089816B (en) Query type repeated data deleting method and device based on load balancing
CN108090125B (en) Non-query type repeated data deleting method and device
WO2017020576A1 (en) Method and apparatus for file compaction in key-value storage system
CN108415671B (en) Method and system for deleting repeated data facing green cloud computing
CN103227818A (en) Terminal, server, file transferring method, file storage management system and file storage management method
CN108804661B (en) Fuzzy clustering-based repeated data deleting method in cloud storage system
CN106066818B (en) A kind of data layout method improving data de-duplication standby system restorability
Viji et al. Comparative analysis for content defined chunking algorithms in data deduplication
CN112162973A (en) Fingerprint collision avoidance, deduplication and recovery method, storage medium and deduplication system
Conde-Canencia et al. Deduplication algorithms and models for efficient data storage
CN111831480B (en) Layered coding method and device based on deduplication system and deduplication system
CN111352587A (en) Data packing method and device
KR102026125B1 (en) Lightweight complexity based packet-level deduplication apparatus and method, storage media storing the same
Liu et al. Tscf: An efficient two-stage cuckoo filter for data deduplication
CN111177092A (en) Deduplication method and device based on erasure codes
CN113625961B (en) Self-adaptive threshold value repeated data deleting method based on greedy selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230404

Address after: Room 20059, 2nd Floor, Building 5, Phase 1, Guangdong Xiaxi International Rubber and Plastic City, Nanping West Road, Guicheng Street, Nanhai District, Foshan City, Guangdong Province, 528200 (Residence Declaration)

Applicant after: Foshan saisichen Technology Co.,Ltd.

Applicant after: SHENZHEN CESTBON TECHNOLOGY Co.,Ltd.

Address before: 400000 building 10, No. 1, Jiangxia Road, Changshengqiao Town, economic development zone, Nan'an District, Chongqing

Applicant before: Chongqing saiyushen Technology Co.,Ltd.

Applicant before: Foshan saisichen Technology Co.,Ltd.

Applicant before: SHENZHEN CESTBON TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant