Background
With the advent of the information age, data on the internet has experienced explosive growth, with a global data volume of 33ZB in 2018 as reported by IDC, and it is predicted that the data will reach 175ZB in 2025. Meanwhile, to alleviate the cost pressure of self-built local storage and maintenance, more and more individuals, companies, and organizations migrate the storage traffic of data to cloud service providers. However, the explosive growth of data poses serious challenges to the storage capacity, network bandwidth, etc. of cloud service providers. In order to solve the problem of data explosion, a redundant data elimination technology is proposed, and the technology goes through development for many years, namely a lossless data compression technology, a lossy data compression technology and a repeated data deletion technology.
In the early stage of redundant data elimination technology, the coding mode is widely applied and researched. Huffman coding builds a code with the shortest average length according to the probability of character occurrence. Subsequent LZ encoding creates a dictionary of data for the data, and if both sender and receiver have such dictionaries, the actual data sent may be replaced by the index of the dictionary, thereby compressing the actual amount of data transferred.
For multimedia data, lossy data compression techniques are widely used, which raise the compression ratio at the cost of some unimportant information, such as some music with very complete frequency spectrum, and cut off the frequency spectrum (the upper limit of hearing of human ear) above 20KHz will not affect the quality of music, which is the MP3 lossy compression technique. For pictures, JPEG and PNG are two of the more common compression algorithms.
After entering the new century, a data de-duplication technology formally appears, supports multi-granularity data de-duplication, has better expansibility, and can be extended to a large-scale distributed storage system from the local. Deduplication means that duplicate data is detected and only a single instance of the data is stored in a collection of digital files, thereby eliminating redundant data. A deduplication technology based on hash identification, which is a deduplication technology with low implementation cost and excellent deduplication effect, is widely applied to various storage systems. The identification of the data block or the file is calculated and stored in the database, when the deduplication process is carried out, the system can calculate the identification of the data block to be deduplicated and compare the identification with the identification in the database, if the identification is matched with the identification, the same data block is proved to be stored, the system gives up storage, but the index information between the file and the unique data block is reserved, and the normal reconstruction of the subsequent file is ensured.
With the rapid expansion of data volume, the space overhead required by data block identification storage is also larger and larger, and the main memory of the storage system cannot meet the storage overhead, so that slow external storage devices such as a disk and the like undertake the task of storing the data block identification. The solution comes with a bottleneck of searching a disk, so that the efficiency of the whole deduplication system is limited, and the response time of the deduplication system is reduced. More and more deduplication systems use additional technology to mitigate the performance degradation problem from disk bottlenecks.
The DDFS system proposed by DataDomain uses a classical approximate set membership decision data structure, a bloom filter, to avoid disk bottlenecks. A bloom filter is a data structure that typically trades off memory space overhead for partial accuracy bit cost, and enables set determination, i.e., whether an element exists in a set, to be done with minimal space overhead. The bloom filter does not need to store the original data itself, but rather summary information of the original data. The main data structure is a bit vector, and the bit vector comprises a plurality of hash functions for mapping data to bits in the bit vector.
The DDFS uses a summary vector to improve the performance of the deduplication system, and the summary vector is implemented by a Bloom Filter, which is stored in the main memory of the DDFS and represents a summary of the data segment identifications in the file system. When the deduplication system needs to query whether an index value exists, it will access the summary vector first, and if the result given by the summary vector is that the index value does not exist, the DDFS considers that the data segment is a new data segment, and no additional lookup operation is needed. If the result given by the summary vector indicates that an index value exists, then the index value exists with a high probability, but the result is not guaranteed, and the DDFS is further confirmed by a database lookup. When the system is closed, the system stores the summary vector to a disk, and the system after power failure is ensured not to lose the information of the summary vector. When the system is restarted to recover, the DDFS will restore the latest one of the summary vectors in disk, and then insert new data after the checkpoint into the summary vector, as shown in fig. 1.
The bloom filter is a data structure which does not support deletion operation, so that the summary vector implemented based on the bloom filter does not support deletion operation, which results in that deletion operation of a file or a data segment cannot be synchronized into the summary vector, and the accuracy of information in the summary vector is reduced. The accuracy of the summarized vector is directly related to the running speed of the DDFS, and the problem of accumulated accuracy reduction becomes the bottleneck of the whole DDFS along with the long-time running of the system.
Agrawal et al teach a cuckoo filter based data de-duplication system. The cuckoo filter solves the problem that a bloom filter cannot delete elements and does not need to sacrifice space or performance overhead. The data structure of the cuckoo filter is composed of a plurality of buckets, each bucket comprises a plurality of units, and each unit can store a fingerprint. The cuckoo filter also only stores the fingerprint information of the elements, and the space overhead is reduced at the cost of accuracy. The cuckoo filter employs two hash functions, the pair of hash functions is used to compute two candidate bucket positions for the element, and the two candidate bucket positions are associated by a partial keycuckoo hash, and the location of one candidate bucket can be obtained from the fingerprint of the element and the location of the other candidate bucket.
The author also uses the cuckoo filter to accelerate the query operation of the data block identification and reduce the disk access times in the process of deleting the repeated data. When querying whether a data block exists, the data block is firstly searched by a hash structure constructed by a cuckoo filter, if the data block identification is found, the system tries to read the metadata information of the data block from the cache, and otherwise, the metadata information is directly obtained from the metadata record. In this process, the system updates the cache via the LRU algorithm. If no data block identification is found in the cuckoo filter, the data block content is written directly to the storage system and this data block is added to the metadata record. In the manner described above, they use cuckoo filters to speed up the entire deduplication process, as shown in fig. 2.
The interpolation algorithm of the cuckoo filter does not take into account the influence that the interpolation algorithm may have on the degree of data load within the filter at the beginning of the design, and only a simple selection algorithm is used to select one of the two candidate buckets. The simple random insertion algorithm can cause the load in the filter to be concentrated in a few buckets, and the imbalance of the load can cause the insertion efficiency to be reduced, the time delay to be increased, the usability of the whole cuckoo filter is influenced, and the efficiency of the repeated data deleting system is further influenced.
Disclosure of Invention
The invention aims to provide a two-stage cuckoo filter and a repeated data deleting method based on the two-stage cuckoo filter, and aims to solve the technical problem that the cuckoo filter used in the repeated data deleting method is low in inserting efficiency.
The invention is realized in such a way that each cuckoo filter in the two-stage cuckoo filter consists of a plurality of barrels, each barrel consists of a plurality of units, each unit can be used for storing data fingerprints, the units form a two-dimensional fingerprint matrix structure, each element to be inserted is associated with two hash functions, and the two hash functions are used for obtaining the positions of two candidate barrels of the element so that the element fingerprints can be only stored in the two candidate barrels to form the cuckoo filter to form an inserted element two-stage algorithm and two-stage repositioning.
The further technical scheme of the invention is as follows: the overall load rate of the cuckoo filter is smaller than a preset threshold value, the first stage is that the cuckoo filter firstly calculates data fingerprints of elements in an insertion algorithm of the first stage in the elements, then calculates the positions of two candidate buckets, and then calculates the loads in the two candidate buckets.
The further technical scheme of the invention is as follows: calculating the load in the two candidate buckets, judging whether the load of the two candidate buckets is larger than a set value or not, if the load of the two candidate buckets is smaller than the set value, selecting a candidate bucket with the minimum load rate to insert by the algorithm, and feeding back the successful insertion; if the load of one candidate barrel is smaller than the set value, the algorithm selects the candidate barrel with the load rate smaller than the set value to insert, and feeds back the successful insertion; if the load rates of the two candidate buckets are larger than the set value, the algorithm randomly selects one candidate bucket, removes one fingerprint from the candidate buckets, names the candidate bucket as a victim, inserts the fingerprint of the element to be inserted into the position before the victim, and carries out the first-stage repositioning operation.
The further technical scheme of the invention is as follows: the first-stage repositioning operation insertion algorithm judges whether the iteration number reaches an upper limit, if so, the feedback insertion fails, if not, judges whether the iteration number is greater than or equal to a preset value, if so, calculates another candidate bucket of the victim by using a partial keybok hash function, and judges the load condition of the candidate bucket, and if so, the algorithm inserts the fingerprint into the bucket, and the feedback insertion succeeds; otherwise, the algorithm randomly selects a fingerprint in the current candidate bucket, removes the fingerprint, updates the fingerprint as a victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the first-stage circular repositioning operation; if the iteration number is larger than or equal to the preset value, another candidate bucket of the victim is calculated by using a partial key cuckoo hash function, whether the candidate bucket is full or not is judged, if the candidate bucket is not full, the algorithm inserts the fingerprint of the element to be inserted into the candidate bucket and feeds back the insertion success, if the candidate bucket is full, the algorithm randomly selects the fingerprint in the current candidate bucket, removes the fingerprint, updates the fingerprint to be the victim, inserts the fingerprint to be inserted into the original position of the victim, adds one iteration number, and returns to the first-stage circular repositioning operation.
The further technical scheme of the invention is as follows: the overall load rate of the cuckoo filter is greater than or equal to a preset threshold value, the second stage is that the cuckoo filter firstly calculates the data fingerprint of the element in the second stage insertion algorithm in the inserted element, then calculates the positions of two candidate buckets, and then calculates the load in the two candidate buckets.
The further technical scheme of the invention is as follows: the algorithm of the cuckoo filter in the insertion element obtains the load judgment of two candidate buckets to be full, if the two candidate buckets are not full, the algorithm selects a candidate bucket with the lowest load to insert the fingerprint into the position, the feedback insertion is successful, if the two candidate buckets are not full and one is full, the algorithm selects the candidate bucket which is not full to insert the fingerprint into the position, the feedback insertion is successful, if the two candidate buckets are full, the algorithm randomly selects one candidate bucket and randomly removes one fingerprint to be called a victim, then the fingerprint of the element to be inserted is inserted into the position, and the second stage of relocation operation is carried out.
The further technical scheme of the invention is as follows: and the second-stage repositioning operation judges whether the current iteration number reaches an upper limit, if the current iteration number reaches the upper limit, the feedback insertion fails, if the current iteration number does not reach the upper limit, the algorithm calculates the position of another candidate bucket of the victim through a partial key valley hash function and judges whether the candidate bucket is full, if the candidate bucket is not full, the algorithm inserts an element into the candidate bucket and feeds back the insertion success, if the candidate bucket is full, the algorithm randomly removes one fingerprint, updates the fingerprint to be the victim, then inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the second-stage repositioning operation.
Another object of the present invention is to provide a data de-duplication method based on a two-stage cuckoo filter, wherein when a file stream enters a storage system, the data de-duplication method based on the two-stage cuckoo filter includes the following steps:
s1, cutting the file into data blocks, and calculating the fingerprint of each data block;
s2, sending the identification of the data block into a two-stage cuckoo filter for inquiring and judging whether the identification exists, if the identification does not exist, judging that the data block is a brand-new data block, storing the data block into a container area by the system, forming a key value pair by the fingerprint and the physical position of the data block, storing the key value pair into a fingerprint index area, and storing the fingerprint into a list area of a file; if the identification exists, the proposed data de-duplication technology can enter a disk database to compare fingerprints, if the disk database does not have the fingerprint, the data block is proved to be a brand new data block, the data block is reserved, the data block is stored in a container area, the fingerprint and the physical position of the data block form a key value pair and then are stored in a fingerprint index area, the fingerprint is stored in a list area of a file, and if the fingerprint exists in the disk database, the data block is proved to be stored by a storage system and is abandoned.
The further technical scheme of the invention is as follows: in step S1, the file is cut into data blocks by using a rolling Rabin fingerprint blocking method.
The further technical scheme of the invention is as follows: the fingerprint of each data chunk is computed in the step S1 by the SHA1 secure hash function.
The invention has the beneficial effects that: the two-stage insertion algorithm relieves the problem of uneven data load, effectively reduces the insertion delay of the cuckoo filter, and accordingly increases the efficiency and the throughput of the repeating data deleting system.
Detailed Description
According to the scheme, a two-stage insertion algorithm is designed, different insertion strategies are used in two stages with different loads, relocation is actively carried out in a first stage with low load rate to balance the loads, and bedding is laid for insertion in a second stage.
With the advent of the information age, data on the network expands rapidly, and more enterprises face the difficult problem of data explosion, which restricts the expansion of enterprise business. However, among massive data, a high proportion of data is redundant repeated data, which causes additional space overhead, bandwidth overhead and energy consumption, and a repeated data deleting technology is developed.
The data de-duplication technology follows the steps of file stream blocking, data block fingerprint calculation, data block fingerprint comparison, data compression and disk drop storage. According to the related research, the data block fingerprint comparison step is the key point of the accelerated promotion of the data deduplication technology. The burst growth of data can cause serious negative effects on the data block fingerprint comparison step. After the total number of data rapidly increases, the total number of data block fingerprints and the required storage space also increase rapidly, so that the memory of the deduplication system is overloaded, and all data fingerprints cannot be stored. Thus, a significant portion of the data fingerprint is transferred to disk for storage. The random access speed of the disk database is very low compared with the access speed of the memory data block, so that the extra disk IO caused by the query of the fingerprint increases the time of index query, and becomes the performance bottleneck of the whole deduplication process. Therefore, various schemes have been proposed in academia and industry to overcome the performance bottleneck and speed up the entire deduplication process. One possible solution is by using a very space efficient summary data structure. The probability data structure does not store the elements per se, but stores some summary information of the elements, so that space overhead is greatly reduced, the space overhead can be stored by a memory, disk IO is reduced, and the whole deduplication process is accelerated.
The cuckoo filter is adopted by a partial data de-duplication system as a novel data structure, but the insertion algorithm does not consider the load influence on the whole filter, so that the insertion delay is increased, the throughput and the availability are reduced, and the efficiency of the whole data de-duplication system is further influenced.
The invention discloses a method for deleting repeated data of a storage system based on a two-stage cuckoo filter, and provides a two-stage insertion algorithm of the cuckoo filter, which can effectively balance loads in the filter, average the loads to each barrel as much as possible, increase the throughput of the cuckoo filter, reduce the insertion delay of the cuckoo filter, and realize an efficient scheme for deleting the repeated data of the storage system based on the two-stage cuckoo filter.
First, we will introduce the core of the efficient deduplication technology proposed by the present invention, a two-stage cuckoo filter.
When element x needs to be inserted into the cuckoo filter, the algorithm first computes its fingerprint by the SHA1 algorithm and computes the locations of its two candidate buckets by two hash functions. The algorithm then takes the current load of the filter and determines the stage it is currently in, which is the first stage if the load factor is less than 0.45, and the second stage otherwise. And there will be an upper iteration limit in the two-stage interpolation algorithm to avoid the interpolation algorithm of the cuckoo filter entering an infinite loop state.
The two-stage insertion cuckoo filter has no difference from the common cuckoo filter in data structure, and the difference is mainly the insertion algorithm.
Each cuckoo filter consists of a number of buckets (one row in the figure), each bucket consists of a number of cells (one grid in the figure), each cell can be used to store a data fingerprint, so that a cuckoo filter presents a structure of a two-dimensional fingerprint matrix. Meanwhile, each element to be inserted is associated with two hash functions, and through the two hash functions, the element can acquire the positions of two candidate buckets of the element, and the fingerprints of the element can only be stored in the two candidate buckets.
The first phase of the interpolation algorithm, which first computes the data fingerprint of an element, then the positions of two candidate buckets, and then the loads in the two candidate buckets, has three states: a) the load of both candidate buckets is less than 0.5. b) There is a candidate bucket with a load less than 0.5. c) The load rates of both candidate buckets are greater than 0.5. When the state is a, the algorithm selects a candidate bucket with the minimum load rate to insert, and returns that the insertion is successful. When the state is b, the algorithm selects a candidate bucket with the load rate less than 0.5 for insertion, and returns that the insertion is successful. When the state is c, the algorithm randomly selects a candidate bucket and removes one of the fingerprints, called the victim, and then inserts the fingerprint of the element to be inserted into the previous position of the victim, followed by the repositioning operation.
In the relocation operation in the first stage, firstly, the algorithm judges whether the iteration number has reached the upper limit, if the iteration number has reached the upper limit, the insertion is returned to fail, and if the iteration number has not reached the upper limit, two cases exist, wherein the iteration number is less than 3 in case of a, and the iteration number is greater than or equal to 3 in case of b.
1) Case a
Another candidate bucket for the victim is computed using a partial keycuckoo hash function, and then the load condition of this candidate bucket is determined, where two conditions may exist: i) the load rate of the candidate bucket is less than 0.5. ii) the load rate of the candidate bucket is 0.5 or more. When it is case i, the algorithm inserts the fingerprint of the element to be inserted into this candidate bucket, returning that the insertion is successful. And in case ii, the algorithm randomly selects a fingerprint in a current candidate bucket, removes the fingerprint, updates the fingerprint to be the victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the circular repositioning operation.
2) Case b
Another candidate bucket for the victim is computed using a partial keycuckoo hash function, and then a determination is made as to whether this candidate bucket is full, in which case two situations may exist: i) an under-fill of the candidate bucket, ii) an over-fill of the candidate bucket. When the case is i, the algorithm inserts the fingerprint of the element to be inserted into this candidate bucket and returns that the insertion is successful. And in case ii, the algorithm randomly selects a fingerprint in a current candidate bucket, removes the fingerprint, updates the fingerprint to be the victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the circular repositioning operation.
The second phase inserts the algorithm, which obtains the load of two candidate buckets, there are three cases: a) both candidate buckets are not full, b) one of the two candidate buckets is not full and one is full, c) both candidate buckets are full. When the case is a, the algorithm will select a candidate bucket with the lowest load to insert the fingerprint into this position, and returns the insertion success. When case b, the algorithm selects an unfilled candidate bucket to insert the fingerprint into this location, returning that the insertion was successful. When case c is the case, the algorithm will randomly select a candidate bucket and randomly remove one of the fingerprints, called the victim, and then insert the fingerprint of the element to be inserted into this location and enter the relocation operation.
And in the second stage of relocation operation, firstly, judging whether the current iteration number reaches an upper limit, and returning to insertion failure if the current iteration number reaches the upper limit. If the upper limit is not reached, the algorithm will calculate another candidate bucket position for the victim by a partial keyvalley hash function, which may be two cases: a) the candidate bucket is not full, b) the candidate bucket is full. When case a, the algorithm inserts the element into this candidate bucket, returning an insertion success. In case b, the algorithm will randomly remove one of the fingerprints, update this fingerprint as the victim, then insert the fingerprint to be inserted into the original position of the victim, add one to the number of iterations, and return to the loop repositioning operation. The overall flow chart of the two-phase interpolation algorithm is shown in fig. 4. The pseudo code for the two-phase insertion algorithm as a whole is as follows:
based on the data de-duplication method of the two-stage cuckoo filter, the invention will introduce a data de-duplication method with the two-stage cuckoo filter as a core.
When the file stream enters the storage system, the flow of data de-duplication is as follows: (1) the file is cut into data blocks using the rolling Rabin fingerprinting method, and then the fingerprint of each data block is calculated by the SHA1 secure hash function. (2) The identification of the data block is sent to a two-stage insertion cuckoo filter for query, and the cuckoo filter may return two results: a) the identification is not present in the two-stage cuckoo filter, b) the identification is present in the two-stage insertion cuckoo filter. In case a, the data block is determined to be a new data block, so the system stores the data block in the container area, and stores the fingerprint of the data block and the physical location into the fingerprint index area after forming a key value pair, and finally stores the fingerprint into the list area of the file.
The rolling Rabin fingerprint method is an algorithm for dividing a file into data blocks of indefinite length, and the input of the algorithm is a file data stream, and the output of the algorithm is data blocks of indefinite length. As shown in fig. 6.
The algorithm steps are as follows:
(1) a sliding window value and a fingerprint value are preset.
(2) The beginning of the file is set to the first window position.
(3) Calculating Rabin fingerprints (hash values) of the data in the window, and jumping to the step 4 if the Rabin fingerprint value calculated by the data in the current window is the same as the preset fingerprint value, or jumping to the step 5 if not.
(4) Setting the window boundary as one boundary of the block. Jump to step 5,.
(5) If the file has data subsequently, moving the sliding window backwards, and jumping to the step 3; otherwise, jumping to step 6.
(6) And finishing the algorithm, and outputting the blocks according to the calculated boundary.
Rabin fingerprint algorithm
The input of Rabin fingerprint algorithm is binary information, and the output is binary information abstract.
(1) A ([ b _1, …, b _ m ]) is the input binary string
(2) Constructing a polynomial with the corresponding highest degree of m-1 according to A
(3) Given a polynomial P (t) of k highest degree
(4) Calculate Rabin fingerprint ═ A (t) mod P (t)
(5) And outputting the Rabin fingerprint.
SHA1 function
The SHA1 algorithm is a secure hash algorithm, the input is binary information, and the output is 160-bit SHA1 message digest. For information input with length less than 264 bits, the SHA1 algorithm generates a 160-bit information summary (id), and the original input data cannot be obtained reversely from the information summary.
For any length of plaintext, the SHA1 function first groups it so that each group is 512 bits in length, and then repeats the process over and over again for those plaintext blocks.
The digest generation process for each plaintext packet is as follows:
(1) a 512-bit plaintext block is divided into 16 sub-plaintext blocks, each sub-plaintext block being 32 bits.
(2) Claim 5 linked variables of 32 bits, noted A, B, C, D, E.
(3) The 16 sub-plaintext blocks are expanded to 80.
(4)80 sub-plaintext blocks are subjected to 4 rounds of operations.
(5) And performing summation operation on the link variable and the initial link variable.
(6) The above operation is repeated with the chaining variable as input for the next plaintext block.
(7) Finally, the data inside the 5 linked variables is the SHA1 digest.
And in case b, the data de-duplication technology provided by the inventor can enter a disk database to compare fingerprints, if the disk database does not have the fingerprints, the data blocks are proved to be brand new data blocks and are reserved, the data blocks are stored in a container area, the fingerprints and the physical positions of the data blocks form a key value pair and then are stored in a fingerprint index area, and finally the fingerprints are stored in a list area of the file. If this fingerprint exists in the disk database, the data block is certified as having been stored from the storage system, and storage is abandoned. The block diagram of the deduplication method is shown in FIG. 5, and the flow diagram of the deduplication method is shown in FIG. 6. The pseudo code of the deduplication method is as follows:
the scheme provides an improved insertion algorithm of the cuckoo filter, namely a two-stage insertion algorithm, so that the problem of uneven data load is solved, and experimental evaluation shows that the scheme effectively reduces the insertion delay of the cuckoo filter, thereby increasing the efficiency and the throughput of a repeating data deleting system.
Aiming at the defect of uneven data of a rough data structure cuckoo filter used in the previous deduplication scheme, the invention provides a two-stage insertion algorithm, a stricter relocation condition is set in the first stage, and data distribution is balanced through a more active relocation strategy, so that the second stage contributing to the main part of insertion delay obtains a better data distribution condition, the insertion delay of the rough structure is reduced, and the whole deduplication process is accelerated.
The invention provides a block-level data deduplication scheme based on a dual-stage insertion cuckoo filter, and the time delay of the whole deduplication algorithm is effectively reduced by means of the advantage of the insertion performance of the dual-stage insertion cuckoo filter.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.