Background
With the advent of the information age, data on the internet underwent explosive growth, and according to IDC reports, global data amounts to 33ZB in 2018, and it was predicted that this data will reach 175ZB in 2025. Meanwhile, in order to alleviate the cost pressure of self-built local storage and maintenance, more and more individuals, companies and organizations migrate the storage traffic of data to cloud service providers. However, the explosive growth of data presents serious challenges to cloud service providers in terms of storage capacity, network bandwidth, etc. In order to solve the problem of data explosion, a redundant data elimination technology has been proposed, and has been developed for many years, from a lossless data compression technology to a lossy data compression technology to a deduplication technology.
In the early stages of the redundant data elimination technique, the coding scheme is widely used and studied. Huffman coding constructs the code with the shortest average length according to the probability of character occurrence. Subsequent LZ encoding creates a dictionary of data, and if both the sender and the receiver have such a dictionary, the actually transmitted data may be replaced by the index of the dictionary, thereby compressing the actual transfer amount of data.
For multimedia data, lossy data compression techniques are widely used, which promote compression ratios at the cost of some unimportant information, such as music with very complete frequency spectrum, and cutting the frequency spectrum above 20KHz (the upper auditory limit of the human ear) does not affect the quality of the music, which is MP3 lossy compression techniques. For pictures, JPEG and PNG are two more popular compression algorithms.
After the new century, the technology of data de-duplication formally appears, which supports multi-granularity de-duplication, has better expansibility, and can be extended from local to large-scale distributed storage systems. The meaning of the deduplication technique is to detect duplicate data in one set of digital files and save only unique instances of the data, thereby eliminating redundant data. The deduplication technology based on hash identification, which is a deduplication technology with low implementation cost and excellent deduplication effect, is widely applied to various storage systems. The method comprises the steps that through calculating the identification of a data block or a file and storing the identification into a database, when a deleting process is carried out, the system calculates the identification of the data block to be deleted again and compares the identification in the database, if the identification is matched, the same data block is proved to be stored, the system gives up storage, but index information between the file and a unique data block is reserved, and normal reconstruction of the subsequent file is guaranteed.
With the rapid expansion of data volume, the space overhead required by the data block identification storage is also larger and larger, and the main memory of the storage system cannot meet the storage overhead, so that slow external memory devices such as magnetic disks and the like bear the task of storing the data block identification. With the solution, the bottleneck of disk searching is adopted, so that the efficiency of the whole deduplication system is limited, and the response time of the deduplication system is reduced. More and more deduplication systems use additional technology to alleviate the performance degradation problem associated with disk bottlenecks.
The DDFS system proposed by DataDomain uses a classical approximation set membership decision data structure, bloom filter, to avoid disk bottlenecks. Bloom filters are a typical data structure that trades for memory space overhead at the cost of partial accuracy bits, and are capable of performing set decisions, i.e., whether an element is present in a set, with minimal space overhead. The bloom filter does not need to store the original data itself, but rather summary information of the original data. The main data structure is a bit vector, and the main data structure comprises a plurality of hash functions for mapping data to bits in the bit vector.
DDFS uses summary vectors to promote the performance of the deduplication system, which are implemented by Bloom filters, which are stored in the DDFS's main memory and represent a summary of the data segment identifications in the file system. When the deduplication system needs to query whether an index value exists, the deduplication system accesses the summarization vector first, and if the result given by the summarization vector is that the index value does not exist, the DDFS considers the data segment as a new data segment without additional searching operation. If the result given by the summary vector indicates that an index value exists, this index value exists with a high probability, but this result is not guaranteed, and the DDFS will be further validated by a database lookup. When the system is closed, the system can save the summary vector to the disk, so that the system after power failure is ensured not to lose the information of the summary vector. When the system is restarted for recovery, the DDFS will go to the latest one of the checkpoints of the summary vector in the recovery disk, and then insert new data into the summary vector after the checkpoints, as shown in fig. 1.
The bloom filter is a data structure which does not support the deleting operation, so that the summary vector realized based on the bloom filter does not support the deleting operation, and the deleting operation of the file or the data segment cannot be synchronized into the summary vector, so that the accuracy of information in the summary vector is reduced. The accuracy of the summary vector is directly related to the operation speed of the DDFS, and the problem of accumulated accuracy degradation becomes a bottleneck of the whole DDFS with long-time operation of the system.
Agrawal et al scholars have proposed a cuckoo filter-based deduplication system. The cuckoo filter solves the problem of the bloom filter failing to delete elements and does not require sacrifice in space or performance overhead. The data structure of the cuckoo filter is composed of a plurality of barrels, each barrel comprises a plurality of units, and each unit can store a fingerprint. The cuckoo filter can only store fingerprint information of elements, and space overhead is reduced at the cost of accuracy. The cuckoo filter selects two hash functions, the pair of hash functions are used for calculating two candidate barrel positions of the element, the two candidate barrel positions are associated through partial key cuckoo hash, and the position of the other candidate barrel can be obtained through the fingerprint of the element and the position of one candidate barrel.
Authors also use a cuckoo filter to speed up the query operation of data block identification, reducing the number of disk accesses during deduplication. When a data block is queried, the data block is firstly searched through a hash structure constructed by a cuckoo filter, if the data block identification is found, the system tries to read the metadata information of the data block from a cache, otherwise, the metadata information is directly obtained from the metadata record. In this process, the system will update the cache via the LRU algorithm. If no data block identification is found in the cuckoo filter, the data block content is directly written into the storage system, and the data block is added in the metadata record. In the manner described above, they use a cuckoo filter to accelerate the overall deduplication process, as shown in fig. 2.
The insertion algorithm of the cuckoo filter did not take into account at the beginning of the design that the insertion algorithm may have an impact on the degree of data loading within the filter, and only a simple selection algorithm was used to pick one of the two candidate buckets. The simple random insertion algorithm can cause the load in the filter to be concentrated in a few barrels, and unbalanced load can cause the efficiency of insertion to be reduced, time delay to be increased, and the usability of the whole cuckoo filter to further influence the efficiency of the repeated data deleting system.
Disclosure of Invention
The invention aims to provide a two-stage cuckoo filter and a repeated data deleting method based on the two-stage cuckoo filter, and aims to solve the technical problem that the insertion efficiency of the cuckoo filter used in the repeated data deleting method is low.
The invention is realized in that a two-stage cuckoo filter is provided, each cuckoo filter in the two-stage cuckoo filter is composed of a plurality of barrels, each barrel is composed of a plurality of units, each unit can be used for storing data fingerprints, the plurality of units form a two-dimensional fingerprint matrix structure, each element to be inserted is associated with two hash functions, and the two hash functions are utilized to obtain the positions of two candidate barrels of the element so that the element fingerprints can only be stored in the two candidate barrels to form a cuckoo filter to form an inserted element two-stage algorithm and two-stage repositioning.
The invention further adopts the technical scheme that: the whole load rate of the cuckoo filter is smaller than a preset threshold value to be a first stage, the cuckoo filter firstly calculates data fingerprints of elements in an insertion algorithm of the first stage in the elements, then calculates two candidate barrel positions, and then calculates loads in the two candidate barrels.
The invention further adopts the technical scheme that: judging whether the loads of the two candidate barrels are larger than or smaller than a set value or not by calculating the loads of the two candidate barrels, if the loads of the two candidate barrels are smaller than the set value, selecting a candidate barrel with the minimum load rate by an algorithm, and feeding back the insertion success; if the load of one candidate barrel is smaller than the set value, the algorithm selects the candidate barrel with the load rate smaller than the set value for insertion, and feeds back that the insertion is successful; if the load rates of the two candidate buckets are greater than the set value, the algorithm randomly selects one candidate bucket, removes one fingerprint which is named as a victim, inserts the fingerprint of the element to be inserted into the position before the victim, and performs a first stage repositioning operation.
The invention further adopts the technical scheme that: the first stage repositioning operation insertion algorithm judges whether the iteration number reaches an upper limit, if the iteration number reaches the upper limit, feedback insertion fails, if the iteration number does not reach the upper limit, judging whether the iteration number is greater than or equal to a preset value, if the iteration number is smaller than the preset value, calculating another candidate barrel of the victim by using a part of key-cuckoo hash function, judging the load condition of the candidate barrel, if the load condition of the candidate barrel is smaller than a set value, inserting fingerprints into the barrel by the algorithm, and if the feedback insertion is successful; otherwise, the algorithm randomly selects a fingerprint in a current candidate barrel, removes the fingerprint, updates the fingerprint as a victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the repositioning operation in the first stage of the loop; if the iteration number is greater than or equal to a preset value, calculating another candidate barrel of the victim by utilizing a part of key cuckoo hash function, judging whether the candidate barrel is full, if not, inserting the fingerprints of the elements to be inserted into the candidate barrel by the algorithm, feeding back the successful insertion, if the candidate barrel is full, randomly selecting the fingerprints in one current candidate barrel by the algorithm, removing the fingerprints, updating the fingerprints to be the victim, inserting the fingerprints to be inserted into the original position of the victim, adding one to the iteration number, and returning to the first stage repositioning operation of the loop.
The invention further adopts the technical scheme that: and the whole load rate of the cuckoo filter is larger than or equal to a preset threshold value to be a second stage, wherein the cuckoo filter firstly calculates the data fingerprint of the element in the second stage of the insertion algorithm in the element, then calculates the positions of two candidate barrels, and then calculates the loads in the two candidate barrels.
The invention further adopts the technical scheme that: the cuckoo filter obtains the load judgment of two candidate barrels in the inserted element through the algorithm, if the two candidate barrels are not full, the algorithm selects one candidate barrel with the lowest load to insert the fingerprint into the position, the feedback insertion is successful, if one candidate barrel is not full and one candidate barrel is full, the algorithm selects the candidate barrel which is not full to insert the fingerprint into the position, the feedback insertion is successful, if the two candidate barrels are full, the algorithm randomly selects one candidate barrel, one fingerprint is called a victim, the fingerprint of the element to be inserted is inserted into the position, and the second stage repositioning operation is carried out.
The invention further adopts the technical scheme that: and the second stage repositioning operation judges whether the current iteration number reaches an upper limit, if the current iteration number reaches the upper limit, the feedback insertion fails, if the current iteration number does not reach the upper limit, the algorithm calculates the position of another candidate barrel of the victim through a part of key-bar hash function and judges whether the candidate barrel is full, if the candidate barrel is not full, the algorithm inserts elements into the candidate barrel, and feeds back the successful insertion, if the candidate barrel is full, the algorithm randomly removes one fingerprint, updates the fingerprint as the victim, inserts the fingerprint to be inserted into the original position of the victim, adds one iteration number, and returns to the second stage repositioning operation of the loop.
Another object of the present invention is to provide a two-stage cuckoo filter-based data de-duplication method, which includes the steps of, when a file stream enters a storage system:
s1, cutting a file into data blocks, and calculating fingerprints of each data block;
s2, sending the identification of the data block into a two-stage insertion cuckoo filter for inquiring and judging whether the identification exists, if the identification does not exist, judging that the data block is a brand new data block, storing the data block into a container area by a system, forming a key value pair by fingerprints and physical positions of the data block, storing the key value pair into a fingerprint index area, and storing the fingerprints into a list area of a file; if the mark exists, the proposed repeated data deleting technology can enter a disk database to compare fingerprints, if the disk database does not have the fingerprints, the data block is proved to be a brand new data block, the data block is reserved, the data block is stored in a container area, the fingerprint and the physical position of the data block form a key value pair and then are stored in a fingerprint index area, the fingerprint is stored in a list area of a file, if the disk database has the fingerprints, the data block is proved to be stored by a storage system, and the storage is abandoned.
The invention further adopts the technical scheme that: in the step S1, a rolling Rabin fingerprint blocking method is used to cut the file into data blocks.
The invention further adopts the technical scheme that: the fingerprint of each data block is calculated by the SHA1 secure hash function in said step S1.
The beneficial effects of the invention are as follows: the two-stage insertion algorithm relieves the problem of uneven data load, effectively reduces the insertion time delay of the cuckoo filter, and therefore increases the efficiency and throughput of the data de-duplication system.
Detailed Description
The scheme aims at solving the problem of low insertion efficiency of the cuckoo filter used in the repeated data deleting method, designs a two-stage insertion algorithm, uses different insertion strategies in two stages with different loads, actively performs repositioning in a first stage with low load rate to balance the load, and lays a cushion for the insertion in a second stage.
With the advent of the information age, data on networks has rapidly expanded, and more enterprises face the difficult problem of data explosion, so that expansion of enterprise business is restricted. However, among the vast amounts of data, there is a high proportion of data that is redundant duplicate data, which creates additional space overhead, bandwidth overhead, and energy consumption, and duplicate data deletion techniques have evolved.
The deduplication technology follows the steps of file stream chunking, data chunk fingerprint calculation, data chunk fingerprint comparison, data compression, and landing disk storage. According to the related research, the data block fingerprint comparison step is an important point of acceleration of the data deduplication technology. Explosive growth of data can have serious negative impact on the fingerprint comparison step of the data blocks therein. After the total number of data grows rapidly, the total number of data block fingerprints and the required storage space also grow rapidly, so that the memory of the deduplication system is overwhelmed, and all data fingerprints cannot be stored. Thus, a substantial portion of the data fingerprint is transferred to disk for storage. The random access speed of the disk database is very low compared with the access speed of the memory data block, so that the additional disk IO caused by the inquiry of the fingerprint increases the time of index inquiry and becomes the performance bottleneck of the whole deduplication process. Therefore, various schemes are proposed in the academia and industry to overcome the performance bottleneck and speed up the whole de-duplication process. One possible solution is by using a space-efficient summary data structure. The probability data structure does not store the elements per se, but stores some summary information of the elements, so that space overhead is greatly reduced, and the probability data structure can be stored by a memory, thereby reducing disk IO and accelerating the whole deduplication process.
The cuckoo filter is used as a novel data structure by a partial deduplication system, but the insertion algorithm does not consider the load effect on the whole filter, resulting in increased insertion delay, reduced throughput and usability, and thus affecting the efficiency of the whole deduplication system.
The invention provides a two-stage cuckoo filter-based storage system repeated data deleting method, which provides a two-stage insertion algorithm of a cuckoo filter, which can effectively balance loads in the filter, average the loads to each barrel as much as possible, increase throughput of the cuckoo filter, reduce insertion time delay of the cuckoo filter, and realize an efficient storage system repeated data deleting scheme based on the two-stage cuckoo filter.
First, we will introduce the core of the efficient deduplication technique proposed by the present invention, the two-stage cuckoo filter.
When it is desired to insert element x into the cuckoo filter, the algorithm will first calculate its fingerprint by the SHA1 algorithm and calculate the positions of its two candidate buckets by two hash functions. The algorithm then takes the current load of the filter and decides what phase it is currently in, and if the load factor is less than 0.45 it is in the first phase, otherwise it is in the second phase. And there will be an upper iteration limit in the two-stage insertion algorithm in order to avoid the cuckoo filter insertion algorithm from entering an infinite loop state.
The two-stage insertion cuckoo filter is not different in data structure from the common cuckoo filter, except for the insertion algorithm.
Each cuckoo filter is composed of a plurality of barrels (one row in the figure), each barrel is composed of a plurality of units (one cell in the figure), and each unit can be used for storing one data fingerprint, so that one cuckoo filter presents a two-dimensional fingerprint matrix structure. Meanwhile, each element to be inserted is associated with two hash functions, through which the element can acquire the positions of two candidate barrels thereof, and fingerprints of the element can only be stored in the two candidate barrels.
The first stage inserts an algorithm that will first compute the data fingerprint of the element, then the locations of the two candidate buckets, then the load in the two candidate buckets, which will now have three states: a) The load of both candidate buckets is less than 0.5. b) There is a candidate bucket with a load less than 0.5. c) The load rate of both candidate buckets is greater than 0.5. When the state is a, the algorithm selects a candidate bucket with the minimum load rate to insert, and returns successful insertion. When the state is b, the algorithm selects a candidate bucket with the load rate smaller than 0.5 for insertion, and returns successful insertion. When the state is c, the algorithm randomly selects a candidate bucket and removes one of the fingerprints, called the victim, and then inserts the fingerprint of the element to be inserted into the position in front of the victim, followed by a relocation operation.
In the repositioning operation in the first stage, the algorithm first judges whether the iteration number reaches the upper limit, if so, the insertion failure is returned, and if not, two cases exist, wherein the iteration number is less than 3, and the iteration number is greater than or equal to 3.
1) Case a
Another candidate bucket for the victim is computed using the partial key-bird hash function, and then the load condition of this candidate bucket is determined, at which time two conditions may exist: i) The load rate of the candidate bucket is less than 0.5. ii) the load rate of the candidate barrel is more than or equal to 0.5. And when the condition is the case i, the algorithm inserts the fingerprints of the elements to be inserted into the candidate bucket, and the insertion is successful. When the condition ii is met, the algorithm randomly selects a fingerprint in a current candidate barrel, removes the fingerprint, updates the fingerprint as a victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the loop repositioning operation.
2) Case b
Using the partial key-bird hash function to calculate another candidate bucket for the victim, then determining if this candidate bucket is full, there may be two situations: i) Less than full of candidate bucket, ii) full of candidate bucket. When the case is i, the algorithm inserts the fingerprint of the element to be inserted into this candidate bucket and returns the success of the insertion. When the condition ii is met, the algorithm randomly selects a fingerprint in a current candidate barrel, removes the fingerprint, updates the fingerprint as a victim, inserts the fingerprint to be inserted into the original position of the victim, adds one to the iteration number, and returns to the loop repositioning operation.
The second stage inserts an algorithm, which obtains the load of two candidate buckets, and there are three cases: a) both candidate buckets are not full, b) one of the two candidate buckets is not full, one is full, c) both candidate buckets are full. In case a, the algorithm will select a lowest loaded candidate bucket to insert the fingerprint into this location, returning that the insertion was successful. When it is case b, the algorithm selects the candidate bucket that is not full to insert the fingerprint into this location, returning that the insertion was successful. In case c, the algorithm randomly selects a candidate bucket and randomly removes one of the fingerprints, called the victim, and then inserts the fingerprint of the element to be inserted into this location and enters the relocation operation.
And the repositioning operation in the second stage firstly judges whether the current iteration number reaches the upper limit, and if the current iteration number reaches the upper limit, the inserting failure is returned. If the upper bound is not reached, the algorithm will calculate another candidate bucket location for the victim via a partial key-pad hash function, at which time there may be two cases: a) The candidate bucket is not full, b) the candidate bucket is full. In case a, the algorithm inserts the element into this candidate bucket, returning that the insertion was successful. In case b, the algorithm will randomly remove one of the fingerprints, update the fingerprint as the victim, insert the fingerprint to be inserted into the original position of the victim, add one to the iteration number, and return to the loop repositioning operation. A flowchart of the whole of the double-stage insertion algorithm is shown in fig. 4. The whole pseudo code of the double-stage insertion algorithm is as follows:
the invention discloses a method for deleting repeated data based on a two-stage cuckoo filter, and subsequently, the invention introduces the method for deleting repeated data with the two-stage cuckoo filter as a core.
When a file stream enters a storage system, the flow of deduplication is as follows: (1) The file is cut into data blocks using the rolling Rabin fingerprint method, and then the fingerprint of each data block is calculated by the SHA1 secure hash function. (2) The identification of the data block is sent to a two-stage insertion cuckoo filter for inquiry, and the cuckoo filter may return two results at the moment: a) The identifier is not present in the two-stage cuckoo filter, b) the identifier is present in the two-stage insertion cuckoo filter. When the case a is the case, the data block can be judged to be a brand new data block, so that the system stores the data block into the container area, forms a key value pair with the fingerprint and the physical position of the data block, stores the key value pair into the fingerprint index area, and finally stores the fingerprint into the list area of the file.
The rolling Rabin fingerprint method is an algorithm for dividing a file into data blocks with indefinite length, and the data blocks with indefinite length are output by inputting the data blocks into a file data stream. As shown in fig. 6.
The algorithm comprises the following steps:
(1) A sliding window value and a fingerprint value are preset.
(2) The beginning of the file is set to the first window position.
(3) Calculating Rabin fingerprint (hash value) of the data in the window, if the Rabin fingerprint value calculated by the data in the current window is the same as the preset fingerprint value, jumping to the step 4, otherwise jumping to the step 5.
(4) The window boundary is set to be one boundary of the block. Jump to step 5.
(5) If the file has data in the follow-up process, moving the sliding window backwards, and jumping to the step 3; otherwise, jumping to step 6.
(6) And ending the algorithm, and outputting the blocks according to the calculated boundary.
Rabin fingerprint algorithm
The input of the Rabin fingerprint algorithm is binary information, and the output is binary information abstract.
(1) A ([ b_1, …, b_m ]) is an input binary string
(2) Constructing a corresponding polynomial with the highest term degree of m-1 according to A
(3) Given a polynomial P (t) of degree k with the highest term
(4) Calculation of Rabin fingerprint = a (t) mod P (t)
(5) And outputting Rabin fingerprints.
SHA1 function
The SHA1 algorithm is a secure hash algorithm, and is input as binary information and output as 160-bit SHA1 information digest. For information inputs less than 264 bits in length, the SHA1 algorithm generates a 160-bit information digest (signature) and cannot reversely obtain the original input data from the information digest.
For plaintext of arbitrary length, the SHA1 function first groups it so that each group is 512 bits in length, and then repeats the process iteratively on these plaintext groups.
The digest generation process for each plaintext block is as follows:
(1) The 512-bit plaintext block is divided into 16 sub-plaintext blocks, each sub-plaintext block being 32 bits.
(2) Apply for 5 32-bit link variables, denoted A, B, C, D, E.
(3) The 16-part sub-plaintext packet is expanded to 80 parts.
(4) The 80 sub-plaintext blocks are subjected to 4 rounds of operation.
(5) The linked variables are summed with the initial linked variables.
(6) The linking variable repeats the above as input to the next plaintext packet.
(7) Finally, the data in the 5 linked variables is the SHA1 digest.
In case b, the proposed repeated data deleting technology can enter a disk database to compare fingerprints, if the disk database does not have the fingerprints, the data block is proved to be a brand new data block, the data block is reserved and stored in a container area, the fingerprints and the physical positions of the data block form a key value pair and then are stored in a fingerprint index area, and finally the fingerprints are stored in a list area of a file. If this fingerprint is present in the disk database, the data block is verified to have been stored from the stored system, thus giving up storage. The block diagram of the deduplication method is shown in fig. 5, and the flow chart of the deduplication method is shown in fig. 6. The pseudocode for the deduplication method is as follows:
the scheme provides an improved insertion algorithm of the cuckoo filter, namely a two-stage insertion algorithm, so that the problem of uneven data load is solved, and meanwhile, experimental evaluation shows that the scheme effectively reduces the insertion time delay of the cuckoo filter, thereby increasing the efficiency and throughput of a repeated data deletion system.
Aiming at the defect of uneven data of a summary data structure cuckoo filter used in the prior de-duplication scheme, the invention provides a double-stage insertion algorithm, wherein a stricter repositioning condition is set in a first stage, and data distribution is balanced through a more positive repositioning strategy, so that a second stage contributing to the main part of the insertion time delay obtains better data distribution conditions, thereby reducing the insertion time delay of the summary structure and accelerating the whole de-duplication flow.
The invention provides a block-level data deduplication scheme based on a dual-stage inserted cuckoo filter, which effectively reduces the time delay of the whole deduplication algorithm by virtue of the advantage of the insertion performance of the dual-stage inserted cuckoo filter.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.