CN111177092A

CN111177092A - Deduplication method and device based on erasure codes

Info

Publication number: CN111177092A
Application number: CN201911251209.4A
Authority: CN
Inventors: 唐聃; 刘龙祥; 蔡红亮; 何磊; 耿微
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2020-05-19

Abstract

The invention discloses a method and a device for deleting repeated data based on erasure codes, wherein the method comprises the following steps: performing security processing on a data block to be stored by using a nonlinear Hash function to obtain a security data block; carrying out operation processing on the safety data block by using an erasure code to obtain a storage value of the data block; judging whether the data block is a repeated data block or not according to the stored value of the data block and a pre-stored data storage table; and correspondingly processing the data block needing to be stored according to the judgment result.

Description

Deduplication method and device based on erasure codes

Technical Field

The invention relates to the technical field of data storage, in particular to a data de-duplication method and device based on erasure codes.

Background

In the 21 st century, with the advent of the Information age, MIS (Management Information System) was used by various industries around the world, which enhanced Information Management, collected, collated, and processed data of enterprises using computer and network communication, and then decision-makers could analyze the Information resources thus generated, thereby improving the Management level and benefit of the enterprises. The data volume of modern enterprises grows exponentially, and the required storage capacity of the modern enterprises is from dozens of TB of dozens of GB to several PB. The big data age has long been no longer only theoretical, but has come. Through research, nearly 60% of data in storage is duplicated, and the existence of duplicated data not only wastes storage space, but also reduces the processing speed and the calculation accuracy of the data. Naturally, reducing the number of copies of the repeated data blocks has become an effective way to reduce the storage capacity and save the storage space.

Deduplication is a data pruning technique that can efficiently optimize storage capacity. The definition of IDC (International Data Corporation ) for deletion of duplicate Data is: a technique that can normalize duplicate data into a single shared data object to improve storage capacity efficiency. The purpose of deduplication is to globally remove redundant data existing in a storage system, including intra-file and inter-file redundant data, whereas conventional data compression can only remove redundant information inside files. Compared with the prior art, the data compression effect of the data de-duplication technology is more obvious, and the data de-duplication rate for specific application data can reach 300: 1 and even higher, the two data compression techniques are only 2: about 1.

The key of the data de-duplication technology is to determine whether a file, a data block or even a byte in a storage system is duplicated by detecting duplicated data, and the de-duplication efficiency of the duplicated data needs to be determined according to the dividing method of the file. There are two main types of current deduplication: file-level data de-duplication can detect the same file or two files with different names and the same content at different positions, thereby avoiding the repeated storage of the same file; the data block level data de-duplication can detect the same data block in the file and ensure the unique storage of the data block.

The data de-duplication utilizes the identity and similarity of the files with the files and the interior of the files, and the finer the processing granularity is, the more redundant data is deleted. Today, the algorithm for computing the duplicate data is generally a Hash algorithm. And the MD5 algorithm and the SHA-1 algorithm are Hash algorithms which are widely applied at present. The Hash algorithm is utilized to calculate the repeated data, and generally, two modes are provided, namely full-text Hash and file blocking Hash.

Full file Hash is a method to find duplicate data at the file granularity level. In a storage system, since a file is generally used as a unit of one information set, it is originally thought that a deduplication technology compares duplicates based on a file. For files already stored in the storage system, their respective hash function values are first calculated (usually using MD5 or SHA-1) and organized into a hash function library for individual storage. The premise of applying the data de-duplication function is that the application has a lot of repeated data, otherwise, the storage space is actually wasted due to the fact that the hash function value of the file is stored. When new files to be stored arrive at the storage system, the hash function values of the new files are calculated. The resulting hash function value is compared with values already stored in a hash function value library. If the two files have the same hash function value, the two files are judged to be the same, and only a pointer pointing to the stored file is needed to replace a new file to be stored. If the new file to be stored is not found in the hash function value library, the file is judged not to be in the storage system, and the hash function value library is updated to add the new file hash function value in addition to storing the file.

File blocking Hash is similar to data compression techniques. The file blocking Hash is very similar to the dictionary type compression algorithm. And carrying out the Hash calculation of the file blocks, namely firstly dividing the data blocks and then carrying out the Hash calculation on the data blocks. The simplest way to divide a block is to fix the size of the data block. The block size is within a specified range of minimum and maximum sizes. Variable-size data blocks may be partitioned by a sliding window, and a partition is created when the Hash value of the sliding window matches a reference value. In general, the reference value may be calculated using a Rabin fingerprint, and the range of block size variation may be reduced by setting upper and lower limits of the block size. The storage of data blocks is similar to the way full file Hash, with identical blocks identified by linear block numbers. Fixed block sizes may reduce the need for block partitioning algorithms, but similarity detection for the same block will be reduced.

The full-text Hash has the advantage of high calculation speed in a common environment, but has the defect that the same data existing among different files cannot be detected and redundancy elimination cannot be realized. The advantage of the file block Hash is that the same data between different files can be detected and deleted, and the disadvantage is that the Hash index of the block must be saved, which additionally increases some storage space. The Hash algorithm has a common disadvantage that the security of data cannot be guaranteed.

Disclosure of Invention

The technical problem solved by the scheme provided by the embodiment of the invention is that the existing data in the existing data de-duplication technology has lower safety.

The deduplication method based on the erasure code provided by the embodiment of the invention comprises the following steps:

performing security processing on a data block to be stored by using a nonlinear Hash function to obtain a security data block;

carrying out operation processing on the safety data block by using an erasure code to obtain a storage value of the data block;

judging whether the data block is a repeated data block or not according to the stored value of the data block and a pre-stored data storage table;

and correspondingly processing the data block needing to be stored according to the judgment result.

Preferably, the method further comprises the following steps:

reading data to be stored;

and segmenting the data to be stored according to a preset size to obtain N data blocks with the same size.

Preferably, the data storage table comprises index locations, data blocks and storage values.

Preferably, the determining whether the data block is a duplicate data block according to the storage value of the data block and a pre-stored data storage table includes:

traversing the stored values in the pre-stored data storage table, and determining whether the stored values of the data blocks are contained in the data storage table;

when the data storage table is determined to contain the storage value of the data block, judging that the data block is a repeated data block;

and when the data storage table is determined not to contain the storage value of the data block, judging that the data block is a non-repeated data block.

Preferably, the performing, according to the determination result, the corresponding processing on the data block to be stored includes:

when the data block is judged to be a repeated data block, discarding the data block, and recording the index position of the data block in the data storage table;

and when the data block is judged to be a non-repeated data block, storing the data block and a storage value thereof, and recording the index position of the data block in the data storage table.

According to an embodiment of the present invention, a de-duplication apparatus based on erasure codes includes:

the safety processing module is used for carrying out safety processing on the data block needing to be stored by utilizing a nonlinear Hash function to obtain a safety data block;

the operation processing module is used for performing operation processing on the safety data block by using the erasure code to obtain a storage value of the data block;

the judging module is used for judging whether the data block is a repeated data block or not according to the stored value of the data block and a pre-stored data storage table;

and the processing module is used for correspondingly processing the data block needing to be stored according to the judgment result.

Preferably, the method further comprises the following steps:

the reading module is used for reading data needing to be stored;

and the segmentation module is used for segmenting the data to be stored according to a preset size to obtain N data blocks with the same size.

Preferably, the judging module includes:

the determining unit is used for traversing the stored values in the pre-stored data storage table and determining whether the stored values of the data blocks are contained in the data storage table;

and the judging unit is used for judging that the data block is a repeated data block when the data storage table is determined to contain the stored value of the data block, and judging that the data block is a non-repeated data block when the data storage table is determined not to contain the stored value of the data block.

Preferably, the processing module is specifically configured to discard the data block and record an index position of the data block in the data storage table when the data block is determined to be a duplicate data block, and store the data block and a storage value thereof and record an index position of the data block in the data storage table when the data block is determined to be a non-duplicate data block.

According to the scheme provided by the embodiment of the invention, the erasure code technology is utilized to prevent the data from being deleted by mistake, thereby ensuring the safety of the data.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a method for erasure code based deduplication provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of an erasure code based de-duplication apparatus according to an embodiment of the present invention;

fig. 3 is a flowchart of an erasure code based deduplication method provided by an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings, and it should be understood that the preferred embodiments described below are only for the purpose of illustrating and explaining the present invention, and are not to be construed as limiting the present invention.

Fig. 1 is a flowchart of an erasure code-based data de-duplication method according to an embodiment of the present invention, as shown in fig. 1, including:

step S100: performing security processing on a data block to be stored by using a nonlinear Hash function to obtain a security data block;

step S110: carrying out operation processing on the safety data block by using an erasure code to obtain a storage value of the data block;

step S120: judging whether the data block is a repeated data block or not according to the stored value of the data block and a pre-stored data storage table;

step S130: and correspondingly processing the data block needing to be stored according to the judgment result.

The invention also includes: reading data to be stored; and segmenting the data to be stored according to a preset size to obtain N data blocks with the same size.

Wherein the data storage table includes an index position, a data block, and a storage value.

Wherein the step S120 includes: traversing the stored values in the pre-stored data storage table, and determining whether the stored values of the data blocks are contained in the data storage table; when the data storage table is determined to contain the storage value of the data block, judging that the data block is a repeated data block; and when the data storage table is determined not to contain the storage value of the data block, judging that the data block is a non-repeated data block. Specifically, the performing, according to the determination result, the corresponding processing on the data block to be stored includes: when the data block is judged to be a repeated data block, discarding the data block, and recording the index position of the data block in the data storage table; and when the data block is judged to be a non-repeated data block, storing the data block and a storage value thereof, and recording the index position of the data block in the data storage table.

Fig. 2 is a schematic diagram of an erasure code-based data de-duplication apparatus according to an embodiment of the present invention, as shown in fig. 2, including: the device comprises a safety processing module, an operation processing module, a judgment module and a processing module.

The safety processing module is used for carrying out safety processing on the data block to be stored by utilizing a nonlinear Hash function to obtain a safety data block; the operation processing module is used for performing operation processing on the safety data block by using an erasure code to obtain a storage value of the data block; the judging module is used for judging whether the data block is a repeated data block according to the stored value of the data block and a pre-stored data storage table; and the processing module is used for correspondingly processing the data block needing to be stored according to the judgment result.

The invention also includes: the reading module is used for reading data needing to be stored; and the segmentation module is used for segmenting the data to be stored according to a preset size to obtain N data blocks with the same size.

Wherein, the judging module comprises: the determining unit is used for traversing the stored values in the pre-stored data storage table and determining whether the stored values of the data blocks are contained in the data storage table; and the judging unit is used for judging that the data block is a repeated data block when the data storage table is determined to contain the stored value of the data block, and judging that the data block is a non-repeated data block when the data storage table is determined not to contain the stored value of the data block. Specifically, the processing module is configured to discard the data block and record an index position of the data block in the data storage table when the data block is determined to be a duplicate data block, and store the data block and a storage value thereof and record an index position of the data block in the data storage table when the data block is determined to be a non-duplicate data block.

The method combines a data de-duplication technology method, divides data into n data blocks with fixed size, processes each data block by using a nonlinear hash function, uses binary Goppa codes to operate the processed data blocks to obtain keys, compares each calculation result with existing data in a database in sequence, and stores the index position of the data block without storing the data block if the data block already exists in the original database; otherwise, it is stored in the database and its index position is saved. When the file needs to be read, the data block index file is extracted from the database according to the search content, then the corresponding data block is searched according to the index position recorded in the index file found before, and then the found data block is restored into the original data file.

It should be noted that, the data and the values related to the embodiments of the present invention may be determined according to actual needs, and are not limited herein.

Fig. 3 is a flowchart of a deduplication method based on erasure codes according to an embodiment of the present invention, and as shown in fig. 3, taking a file with a size of about 32M as an example, where the file name is test, and the subscript i takes a value of 1 to 1024, including:

step 101 reads the data to be stored, here read test.

And 102, partitioning the file test according to a fixed size to obtain n data blocks.

Specifically, the file test in step 101 is divided into 1024 blocks, named n1, n2, n3 … n1024, according to 32kb each. Where the size of each block of data may vary with demand.

Step 103: the 1024 data blocks in step 102 are processed separately using a nonlinear Hash function.

Specifically, the 1024 data blocks in the step 102 are respectively processed according to the set nonlinear Hash function, so that the attack can be prevented, and meanwhile, the subsequent steps can be better realized. The nonlinear Hash function is here set to H (x) and the data blocks after processing are named m1, m2, m3 … m1024, respectively.

Step 104: calculating the data block in the step 103 according to a repeated data calculation rule;

specifically, m1, m2, m3 … m1024 in step 103 are operated by binary, i.e., Goppa codes, to obtain the corresponding values of each data block, which are named key1, key2, key3 … key 1024.

Step 105: the data block is processed according to the deduplication rule using the value in step 104.

Specifically, the value obtained in step 104 is compared with a value stored in the system to determine whether this value is present in the system; when the value of a data block is the same as the value in the system (through calculation, the value is the same as the value in the system, if the value is the same, the data block already exists in the original system, the data block is not stored, and repeated storage is avoided), recording the index position corresponding to the value in the system (the index position records what the data block after the data is split is specific, and subsequent data reconstruction is needed), and discarding the data block; when the value of a data block does not exist in the system, the data block and the value are stored, and the index position of the data block is recorded.

That is, the key1, key2, key3 … key1024 obtained in step 104 is compared with the key value stored in the system to determine whether this value exists in the system; when the keyi values are the same, recording the index positions of the data blocks corresponding to the values in the system, and discarding the data blocks; when the key i value does not exist in the system, storing the data block ni corresponding to the key value and the key value, and recording the index position of ni.

According to the scheme provided by the embodiment of the invention, the data is divided into n data blocks with fixed sizes, then each data block is calculated according to a certain rule to obtain a unique value key, and finally the value is compared with the key value of the existing data block in the original database, and if the key value exists in the original database, the data block is deleted; if the data block does not exist, the data block is stored in the database, and the safety of the data is ensured.

Although the present invention has been described in detail hereinabove, the present invention is not limited thereto, and various modifications can be made by those skilled in the art in light of the principle of the present invention. Thus, modifications made in accordance with the principles of the present invention should be understood to fall within the scope of the present invention.

Claims

1. A deduplication method based on erasure codes is characterized by comprising the following steps:

2. The method of claim 1, further comprising:

reading data to be stored;

3. The method of claim 1, wherein the data storage table comprises index locations, data blocks, and storage values.

4. The method according to claim 3, wherein the determining whether the data block is a duplicate data block according to the stored value of the data block and a pre-stored data storage table comprises:

5. The method according to claim 4, wherein the performing the corresponding processing on the data block to be stored according to the determination result comprises:

6. An erasure code based de-duplication apparatus, comprising:

7. The apparatus of claim 6, further comprising:

the reading module is used for reading data needing to be stored;

8. The apparatus of claim 6, wherein the data storage table comprises index locations, data blocks, and storage values.

9. The apparatus of claim 8, wherein the determining module comprises:

10. The apparatus according to claim 9, wherein the processing module is specifically configured to discard the data chunk and record an index position of the data chunk in the data storage table when the data chunk is determined to be a duplicate data chunk, and store the data chunk and a storage value thereof and record an index position of the data chunk in the data storage table when the data chunk is determined to be a non-duplicate data chunk.