CN103177111B

CN103177111B - Data deduplication system and delet method thereof

Info

Publication number: CN103177111B
Application number: CN201310109231.1A
Authority: CN
Inventors: 王磊; 任振刚; 黑新宏; 高阔; 费蓉
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2013-03-29
Filing date: 2013-03-29
Publication date: 2016-02-24
Anticipated expiration: 2033-03-29
Also published as: CN103177111A

Abstract

Data deduplication system and delet method thereof are distributed structure/architecture, form primarily of client, management server and memory node server; Client is mainly used in receiving user and preserves file/reduction file request, cutting file/composition file; The major function of management server has fingerprint value comparison, safeguards fingerprint base, correcting and eleting codes coding and data compression; Data block after memory node server primary responsibility store compressed; Client is all connected by LAN (Local Area Network) with memory node server end with management server end, management server.User is preserved by client and goes back original.The present invention carries out correcting and eleting codes coding and data compression to cutting data block, different memory node servers is stored in data block dispersion after compression, once part memory node breaks down, the data of preserving in residue memory node can be utilized to carry out file reduction, not only increase the reliability of data deduplication system, and decrease the waste of storage space.

Description

Data deduplication system and delet method thereof

Technical field

The invention belongs to data de-duplication technology field, relate to technical field of distributed memory, particularly a kind of data deduplication system based on data compression and correcting and eleting codes technology; The invention still further relates to the delet method of this data deduplication system.

Background technology

Along with the high speed development of global IT application, company and enterprise and the data center in organizing are faced with the challenge of increasing data volume and high-speed data growth, research shows that large data age arrives, large data have four features, wherein most outstanding feature is that the data scale of construction is huge, have report to point out, whole world establishment in 2011 and the data volume copied have exceeded 1.8ZB(1.8 trillion GB), between 5 years, increase 9 times.Research finds that having up to 60% in the data of preserving in enterprise is repetition, and As time goes on will get more and more, the existence of a large amount of repeating data not only wastes storage space, and brings very large challenge to the processing speed of data and the accuracy of calculating.In order to reduce a large amount of repeating datas in storage system, data de-duplication technology has become the focus of research in recent years.

Data de-duplication technology a kind ofly repeating data can be standardized as single sharing data objects to improve the technology of memory capacity efficiency.It is a kind of data reducti techniques, is mainly used in based in the backup of disk, disaster tolerance and filing storage system, effectively can optimizes memory capacity.Existing data deduplication system, its flow process is see Fig. 1: the file segmentation algorithm that first will store is divided into one group of data block, fingerprint calculating is carried out to each data block be syncopated as, then be that key word is searched in fingerprint base with fingerprint value, if this data block is repeating data block to find the fingerprint value of coupling to illustrate, only store the call number of this data block, otherwise represent that this data block is a new data block, this data block is stored and creates corresponding metamessage.

In above-mentioned existing data deduplication system, data block is by file-sharing multiple in system, or even jointly used by All Files, if certain dropout of data block or make a mistake can have influence on the reduction of multiple file, this reduces the reliability of data deduplication system, although data block can be carried out redundancy backup, by each block copy in multiple memory node, once wherein certain memory node goes wrong, the data block in all the other memory nodes can be utilized, but be a kind of serious waste to storage space like this.

Summary of the invention

The object of the present invention is to provide a kind of data deduplication system, if solve certain dropout of data block or make a mistake that prior art exists can have influence on the reduction of multiple file, the problem that reliability is poor.

Another object of the present invention is the delet method providing above-mentioned data deduplication system.

The object of the present invention is achieved like this, data deduplication system, is distributed structure/architecture, forms primarily of client, management server and memory node server; Client is mainly used in receiving user and preserves file/reduction file request, cutting file/composition file; The major function of management server has fingerprint value comparison, safeguards fingerprint base, correcting and eleting codes coding and data compression; Data block after memory node server primary responsibility store compressed; Client is all connected by LAN (Local Area Network) with memory node server end with management server end, management server.

Feature of the present invention is also:

Management server is primarily of fingerprint base, file index storehouse and compression data block index database three part composition;

Fingerprint base is used for the fingerprint value of all data blocks in register system, its structure is made up of FingerPrint, ReferenceCount, wherein FingerPrint is fingerprint value, the number of times that the data block that ReferenceCount records this fingerprint value is shared, and initial value is 1;

The data block fingerprint value of file index storehouse record composing document, and the order of data block;

Compression data block index database is used for recording the information of each compression data block, and its structure is made up of DatablockName, IpAddress, SavePath, DataBlockLength, FingerPrint, ReferenceCount; Wherein DatablockName represents data block title, IpAddress represents the IP address of preserving data block server, SavePath represents the catalogue of preserving data block, DataBlockLength represents the length of data block, FingerPrint is the fingerprint value of data block, ReferenceCount records the number of times that this data block is shared, and initial value is 1, and it is equal with the ReferenceCount field value of identical fingerprints value in fingerprint base.

Client is installed in the PC of user.

Another object of the present invention is achieved in that the delet method of above-mentioned data deduplication system, and user is preserved by client and goes back original.

Its feature is also:

When preserving file, client is carried out data cutting to the file that user inputs and is produced cutting data block, carries out fingerprint calculating, the fingerprint value calculated is sent to management server to each cutting data block, after management server receives fingerprint value, first in fingerprint base, search whether there is identical fingerprints value, if there is identical fingerprints value, illustrate that this data block is preserved, notice client does not need to send data block, otherwise illustrate that this data block is a new data block, data block is sent to management server by notice client, after management server receives data block, correcting and eleting codes coding is carried out to data block, basic data block number and the checking data block number of correcting and eleting codes needs is set in advance according to the number of memory node server, correcting and eleting codes has been encoded and has been carried out data compression to each correcting and eleting codes data block afterwards, correcting and eleting codes data block after compression is sent to memory node server preserve.

Also during original, by client, the filename for reduction is sent to management server, the file preserving data block index is searched in management server to file data blocks index database, according to the index position recorded in file, the file preserving correcting and eleting codes index is searched in data block correcting and eleting codes index database, position is preserved according to the correcting and eleting codes data block recorded in index file, the correcting and eleting codes data block after compression is extracted to each memory node server, it is decompressed, cutting data block is reconstructed according to correcting and eleting codes algorithm, finally cutting data block is reduced to original file.

The flow process of preserving file is as follows:

Step 1: file cutting, by the files passe for preserving to client, client fixed measure block algorithm cutting file, produces interim cutting data block;

Step 2: calculated fingerprint value, client utilizes MD5 algorithm to calculate the fingerprint value of cutting data block;

Step 3:HASH searches, the fingerprint value calculated is sent to management server, management server is search in key word HASH function to fingerprint base with fingerprint value, if this data block is preserved to find identical fingerprints value to illustrate, upgrade the ReferenceCount field in fingerprint base, its value is added 1, preserve it and index cutting index file, notice client need not send data block, otherwise illustrate that this data block is a new data block, preserve its index, and notify that client sends data block, subsequent calculations is carried out to it;

Step 3: correcting and eleting codes is encoded, encodes to new cutting data block with ReedSolomon correcting and eleting codes coding, produces correcting and eleting codes data block and correcting and eleting codes index file;

Step 4: packed data, compresses correcting and eleting codes data block by Huffman compression algorithm;

Step 5: preserve, is sent to memory node server and preserves by compression data block;

Repeat step 1 to step 5, until file processing is complete;

Also the flow process of original is as follows:

Step 1: extract data block index, by client, reduction file request is sent to management server, searches the file preserving file cutting data block index in management server to file index storehouse; If do not find prompting to go back original, otherwise continue;

Step 2: extract correcting and eleting codes index, management server, according to the index position recorded in cutting data block index file, searches correcting and eleting codes index file in correcting and eleting codes data block index database; If do not found, prompting correcting and eleting codes dropout of data block, cannot go back original, otherwise continue;

Step 3: extract correcting and eleting codes data block, according to extracting the correcting and eleting codes data block after compression in IpAddress, SavePath of recording in correcting and eleting codes index file two fields to memory node server;

Step 4: decompress, decompresses to the compression data block extracted, and produces correcting and eleting codes data block;

Step 5: judge whether to meet reconstruction condition, judge whether separate the correcting and eleting codes data block extruded meets reconstruction condition according to ReedSolomon correcting and eleting codes decoding principle, if do not met, prompting correcting and eleting codes dropout of data block is too much, cannot reconstruct, otherwise proceed;

Step 6: reconstruct data block, utilizes ReedSolomon correcting and eleting codes decoding principle that the correcting and eleting codes data block after decompress(ion) is reconstructed into cutting data block;

Step 7: also original, the data in cutting data block are sent to client, and client is saved in file;

Repeat step 1 to step 6, until the data processing in data block index file is complete, the file of reduction is presented to user by last client.

The present invention has following beneficial effect:

1, data deduplication system reliability of the present invention is high.The present invention's correcting and eleting codes technology is encoded to cutting data block, different memory nodes is stored in correcting and eleting codes data block dispersion after coding, as fruit part memory node breaks down, original cutting data block can be reconstructed by the correcting and eleting codes data block in remaining memory node.All be kept at a certain memory node compared to by all data blocks, invention increases the reliability of data deduplication system.

2, data deduplication system of the present invention decreases the waste of storage space.In order to data block reconstruct can increase several check block newly during correcting and eleting codes coding of the present invention, correcting and eleting codes data block total memory capacity after having encoded adds a part of storage space relative to original cutting data block, on this basis data compression is carried out to correcting and eleting codes data block, reduce the waste of storage space to a certain extent.

3, correcting and eleting codes and data compression technique are introduced in data deduplication system by the present invention, not only increase the reliability of data deduplication system, and decrease the waste of storage space.

4, data deduplication system delet method of the present invention, correcting and eleting codes coding and data compression are carried out to cutting data block, different memory node servers is stored in data block dispersion after compression, once part memory node breaks down, the data of preserving in residue memory node can be utilized to carry out file reduction, compared to prior art, the present invention not only increases the reliability of data de-duplication, and decreases the waste of storage space.

Accompanying drawing explanation

Fig. 1 is existing data deduplication system process flow diagram;

Fig. 2 is data deduplication system structural representation of the present invention;

Fig. 3 is correcting and eleting codes schematic diagram;

Fig. 4 is data deduplication system delet method process flow diagram of the present invention.

Embodiment

Below in conjunction with embodiment and accompanying drawing, the present invention is further detailed explanation.

The Core Feature of data de-duplication technology is when storing data, compares for storing the data of having preserved in data and storage system, if there is identical data, illustrate that these data are preserved, filter out this part data, quote this part data by pointer, otherwise, preserve data.According to disappearing, heavily granularity data de-duplication technology can be divided into file-level and block level, and data block rank disappears heavily, and granularity is less, provides higher data deduplication rate.The present invention adopts data block rank to disappear the method for reruning.

Deblocking algorithm mainly contains three kinds: fixed size block algorithm, elongated segmentation algorithm and sliding shoe segmentation algorithm.Fixed size block algorithm carries out file cutting with the block size pre-defined.Elongated segmentation algorithm is a kind of segmentation algorithm based on file content, the data block size be syncopated as is change, use the moving window of a fixed size to file data blocks calculated fingerprint value, if fingerprint value meets certain condition, such as its to certain specific numerical value delivery calculate equal to preset several time, the border of the window's position as block.Sliding shoe segmentation algorithm combines the advantage of fixed size block segmentation algorithm and elongated segmentation algorithm, its data block size is determined, first calculate weak proof test value to fixed length block, if coupling calculates strong proof test value again, both coupling thinks that this is a data block boundary.The present invention uses fixed size block algorithm.

In data deduplication system, data block is stored All Files in system and shares, if certain dropout of data block or make a mistake multiple file can be caused to reduce, correcting and eleting codes technology is introduced based on this, correcting and eleting codes is a kind of forward error correction (ForwardErrorCorrecting, FEC) technology, is widely used in the every field of information processing in recent years.(m, n) correcting and eleting codes is n (n>m) individual data slot m data source fragment coding, original m source data fragment can be reconstructed with any x (x >=m) in this n data slot is individual, correcting and eleting codes principle as shown in Figure 3, correcting and eleting codes is mainly divided into 4 classes: ReedSolomonCodes, ParityArrayCodes, Parity-checkCodes, LDPCCodes.The present invention uses ReedSolomonCodes technology.

When preserving cutting data block in the present invention, first correcting and eleting codes coding is carried out to data block, the data block after coding is saved in different memory nodes.During file reduction, as fruit part memory node breaks down or makes a mistake, original data block can be reconstructed by the correcting and eleting codes data block in residue memory node, and then restore source document.

(m is being carried out to data block, n) during correcting and eleting codes coding, first data block is on average cut into m data block, then be n (n>m) individual data block by m data block forecast, add (n-m) individual checking data block, data block memory capacity then after coding is n/m (n/m>1) times of initial data block capacity, adds a part of storage space.In order to data compression technique is introduced in data deduplication system by head it off herein on the basis of correcting and eleting codes technology.

Data compression (DataCompression) adopts the storage space of compression one given data when data processing or in certain space, increases the technology of memory data output.Data slit, null field, redundant information and the method shortening record of unnecessary data or the length of block are eliminated in usual employing, and object improves the utilization factor in Computer Storage space.Data compression is divided into Lossless Compression and lossy compression method two kinds.Conventional lossless compression method has: Huffman coding, arithmetic coding, run length coding, RLC and Fano-Shannon coding etc.Conventional lossy compression method method has: predictive coding, transition coding and hybrid coding etc.The present invention uses the Huffman in Lossless Compression to compress.

Huffman compression is popular a kind of lossless compression method, and its theoretical foundation is Huffman coding, and the frequency structure prefix code Huffman tree that Huffman coding basis information occurs, reaches the target that code length is the shortest.Based on this, the flow process of Huffman compressed file is:

1) each byte of file reading, adds up the frequency that each byte occurs;

2) for each byte specifies a binary tree only comprising a node, and using the frequency of this byte as the power of binary tree;

3) choose two minimum trees of power and be merged into a tree with new root node, its left and right subtree is two trees chosen respectively, and the power of new tree is the weight sum of left and right subtree;

4) step is above repeated, until be only left last one tree;

5) in tree, the left pointer of each nonleaf node distributes " 0 ", and right pointer distributes " 1 ", thus, can obtain the Huffman encoding of each byte from root;

6) coded message Huffman being set information and each byte is saved in compressed file.

When preserving file, after correcting and eleting codes has been encoded, utilize Huffman compression algorithm to compress correcting and eleting codes data block, then the data block after compression is sent to different memory nodes respectively and preserves.Also during original, data block is extracted from memory node, extrude correcting and eleting codes data block by Huffman decompression algorithm solution.

Figure 2 shows that the data deduplication system that the present invention is based on data compression and correcting and eleting codes technology forms structural drawing, system mainly comprises client, management server and memory node server, and the main function of system concentrates in management server.Management server mainly contains three part compositions: fingerprint base, file index storehouse and compression data block index database.

Fingerprint base is used for the fingerprint value of all data blocks in register system, its structure is made up of <FingerPrint, ReferenceCount>, wherein FingerPrint is fingerprint value, the number of times that the data block that ReferenceCount records this fingerprint value is shared, initial value is 1.Fingerprint base is searched and is adopted Hash lookup algorithm, and with fingerprint value character string for key word calculates its memory location, contention resolution adopts linear probing method.

The data block fingerprint value of file index storehouse record composing document, and the order of data block.

Compression data block index database is used for recording the information of each compression data block, and its structure is made up of <DatablockName, IpAddress, SavePath, DataBlockLength, FingerPrint, ReferenceCount>.Wherein DatablockName represents data block title, IpAddress represents the IP address of preserving data block server, SavePath represents the catalogue of preserving data block, DataBlockLength represents the length of data block, FingerPrint is the fingerprint value of data block, the number of times that this data block of ReferenceCount field record is shared, initial value is 1, and it is equal with the ReferenceCount field value of identical fingerprints value in fingerprint base.

ReferenceCount field in above-mentioned compression data block index database judges the number of times that data block is shared when being used for file erase.When certain file preserved in user's deletion system, need the data block of deleting this file of composition.But because data block is not privately owned by a file, if do not add judgement directly delete data block, some file may be caused to reduce.In order to avoid this situation occurs, the ReferenceCount field of checking in data block index database is needed when deleting data block, only used by this file if the value of this field is this data block of 1 explanation, can delete, if the value of this field is greater than 1, this data block can not be deleted, this field value is subtracted 1.This ensures that there and alternative document cannot will be reduced because certain data block is deleted.

Fig. 4 is the data deduplication system process flow diagram based on data compression and correcting and eleting codes technology.First several word is defined for ease of explaining its flow process:

Define 1 cutting data block: the data block produced after certain segmentation algorithm is applied to file.

Define 2 correcting and eleting codes data blocks: the data block produced after correcting and eleting codes coding is carried out to cutting data block.

Define 3 compression data blocks: to the data block produced after correcting and eleting codes data block applied compression algorithm.

Define 4 cutting index files: the index file produced during file cutting, be used for all data block indexes that log file comprises.

Define 5 correcting and eleting codes index files: the index file produced during correcting and eleting codes coding, be used for recording the index of all data blocks after cutting data block coding.

The flow process of preserving file is as follows:

Step 1: file cutting.By the files passe for preserving to client, client fixed measure block algorithm cutting file, produces interim cutting data block.

Step 2: calculated fingerprint value.Client utilizes MD5 algorithm to calculate the fingerprint value of cutting data block.

Step 3:HASH searches.The fingerprint value calculated is sent to management server, management server is search in key word HASH function to fingerprint base with fingerprint value, if this data block is preserved to find identical fingerprints value to illustrate, upgrade the ReferenceCount field in fingerprint base, its value is added 1, preserve it and index cutting index file, notice client need not send data block, otherwise illustrate that this data block is a new data block, preserve its index, and notify that client sends data block, subsequent calculations is carried out to it.

Step 4: correcting and eleting codes is encoded.With ReedSolomon correcting and eleting codes coding, new cutting data block is encoded, produce correcting and eleting codes data block and correcting and eleting codes index file.

Step 5: packed data.By Huffman compression algorithm, correcting and eleting codes data block is compressed.

Step 6: preserve.Compression data block is sent to memory node server to preserve, repeats step 1 to step 5, until file processing is complete.

Also the flow process of original is as follows:

Step 1: extract data block index.By client, reduction file request is sent to management server, in management server to file index storehouse, searches the file preserving file cutting data block index.If do not find prompting to go back original, otherwise continue.

Step 2: extract correcting and eleting codes index.Management server, according to the index position recorded in cutting data block index file, searches correcting and eleting codes index file in correcting and eleting codes data block index database.If do not found, prompting correcting and eleting codes dropout of data block, cannot go back original, otherwise continue.

Step 3: extract correcting and eleting codes data block.According to extracting the correcting and eleting codes data block after compression in IpAddress, SavePath of recording in correcting and eleting codes index file two fields to memory node server.

Step 4: decompress.The compression data block extracted is decompressed, produces correcting and eleting codes data block.

Step 5: judge whether to meet reconstruction condition.Judge whether separate the correcting and eleting codes data block extruded meets reconstruction condition according to ReedSolomon correcting and eleting codes decoding principle, if do not met, prompting correcting and eleting codes dropout of data block is too much, cannot reconstruct, otherwise proceed.

Step 6: reconstruct data block.Utilize ReedSolomon correcting and eleting codes decoding principle that the correcting and eleting codes data block after decompress(ion) is reconstructed into cutting data block.

Step 7: also original.Data in cutting data block are sent to client, and client is saved in file, repeats step 1 to step 6, until the data processing in data block index file is complete.The file of reduction is presented to user by last client.

The data deduplication system framework based on data compression and correcting and eleting codes technology that the present invention proposes, correcting and eleting codes coding and data compression are carried out to cutting data block, different memory node servers is stored in data block dispersion after compression, once part memory node breaks down, the data of preserving in residue memory node can be utilized to carry out file reduction, the framework that the present invention proposes not only increases the reliability of data deduplication system, and decreases the waste of storage space.

Certainly; the present invention also can have other various embodiments; when not deviating from the present invention's spirit and essence thereof; those of ordinary skill in the art can make various corresponding change and distortion according to the present invention, but these change accordingly and are out of shape the protection domain that all should belong to the claim appended by the present invention.

Claims

1. data deduplication system, is characterized in that: be distributed structure/architecture, forms primarily of client, management server and memory node server; Client is mainly used in receiving user and preserves file/reduction file request, cutting file/composition file; The major function of management server has fingerprint value comparison, safeguards fingerprint base, correcting and eleting codes coding and data compression; Data block after memory node server primary responsibility store compressed; Client is all connected by LAN (Local Area Network) with memory node server end with management server end, management server;

Wherein, management server is primarily of fingerprint base, file index storehouse and compression data block index database three part composition;

2. data deduplication system as claimed in claim 1, is characterized in that: client is installed in the PC of user.

3. the delet method of data deduplication system as claimed in claim 1 or 2, is characterized in that: user is preserved by client and goes back original;

Wherein, when preserving file, client is carried out data cutting to the file that user inputs and is produced cutting data block, carries out fingerprint calculating, the fingerprint value calculated is sent to management server to each cutting data block, after management server receives fingerprint value, first in fingerprint base, search whether there is identical fingerprints value, if there is identical fingerprints value, illustrate that this data block is preserved, notice client does not need to send data block, otherwise illustrate that this data block is a new data block, data block is sent to management server by notice client, after management server receives data block, correcting and eleting codes coding is carried out to data block, basic data block number and the checking data block number of correcting and eleting codes needs is set in advance according to the number of memory node server, correcting and eleting codes has been encoded and has been carried out data compression to each correcting and eleting codes data block afterwards, correcting and eleting codes data block after compression is sent to memory node server preserve,

Wherein, also during original, by client, the filename for reduction is sent to management server, the file preserving data block index is searched in management server to file data blocks index database, according to the index position recorded in file, the file preserving correcting and eleting codes index is searched in data block correcting and eleting codes index database, position is preserved according to the correcting and eleting codes data block recorded in index file, the correcting and eleting codes data block after compression is extracted to each memory node server, it is decompressed, cutting data block is reconstructed according to correcting and eleting codes algorithm, finally cutting data block is reduced to original file,

The flow process of preserving file is as follows:

Step 4: correcting and eleting codes is encoded, encodes to new cutting data block with ReedSolomon correcting and eleting codes coding, produces correcting and eleting codes data block and correcting and eleting codes index file;

Step 5: packed data, compresses correcting and eleting codes data block by Huffman compression algorithm;

Step 6: preserve, is sent to memory node server and preserves by compression data block;

Repeat step 1 to step 6, until file processing is complete;

Also the flow process of original is as follows: