CN103177111B - Data deduplication system and delet method thereof - Google Patents

Data deduplication system and delet method thereof Download PDF

Info

Publication number
CN103177111B
CN103177111B CN201310109231.1A CN201310109231A CN103177111B CN 103177111 B CN103177111 B CN 103177111B CN 201310109231 A CN201310109231 A CN 201310109231A CN 103177111 B CN103177111 B CN 103177111B
Authority
CN
China
Prior art keywords
data block
file
correcting
eleting codes
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201310109231.1A
Other languages
Chinese (zh)
Other versions
CN103177111A (en
Inventor
王磊
任振刚
黑新宏
高阔
费蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN201310109231.1A priority Critical patent/CN103177111B/en
Publication of CN103177111A publication Critical patent/CN103177111A/en
Application granted granted Critical
Publication of CN103177111B publication Critical patent/CN103177111B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Data deduplication system and delet method thereof are distributed structure/architecture, form primarily of client, management server and memory node server; Client is mainly used in receiving user and preserves file/reduction file request, cutting file/composition file; The major function of management server has fingerprint value comparison, safeguards fingerprint base, correcting and eleting codes coding and data compression; Data block after memory node server primary responsibility store compressed; Client is all connected by LAN (Local Area Network) with memory node server end with management server end, management server.User is preserved by client and goes back original.The present invention carries out correcting and eleting codes coding and data compression to cutting data block, different memory node servers is stored in data block dispersion after compression, once part memory node breaks down, the data of preserving in residue memory node can be utilized to carry out file reduction, not only increase the reliability of data deduplication system, and decrease the waste of storage space.

Description

Data deduplication system and delet method thereof
Technical field
The invention belongs to data de-duplication technology field, relate to technical field of distributed memory, particularly a kind of data deduplication system based on data compression and correcting and eleting codes technology; The invention still further relates to the delet method of this data deduplication system.
Background technology
Along with the high speed development of global IT application, company and enterprise and the data center in organizing are faced with the challenge of increasing data volume and high-speed data growth, research shows that large data age arrives, large data have four features, wherein most outstanding feature is that the data scale of construction is huge, have report to point out, whole world establishment in 2011 and the data volume copied have exceeded 1.8ZB(1.8 trillion GB), between 5 years, increase 9 times.Research finds that having up to 60% in the data of preserving in enterprise is repetition, and As time goes on will get more and more, the existence of a large amount of repeating data not only wastes storage space, and brings very large challenge to the processing speed of data and the accuracy of calculating.In order to reduce a large amount of repeating datas in storage system, data de-duplication technology has become the focus of research in recent years.
Data de-duplication technology a kind ofly repeating data can be standardized as single sharing data objects to improve the technology of memory capacity efficiency.It is a kind of data reducti techniques, is mainly used in based in the backup of disk, disaster tolerance and filing storage system, effectively can optimizes memory capacity.Existing data deduplication system, its flow process is see Fig. 1: the file segmentation algorithm that first will store is divided into one group of data block, fingerprint calculating is carried out to each data block be syncopated as, then be that key word is searched in fingerprint base with fingerprint value, if this data block is repeating data block to find the fingerprint value of coupling to illustrate, only store the call number of this data block, otherwise represent that this data block is a new data block, this data block is stored and creates corresponding metamessage.
In above-mentioned existing data deduplication system, data block is by file-sharing multiple in system, or even jointly used by All Files, if certain dropout of data block or make a mistake can have influence on the reduction of multiple file, this reduces the reliability of data deduplication system, although data block can be carried out redundancy backup, by each block copy in multiple memory node, once wherein certain memory node goes wrong, the data block in all the other memory nodes can be utilized, but be a kind of serious waste to storage space like this.
Summary of the invention
The object of the present invention is to provide a kind of data deduplication system, if solve certain dropout of data block or make a mistake that prior art exists can have influence on the reduction of multiple file, the problem that reliability is poor.
Another object of the present invention is the delet method providing above-mentioned data deduplication system.
The object of the present invention is achieved like this, data deduplication system, is distributed structure/architecture, forms primarily of client, management server and memory node server; Client is mainly used in receiving user and preserves file/reduction file request, cutting file/composition file; The major function of management server has fingerprint value comparison, safeguards fingerprint base, correcting and eleting codes coding and data compression; Data block after memory node server primary responsibility store compressed; Client is all connected by LAN (Local Area Network) with memory node server end with management server end, management server.
Feature of the present invention is also:
Management server is primarily of fingerprint base, file index storehouse and compression data block index database three part composition;
Fingerprint base is used for the fingerprint value of all data blocks in register system, its structure is made up of FingerPrint, ReferenceCount, wherein FingerPrint is fingerprint value, the number of times that the data block that ReferenceCount records this fingerprint value is shared, and initial value is 1;
The data block fingerprint value of file index storehouse record composing document, and the order of data block;
Compression data block index database is used for recording the information of each compression data block, and its structure is made up of DatablockName, IpAddress, SavePath, DataBlockLength, FingerPrint, ReferenceCount; Wherein DatablockName represents data block title, IpAddress represents the IP address of preserving data block server, SavePath represents the catalogue of preserving data block, DataBlockLength represents the length of data block, FingerPrint is the fingerprint value of data block, ReferenceCount records the number of times that this data block is shared, and initial value is 1, and it is equal with the ReferenceCount field value of identical fingerprints value in fingerprint base.
Client is installed in the PC of user.
Another object of the present invention is achieved in that the delet method of above-mentioned data deduplication system, and user is preserved by client and goes back original.
Its feature is also:
When preserving file, client is carried out data cutting to the file that user inputs and is produced cutting data block, carries out fingerprint calculating, the fingerprint value calculated is sent to management server to each cutting data block, after management server receives fingerprint value, first in fingerprint base, search whether there is identical fingerprints value, if there is identical fingerprints value, illustrate that this data block is preserved, notice client does not need to send data block, otherwise illustrate that this data block is a new data block, data block is sent to management server by notice client, after management server receives data block, correcting and eleting codes coding is carried out to data block, basic data block number and the checking data block number of correcting and eleting codes needs is set in advance according to the number of memory node server, correcting and eleting codes has been encoded and has been carried out data compression to each correcting and eleting codes data block afterwards, correcting and eleting codes data block after compression is sent to memory node server preserve.
Also during original, by client, the filename for reduction is sent to management server, the file preserving data block index is searched in management server to file data blocks index database, according to the index position recorded in file, the file preserving correcting and eleting codes index is searched in data block correcting and eleting codes index database, position is preserved according to the correcting and eleting codes data block recorded in index file, the correcting and eleting codes data block after compression is extracted to each memory node server, it is decompressed, cutting data block is reconstructed according to correcting and eleting codes algorithm, finally cutting data block is reduced to original file.
The flow process of preserving file is as follows:
Step 1: file cutting, by the files passe for preserving to client, client fixed measure block algorithm cutting file, produces interim cutting data block;
Step 2: calculated fingerprint value, client utilizes MD5 algorithm to calculate the fingerprint value of cutting data block;
Step 3:HASH searches, the fingerprint value calculated is sent to management server, management server is search in key word HASH function to fingerprint base with fingerprint value, if this data block is preserved to find identical fingerprints value to illustrate, upgrade the ReferenceCount field in fingerprint base, its value is added 1, preserve it and index cutting index file, notice client need not send data block, otherwise illustrate that this data block is a new data block, preserve its index, and notify that client sends data block, subsequent calculations is carried out to it;
Step 3: correcting and eleting codes is encoded, encodes to new cutting data block with ReedSolomon correcting and eleting codes coding, produces correcting and eleting codes data block and correcting and eleting codes index file;
Step 4: packed data, compresses correcting and eleting codes data block by Huffman compression algorithm;
Step 5: preserve, is sent to memory node server and preserves by compression data block;
Repeat step 1 to step 5, until file processing is complete;
Also the flow process of original is as follows:
Step 1: extract data block index, by client, reduction file request is sent to management server, searches the file preserving file cutting data block index in management server to file index storehouse; If do not find prompting to go back original, otherwise continue;
Step 2: extract correcting and eleting codes index, management server, according to the index position recorded in cutting data block index file, searches correcting and eleting codes index file in correcting and eleting codes data block index database; If do not found, prompting correcting and eleting codes dropout of data block, cannot go back original, otherwise continue;
Step 3: extract correcting and eleting codes data block, according to extracting the correcting and eleting codes data block after compression in IpAddress, SavePath of recording in correcting and eleting codes index file two fields to memory node server;
Step 4: decompress, decompresses to the compression data block extracted, and produces correcting and eleting codes data block;
Step 5: judge whether to meet reconstruction condition, judge whether separate the correcting and eleting codes data block extruded meets reconstruction condition according to ReedSolomon correcting and eleting codes decoding principle, if do not met, prompting correcting and eleting codes dropout of data block is too much, cannot reconstruct, otherwise proceed;
Step 6: reconstruct data block, utilizes ReedSolomon correcting and eleting codes decoding principle that the correcting and eleting codes data block after decompress(ion) is reconstructed into cutting data block;
Step 7: also original, the data in cutting data block are sent to client, and client is saved in file;
Repeat step 1 to step 6, until the data processing in data block index file is complete, the file of reduction is presented to user by last client.
The present invention has following beneficial effect:
1, data deduplication system reliability of the present invention is high.The present invention's correcting and eleting codes technology is encoded to cutting data block, different memory nodes is stored in correcting and eleting codes data block dispersion after coding, as fruit part memory node breaks down, original cutting data block can be reconstructed by the correcting and eleting codes data block in remaining memory node.All be kept at a certain memory node compared to by all data blocks, invention increases the reliability of data deduplication system.
2, data deduplication system of the present invention decreases the waste of storage space.In order to data block reconstruct can increase several check block newly during correcting and eleting codes coding of the present invention, correcting and eleting codes data block total memory capacity after having encoded adds a part of storage space relative to original cutting data block, on this basis data compression is carried out to correcting and eleting codes data block, reduce the waste of storage space to a certain extent.
3, correcting and eleting codes and data compression technique are introduced in data deduplication system by the present invention, not only increase the reliability of data deduplication system, and decrease the waste of storage space.
4, data deduplication system delet method of the present invention, correcting and eleting codes coding and data compression are carried out to cutting data block, different memory node servers is stored in data block dispersion after compression, once part memory node breaks down, the data of preserving in residue memory node can be utilized to carry out file reduction, compared to prior art, the present invention not only increases the reliability of data de-duplication, and decreases the waste of storage space.
Accompanying drawing explanation
Fig. 1 is existing data deduplication system process flow diagram;
Fig. 2 is data deduplication system structural representation of the present invention;
Fig. 3 is correcting and eleting codes schematic diagram;
Fig. 4 is data deduplication system delet method process flow diagram of the present invention.
Embodiment
Below in conjunction with embodiment and accompanying drawing, the present invention is further detailed explanation.
The Core Feature of data de-duplication technology is when storing data, compares for storing the data of having preserved in data and storage system, if there is identical data, illustrate that these data are preserved, filter out this part data, quote this part data by pointer, otherwise, preserve data.According to disappearing, heavily granularity data de-duplication technology can be divided into file-level and block level, and data block rank disappears heavily, and granularity is less, provides higher data deduplication rate.The present invention adopts data block rank to disappear the method for reruning.
Deblocking algorithm mainly contains three kinds: fixed size block algorithm, elongated segmentation algorithm and sliding shoe segmentation algorithm.Fixed size block algorithm carries out file cutting with the block size pre-defined.Elongated segmentation algorithm is a kind of segmentation algorithm based on file content, the data block size be syncopated as is change, use the moving window of a fixed size to file data blocks calculated fingerprint value, if fingerprint value meets certain condition, such as its to certain specific numerical value delivery calculate equal to preset several time, the border of the window's position as block.Sliding shoe segmentation algorithm combines the advantage of fixed size block segmentation algorithm and elongated segmentation algorithm, its data block size is determined, first calculate weak proof test value to fixed length block, if coupling calculates strong proof test value again, both coupling thinks that this is a data block boundary.The present invention uses fixed size block algorithm.
In data deduplication system, data block is stored All Files in system and shares, if certain dropout of data block or make a mistake multiple file can be caused to reduce, correcting and eleting codes technology is introduced based on this, correcting and eleting codes is a kind of forward error correction (ForwardErrorCorrecting, FEC) technology, is widely used in the every field of information processing in recent years.(m, n) correcting and eleting codes is n (n>m) individual data slot m data source fragment coding, original m source data fragment can be reconstructed with any x (x >=m) in this n data slot is individual, correcting and eleting codes principle as shown in Figure 3, correcting and eleting codes is mainly divided into 4 classes: ReedSolomonCodes, ParityArrayCodes, Parity-checkCodes, LDPCCodes.The present invention uses ReedSolomonCodes technology.
When preserving cutting data block in the present invention, first correcting and eleting codes coding is carried out to data block, the data block after coding is saved in different memory nodes.During file reduction, as fruit part memory node breaks down or makes a mistake, original data block can be reconstructed by the correcting and eleting codes data block in residue memory node, and then restore source document.
(m is being carried out to data block, n) during correcting and eleting codes coding, first data block is on average cut into m data block, then be n (n>m) individual data block by m data block forecast, add (n-m) individual checking data block, data block memory capacity then after coding is n/m (n/m>1) times of initial data block capacity, adds a part of storage space.In order to data compression technique is introduced in data deduplication system by head it off herein on the basis of correcting and eleting codes technology.
Data compression (DataCompression) adopts the storage space of compression one given data when data processing or in certain space, increases the technology of memory data output.Data slit, null field, redundant information and the method shortening record of unnecessary data or the length of block are eliminated in usual employing, and object improves the utilization factor in Computer Storage space.Data compression is divided into Lossless Compression and lossy compression method two kinds.Conventional lossless compression method has: Huffman coding, arithmetic coding, run length coding, RLC and Fano-Shannon coding etc.Conventional lossy compression method method has: predictive coding, transition coding and hybrid coding etc.The present invention uses the Huffman in Lossless Compression to compress.
Huffman compression is popular a kind of lossless compression method, and its theoretical foundation is Huffman coding, and the frequency structure prefix code Huffman tree that Huffman coding basis information occurs, reaches the target that code length is the shortest.Based on this, the flow process of Huffman compressed file is:
1) each byte of file reading, adds up the frequency that each byte occurs;
2) for each byte specifies a binary tree only comprising a node, and using the frequency of this byte as the power of binary tree;
3) choose two minimum trees of power and be merged into a tree with new root node, its left and right subtree is two trees chosen respectively, and the power of new tree is the weight sum of left and right subtree;
4) step is above repeated, until be only left last one tree;
5) in tree, the left pointer of each nonleaf node distributes " 0 ", and right pointer distributes " 1 ", thus, can obtain the Huffman encoding of each byte from root;
6) coded message Huffman being set information and each byte is saved in compressed file.
When preserving file, after correcting and eleting codes has been encoded, utilize Huffman compression algorithm to compress correcting and eleting codes data block, then the data block after compression is sent to different memory nodes respectively and preserves.Also during original, data block is extracted from memory node, extrude correcting and eleting codes data block by Huffman decompression algorithm solution.
Figure 2 shows that the data deduplication system that the present invention is based on data compression and correcting and eleting codes technology forms structural drawing, system mainly comprises client, management server and memory node server, and the main function of system concentrates in management server.Management server mainly contains three part compositions: fingerprint base, file index storehouse and compression data block index database.
Fingerprint base is used for the fingerprint value of all data blocks in register system, its structure is made up of <FingerPrint, ReferenceCount>, wherein FingerPrint is fingerprint value, the number of times that the data block that ReferenceCount records this fingerprint value is shared, initial value is 1.Fingerprint base is searched and is adopted Hash lookup algorithm, and with fingerprint value character string for key word calculates its memory location, contention resolution adopts linear probing method.
The data block fingerprint value of file index storehouse record composing document, and the order of data block.
Compression data block index database is used for recording the information of each compression data block, and its structure is made up of <DatablockName, IpAddress, SavePath, DataBlockLength, FingerPrint, ReferenceCount>.Wherein DatablockName represents data block title, IpAddress represents the IP address of preserving data block server, SavePath represents the catalogue of preserving data block, DataBlockLength represents the length of data block, FingerPrint is the fingerprint value of data block, the number of times that this data block of ReferenceCount field record is shared, initial value is 1, and it is equal with the ReferenceCount field value of identical fingerprints value in fingerprint base.
ReferenceCount field in above-mentioned compression data block index database judges the number of times that data block is shared when being used for file erase.When certain file preserved in user's deletion system, need the data block of deleting this file of composition.But because data block is not privately owned by a file, if do not add judgement directly delete data block, some file may be caused to reduce.In order to avoid this situation occurs, the ReferenceCount field of checking in data block index database is needed when deleting data block, only used by this file if the value of this field is this data block of 1 explanation, can delete, if the value of this field is greater than 1, this data block can not be deleted, this field value is subtracted 1.This ensures that there and alternative document cannot will be reduced because certain data block is deleted.
Fig. 4 is the data deduplication system process flow diagram based on data compression and correcting and eleting codes technology.First several word is defined for ease of explaining its flow process:
Define 1 cutting data block: the data block produced after certain segmentation algorithm is applied to file.
Define 2 correcting and eleting codes data blocks: the data block produced after correcting and eleting codes coding is carried out to cutting data block.
Define 3 compression data blocks: to the data block produced after correcting and eleting codes data block applied compression algorithm.
Define 4 cutting index files: the index file produced during file cutting, be used for all data block indexes that log file comprises.
Define 5 correcting and eleting codes index files: the index file produced during correcting and eleting codes coding, be used for recording the index of all data blocks after cutting data block coding.
The flow process of preserving file is as follows:
Step 1: file cutting.By the files passe for preserving to client, client fixed measure block algorithm cutting file, produces interim cutting data block.
Step 2: calculated fingerprint value.Client utilizes MD5 algorithm to calculate the fingerprint value of cutting data block.
Step 3:HASH searches.The fingerprint value calculated is sent to management server, management server is search in key word HASH function to fingerprint base with fingerprint value, if this data block is preserved to find identical fingerprints value to illustrate, upgrade the ReferenceCount field in fingerprint base, its value is added 1, preserve it and index cutting index file, notice client need not send data block, otherwise illustrate that this data block is a new data block, preserve its index, and notify that client sends data block, subsequent calculations is carried out to it.
Step 4: correcting and eleting codes is encoded.With ReedSolomon correcting and eleting codes coding, new cutting data block is encoded, produce correcting and eleting codes data block and correcting and eleting codes index file.
Step 5: packed data.By Huffman compression algorithm, correcting and eleting codes data block is compressed.
Step 6: preserve.Compression data block is sent to memory node server to preserve, repeats step 1 to step 5, until file processing is complete.
Also the flow process of original is as follows:
Step 1: extract data block index.By client, reduction file request is sent to management server, in management server to file index storehouse, searches the file preserving file cutting data block index.If do not find prompting to go back original, otherwise continue.
Step 2: extract correcting and eleting codes index.Management server, according to the index position recorded in cutting data block index file, searches correcting and eleting codes index file in correcting and eleting codes data block index database.If do not found, prompting correcting and eleting codes dropout of data block, cannot go back original, otherwise continue.
Step 3: extract correcting and eleting codes data block.According to extracting the correcting and eleting codes data block after compression in IpAddress, SavePath of recording in correcting and eleting codes index file two fields to memory node server.
Step 4: decompress.The compression data block extracted is decompressed, produces correcting and eleting codes data block.
Step 5: judge whether to meet reconstruction condition.Judge whether separate the correcting and eleting codes data block extruded meets reconstruction condition according to ReedSolomon correcting and eleting codes decoding principle, if do not met, prompting correcting and eleting codes dropout of data block is too much, cannot reconstruct, otherwise proceed.
Step 6: reconstruct data block.Utilize ReedSolomon correcting and eleting codes decoding principle that the correcting and eleting codes data block after decompress(ion) is reconstructed into cutting data block.
Step 7: also original.Data in cutting data block are sent to client, and client is saved in file, repeats step 1 to step 6, until the data processing in data block index file is complete.The file of reduction is presented to user by last client.
The data deduplication system framework based on data compression and correcting and eleting codes technology that the present invention proposes, correcting and eleting codes coding and data compression are carried out to cutting data block, different memory node servers is stored in data block dispersion after compression, once part memory node breaks down, the data of preserving in residue memory node can be utilized to carry out file reduction, the framework that the present invention proposes not only increases the reliability of data deduplication system, and decreases the waste of storage space.
Certainly; the present invention also can have other various embodiments; when not deviating from the present invention's spirit and essence thereof; those of ordinary skill in the art can make various corresponding change and distortion according to the present invention, but these change accordingly and are out of shape the protection domain that all should belong to the claim appended by the present invention.

Claims (3)

1. data deduplication system, is characterized in that: be distributed structure/architecture, forms primarily of client, management server and memory node server; Client is mainly used in receiving user and preserves file/reduction file request, cutting file/composition file; The major function of management server has fingerprint value comparison, safeguards fingerprint base, correcting and eleting codes coding and data compression; Data block after memory node server primary responsibility store compressed; Client is all connected by LAN (Local Area Network) with memory node server end with management server end, management server;
Wherein, management server is primarily of fingerprint base, file index storehouse and compression data block index database three part composition;
Fingerprint base is used for the fingerprint value of all data blocks in register system, its structure is made up of FingerPrint, ReferenceCount, wherein FingerPrint is fingerprint value, the number of times that the data block that ReferenceCount records this fingerprint value is shared, and initial value is 1;
The data block fingerprint value of file index storehouse record composing document, and the order of data block;
Compression data block index database is used for recording the information of each compression data block, and its structure is made up of DatablockName, IpAddress, SavePath, DataBlockLength, FingerPrint, ReferenceCount; Wherein DatablockName represents data block title, IpAddress represents the IP address of preserving data block server, SavePath represents the catalogue of preserving data block, DataBlockLength represents the length of data block, FingerPrint is the fingerprint value of data block, ReferenceCount records the number of times that this data block is shared, and initial value is 1, and it is equal with the ReferenceCount field value of identical fingerprints value in fingerprint base.
2. data deduplication system as claimed in claim 1, is characterized in that: client is installed in the PC of user.
3. the delet method of data deduplication system as claimed in claim 1 or 2, is characterized in that: user is preserved by client and goes back original;
Wherein, when preserving file, client is carried out data cutting to the file that user inputs and is produced cutting data block, carries out fingerprint calculating, the fingerprint value calculated is sent to management server to each cutting data block, after management server receives fingerprint value, first in fingerprint base, search whether there is identical fingerprints value, if there is identical fingerprints value, illustrate that this data block is preserved, notice client does not need to send data block, otherwise illustrate that this data block is a new data block, data block is sent to management server by notice client, after management server receives data block, correcting and eleting codes coding is carried out to data block, basic data block number and the checking data block number of correcting and eleting codes needs is set in advance according to the number of memory node server, correcting and eleting codes has been encoded and has been carried out data compression to each correcting and eleting codes data block afterwards, correcting and eleting codes data block after compression is sent to memory node server preserve,
Wherein, also during original, by client, the filename for reduction is sent to management server, the file preserving data block index is searched in management server to file data blocks index database, according to the index position recorded in file, the file preserving correcting and eleting codes index is searched in data block correcting and eleting codes index database, position is preserved according to the correcting and eleting codes data block recorded in index file, the correcting and eleting codes data block after compression is extracted to each memory node server, it is decompressed, cutting data block is reconstructed according to correcting and eleting codes algorithm, finally cutting data block is reduced to original file,
The flow process of preserving file is as follows:
Step 1: file cutting, by the files passe for preserving to client, client fixed measure block algorithm cutting file, produces interim cutting data block;
Step 2: calculated fingerprint value, client utilizes MD5 algorithm to calculate the fingerprint value of cutting data block;
Step 3:HASH searches, the fingerprint value calculated is sent to management server, management server is search in key word HASH function to fingerprint base with fingerprint value, if this data block is preserved to find identical fingerprints value to illustrate, upgrade the ReferenceCount field in fingerprint base, its value is added 1, preserve it and index cutting index file, notice client need not send data block, otherwise illustrate that this data block is a new data block, preserve its index, and notify that client sends data block, subsequent calculations is carried out to it;
Step 4: correcting and eleting codes is encoded, encodes to new cutting data block with ReedSolomon correcting and eleting codes coding, produces correcting and eleting codes data block and correcting and eleting codes index file;
Step 5: packed data, compresses correcting and eleting codes data block by Huffman compression algorithm;
Step 6: preserve, is sent to memory node server and preserves by compression data block;
Repeat step 1 to step 6, until file processing is complete;
Also the flow process of original is as follows:
Step 1: extract data block index, by client, reduction file request is sent to management server, searches the file preserving file cutting data block index in management server to file index storehouse; If do not find prompting to go back original, otherwise continue;
Step 2: extract correcting and eleting codes index, management server, according to the index position recorded in cutting data block index file, searches correcting and eleting codes index file in correcting and eleting codes data block index database; If do not found, prompting correcting and eleting codes dropout of data block, cannot go back original, otherwise continue;
Step 3: extract correcting and eleting codes data block, according to extracting the correcting and eleting codes data block after compression in IpAddress, SavePath of recording in correcting and eleting codes index file two fields to memory node server;
Step 4: decompress, decompresses to the compression data block extracted, and produces correcting and eleting codes data block;
Step 5: judge whether to meet reconstruction condition, judge whether separate the correcting and eleting codes data block extruded meets reconstruction condition according to ReedSolomon correcting and eleting codes decoding principle, if do not met, prompting correcting and eleting codes dropout of data block is too much, cannot reconstruct, otherwise proceed;
Step 6: reconstruct data block, utilizes ReedSolomon correcting and eleting codes decoding principle that the correcting and eleting codes data block after decompress(ion) is reconstructed into cutting data block;
Step 7: also original, the data in cutting data block are sent to client, and client is saved in file;
Repeat step 1 to step 6, until the data processing in data block index file is complete, the file of reduction is presented to user by last client.
CN201310109231.1A 2013-03-29 2013-03-29 Data deduplication system and delet method thereof Expired - Fee Related CN103177111B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310109231.1A CN103177111B (en) 2013-03-29 2013-03-29 Data deduplication system and delet method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310109231.1A CN103177111B (en) 2013-03-29 2013-03-29 Data deduplication system and delet method thereof

Publications (2)

Publication Number Publication Date
CN103177111A CN103177111A (en) 2013-06-26
CN103177111B true CN103177111B (en) 2016-02-24

Family

ID=48636972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310109231.1A Expired - Fee Related CN103177111B (en) 2013-03-29 2013-03-29 Data deduplication system and delet method thereof

Country Status (1)

Country Link
CN (1) CN103177111B (en)

Families Citing this family (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473298B (en) * 2013-09-04 2017-01-11 华为技术有限公司 Data archiving method and device and storage system
EP3015999A4 (en) 2013-09-29 2016-08-17 Huawei Tech Co Ltd Data processing method, system and client
CN103593264B (en) * 2013-11-28 2017-07-07 中国南方电网有限责任公司超高压输电公司南宁局 Remote Wide Area Network disaster tolerant backup system and method
CN104765693B (en) * 2014-01-06 2018-03-27 国际商业机器公司 A kind of methods, devices and systems for data storage
CN104484126B (en) * 2014-11-13 2017-06-13 华中科技大学 A kind of data safety delet method and system based on correcting and eleting codes
CN104572987B (en) * 2015-01-04 2017-12-22 浙江大学 A kind of method and system that simple regeneration code storage efficiency is improved by compressing
US20160253096A1 (en) * 2015-02-28 2016-09-01 Altera Corporation Methods and apparatus for two-dimensional block bit-stream compression and decompression
CN104793902A (en) * 2015-04-17 2015-07-22 北京赛思信安技术有限公司 Data storage method based on repeating data deleting system
CN105389387B (en) * 2015-12-11 2018-12-14 上海爱数信息技术股份有限公司 A kind of data de-duplication performance based on compression and the method and system for deleting rate promotion again
CN105610921B (en) * 2015-12-23 2018-09-07 华中科技大学 Correcting and eleting codes archiving method based on data buffer storage under a kind of cluster
CN105677238A (en) * 2015-12-28 2016-06-15 国云科技股份有限公司 Method for distributed storage based data deduplication on virtual machine system disk
CN105763600B (en) * 2016-01-29 2019-06-18 华南理工大学 A kind of the grain communication system and its grain communication means of Cache support
CN105912622A (en) * 2016-04-05 2016-08-31 重庆大学 Data de-duplication method for lossless compressed files
CN106527986A (en) * 2016-11-03 2017-03-22 北京百度网讯科技有限公司 Method and device for storing data
CN106713422A (en) * 2016-12-05 2017-05-24 广州因特信息科技有限公司 Method and system for realizing different-place quick data transmission based on Internet
JP6876247B2 (en) 2017-03-09 2021-05-26 コニカミノルタ株式会社 Image forming device
CN107066601A (en) * 2017-04-20 2017-08-18 北京古盘创世科技发展有限公司 File contrasts management method and system
CN107066624B (en) * 2017-05-15 2020-07-28 成都优孚达信息技术有限公司 Data off-line storage method
CN109725836B (en) * 2017-10-30 2021-11-26 普天信息技术有限公司 User context compression method and device
CN108052649A (en) * 2017-12-26 2018-05-18 广州泼墨神网络科技有限公司 The data managing method and its system of a kind of distributed file system
CN110389857B (en) * 2018-04-20 2023-04-21 伊姆西Ip控股有限责任公司 Method, apparatus and non-transitory computer storage medium for data backup
CN109040173A (en) * 2018-06-21 2018-12-18 佛山科学技术学院 A kind of reliable storage method and device of government affairs big data
CN110908589B (en) * 2018-09-14 2023-06-27 阿里巴巴集团控股有限公司 Data file processing method, device, system and storage medium
CN109522283B (en) * 2018-10-30 2021-09-21 深圳先进技术研究院 Method and system for deleting repeated data
CN109213738B (en) * 2018-11-20 2022-01-25 武汉理工光科股份有限公司 Cloud storage file-level repeated data deletion retrieval system and method
CN111177092A (en) * 2019-12-09 2020-05-19 成都信息工程大学 Deduplication method and device based on erasure codes
CN111522791B (en) * 2020-04-30 2023-05-30 电子科技大学 Distributed file repeated data deleting system and method
CN112069510B (en) * 2020-07-24 2024-01-30 北京思特奇信息技术股份有限公司 Data encryption and duplication elimination method
CN112380196B (en) * 2020-10-28 2023-03-21 安擎(天津)计算机有限公司 Server for data compression transmission
CN113472691A (en) * 2021-06-16 2021-10-01 安阳师范学院 Mass time sequence data remote filing method based on message queue and erasure code
CN113270120B (en) * 2021-07-16 2022-02-18 北京金山云网络技术有限公司 Data compression method and device
CN113612829A (en) * 2021-07-27 2021-11-05 安阳师范学院 Remote archiving method for high-density mass data
CN118120212A (en) * 2021-10-28 2024-05-31 华为技术有限公司 File deduplication method, device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777056A (en) * 2009-12-31 2010-07-14 成都市华为赛门铁克科技有限公司 Data storage method and device
CN102200936A (en) * 2011-05-11 2011-09-28 杨钧 Intelligent configuration storage backup method suitable for cloud storage
CN102594899A (en) * 2011-12-31 2012-07-18 成都市华为赛门铁克科技有限公司 Storage service method and storage server using the same
CN102833298A (en) * 2011-06-17 2012-12-19 英业达集团(天津)电子技术有限公司 Distributed repeated data deleting system and processing method thereof
WO2013030893A1 (en) * 2011-08-31 2013-03-07 Hitachi, Ltd. Computer system and data access control method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8307177B2 (en) * 2008-09-05 2012-11-06 Commvault Systems, Inc. Systems and methods for management of virtualization data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777056A (en) * 2009-12-31 2010-07-14 成都市华为赛门铁克科技有限公司 Data storage method and device
CN102200936A (en) * 2011-05-11 2011-09-28 杨钧 Intelligent configuration storage backup method suitable for cloud storage
CN102833298A (en) * 2011-06-17 2012-12-19 英业达集团(天津)电子技术有限公司 Distributed repeated data deleting system and processing method thereof
WO2013030893A1 (en) * 2011-08-31 2013-03-07 Hitachi, Ltd. Computer system and data access control method
CN102594899A (en) * 2011-12-31 2012-07-18 成都市华为赛门铁克科技有限公司 Storage service method and storage server using the same

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于在线重复数据消除的海量数据处理关键技术研究;王灿;《中国博士学位论文全文数据库》;20121215;第30-32页 *

Also Published As

Publication number Publication date
CN103177111A (en) 2013-06-26

Similar Documents

Publication Publication Date Title
CN103177111B (en) Data deduplication system and delet method thereof
US9880746B1 (en) Method to increase random I/O performance with low memory overheads
CN102246137B (en) Delta compression after the deletion of identity copy
JP6596102B2 (en) Lossless data loss by deriving data from basic data elements present in content-associative sheaves
US11954373B2 (en) Data structure storage and data management
EP2940598B1 (en) Data object processing method and device
US9367448B1 (en) Method and system for determining data integrity for garbage collection of data storage systems
US9424185B1 (en) Method and system for garbage collection of data storage systems
CN110741637B (en) Method for simplifying video data, computer readable storage medium and electronic device
CN101968796B (en) Method for segmenting bidirectionally and concurrently executed file level variable-length data
CN113535706A (en) Two-stage cuckoo filter and repeated data deleting method based on two-stage cuckoo filter
US8516002B2 (en) Deflate file data optimization
CN108415671B (en) Method and system for deleting repeated data facing green cloud computing
JP6726690B2 (en) Performing multidimensional search, content-associative retrieval, and keyword-based retrieval and retrieval on losslessly reduced data using basic data sieves
US9665590B2 (en) Bitmap compression for fast searches and updates
JP2023525791A (en) Exploiting Base Data Locality for Efficient Retrieval of Lossless Reduced Data Using Base Data Sieves
Feng et al. MLC: an efficient multi-level log compression method for cloud backup systems
Xu et al. Reducing replication bandwidth for distributed document databases
CN108475508B (en) Simplification of audio data and data stored in block processing storage system
CN112416879A (en) Block-level data deduplication method based on NTFS (New technology File System)
Goel et al. A Detailed Review of Data Deduplication Approaches in the Cloud and Key Challenges
Tolic et al. Deduplication in unstructured-data storage systems
Xu et al. Similarity-based Deduplication for Databases
Xu Online Deduplication for Distributed Databases
KR102705306B1 (en) Reduction of data and audio data stored on block processing storage systems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20160224

Termination date: 20210329