CN105912622A

CN105912622A - Data de-duplication method for lossless compressed files

Info

Publication number: CN105912622A
Application number: CN201610213219.9A
Authority: CN
Inventors: 谭玉娟; 晏志超
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2016-04-05
Filing date: 2016-04-05
Publication date: 2016-08-31

Abstract

The invention provides a data de-duplication method for lossless compressed files. The method utilizes a data integrity check code of a compressed file such as a cyclic redundancy check code (CRC check codes) as a file signature (File Signature) to recognize a repeated compressed file. Within the limitation of demands of the collision rate, other file attributes such as file length can be extracted, the file length and the check code serve as a file signature to recognize the repeated file; if the compressed file has no the check code, the check code is extracted through calculation, or a Hash value is calculated out through the Hash algorithm and serves as the file signature to recognize the repeated file. The method can be integrated with the conventional repeated data deleting technique and fills up a technique gap that data de-duplication cannot be performed on the compressed files.

Description

A kind of data duplicate removal method for lossless compression file

Technical field

The invention belongs to computer information storage technology field, be specifically related to a kind of data duplicate removal method for lossless compression file.

Background technology

In technical field of information storage, data compression is the conventional means for reducing data volume.Data compression comprises Lossless Compression and lossy compression method.Lossless Compression mainly by adding up the redundancy of initial data in certain scope, utilizes redundancy statistical information to recompile initial data, removes redundant data, the purpose reaching to reduce data volume with this.Different lossless compression algorithms has different coding methods.Lossless Compression is widely used in the view data of text data, program and particular application etc. needs accurately to store the compression of data.Compression method then make use of the characteristic that human vision, the sense of hearing are insensitive to some frequency content in image, sound, it is allowed to loses certain information during compression.Lossy compression method is widely used in the compression of voice, image and video data.

Along with the explosive increase of data volume, in addition to data compression, data de-duplication technology is appeared vividly in last decade becomes the powerful of another kind of reduction data volume.Data de-duplication technology refers to data stream is carried out piecemeal, reaches to reduce the purpose of data volume by finding and remove redundant data block.In general, the mean size of data block is at 4K to 32K, the most greatly.Data de-duplication technology is a kind of emerging lossless compression method, is mainly used in large-scale storage systems, compensate for traditional destructive data compressing method and cannot remove the defect of redundant data in data block rank.

In existing storage system, data de-duplication has been considered a functional part routinely to be occurred in storage system.But, for compressed file (i.e. using the file that traditional compression method is compressed), data de-duplication can not find its potential redundant data.This is primarily due to same or analogous file, after using different compression algorithms to be compressed, can obtain data stream diverse with original document, cause its potential redundant data of data de-duplication technology None-identified.And within the storage system, in order to reduce data volume, a lot of files all use the form of compressed file be transmitted and store, it is impossible to identify that the potential redundant data of compressed file is a big defect of data de-duplication technology.The present invention proposes a kind of data duplicate removal method for lossless compression file, and the method can efficiently identify and remove the redundant data in compressed file, has filled up the technological gap that compressed file cannot carry out data deduplication.

Summary of the invention

The present invention proposes a kind of data duplicate removal method for lossless compression file.The method utilizes the data integrity verifying code that compressed file itself exists, such as CRC (CRC check code), the compressed file of repetition is identified as file signature (File Signature), if the check code of two files is identical, then think that the two compressed file is identical, remove duplicate file with this.Under the requirement of collision rate limits, it is also possible to extract alternative document attribute, such as file size, file size and check code are identified duplicate file together as file signature.If compressed file itself does not exist check code, then extract check code by calculating or use hash algorithm calculates cryptographic Hash and carries out the identification of duplicate file as file signature.Meanwhile, the method can also combine with existing data de-duplication technology, strengthens the duplicate removal effect of data de-duplication technology, has filled up the technological gap that compressed file cannot carry out data deduplication.

The present invention proposes a kind of data duplicate removal method for lossless compression file.Concretely comprise the following steps:

(1) extract the file signature (File Signature) of each compressed file in compressed package, concretely comprise the following steps:

(1.1) each compressed file existing data integrity verifying code is extracted, such as CRC (CRC check code), as file signature.Under the requirement of collision rate limits, it is also possible to extract alternative document attribute, such as file size, by file size and check code together as file signature.

(1.2) if compressed file itself does not exist check code, then the check code of compressed file (original document before compression) is calculated as file signature.

(1.3) in addition to using check code, it would however also be possible to employ hash algorithm, such as MD5, SHA-1 etc., the cryptographic Hash of compressed file (original document before compression) is calculated as file signature.

(2) if the file signature that there are two files is identical, then it is labeled as duplicate file, is otherwise labeled as non-duplicate file.Concretely comprise the following steps:

(2.1) file signature storehouse (File Signature store) inner searching whether, there is the file signature that (1) draws.

(2.2) if finding the file signature that (1) obtains in file signature storehouse, then the corresponding file in mark (1) is duplicate file, it is not necessary to store or transmit this document content.

(2.3) if not finding the file signature that (1) obtains in file signature storehouse, then the corresponding file in mark (1) is non-duplicate file, and stores the file signature that (1) obtains to file signature storehouse.

(2.4) if combining with existing data de-duplication method, then (2.1), (2.2) and (2.3) can use the file signature storehouse of existing data de-duplication method.

(3) remove the duplicate file identified in (2), build new compressed package and compressed package spectrum.Wherein new compressed package refers to remove the new compressed package rebuild after the duplicate file that (2.2) are identified in the compressed package of (1) indication.Compressed package spectrum then describes the compressed package of (1) indication which file is made up of, the convenient compressed package later recovering (1) indication.

(4) if combining with the existing data de-duplication method of system, then proceeding to existing data de-duplication step, otherwise, duplicate removal process terminates.

Accompanying drawing explanation

Fig. 1 is the modular structure schematic diagram of the present invention；

Fig. 2 is the schematic flow sheet of the present invention；

Detailed description of the invention

The main body that the present invention relates to has client and storage server.Detailed description of the invention includes two kinds: 1) independent process pattern: client transmits compressed file to storage server, and compressed file is processed by storage server.2) collaboration mode: client and storage server cooperate with each other and process compressed file.

Fig. 1 is the modular structure schematic diagram of the present invention.Mainly comprise five parts: file signature extraction module 101, duplicate file identification module 102, file signature database management module 103, compressed package and compressed package spectrum build module 104, data de-duplication module 105.File signature extraction module 101 is for extracting the file signature of each compressed file in compressed package；Duplicate file identification module 102, by consultant service signature storehouse, finds out the file of repetition；File signature database management module 103 is for managing the file signature storehouse of storage file signature；Compressed package and compressed package spectrum build module for building the compressed package spectrum of the new compressed package not comprising duplicate file and correspondence.Data de-duplication module 105 refers to the existing data de-duplication module for deleting duplicated data of system.In the embodiment of independent process pattern, above-mentioned all of module all exists on storage server；And in collaborative process pattern, file signature extraction module 101 builds module 104 on the client with compressed package and compressed package spectrum, duplicate file identification module 102 and file signature database management module 103 are on storage server, data de-duplication module then both can be deposited on the client according to the existing mode of system, it is also possible to exists on storage server.

Fig. 2 is the schematic flow sheet of the present invention, wherein uses and follows the bad redundancy check code (CRC check code) the data integrity verifying code as compressed file.Particularly as follows:

(1) compressed package files is read；

(2) checking a compressed file in compressed package, file signature extraction module 101 extracts the file signature of this compressed file, concretely comprises the following steps:

(2.1) checking whether this compressed file exists CRC check code, if there is CRC check code, then using this CRC check code as file signature；If there is not this check code, then recalculate the CRC check code of this compressed file (original document before compression) as file signature.It addition, under the requirement of collision rate limits, it is also possible to extract alternative document attribute, such as file size, by file size and CRC check code together as file signature；(2.2) if combining with existing data de-duplication method, it would however also be possible to employ hash algorithm calculation document cryptographic Hash is as the file signature of this compressed file；

(3) duplicate file identification module 102 file signature that finding step (2) obtains in file signature storehouse, if finding identical file signature, then the compressed file that markers step (2) is checked is duplicate file；Otherwise, the compressed file that markers step (2) is checked is non-duplicate file；

(4) in the file signature write file signature storehouse of the non-duplicate file that step (3) is marked by file signature database management module 103；

(5) check whether this compressed package also has Unidentified compressed file, if having, then going to step (2), otherwise entering next step；

(6) compressed package and compressed package spectrum build module 104 and rebuild compressed package and corresponding compressed package spectrum, the duplicate file that removal step (3) marks；

(7) if the existing data de-duplication method of system, existing data de-duplication module 105 will process new compressed package and the compressed package spectrum that (6) obtain；Otherwise, the duplicate removal process for compressed file terminates.

Claims

1. the present invention proposes a kind of data duplicate removal method for lossless compression file.Concretely comprise the following steps:

(1.1) each compressed file existing data integrity verifying code is extracted, such as CRC Code (CRC check code), as file signature.Under the requirement of collision rate limits, Alternative document attribute can also be extracted, such as file size, by file size and check code Together as file signature.

(1.2) if compressed file itself does not exist check code, then compressed file is calculated (before compression Original document) check code as file signature.

(1.3) in addition to using check code, it would however also be possible to employ hash algorithm, such as MD5, SHA-1 Deng, calculate the cryptographic Hash of compressed file (original document before compression) as file Signature.

(2) if the file signature that there are two files is identical, then it is labeled as duplicate file, is otherwise labeled as non- Duplicate file.Concretely comprise the following steps:

(2.1) search whether that having (1) draws file signature storehouse (File Signature store) is inner File signature.

(2.2) if finding the file signature that (1) obtains in file signature storehouse, then in mark (1) Corresponding file be duplicate file, it is not necessary to store or transmit this document content.

(2.3) if not finding the file signature that (1) obtains in file signature storehouse, then mark (1) In corresponding file be non-duplicate file, and the file signature that (1) obtains is deposited Storage is to file signature storehouse.

(2.4) if combining, then (2.1), (2.2) and (2.3) with existing data de-duplication method The file signature storehouse of existing data de-duplication method can be used.

(3) remove the duplicate file identified in (2), build new compressed package and compressed package spectrum.Its In new compressed package refer in the compressed package of (1) indication, remove the repetition literary composition (2.2) identified The new compressed package rebuild after part.Compressed package spectrum then describe (1) indication compressed package by Which file composition, the convenient compressed package later recovering (1) indication.

(4) if combining with the existing data de-duplication method of system, then proceed to existing repetition data and delete Except step, otherwise, duplicate removal process terminates.