CN105912622A - Data de-duplication method for lossless compressed files - Google Patents
Data de-duplication method for lossless compressed files Download PDFInfo
- Publication number
- CN105912622A CN105912622A CN201610213219.9A CN201610213219A CN105912622A CN 105912622 A CN105912622 A CN 105912622A CN 201610213219 A CN201610213219 A CN 201610213219A CN 105912622 A CN105912622 A CN 105912622A
- Authority
- CN
- China
- Prior art keywords
- file
- compressed
- signature
- data
- file signature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1748—De-duplication implemented within the file system, e.g. based on file segments
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a data de-duplication method for lossless compressed files. The method utilizes a data integrity check code of a compressed file such as a cyclic redundancy check code (CRC check codes) as a file signature (File Signature) to recognize a repeated compressed file. Within the limitation of demands of the collision rate, other file attributes such as file length can be extracted, the file length and the check code serve as a file signature to recognize the repeated file; if the compressed file has no the check code, the check code is extracted through calculation, or a Hash value is calculated out through the Hash algorithm and serves as the file signature to recognize the repeated file. The method can be integrated with the conventional repeated data deleting technique and fills up a technique gap that data de-duplication cannot be performed on the compressed files.
Description
Technical field
The invention belongs to computer information storage technology field, be specifically related to a kind of data duplicate removal method for lossless compression file.
Background technology
In technical field of information storage, data compression is the conventional means for reducing data volume.Data compression comprises Lossless Compression and lossy compression method.Lossless Compression mainly by adding up the redundancy of initial data in certain scope, utilizes redundancy statistical information to recompile initial data, removes redundant data, the purpose reaching to reduce data volume with this.Different lossless compression algorithms has different coding methods.Lossless Compression is widely used in the view data of text data, program and particular application etc. needs accurately to store the compression of data.Compression method then make use of the characteristic that human vision, the sense of hearing are insensitive to some frequency content in image, sound, it is allowed to loses certain information during compression.Lossy compression method is widely used in the compression of voice, image and video data.
Along with the explosive increase of data volume, in addition to data compression, data de-duplication technology is appeared vividly in last decade becomes the powerful of another kind of reduction data volume.Data de-duplication technology refers to data stream is carried out piecemeal, reaches to reduce the purpose of data volume by finding and remove redundant data block.In general, the mean size of data block is at 4K to 32K, the most greatly.Data de-duplication technology is a kind of emerging lossless compression method, is mainly used in large-scale storage systems, compensate for traditional destructive data compressing method and cannot remove the defect of redundant data in data block rank.
In existing storage system, data de-duplication has been considered a functional part routinely to be occurred in storage system.But, for compressed file (i.e. using the file that traditional compression method is compressed), data de-duplication can not find its potential redundant data.This is primarily due to same or analogous file, after using different compression algorithms to be compressed, can obtain data stream diverse with original document, cause its potential redundant data of data de-duplication technology None-identified.And within the storage system, in order to reduce data volume, a lot of files all use the form of compressed file be transmitted and store, it is impossible to identify that the potential redundant data of compressed file is a big defect of data de-duplication technology.The present invention proposes a kind of data duplicate removal method for lossless compression file, and the method can efficiently identify and remove the redundant data in compressed file, has filled up the technological gap that compressed file cannot carry out data deduplication.
Summary of the invention
The present invention proposes a kind of data duplicate removal method for lossless compression file.The method utilizes the data integrity verifying code that compressed file itself exists, such as CRC (CRC check code), the compressed file of repetition is identified as file signature (File Signature), if the check code of two files is identical, then think that the two compressed file is identical, remove duplicate file with this.Under the requirement of collision rate limits, it is also possible to extract alternative document attribute, such as file size, file size and check code are identified duplicate file together as file signature.If compressed file itself does not exist check code, then extract check code by calculating or use hash algorithm calculates cryptographic Hash and carries out the identification of duplicate file as file signature.Meanwhile, the method can also combine with existing data de-duplication technology, strengthens the duplicate removal effect of data de-duplication technology, has filled up the technological gap that compressed file cannot carry out data deduplication.
The present invention proposes a kind of data duplicate removal method for lossless compression file.Concretely comprise the following steps:
(1) extract the file signature (File Signature) of each compressed file in compressed package, concretely comprise the following steps:
(1.1) each compressed file existing data integrity verifying code is extracted, such as CRC (CRC check code), as file signature.Under the requirement of collision rate limits, it is also possible to extract alternative document attribute, such as file size, by file size and check code together as file signature.
(1.2) if compressed file itself does not exist check code, then the check code of compressed file (original document before compression) is calculated as file signature.
(1.3) in addition to using check code, it would however also be possible to employ hash algorithm, such as MD5, SHA-1 etc., the cryptographic Hash of compressed file (original document before compression) is calculated as file signature.
(2) if the file signature that there are two files is identical, then it is labeled as duplicate file, is otherwise labeled as non-duplicate file.Concretely comprise the following steps:
(2.1) file signature storehouse (File Signature store) inner searching whether, there is the file signature that (1) draws.
(2.2) if finding the file signature that (1) obtains in file signature storehouse, then the corresponding file in mark (1) is duplicate file, it is not necessary to store or transmit this document content.
(2.3) if not finding the file signature that (1) obtains in file signature storehouse, then the corresponding file in mark (1) is non-duplicate file, and stores the file signature that (1) obtains to file signature storehouse.
(2.4) if combining with existing data de-duplication method, then (2.1), (2.2) and (2.3) can use the file signature storehouse of existing data de-duplication method.
(3) remove the duplicate file identified in (2), build new compressed package and compressed package spectrum.Wherein new compressed package refers to remove the new compressed package rebuild after the duplicate file that (2.2) are identified in the compressed package of (1) indication.Compressed package spectrum then describes the compressed package of (1) indication which file is made up of, the convenient compressed package later recovering (1) indication.
(4) if combining with the existing data de-duplication method of system, then proceeding to existing data de-duplication step, otherwise, duplicate removal process terminates.
Accompanying drawing explanation
Fig. 1 is the modular structure schematic diagram of the present invention;
Fig. 2 is the schematic flow sheet of the present invention;
Detailed description of the invention
The main body that the present invention relates to has client and storage server.Detailed description of the invention includes two kinds: 1) independent process pattern: client transmits compressed file to storage server, and compressed file is processed by storage server.2) collaboration mode: client and storage server cooperate with each other and process compressed file.
Fig. 1 is the modular structure schematic diagram of the present invention.Mainly comprise five parts: file signature extraction module 101, duplicate file identification module 102, file signature database management module 103, compressed package and compressed package spectrum build module 104, data de-duplication module 105.File signature extraction module 101 is for extracting the file signature of each compressed file in compressed package;Duplicate file identification module 102, by consultant service signature storehouse, finds out the file of repetition;File signature database management module 103 is for managing the file signature storehouse of storage file signature;Compressed package and compressed package spectrum build module for building the compressed package spectrum of the new compressed package not comprising duplicate file and correspondence.Data de-duplication module 105 refers to the existing data de-duplication module for deleting duplicated data of system.In the embodiment of independent process pattern, above-mentioned all of module all exists on storage server;And in collaborative process pattern, file signature extraction module 101 builds module 104 on the client with compressed package and compressed package spectrum, duplicate file identification module 102 and file signature database management module 103 are on storage server, data de-duplication module then both can be deposited on the client according to the existing mode of system, it is also possible to exists on storage server.
Fig. 2 is the schematic flow sheet of the present invention, wherein uses and follows the bad redundancy check code (CRC check code) the data integrity verifying code as compressed file.Particularly as follows:
(1) compressed package files is read;
(2) checking a compressed file in compressed package, file signature extraction module 101 extracts the file signature of this compressed file, concretely comprises the following steps:
(2.1) checking whether this compressed file exists CRC check code, if there is CRC check code, then using this CRC check code as file signature;If there is not this check code, then recalculate the CRC check code of this compressed file (original document before compression) as file signature.It addition, under the requirement of collision rate limits, it is also possible to extract alternative document attribute, such as file size, by file size and CRC check code together as file signature;(2.2) if combining with existing data de-duplication method, it would however also be possible to employ hash algorithm calculation document cryptographic Hash is as the file signature of this compressed file;
(3) duplicate file identification module 102 file signature that finding step (2) obtains in file signature storehouse, if finding identical file signature, then the compressed file that markers step (2) is checked is duplicate file;Otherwise, the compressed file that markers step (2) is checked is non-duplicate file;
(4) in the file signature write file signature storehouse of the non-duplicate file that step (3) is marked by file signature database management module 103;
(5) check whether this compressed package also has Unidentified compressed file, if having, then going to step (2), otherwise entering next step;
(6) compressed package and compressed package spectrum build module 104 and rebuild compressed package and corresponding compressed package spectrum, the duplicate file that removal step (3) marks;
(7) if the existing data de-duplication method of system, existing data de-duplication module 105 will process new compressed package and the compressed package spectrum that (6) obtain;Otherwise, the duplicate removal process for compressed file terminates.
Claims (1)
1. the present invention proposes a kind of data duplicate removal method for lossless compression file.Concretely comprise the following steps:
(1) extract the file signature (File Signature) of each compressed file in compressed package, concretely comprise the following steps:
(1.1) each compressed file existing data integrity verifying code is extracted, such as CRC
Code (CRC check code), as file signature.Under the requirement of collision rate limits,
Alternative document attribute can also be extracted, such as file size, by file size and check code
Together as file signature.
(1.2) if compressed file itself does not exist check code, then compressed file is calculated (before compression
Original document) check code as file signature.
(1.3) in addition to using check code, it would however also be possible to employ hash algorithm, such as MD5, SHA-1
Deng, calculate the cryptographic Hash of compressed file (original document before compression) as file
Signature.
(2) if the file signature that there are two files is identical, then it is labeled as duplicate file, is otherwise labeled as non-
Duplicate file.Concretely comprise the following steps:
(2.1) search whether that having (1) draws file signature storehouse (File Signature store) is inner
File signature.
(2.2) if finding the file signature that (1) obtains in file signature storehouse, then in mark (1)
Corresponding file be duplicate file, it is not necessary to store or transmit this document content.
(2.3) if not finding the file signature that (1) obtains in file signature storehouse, then mark (1)
In corresponding file be non-duplicate file, and the file signature that (1) obtains is deposited
Storage is to file signature storehouse.
(2.4) if combining, then (2.1), (2.2) and (2.3) with existing data de-duplication method
The file signature storehouse of existing data de-duplication method can be used.
(3) remove the duplicate file identified in (2), build new compressed package and compressed package spectrum.Its
In new compressed package refer in the compressed package of (1) indication, remove the repetition literary composition (2.2) identified
The new compressed package rebuild after part.Compressed package spectrum then describe (1) indication compressed package by
Which file composition, the convenient compressed package later recovering (1) indication.
(4) if combining with the existing data de-duplication method of system, then proceed to existing repetition data and delete
Except step, otherwise, duplicate removal process terminates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610213219.9A CN105912622A (en) | 2016-04-05 | 2016-04-05 | Data de-duplication method for lossless compressed files |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610213219.9A CN105912622A (en) | 2016-04-05 | 2016-04-05 | Data de-duplication method for lossless compressed files |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105912622A true CN105912622A (en) | 2016-08-31 |
Family
ID=56745477
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610213219.9A Pending CN105912622A (en) | 2016-04-05 | 2016-04-05 | Data de-duplication method for lossless compressed files |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105912622A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106528703A (en) * | 2016-10-26 | 2017-03-22 | 杭州宏杉科技股份有限公司 | Deduplication mode switching method and apparatus |
CN106874399A (en) * | 2017-01-16 | 2017-06-20 | 厦门天锐科技股份有限公司 | One kind networking standby system and backup method |
CN107085613A (en) * | 2017-05-17 | 2017-08-22 | 广州四三九九信息科技有限公司 | Enter the filter method and device of library file |
CN107402725A (en) * | 2017-03-20 | 2017-11-28 | 威盛电子股份有限公司 | Nonvolatile memory devices and its data deduplication method |
CN108563649A (en) * | 2017-12-12 | 2018-09-21 | 南京富士通南大软件技术有限公司 | Offline De-weight method based on GlusterFS distributed file systems |
CN109144768A (en) * | 2017-06-16 | 2019-01-04 | 西部数据技术公司 | CPU errors repair during correcting and eleting codes coding |
CN112230032A (en) * | 2020-08-03 | 2021-01-15 | 青岛鼎信通讯股份有限公司 | Electric energy meter data compression and decompression method |
CN114077569A (en) * | 2020-08-18 | 2022-02-22 | 富泰华工业(深圳)有限公司 | Method and equipment for compressing data and method and equipment for decompressing data |
CN115993939A (en) * | 2023-03-22 | 2023-04-21 | 陕西中安数联信息技术有限公司 | Method and device for deleting repeated data of storage system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101908077A (en) * | 2010-08-27 | 2010-12-08 | 华中科技大学 | Duplicated data deleting method applicable to cloud backup |
CN102207939A (en) * | 2010-03-31 | 2011-10-05 | 联想(北京)有限公司 | Multi-hardware system data processing apparatus and method for deleting duplicated data |
CN102323958A (en) * | 2011-10-27 | 2012-01-18 | 上海文广互动电视有限公司 | Data de-duplication method |
CN103020317A (en) * | 2013-01-10 | 2013-04-03 | 曙光信息产业(北京)有限公司 | Device and method for data compression based on data deduplication |
CN103177111A (en) * | 2013-03-29 | 2013-06-26 | 西安理工大学 | System and method for deleting repeating data |
CN103873438A (en) * | 2012-12-12 | 2014-06-18 | 鸿富锦精密工业(深圳)有限公司 | Compression packet uploading and duplication-removing system and method |
-
2016
- 2016-04-05 CN CN201610213219.9A patent/CN105912622A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102207939A (en) * | 2010-03-31 | 2011-10-05 | 联想(北京)有限公司 | Multi-hardware system data processing apparatus and method for deleting duplicated data |
CN101908077A (en) * | 2010-08-27 | 2010-12-08 | 华中科技大学 | Duplicated data deleting method applicable to cloud backup |
CN102323958A (en) * | 2011-10-27 | 2012-01-18 | 上海文广互动电视有限公司 | Data de-duplication method |
CN103873438A (en) * | 2012-12-12 | 2014-06-18 | 鸿富锦精密工业(深圳)有限公司 | Compression packet uploading and duplication-removing system and method |
CN103020317A (en) * | 2013-01-10 | 2013-04-03 | 曙光信息产业(北京)有限公司 | Device and method for data compression based on data deduplication |
CN103177111A (en) * | 2013-03-29 | 2013-06-26 | 西安理工大学 | System and method for deleting repeating data |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106528703A (en) * | 2016-10-26 | 2017-03-22 | 杭州宏杉科技股份有限公司 | Deduplication mode switching method and apparatus |
CN106874399A (en) * | 2017-01-16 | 2017-06-20 | 厦门天锐科技股份有限公司 | One kind networking standby system and backup method |
CN106874399B (en) * | 2017-01-16 | 2020-06-12 | 厦门天锐科技股份有限公司 | Networking backup system and backup method |
CN107402725A (en) * | 2017-03-20 | 2017-11-28 | 威盛电子股份有限公司 | Nonvolatile memory devices and its data deduplication method |
CN107402725B (en) * | 2017-03-20 | 2020-08-25 | 威盛电子股份有限公司 | Nonvolatile memory device and data deduplication method thereof |
CN107085613A (en) * | 2017-05-17 | 2017-08-22 | 广州四三九九信息科技有限公司 | Enter the filter method and device of library file |
CN107085613B (en) * | 2017-05-17 | 2020-07-28 | 广州四三九九信息科技有限公司 | Method and device for filtering files to be put in storage |
CN109144768A (en) * | 2017-06-16 | 2019-01-04 | 西部数据技术公司 | CPU errors repair during correcting and eleting codes coding |
CN109144768B (en) * | 2017-06-16 | 2021-12-17 | 西部数据技术公司 | System for data encoding and computer-implemented method thereof |
CN108563649A (en) * | 2017-12-12 | 2018-09-21 | 南京富士通南大软件技术有限公司 | Offline De-weight method based on GlusterFS distributed file systems |
CN108563649B (en) * | 2017-12-12 | 2021-12-07 | 南京富士通南大软件技术有限公司 | Offline duplicate removal method based on GlusterFS distributed file system |
CN112230032A (en) * | 2020-08-03 | 2021-01-15 | 青岛鼎信通讯股份有限公司 | Electric energy meter data compression and decompression method |
CN114077569A (en) * | 2020-08-18 | 2022-02-22 | 富泰华工业(深圳)有限公司 | Method and equipment for compressing data and method and equipment for decompressing data |
CN114077569B (en) * | 2020-08-18 | 2023-07-18 | 富泰华工业(深圳)有限公司 | Method and device for compressing data, and method and device for decompressing data |
CN115993939A (en) * | 2023-03-22 | 2023-04-21 | 陕西中安数联信息技术有限公司 | Method and device for deleting repeated data of storage system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105912622A (en) | Data de-duplication method for lossless compressed files | |
CN102246137B (en) | Delta compression after the deletion of identity copy | |
CN106815326B (en) | System and method for detecting consistency of data table without main key | |
CN107229420B (en) | Data storage method, reading method, deleting method and data operating system | |
CN103959256A (en) | Fingerprint-based data deduplication | |
CN103177111A (en) | System and method for deleting repeating data | |
CN108108394B (en) | Compressed file recovery method and storage medium of APFS file system | |
WO2010148201A3 (en) | Data compression for reducing storage requirements in a database system | |
EP3196781A1 (en) | Method and apparatus for deleting duplicate data | |
US20130080403A1 (en) | File storage apparatus, file storage method, and program | |
CN101807208A (en) | Method for quickly retrieving video fingerprints | |
US20180067978A1 (en) | Log management method, log management device, and recording medium | |
CN105095330A (en) | Method and system for identifying file format based on compressed package content | |
WO2015067145A1 (en) | Application recognition method and device | |
KR101484882B1 (en) | Forensic data recovery method and system | |
CN104021217A (en) | System and method for extracting fragment file and deleted file of mobile phone | |
CN105447168A (en) | Method for restoring and recombining fragmented files in MP4 format | |
CN105045676A (en) | Device and method for recovering lost data based on SQLite database | |
CN105515586B (en) | A kind of quick residual quantity compression method | |
US9633035B2 (en) | Storage system and methods for time continuum data retrieval | |
CN107515801A (en) | A kind of data self-repairing method based on relevant database | |
CN110019039B (en) | Metadata-separated container format | |
CN105279171B (en) | The method and apparatus of predicate evaluation is carried out on the varying length string of compression | |
Ravi et al. | A method for carving fragmented document and image files | |
US20160275134A1 (en) | Nosql database data validation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160831 |
|
RJ01 | Rejection of invention patent application after publication |