CN105912622A - Data de-duplication method for lossless compressed files - Google Patents

Data de-duplication method for lossless compressed files Download PDF

Info

Publication number
CN105912622A
CN105912622A CN201610213219.9A CN201610213219A CN105912622A CN 105912622 A CN105912622 A CN 105912622A CN 201610213219 A CN201610213219 A CN 201610213219A CN 105912622 A CN105912622 A CN 105912622A
Authority
CN
China
Prior art keywords
file
compressed
signature
data
file signature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610213219.9A
Other languages
Chinese (zh)
Inventor
谭玉娟
晏志超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201610213219.9A priority Critical patent/CN105912622A/en
Publication of CN105912622A publication Critical patent/CN105912622A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data de-duplication method for lossless compressed files. The method utilizes a data integrity check code of a compressed file such as a cyclic redundancy check code (CRC check codes) as a file signature (File Signature) to recognize a repeated compressed file. Within the limitation of demands of the collision rate, other file attributes such as file length can be extracted, the file length and the check code serve as a file signature to recognize the repeated file; if the compressed file has no the check code, the check code is extracted through calculation, or a Hash value is calculated out through the Hash algorithm and serves as the file signature to recognize the repeated file. The method can be integrated with the conventional repeated data deleting technique and fills up a technique gap that data de-duplication cannot be performed on the compressed files.

Description

A kind of data duplicate removal method for lossless compression file
Technical field
The invention belongs to computer information storage technology field, be specifically related to a kind of data duplicate removal method for lossless compression file.
Background technology
In technical field of information storage, data compression is the conventional means for reducing data volume.Data compression comprises Lossless Compression and lossy compression method.Lossless Compression mainly by adding up the redundancy of initial data in certain scope, utilizes redundancy statistical information to recompile initial data, removes redundant data, the purpose reaching to reduce data volume with this.Different lossless compression algorithms has different coding methods.Lossless Compression is widely used in the view data of text data, program and particular application etc. needs accurately to store the compression of data.Compression method then make use of the characteristic that human vision, the sense of hearing are insensitive to some frequency content in image, sound, it is allowed to loses certain information during compression.Lossy compression method is widely used in the compression of voice, image and video data.
Along with the explosive increase of data volume, in addition to data compression, data de-duplication technology is appeared vividly in last decade becomes the powerful of another kind of reduction data volume.Data de-duplication technology refers to data stream is carried out piecemeal, reaches to reduce the purpose of data volume by finding and remove redundant data block.In general, the mean size of data block is at 4K to 32K, the most greatly.Data de-duplication technology is a kind of emerging lossless compression method, is mainly used in large-scale storage systems, compensate for traditional destructive data compressing method and cannot remove the defect of redundant data in data block rank.
In existing storage system, data de-duplication has been considered a functional part routinely to be occurred in storage system.But, for compressed file (i.e. using the file that traditional compression method is compressed), data de-duplication can not find its potential redundant data.This is primarily due to same or analogous file, after using different compression algorithms to be compressed, can obtain data stream diverse with original document, cause its potential redundant data of data de-duplication technology None-identified.And within the storage system, in order to reduce data volume, a lot of files all use the form of compressed file be transmitted and store, it is impossible to identify that the potential redundant data of compressed file is a big defect of data de-duplication technology.The present invention proposes a kind of data duplicate removal method for lossless compression file, and the method can efficiently identify and remove the redundant data in compressed file, has filled up the technological gap that compressed file cannot carry out data deduplication.
Summary of the invention
The present invention proposes a kind of data duplicate removal method for lossless compression file.The method utilizes the data integrity verifying code that compressed file itself exists, such as CRC (CRC check code), the compressed file of repetition is identified as file signature (File Signature), if the check code of two files is identical, then think that the two compressed file is identical, remove duplicate file with this.Under the requirement of collision rate limits, it is also possible to extract alternative document attribute, such as file size, file size and check code are identified duplicate file together as file signature.If compressed file itself does not exist check code, then extract check code by calculating or use hash algorithm calculates cryptographic Hash and carries out the identification of duplicate file as file signature.Meanwhile, the method can also combine with existing data de-duplication technology, strengthens the duplicate removal effect of data de-duplication technology, has filled up the technological gap that compressed file cannot carry out data deduplication.
The present invention proposes a kind of data duplicate removal method for lossless compression file.Concretely comprise the following steps:
(1) extract the file signature (File Signature) of each compressed file in compressed package, concretely comprise the following steps:
(1.1) each compressed file existing data integrity verifying code is extracted, such as CRC (CRC check code), as file signature.Under the requirement of collision rate limits, it is also possible to extract alternative document attribute, such as file size, by file size and check code together as file signature.
(1.2) if compressed file itself does not exist check code, then the check code of compressed file (original document before compression) is calculated as file signature.
(1.3) in addition to using check code, it would however also be possible to employ hash algorithm, such as MD5, SHA-1 etc., the cryptographic Hash of compressed file (original document before compression) is calculated as file signature.
(2) if the file signature that there are two files is identical, then it is labeled as duplicate file, is otherwise labeled as non-duplicate file.Concretely comprise the following steps:
(2.1) file signature storehouse (File Signature store) inner searching whether, there is the file signature that (1) draws.
(2.2) if finding the file signature that (1) obtains in file signature storehouse, then the corresponding file in mark (1) is duplicate file, it is not necessary to store or transmit this document content.
(2.3) if not finding the file signature that (1) obtains in file signature storehouse, then the corresponding file in mark (1) is non-duplicate file, and stores the file signature that (1) obtains to file signature storehouse.
(2.4) if combining with existing data de-duplication method, then (2.1), (2.2) and (2.3) can use the file signature storehouse of existing data de-duplication method.
(3) remove the duplicate file identified in (2), build new compressed package and compressed package spectrum.Wherein new compressed package refers to remove the new compressed package rebuild after the duplicate file that (2.2) are identified in the compressed package of (1) indication.Compressed package spectrum then describes the compressed package of (1) indication which file is made up of, the convenient compressed package later recovering (1) indication.
(4) if combining with the existing data de-duplication method of system, then proceeding to existing data de-duplication step, otherwise, duplicate removal process terminates.
Accompanying drawing explanation
Fig. 1 is the modular structure schematic diagram of the present invention;
Fig. 2 is the schematic flow sheet of the present invention;
Detailed description of the invention
The main body that the present invention relates to has client and storage server.Detailed description of the invention includes two kinds: 1) independent process pattern: client transmits compressed file to storage server, and compressed file is processed by storage server.2) collaboration mode: client and storage server cooperate with each other and process compressed file.
Fig. 1 is the modular structure schematic diagram of the present invention.Mainly comprise five parts: file signature extraction module 101, duplicate file identification module 102, file signature database management module 103, compressed package and compressed package spectrum build module 104, data de-duplication module 105.File signature extraction module 101 is for extracting the file signature of each compressed file in compressed package;Duplicate file identification module 102, by consultant service signature storehouse, finds out the file of repetition;File signature database management module 103 is for managing the file signature storehouse of storage file signature;Compressed package and compressed package spectrum build module for building the compressed package spectrum of the new compressed package not comprising duplicate file and correspondence.Data de-duplication module 105 refers to the existing data de-duplication module for deleting duplicated data of system.In the embodiment of independent process pattern, above-mentioned all of module all exists on storage server;And in collaborative process pattern, file signature extraction module 101 builds module 104 on the client with compressed package and compressed package spectrum, duplicate file identification module 102 and file signature database management module 103 are on storage server, data de-duplication module then both can be deposited on the client according to the existing mode of system, it is also possible to exists on storage server.
Fig. 2 is the schematic flow sheet of the present invention, wherein uses and follows the bad redundancy check code (CRC check code) the data integrity verifying code as compressed file.Particularly as follows:
(1) compressed package files is read;
(2) checking a compressed file in compressed package, file signature extraction module 101 extracts the file signature of this compressed file, concretely comprises the following steps:
(2.1) checking whether this compressed file exists CRC check code, if there is CRC check code, then using this CRC check code as file signature;If there is not this check code, then recalculate the CRC check code of this compressed file (original document before compression) as file signature.It addition, under the requirement of collision rate limits, it is also possible to extract alternative document attribute, such as file size, by file size and CRC check code together as file signature;(2.2) if combining with existing data de-duplication method, it would however also be possible to employ hash algorithm calculation document cryptographic Hash is as the file signature of this compressed file;
(3) duplicate file identification module 102 file signature that finding step (2) obtains in file signature storehouse, if finding identical file signature, then the compressed file that markers step (2) is checked is duplicate file;Otherwise, the compressed file that markers step (2) is checked is non-duplicate file;
(4) in the file signature write file signature storehouse of the non-duplicate file that step (3) is marked by file signature database management module 103;
(5) check whether this compressed package also has Unidentified compressed file, if having, then going to step (2), otherwise entering next step;
(6) compressed package and compressed package spectrum build module 104 and rebuild compressed package and corresponding compressed package spectrum, the duplicate file that removal step (3) marks;
(7) if the existing data de-duplication method of system, existing data de-duplication module 105 will process new compressed package and the compressed package spectrum that (6) obtain;Otherwise, the duplicate removal process for compressed file terminates.

Claims (1)

1. the present invention proposes a kind of data duplicate removal method for lossless compression file.Concretely comprise the following steps:
(1) extract the file signature (File Signature) of each compressed file in compressed package, concretely comprise the following steps:
(1.1) each compressed file existing data integrity verifying code is extracted, such as CRC Code (CRC check code), as file signature.Under the requirement of collision rate limits, Alternative document attribute can also be extracted, such as file size, by file size and check code Together as file signature.
(1.2) if compressed file itself does not exist check code, then compressed file is calculated (before compression Original document) check code as file signature.
(1.3) in addition to using check code, it would however also be possible to employ hash algorithm, such as MD5, SHA-1 Deng, calculate the cryptographic Hash of compressed file (original document before compression) as file Signature.
(2) if the file signature that there are two files is identical, then it is labeled as duplicate file, is otherwise labeled as non- Duplicate file.Concretely comprise the following steps:
(2.1) search whether that having (1) draws file signature storehouse (File Signature store) is inner File signature.
(2.2) if finding the file signature that (1) obtains in file signature storehouse, then in mark (1) Corresponding file be duplicate file, it is not necessary to store or transmit this document content.
(2.3) if not finding the file signature that (1) obtains in file signature storehouse, then mark (1) In corresponding file be non-duplicate file, and the file signature that (1) obtains is deposited Storage is to file signature storehouse.
(2.4) if combining, then (2.1), (2.2) and (2.3) with existing data de-duplication method The file signature storehouse of existing data de-duplication method can be used.
(3) remove the duplicate file identified in (2), build new compressed package and compressed package spectrum.Its In new compressed package refer in the compressed package of (1) indication, remove the repetition literary composition (2.2) identified The new compressed package rebuild after part.Compressed package spectrum then describe (1) indication compressed package by Which file composition, the convenient compressed package later recovering (1) indication.
(4) if combining with the existing data de-duplication method of system, then proceed to existing repetition data and delete Except step, otherwise, duplicate removal process terminates.
CN201610213219.9A 2016-04-05 2016-04-05 Data de-duplication method for lossless compressed files Pending CN105912622A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610213219.9A CN105912622A (en) 2016-04-05 2016-04-05 Data de-duplication method for lossless compressed files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610213219.9A CN105912622A (en) 2016-04-05 2016-04-05 Data de-duplication method for lossless compressed files

Publications (1)

Publication Number Publication Date
CN105912622A true CN105912622A (en) 2016-08-31

Family

ID=56745477

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610213219.9A Pending CN105912622A (en) 2016-04-05 2016-04-05 Data de-duplication method for lossless compressed files

Country Status (1)

Country Link
CN (1) CN105912622A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528703A (en) * 2016-10-26 2017-03-22 杭州宏杉科技股份有限公司 Deduplication mode switching method and apparatus
CN106874399A (en) * 2017-01-16 2017-06-20 厦门天锐科技股份有限公司 One kind networking standby system and backup method
CN107085613A (en) * 2017-05-17 2017-08-22 广州四三九九信息科技有限公司 Enter the filter method and device of library file
CN107402725A (en) * 2017-03-20 2017-11-28 威盛电子股份有限公司 Nonvolatile memory devices and its data deduplication method
CN108563649A (en) * 2017-12-12 2018-09-21 南京富士通南大软件技术有限公司 Offline De-weight method based on GlusterFS distributed file systems
CN109144768A (en) * 2017-06-16 2019-01-04 西部数据技术公司 CPU errors repair during correcting and eleting codes coding
CN112230032A (en) * 2020-08-03 2021-01-15 青岛鼎信通讯股份有限公司 Electric energy meter data compression and decompression method
CN114077569A (en) * 2020-08-18 2022-02-22 富泰华工业(深圳)有限公司 Method and equipment for compressing data and method and equipment for decompressing data
CN115993939A (en) * 2023-03-22 2023-04-21 陕西中安数联信息技术有限公司 Method and device for deleting repeated data of storage system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908077A (en) * 2010-08-27 2010-12-08 华中科技大学 Duplicated data deleting method applicable to cloud backup
CN102207939A (en) * 2010-03-31 2011-10-05 联想(北京)有限公司 Multi-hardware system data processing apparatus and method for deleting duplicated data
CN102323958A (en) * 2011-10-27 2012-01-18 上海文广互动电视有限公司 Data de-duplication method
CN103020317A (en) * 2013-01-10 2013-04-03 曙光信息产业(北京)有限公司 Device and method for data compression based on data deduplication
CN103177111A (en) * 2013-03-29 2013-06-26 西安理工大学 System and method for deleting repeating data
CN103873438A (en) * 2012-12-12 2014-06-18 鸿富锦精密工业(深圳)有限公司 Compression packet uploading and duplication-removing system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102207939A (en) * 2010-03-31 2011-10-05 联想(北京)有限公司 Multi-hardware system data processing apparatus and method for deleting duplicated data
CN101908077A (en) * 2010-08-27 2010-12-08 华中科技大学 Duplicated data deleting method applicable to cloud backup
CN102323958A (en) * 2011-10-27 2012-01-18 上海文广互动电视有限公司 Data de-duplication method
CN103873438A (en) * 2012-12-12 2014-06-18 鸿富锦精密工业(深圳)有限公司 Compression packet uploading and duplication-removing system and method
CN103020317A (en) * 2013-01-10 2013-04-03 曙光信息产业(北京)有限公司 Device and method for data compression based on data deduplication
CN103177111A (en) * 2013-03-29 2013-06-26 西安理工大学 System and method for deleting repeating data

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528703A (en) * 2016-10-26 2017-03-22 杭州宏杉科技股份有限公司 Deduplication mode switching method and apparatus
CN106874399A (en) * 2017-01-16 2017-06-20 厦门天锐科技股份有限公司 One kind networking standby system and backup method
CN106874399B (en) * 2017-01-16 2020-06-12 厦门天锐科技股份有限公司 Networking backup system and backup method
CN107402725A (en) * 2017-03-20 2017-11-28 威盛电子股份有限公司 Nonvolatile memory devices and its data deduplication method
CN107402725B (en) * 2017-03-20 2020-08-25 威盛电子股份有限公司 Nonvolatile memory device and data deduplication method thereof
CN107085613A (en) * 2017-05-17 2017-08-22 广州四三九九信息科技有限公司 Enter the filter method and device of library file
CN107085613B (en) * 2017-05-17 2020-07-28 广州四三九九信息科技有限公司 Method and device for filtering files to be put in storage
CN109144768A (en) * 2017-06-16 2019-01-04 西部数据技术公司 CPU errors repair during correcting and eleting codes coding
CN109144768B (en) * 2017-06-16 2021-12-17 西部数据技术公司 System for data encoding and computer-implemented method thereof
CN108563649A (en) * 2017-12-12 2018-09-21 南京富士通南大软件技术有限公司 Offline De-weight method based on GlusterFS distributed file systems
CN108563649B (en) * 2017-12-12 2021-12-07 南京富士通南大软件技术有限公司 Offline duplicate removal method based on GlusterFS distributed file system
CN112230032A (en) * 2020-08-03 2021-01-15 青岛鼎信通讯股份有限公司 Electric energy meter data compression and decompression method
CN114077569A (en) * 2020-08-18 2022-02-22 富泰华工业(深圳)有限公司 Method and equipment for compressing data and method and equipment for decompressing data
CN114077569B (en) * 2020-08-18 2023-07-18 富泰华工业(深圳)有限公司 Method and device for compressing data, and method and device for decompressing data
CN115993939A (en) * 2023-03-22 2023-04-21 陕西中安数联信息技术有限公司 Method and device for deleting repeated data of storage system

Similar Documents

Publication Publication Date Title
CN105912622A (en) Data de-duplication method for lossless compressed files
CN102246137B (en) Delta compression after the deletion of identity copy
CN106815326B (en) System and method for detecting consistency of data table without main key
CN107229420B (en) Data storage method, reading method, deleting method and data operating system
CN103959256A (en) Fingerprint-based data deduplication
CN103177111A (en) System and method for deleting repeating data
CN108108394B (en) Compressed file recovery method and storage medium of APFS file system
WO2010148201A3 (en) Data compression for reducing storage requirements in a database system
EP3196781A1 (en) Method and apparatus for deleting duplicate data
US20130080403A1 (en) File storage apparatus, file storage method, and program
CN101807208A (en) Method for quickly retrieving video fingerprints
US20180067978A1 (en) Log management method, log management device, and recording medium
CN105095330A (en) Method and system for identifying file format based on compressed package content
WO2015067145A1 (en) Application recognition method and device
KR101484882B1 (en) Forensic data recovery method and system
CN104021217A (en) System and method for extracting fragment file and deleted file of mobile phone
CN105447168A (en) Method for restoring and recombining fragmented files in MP4 format
CN105045676A (en) Device and method for recovering lost data based on SQLite database
CN105515586B (en) A kind of quick residual quantity compression method
US9633035B2 (en) Storage system and methods for time continuum data retrieval
CN107515801A (en) A kind of data self-repairing method based on relevant database
CN110019039B (en) Metadata-separated container format
CN105279171B (en) The method and apparatus of predicate evaluation is carried out on the varying length string of compression
Ravi et al. A method for carving fragmented document and image files
US20160275134A1 (en) Nosql database data validation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20160831

RJ01 Rejection of invention patent application after publication