CN104281412A - Method for removing repeating data before data storage - Google Patents

Method for removing repeating data before data storage Download PDF

Info

Publication number
CN104281412A
CN104281412A CN201310278342.5A CN201310278342A CN104281412A CN 104281412 A CN104281412 A CN 104281412A CN 201310278342 A CN201310278342 A CN 201310278342A CN 104281412 A CN104281412 A CN 104281412A
Authority
CN
China
Prior art keywords
data
repeating
storage
pending
store
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310278342.5A
Other languages
Chinese (zh)
Inventor
邬玉良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
HEATSONE TECHNOLOGY Inc
Original Assignee
HEATSONE TECHNOLOGY Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by HEATSONE TECHNOLOGY Inc filed Critical HEATSONE TECHNOLOGY Inc
Priority to CN201310278342.5A priority Critical patent/CN104281412A/en
Publication of CN104281412A publication Critical patent/CN104281412A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data

Abstract

The invention discloses a method for removing repeating data before data storage according to organization characters of data to be processed. The method aims to solve the problem of identification and removal of repeating data before data storage, and is characterized by comprising the steps that the data to be processed are cut into sub data blocks with different lengths according to the organization characters of the data to be processed, a standard identifier is generated for each sub data block to identify whether repeating data exist, then the data are processed before data storage, and the possibility of deleting and misjudgment of repeating data after storage is lowered. The method is usually used for identifying repeating data and only storing one data while neglecting the others in the processes of computer data archiving, storage, backup, remote disaster tolerance and disaster recovery, the effective utilization rate of storage space of a computer is improved, the bandwidth availability ratio is lowered, the possibility of deleting and misjudgment of repeating data after storage is lowered, and data consistency is guaranteed.

Description

A kind of method removing repeating data before data store
Technical field
The present invention relates to a kind of method removing repeating data before data store, belong to field of computer data processing.
Background technology
Computing machine is used to carry out the demand of data storage in recent years increasing, also more and more higher to the efficient requirement of the speed of data-storage.Current, business data stores swelling property and increases, and data volume at short notice will be double, and this can cause very large funds pressure to enterprise.
Data de-duplication is a kind of current main flow and very popular memory technology, effectively can optimize memory capacity.Data de-duplication is one and input traffic and data be kept in system before compares, found out the son file information of redundancy, only preserves the flow process of the fileinfo of a version.In backup procedure, this technology is very valuable, because most data are all identical, especially backs up completely from backuping to completely.Data de-duplication has become the very popular topic of storage industry and a large class commercial product.Purchase and operating cost this is because data de-duplication can significantly reduce, improve storage efficiency simultaneously.Along with the explosive increase of data volume, data growth is all chosen as one of three challenges by the data center administrator close to half.According to nearest Gartner investigation result, data de-duplication can alleviate the pressure of storage budget and help storage administrator to tackle the growth of data.
Although data de-duplication is mainly regarded as a kind of capacity optimisation technique, but this technology also can bring the benefit one in performance along with the minimizing of the data of required storage, and the data of the required migration of system also reduce.
Data de-duplication technology can be applied on the difference on data life period: from source terminal data de-duplication, data de-duplication in transmission, until store destination end data de-duplication.These technology can also be applied in all accumulation layers: backup, filing and primary storage.
No matter use which kind of mode, data de-duplication is exactly one and in the granularity of different levels, identifies repeating data and repeating data is replaced by the process pointing to the pointer sharing copy, can save the bandwidth needed for storage space and migration data like this.
Data de-duplication flow process comprises follows the tracks of and identifies those deleted repeating datas, and identifies and store that those are new for unique data.The terminal user of data can not feel that these data may be performed data de-duplication flow process and many times rebuilt in its data life period completely.
Data de-duplication operations is carried out to data and has several different mode.Single example stores (SIS) and carry out data de-duplication on file or block level.Duplicate copies can replace by the example with pointer, pointer then points to source document or object.
The operation granularity of secondary file layers data de-duplication then than file or object less.This technology has two kinds of common modes: fixed block data de-duplication one data are broken down into part or the block of regular length; Variable-length data de-duplication--data carry out data de-duplication according to a window slided.
Data compression is the size encoding to reduce it to data; It can also be used for those by the data of data de-duplication to reduce storage consumption further.Though data de-duplication is different but complementary with data compression--such as, but the efficiency of the very high compression of efficiency of data possibility data de-duplication is very low.
In addition, data de-duplication data can perform online; That is, data de-duplication operations is carried out when data are written into destination end; Certainly, data de-duplication also can perform in the mode of aftertreatment, namely performs when data have been written into and have been stored on disk.
Summary of the invention
The present invention be directed to the method for a kind of deleting duplicated data before storing that data de-duplication proposes in storing process, to solve the probability of judging by accident at computer data filing, storage, backup, remote disaster tolerance, low and after reducing data storage the data de-duplication of disaster recovery Computer storage space utilization factor.
Repeating to delete data has based on file and the differentiation based on data block, and both are different for the treatment effect of repeating data, and the effect data that different application produces is different, and the method applied in the present invention is the data de-duplication based on data block.
What the present invention proposed comprises the following steps according to the measured step of tissue signature's deleting duplicated data of pending data before data store:
First, obtain the institutional framework of pending data, then verify according to already present flag information in the institutional framework of data and the machine configuration file, after verification again with store in check code compare, judge that whether pending data consistent with canned data with this.If pending data and the data consistent deposited, need the data slicer obtaining its structure type, again data are loaded into internal memory from hard disk after obtaining cutter, import pending data into; If pending data and canned data inconsistent, just need to obtain the data slicer of data structure, imported into pending data.
Secondly, pending data are divided into multiple sub-block by data slicer, and generate unique identifier for sub-block.Different data types has different identifiers, and the identifier that each data block generates through certain algorithm is not identical yet, and identifier is unique.Extract with the data stored, obtain its identifier, check code with the algorithm identical with the pending data of process.
Finally, carry out the identifier of two groups of data and check code to contrast the identifier and check code that then judge whether to there is repetition, if find that there is identical identifier and check code, then data block corresponding in pending data is deleted.
The present invention is used for identifying that the data repeated also only are preserved a copy of it and ignore all the other usually in computer data filing, storage, backup, remote disaster tolerance, disaster recovery, to reach the effective rate of utilization improving Computer Storage space, reduce bandwidth availability ratio simultaneously, reduce the probability of the data de-duplication erroneous judgement after data storage, ensure the consistance of data.
Accompanying drawing explanation
Fig. 1, data Stored Procedure figure
Fig. 2, acquisition data slicer figure
Embodiment
The present invention proposes a kind of method removing repeating data before data store, concrete data Stored Procedure is as Fig. 1, first obtain the data that will store, we are referred to as pending data, judge pending data organizational structure whether with already present data consistent.If consistent, obtain the data slicer of its structure type, then data are loaded into internal memory from hard disk, import pending data into; As inconsistent, then obtain the data slicer of data structure, import pending data into.Data slicer essence is a kind of algorithm data being carried out to piecemeal, the size of data block is set, and the window that the data block of variable-size can be slided with divide, when the hash value of moving window matches with a reference value, and a just establishment piecemeal.Pending data are divided into sub-block by cutter, calculate the identifier that the MD5 value of data block is unique for sub-block generates.Extract the identifier of data in storing, check code merging.Then judge whether identifier and the check code of existence repetition, delete if any the data block that then will repeat repeated, remaining is stored, there is no then directly storing of repetition.
Its detailed step of said method can be divided into following a few step:
(1) take out data to be stored, obtain the institutional framework of pending data, description below can do in data organizational structure here: if data D represents, the relation between data represents with R, then DR=(D, R) just represents data organizational structure.Then verify according to already present flag information in the institutional framework of data and the machine configuration file, after verification again with store in check code compare, judge that whether pending data consistent with canned data with this.
(2) if pending data and the data consistent deposited, need the data slicer obtaining its structure type, again data are loaded into internal memory from hard disk after obtaining cutter, import pending data two into
(3) if pending data and canned data inconsistent, just need to obtain the data slicer of data structure, imported into pending data.
(4) pending data are divided into multiple sub-block by data slicer, and the MD5 value calculating sub-block makes sub-block generate unique identifier.Different data types has different identifiers, and the MD5 value of each data block is not identical, and identifier is unique.
(5) extract with the data stored, obtain its identifier, check code by identical method in step (4).
(6) carry out the identifier of two groups of data and check code to contrast the identifier then judging whether to there is repetition and check code, if find that there is identical identifier and check code, then data block corresponding in pending data is deleted.
Fig. 2 is the process flow diagram obtaining data slicer
After obtaining pending data organizational structure, just need to analyze its institutional framework, analytical approach is exactly verified by already present identification information in the data organizational structure of acquisition and the machine configuration file, check again with the check code of the data stored afterwards, judge after checking, if the data check code stored comprises data check code to be stored, then obtain the data slicer of this institutional framework type, otherwise, obtain the data slicer of conventional organization type.

Claims (4)

1. before data store, remove a method for repeating data according to the tissue signature of pending data, its step is as follows:
(1) judge pending data organizational structure whether with already present data consistent.
(2) if consistent, obtain the data slicer of its structure type, then data are loaded into internal memory from hard disk, import pending data into; As inconsistent, then obtain the data slicer of data structure, import pending data into.
(3) pending data are divided into sub-block by cutter, and generate unique identifier for sub-block.
(4) extract store in the identifier of data, check code merging.
(5) then judge whether the identifier and the check code that there is repetition, and store.
2. the method as described in claim 1 removing repeating data according to the tissue signature of pending data before data store its objective is and is that solving data stores the front identification to repeating data, removal problem.
3. as described in claim 1 a standard identifier is generated to each sub-block identify whether to there is repeating data at date storage method it is characterized in that utilizing the tissue signature of pending data to be cut into before data store sub-block that length do not wait, then before data store, data are processed, reduce the possibility of data de-duplication erroneous judgement after storing.
4. the institutional framework implication of the data as described in step in claim 1 (1) is: if data D represents, the relation between data represents with R, then DR=(D, R) just represents data organizational structure.
CN201310278342.5A 2013-07-04 2013-07-04 Method for removing repeating data before data storage Pending CN104281412A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310278342.5A CN104281412A (en) 2013-07-04 2013-07-04 Method for removing repeating data before data storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310278342.5A CN104281412A (en) 2013-07-04 2013-07-04 Method for removing repeating data before data storage

Publications (1)

Publication Number Publication Date
CN104281412A true CN104281412A (en) 2015-01-14

Family

ID=52256328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310278342.5A Pending CN104281412A (en) 2013-07-04 2013-07-04 Method for removing repeating data before data storage

Country Status (1)

Country Link
CN (1) CN104281412A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569745A (en) * 2016-10-25 2017-04-19 暨南大学 Memory optimization system for data deduplication under memory overload
CN107402725A (en) * 2017-03-20 2017-11-28 威盛电子股份有限公司 Nonvolatile memory devices and its data deduplication method
CN112053735A (en) * 2019-06-05 2020-12-08 建兴储存科技(广州)有限公司 Repeated data processing method of solid-state storage device
CN113126885A (en) * 2020-01-14 2021-07-16 瑞昱半导体股份有限公司 Data writing method, data reading method and storage device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320372A (en) * 2008-05-22 2008-12-10 上海爱数软件有限公司 Compression method for repeated data
US20130046733A1 (en) * 2011-08-19 2013-02-21 Hitachi Computer Peripherals Co., Ltd. Storage apparatus and duplicate data detection method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320372A (en) * 2008-05-22 2008-12-10 上海爱数软件有限公司 Compression method for repeated data
US20130046733A1 (en) * 2011-08-19 2013-02-21 Hitachi Computer Peripherals Co., Ltd. Storage apparatus and duplicate data detection method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
严蔚敏等: "《数据结构(C语言版)》", 31 May 2011 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106569745A (en) * 2016-10-25 2017-04-19 暨南大学 Memory optimization system for data deduplication under memory overload
CN106569745B (en) * 2016-10-25 2019-07-19 暨南大学 Memory optimizing system towards data de-duplication under a kind of memory overload
CN107402725A (en) * 2017-03-20 2017-11-28 威盛电子股份有限公司 Nonvolatile memory devices and its data deduplication method
CN107402725B (en) * 2017-03-20 2020-08-25 威盛电子股份有限公司 Nonvolatile memory device and data deduplication method thereof
CN112053735A (en) * 2019-06-05 2020-12-08 建兴储存科技(广州)有限公司 Repeated data processing method of solid-state storage device
CN112053735B (en) * 2019-06-05 2023-03-28 建兴储存科技(广州)有限公司 Repeated data processing method of solid-state storage device
CN113126885A (en) * 2020-01-14 2021-07-16 瑞昱半导体股份有限公司 Data writing method, data reading method and storage device

Similar Documents

Publication Publication Date Title
US9952936B2 (en) Storage system and method of controlling storage system
CN101989929B (en) Disaster recovery data backup method and system
He et al. Data deduplication techniques
US9223794B2 (en) Method and apparatus for content-aware and adaptive deduplication
US8447740B1 (en) Stream locality delta compression
US8751462B2 (en) Delta compression after identity deduplication
US7567188B1 (en) Policy based tiered data deduplication strategy
CN107229420B (en) Data storage method, reading method, deleting method and data operating system
US10366072B2 (en) De-duplication data bank
US20120303595A1 (en) Data restoration method for data de-duplication
WO2013051129A1 (en) Deduplication method for storage data, deduplication device for storage data, and deduplication program
KR20170054299A (en) Reference block aggregating into a reference set for deduplication in memory management
US8578112B2 (en) Data management system and data management method
US8667032B1 (en) Efficient content meta-data collection and trace generation from deduplicated storage
CN108170555A (en) A kind of data reconstruction method and equipment
CN102033924B (en) Data storage method and system
CN103118104B (en) A kind of data restoration method and server based on version vector
WO2017020576A1 (en) Method and apparatus for file compaction in key-value storage system
CN104077380A (en) Method and device for deleting duplicated data and system
CN107885619A (en) A kind of data compaction duplicate removal and the method and system of mirror image remote backup protection
CN104281412A (en) Method for removing repeating data before data storage
CN106990914B (en) Data deleting method and device
EP3477462B1 (en) Tenant aware, variable length, deduplication of stored data
US20170308554A1 (en) Auto-determining backup level
US20220245097A1 (en) Hashing with differing hash size and compression size

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C53 Correction of patent of invention or patent application
CB02 Change of applicant information

Address after: Taiyue business center 100086 Beijing Haidian District City, Zhichun Road Tai Yue Park Building No. 1 4 floor

Applicant after: HEATSONE TECHNOLOGY INC.

Address before: 100080 Beijing City, Haidian District Cheng Fu Road No. 268 KYKY No. 1 building 508

Applicant before: HEATSONE TECHNOLOGY INC.

COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100080 HAIDIAN, BEIJING TO: 100086 HAIDIAN, BEIJING

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150114