CN104281412A - Method for removing repeating data before data storage - Google Patents
Method for removing repeating data before data storage Download PDFInfo
- Publication number
- CN104281412A CN104281412A CN201310278342.5A CN201310278342A CN104281412A CN 104281412 A CN104281412 A CN 104281412A CN 201310278342 A CN201310278342 A CN 201310278342A CN 104281412 A CN104281412 A CN 104281412A
- Authority
- CN
- China
- Prior art keywords
- data
- repeating
- storage
- pending
- store
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
Abstract
The invention discloses a method for removing repeating data before data storage according to organization characters of data to be processed. The method aims to solve the problem of identification and removal of repeating data before data storage, and is characterized by comprising the steps that the data to be processed are cut into sub data blocks with different lengths according to the organization characters of the data to be processed, a standard identifier is generated for each sub data block to identify whether repeating data exist, then the data are processed before data storage, and the possibility of deleting and misjudgment of repeating data after storage is lowered. The method is usually used for identifying repeating data and only storing one data while neglecting the others in the processes of computer data archiving, storage, backup, remote disaster tolerance and disaster recovery, the effective utilization rate of storage space of a computer is improved, the bandwidth availability ratio is lowered, the possibility of deleting and misjudgment of repeating data after storage is lowered, and data consistency is guaranteed.
Description
Technical field
The present invention relates to a kind of method removing repeating data before data store, belong to field of computer data processing.
Background technology
Computing machine is used to carry out the demand of data storage in recent years increasing, also more and more higher to the efficient requirement of the speed of data-storage.Current, business data stores swelling property and increases, and data volume at short notice will be double, and this can cause very large funds pressure to enterprise.
Data de-duplication is a kind of current main flow and very popular memory technology, effectively can optimize memory capacity.Data de-duplication is one and input traffic and data be kept in system before compares, found out the son file information of redundancy, only preserves the flow process of the fileinfo of a version.In backup procedure, this technology is very valuable, because most data are all identical, especially backs up completely from backuping to completely.Data de-duplication has become the very popular topic of storage industry and a large class commercial product.Purchase and operating cost this is because data de-duplication can significantly reduce, improve storage efficiency simultaneously.Along with the explosive increase of data volume, data growth is all chosen as one of three challenges by the data center administrator close to half.According to nearest Gartner investigation result, data de-duplication can alleviate the pressure of storage budget and help storage administrator to tackle the growth of data.
Although data de-duplication is mainly regarded as a kind of capacity optimisation technique, but this technology also can bring the benefit one in performance along with the minimizing of the data of required storage, and the data of the required migration of system also reduce.
Data de-duplication technology can be applied on the difference on data life period: from source terminal data de-duplication, data de-duplication in transmission, until store destination end data de-duplication.These technology can also be applied in all accumulation layers: backup, filing and primary storage.
No matter use which kind of mode, data de-duplication is exactly one and in the granularity of different levels, identifies repeating data and repeating data is replaced by the process pointing to the pointer sharing copy, can save the bandwidth needed for storage space and migration data like this.
Data de-duplication flow process comprises follows the tracks of and identifies those deleted repeating datas, and identifies and store that those are new for unique data.The terminal user of data can not feel that these data may be performed data de-duplication flow process and many times rebuilt in its data life period completely.
Data de-duplication operations is carried out to data and has several different mode.Single example stores (SIS) and carry out data de-duplication on file or block level.Duplicate copies can replace by the example with pointer, pointer then points to source document or object.
The operation granularity of secondary file layers data de-duplication then than file or object less.This technology has two kinds of common modes: fixed block data de-duplication one data are broken down into part or the block of regular length; Variable-length data de-duplication--data carry out data de-duplication according to a window slided.
Data compression is the size encoding to reduce it to data; It can also be used for those by the data of data de-duplication to reduce storage consumption further.Though data de-duplication is different but complementary with data compression--such as, but the efficiency of the very high compression of efficiency of data possibility data de-duplication is very low.
In addition, data de-duplication data can perform online; That is, data de-duplication operations is carried out when data are written into destination end; Certainly, data de-duplication also can perform in the mode of aftertreatment, namely performs when data have been written into and have been stored on disk.
Summary of the invention
The present invention be directed to the method for a kind of deleting duplicated data before storing that data de-duplication proposes in storing process, to solve the probability of judging by accident at computer data filing, storage, backup, remote disaster tolerance, low and after reducing data storage the data de-duplication of disaster recovery Computer storage space utilization factor.
Repeating to delete data has based on file and the differentiation based on data block, and both are different for the treatment effect of repeating data, and the effect data that different application produces is different, and the method applied in the present invention is the data de-duplication based on data block.
What the present invention proposed comprises the following steps according to the measured step of tissue signature's deleting duplicated data of pending data before data store:
First, obtain the institutional framework of pending data, then verify according to already present flag information in the institutional framework of data and the machine configuration file, after verification again with store in check code compare, judge that whether pending data consistent with canned data with this.If pending data and the data consistent deposited, need the data slicer obtaining its structure type, again data are loaded into internal memory from hard disk after obtaining cutter, import pending data into; If pending data and canned data inconsistent, just need to obtain the data slicer of data structure, imported into pending data.
Secondly, pending data are divided into multiple sub-block by data slicer, and generate unique identifier for sub-block.Different data types has different identifiers, and the identifier that each data block generates through certain algorithm is not identical yet, and identifier is unique.Extract with the data stored, obtain its identifier, check code with the algorithm identical with the pending data of process.
Finally, carry out the identifier of two groups of data and check code to contrast the identifier and check code that then judge whether to there is repetition, if find that there is identical identifier and check code, then data block corresponding in pending data is deleted.
The present invention is used for identifying that the data repeated also only are preserved a copy of it and ignore all the other usually in computer data filing, storage, backup, remote disaster tolerance, disaster recovery, to reach the effective rate of utilization improving Computer Storage space, reduce bandwidth availability ratio simultaneously, reduce the probability of the data de-duplication erroneous judgement after data storage, ensure the consistance of data.
Accompanying drawing explanation
Fig. 1, data Stored Procedure figure
Fig. 2, acquisition data slicer figure
Embodiment
The present invention proposes a kind of method removing repeating data before data store, concrete data Stored Procedure is as Fig. 1, first obtain the data that will store, we are referred to as pending data, judge pending data organizational structure whether with already present data consistent.If consistent, obtain the data slicer of its structure type, then data are loaded into internal memory from hard disk, import pending data into; As inconsistent, then obtain the data slicer of data structure, import pending data into.Data slicer essence is a kind of algorithm data being carried out to piecemeal, the size of data block is set, and the window that the data block of variable-size can be slided with divide, when the hash value of moving window matches with a reference value, and a just establishment piecemeal.Pending data are divided into sub-block by cutter, calculate the identifier that the MD5 value of data block is unique for sub-block generates.Extract the identifier of data in storing, check code merging.Then judge whether identifier and the check code of existence repetition, delete if any the data block that then will repeat repeated, remaining is stored, there is no then directly storing of repetition.
Its detailed step of said method can be divided into following a few step:
(1) take out data to be stored, obtain the institutional framework of pending data, description below can do in data organizational structure here: if data D represents, the relation between data represents with R, then DR=(D, R) just represents data organizational structure.Then verify according to already present flag information in the institutional framework of data and the machine configuration file, after verification again with store in check code compare, judge that whether pending data consistent with canned data with this.
(2) if pending data and the data consistent deposited, need the data slicer obtaining its structure type, again data are loaded into internal memory from hard disk after obtaining cutter, import pending data two into
(3) if pending data and canned data inconsistent, just need to obtain the data slicer of data structure, imported into pending data.
(4) pending data are divided into multiple sub-block by data slicer, and the MD5 value calculating sub-block makes sub-block generate unique identifier.Different data types has different identifiers, and the MD5 value of each data block is not identical, and identifier is unique.
(5) extract with the data stored, obtain its identifier, check code by identical method in step (4).
(6) carry out the identifier of two groups of data and check code to contrast the identifier then judging whether to there is repetition and check code, if find that there is identical identifier and check code, then data block corresponding in pending data is deleted.
Fig. 2 is the process flow diagram obtaining data slicer
After obtaining pending data organizational structure, just need to analyze its institutional framework, analytical approach is exactly verified by already present identification information in the data organizational structure of acquisition and the machine configuration file, check again with the check code of the data stored afterwards, judge after checking, if the data check code stored comprises data check code to be stored, then obtain the data slicer of this institutional framework type, otherwise, obtain the data slicer of conventional organization type.
Claims (4)
1. before data store, remove a method for repeating data according to the tissue signature of pending data, its step is as follows:
(1) judge pending data organizational structure whether with already present data consistent.
(2) if consistent, obtain the data slicer of its structure type, then data are loaded into internal memory from hard disk, import pending data into; As inconsistent, then obtain the data slicer of data structure, import pending data into.
(3) pending data are divided into sub-block by cutter, and generate unique identifier for sub-block.
(4) extract store in the identifier of data, check code merging.
(5) then judge whether the identifier and the check code that there is repetition, and store.
2. the method as described in claim 1 removing repeating data according to the tissue signature of pending data before data store its objective is and is that solving data stores the front identification to repeating data, removal problem.
3. as described in claim 1 a standard identifier is generated to each sub-block identify whether to there is repeating data at date storage method it is characterized in that utilizing the tissue signature of pending data to be cut into before data store sub-block that length do not wait, then before data store, data are processed, reduce the possibility of data de-duplication erroneous judgement after storing.
4. the institutional framework implication of the data as described in step in claim 1 (1) is: if data D represents, the relation between data represents with R, then DR=(D, R) just represents data organizational structure.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310278342.5A CN104281412A (en) | 2013-07-04 | 2013-07-04 | Method for removing repeating data before data storage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310278342.5A CN104281412A (en) | 2013-07-04 | 2013-07-04 | Method for removing repeating data before data storage |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104281412A true CN104281412A (en) | 2015-01-14 |
Family
ID=52256328
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310278342.5A Pending CN104281412A (en) | 2013-07-04 | 2013-07-04 | Method for removing repeating data before data storage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104281412A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106569745A (en) * | 2016-10-25 | 2017-04-19 | 暨南大学 | Memory optimization system for data deduplication under memory overload |
CN107402725A (en) * | 2017-03-20 | 2017-11-28 | 威盛电子股份有限公司 | Nonvolatile memory devices and its data deduplication method |
CN112053735A (en) * | 2019-06-05 | 2020-12-08 | 建兴储存科技(广州)有限公司 | Repeated data processing method of solid-state storage device |
CN113126885A (en) * | 2020-01-14 | 2021-07-16 | 瑞昱半导体股份有限公司 | Data writing method, data reading method and storage device |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101320372A (en) * | 2008-05-22 | 2008-12-10 | 上海爱数软件有限公司 | Compression method for repeated data |
US20130046733A1 (en) * | 2011-08-19 | 2013-02-21 | Hitachi Computer Peripherals Co., Ltd. | Storage apparatus and duplicate data detection method |
-
2013
- 2013-07-04 CN CN201310278342.5A patent/CN104281412A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101320372A (en) * | 2008-05-22 | 2008-12-10 | 上海爱数软件有限公司 | Compression method for repeated data |
US20130046733A1 (en) * | 2011-08-19 | 2013-02-21 | Hitachi Computer Peripherals Co., Ltd. | Storage apparatus and duplicate data detection method |
Non-Patent Citations (1)
Title |
---|
严蔚敏等: "《数据结构(C语言版)》", 31 May 2011 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106569745A (en) * | 2016-10-25 | 2017-04-19 | 暨南大学 | Memory optimization system for data deduplication under memory overload |
CN106569745B (en) * | 2016-10-25 | 2019-07-19 | 暨南大学 | Memory optimizing system towards data de-duplication under a kind of memory overload |
CN107402725A (en) * | 2017-03-20 | 2017-11-28 | 威盛电子股份有限公司 | Nonvolatile memory devices and its data deduplication method |
CN107402725B (en) * | 2017-03-20 | 2020-08-25 | 威盛电子股份有限公司 | Nonvolatile memory device and data deduplication method thereof |
CN112053735A (en) * | 2019-06-05 | 2020-12-08 | 建兴储存科技(广州)有限公司 | Repeated data processing method of solid-state storage device |
CN112053735B (en) * | 2019-06-05 | 2023-03-28 | 建兴储存科技(广州)有限公司 | Repeated data processing method of solid-state storage device |
CN113126885A (en) * | 2020-01-14 | 2021-07-16 | 瑞昱半导体股份有限公司 | Data writing method, data reading method and storage device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9952936B2 (en) | Storage system and method of controlling storage system | |
CN101989929B (en) | Disaster recovery data backup method and system | |
He et al. | Data deduplication techniques | |
US9223794B2 (en) | Method and apparatus for content-aware and adaptive deduplication | |
US8447740B1 (en) | Stream locality delta compression | |
US8751462B2 (en) | Delta compression after identity deduplication | |
US7567188B1 (en) | Policy based tiered data deduplication strategy | |
CN107229420B (en) | Data storage method, reading method, deleting method and data operating system | |
US10366072B2 (en) | De-duplication data bank | |
US20120303595A1 (en) | Data restoration method for data de-duplication | |
WO2013051129A1 (en) | Deduplication method for storage data, deduplication device for storage data, and deduplication program | |
KR20170054299A (en) | Reference block aggregating into a reference set for deduplication in memory management | |
US8578112B2 (en) | Data management system and data management method | |
US8667032B1 (en) | Efficient content meta-data collection and trace generation from deduplicated storage | |
CN108170555A (en) | A kind of data reconstruction method and equipment | |
CN102033924B (en) | Data storage method and system | |
CN103118104B (en) | A kind of data restoration method and server based on version vector | |
WO2017020576A1 (en) | Method and apparatus for file compaction in key-value storage system | |
CN104077380A (en) | Method and device for deleting duplicated data and system | |
CN107885619A (en) | A kind of data compaction duplicate removal and the method and system of mirror image remote backup protection | |
CN104281412A (en) | Method for removing repeating data before data storage | |
CN106990914B (en) | Data deleting method and device | |
EP3477462B1 (en) | Tenant aware, variable length, deduplication of stored data | |
US20170308554A1 (en) | Auto-determining backup level | |
US20220245097A1 (en) | Hashing with differing hash size and compression size |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C53 | Correction of patent of invention or patent application | ||
CB02 | Change of applicant information |
Address after: Taiyue business center 100086 Beijing Haidian District City, Zhichun Road Tai Yue Park Building No. 1 4 floor Applicant after: HEATSONE TECHNOLOGY INC. Address before: 100080 Beijing City, Haidian District Cheng Fu Road No. 268 KYKY No. 1 building 508 Applicant before: HEATSONE TECHNOLOGY INC. |
|
COR | Change of bibliographic data |
Free format text: CORRECT: ADDRESS; FROM: 100080 HAIDIAN, BEIJING TO: 100086 HAIDIAN, BEIJING |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150114 |