CN104281412A

CN104281412A - Method for removing repeating data before data storage

Info

Publication number: CN104281412A
Application number: CN201310278342.5A
Authority: CN
Inventors: 邬玉良
Original assignee: HEATSONE TECHNOLOGY Inc
Current assignee: HEATSONE TECHNOLOGY Inc
Priority date: 2013-07-04
Filing date: 2013-07-04
Publication date: 2015-01-14

Abstract

The invention discloses a method for removing repeating data before data storage according to organization characters of data to be processed. The method aims to solve the problem of identification and removal of repeating data before data storage, and is characterized by comprising the steps that the data to be processed are cut into sub data blocks with different lengths according to the organization characters of the data to be processed, a standard identifier is generated for each sub data block to identify whether repeating data exist, then the data are processed before data storage, and the possibility of deleting and misjudgment of repeating data after storage is lowered. The method is usually used for identifying repeating data and only storing one data while neglecting the others in the processes of computer data archiving, storage, backup, remote disaster tolerance and disaster recovery, the effective utilization rate of storage space of a computer is improved, the bandwidth availability ratio is lowered, the possibility of deleting and misjudgment of repeating data after storage is lowered, and data consistency is guaranteed.

Description

A kind of method removing repeating data before data store

Technical field

The present invention relates to a kind of method removing repeating data before data store, belong to field of computer data processing.

Background technology

Computing machine is used to carry out the demand of data storage in recent years increasing, also more and more higher to the efficient requirement of the speed of data-storage.Current, business data stores swelling property and increases, and data volume at short notice will be double, and this can cause very large funds pressure to enterprise.

Data de-duplication is a kind of current main flow and very popular memory technology, effectively can optimize memory capacity.Data de-duplication is one and input traffic and data be kept in system before compares, found out the son file information of redundancy, only preserves the flow process of the fileinfo of a version.In backup procedure, this technology is very valuable, because most data are all identical, especially backs up completely from backuping to completely.Data de-duplication has become the very popular topic of storage industry and a large class commercial product.Purchase and operating cost this is because data de-duplication can significantly reduce, improve storage efficiency simultaneously.Along with the explosive increase of data volume, data growth is all chosen as one of three challenges by the data center administrator close to half.According to nearest Gartner investigation result, data de-duplication can alleviate the pressure of storage budget and help storage administrator to tackle the growth of data.

Although data de-duplication is mainly regarded as a kind of capacity optimisation technique, but this technology also can bring the benefit one in performance along with the minimizing of the data of required storage, and the data of the required migration of system also reduce.

Data de-duplication technology can be applied on the difference on data life period: from source terminal data de-duplication, data de-duplication in transmission, until store destination end data de-duplication.These technology can also be applied in all accumulation layers: backup, filing and primary storage.

No matter use which kind of mode, data de-duplication is exactly one and in the granularity of different levels, identifies repeating data and repeating data is replaced by the process pointing to the pointer sharing copy, can save the bandwidth needed for storage space and migration data like this.

Data de-duplication flow process comprises follows the tracks of and identifies those deleted repeating datas, and identifies and store that those are new for unique data.The terminal user of data can not feel that these data may be performed data de-duplication flow process and many times rebuilt in its data life period completely.

Data de-duplication operations is carried out to data and has several different mode.Single example stores (SIS) and carry out data de-duplication on file or block level.Duplicate copies can replace by the example with pointer, pointer then points to source document or object.

The operation granularity of secondary file layers data de-duplication then than file or object less.This technology has two kinds of common modes: fixed block data de-duplication one data are broken down into part or the block of regular length; Variable-length data de-duplication--data carry out data de-duplication according to a window slided.

Data compression is the size encoding to reduce it to data; It can also be used for those by the data of data de-duplication to reduce storage consumption further.Though data de-duplication is different but complementary with data compression--such as, but the efficiency of the very high compression of efficiency of data possibility data de-duplication is very low.

In addition, data de-duplication data can perform online; That is, data de-duplication operations is carried out when data are written into destination end; Certainly, data de-duplication also can perform in the mode of aftertreatment, namely performs when data have been written into and have been stored on disk.

Summary of the invention

The present invention be directed to the method for a kind of deleting duplicated data before storing that data de-duplication proposes in storing process, to solve the probability of judging by accident at computer data filing, storage, backup, remote disaster tolerance, low and after reducing data storage the data de-duplication of disaster recovery Computer storage space utilization factor.

Repeating to delete data has based on file and the differentiation based on data block, and both are different for the treatment effect of repeating data, and the effect data that different application produces is different, and the method applied in the present invention is the data de-duplication based on data block.

What the present invention proposed comprises the following steps according to the measured step of tissue signature's deleting duplicated data of pending data before data store:

First, obtain the institutional framework of pending data, then verify according to already present flag information in the institutional framework of data and the machine configuration file, after verification again with store in check code compare, judge that whether pending data consistent with canned data with this.If pending data and the data consistent deposited, need the data slicer obtaining its structure type, again data are loaded into internal memory from hard disk after obtaining cutter, import pending data into; If pending data and canned data inconsistent, just need to obtain the data slicer of data structure, imported into pending data.

Secondly, pending data are divided into multiple sub-block by data slicer, and generate unique identifier for sub-block.Different data types has different identifiers, and the identifier that each data block generates through certain algorithm is not identical yet, and identifier is unique.Extract with the data stored, obtain its identifier, check code with the algorithm identical with the pending data of process.

Finally, carry out the identifier of two groups of data and check code to contrast the identifier and check code that then judge whether to there is repetition, if find that there is identical identifier and check code, then data block corresponding in pending data is deleted.

The present invention is used for identifying that the data repeated also only are preserved a copy of it and ignore all the other usually in computer data filing, storage, backup, remote disaster tolerance, disaster recovery, to reach the effective rate of utilization improving Computer Storage space, reduce bandwidth availability ratio simultaneously, reduce the probability of the data de-duplication erroneous judgement after data storage, ensure the consistance of data.

Accompanying drawing explanation

Fig. 1, data Stored Procedure figure

Fig. 2, acquisition data slicer figure

Embodiment

The present invention proposes a kind of method removing repeating data before data store, concrete data Stored Procedure is as Fig. 1, first obtain the data that will store, we are referred to as pending data, judge pending data organizational structure whether with already present data consistent.If consistent, obtain the data slicer of its structure type, then data are loaded into internal memory from hard disk, import pending data into; As inconsistent, then obtain the data slicer of data structure, import pending data into.Data slicer essence is a kind of algorithm data being carried out to piecemeal, the size of data block is set, and the window that the data block of variable-size can be slided with divide, when the hash value of moving window matches with a reference value, and a just establishment piecemeal.Pending data are divided into sub-block by cutter, calculate the identifier that the MD5 value of data block is unique for sub-block generates.Extract the identifier of data in storing, check code merging.Then judge whether identifier and the check code of existence repetition, delete if any the data block that then will repeat repeated, remaining is stored, there is no then directly storing of repetition.

Its detailed step of said method can be divided into following a few step:

(1) take out data to be stored, obtain the institutional framework of pending data, description below can do in data organizational structure here: if data D represents, the relation between data represents with R, then DR=(D, R) just represents data organizational structure.Then verify according to already present flag information in the institutional framework of data and the machine configuration file, after verification again with store in check code compare, judge that whether pending data consistent with canned data with this.

(2) if pending data and the data consistent deposited, need the data slicer obtaining its structure type, again data are loaded into internal memory from hard disk after obtaining cutter, import pending data two into

(3) if pending data and canned data inconsistent, just need to obtain the data slicer of data structure, imported into pending data.

(4) pending data are divided into multiple sub-block by data slicer, and the MD5 value calculating sub-block makes sub-block generate unique identifier.Different data types has different identifiers, and the MD5 value of each data block is not identical, and identifier is unique.

(5) extract with the data stored, obtain its identifier, check code by identical method in step (4).

(6) carry out the identifier of two groups of data and check code to contrast the identifier then judging whether to there is repetition and check code, if find that there is identical identifier and check code, then data block corresponding in pending data is deleted.

Fig. 2 is the process flow diagram obtaining data slicer

After obtaining pending data organizational structure, just need to analyze its institutional framework, analytical approach is exactly verified by already present identification information in the data organizational structure of acquisition and the machine configuration file, check again with the check code of the data stored afterwards, judge after checking, if the data check code stored comprises data check code to be stored, then obtain the data slicer of this institutional framework type, otherwise, obtain the data slicer of conventional organization type.

Claims

1. before data store, remove a method for repeating data according to the tissue signature of pending data, its step is as follows:

(1) judge pending data organizational structure whether with already present data consistent.

(2) if consistent, obtain the data slicer of its structure type, then data are loaded into internal memory from hard disk, import pending data into; As inconsistent, then obtain the data slicer of data structure, import pending data into.

(3) pending data are divided into sub-block by cutter, and generate unique identifier for sub-block.

(4) extract store in the identifier of data, check code merging.

(5) then judge whether the identifier and the check code that there is repetition, and store.

2. the method as described in claim 1 removing repeating data according to the tissue signature of pending data before data store its objective is and is that solving data stores the front identification to repeating data, removal problem.

3. as described in claim 1 a standard identifier is generated to each sub-block identify whether to there is repeating data at date storage method it is characterized in that utilizing the tissue signature of pending data to be cut into before data store sub-block that length do not wait, then before data store, data are processed, reduce the possibility of data de-duplication erroneous judgement after storing.

4. the institutional framework implication of the data as described in step in claim 1 (1) is: if data D represents, the relation between data represents with R, then DR=(D, R) just represents data organizational structure.