CN103473278A - Repeating data processing technology - Google Patents

Repeating data processing technology Download PDF

Info

Publication number
CN103473278A
CN103473278A CN2013103789166A CN201310378916A CN103473278A CN 103473278 A CN103473278 A CN 103473278A CN 2013103789166 A CN2013103789166 A CN 2013103789166A CN 201310378916 A CN201310378916 A CN 201310378916A CN 103473278 A CN103473278 A CN 103473278A
Authority
CN
China
Prior art keywords
data
file
fingerprint
data blocks
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013103789166A
Other languages
Chinese (zh)
Inventor
曹峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SUZHOU TIANYONGBEI NETWORK TECHNOLOGY Co Ltd
Original Assignee
SUZHOU TIANYONGBEI NETWORK TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SUZHOU TIANYONGBEI NETWORK TECHNOLOGY Co Ltd filed Critical SUZHOU TIANYONGBEI NETWORK TECHNOLOGY Co Ltd
Priority to CN2013103789166A priority Critical patent/CN103473278A/en
Publication of CN103473278A publication Critical patent/CN103473278A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses repeating data processing technology which includes two methods including static file segmenting and dynamic file segmenting. Static file segmenting refers to segmenting files according to a fixed size, and dynamic file segmenting includes the following steps: looking up border positions of data blocks according to a certain algorithm; solving data fingerprints; using the data fingerprints to judge whether two data blocks are same or not; storing the same data blocks into one portion, and storing index values of the same data blocks for the convenience of being used during recovery. By adopting the technical scheme, needs of data on storage capacity can be reduced; on the basis of in-depth study on storage capacity optimization technology in disaster recovery backup, a certain technological improvement on repeating data deleting technology is made, and high-quality storage is realized.

Description

A kind of repeating data treatment technology
Technical field
The present invention relates to warning system, be specifically related to a kind of repeating data treatment technology.
Background technology
Current enterprise is to the storage demand of information just in growth by leaps and bounds, and the collection of information has become one of gordian technique factor that determines enterprise's survival and development with processing.Meanwhile, the reliability of the data in infosystem and security also have been subject to increasing attention, and wherein data disaster tolerance system is exactly a kind of effective technology means that ensure data security.Particularly the September 11th attacks and Southeast Asia tsunami, and the southern snow disaster and the Wenchuan earthquake that occur in not long ago China, these catastrophic event make enterprise that a common main line be arranged, and that is exactly to set up the long-distance disaster system to guarantee the continuity of business.Disaster tolerance system be according to current technology trends and guarantee data security and business continuance propose.Because the problem the most intuitively that burgeoning data volume is brought to disaster recovery and backup systems is memory space inadequate, brought immense pressure also to processing power, the data transfer bandwidth of system simultaneously, so, in order to ensure that disaster tolerance system moves efficiently and stably, need to set up a memory capacity Optimization Mechanism and reduce the demand of data to memory capacity.On the basis of memory capacity optimisation technique, data de-duplication technology has been carried out to certain technological improvement in the further investigation disaster-tolerant backup, realized high-quality storage.
Summary of the invention
The object of the invention is to overcome the problem that prior art exists, a kind of repeating data treatment technology is provided.
For realizing above-mentioned technical purpose, reach above-mentioned technique effect, the present invention is achieved through the following technical solutions:
A kind of repeating data treatment technology comprises that two kinds of methods are respectively: static cutting file and dynamic cutting file, and the cutting file of described static state is that file is carried out to cutting according to fixed size, described dynamic cutting file comprises the following steps:
Step 1) is searched the boundary position of data block according to certain algorithm;
Step 2) solve data fingerprint, after soon File cutting becomes a plurality of little modules, need to calculate data fingerprint to each small data piece;
Step 3) judges that with data fingerprint whether two data blocks are identical; Search data block, due to the data block One's name is legion, adopting the HASH lookup method based on functional form, can effectively shorten the time of searching;
Step 4) is a by identical block storage, and the index value of storage identical block, so that used while recovering.
Further, the computational data piece fingerprint in described dynamic cutting file has adopted weak proof test value and SHA1 algorithm to carry out computational data piece fingerprint.
Further, described weak proof test value is the cyclic redundancy value of calculating each data block, described algorithm is fairly simple, when the cyclic redundancy value is different, can judge this two data block differences, when the cyclic redundancy value is identical, can not judge that whether these two data blocks are identical, we need to calculate with described SHA1 algorithm the value of these two data blocks, when two data blocks are identical, after SHA1 calculates, resulting 160 place values are identical, otherwise different.
Beneficial effect of the present invention:
Technical solution of the present invention, can reduce the demand of data to memory capacity, on the basis of memory capacity optimisation technique, data de-duplication technology carried out to certain technological improvement in the further investigation disaster-tolerant backup simultaneously, realized high-quality storage.
The accompanying drawing explanation
Fig. 1 be of the present invention data-optimized before and data-optimized after comparison diagram;
Fig. 2 is the specific implementation of the present invention schematic diagram.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Shown in Fig. 2, a kind of repeating data treatment technology, comprise that two kinds of methods are respectively: static cutting file and dynamic cutting file, the cutting file of described static state is that file is carried out to cutting according to fixed size, described dynamic cutting file comprises the following steps:
Step 1) is searched the boundary position of data block according to certain algorithm;
Step 2) solve data fingerprint, after soon File cutting becomes a plurality of little modules, need to calculate data fingerprint to each small data piece;
Step 3) judges that with data fingerprint whether two data blocks are identical; Search data block, due to the data block One's name is legion, adopting the HASH lookup method based on functional form, can effectively shorten the time of searching;
Step 4) is a by identical block storage, and the index value of storage identical block, so that used while recovering.
Further, the computational data piece fingerprint in described dynamic cutting file has adopted weak proof test value and SHA1 algorithm to carry out computational data piece fingerprint.
Further, described weak proof test value is the cyclic redundancy value of calculating each data block, described algorithm is fairly simple, when the cyclic redundancy value is different, can judge this two data block differences, when the cyclic redundancy value is identical, can not judge that whether these two data blocks are identical, we need to calculate with described SHA1 algorithm the value of these two data blocks, when two data blocks are identical, after SHA1 calculates, resulting 160 place values are identical, otherwise different.
Principle of the present invention:
A File cutting is become to a plurality of small data segments, utilize certain algorithm to calculate the data fingerprint of these small data pieces, illustrate that these two data block contents are identical if data fingerprint is identical, otherwise the content of two small data pieces is just different, in storage, we only need the portion of storage identical block, and the piece of storage is called meta data block, in order to revert to raw data, we also need to store the index value of identical block in former data.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (3)

1. a repeating data treatment technology, it is characterized in that, comprise that two kinds of methods are respectively: static cutting file and dynamic cutting file, the cutting file of described static state is that file is carried out to cutting according to fixed size, described dynamic cutting file comprises the following steps:
Step 1) is searched the boundary position of data block according to certain algorithm;
Step 2) solve data fingerprint, after soon File cutting becomes a plurality of little modules, need to calculate data fingerprint to each small data piece;
Step 3) judges that with data fingerprint whether two data blocks are identical; Search data block, due to the data block One's name is legion, adopting the HASH lookup method based on functional form, can effectively shorten the time of searching;
Step 4) is a by identical block storage, and the index value of storage identical block, so that used while recovering.
2. repeating data treatment technology according to claim 1, is characterized in that, the computational data piece fingerprint in described dynamic cutting file has adopted weak proof test value and SHA1 algorithm to carry out computational data piece fingerprint.
3. repeating data treatment technology according to claim 2, it is characterized in that, described weak proof test value is the cyclic redundancy value of calculating each data block, described algorithm is fairly simple, when the cyclic redundancy value is different, can judge this two data block differences, when the cyclic redundancy value is identical, can not judge that whether these two data blocks are identical, we need to calculate with described SHA1 algorithm the value of these two data blocks, when two data blocks are identical, after SHA1 calculates, resulting 160 place values are identical, otherwise different.
CN2013103789166A 2013-08-28 2013-08-28 Repeating data processing technology Pending CN103473278A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013103789166A CN103473278A (en) 2013-08-28 2013-08-28 Repeating data processing technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013103789166A CN103473278A (en) 2013-08-28 2013-08-28 Repeating data processing technology

Publications (1)

Publication Number Publication Date
CN103473278A true CN103473278A (en) 2013-12-25

Family

ID=49798126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013103789166A Pending CN103473278A (en) 2013-08-28 2013-08-28 Repeating data processing technology

Country Status (1)

Country Link
CN (1) CN103473278A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955530A (en) * 2014-05-12 2014-07-30 暨南大学 Data reconstruction and optimization method of on-line repeating data deletion system
CN104317823A (en) * 2014-09-30 2015-01-28 北京合力思腾科技股份有限公司 Method for carrying out data detection by utilizing data fingerprints
CN104408154A (en) * 2014-12-04 2015-03-11 华为技术有限公司 Repeated data deletion method and device
CN104407928A (en) * 2014-11-18 2015-03-11 杭州华为企业通信技术有限公司 Data transmission method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20030083292A (en) * 2002-04-20 2003-10-30 주식회사 퓨쳐시스템 Apparatus and method for providing a cipher accelerator using a hash function
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN101916171A (en) * 2010-07-16 2010-12-15 中国科学院计算技术研究所 Concurrent hierarchy type replicated data eliminating method and system
CN101989929A (en) * 2010-11-17 2011-03-23 中兴通讯股份有限公司 Disaster recovery data backup method and system
US20120005144A1 (en) * 2010-06-30 2012-01-05 Alcatel-Lucent Canada, Inc. Optimization of rule entities

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20030083292A (en) * 2002-04-20 2003-10-30 주식회사 퓨쳐시스템 Apparatus and method for providing a cipher accelerator using a hash function
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
US20120005144A1 (en) * 2010-06-30 2012-01-05 Alcatel-Lucent Canada, Inc. Optimization of rule entities
CN101916171A (en) * 2010-07-16 2010-12-15 中国科学院计算技术研究所 Concurrent hierarchy type replicated data eliminating method and system
CN101989929A (en) * 2010-11-17 2011-03-23 中兴通讯股份有限公司 Disaster recovery data backup method and system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103955530A (en) * 2014-05-12 2014-07-30 暨南大学 Data reconstruction and optimization method of on-line repeating data deletion system
CN103955530B (en) * 2014-05-12 2017-02-22 暨南大学 Data reconstruction and optimization method of on-line repeating data deletion system
CN104317823A (en) * 2014-09-30 2015-01-28 北京合力思腾科技股份有限公司 Method for carrying out data detection by utilizing data fingerprints
CN104317823B (en) * 2014-09-30 2016-03-16 北京艾秀信安科技有限公司 A kind of method utilizing data fingerprint to carry out Data Detection
CN104407928A (en) * 2014-11-18 2015-03-11 杭州华为企业通信技术有限公司 Data transmission method and device
CN104408154A (en) * 2014-12-04 2015-03-11 华为技术有限公司 Repeated data deletion method and device
CN104408154B (en) * 2014-12-04 2018-05-29 华为技术有限公司 Data de-duplication method and device

Similar Documents

Publication Publication Date Title
US20200117385A1 (en) System and method for reference tracking garbage collector
US10162552B2 (en) System and method for quasi-compacting garbage collection
US8898120B1 (en) Systems and methods for distributed data deduplication
EP3519965B1 (en) Systems and methods for healing images in deduplication storage
CN103095843B (en) A kind of data back up method and client based on version vector
CN104077380B (en) A kind of data de-duplication method, apparatus and system
US20150293817A1 (en) Secure Relational File System With Version Control, Deduplication, And Error Correction
CN102722583A (en) Hardware accelerating device for data de-duplication and method
US9785643B1 (en) Systems and methods for reclaiming storage space in deduplicating data systems
CN101989929A (en) Disaster recovery data backup method and system
CN101968796B (en) Method for segmenting bidirectionally and concurrently executed file level variable-length data
CN106611035A (en) Retrieval algorithm for deleting repetitive data in cloud storage
CN103473278A (en) Repeating data processing technology
US10409497B2 (en) Systems and methods for increasing restore speeds of backups stored in deduplicated storage systems
CN106469152A (en) A kind of document handling method based on ETL and system
CN104317676A (en) Data backup disaster tolerance method
CN103617260A (en) Index generation method and device for repeated data deletion
CN104461773A (en) Backup deduplication method of virtual machine
CN104965835B (en) A kind of file read/write method and device of distributed file system
CN105917304A (en) Apparatus and method for de-duplication of data
CN105095027A (en) Data backup method and apparatus
RU2016124319A (en) METHOD AND DEVICE FOR RESTORING DEDUPLICATED DATA
CN103403709B (en) A kind of methods, devices and systems of reading and writing data
CN104486387A (en) Data synchronization processing method and system
CN103176867A (en) Fast file differential backup method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20131225