CN101789977B - Teledata copying and de-emphasis method based on Hash coding - Google Patents

Teledata copying and de-emphasis method based on Hash coding Download PDF

Info

Publication number
CN101789977B
CN101789977B CN2010191850197A CN201019185019A CN101789977B CN 101789977 B CN101789977 B CN 101789977B CN 2010191850197 A CN2010191850197 A CN 2010191850197A CN 201019185019 A CN201019185019 A CN 201019185019A CN 101789977 B CN101789977 B CN 101789977B
Authority
CN
China
Prior art keywords
data
node
destination
hash
data block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2010191850197A
Other languages
Chinese (zh)
Other versions
CN101789977A (en
Inventor
刘靖宇
周泽湘
谢红军
谭毓安
王成武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING TOYOU FEIJI ELECTRONICS Co Ltd
Original Assignee
BEIJING TOYOU FEIJI ELECTRONICS Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING TOYOU FEIJI ELECTRONICS Co Ltd filed Critical BEIJING TOYOU FEIJI ELECTRONICS Co Ltd
Priority to CN2010191850197A priority Critical patent/CN101789977B/en
Publication of CN101789977A publication Critical patent/CN101789977A/en
Application granted granted Critical
Publication of CN101789977B publication Critical patent/CN101789977B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a teledata copying and de-emphasis method based on Hash coding. Extra storage space is used for storing the Hash value of each data block of a data disk out of the data disk of a source node and a target node, and the source node recognizes a repeated data block with the Hash coding matching method after the source node receives a write request to a target data block, does not need to transmit the data block to the target node, but transmits the addresses of the source data block and the target data block to the target node; the target node reads the data of the source data block from the disk thereof, and then, the data is written into the target data block. Thus, when data is repeated, only a block number needs to be transmitted without transmitting the data block, thus lowering network band width expenditure for data transmission, and saving network band width and synchronous time.

Description

A kind of teledata copying and de-emphasis method based on the Hash coding
Technical field
The invention belongs to the data disaster tolerance technical field, relate to a kind of teledata copying and de-emphasis method, be specifically related to a kind of method of in the teledata disaster tolerance system, utilizing the Hash code identification and avoiding transmitting the repeating data piece.
Background technology
The data disaster tolerance technology is important measures that guarantee computer system integrity and availability.Wherein, the remote copy technology is that local data is preserved an independently backup in the strange land through network link, makes when local system is damaged, can be from remote system restore data and service application.Its basic implementation procedure is:
At first, all data blocks on source server (source node) disk of this locality are all copied to the destination server (destination node) in strange land, accomplish the initial synchronisation of data.Afterwards, the data variation of source node is through Network Synchronization ground or copy to destination node asynchronously.There is following shortcoming in this mode:
Source node and destination node are deployed in remote two buildings usually, or even in two cities.Because dedicated network costs an arm and a leg, the data between source node and the destination node are duplicated the common IP network of general employing.When Data Update frequent, when volume of transmitted data is very big, may be in the data reproduction process and cause losing of performance decrease and Backup Data because of the network bandwidth and delay.
In addition, the partial data piece content on the disk is identical, and for example, a file has a plurality of copies on disk, perhaps preserved a plurality of versions, and had repeated content between the different editions.In data disaster tolerance system, when source node is a document creation copy or when upgrading certain file, all data blocks of this file need be sent to destination node.Yet, having comprised the partial data of this file in the destination node, the data block on partial data piece that transmits on the network and destination node disk is repetition, the utilance that this has just seriously reduced network has increased unnecessary network bandwidth consumption.
Summary of the invention
The objective of the invention is for improving the network utilization in the remote copy process, the network bandwidth expense when the reduction data are duplicated, to propose a kind of teledata copying and de-emphasis method based on the Hash coding in order to overcome the defective that prior art exists.The inventive method identifies the repeating data piece by source node through Hash codes match method, and does not need transmission data block to destination node through the data replication protocol between extended source node and the destination node, and destination node is directly from its disk copy data block.When transmitting the repeating data piece, only need transfer address information (be data block piece number), and needn't transmit data itself, thereby avoid transmitting repeating data, reduce the required network bandwidth expense of transfer of data thus.
The technical scheme that the present invention adopts is following:
Existing data disaster tolerance system all is through IP network copied chunks between two nodes, to guarantee the data consistency of disk in two nodes.The present invention uses the extra memory space (being the Hash storehouse) of a part to note the hash value of each data block of data disk outside data disk, and Hash storehouse and data in magnetic disk upgrade synchronously, and the Hash storehouse content of source node and destination node is consistent.Its system architecture diagram is as shown in Figure 1.
Usually, the size of each data block is 4KB (i.e. 4096 bytes), uses the MD5Hash algorithm, calculates 128 hash values, accounts for 16 bytes.The Hash storehouse is the hash value of all data blocks of store data disk in order, and each data block takies 16 bytes, and then Hash storehouse requisite space is the 16/4096=1/256 of data disk.The structure in Hash storehouse is as shown in Figure 2.
When source node receive to certain data block (being called the destination data piece) write request after, this data block is write data disk, and the hash value of calculated data piece, be complementary with the Hash storehouse.
If coupling is unsuccessful, then send destination data piece content to destination node, destination node writes the destination data piece of disk with it.
If mate successfully, represent that then certain data block (being called source block) in the data disks of source node is identical with the content of destination data piece, i.e. repeating data.And this data block has been sent to destination node in initialization before or data reproduction process, and in other words, the source block of destination node data disk has contained the content of destination data piece.Therefore, source node only need send destination node to the address of source block and destination data piece (be source piece number with purpose piece number), from its disk, reads source block by destination node, is written into the destination data piece again and gets final product.
After writing successfully, both sides are written to the hash value of destination data piece in the Hash storehouse.
When source node broke down, the destination node in strange land can start the service that operation system is taken over source node.Before source node was repaired, the data variation of destination node can not be transferred to source node, and the data block of destination node writes its data disks, upgrades its Hash storehouse simultaneously.After source node was repaired, it was synchronous again to need to carry out data between two nodes.Equally, after destination node broke down, when data block changed in the source node data disk, its Hash storehouse also was updated.After destination node was repaired, it was synchronous again also need to carry out data.
The Hash storehouse of comparing two nodes, the set of the data block that can obtain to change.Normal node is sent to the node that fault took place with these data blocks, keeps the consistency of data between two nodes.In the data block process that transmission changes, still can use the above-mentioned weight technology of going.
The invention has the beneficial effects as follows:
1) saves bandwidth.For the repeating data piece, destination node has comprised the content of data itself, and source node only needs the address (piece number) of transmission block, reduces the volume of transmitted data of network, reduces network delay.
2) shorten the synchronous again required time of data.Utilize the comparison of Hash storehouse, need not the set that the reading of data disk can obtain the data block that changes rapidly; In synchronizing process, make to spend weight technology reduction data conveying capacity.
Description of drawings
Fig. 1 duplicating remote data removes heavy Organization Chart;
Fig. 2 Hash library structure figure;
Fig. 3 data are duplicated heavy example;
The take data of weight technology of Fig. 4 are duplicated flow chart;
Fig. 5 data replication protocol form.
Embodiment
Below in conjunction with accompanying drawing and embodiment the present invention is described in further detail.
The present invention uses the hash value of extra each data block of memory space record of a part respectively on source node and destination node.
The size of each data block is 4KB, and its hash value size is 16 bytes, and the Hash storehouse is 1/256 of a data disks size.As shown in Figure 3, source node receives write operation requests, and data A is written among the destination data piece SD_B, and data A also should be written among the data block DD_B of destination node (SD_B=DD_B).Operating procedure is: source node is written to data A among the destination data piece SD_B, and the hash value of calculated data A, is complementary with the Hash storehouse.If in the Hash storehouse address SH_A place hash value identical with the hash value of data block A, mate successfully.Data block piece in the SH_A institute corresponding data dish number is SD_A, shows that data A exists in the data disks of source node, and the address is SD_A.There have been these data equally in destination node, and the address is DD_A (DD_A=SD_A).Therefore, need not during data sync again data A to be sent to destination node, but send the source block SD_A and the destination data piece SD_B of correspondence to destination node.Destination node is according to the source piece that receives number and purpose piece number, and reading of data from local disk data block DD_A is written among the data block DD_B.Source node and destination node upgrade the hash value among SH_B and the DH_B respectively after accomplishing write operation separately.
For example: in existing disaster tolerance system; Be that the file F of 8MB (8192KB) is when B carries out copy operation to the position by position A to size; File system for 64 disk addresses; The size of each data block is 4KB; The address of each piece is made up of 8 bytes (64bit), and file F comprises 8192KB/4KB=2048 data block altogether so, and the data volume that the data sync of source node and destination node need transmit altogether is file all data blocks and data block address 2048 * (4KB+8B)=8208KB.When adopting among the present invention the removing repeat method in the data replication protocol; Because file F exists at the destination node place; Just source address A that need to transmit this moment and address information and the sign (1B) of destination address B, and need not transmit the data of file F own, at this moment data quantity transmitted totally 2048 * (8B * 2+1B)=34KB; Transmission quantity is the 34KB/8208KB ≈ 1/241 of former transmission quantity, has significantly reduced the required network bandwidth expense of transfer of data.
Embodiment
The present invention uses the hash value of extra each data block of memory space record of a part respectively on source node and destination node.
The size of each data block is 4KB, and its hash value size is 16 bytes, and the Hash storehouse is 1/256 of a data disks size.
Calculate the storage address of hash value in the Hash storehouse of each data block according to following formula:
Address=piece number * 16
Equally, by the address of the hash value that finds in the Hash storehouse, can directly calculate its pairing number:
Piece number=address/16
When source node received the data write operation request of source disk, it is as shown in Figure 4 that it carries out flow process.
◆ for source node:
1) all data blocks that will write request are written in the data disks.
2) for writing each the destination data piece that comprises in the request, carry out following 3) to 5) step.
3) hash value of calculated data piece.
Whether 4) in the Hash storehouse, search this hash value exists.
If a) in the Hash storehouse, do not have this hash value, send destination node to according to structure construction network packet shown in Fig. 5 (a).Sign is made as " 0 ", and expression this time comprises data block in the transmission, and the purpose piece number is the block address of this data block at the destination node data disks.
B) if in the Hash storehouse, there is this hash value, go out source piece number according to the position calculation of this hash value, the content of destination data piece has been included in the source block of data disks of source node and destination node, is repeating data.Send destination node to according to structure construction network packet shown in Fig. 5 (b), sign is made as " 1 ", and expression does not this time comprise data block itself in the transmission.
5) hash value of renewal destination data piece in the Hash storehouse.
◆ for destination node:
1) reception is from the network packet of source node.
2) according to the sign of network packet, carry out:
A) be designated " 0 ", from network packet, take out purpose piece number, data block contents, data block contents is written in the data disks.
B) be designated " 1 ", from network packet, take out purpose piece number, source piece number.From data disks, read source block, its content is written in the destination data piece.
3) hash value of calculated data piece.
4) hash value of renewal destination data piece in the Hash storehouse.

Claims (1)

1. teledata copying and de-emphasis method based on Hash coding is characterized in that:
On source node and destination node, use the extra memory space of a part to note the hash value of each data block of data disk respectively, and Hash storehouse and data in magnetic disk upgrade synchronously, the Hash storehouse content of source node and destination node is consistent;
The Hash storehouse is the hash value of all data blocks of store data disk in order, and each data block takies 16 bytes, and then Hash storehouse requisite space is the 16/4096=1/256 of data disk;
When source node receives certain data block, that is, the destination data piece write request after, this data block is write data disk, and the hash value of calculated data piece, be complementary with the Hash storehouse;
If coupling is unsuccessful, then number send the purpose piece to destination node with data block contents, destination node writes the destination data piece of disk with it;
If mate successfully; Represent that then certain data block in the data disks of source node is identical with the content of destination data piece; At this moment, source node only needs the address with source block and destination data piece, and promptly source piece number and purpose piece number send destination node to; From its disk, read source block by destination node, be written into the destination data piece again; After writing successfully, both sides are written to the hash value of destination data piece in the Hash storehouse;
When source node broke down, the destination node in strange land started the service that operation system is taken over source node; Before source node was repaired, the data variation of destination node can not be transferred to source node, and the data block of destination node writes its data disks, upgrades its Hash storehouse simultaneously; After source node was repaired, it was synchronous again to need to carry out data between two nodes; Equally, after destination node broke down, when data block changed in the source node data disk, its Hash storehouse also was updated; After destination node was repaired, it was synchronous again also need to carry out data.
CN2010191850197A 2010-02-08 2010-02-08 Teledata copying and de-emphasis method based on Hash coding Active CN101789977B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010191850197A CN101789977B (en) 2010-02-08 2010-02-08 Teledata copying and de-emphasis method based on Hash coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010191850197A CN101789977B (en) 2010-02-08 2010-02-08 Teledata copying and de-emphasis method based on Hash coding

Publications (2)

Publication Number Publication Date
CN101789977A CN101789977A (en) 2010-07-28
CN101789977B true CN101789977B (en) 2012-07-25

Family

ID=42533026

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010191850197A Active CN101789977B (en) 2010-02-08 2010-02-08 Teledata copying and de-emphasis method based on Hash coding

Country Status (1)

Country Link
CN (1) CN101789977B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102130939A (en) * 2010-12-10 2011-07-20 创新科存储技术有限公司 Remote duplication method and device
CN102323908A (en) * 2011-08-03 2012-01-18 浪潮(北京)电子信息产业有限公司 Method and system for data caching in large volume data synchronization process on disk
KR101630275B1 (en) * 2012-03-27 2016-06-14 에스케이텔레콤 주식회사 Contents delivery system, method for synchronizing a cache and apparatus thereof
US9042386B2 (en) * 2012-08-14 2015-05-26 International Business Machines Corporation Data transfer optimization through destination analytics and data de-duplication
CN103875229B (en) * 2013-12-02 2017-04-26 华为技术有限公司 asynchronous replication method, device and system
WO2015100639A1 (en) * 2013-12-31 2015-07-09 华为技术有限公司 De-duplication method, apparatus and system
KR102148757B1 (en) * 2015-09-17 2020-08-27 삼성전자주식회사 Method and apparatus for transmitting/receiving data in a communication system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101415016A (en) * 2007-10-17 2009-04-22 深圳市亚贝电气技术有限公司 A kind of data copy method, system and storage server
CN101520743A (en) * 2009-04-17 2009-09-02 杭州华三通信技术有限公司 Data storage method, system and device based on copy-on-write

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101415016A (en) * 2007-10-17 2009-04-22 深圳市亚贝电气技术有限公司 A kind of data copy method, system and storage server
CN101520743A (en) * 2009-04-17 2009-09-02 杭州华三通信技术有限公司 Data storage method, system and device based on copy-on-write

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
崔兴华, 杜晓黎, 赵晓睿.重复数据检测在多版本数据备份中的应用.《计算机应用研究》.2009,第26卷(第1期), *

Also Published As

Publication number Publication date
CN101789977A (en) 2010-07-28

Similar Documents

Publication Publication Date Title
CN101789977B (en) Teledata copying and de-emphasis method based on Hash coding
US20200371884A1 (en) Remote Data Replication Method and System
US8335761B1 (en) Replicating in a multi-copy environment
US8745004B1 (en) Reverting an old snapshot on a production volume without a full sweep
US8521691B1 (en) Seamless migration between replication technologies
US9026696B1 (en) Using I/O track information for continuous push with splitter for storage device
US7953947B2 (en) Creating a snapshot based on a marker transferred from a first storage system to a second storage system
US7308545B1 (en) Method and system of providing replication
US20110107025A1 (en) Synchronizing snapshot volumes across hosts
JP4813924B2 (en) Database management system, storage device, disaster recovery system, and database backup method
US20050050115A1 (en) Method and system of providing cascaded replication
CN101808137B (en) Data transmission method, device and system
JP2005018506A (en) Storage system
CN102033786B (en) Method for repairing consistency of copies in object storage system
CN103780638A (en) Data synchronization method and system
CN106919465B (en) Method and apparatus for multiple data protection in a storage system
US20060277376A1 (en) Initial copy system
CN102483711A (en) Synchronization Of Replicated Sequential Access Storage Components
CN104881333A (en) Storage system and method for using same
US20110225382A1 (en) Incremental replication using snapshots
KR102624911B1 (en) Method for increasing endurance of flash memory by improved metadata management
CN102750110B (en) High-reliable disk array system for configuration information
US11023433B1 (en) Systems and methods for bi-directional replication of cloud tiered data across incompatible clusters
WO2014067452A1 (en) Data synchronization method, data synchronization system and storage medium for multilayer association storage architecture
CN102023816A (en) Object storage policy and access method of object storage system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant