CN111309523A - Data reading and writing method, data remote copying method and device and distributed storage system - Google Patents

Data reading and writing method, data remote copying method and device and distributed storage system Download PDF

Info

Publication number
CN111309523A
CN111309523A CN202010094576.4A CN202010094576A CN111309523A CN 111309523 A CN111309523 A CN 111309523A CN 202010094576 A CN202010094576 A CN 202010094576A CN 111309523 A CN111309523 A CN 111309523A
Authority
CN
China
Prior art keywords
signature
data
signatures
tree
data block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010094576.4A
Other languages
Chinese (zh)
Inventor
刘洋
周耀辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Orca Data Technology Xian Co Ltd
Original Assignee
Orca Data Technology Xian Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Orca Data Technology Xian Co Ltd filed Critical Orca Data Technology Xian Co Ltd
Priority to CN202010094576.4A priority Critical patent/CN111309523A/en
Publication of CN111309523A publication Critical patent/CN111309523A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1464Management of the backup or restore process for networked environments

Abstract

The invention discloses a data reading and writing method, a data remote copying method and device and a distributed storage system, wherein the data reading and writing and remote copying method comprises the following steps: dividing a logical volume in the distributed storage system to obtain a plurality of data blocks, calculating a hash value of each data block, and using the hash value as a signature; storing the signature of each data block on a leaf node of a Merckel tree, calculating the hash value of each leaf node, using the hash value as the signature, and storing the signature of each leaf node on an upper-layer father node until the signature of the logical volume is stored on a root node of the Merckel tree; according to the logical address of the data block needing to be read and written, the position of the signature of the data block in each layer of the Mercker tree is calculated, the signature of the data block is obtained, during reading, the content of the data block is obtained through indexes according to the signature, during writing, the content of the data block is updated, the signature of the corresponding position in each layer of the Mercker tree is updated, during remote copying, the signatures of the nodes in each layer are recursively compared from the root node of the Mercker tree, and only the modified signature and the subtree thereof are remotely copied.

Description

Data reading and writing method, data remote copying method and device and distributed storage system
Technical Field
The invention relates to the technical field of computers, in particular to a data reading and writing method, a data remote copying method and device and a distributed storage system.
Background
With the development of cloud computing, conventional storage device products increasingly exhibit various limitations. The distributed storage system is applied, the problems of transverse expansion, performance bottleneck, single-point failure and the like of the storage system are solved, and the reliability, the availability and the storage efficiency of the system are greatly improved.
Meanwhile, in order to improve the reliability and the availability of data and avoid the unavailability of systems and services caused by some unexpected factors such as fire, earthquake, large-scale power failure and the like of a single data center, a distributed storage system also needs to be disaster-tolerant. The key to realize remote disaster recovery is remote copying of data.
The purpose of remote copy is to copy data of one data center to one or more other disaster recovery data centers in a synchronous or asynchronous manner. When the service is unavailable due to an accident occurring in a main data center (also called a source end), the disaster recovery center can be quickly switched to the main service center, the service is started and provided for a user, and meanwhile, the real-time availability of data is ensured.
In the prior art, mainstream storage service providers such as Veritas, EMC, AWS, etc. all have related remote copy schemes.
Specifically, the remote copy is divided into a full copy and an incremental copy. When a disaster recovery center is determined to be newly built or a storage volume, namely a Logical Unit Number (LUN) is determined to be disaster-recovered to a specific disaster recovery center, full replication is performed, that is, all data on a volume is replicated to a related volume of the disaster recovery center; when data of a main data center changes, for example, there is an Input/Output (IO) operation, a part of the data that changes (such as adding, deleting, and changing) is synchronized to a remote disaster recovery center periodically (for example, every hour), and this part of the operation is called incremental replication. Incremental replication differs from full replication in that it only copies the modified data blocks.
In terms of full-scale replication, the technology gap of each manufacturer is not big, and data on a storage volume (LUN) is completely replicated to a disaster recovery center. The main difference is incremental replication.
In incremental replication, it is important how to track data changes. In the related art, a mainstream vendor is realized by a Change Map (Change Map). This is typically done by dividing the space of a storage volume (LUN) into a number of fixed-size regions (regions), for example, each Region being 64MB in size. An extra portion of space is required to be allocated as a Map, also called Change Map, for recording these region data changes, and is usually implemented using a BitMap (BitMap). When the data of a Region changes, the position of the corresponding BitMap also changes, for example, from 0 to 1. As shown in fig. 1, ChangeMap is 1 part, which indicates that the source data is changed.
With the Change Map, only the data portion of the changed Region needs to be copied when incremental copying is performed. After one copy is completed, Change maps are all set to 0 again.
However, the above-mentioned treatment scheme has several problems:
1. extra space is needed to store the Change Map;
2. each data writing is amplified into two writing operations, namely writing data per se, and updating the corresponding bit of the Change Map (changing from 0 to 1);
3. there is a granularity problem of regions, if the granularity is too small, for example, each Region is 4KB in size, then the Change Map required is relatively large, and the Change Map is updated every time 4KB of data is written, and the update frequency is also high; if the granularity is too large, for example, each Region is 64MB in size, even if only a small part (for example, 4KB of data or less) of the 64MB space is changed, the entire 64MB Region needs to be completely copied to the remote end (disaster recovery end) in the incremental copy process, which may cause many unnecessary remote network transmissions and affect the copy efficiency.
Disclosure of Invention
The embodiment of the invention provides a data reading and writing method, a data remote copying method and device and a distributed storage system, which are used for solving the problems in the prior art.
The embodiment of the invention provides a data reading and writing method based on a Mercker tree, which comprises the following steps:
dividing a logical volume in a distributed storage system by a preset size to obtain a plurality of data blocks, calculating a hash value of each data block, and taking the hash value as a signature of the corresponding data block;
storing the signature of each data block on a leaf node of the Merckel tree, calculating the hash value of each leaf node, using the hash value as the signature of the corresponding leaf node, storing the signature of each leaf node on an upper parent node of the leaf node, and repeating the steps until the signature of the logical volume is stored on a root node of the Merckel tree;
calculating the position of the signature of the data block in each layer of the Mercker tree according to the logical address of the data block needing to be read and written, acquiring the signature of the data block, acquiring the content in the data block through an index according to the signature during reading, updating the content in the data block during writing, and updating the signature at the corresponding position in each layer of the Mercker tree.
The embodiment of the invention also provides a data remote copying method based on the data reading and writing method, which comprises the following steps:
after receiving a data remote copying request of a source end, acquiring a signature of a root node of a Mercker tree of a logical volume from the source end, and comparing the acquired signature with a signature of the root node stored locally;
and if the source end and the destination end have different root node signatures, requesting source end content, acquiring the signatures of all lower layer nodes of the root node by reading the source end content, comparing the acquired signatures of the lower layer nodes with the locally stored signatures of the corresponding lower layer nodes, and requesting the source end for the content of the lower layer nodes with different signatures. And repeating the steps until the leaf node at the lowest layer which is changed is obtained, the contents in the data blocks with different signatures in the leaf node are obtained, the contents are copied to the local, and the corresponding signatures in the corresponding nodes in each layer in the Merckel tree are updated.
The embodiment of the invention provides a data read-write device based on a Mercker tree, which is used for a distributed storage system and comprises:
the system comprises a dividing calculation module, a data storage module and a data processing module, wherein the dividing calculation module is used for dividing a logical volume by a preset size to obtain a plurality of data blocks, calculating a hash value of each data block and using the hash value as a signature of the corresponding data block;
the tacle tree module is used for storing the signature of each data block on leaf nodes of the tacle tree, calculating the hash value of each leaf node, using the hash value as the signature of the corresponding leaf node, storing the signature of each leaf node on an upper parent node of the leaf node, and repeating the steps until the signature of the logical volume is stored on a root node of the tacle tree;
and the read-write module is used for calculating the position of the signature of the data block in each layer of the Mercker tree according to the logical address of the data block needing to be read and written, acquiring the signature of the data block, acquiring the content in the data block through the index according to the signature during reading, updating the content in the data block during writing, and updating the signature of the corresponding position in each layer of the Mercker tree.
The embodiment of the present invention provides a data remote copying device based on the data read-write device, which is used for a distributed storage system, is arranged at a disaster recovery end, and includes:
the comparison module is used for acquiring the signature of the root node of the Mercker tree of the logical volume from a source end after receiving a data remote copying request of the source end, and comparing the acquired signature with the signature of the root node stored locally; and if the source end and the destination end have different root node signatures, requesting source end content, acquiring the signatures of all lower layer nodes of the root node by reading the source end content, comparing the acquired signatures of the lower layer nodes with the locally stored signatures of the corresponding lower layer nodes, and requesting the source end for the content of the lower layer nodes with different signatures. And so on until obtaining the leaf node of the bottom layer which is changed;
and the copying module is used for acquiring the content in the data block corresponding to the signature in the leaf node at the lowest layer which is changed, copying the content to the local, and updating the corresponding signature in the corresponding node of each layer in the Mercker tree.
The embodiment of the invention provides a distributed storage system, which comprises the data read-write device and the data remote copying device.
An embodiment of the present invention further provides a distributed storage system, including a plurality of distributed storage apparatuses, where each distributed storage apparatus includes: the data reading and writing method based on the Merckel tree and the data remote copying method are realized by the computer program which is executed by the processor.
The embodiment of the invention also provides a computer readable storage medium, wherein an implementation program for information transmission is stored on the computer readable storage medium, and the program is executed by a processor to realize the steps of the data remote copying method.
By adopting the embodiment of the invention, the data block which needs to be copied to the far end can be quickly found through the self-owned Tree structure based on the Signature (Signature) of the Merkle Tree (Merkle Tree) under the condition of not additionally introducing the Change Map, so that the read-write efficiency and the copying efficiency of the data volume are improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a diagram of a Change Map in the prior art;
FIG. 2 is a flow chart of a method for reading and writing data based on the Mercker tree according to an embodiment of the present invention;
FIG. 3 is a schematic of a 3-layer Merkle Tree according to an embodiment of the present invention;
FIG. 4 is a flow chart of a method for remote replication of data according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of signal interaction of an example of a data remote copy method according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a data read/write device based on the Mercker tree according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a data remote copy apparatus according to an embodiment of the present invention;
FIG. 8 is a diagram of a distributed storage system according to a first embodiment of the present invention;
fig. 9 is a schematic diagram of a distributed storage system according to a second embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Method embodiment one
According to an embodiment of the present invention, a data read-write method based on a mercker tree is provided, which is used for remote replication and remote disaster recovery of a logical volume in a distributed storage system. Fig. 2 is a flowchart of a data reading and writing method based on a merkel tree according to an embodiment of the present invention, and as shown in fig. 2, the data reading and writing method based on a merkel tree according to an embodiment of the present invention specifically includes:
step 201, dividing a logical volume in a distributed storage system by a predetermined size to obtain a plurality of data blocks, calculating a hash value of each data block, and using the hash value as a signature of the corresponding data block; in the present embodiment, the predetermined size is 4 KB;
step 202, storing the signature of each data block on a leaf node of the Mercker tree, calculating the hash value of each leaf node, using the hash value as the signature of the corresponding leaf node, storing the signature of each leaf node on an upper parent node, and repeating the steps until the signature of the logical volume is stored on a root node of the Mercker tree; in the embodiment of the present invention, the size of the hash value is 20 bytes; the size of a node in the merkel tree is 4KB, and each node in the merkel tree stores up to 204 signatures.
That is, in the distributed storage system, the logical volume is divided in units of 4KB, each 4KB is an object (i.e. the above data block), and each object calculates a hash value of 20 bytes (160 bits) according to its content by a hash (hash) function, which is called Signature (i.e. the above Signature). Since 20 bytes (160 bits) has 2160This possibility is a relatively large astronomical number, so if the signatures of any two objects are the same, the data contents of the two objects can also be considered to be the same. When reading the object data, the object data itself can be obtained through the relevant index by only providing the Signature of the 20 bytes corresponding to the object data.
In an embodiment of the invention, the logical volume is managed for its objects in terms of a Merkle Tree, which is a hierarchy of N-way trees, and in an embodiment of the invention, it is 204-way trees. The Merkle Tree is composed of a group of leaf nodes, a group of intermediate nodes and a root node, and each node is used for storing the Signature of the child node. The size of each node is 4KB, so that one node can store 4096/20-204 signatures, in other words, each node may have 204 child nodes. The leaf nodes store objects of the logical volume, that is, signatures of contents (payload), and each leaf node can store 204 objects of signatures; the father node of the leaf node can correspondingly store the signatures of 204 leaf nodes, and the layer-by-layer analogy is carried out, so that the Signature of the root node can be calculated, and the Signature of the root node is the Signature of the logic volume. The Merkel Tree for a logical volume is shown in FIG. 3, with a 3-tier Merkle Tree shown in FIG. 3, and only 5-way trees are shown for ease of illustration.
Step 203, according to the logical address of the data block to be read and written, calculating the position of the signature of the data block in each layer of the mercker tree, and obtaining the signature of the data block, during reading, obtaining the content in the data block through the index according to the signature, during writing, updating the content in the data block, and updating the signature of the corresponding position in each layer of the mercker tree.
In step 203, calculating the position of the signature of the data block in each layer of the merkel tree according to the logical address of the data block to be read and written specifically includes:
dividing the logical address by the size of the data block to obtain the logical block number of the data block to be read and written, and repeatedly rounding and modulo the maximum number of signatures which can be stored by the nodes in the Mercker tree by the logical block number to obtain the position of the signature of the data block to be read and written in each layer of the Mercker tree.
That is, in the embodiment of the present invention, if it is necessary to calculate the position of a certain address on the Merkle Tree, the logical address (LBA) is first divided by 4K (the size of a data block), the logical block number (VBN) of the certain address is calculated, and then the position of the certain address on each level of the Merkle Tree is obtained by repeatedly rounding and modulo 204 (the maximum number of signatures that can be stored by a node in the Merkle Tree).
For example, in a 3-layer Merkle Tree, the path corresponding to logical address 3251489060 is calculated as follows:
(3251489060+4095)/4096=793821
(793821+203)/204=3892;793821mod204=57
(3892+203)/204=20;3892mod204=16
as can be seen from the above calculation, the corresponding positions of the logical address on the Merkle Tree are (starting position starts from 1):
the 20 th Signature inside the root node;
a 16 th Signature of a 20 th child node of the second layer;
the 57 th Signature of the 16 th child node of the third layer.
When reading and writing the logical volume, the Signature of the corresponding object is calculated and acquired layer by layer through the logical address according to the above mode. During reading, directly acquiring payload data of the corresponding object through an index according to the Signature; when writing operation is performed, payload data is updated, and the Signature of the position where the leaf node is located is updated.
From the above process, it can be seen that each layer of nodes stores the Signature of the next layer of nodes in the design due to the Merkle Tree. When the bottom-layer object data changes, all the leaf nodes, intermediate nodes and root nodes on the object path also change accordingly, as shown by the white path in fig. 3.
That is, for a write operation of a logical volume, in addition to updating corresponding payload data, the Signature of the payload in the leaf node is also updated; and as the data of the leaf nodes are changed, the Signature of the leaf nodes in the father nodes of the leaf nodes is further updated, and the Signature of the root nodes is finally updated by analogy.
Through the technical scheme, the self-owned Tree structure based on the Signature (Signature) of the Merkle Tree (Merkle Tree) of the logical volume can be established so as to facilitate subsequent remote copying.
Method embodiment two
According to an embodiment of the present invention, a data remote replication method based on the data read-write method according to the first embodiment of the method is provided, and is used for remote replication and remote disaster recovery of a logical volume in a distributed storage system. Fig. 4 is a flowchart of a data remote copy method according to an embodiment of the present invention, and as shown in fig. 4, the data remote copy method according to the embodiment of the present invention specifically includes:
step 401, after receiving a data remote copy request from a source end, obtaining a signature of a root node of a merkel tree of a logical volume from the source end, and comparing the obtained signature with a signature of the root node stored locally;
step 402, if the source end and the destination end have different root node signatures, requesting the source end content, obtaining the signatures of all lower layer nodes of the root node by reading the source end content, comparing the obtained signatures of the lower layer nodes with the locally stored signatures of the corresponding lower layer nodes, and requesting the source end for the content of the lower layer nodes with different signatures. And repeating the steps until the leaf node at the lowest layer which is changed is obtained, the contents in the data blocks with different signatures in the leaf node are obtained, the contents are copied to the local, and the corresponding signatures in the corresponding nodes in each layer in the Merckel tree are updated.
In the above processing procedure, if the number of different signatures is multiple, all the nodes with changed signatures in the mulck tree are sequentially obtained in a cyclic, recursive and pre-sequencing traversal manner until all the contents in the changed data blocks are obtained.
Due to any object change, the change of all nodes on the Merkle Tree path can be brought; thus, a natural way of tracking data changes is brought to remote incremental replication of data. Fig. 5 is a schematic signal interaction diagram of an example of a data remote copy method according to an embodiment of the present invention, and as shown in fig. 5, a 2-layer Merkle Tree is taken as an example below to demonstrate a remote copy process:
step 501, a source terminal initiates a remote copy request to a disaster recovery terminal;
step 502, a disaster recovery end requests a source end to obtain root node data;
step 503, the source end reads root node data (4 KB);
step 504, the source end sends the root node data to the disaster recovery end;
step 505, the disaster recovery end compares data of the source end and the local root node, and finds that 89 th signatures are different, and the 89 th Signature of the source end is SigA _ 89;
step 506, the disaster recovery end requests 4KB data corresponding to SigA _89 to the source end;
step 507, the source end reads 4KB data corresponding to SigA _ 89;
step 508, the source end sends 4KB data corresponding to SigA _89 to the disaster recovery end;
step 509, comparing data corresponding to the source end and the local SigA _89 by the disaster recovery end, and finding that 201 st signats are different, wherein the 201 st signats of the source end are SigB _ 201;
step 510, the disaster recovery end requests a source end for 4KB data corresponding to SigB _ 201;
step 511, the source end reads the 4KB data corresponding to SigB _ 201;
step 512, the source end sends 4KB data corresponding to SigB _201 to the disaster recovery end;
step 513, the disaster recovery side updates Payload corresponding to SigB _201 to the local.
As can be seen from the above process, the source (production) initiates a remote copy request. The actual data request and comparison process is dominated by the disaster recovery side.
The disaster recovery end firstly obtains a root node signature sent by the source end, and compares the root node signature with a local root node signature, if the signatures are the same, the source end is not changed by data and does not need to be copied. If the root node signatures are different, requesting root node data (including 204 signatures) from the source end, and comparing the root node data with local root node data (including 204 signatures) to obtain different signatures, for example, if 89 th signatures are different, then the disaster backup end requests data corresponding to the 89 th Signature from the source end and compares the data with local corresponding data; and so on until the payload data is obtained and updated to the local.
In the actual copying process, several signatures may be found to be different in the same root node/intermediate node, and in terms of system implementation, all nodes that have been changed need to be sequentially acquired in a circulating, recursive, and preorder traversal manner, and finally all payload data that have been changed are sequentially copied to the disaster recovery side.
It can be seen from the above processing that the technical solution of the embodiment of the present invention does not need to traverse the whole Merkle Tree to obtain the changed object. In the recursive traversal process from the root node down, only the changed subtrees need to be compared and tracked. Taking fig. 5 as an example, 204 signatures of the source end root node and the disaster backup end root node are compared, and it is found that only the 89 th Signature is different; then the subtrees corresponding to the 1 st to 88 th and the 90 th to 204 th signatures can be ignored, and only the subtrees corresponding to the 89 th signatures are used for the next comparison until the payload node, so that the remote replication is quite accurate.
The Change Map format of the prior art requires alignment of all Change Map sites at a timeO (N) in complexity, and N is the number of sites (number of regions); while fast incremental replication using the Merkle Tree only needs to track the path that changes, with a time complexity of O (K log — 204)N) Where K is the number of objects that actually change. When less data changes, the comparison times can be effectively reduced.
In summary, according to the technical solution of the embodiment of the present invention, a data block that needs to be copied to a remote end can be quickly found through a self-owned Tree structure based on a Signature (Signature) of a Merkle Tree (Merkle Tree) without additionally introducing a Change Map, so as to improve the read-write efficiency and the copy efficiency of a data volume.
Apparatus embodiment one
According to an embodiment of the present invention, a data read/write device based on a merkel tree is provided, which is used in a distributed storage system, fig. 6 is a schematic diagram of the data read/write device based on the merkel tree according to an embodiment of the present invention, as shown in fig. 6, the data read/write device based on the merkel tree according to an embodiment of the present invention specifically includes:
a partition calculation module 60, configured to partition the logical volume by a predetermined size, obtain a plurality of data blocks, calculate a hash value of each data block, and use the hash value as a signature of the corresponding data block; in the present embodiment, the predetermined size is 4 KB;
the tacle tree module 62 is configured to store the signature of each data block on a leaf node of the tacle tree, calculate a hash value of each leaf node, use the hash value as a signature of a corresponding leaf node, store the signature of each leaf node on an upper parent node thereof, and so on until the signature of the logical volume is stored on a root node of the tacle tree; in the embodiment of the present invention, the size of the hash value is 20 bytes; the size of a node in the merkel tree is 4KB, and each node in the merkel tree stores up to 204 signatures.
That is, in the distributed storage system, the logical volume is divided in units of 4KB, each 4KB is an object (i.e. the above data block), and each object calculates a 20Byte (160Bit) by a hash function according to its contentThe hash value, we call Signature (i.e. the Signature described above). Since 20 bytes (160 bits) has 2160This possibility is a relatively large astronomical number, so if the signatures of any two objects are the same, the data contents of the two objects can also be considered to be the same. When reading the object data, the object data itself can be obtained through the relevant index by only providing the Signature of the 20 bytes corresponding to the object data.
In an embodiment of the invention, the logical volume is managed for its objects in terms of a Merkle Tree, which is a hierarchy of N-way trees, and in an embodiment of the invention, it is 204-way trees. The Merkle Tree is composed of a group of leaf nodes, a group of intermediate nodes and a root node, and each node is used for storing the Signature of the child node. The size of each node is 4KB, so that one node can store 4096/20-204 signatures, in other words, each node may have 204 child nodes. The leaf nodes store objects of the logical volume, that is, signatures of contents (payload), and each leaf node can store 204 objects of signatures; the father node of the leaf node can correspondingly store the signatures of 204 leaf nodes, and the layer-by-layer analogy is carried out, so that the Signature of the root node can be calculated, and the Signature of the root node is the Signature of the logic volume. The Merkel Tree for a logical volume is shown in FIG. 3, with a 3-tier Merkle Tree shown in FIG. 3, and only 5-way trees are shown for ease of illustration.
The read-write module 64 is configured to calculate a position of a signature of the data block in each layer of the mercker tree according to a logical address of the data block that needs to be read and written, obtain the signature of the data block, obtain content in the data block through an index according to the signature during reading, update the content in the data block during writing, and update the signature at a corresponding position in each layer of the mercker tree.
The read-write module 64 is specifically configured to: dividing the logical address by the size of the data block to obtain the logical block number of the data block to be read and written, and repeatedly rounding and modulo the maximum number of signatures which can be stored by the nodes in the Mercker tree by the logical block number to obtain the position of the signature of the data block to be read and written in each layer of the Mercker tree.
That is, in the embodiment of the present invention, if it is necessary to calculate the position of a certain address on the Merkle Tree, the logical address (LBA) is first divided by 4K (the size of a data block), the logical block number (VBN) of the certain address is calculated, and then the position of the certain address on each level of the Merkle Tree is obtained by repeatedly rounding and modulo 204 (the maximum number of signatures that can be stored by a node in the Merkle Tree).
For example, in a 3-layer Merkle Tree, the path corresponding to logical address 3251489060 is calculated as follows:
(3251489060+4095)/4096=793821
(793821+203)/204=3892;793821mod204=57
(3892+203)/204=20;3892mod204=16
as can be seen from the above calculation, the corresponding positions of the logical address on the Merkle Tree are (starting position starts from 1):
the 20 th Signature inside the root node;
a 16 th Signature of a 20 th child node of the second layer;
the 57 th Signature of the 16 th child node of the third layer.
When reading and writing the logical volume, the Signature of the corresponding object is calculated and acquired layer by layer through the logical address according to the above mode. During reading, directly acquiring payload data of the corresponding object through an index according to the Signature; when writing operation is performed, payload data is updated, and the Signature of the position where the leaf node is located is updated.
From the above process, it can be seen that each layer of nodes stores the Signature of the next layer of nodes in the design due to the Merkle Tree. When the bottom-layer object data changes, all the leaf nodes, intermediate nodes and root nodes on the object path also change accordingly, as shown by the white path in fig. 3.
That is, for a write operation of a logical volume, in addition to updating corresponding payload data, the Signature of the payload in the leaf node is also updated; and as the data of the leaf nodes are changed, the Signature of the leaf nodes in the father nodes of the leaf nodes is further updated, and the Signature of the root nodes is finally updated by analogy.
Through the technical scheme, the self-owned Tree structure based on the Signature (Signature) of the Merkle Tree (Merkle Tree) of the logical volume can be established so as to facilitate subsequent remote copying.
Device embodiment II
According to an embodiment of the present invention, a data remote copying apparatus based on a data reading and writing apparatus according to a first embodiment of the apparatus is provided, and is used in a distributed storage system and is disposed at a disaster recovery end, fig. 7 is a schematic diagram of the data remote copying apparatus according to an embodiment of the present invention, and as shown in fig. 7, the data remote copying apparatus according to an embodiment of the present invention specifically includes:
a comparing module 70, configured to obtain, from a source end, a signature of a root node of a merkel tree of a logical volume after receiving a remote data replication request of the source end, and compare the obtained signature with a locally stored signature of the root node; acquiring different signatures in the two root nodes, requesting the source end for the signatures of the lower nodes corresponding to the different signatures, comparing the acquired signatures of the lower nodes with the signatures of the corresponding lower nodes stored locally, and so on until acquiring the leaf nodes of the modified bottom layer; the comparing module 70 is specifically configured to, when the number of different signatures is multiple, sequentially obtain all nodes with changed signatures in the mulck tree in a cyclic, recursive, and pre-sequencing traversal manner until all contents in the changed data blocks are obtained. That is, in the above processing procedure, if the number of different signatures is multiple, all the nodes with changed signatures in the merk tree are sequentially obtained in a cyclic, recursive and pre-sequencing traversal manner until all the contents in the changed data blocks are obtained.
And the copying module 72 is configured to obtain the content in the data block corresponding to the leaf node at the lowest layer that is changed, copy the content to the local, and update the corresponding signature in the corresponding node in each layer in the mercker tree.
Due to any object change, the change of all nodes on the Merkle Tree path can be brought; thus, a natural way of tracking data changes is brought to remote incremental replication of data. As shown in fig. 5, the source (production) initiates a remote copy request. The actual data request and comparison process is dominated by the disaster recovery side.
The comparing module 70 of the disaster recovery side first obtains the root node signature sent from the source side, and compares the root node signature with the local root node signature, if the signatures are the same, it indicates that the source side has no data change and does not need to be copied. If the root node signatures are different, requesting root node data (including 204 signatures) from the source end, and comparing the root node data with local root node data (including 204 signatures) to obtain different signatures, for example, if 89 th signatures are different, then the comparing module 70 of the disaster backup end requests data corresponding to 89 th Signature from the source end, and compares the data with local corresponding data; and so on until the payload data is obtained and updated locally by the copy module 72.
In the actual replication process, several signatures may be found to be different in the same root node/intermediate node, and in terms of system implementation, the comparison module 70 needs to sequentially acquire all nodes that have been modified in a cyclic, recursive, and preorder traversal manner, and finally sequentially replicate all payload data that has been modified to the disaster recovery side.
It can be seen from the above processing that the technical solution of the embodiment of the present invention does not need to traverse the whole Merkle Tree to obtain the changed object. In the recursive traversal process from the root node down, only the changed subtrees need to be compared and tracked. Taking fig. 5 as an example, 204 signatures of the source end root node and the disaster backup end root node are compared, and it is found that only the 89 th Signature is different; then the subtrees corresponding to the 1 st to 88 th and the 90 th to 204 th signatures can be ignored, and only the subtrees corresponding to the 89 th signatures are used for the next comparison until the payload node, so that the remote replication is quite accurate.
The Change Map format of the prior art requires alignment of all Change Map sitesO (N) in temporal complexity, N is the number of sites (number of regions); while fast incremental replication using the Merkle Tree only needs to track the path that changes, with a time complexity of O (K log — 204)N) Where K is the number of objects that actually change. When less data changes, the comparison times can be effectively reduced.
In summary, according to the technical solution of the embodiment of the present invention, a data block that needs to be copied to a remote end can be quickly found through a self-owned Tree structure based on a Signature (Signature) of a Merkle Tree (Merkle Tree) without additionally introducing a Change Map, so as to improve the read-write efficiency and the copy efficiency of a data volume.
Embodiment of the System
According to an embodiment of the present invention, a distributed storage system is provided, fig. 8 is a schematic diagram of a distributed storage system according to a first embodiment of the present invention, and as shown in fig. 8, the distributed storage system according to the embodiment of the present invention specifically includes: a data read/write device 80 in the first device embodiment and a data remote copy device 82 in the second device embodiment.
Specifically, as shown in fig. 6, the data reading/writing device 80 includes:
a partition calculation module 60, configured to partition the logical volume by a predetermined size, obtain a plurality of data blocks, calculate a hash value of each data block, and use the hash value as a signature of the corresponding data block; in the present embodiment, the predetermined size is 4 KB;
the tacle tree module 62 is configured to store the signature of each data block on a leaf node of the tacle tree, calculate a hash value of each leaf node, use the hash value as a signature of a corresponding leaf node, store the signature of each leaf node on an upper parent node thereof, and so on until the signature of the logical volume is stored on a root node of the tacle tree; in the embodiment of the present invention, the size of the hash value is 20 bytes; the size of a node in the merkel tree is 4KB, and each node in the merkel tree stores up to 204 signatures.
That is, in the distributed storage system, the logical volume is divided in units of 4KB eachEach 4KB is an object (i.e., the block of data) whose content is used to calculate a hash value of 20 bytes (160 bits) by a hash function, which is called Signature (i.e., the Signature). Since 20 bytes (160 bits) has 2160This possibility is a relatively large astronomical number, so if the signatures of any two objects are the same, the data contents of the two objects can also be considered to be the same. When reading the object data, the object data itself can be obtained through the relevant index by only providing the Signature of the 20 bytes corresponding to the object data.
In an embodiment of the invention, the logical volume is managed for its objects in terms of a Merkle Tree, which is a hierarchy of N-way trees, and in an embodiment of the invention, it is 204-way trees. The Merkle Tree is composed of a group of leaf nodes, a group of intermediate nodes and a root node, and each node is used for storing the Signature of the child node. The size of each node is 4KB, so that one node can store 4096/20-204 signatures, in other words, each node may have 204 child nodes. The leaf nodes store objects of the logical volume, that is, signatures of contents (payload), and each leaf node can store 204 objects of signatures; the father node of the leaf node can correspondingly store the signatures of 204 leaf nodes, and the layer-by-layer analogy is carried out, so that the Signature of the root node can be calculated, and the Signature of the root node is the Signature of the logic volume. The Merkel Tree for a logical volume is shown in FIG. 3, with a 3-tier Merkle Tree shown in FIG. 3, and only 5-way trees are shown for ease of illustration.
The read-write module 64 is configured to calculate a position of a signature of the data block in each layer of the mercker tree according to a logical address of the data block that needs to be read and written, obtain the signature of the data block, obtain content in the data block through an index according to the signature during reading, update the content in the data block during writing, and update the signature at a corresponding position in each layer of the mercker tree.
The read-write module 64 is specifically configured to: dividing the logical address by the size of the data block to obtain the logical block number of the data block to be read and written, and repeatedly rounding and modulo the maximum number of signatures which can be stored by the nodes in the Mercker tree by the logical block number to obtain the position of the signature of the data block to be read and written in each layer of the Mercker tree.
That is, in the embodiment of the present invention, if it is necessary to calculate the position of a certain address on the Merkle Tree, the logical address (LBA) is first divided by 4K (the size of a data block), the logical block number (VBN) of the certain address is calculated, and then the position of the certain address on each level of the Merkle Tree is obtained by repeatedly rounding and modulo 204 (the maximum number of signatures that can be stored by a node in the Merkle Tree).
For example, in a 3-layer Merkle Tree, the path corresponding to logical address 3251489060 is calculated as follows:
(3251489060+4095)/4096=793821
(793821+203)/204=3892;793821mod204=57
(3892+203)/204=20;3892mod204=16
as can be seen from the above calculation, the corresponding positions of the logical address on the Merkle Tree are (starting position starts from 1):
the 20 th Signature inside the root node;
a 16 th Signature of a 20 th child node of the second layer;
the 57 th Signature of the 16 th child node of the third layer.
When reading and writing the logical volume, the Signature of the corresponding object is calculated and acquired layer by layer through the logical address according to the above mode. During reading, directly acquiring payload data of the corresponding object through an index according to the Signature; when writing operation is performed, payload data is updated, and the Signature of the position where the leaf node is located is updated.
From the above process, it can be seen that each layer of nodes stores the Signature of the next layer of nodes in the design due to the Merkle Tree. When the bottom-layer object data changes, all the leaf nodes, intermediate nodes and root nodes on the object path also change accordingly, as shown by the white path in fig. 3.
That is, for a write operation of a logical volume, in addition to updating corresponding payload data, the Signature of the payload in the leaf node is also updated; and as the data of the leaf nodes are changed, the Signature of the leaf nodes in the father nodes of the leaf nodes is further updated, and the Signature of the root nodes is finally updated by analogy.
Through the technical scheme, the self-owned Tree structure based on the Signature (Signature) of the Merkle Tree (Merkle Tree) of the logical volume can be established so as to facilitate subsequent remote copying.
As shown in fig. 7, the data remote copy apparatus 82 includes:
a comparing module 70, configured to obtain, from a source end, a signature of a root node of a merkel tree of a logical volume after receiving a remote data replication request of the source end, and compare the obtained signature with a locally stored signature of the root node; and if the source end and the destination end have different root node signatures, requesting source end content, acquiring the signatures of all lower layer nodes of the root node by reading the source end content, comparing the acquired signatures of the lower layer nodes with the locally stored signatures of the corresponding lower layer nodes, and requesting the source end for the content of the lower layer nodes with different signatures. And so on until obtaining the leaf node of the bottom layer which is changed; the comparing module 70 is specifically configured to, when the number of different signatures is multiple, sequentially obtain all nodes with changed signatures in the mulck tree in a cyclic, recursive, and pre-sequencing traversal manner until all contents in the changed data blocks are obtained. That is, in the above processing procedure, if the number of different signatures is multiple, all the nodes with changed signatures in the merk tree are sequentially obtained in a cyclic, recursive and pre-sequencing traversal manner until all the contents in the changed data blocks are obtained.
The copying module 72 is configured to obtain contents in the data block corresponding to different signatures in the leaf node at the lowest layer that is changed, copy the contents to the local, and update corresponding signatures in corresponding nodes in each layer in the mercker tree.
Due to any object change, the change of all nodes on the Merkle Tree path can be brought; thus, a natural way of tracking data changes is brought to remote incremental replication of data. As shown in fig. 5, the source (production) initiates a remote copy request. The actual data request and comparison process is dominated by the disaster recovery side.
The comparing module 70 of the disaster recovery side first obtains the root node signature sent from the source side, and compares the root node signature with the local root node signature, if the signatures are the same, it indicates that the source side has no data change and does not need to be copied. If the root node signatures are different, requesting root node data (including 204 signatures) from the source end, and comparing the root node data with local root node data (including 204 signatures) to obtain different signatures, for example, if 89 th signatures are different, then the comparing module 70 of the disaster backup end requests data corresponding to 89 th Signature from the source end, and compares the data with local corresponding data; and so on until the payload data is obtained and updated locally by the copy module 72.
In the actual replication process, several signatures may be found to be different in the same root node/intermediate node, and in terms of system implementation, the comparison module 70 needs to sequentially acquire all nodes that have been modified in a cyclic, recursive, and preorder traversal manner, and finally sequentially replicate all payload data that has been modified to the disaster recovery side.
It can be seen from the above processing that the technical solution of the embodiment of the present invention does not need to traverse the whole Merkle Tree to obtain the changed object. In the recursive traversal process from the root node down, only the changed subtrees need to be compared and tracked. Taking fig. 5 as an example, 204 signatures of the source end root node and the disaster backup end root node are compared, and it is found that only the 89 th Signature is different; then the subtrees corresponding to the 1 st to 88 th and the 90 th to 204 th signatures can be ignored, and only the subtrees corresponding to the 89 th signatures are used for the next comparison until the payload node, so that the remote replication is quite accurate.
In the Change Map mode in the prior art, all Change Map sites need to be compared, and the time complexity is O (N), and N is the number of sites (number of regions); while using Merkle Tree, only need to track the path that changes, with a time complexity of O (K log-204)N) Where K is the number of objects that actually change. When less data changes, the comparison times can be effectively reduced.
In summary, according to the technical solution of the embodiment of the present invention, a data block that needs to be copied to a remote end can be quickly found through a self-owned Tree structure based on a Signature (Signature) of a Merkle Tree (Merkle Tree) without additionally introducing a Change Map, so as to improve the read-write efficiency and the copy efficiency of a data volume.
Second embodiment of the System
An embodiment of the present invention provides a distributed storage system, as shown in fig. 9, including: a plurality of distributed storage devices, each distributed storage device comprising a memory 90, a processor 92, and a computer program stored on the memory 90 and executable on the processor 92, the computer program implementing the merkel tree based data reading and writing method and the data remote copying method steps when executed by the processor 92:
specifically, as shown in fig. 2, the data reading and writing method based on the merkel tree specifically includes:
step 201, dividing a logical volume in a distributed storage system by a predetermined size to obtain a plurality of data blocks, calculating a hash value of each data block, and using the hash value as a signature of the corresponding data block; in the present embodiment, the predetermined size is 4 KB;
step 202, storing the signature of each data block on a leaf node of the Mercker tree, calculating the hash value of each leaf node, using the hash value as the signature of the corresponding leaf node, storing the signature of each leaf node on an upper parent node, and repeating the steps until the signature of the logical volume is stored on a root node of the Mercker tree; in the embodiment of the present invention, the size of the hash value is 20 bytes; the size of a node in the merkel tree is 4KB, and each node in the merkel tree stores up to 204 signatures.
That is, in the distributed storage system, the logical volume is divided in units of 4KB,each 4KB is an object (i.e., the block of data), and each object computes a 20Byte (160Bit) hash value, which we call Signature (i.e., the Signature), by a hash function according to its contents. Since 20 bytes (160 bits) has 2160This possibility is a relatively large astronomical number, so if the signatures of any two objects are the same, the data contents of the two objects can also be considered to be the same. When reading the object data, the object data itself can be obtained through the relevant index by only providing the Signature of the 20 bytes corresponding to the object data.
In an embodiment of the invention, the logical volume is managed for its objects in terms of a Merkle Tree, which is a hierarchy of N-way trees, and in an embodiment of the invention, it is 204-way trees. The Merkle Tree is composed of a group of leaf nodes, a group of intermediate nodes and a root node, and each node is used for storing the Signature of the child node. The size of each node is 4KB, so that one node can store 4096/20-204 signatures, in other words, each node may have 204 child nodes. The leaf nodes store objects of the logical volume, that is, signatures of contents (payload), and each leaf node can store 204 objects of signatures; the father node of the leaf node can correspondingly store the signatures of 204 leaf nodes, and the layer-by-layer analogy is carried out, so that the Signature of the root node can be calculated, and the Signature of the root node is the Signature of the logic volume. The Merkel Tree for a logical volume is shown in FIG. 3, with a 3-tier Merkle Tree shown in FIG. 3, and only 5-way trees are shown for ease of illustration.
Step 203, according to the logical address of the data block to be read and written, calculating the position of the signature of the data block in each layer of the mercker tree, and obtaining the signature of the data block, during reading, obtaining the content in the data block through the index according to the signature, during writing, updating the content in the data block, and updating the signature of the corresponding position in each layer of the mercker tree.
In step 203, calculating the position of the signature of the data block in each layer of the merkel tree according to the logical address of the data block to be read and written specifically includes:
dividing the logical address by the size of the data block to obtain the logical block number of the data block to be read and written, and repeatedly rounding and modulo the maximum number of signatures which can be stored by the nodes in the Mercker tree by the logical block number to obtain the position of the signature of the data block to be read and written in each layer of the Mercker tree.
That is, in the embodiment of the present invention, if it is necessary to calculate the position of a certain address on the Merkle Tree, the logical address (LBA) is first divided by 4K (the size of a data block), the logical block number (VBN) of the certain address is calculated, and then the position of the certain address on each level of the Merkle Tree is obtained by repeatedly rounding and modulo 204 (the maximum number of signatures that can be stored by a node in the Merkle Tree).
For example, in a 3-layer Merkle Tree, the path corresponding to logical address 3251489060 is calculated as follows:
(3251489060+4095)/4096=793821
(793821+203)/204=3892;793821mod204=57
(3892+203)/204=20;3892mod204=16
as can be seen from the above calculation, the corresponding positions of the logical address on the Merkle Tree are (starting position starts from 1):
the 20 th Signature inside the root node;
a 16 th Signature of a 20 th child node of the second layer;
the 57 th Signature of the 16 th child node of the third layer.
When reading and writing the logical volume, the Signature of the corresponding object is calculated and acquired layer by layer through the logical address according to the above mode. During reading, directly acquiring payload data of the corresponding object through an index according to the Signature; when writing operation is performed, payload data is updated, and the Signature of the position where the leaf node is located is updated.
From the above process, it can be seen that each layer of nodes stores the Signature of the next layer of nodes in the design due to the Merkle Tree. When the bottom-layer object data changes, all the leaf nodes, intermediate nodes and root nodes on the object path also change accordingly, as shown by the white path in fig. 3.
That is, for a write operation of a logical volume, in addition to updating corresponding payload data, the Signature of the payload in the leaf node is also updated; and as the data of the leaf nodes are changed, the Signature of the leaf nodes in the father nodes of the leaf nodes is further updated, and the Signature of the root nodes is finally updated by analogy.
Through the technical scheme, the self-owned Tree structure based on the Signature (Signature) of the Merkle Tree (Merkle Tree) of the logical volume can be established so as to facilitate subsequent remote copying.
Specifically, as shown in fig. 4, the data remote copy method specifically includes the following processing:
step 401, after receiving a data remote copy request from a source end, obtaining a signature of a root node of a merkel tree of a logical volume from the source end, and comparing the obtained signature with a signature of the root node stored locally;
step 402, obtaining different signatures in two root nodes, requesting the source end for the signature of the lower layer node corresponding to the different signatures, comparing the obtained signature of the lower layer node with the locally stored signature of the corresponding lower layer node, and so on until obtaining the leaf node of the modified bottom layer, obtaining the content in the data block corresponding to the leaf node, copying the content to the local, and updating the corresponding signature in the corresponding node of each layer in the Mercker tree.
In the above processing procedure, if the number of different signatures is multiple, all the nodes with changed signatures in the mulck tree are sequentially obtained in a cyclic, recursive and pre-sequencing traversal manner until all the contents in the changed data blocks are obtained.
Due to any object change, the change of all nodes on the Merkle Tree path can be brought; thus, a natural way of tracking data changes is brought to remote incremental replication of data. Fig. 5 is a schematic signal interaction diagram of an example of a data remote copy method according to an embodiment of the present invention, and as shown in fig. 5, a 2-layer Merkle Tree is taken as an example below to demonstrate a remote copy process:
step 501, a source terminal initiates a remote copy request to a disaster recovery terminal;
step 502, a disaster recovery end requests a source end to obtain root node data;
step 503, the source end reads root node data (4 KB);
step 504, the source end sends the root node data to the disaster recovery end;
step 505, the disaster recovery end compares data of the source end and the local root node, and finds that 89 th signatures are different, and the 89 th Signature of the source end is SigA _ 89;
step 506, the disaster recovery end requests 4KB data corresponding to SigA _89 to the source end;
step 507, the source end reads 4KB data corresponding to SigA _ 89;
step 508, the source end sends 4KB data corresponding to SigA _89 to the disaster recovery end;
step 509, comparing data corresponding to the source end and the local SigA _89 by the disaster recovery end, and finding that 201 st signats are different, wherein the 201 st signats of the source end are SigB _ 201;
step 510, the disaster recovery end requests a source end for 4KB data corresponding to SigB _ 201;
step 511, the source end reads the 4KB data corresponding to SigB _ 201;
step 512, the source end sends 4KB data corresponding to SigB _201 to the disaster recovery end;
step 513, the disaster recovery side updates Payload corresponding to SigB _201 to the local.
As can be seen from the above process, the source (production) initiates a remote copy request. The actual data request and comparison process is dominated by the disaster recovery side.
The disaster recovery end firstly obtains a root node signature sent by the source end, and compares the root node signature with a local root node signature, if the signatures are the same, the source end is not changed by data and does not need to be copied. If the root node signatures are different, requesting root node data (including 204 signatures) from the source end, and comparing the root node data with local root node data (including 204 signatures) to obtain different signatures, for example, if 89 th signatures are different, then the disaster backup end requests data corresponding to the 89 th Signature from the source end and compares the data with local corresponding data; and so on until the payload data is obtained and updated to the local.
In the actual copying process, several signatures may be found to be different in the same root node/intermediate node, and in terms of system implementation, all nodes that have been changed need to be sequentially acquired in a circulating, recursive, and preorder traversal manner, and finally all payload data that have been changed are sequentially copied to the disaster recovery side.
It can be seen from the above processing that the technical solution of the embodiment of the present invention does not need to traverse the whole Merkle Tree to obtain the changed object. In the recursive traversal process from the root node down, only the changed subtrees need to be compared and tracked. Taking fig. 5 as an example, 204 signatures of the source end root node and the disaster backup end root node are compared, and it is found that only the 89 th Signature is different; then the subtrees corresponding to the 1 st to 88 th and the 90 th to 204 th signatures can be ignored, and only the subtrees corresponding to the 89 th signatures are used for the next comparison until the payload node, so that the remote replication is quite accurate.
In the Change Map mode in the prior art, all Change Map sites need to be compared, and the time complexity is O (N), and N is the number of sites (number of regions); while fast incremental replication using the Merkle Tree only needs to track the path that changes, with a time complexity of O (K log — 204)N) Where K is the number of objects that actually change. When less data changes, the comparison times can be effectively reduced.
In summary, according to the technical solution of the embodiment of the present invention, a data block that needs to be copied to a remote end can be quickly found through a self-owned Tree structure based on a Signature (Signature) of a Merkle Tree (Merkle Tree) without additionally introducing a Change Map, so as to improve the read-write efficiency and the copy efficiency of a data volume.
Device embodiment III
The embodiment of the present invention provides a computer-readable storage medium, where an implementation program for information transmission is stored, and when executed by a processor 92, the implementation program implements the steps of a data read-write method and a data remote copy method based on a merkel tree:
specifically, as shown in fig. 2, the data reading and writing method based on the merkel tree specifically includes:
step 201, dividing a logical volume in a distributed storage system by a predetermined size to obtain a plurality of data blocks, calculating a hash value of each data block, and using the hash value as a signature of the corresponding data block; in the present embodiment, the predetermined size is 4 KB;
step 202, storing the signature of each data block on a leaf node of the Mercker tree, calculating the hash value of each leaf node, using the hash value as the signature of the corresponding leaf node, storing the signature of each leaf node on an upper parent node, and repeating the steps until the signature of the logical volume is stored on a root node of the Mercker tree; in the embodiment of the present invention, the size of the hash value is 20 bytes; the size of a node in the merkel tree is 4KB, and each node in the merkel tree stores up to 204 signatures.
That is, in the distributed storage system, the logical volume is divided in units of 4KB, each 4KB is an object (i.e. the above data block), and each object calculates a hash value of 20 bytes (160 bits) according to its content by a hash (hash) function, which is called Signature (i.e. the above Signature). Since 20 bytes (160 bits) has 2160This possibility is a relatively large astronomical number, so if the signatures of any two objects are the same, the data contents of the two objects can also be considered to be the same. When reading the object data, the object data itself can be obtained through the relevant index by only providing the Signature of the 20 bytes corresponding to the object data.
In an embodiment of the invention, the logical volume is managed for its objects in terms of a Merkle Tree, which is a hierarchy of N-way trees, and in an embodiment of the invention, it is 204-way trees. The Merkle Tree is composed of a group of leaf nodes, a group of intermediate nodes and a root node, and each node is used for storing the Signature of the child node. The size of each node is 4KB, so that one node can store 4096/20-204 signatures, in other words, each node may have 204 child nodes. The leaf nodes store objects of the logical volume, that is, signatures of contents (payload), and each leaf node can store 204 objects of signatures; the father node of the leaf node can correspondingly store the signatures of 204 leaf nodes, and the layer-by-layer analogy is carried out, so that the Signature of the root node can be calculated, and the Signature of the root node is the Signature of the logic volume. The Merkel Tree for a logical volume is shown in FIG. 3, with a 3-tier Merkle Tree shown in FIG. 3, and only 5-way trees are shown for ease of illustration.
Step 203, according to the logical address of the data block to be read and written, calculating the position of the signature of the data block in each layer of the mercker tree, and obtaining the signature of the data block, during reading, obtaining the content in the data block through the index according to the signature, during writing, updating the content in the data block, and updating the signature of the corresponding position in each layer of the mercker tree.
In step 203, calculating the position of the signature of the data block in each layer of the merkel tree according to the logical address of the data block to be read and written specifically includes:
dividing the logical address by the size of the data block to obtain the logical block number of the data block to be read and written, and repeatedly rounding and modulo the maximum number of signatures which can be stored by the nodes in the Mercker tree by the logical block number to obtain the position of the signature of the data block to be read and written in each layer of the Mercker tree.
That is, in the embodiment of the present invention, if it is necessary to calculate the position of a certain address on the Merkle Tree, the logical address (LBA) is first divided by 4K (the size of a data block), the logical block number (VBN) of the certain address is calculated, and then the position of the certain address on each level of the Merkle Tree is obtained by repeatedly rounding and modulo 204 (the maximum number of signatures that can be stored by a node in the Merkle Tree).
For example, in a 3-layer Merkle Tree, the path corresponding to logical address 3251489060 is calculated as follows:
(3251489060+4095)/4096=793821
(793821+203)/204=3892;793821mod204=57
(3892+203)/204=20;3892mod204=16
as can be seen from the above calculation, the corresponding positions of the logical address on the Merkle Tree are (starting position starts from 1):
the 20 th Signature inside the root node;
a 16 th Signature of a 20 th child node of the second layer;
the 57 th Signature of the 16 th child node of the third layer.
When reading and writing the logical volume, the Signature of the corresponding object is calculated and acquired layer by layer through the logical address according to the above mode. During reading, directly acquiring payload data of the corresponding object through an index according to the Signature; when writing operation is performed, payload data is updated, and the Signature of the position where the leaf node is located is updated.
From the above process, it can be seen that each layer of nodes stores the Signature of the next layer of nodes in the design due to the Merkle Tree. When the bottom-layer object data changes, all the leaf nodes, intermediate nodes and root nodes on the object path also change accordingly, as shown by the white path in fig. 3.
That is, for a write operation of a logical volume, in addition to updating corresponding payload data, the Signature of the payload in the leaf node is also updated; and as the data of the leaf nodes are changed, the Signature of the leaf nodes in the father nodes of the leaf nodes is further updated, and the Signature of the root nodes is finally updated by analogy.
Through the technical scheme, the self-owned Tree structure based on the Signature (Signature) of the Merkle Tree (Merkle Tree) of the logical volume can be established so as to facilitate subsequent remote copying.
Specifically, as shown in fig. 4, the data remote copy method specifically includes the following processing:
step 401, after receiving a data remote copy request from a source end, obtaining a signature of a root node of a merkel tree of a logical volume from the source end, and comparing the obtained signature with a signature of the root node stored locally;
step 402, if the source end and the destination end have different root node signatures, requesting the source end content, obtaining the signatures of all lower layer nodes of the root node by reading the source end content, comparing the obtained signatures of the lower layer nodes with the locally stored signatures of the corresponding lower layer nodes, and requesting the source end for the content of the lower layer nodes with different signatures. And repeating the steps until the leaf node at the lowest layer which is changed is obtained, the contents in the data blocks with different signatures in the leaf node are obtained, the contents are copied to the local, and the corresponding signatures in the corresponding nodes in each layer in the Merckel tree are updated.
In the above processing procedure, if the number of different signatures is multiple, all the nodes with changed signatures in the mulck tree are sequentially obtained in a cyclic, recursive and pre-sequencing traversal manner until all the contents in the changed data blocks are obtained.
Due to any object change, the change of all nodes on the Merkle Tree path can be brought; thus, a natural way of tracking data changes is brought to remote incremental replication of data. Fig. 5 is a schematic signal interaction diagram of an example of a data remote copy method according to an embodiment of the present invention, and as shown in fig. 5, a 2-layer Merkle Tree is taken as an example below to demonstrate a remote copy process:
step 501, a source terminal initiates a remote copy request to a disaster recovery terminal;
step 502, a disaster recovery end requests a source end to obtain root node data;
step 503, the source end reads root node data (4 KB);
step 504, the source end sends the root node data to the disaster recovery end;
step 505, the disaster recovery end compares data of the source end and the local root node, and finds that 89 th signatures are different, and the 89 th Signature of the source end is SigA _ 89;
step 506, the disaster recovery end requests 4KB data corresponding to SigA _89 to the source end;
step 507, the source end reads 4KB data corresponding to SigA _ 89;
step 508, the source end sends 4KB data corresponding to SigA _89 to the disaster recovery end;
step 509, comparing data corresponding to the source end and the local SigA _89 by the disaster recovery end, and finding that 201 st signats are different, wherein the 201 st signats of the source end are SigB _ 201;
step 510, the disaster recovery end requests a source end for 4KB data corresponding to SigB _ 201;
step 511, the source end reads the 4KB data corresponding to SigB _ 201;
step 512, the source end sends 4KB data corresponding to SigB _201 to the disaster recovery end;
step 513, the disaster recovery side updates Payload corresponding to SigB _201 to the local.
As can be seen from the above process, the source (production) initiates a remote copy request. The actual data request and comparison process is dominated by the disaster recovery side.
The disaster recovery terminal firstly requests to acquire root node data (including 204 signatures), and compares the root node data with local root node data (including 204 signatures); if the data are identical, the source end has no data change and does not need to be copied. If the parts are different, for example, 89 th signatures are different, the disaster recovery terminal requests the source terminal for data corresponding to the 89 th signatures, and compares the data with local corresponding data; and so on until the payload data is obtained and updated to the local.
In the actual copying process, several signatures may be found to be different in the same root node/intermediate node, and in terms of system implementation, all nodes that have been changed need to be sequentially acquired in a circulating, recursive, and preorder traversal manner, and finally all payload data that have been changed are sequentially copied to the disaster recovery side.
It can be seen from the above processing that the technical solution of the embodiment of the present invention does not need to traverse the whole Merkle Tree to obtain the changed object. In the recursive traversal process from the root node down, only the changed subtrees need to be compared and tracked. Taking fig. 5 as an example, 204 signatures of the source end root node and the disaster backup end root node are compared, and it is found that only the 89 th Signature is different; then the subtrees corresponding to the 1 st to 88 th and the 90 th to 204 th signatures can be ignored, and only the subtrees corresponding to the 89 th signatures are used for the next comparison until the payload node, so that the remote replication is quite accurate.
In the Change Map mode in the prior art, all Change Map sites need to be compared, and the time complexity is O (N), and N is the number of sites (number of regions); while fast incremental replication using the Merkle Tree only needs to track the path that changes, with a time complexity of O (K log — 204)N) Where K is the number of objects that actually change. When less data changes, the comparison times can be effectively reduced.
In summary, according to the technical solution of the embodiment of the present invention, a data block that needs to be copied to a remote end can be quickly found through a self-owned Tree structure based on a Signature (Signature) of a Merkle Tree (Merkle Tree) without additionally introducing a Change Map, so as to improve the read-write efficiency and the copy efficiency of a data volume.
The computer-readable storage medium of this embodiment includes, but is not limited to: ROM, RAM, magnetic or optical disks, and the like.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (13)

1. A data read-write method based on the Mercker tree is characterized by comprising the following steps:
dividing a logical volume in a distributed storage system by a preset size to obtain a plurality of data blocks, calculating a hash value of each data block, and taking the hash value as a signature of the corresponding data block;
storing the signature of each data block on a leaf node of the Mercker tree, calculating the hash value of each leaf node, using the hash value as the signature of the corresponding leaf node, storing the signature of each leaf node on an upper parent node of the leaf node, and repeating the steps until the signature of the logical volume is stored on a root node of the Mercker tree;
according to the logical address of the data block needing to be read and written, the position of the signature of the data block in each layer of the Mercker tree is calculated, the signature of the data block is obtained, during reading, the content in the data block is obtained through indexes according to the signature, during writing, the content in the data block is updated, and the signature of the corresponding position in each layer of the Mercker tree is updated.
2. The method of claim 1, wherein the predetermined size is 4 KB; the size of the hash value is 20 bytes; the size of the nodes in the merkel tree is 4KB, and each node in the merkel tree stores 204 signatures at most.
3. The method of claim 1, wherein calculating the position of the signature of the data block in each layer of the merkel tree according to the logical address of the data block to be read and written comprises:
dividing the logical address by the size of the data block to obtain the logical block number of the data block to be read and written, and repeatedly rounding and modulo the maximum number of signatures which can be stored by the nodes in the Mercker tree by the logical block number to obtain the position of the signature of the data block to be read and written in each layer of the Mercker tree.
4. A data remote copy method based on the data read-write method of any one of claims 1 to 3, the method specifically comprising:
after receiving a data remote copying request of a source end, acquiring a signature of a root node of a Mercker tree of a logical volume from the source end, and comparing the acquired signature with a signature of the root node stored locally;
if the source end and the destination end have different root node signatures, requesting source end content, acquiring signatures of all lower layer nodes of the root node by reading the source end content, comparing the acquired signatures of the lower layer nodes with the locally stored signatures of the corresponding lower layer nodes, and requesting the source end for the content of the lower layer nodes with different signatures; and repeating the steps until the leaf node at the lowest layer which is changed is obtained, the contents in the data blocks with different signatures in the leaf node are obtained, the contents are copied to the local, and the corresponding signatures in the corresponding nodes in each layer in the Merckel tree are updated.
5. The method of claim 4, wherein when the number of the different signatures is multiple, all nodes with changed signatures in the mulck tree are sequentially obtained by means of loop, recursion and preorder traversal until all the changed data blocks are obtained.
6. A data read-write device based on the Mercker tree is used for a distributed storage system, and the modules of the data read-write device comprise:
the system comprises a dividing calculation module, a data storage module and a data processing module, wherein the dividing calculation module is used for dividing a logical volume by a preset size to obtain a plurality of data blocks, calculating a hash value of each data block and using the hash value as a signature of the corresponding data block;
the tacle tree module is used for storing the signature of each data block on leaf nodes of the tacle tree, calculating the hash value of each leaf node, using the hash value as the signature of the corresponding leaf node, storing the signature of each leaf node on an upper parent node of the leaf node, and repeating the steps until the signature of the logical volume is stored on a root node of the tacle tree;
and the read-write module is used for calculating the position of the signature of the data block in each layer of the Mercker tree according to the logical address of the data block needing to be read and written, acquiring the signature of the data block, acquiring the content in the data block through the index according to the signature during reading, updating the content in the data block during writing, and updating the signature of the corresponding position in each layer of the Mercker tree.
7. The data reading/writing apparatus according to claim 6, wherein the predetermined size is 4 KB; the size of the hash value is 20 bytes; the size of the nodes in the merkel tree is 4KB, and each node in the merkel tree stores 204 signatures at most.
8. The data reading and writing apparatus according to claim 6, wherein the reading and writing module is specifically configured to:
dividing the logical address by the size of the data block to obtain the logical block number of the data block to be read and written, and repeatedly rounding and modulo the maximum number of signatures which can be stored by the nodes in the Mercker tree by the logical block number to obtain the position of the signature of the data block to be read and written in each layer of the Mercker tree.
9. A data remote replication device based on the data read/write device according to any one of claims 6 to 8, which is used in a distributed storage system and is disposed at a disaster recovery side, and includes:
the comparison module is used for acquiring the signature of the root node of the Mercker tree of the logical volume from a source end after receiving a data remote copying request of the source end, and comparing the acquired signature with the signature of the root node stored locally; and if the source end and the destination end have different root node signatures, requesting source end content, acquiring the signatures of all lower layer nodes of the root node by reading the source end content, comparing the acquired signatures of the lower layer nodes with the locally stored signatures of the corresponding lower layer nodes, and requesting the source end for the content of the lower layer nodes with different signatures. And so on until obtaining the leaf node of the bottom layer which is changed;
and the copying module is used for acquiring the content in the data block corresponding to the signature in the leaf node at the lowest layer which is changed, copying the content to the local, and updating the corresponding signature in the corresponding node of each layer in the Mercker tree.
10. The apparatus according to claim 9, wherein the comparing module is specifically configured to, when the number of the different signatures is multiple, sequentially obtain all nodes with changed signatures in the mulck tree in a circular, recursive, and pre-sequencing traversal manner until all contents in the changed data blocks are obtained.
11. A distributed storage system comprising a data reading and writing apparatus according to any one of claims 6 to 8 and a data remote copying apparatus according to any one of claims 9 to 10.
12. A distributed storage system comprising a plurality of distributed storage devices, each distributed storage device comprising: memory, processor and computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the method of merkel tree based data reading and writing and the apparatus of any one of claims 1 to 3 and the steps of the method of remote data replication of any one of claims 4 to 5.
13. A computer-readable storage medium, on which an implementation program for information transfer is stored, and which when executed by a processor implements the steps of the method for reading and writing data based on a merkel tree and the apparatus for remotely copying data as recited in any one of claims 4 to 5.
CN202010094576.4A 2020-02-16 2020-02-16 Data reading and writing method, data remote copying method and device and distributed storage system Pending CN111309523A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010094576.4A CN111309523A (en) 2020-02-16 2020-02-16 Data reading and writing method, data remote copying method and device and distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010094576.4A CN111309523A (en) 2020-02-16 2020-02-16 Data reading and writing method, data remote copying method and device and distributed storage system

Publications (1)

Publication Number Publication Date
CN111309523A true CN111309523A (en) 2020-06-19

Family

ID=71148994

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010094576.4A Pending CN111309523A (en) 2020-02-16 2020-02-16 Data reading and writing method, data remote copying method and device and distributed storage system

Country Status (1)

Country Link
CN (1) CN111309523A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111309263A (en) * 2020-02-16 2020-06-19 西安奥卡云数据科技有限公司 Method for realizing logical volume in distributed object storage
CN112286873A (en) * 2020-10-30 2021-01-29 西安奥卡云数据科技有限公司 Hash tree caching method and device
CN113055431A (en) * 2021-01-13 2021-06-29 湖南天河国云科技有限公司 Block chain-based industrial big data file efficient chaining method and device
CN113259345A (en) * 2021-05-12 2021-08-13 国网山东省电力公司东平县供电公司 Intelligent power distribution network data secure transmission method, system and storage medium
CN113794558A (en) * 2021-09-16 2021-12-14 烽火通信科技股份有限公司 L-tree calculation method, device and system in XMSS algorithm
CN114626532A (en) * 2020-12-10 2022-06-14 合肥本源量子计算科技有限责任公司 Method and device for reading data based on address, storage medium and electronic device
CN115190136A (en) * 2021-04-21 2022-10-14 统信软件技术有限公司 Data storage method, data transmission method and computing equipment
WO2023108360A1 (en) * 2021-12-13 2023-06-22 华为技术有限公司 Method and apparatus for managing data in storage system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054035A (en) * 2010-12-29 2011-05-11 北京播思软件技术有限公司 Data range-based method for synchronizing data in database
CN105243334A (en) * 2015-09-17 2016-01-13 浪潮(北京)电子信息产业有限公司 Data storage protection method and system
CN106815528A (en) * 2016-12-07 2017-06-09 重庆软云科技有限公司 A kind of file management method and device, storage device
US20190305937A1 (en) * 2016-12-16 2019-10-03 Nokia Technologies Oy Secure document management
CN110647503A (en) * 2019-10-09 2020-01-03 重庆特斯联智慧科技股份有限公司 Distributed storage method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054035A (en) * 2010-12-29 2011-05-11 北京播思软件技术有限公司 Data range-based method for synchronizing data in database
CN105243334A (en) * 2015-09-17 2016-01-13 浪潮(北京)电子信息产业有限公司 Data storage protection method and system
CN106815528A (en) * 2016-12-07 2017-06-09 重庆软云科技有限公司 A kind of file management method and device, storage device
US20190305937A1 (en) * 2016-12-16 2019-10-03 Nokia Technologies Oy Secure document management
CN110647503A (en) * 2019-10-09 2020-01-03 重庆特斯联智慧科技股份有限公司 Distributed storage method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王曙燕: "文件系统", 《计算机专业核心课程辅导及考研攻略》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111309263A (en) * 2020-02-16 2020-06-19 西安奥卡云数据科技有限公司 Method for realizing logical volume in distributed object storage
CN111309263B (en) * 2020-02-16 2020-11-24 西安奥卡云数据科技有限公司 Method for realizing logical volume in distributed object storage
CN112286873A (en) * 2020-10-30 2021-01-29 西安奥卡云数据科技有限公司 Hash tree caching method and device
CN114626532A (en) * 2020-12-10 2022-06-14 合肥本源量子计算科技有限责任公司 Method and device for reading data based on address, storage medium and electronic device
CN114626532B (en) * 2020-12-10 2023-11-03 本源量子计算科技(合肥)股份有限公司 Method and device for reading data based on address, storage medium and electronic device
CN113055431A (en) * 2021-01-13 2021-06-29 湖南天河国云科技有限公司 Block chain-based industrial big data file efficient chaining method and device
CN115190136A (en) * 2021-04-21 2022-10-14 统信软件技术有限公司 Data storage method, data transmission method and computing equipment
CN115190136B (en) * 2021-04-21 2024-03-01 统信软件技术有限公司 Data storage method, data transmission method and computing equipment
CN113259345A (en) * 2021-05-12 2021-08-13 国网山东省电力公司东平县供电公司 Intelligent power distribution network data secure transmission method, system and storage medium
CN113794558A (en) * 2021-09-16 2021-12-14 烽火通信科技股份有限公司 L-tree calculation method, device and system in XMSS algorithm
CN113794558B (en) * 2021-09-16 2024-02-27 烽火通信科技股份有限公司 L-tree calculation method, device and system in XMS algorithm
WO2023108360A1 (en) * 2021-12-13 2023-06-22 华为技术有限公司 Method and apparatus for managing data in storage system

Similar Documents

Publication Publication Date Title
CN111309523A (en) Data reading and writing method, data remote copying method and device and distributed storage system
US10956601B2 (en) Fully managed account level blob data encryption in a distributed storage environment
US10764045B2 (en) Encrypting object index in a distributed storage environment
US10296498B2 (en) Coordinated hash table indexes to facilitate reducing database reconfiguration time
US10013444B2 (en) Modifying an index node of a hierarchical dispersed storage index
US10659225B2 (en) Encrypting existing live unencrypted data using age-based garbage collection
US9277011B2 (en) Processing an unsuccessful write request in a dispersed storage network
US8996611B2 (en) Parallel serialization of request processing
US10824372B2 (en) Data recovery method and device, and cloud storage system
US20050010592A1 (en) Method and system for taking a data snapshot
US11093387B1 (en) Garbage collection based on transmission object models
EP2342661A1 (en) Matrix-based error correction and erasure code methods and apparatus and applications thereof
US10310904B2 (en) Distributed technique for allocating long-lived jobs among worker processes
CN109690494B (en) Hierarchical fault tolerance in system storage
US10628298B1 (en) Resumable garbage collection
US20180329785A1 (en) File system storage in cloud using data and metadata merkle trees
WO2023103213A1 (en) Data storage method and device for distributed database
EP3739450A1 (en) Data processing method and apparatus, and computing device
US11663192B2 (en) Identifying and resolving differences between datastores
US20200341871A1 (en) Raid schema for providing metadata protection in a data storage system
US9767139B1 (en) End-to-end data integrity in parallel storage systems
CN114327239A (en) Method, electronic device and computer program product for storing and accessing data
CN109840051B (en) Data storage method and device of storage system
US20180225044A1 (en) Dispersed storage write process with lock/persist
CN116303789A (en) Parallel synchronization method and device for multi-fragment multi-copy database and readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200619