CN112286888B - Distributed file system copy causality consistent access method facing wide area network - Google Patents

Distributed file system copy causality consistent access method facing wide area network Download PDF

Info

Publication number
CN112286888B
CN112286888B CN202011001103.1A CN202011001103A CN112286888B CN 112286888 B CN112286888 B CN 112286888B CN 202011001103 A CN202011001103 A CN 202011001103A CN 112286888 B CN112286888 B CN 112286888B
Authority
CN
China
Prior art keywords
version
storage
node
copy
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011001103.1A
Other languages
Chinese (zh)
Other versions
CN112286888A (en
Inventor
肖利民
周汉杰
秦广军
霍志胜
宋尧
徐耀文
王超波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202011001103.1A priority Critical patent/CN112286888B/en
Publication of CN112286888A publication Critical patent/CN112286888A/en
Application granted granted Critical
Publication of CN112286888B publication Critical patent/CN112286888B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Abstract

The disclosure provides a wide area network-oriented access method for causality consistency of distributed file system copies. In embodiments of the present disclosure, a causally consistent access service for replica files in a widely distributed replica space may be provided for clients. The method comprises the steps of constructing a partial order relation by excavating a dependency relation of copy file write-in operation, constructing a full order relation by utilizing maximum time deviation, a timestamp and user defined priority, storing a data index through an interval tree supporting multi-version control, solving concurrency conflicts through version rollback of the interval tree of the multi-version index, and finally providing access service with consistent copy file cause and effect.

Description

Distributed file system copy causality consistent access method facing wide area network
Technical Field
The invention discloses a distributed file system copy causal consistency access method facing a wide area network, relates to challenges faced by wide area high-performance computing, and belongs to the technical field of computers.
Background
The file copies are stored in a cross-domain mode, so that the access delay of the data set can be greatly reduced, and the throughput performance is improved. In HPC applications (High performance computing) and big data processing applications, most distributed computing tasks use memory computing methods, and only when performing a checkpoint or persisting a computing output result, a file write operation is involved. And the computing tasks can avoid mutual conflict through the file path and the file name which are reasonably set. The Harding research shows that if a concurrent access control means is adopted, the performance is lost by more than 99% at most according to the difference between the number of cluster nodes and the proportion of conflicting requests.
However, due to the small network bandwidth between the wide-area multi-centers, the time window of the data set which is inconsistent among the multi-centers is too long, and failures such as IO operation disorder and non-atomic execution which occur in the synchronization process also cause irreversible damage to the data set and its copy, with serious consequences. Therefore, multiple applications still have certain consistency requirements when accessing in a multi-center manner.
The concurrent access control method based on the master-slave architecture limits that write access to a single file or a storage unit can only be submitted by a master copy node. Most of the current storage systems in the industry adopt a centralized concurrent access control method, and in order to ensure load balancing, a consistent hash algorithm or a data layout strategy is also adopted to perform data partitioning, so that read-write access balancing among nodes is ensured.
The mainstream distributed storage systems Ceph, Glusterfs and HDFS are all concurrent access control methods based on a central centralization of a master-slave copy mechanism. Because there is no significant difference in network communication performance between nodes in a single center, a FIFO queue is constructed after the commit point of the write request of the memory unit is limited to a single node, thereby ensuring the sequentiality of all write operations and update operations, and ensuring the linear update of data on the time axis.
However, on a multi-center distributed copy architecture, the network communication overhead difference between nodes is very large, and the concurrent access control method for limiting the read-only from the copy greatly limits the throughput of file writing, greatly wastes precious internet bandwidth and computer application time, and is difficult to mine the locality of file reading and writing.
The two-stage lock realized based on the distributed lock is a data concurrent access control method widely used in the current distributed database system. Distributed locks have multiple implementation algorithms, for example, lock services are implemented in Spanner based on a consensus protocol fast-paxos, lock services are implemented in Hadoop based on the existence state of nodes on a Zab-like Unix file system tree based on the Zab protocol in Zookeeper, and lock services are implemented in Sherlock based on a single-machine kv database. Regardless of the means used, the application for the lock requires at least one remote access latency.
The researches of Tango, Granola, Rococo and Ren all adopt two-stage locks to realize the concurrent access control of data. The two-stage lock divides the lock application into two stages, a growth stage and a contraction stage. Only lock application is allowed during the growth phase and lock release is allowed during the contraction phase. Two conflicting transaction operations can be linearly executed through the fine-grained two-stage lock, and the non-conflicting transaction operations can be executed in parallel. Deadlock can be avoided by using conflict waiting or conflict termination on a two-phase lock basis.
However, the overhead of a single lock in the two-phase lock protocol based on the distributed lock introduces a large time delay, and thus the two-phase lock protocol has the characteristic of poor performance. The DrTM system also employs a concurrent access control method based on a two-phase lock in order to resolve conflicts between local operations and remote operations. Unlike other distributed storage systems, DrTM's RDMA-based atomic operations CAS implements a set of point-to-point remote lock protocols. The RDMA network can achieve very low lock application latency, thereby greatly improving the performance of a two-phase lock-based system.
The difference between Optimistic Concurrency Control (OCC) and a concurrent access control method based on a two-stage lock is that the optimistic concurrency control does not block other concurrent processes before or during operation execution, but enters a verification stage when the operation execution completes submission results, verifies whether other operations which generate conflicts destroy the operations in the operation execution process, if the conflicts occur, temporarily changes data generated by the operations, and if the conflicts do not occur, temporarily changes the data and applies the temporarily changed data to persistent storage.
Deneva proposed the MaaT protocol, using a transaction private storage space for recording data updates for write transactions, a schedule for recording active transaction numbers, and a metadata table for each record for recording the transaction number that was expected to read from or write to the record and the transaction number that was last accessed by the record. MaaT specifies that each transaction has a unique number as a logical timestamp, and after the transaction execution phase ends, the authentication phase of the MaaT protocol will be entered. MaaT monitors for conflicts by comparing the recorded time information accessed by the current transaction with the recorded time information of the execution of the transaction in the schedule. Data consistency is guaranteed by a conflict resolution mechanism when a conflict occurs.
Sovran proposes that for a storage system distributed over a wide area, the overhead introduced to ensure strong consistent data access is too large and read-write conflicts often occur under the load of cross-center applications. The access performance can be greatly reduced by ensuring the read-write mutual exclusion of the storage system in the wide area multi-center. Therefore, Sorvan provides a parallel snapshot isolation mechanism PSI for the wide area storage system, and realizes causal consistency in the wide area environment based on the parallel snapshot isolation mechanism. The core idea of parallel snapshot isolation is to relax the commit time of the latest version of the snapshot update, allowing newly generated snapshots to be able to be committed at different points in time at multiple centers. Compared with the mainstream snapshot isolation algorithm, the parallel snapshot isolation mechanism allows update operations of different centers to be submitted out of order, and only requires that updates of the same center can be submitted in order. This consistency relaxation allows the centers to achieve higher throughput in a wide area network environment without having to coordinate with each other and wait for each other's update operations. The PSI has the defect that the copy of the data can only maintain the consistency level of the causal sequence in a long period of time, and the application must be able to tolerate non-strong consistent data for a certain period of time.
Disclosure of Invention
The disclosure provides a distributed file system copy causality consistent access method facing a wide area network.
In embodiments of the present disclosure, a causally consistent access service for replica files in a widely distributed replica space may be provided for clients. The method comprises the steps of constructing a partial order relation by mining the dependency relation of the copy file write-in operation, constructing a full order relation by utilizing the maximum time deviation, the timestamp and the user defined priority, storing a data index through an interval tree supporting multi-version control, solving concurrency conflicts through the version rollback of the interval tree of the multi-version index, and finally providing access service with consistent copy file cause and effect.
The technical solution of the invention is as follows:
a distributed file system copy causally consistent access method facing a wide area network is characterized by comprising the following steps:
only synchronizing the index structure of the duplicate file and synchronizing data as required;
providing a causally consistent partial order relationship by tracing the dependency order of replica file data operations in a wide area distributed environment;
providing a stable full order relationship for data operations through maximum timestamp bias analysis and user definable priorities;
storing an index structure of a duplicate file through a multi-version index interval tree supporting rollback and performing multi-version concurrency control;
concurrency conflicts in logical time are handled through multi-version concurrency control of the replica file index structure.
The method comprises the following steps:
step 1, when a client submits a data write-in request to a copy of a certain storage center on a wide area network, a storage gateway node endows a dependency relationship for the request according to all write-in requests visible to a target copy file by a current storage center;
step 2, when a storage gateway node of a certain storage center on the wide area network receives data segment updating operation broadcasted by other storage centers, constructing a full-order relation for all received data writing operation according to the dependency relation, the timestamp deviation and the user-defined copy space current node nice value of the current request;
step 3, the storage gateway node constructs a multi-version index interval tree for the target copy file according to the pre-constructed data write operation full-order relation, and solves the concurrency conflict caused by the out-of-order arrival of the network packets through the version rollback of the tree;
step 4, when a client submits a data reading request to a copy of a certain storage center on the wide area network, the storage gateway node performs data synchronization according to the maximum version meeting the causal consistency provided by the multi-version index interval tree maintained in the current storage center, and finally returns the copy file data meeting the causal consistency;
in step 1, when the client submits a data write request to a copy of a certain storage center on the wide area network, the method further includes:
A1) organizing and distributing cluster node state diagrams by management nodes to discover storage gateway nodes in a storage center mutually;
A2) a storage gateway node of a storage center maintains version vectors of a group of copy files, wherein the version vectors are composed of client submitted versions of the last write requests of certain copy files in each center, which are maintained by all the storage centers added into a current copy space, and are arranged according to the time sequence of adding the copy files into the copy spaces of each storage center;
A3) when receiving write requests broadcasted by other storage centers, the corresponding version slot of the storage center from which the write request comes in the version vector of the current storage center is updated in a pushing manner;
A4) when a data write request submitted by a client is received, a copy file version vector maintained by a storage gateway node at the time of submitting the request is taken as a dependent version of the data write request, and a version slot of a current storage center in the version vector is pushed forward;
A5) the storage gateway node will push all data segment update requests with time stamps and dependent version vectors to other hubs.
In the step 2, when a storage gateway node of a certain storage center on the wide area network receives a data segment update operation broadcasted by other storage centers, the method further includes:
B1) when a storage gateway node of a certain storage center on a wide area network receives data segment updating operation broadcasted by other storage centers, constructing a full-order relation for all received data writing operation according to the dependency relation, the timestamp deviation and the user-defined current node nice value of a copy space of a current request;
B2) the storage gateway node constructs a partial order relation with consistent cause and effect according to a dependency relation carried in a received data segment updating operation, the partial order relation is generated by a function for comparing version vectors, for two data segment updating requests from different storage centers, the comparison function compares version slots corresponding to opposite storage centers in the version vectors of the two updating requests, if the comparison of the two version slots forms a consistent size relation, the two requests have a cause and effect order relation, and if the comparison of the two version slots forms an inconsistent size relation, the two requests have a parallel relation;
B3) when the two requests have a parallel relationship, the storage gateway node compares the maximum time deviation of the storage gateway node of the request source in the cluster node state diagram with the request timestamp, wherein the maximum time deviation is half of the maximum delay from the storage gateway node to the wide area network NTP server, the request timestamp is the time when the request source storage center receives the client request, and if the absolute value of the difference between the two request timestamps is greater than the sum of the maximum time deviations of the two request source centers, a size relationship is formed according to the timestamps of the two requests;
B4) when two requests have a parallel relationship and the absolute value of the difference between the timestamps of the two requests is less than or equal to the sum of the maximum time deviations of the two request source centers, the nice values of the copy space defined by a user in each storage center are compared, and a size relationship is formed by comparing the nice values, wherein the nice values are unique in each storage center;
B5) when two requests have a parallel relationship and generate a parallel conflict, namely the two requests do not reach the target storage gateway node in a full-order relationship due to network delay and the like, the version of the index interval tree is rolled back and the data segment updating requests arranged in the full-order relationship are reapplied.
In step 3, when the storage gateway node of a certain storage center on the wide area network updates the multi-version index interval tree according to the full-order relationship of the data segment update request, the method further includes:
C1) the index interval tree is subjected to variation based on a B-x tree, a Key of a leaf node of the index interval tree is an interval head of a data segment, the value of the Key is an interval tail of the data segment and a center ID number where source data of the data segment is located, a non-leaf node Key is an interval where a data segment of a child node of the index interval tree is located, and the value of the non-leaf node Key is a vector formed by a plurality of version subtree pointers;
C2) the insertion operation of the index interval tree is based on a B-tree, when a data segment is inserted or updated, if a target data segment interval is completely contained by a certain subtree, the target data segment is inserted into the subtree;
C3) if the target data segment interval to be inserted contains the intervals of a plurality of subtrees of the current node, creating a copy of the root node of the current subtree at the tail of the version pointer vector of the root node of the current subtree, advancing the version of the root node of the current subtree, updating the maximum version number towards the direction of the root node, deleting the reference of the subtree of which the interval is completely contained, executing the insertion operation of the target data segment in the subtree with the minimum value, and executing the deletion operation of the target data segment in the subtree with the maximum value;
C4) if the target data segment to be inserted has an overlapping part or is continuous with the interval of the current leaf node, creating a copy of the current leaf node at the tail part of the version pointer vector of the current leaf node, if the values of the target data segment and the leaf node are equal, combining the intervals of the leaf node and the target data segment, and if the values of the target data segment and the current leaf node are not equal, splitting the current leaf node into an overlapping part and a non-overlapping part and covering the value of the overlapping part;
C5) the multi-version index interval tree does not support deletion operation of data segment intervals;
C6) the interval query operation of the multi-version index interval tree is based on a B-tree, when a target interval is positioned in a plurality of subtrees of a current node, a query request is split into a plurality of subinterval queries, and a result of a group of interval index vectors is finally returned;
C7) when the version of the multi-version index interval tree is rolled back, rolling back the subtrees of which the maximum version numbers in all the subtrees of the current node are larger than the target version, and if the version pointer vector of the root node of the subtree has the version slot of the target version, replacing the root node of the subtree by the copy in the version slot and carrying out recursive operation.
In step 4, when the client submits a data reading request to a copy of a certain storage center on the wide area network, the method further includes:
D1) the storage gateway node performs data synchronization according to the maximum version meeting causality provided by the multi-version index interval tree maintained in the current storage center;
D2) in order to ensure that the multi-version index interval tree can provide indexes meeting causal consistency, the storage gateway node takes the maximum version of the root node of the multi-version interval tree when a read request is submitted as a read target version, and all requests of which version vectors are smaller than the version in a partial order relation are blocked to arrive and be applied;
D3) after the index vector of the target data segment is obtained from the multi-version interval tree, the storage gateway node sends a synchronization request of the target data interval and the version to the center where the target data segment is located, and finally returns data with consistent cause and effect.
Drawings
FIG. 1 is a flow chart of a distributed file system copy causally consistent access method for a wide area network.
FIG. 2 is a block diagram of a distributed file system copy causally consistent access method architecture for a wide area network.
FIG. 3 is a block tree diagram for storing data indexes and supporting multi-versioning.
The concept in the figure is illustrated as follows: the super-computation center: an organization that provides services using a supercomputer and associated network, storage facilities. The GVDS is a bottom software system supported by the technology of the patent and consists of a plurality of service nodes in a supercomputing center. File data indexing: the file data index in the patent refers to a set, and the set is composed of a data segment number composed of binary offset bits and data quantity, an ID number of a super computing center where data are located, and an ID number of a storage cluster where the data are located. For locating the position of a data segment in the GVDS global context. And (3) data index updating operation: the context of a request is used for describing a descriptive parameter that an index of file data should be updated after a client file write or GVDS instance cross-center synchronization operation. OP, which is referred to in this patent as an abbreviation for data index update operation. OP a.1 refers to a request with a logical sequence number of 1 issued from the hypercalculation center a. Dependence relationship: in this patent, a newly submitted request within a supercomputing center depends on all requests that have been applied in the current supercomputing center. Depending on a.0, b.0, in this patent, the number of operations that have been performed in the local center at the time the current request was submitted at the center is referred to. The complete order relationship: and in the cross-super-computation center, all the operation requests can be compared with each other through a set of comparison rules to determine the execution sequence of the requests. Multi-version interval tree: the form of the data index tree is used for providing a data index storage structure and simultaneously supporting the version backtracking and the forwarding of nodes in the tree. [0,20] denotes index data of a node or a sub-tree of the tree containing a file data segment [0,20 ]. T1/T2 is used to refer to a logical time in the graph of this patent, and the logical time in the tree node is used to identify the insertion time of the node. And the center A/B is used for generally indicating the ID number of the super computing center where the data is located, the ID number of the storage cluster where the data is located and the like and is used for determining the positioning information required by one data block in the global environment of the GVDS.
Detailed Description
The embodiment of the disclosure provides a distributed file system copy causality consistent access method for a wide area network, which reduces the flow of the wide area network by only synchronizing the index structure of a copy file and synchronizing data as required; providing a causally consistent partial order relationship by tracing the dependency order of replica file data operations in a wide area distributed environment; providing a stable full-order relationship for data operations through maximum timestamp bias analysis and user-definable priority; the index structure of the duplicate files is stored through a multi-version index interval tree supporting rollback and multi-version concurrence control is carried out. The present invention is described in further detail below.
Fig. 1 shows a flowchart of a distributed file system copy access method according to an embodiment of the present disclosure, which mainly includes the following four steps.
S1), a dependency relationship is constructed, when a client side submits a data writing request to a copy of a certain storage center on the wide area network, the storage gateway node endows the request with the dependency relationship according to all writing requests of the current storage center for the target copy file.
S2), when a storage gateway node of a certain storage center on the wide area network receives data segment update operations broadcast by other storage centers, constructing a full-order relationship for all received data write operations according to the currently requested dependency relationship, the timestamp offset, and the user-defined current node nice value of the replica space.
S3) updating the multi-version interval tree of the stored data index, the storage gateway node constructs the multi-version index interval tree for the target copy file according to the pre-constructed data writing operation full-order relation, and the concurrency conflict caused by the out-of-order arrival of the network packets is solved through the version rollback of the tree.
S4), processing the data reading request submitted by the client, when the client submits the data reading request to a copy of a certain storage center on the wide area network, the storage gateway node performs data synchronization according to the maximum version meeting causal consistency provided by the multi-version index interval tree maintained in the current storage center, and finally returns the copy file data meeting causal consistency.
S1), in the embodiment of the present disclosure, the steps of constructing the dependency relationship are as follows:
organizing and distributing cluster node state diagrams by management nodes to discover storage gateway nodes in a storage center mutually; a storage gateway node of a storage center maintains version vectors of a group of copy files, wherein the version vectors are composed of client submitted versions of the last write requests of certain copy files in each center, which are maintained by all the storage centers added into a current copy space, and are arranged according to the time sequence of adding the copy files into the copy spaces of each storage center; when receiving write requests broadcasted by other storage centers, the corresponding version slot of the storage center from which the write request comes in the version vector of the current storage center is updated in a pushing manner; when a data write request submitted by a client is received, a copy file version vector maintained by a storage gateway node at the time of submitting the request is taken as a dependent version of the data write request, and a version slot of a current storage center in the version vector is pushed forward; the storage gateway node will push all data segment update requests with time stamps and dependent version vectors to other hubs.
S2), in the embodiment of the present disclosure, the steps of constructing the full-order relationship are as follows:
when a storage gateway node of a certain storage center on a wide area network receives data segment updating operation broadcasted by other storage centers, constructing a full-order relation for all received data writing operation according to the dependency relation, the timestamp deviation and the user-defined current node nice value of a copy space of a current request; the storage gateway node constructs a partial order relationship with consistent cause and effect according to a dependency relationship carried in a received data segment updating operation, the partial order relationship is generated by a function of comparing version vectors, for two data segment updating requests from different storage centers, the comparison function compares version slots corresponding to opposite storage centers in the version vectors of the two updating requests, if the comparison of the two version slots forms a consistent magnitude relationship, the two requests have a causal order relationship, and if the comparison of the two version slots forms an inconsistent magnitude relationship, the two requests have a parallel relationship; when the two requests have a parallel relationship, the storage gateway node compares the maximum time deviation of the storage gateway node of the request source in the cluster node state diagram with the request timestamp, wherein the maximum time deviation is half of the maximum delay from the storage gateway node to the wide area network NTP server, the request timestamp is the time when the request source storage center receives the client request, and if the absolute value of the difference between the two request timestamps is greater than the sum of the maximum time deviations of the two request source centers, a size relationship is formed according to the timestamps of the two requests; when two requests have a parallel relationship and the absolute value of the difference between the timestamps of the two requests is less than or equal to the sum of the maximum time deviations of the two request source centers, the nice values of the copy space defined by a user in each storage center are compared, and a size relationship is formed by comparing the nice values, wherein the nice values are unique in each storage center; when two requests have a parallel relationship and generate a parallel conflict, namely the two requests do not reach the target storage gateway node in a full-order relationship due to network delay and the like, the version of the index interval tree is rolled back and the data segment updating requests arranged in the full-order relationship are reapplied.
S3), in the embodiment of the present disclosure, the step of updating the multi-version interval tree of the stored data index is as follows:
the index interval tree is subjected to variation based on a B-x tree, a Key of a leaf node of the index interval tree is an interval head of a data segment, the value of the Key is an interval tail of the data segment and a center ID number where source data of the data segment is located, a non-leaf node Key is an interval where a data segment of a child node of the index interval tree is located, and the value of the non-leaf node Key is a vector formed by a plurality of version subtree pointers; the insertion operation of the index interval tree is based on a B-tree, when a data segment is inserted or updated, if a target data segment interval is completely contained by a certain subtree, the target data segment is inserted into the subtree; if the target data segment interval to be inserted contains the intervals of a plurality of subtrees of the current node, creating a copy of the root node of the current subtree at the tail of the version pointer vector of the root node of the current subtree, advancing the version of the root node of the current subtree, updating the maximum version number towards the direction of the root node, deleting the reference of the subtree of which the interval is completely contained, executing the insertion operation of the target data segment in the subtree with the minimum value, and executing the deletion operation of the target data segment in the subtree with the maximum value; if the target data segment to be inserted has an overlapping part or is continuous with the interval of the current leaf node, creating a copy of the current leaf node at the tail part of the version pointer vector of the current leaf node, if the values of the target data segment and the leaf node are equal, combining the intervals of the leaf node and the target data segment, and if the values of the target data segment and the current leaf node are not equal, splitting the current leaf node into an overlapping part and a non-overlapping part and covering the value of the overlapping part; the multi-version index interval tree does not support deletion operation of data segment intervals; the interval query operation of the multi-version index interval tree is based on a B-tree, when a target interval is positioned in a plurality of subtrees of a current node, a query request is split into a plurality of subinterval queries, and a result of a group of interval index vectors is finally returned; when the version of the multi-version index interval tree is rolled back, rolling back the subtrees of which the maximum version numbers in all the subtrees of the current node are larger than the target version, and if the version pointer vector of the root node of the subtree has the version slot of the target version, replacing the root node of the subtree by the copy in the version slot and carrying out recursive operation.
S4), processing the read data request submitted by the client, in the embodiment of the present disclosure, the step of processing the read data request submitted by the client is as follows:
the storage gateway node performs data synchronization according to the maximum version meeting causality provided by the multi-version index interval tree maintained in the current storage center; in order to ensure that the multi-version index interval tree can provide indexes meeting causal consistency, the storage gateway node takes the maximum version of the root node of the multi-version interval tree when a read request is submitted as a read target version, and blocks all requests of which version vectors are smaller than the version in a partial order waiting relation from all arriving and being applied; after the index vector of the target data segment is obtained from the multi-version interval tree, the storage gateway node sends a synchronization request of the target data interval and the version to the center where the target data segment is located, and finally returns data with consistent cause and effect.
Those skilled in the art will appreciate that the invention may be practiced without these specific details. It is pointed out here that the above description is helpful for the person skilled in the art to understand the invention, but does not limit the scope of the invention. Any such equivalents, modifications and/or omissions as may be made without departing from the spirit and scope of the invention may be resorted to.

Claims (4)

1. A distributed file system copy causally consistent access method facing a wide area network is characterized by comprising the following steps:
only synchronizing the index structure of the duplicate file and synchronizing data as required;
providing a causally consistent partial order relationship by tracing the dependency order of replica file data operations in a wide area distributed environment;
providing a stable full-order relationship for data operations through maximum timestamp bias analysis and user-definable priority;
storing an index structure of a duplicate file through a multi-version index interval tree supporting rollback and performing multi-version concurrency control;
processing concurrency conflicts in the logic time through multi-version concurrency control of the replica file index structure;
the method comprises the following steps:
step 1, when a client submits a data write-in request to a copy of a certain storage center on a wide area network, a storage gateway node endows a dependency relationship for the request according to all write-in requests visible to a target copy file by a current storage center;
step 2, when a storage gateway node of a certain storage center on the wide area network receives data segment updating operation broadcasted by other storage centers, constructing a full-order relation for all received data writing operation according to the dependency relation, the timestamp deviation and the user-defined copy space current node nice value of the current request;
step 3, the storage gateway node constructs a multi-version index interval tree for the target copy file according to the pre-constructed data write operation full-order relation, and solves the concurrency conflict caused by the out-of-order arrival of the network packets through the version rollback of the tree;
step 4, when a client submits a data reading request to a copy of a certain storage center on the wide area network, the storage gateway node performs data synchronization according to the maximum version meeting causal consistency provided by the multi-version index interval tree maintained in the current storage center, and finally returns copy file data meeting causal consistency;
in step 2, when a storage gateway node of a certain storage center on the wide area network receives a data segment update operation broadcasted by another storage center, the method further includes:
B1) when a storage gateway node of a certain storage center on a wide area network receives data segment updating operation broadcasted by other storage centers, constructing a full-order relation for all received data writing operation according to the dependency relation, the timestamp deviation and the user-defined current node nice value of a copy space of a current request;
B2) the storage gateway node constructs a partial order relationship with consistent cause and effect according to a dependency relationship carried in a received data segment updating operation, the partial order relationship is generated by a function of comparing version vectors, for two data segment updating requests from different storage centers, the comparison function compares version slots corresponding to opposite storage centers in the version vectors of the two updating requests, if the comparison of the two version slots forms a consistent magnitude relationship, the two requests have a causal order relationship, and if the comparison of the two version slots forms an inconsistent magnitude relationship, the two requests have a parallel relationship;
B3) when the two requests have a parallel relationship, the storage gateway node compares the maximum time deviation of the storage gateway node of the request source in the cluster node state diagram with the request timestamp, wherein the maximum time deviation is half of the maximum delay from the storage gateway node to the wide area network NTP server, the request timestamp is the time when the request source storage center receives the client request, and if the absolute value of the difference between the two request timestamps is greater than the sum of the maximum time deviations of the two request source centers, a size relationship is formed according to the timestamps of the two requests;
B4) when two requests have a parallel relationship and the absolute value of the difference between the timestamps of the two requests is less than or equal to the sum of the maximum time deviations of the two request source centers, the nice values of the copy space defined by a user in each storage center are compared, and a size relationship is formed by comparing the nice values, wherein the nice values are unique in each storage center;
B5) when two requests have a parallel relationship and a parallel conflict occurs, that is, the two requests do not reach the target storage gateway node in a full-order relationship due to network delay, the version of the index interval tree is rolled back and the data segment update requests arranged in a full-order relationship are reapplied.
2. The method according to claim 1, wherein when the client submits a data write request to a copy of a storage center on the wide area network in step 1, the method further comprises:
A1) organizing and distributing cluster node state diagrams by management nodes to discover storage gateway nodes in a storage center mutually;
A2) a storage gateway node of a storage center maintains version vectors of a group of copy files, wherein the version vectors are composed of client submitted versions of the last write requests of certain copy files in each center, which are maintained by all the storage centers added into a current copy space, and are arranged according to the time sequence of adding the copy files into the copy spaces of each storage center;
A3) when receiving write requests broadcasted by other storage centers, the corresponding version slot of the storage center from which the write request comes in the version vector of the current storage center is updated in a pushing manner;
A4) when a data write request submitted by a client is received, a copy file version vector maintained by a storage gateway node at the time of submitting the request is taken as a dependent version of the data write request, and a version slot of a current storage center in the version vector is pushed forward;
A5) the storage gateway node will push all data segment update requests with time stamps and dependent version vectors to other hubs.
3. The method according to claim 1, wherein when the storage gateway node of a certain storage center on the wide area network updates the multi-version index interval tree according to the full-order relationship of the data segment update request in step 3, the method further comprises:
C1) the index interval tree is subjected to variation based on a B-x tree, a Key of a leaf node of the index interval tree is an interval head of a data segment, the value of the Key is an interval tail of the data segment and a center ID number where source data of the data segment is located, a non-leaf node Key is an interval where a data segment of a child node of the index interval tree is located, and the value of the non-leaf node Key is a vector formed by a plurality of version subtree pointers;
C2) the insertion operation of the index interval tree is based on a B-tree, when a data segment is inserted or updated, if a target data segment interval is completely contained by a certain subtree, the target data segment is inserted into the subtree;
C3) if the target data segment interval to be inserted contains the intervals of a plurality of subtrees of the current node, creating a copy of the root node of the current subtree at the tail of the version pointer vector of the root node of the current subtree, advancing the version of the root node of the current subtree, updating the maximum version number towards the direction of the root node, deleting the reference of the subtree of which the interval is completely contained, executing the insertion operation of the target data segment in the subtree with the minimum value, and executing the deletion operation of the target data segment in the subtree with the maximum value;
C4) if the target data segment to be inserted has an overlapping part or is continuous with the interval of the current leaf node, creating a copy of the current leaf node at the tail part of the version pointer vector of the current leaf node, if the values of the target data segment and the leaf node are equal, combining the intervals of the leaf node and the target data segment, and if the values of the target data segment and the current leaf node are not equal, splitting the current leaf node into an overlapping part and a non-overlapping part and covering the value of the overlapping part;
C5) the multi-version index interval tree does not support deletion operation of data segments;
C6) the interval query operation of the multi-version index interval tree is based on a B-tree, when a target interval is positioned in a plurality of subtrees of a current node, a query request is split into a plurality of subinterval queries, and a result of a group of interval index vectors is finally returned;
C7) when the version of the multi-version index interval tree is rolled back, rolling back the subtrees of which the maximum version numbers in all the subtrees of the current node are larger than the target version, and if the version pointer vector of the root node of the subtree has the version slot of the target version, replacing the root node of the subtree by the copy in the version slot and carrying out recursive operation.
4. The method of claim 1, wherein when the client submits a data reading request to a copy of a storage center over the wide area network in step 4, the method further comprises:
D1) the storage gateway node performs data synchronization according to the maximum version meeting causality provided by the multi-version index interval tree maintained in the current storage center;
D2) in order to ensure that the multi-version index interval tree can provide indexes meeting causal consistency, the storage gateway node takes the maximum version of the root node of the multi-version interval tree when a read request is submitted as a read target version, and all requests of which version vectors are smaller than the version in a partial order relation are blocked to arrive and be applied;
D3) after the index vector of the target data segment is obtained from the multi-version interval tree, the storage gateway node sends a synchronization request of the target data interval and the version to the center where the target data segment is located, and finally returns data with consistent cause and effect.
CN202011001103.1A 2020-09-22 2020-09-22 Distributed file system copy causality consistent access method facing wide area network Active CN112286888B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011001103.1A CN112286888B (en) 2020-09-22 2020-09-22 Distributed file system copy causality consistent access method facing wide area network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011001103.1A CN112286888B (en) 2020-09-22 2020-09-22 Distributed file system copy causality consistent access method facing wide area network

Publications (2)

Publication Number Publication Date
CN112286888A CN112286888A (en) 2021-01-29
CN112286888B true CN112286888B (en) 2022-06-14

Family

ID=74422222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011001103.1A Active CN112286888B (en) 2020-09-22 2020-09-22 Distributed file system copy causality consistent access method facing wide area network

Country Status (1)

Country Link
CN (1) CN112286888B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188080A (en) * 2019-05-17 2019-08-30 北京航空航天大学 Telefile Research of data access performance optimization based on client high-efficiency caching
CN110213352A (en) * 2019-05-17 2019-09-06 北京航空航天大学 The unified Decentralized Autonomous storage resource polymerization of name space

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9792294B2 (en) * 2014-07-02 2017-10-17 Panzura, Inc Using byte-range locks to manage multiple concurrent accesses to a file in a distributed filesystem

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188080A (en) * 2019-05-17 2019-08-30 北京航空航天大学 Telefile Research of data access performance optimization based on client high-efficiency caching
CN110213352A (en) * 2019-05-17 2019-09-06 北京航空航天大学 The unified Decentralized Autonomous storage resource polymerization of name space

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
实现SMP机群虚拟化的方法;彭近兵等;《北京航空航天大学学报》;20090331;第35卷(第3期);第301-303页 *
广域虚拟数据空间中边缘缓存系统的研究与实现;徐耀文等;《大数据》;20210825(第5期);第57-80页 *

Also Published As

Publication number Publication date
CN112286888A (en) 2021-01-29

Similar Documents

Publication Publication Date Title
US20230273904A1 (en) Map-Reduce Ready Distributed File System
US10031813B2 (en) Log record management
US8768977B2 (en) Data management using writeable snapshots in multi-versioned distributed B-trees
US10078681B2 (en) Differentiated secondary index maintenance in log structured NoSQL data stores
US7966298B2 (en) Record-level locking and page-level recovery in a database management system
US7702640B1 (en) Stratified unbalanced trees for indexing of data items within a computer system
US5966706A (en) Local logging in a distributed database management computer system
US20180144015A1 (en) Redoing transaction log records in parallel
AU2017239539A1 (en) In place snapshots
JP2014532919A (en) Online transaction processing
WO2019103950A1 (en) Multi-region, multi-master replication of database tables
US20180276267A1 (en) Methods and system for efficiently performing eventual and transactional edits on distributed metadata in an object storage system
Hakimzadeh et al. Scaling hdfs with a strongly consistent relational model for metadata
Krechowicz et al. Highly scalable distributed architecture for NoSQL datastore supporting strong consistency
CN112286888B (en) Distributed file system copy causality consistent access method facing wide area network
Hiraga et al. PPMDS: A distributed metadata server based on nonblocking transactions
Plantikow et al. Transactions for distributed wikis on structured overlays
Ismail et al. ePipe: Near Real-Time Polyglot Persistence of HopsFS Metadata
Kanungo et al. Original Research Article Concurrency versus consistency in NoSQL databases
Pandey et al. Persisting the AntidoteDB Cache: Design and Implementation of a Cache for a CRDT Datastore
Helt et al. C5: Cloned Concurrency Control that Always Keeps Up
Gessert et al. Transactional Semantics for Globally Distributed Applications
Huaizhong et al. Optimistic Voting for Managing Replicated Data
Agrawal et al. Cloud Data Management: Early Trends
Ben-Chaim et al. Second order snapshot-log relations: Supporting multi-directional database replication using asynchronous snapshot replication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant