US10713134B2 - Distributed storage and replication system and method - Google Patents
Distributed storage and replication system and method Download PDFInfo
- Publication number
- US10713134B2 US10713134B2 US15/589,856 US201715589856A US10713134B2 US 10713134 B2 US10713134 B2 US 10713134B2 US 201715589856 A US201715589856 A US 201715589856A US 10713134 B2 US10713134 B2 US 10713134B2
- Authority
- US
- United States
- Prior art keywords
- partition
- view
- osd node
- osd
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000010076 replication Effects 0.000 title claims abstract description 184
- 238000000034 method Methods 0.000 title claims abstract description 177
- 238000003860 storage Methods 0.000 title claims abstract description 143
- 238000005192 partition Methods 0.000 claims abstract description 1109
- 230000008569 process Effects 0.000 claims abstract description 85
- 238000012545 processing Methods 0.000 claims abstract description 84
- 238000011084 recovery Methods 0.000 claims description 51
- 238000013500 data storage Methods 0.000 claims description 14
- 230000003362 replicative effect Effects 0.000 claims description 5
- 230000006870 function Effects 0.000 description 41
- 230000015654 memory Effects 0.000 description 37
- 101100121776 Arabidopsis thaliana GIG1 gene Proteins 0.000 description 29
- 101100267551 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) YME1 gene Proteins 0.000 description 29
- 230000003993 interaction Effects 0.000 description 15
- 230000005012 migration Effects 0.000 description 12
- 238000013508 migration Methods 0.000 description 12
- 238000004891 communication Methods 0.000 description 7
- 230000004075 alteration Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 230000001360 synchronised effect Effects 0.000 description 6
- 230000004044 response Effects 0.000 description 5
- 230000000903 blocking effect Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 230000002452 interceptive effect Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 201000006815 congenital muscular dystrophy Diseases 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 238000005304 joining Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/466—Transaction processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2053—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
- G06F11/2056—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
- G06F11/2064—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring while ensuring consistency
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
- G06F11/2028—Failover techniques eliminating a faulty processor or activating a spare
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2041—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with more than one idle spare processing component
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2048—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share neither address space nor persistent storage
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2053—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
- G06F11/2094—Redundant storage or storage space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/27—Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
- G06F16/273—Asynchronous replication or reconciliation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5077—Logical partitioning of resources; Management or configuration of virtualized resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
- G06F11/0754—Error or fault detection not based on redundancy by exceeding limits
- G06F11/0757—Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs
Definitions
- the present disclosure relates to the information technology (IT) field, and in particular, to a distributed storage and replication system and a method.
- IT information technology
- a large quantity of storage nodes are organized to form a distributed system, and data reliability is ensured by means of data replication and backup between different nodes, so that data has replicas on all the different storage nodes. How to ensure data consistency of multiple data replicas has become a problem confronting the distributed storage system for a long time. In a case of ensuring the data consistency, system performance and availability also become considerations of increasing importance.
- FIG. 1 shows the existing two-phase commit protocol (2 Phase Commit, 2PC), which is a typical centralized strong-consistency replica control protocol and is used in many distributed database systems to ensure replica consistency.
- 2PC Phase Commit
- a system In the two-phase commit protocol, a system generally includes two types of nodes: a coordinator (coordinator) and a participant (participant).
- the coordinator is responsible for execution of initiating voting on data updating and notifying a voting decision, and the participant participates in the voting on the data updating and executes the voting decision.
- the two-phase commit protocol includes two phases: Phase 1 is a commit-request phase, where the coordinator instructs the participant to vote on data modification, and the participant notifies the coordinator of its own voting result: Yes or No; and Phase 2 is a commit phase, where the coordinator makes a decision: Commit or Abort according to the voting result in the first phase.
- the two-phase commit protocol Successfully executing the two-phase commit protocol once requires at least two rounds of interaction between the coordinator and each participant with four messages, and excessive times of interaction degrade performance.
- the two-phase commit protocol if a node becomes faulty or continuously has no response, another input/output (“IO”) request is blocked and finally fails due to a timeout, and a data rollback needs to be performed.
- IO input/output
- Embodiments of the present disclosure provide a distributed storage and replication system and a method for managing data storage and replication in a distributed storage system, so as to resolve a problem of low performance and low availability of an existing consistency replication protocol.
- a distributed storage and replication system includes at least one metadata control (MDC) module, multiple input/output (“IO”) routing modules, and multiple object-based storage device (OSD) nodes, where the MDC module is adapted to: configure, for each OSD node, at least one logical partition corresponding to physical storage resources managed by each OSD node, where the at least one partition is a primary partition, a secondary partition, or any combination of primary and secondary partitions, a primary partition and a secondary partition corresponding to the primary partition constitute a partition group, a primary partition and a secondary partition in a same partition group are located on different OSD nodes, an OSD node on which a primary partition is located is a primary OSD node of a partition group that includes the primary partition, and an OSD node on which a secondary partition is located is a secondary OSD node of a partition group that includes the secondary partition; and generate a partition view according to the partition, where the partition view includes information about an OSD on
- MDC metadata control
- IO input/
- the IO routing module is adapted to route a received IO request to an OSD node; and the OSD node is adapted to execute, according to the IO request, storage of data corresponding to the IO request, where the MDC determines that an OSD node in the system is a faulty OSD node, determines a partition on the faulty OSD node, updates a partition view of a partition group that includes the partition on the faulty OSD node, and sends an updating notification to a primary OSD node in the updated partition view.
- the primary OSD node is adapted to process, according to the updated partition view after receiving the updating notification sent by the MDC module, replication of the data corresponding to the IO request.
- the partition view includes a primary/secondary identity and a corresponding partition status that are of an OSD on which a partition in a partition group is located.
- the primary OSD node is further adapted to update, according to the updated partition view, a partition view locally stored on the primary OSD node.
- the MDC module is further adapted to: generate an IO view, where the IO view includes an identifier of a primary OSD node of a partition group; and send the IO view to the IO routing module and the OSD node on which the partition in the partition view is located.
- the primary OSD node is further configured to: update, according to the updated partition view, an IO view locally stored on the primary OSD node, and process, according to the updated locally-stored IO view, replication of the data corresponding to the IO request.
- the MDC module is further adapted to: when determining that the partition on the faulty OSD node includes a primary partition, update an IO view of a partition group that includes the primary partition, and notify a secondary OSD node in the updated partition view of the updated IO view.
- the secondary OSD node in the updated partition view is configured to: update a locally stored IO view according to the updated IO view, and process, according to the updated locally-stored IO view, replication of the data corresponding to the IO request.
- the updating a partition view of a partition group that includes the partition on the faulty OSD node includes: when the partition on the faulty OSD node includes a secondary partition, marking a partition status of the faulty OSD node in a partition view of a partition group that includes the secondary partition as inconsistent; and when the partition on the faulty OSD node includes the primary partition, setting the faulty OSD node that serves as a primary OSD node in a partition view of the partition group that includes the primary partition as a new secondary OSD node, marking a partition status corresponding to the new secondary OSD node as inconsistent, selecting a secondary OSD node whose partition status is consistent from an original secondary OSD node in the partition view of the partition group that includes the primary partition, and setting the selected secondary OSD node as a new primary OSD node.
- a distributed storage and replication system includes at least one metadata control (MDC) module, multiple IO routing modules, and multiple object-based storage device (OSD) nodes, where the MDC module is adapted to: configure, for each OSD node, at least one logical partition corresponding to physical storage resources managed by each OSD node, where the at least one partition is a primary partition, a secondary partition, or any combination of primary and secondary partitions, a primary partition and a secondary partition corresponding to the primary partition constitute a partition group, a primary partition and a secondary partition in a same partition group are located on different OSD nodes, an OSD node on which a primary partition is located is a primary OSD node of a partition group that includes the primary partition, and an OSD node on which a secondary partition is located is a secondary OSD node of a partition group that includes the secondary partition; and generate a partition view and an IO view according to the partition, where the partition view includes information about an OSD on
- MDC metadata control
- OSD object-
- the primary OSD node is further adapted to: return an error to the IO routing module after determining, according to the IO view version information, that the IO view version in the IO request is earlier than the IO view version locally stored on the primary OSD; and after determining that the IO view version in the IO request is later than the IO view version locally stored on the primary OSD, add the IO request to a cache queue, and query the MDC module for the IO view version information of the IO view of the partition group to which the data belongs, so as to execute the IO request after determining that the IO view version locally stored on the primary OSD is consistent with the IO view version in the IO request; and the IO routing module is adapted to: after receiving the error returned by the primary OSD node, query the MDC module for the IO view of the partition group to which the data belongs, and after obtaining updated IO view version information, send an IO request that carries the updated IO view version information.
- the IO view version information includes an IO view version number
- the primary OSD node further generates a sequence identifier for the IO request, and adds the sequence identifier to the replication request sent to the secondary OSD node, where the sequence identifier includes the IO view version number and a sequence number, and the sequence number indicates a serial number of a modify operation on data corresponding to a partition group in the IO view within an IO view version; and the secondary OSD node is further adapted to execute the replication request according to the sequence identifier.
- the replication request further carries a sequence identifier in a previous replication request sent by the primary OSD node for the partition group; and the secondary OSD node is adapted to: after receiving the replication request, execute the replication request when the sequence identifier in the previous replication request is consistent with a largest sequence identifier locally stored on the secondary OSD node.
- the partition view includes a primary/secondary identity and a corresponding partition status that are of an OSD on which a partition in a partition group is located; and the MDC module is further adapted to: when detecting, in a process of processing the IO request, that the primary OSD node becomes faulty, set the primary OSD node in the partition view of the partition group to which the data belongs as a new secondary OSD node, and mark a partition status of the new secondary OSD as inconsistent, set any secondary OSD node of the secondary OSD node in the partition view of the partition group to which the data belongs as a new primary OSD node, notify the new primary OSD node of the updated partition view of the partition group to which the data belongs; and update, by using the new primary OSD node, the IO view of the partition group to which the data belongs, and notify the IO routing module of the updated IO view of the partition
- the partition view includes a primary/secondary identity and a corresponding partition status that are of an OSD on which a partition in a partition group is located; and the MDC module is further adapted to: when detecting, in a process of processing the IO request, that any secondary OSD node of the secondary OSD node becomes faulty, mark a partition status of the any secondary OSD in the partition view of the partition group to which the data belongs as inconsistent, and notify the primary OSD node of the updated partition view of the partition group to which the data belongs; and the primary OSD node is adapted to: after receiving the updated partition view of the partition group to which the data belongs, send the replication request to a secondary OSD node whose partition status is consistent in the updated partition view, and skip sending the replication request to the secondary OSD node whose partition status is inconsistent.
- a distributed storage and replication system includes at least one metadata control (MDC) module, multiple IO routing modules, and multiple object-based storage device (OSD) nodes, where the MDC module is adapted to: configure, for each OSD node, at least one logical partition corresponding to physical storage resources managed by each OSD node, where the at least one partition is a primary partition, a secondary partition, or any combination of primary and secondary partitions, a primary partition and a secondary partition corresponding to the primary partition constitute a partition group, a primary partition and a secondary partition in a same partition group are located on different OSD nodes, an OSD node on which a primary partition is located is a primary OSD node of a partition group that includes the primary partition, and an OSD node on which a secondary partition is located is a secondary OSD node of a partition group that includes the secondary partition; and generate a partition view and an IO view according to the partition, where the partition view includes information about an OSD on
- MDC metadata control
- OSD object-
- the primary OSD node is further adapted to: after receiving the data recovery request, receive the IO request sent by the IO routing module for the partition on the failback OSD node, execute the IO request, and send, to the failback OSD node, a replication request that carries IO key information and the data corresponding to the IO request; and the failback OSD node writes, to a log, the replication request that carries the IO key information and the data corresponding to the IO request, and update, according to a record of the log after the data recovery is completed, the data corresponding to the IO request to physical storage resources managed by the failback OSD node.
- the data recovery request carries a largest sequence identifier that is of an IO operation for the partition on the failback OSD node and that is locally recorded on the failback OSD node, where the largest sequence identifier comprises a latest IO view version number of the IO view of the partition group that includes the partition on the failback OSD node, and a largest serial number of a modify operation on data corresponding to a partition in the IO view corresponding to the latest IO view version number; and the sending the data updated during the failure to the failback OSD node includes: determining that the largest sequence identifier in the data recovery request is greater than or equal to a current smallest sequence identifier locally stored on the primary OSD node, sending an entry that the failback OSD node lacks during the failure to the failback OSD node, receiving a data recovery request initiated by the failback OSD node according to the entry, and sending data corresponding to the entry to the failback OS
- the data recovery request carries a largest sequence identifier that is of an IO operation for the partition on the failback OSD node and that is locally recorded on the failback OSD node, where the largest sequence identifier includes: a latest IO view version number of the IO view of the partition group that includes the partition on the failback OSD node, and a largest serial number of a modify operation on data corresponding to a partition in the IO view within the IO view corresponding to the latest IO view version number; and the sending the data updated during the failure to the failback OSD node includes: determining that the largest sequence identifier in the data recovery request is less than a current smallest sequence identifier locally stored on the primary OSD node, sending the current smallest sequence identifier locally stored on the primary OSD to the failback OSD node, receiving a data recovery request, initiated by the failback OSD node, for synchronizing all data corresponding to a primary partition
- a distributed storage and replication system includes at least one metadata control (MDC) module, multiple IO routing modules, and multiple object-based storage device (OSD) nodes, where the MDC module is adapted to: configure, for each OSD node, at least one logical partition corresponding to physical storage resources managed by each OSD node, where the at least one partition is a primary partition, a secondary partition, or any combination of primary and secondary partitions, a primary partition and a secondary partition corresponding to the primary partition constitute a partition group, a primary partition and a secondary partition in a same partition group are located on different OSD nodes, an OSD node on which a primary partition is located is a primary OSD node of a partition group that includes the primary partition, and an OSD node on which a secondary partition is located is a secondary OSD node of a partition group that includes the secondary partition; and generate a partition view according to the partition, where the partition view includes information about an OSD on which a partition in
- a distributed storage and replication system includes at least one metadata control (MDC) module, multiple IO routing modules, and multiple object-based storage device (OSD) nodes, where the MDC module is adapted to: configure, for each OSD node, at least one logical partition corresponding to physical storage resources managed by each OSD node, where the at least one partition is a primary partition, a secondary partition, or any combination of primary and secondary partitions, a primary partition and a secondary partition corresponding to the primary partition constitute a partition group, a primary partition and a secondary partition in a same partition group are located on different OSD nodes, an OSD node on which a primary partition is located is a primary OSD node of a partition group that includes the primary partition, and an OSD node on which a secondary partition is located is a secondary OSD node of a partition group that includes the secondary partition; and generate a partition view and an IO view according to the partition, where the partition view includes information about an OSD on
- MDC metadata control
- OSD object-
- a distributed storage and replication system includes at least one metadata control (MDC) module, multiple IO routing modules, and multiple object-based storage device (OSD) nodes, where the MDC module is adapted to: configure, for each OSD node, at least one logical partition corresponding to physical storage resources managed by each OSD node, where the at least one partition is a primary partition, a secondary partition, or any combination of primary and secondary partitions, a primary partition and a secondary partition corresponding to the primary partition constitute a partition group, a primary partition and a secondary partition in a same partition group are located on different OSD nodes, an OSD node on which a primary partition is located is a primary OSD node of a partition group that includes the primary partition, and an OSD node on which a secondary partition is located is a secondary OSD node of a partition group that includes the secondary partition; and generate a partition view and an IO view according to the partition, where the partition view includes information about an OSD on
- MDC metadata control
- OSD object-
- a method for managing data storage and replication in a distributed storage system includes at least one metadata control (MDC) module, multiple IO routing modules, and multiple object-based storage device (OSD) nodes, where the MDC module is adapted to: configure, for each OSD node, at least one logical partition corresponding to physical storage resources managed by each OSD node, where the at least one partition is a primary partition, a secondary partition, or any combination of primary and secondary partitions, a primary partition and a secondary partition corresponding to the primary partition constitute a partition group, a primary partition and a secondary partition in a same partition group are located on different OSD nodes, an OSD node on which a primary partition is located is a primary OSD node of a partition group that includes the primary partition, and an OSD node on which a secondary partition is located is a secondary OSD node of a partition group that includes the secondary partition; and generate a partition view according to the partition, where the partition view includes information about an
- the partition view includes a primary/secondary identity and a corresponding partition status that are of an OSD on which a partition in a partition group is located; and the primary OSD node is further adapted to update, according to the updated partition view, a partition view locally stored on the primary OSD node; and the processing, according to the updated partition view, replication of the data corresponding to the IO request includes: replicating, according to the updated locally-stored partition view, the data corresponding to the IO request from the IO routing module onto a secondary OSD node whose partition status is consistent in the updated locally-stored partition view, or onto a secondary OSD node whose partition status is consistent in the updated locally-stored partition view and a secondary OSD node whose partition status is inconsistent in the updated locally-stored partition view but that is recovering data.
- the updating a partition view of a partition group that includes the partition on the faulty OSD node includes: when the partition on the faulty OSD node includes a secondary partition, marking a partition status of the faulty OSD node in a partition view of a partition group that includes the secondary partition as inconsistent; and when the partition on the faulty OSD node includes the primary partition, setting the faulty OSD node that serves as a primary OSD node in a partition view of the partition group that includes the primary partition as a new secondary OSD node, marking a partition status corresponding to the new secondary OSD node as inconsistent, selecting a secondary OSD node whose partition status is consistent from an original secondary OSD node in the partition view of the partition group that includes the primary partition, and setting the selected secondary OSD node as a new primary OSD node.
- a method for managing data storage and replication in a distributed storage system includes at least one metadata control (MDC) module, multiple IO routing modules, and multiple object-based storage device (OSD) nodes, where the MDC module is adapted to: configure, for each OSD node, at least one logical partition corresponding to physical storage resources managed by each OSD node, where the at least one partition is a primary partition, a secondary partition, or any combination of primary and secondary partitions, a primary partition and a secondary partition corresponding to the primary partition constitute a partition group, a primary partition and a secondary partition in a same partition group are located on different OSD nodes, an OSD node on which a primary partition is located is a primary OSD node of a partition group that includes the primary partition, and an OSD node on which a secondary partition is located is a secondary OSD node of a partition group that includes the secondary partition; and generate a partition view and an IO view according to the partition, where the partition view
- the partition view includes a primary/secondary identity and a corresponding partition status that are of an OSD on which a partition in a partition group is located; and the method further includes: when detecting, in a process of processing the IO request, that the primary OSD node becomes faulty, setting, by the MDC module, the primary OSD node in a partition view of the partition group to which the data belongs as a new secondary OSD node, marking a partition status of the new secondary OSD as inconsistent; setting any secondary OSD node of the secondary OSD node in the partition view of the partition group to which the data belongs as a new primary OSD node, notifying the new primary OSD node of the updated partition view of the partition group to which the data belongs; and updating, by using the new primary OSD node, the IO view of the partition group to which the data belongs, and notifying the IO routing module of the updated IO view of the partition group to which the data belongs; the IO routing module is further
- the partition view includes a primary/secondary identity and a corresponding partition status that are of an OSD on which a partition in a partition group is located; and the method further includes: when detecting, in a process of processing the IO request, that any secondary OSD node of the secondary OSD node becomes faulty, marking, by the MDC module, a partition status of the any secondary OSD node in the partition view of the partition group to which the data belongs as inconsistent, and notifying the primary OSD node of the updated partition view of the partition group to which the data belongs; and the primary OSD node is adapted to: after receiving the updated partition view of the partition group to which the data belongs, send the replication request to a secondary OSD node whose partition status is consistent in the updated partition view, and skip sending the replication request to the secondary OSD node whose partition status is inconsistent.
- FIG. 1 is a flowchart of the two-phase commit protocol in the prior art
- FIG. 2A is a schematic architecture diagram of a distributed storage and replication system according to an embodiment of the present disclosure
- FIG. 2B is a schematic structural diagram of a distributed storage and replication system according to another embodiment of the present disclosure.
- FIG. 2C is a schematic structural diagram of a distributed storage and replication system according to another embodiment of the present disclosure.
- FIG. 4 is a schematic diagram of an OSD view status transition according to an embodiment of the present disclosure.
- FIG. 5 is a schematic structural diagram of a distributed storage and replication system according to another embodiment of the present disclosure.
- FIG. 8A and FIG. 8B are a flowchart of OSD failure processing according to an embodiment of the present disclosure.
- FIG. 10A and FIG. 10B are a flowchart of data recovery in an OSD failback processing process according to an embodiment of the present disclosure
- FIG. 12A and FIG. 12B are a flowchart of processing performed after a new OSD joins a cluster according to an embodiment of the present disclosure.
- a specific embodiment of the present disclosure provides a distributed storage and replication control system, so as to manage and control data storage and replication mentioned in this embodiment of the present disclosure.
- the distributed storage and replication control system mainly includes three sub-layers: a status layer, an interface layer, and a data layer.
- the status layer includes a metadata control (MDC) module 202 , and in an actual application, whether a secondary MDC needs to be configured and a quantity of secondary MDCs may be determined according to a requirement, where the secondary MDC is adapted to play the role of a primary MDC when the primary MDC module becomes faulty.
- MDC metadata control
- the interface layer includes multiple IO (input/output) routing modules 204 (which may also be called clients, and the two concepts may be interchangeable in implementation of the present disclosure).
- the data layer includes multiple object-based storage device object storage device (OSD) nodes 206 .
- the status layer communicates with the interface layer and the data layer by using a status view message.
- the MDC module 202 sends an updating notification to the IO routing module 204 and the OSD node 206 by using the status view message, so as to instruct the IO routing module 204 and the OSD node 206 to update a local cluster view (which may also be called a view, and the two concepts may be interchangeable in implementation of the present disclosure); or directly sends a cluster view generated or updated by the MDC module 202 to the IO routing module 204 and the OSD node 206 .
- the interface layer and the data layer communicate with each other by using a service message.
- the IO routing module 204 sends an IO request message to the OSD node 206 to request IO data storage and replication.
- the MDC module 202 as an entrance for delivery of cluster configuration information, is adapted to allocate a logical partition of logical storage resources in application storage space to each OSD node, generate a cluster view according to the partition, maintain and update the cluster view, and notify the corresponding IO routing module 204 and the OSD node 206 of cluster view updating.
- the IO routing module 204 is adapted to route and forward an IO request of an upper-layer application to a corresponding OSD node according to the cluster view.
- the OSD node 206 is adapted to: execute a related IO operation on the IO request according to the cluster view, where the related IO operation mainly includes storing and replicating data, to implement data backup consistency; and organize a data operation on physical storage resources (for example, local disk or external storage resources) managed by the OSD node 206 .
- MDC module may be implemented by hardware, firmware, software, or a combination thereof.
- IO routing module may be implemented by hardware, firmware, software, or a combination thereof.
- OSD node may be implemented by hardware, firmware, software, or a combination thereof.
- a specific implementation manner is determined in consideration of a product design requirement or manufacturing costs, and the present disclosure shall not be limited to a particular implementation manner.
- the entire distributed storage and replication system may be deployed on an independent platform or server (for example, a distributed storage and replication control platform in FIG. 2B ), so as to manage data replication and storage in a distributed storage system connected to the platform or server.
- an independent platform or server for example, a distributed storage and replication control platform in FIG. 2B .
- the distributed storage and replication control system may be deployed, in a distributed manner, in a distributed storage system shown in FIG. 2C .
- the distributed storage system includes multiple servers or hosts, where the host or server in this embodiment is a physical host or server, that is, includes hardware such as a processor and a memory.
- the foregoing MDC module 202 may be deployed only on one server or host (no secondary MDC), or on two servers or hosts (one primary MDC module and one secondary MDC module), or on three servers or hosts (one primary MDC module and two secondary MDC modules) in the distributed storage system; the IO routing module 204 is deployed on each server or host in the distributed storage system; and the OSD node 206 is deployed on each server or host that has storage resources in the distributed storage system, so as to manage and control local storage resources or external storage resources.
- either the IO routing module or the OSD node, or both the IO routing module and the OSD node may be deployed on one host, and a specific deployment manner may be determined according to an actual specific situation, which is not limited in the present disclosure.
- the MDC module 202 , the IO routing module 204 , and the OSD node 206 in FIG. 2C constitute a distributed storage control system, which is called a distributed replication protocol layer in the distributed storage system shown in FIG. 2B .
- the distributed storage system controls IO data storage and replication to storage resources at a storage layer by using the distributed replication protocol layer.
- the storage layer includes local storage resources on the multiple servers or hosts, and the modules that are at the distributed replication protocol layer and are distributed on the server or host interact with each other by using a switched data network at a network layer.
- Ethernet or InfiniBand may be used. It should be understood that the foregoing Ethernet or InfiniBand is merely an exemplary implementation manner of a high-speed switched data network used in this embodiment of the present disclosure, which is not limited in this embodiment of the present disclosure.
- the following uses specific embodiments and implementation manners to describe connection and interaction, specific functions, and the like of the MDC module 202 , the IO routing module 204 , and the OSD node 206 in the foregoing distributed storage and replication control system in detail.
- a partitioning function of the MDC module may include: the MDC module configures, for each OSD node according to a status of physical storage resources managed by each OSD node, a logical partition corresponding to the physical storage resources managed by each OSD node.
- the partition includes a particular quantity of data blocks in application storage space.
- the application storage space at an application layer is a particular amount of logical storage space allocated by the application layer to a user, and is logical mapping of the physical storage space at the storage layer. That is, a concept of the partition herein is different from a concept of a physical storage space partition.
- space of one partition in the application storage space may be mapped to one or more partitions in the physical storage space.
- a specific granularity of the partition may be acquired from cluster configuration information, or may be determined by the MDC module according to a particular rule or determined in another manner, which is not limited in the present disclosure.
- the MDC module may generate a cluster view of the partition according to information such as partition size configuration information, a local storage resource status, and an external storage resource status (for example, LUN (Logical Unit Number, logical unit number) information of a normally accessed SAN (Storage Area Network, storage area network)).
- LUN Logical Unit Number, logical unit number
- Storage Area Network Storage Area Network, storage area network
- a partition has replicas stored on different OSD nodes, and a partition replica quantity may be configured by using a configuration file, or may be determined by the MDC according to a particular algorithm.
- a primary partition and a secondary partition corresponding to the primary partition constitute a partition group
- an OSD node on which a primary partition is located is called a primary OSD node of a partition group that includes the primary partition
- a primary OSD described in this embodiment refers to a primary OSD of a partition group
- an OSD node on which a secondary partition is located is called a secondary OSD node of a partition group that includes the secondary partition
- a secondary OSD described in this embodiment refers to a secondary OSD for a partition group.
- storage resources managed by an OSD on a host or server server_ 1 are divided into a partition 1 , a partition 2 , and a partition 3 (P 1 , P 2 , and P 3 for short) and a partition 4 ′, a partition 5 ′, and a partition 6 ′ (P 4 ′, P 5 ′, and P 6 ′ for short), where the P 4 ′, the P 5 ′, and the P 6 ′ are replicas of a partition 4 , a partition 5 , and a partition 6 (P 4 , P 5 , and P 6 for short) on an OSD node on a server server_ 2 respectively.
- the foregoing partition and the corresponding replicas may be set according to the following factors, and in a specific actual application, another factor may be taken into consideration to set and plan a disk partition.
- data security replicas of each partition should be distributed to different hosts or servers as much as possible.
- a bottom line of data security is that multiple replicas of a partition are not allowed to be placed on a same host or server.
- data balance a quantity of partitions on each OSD keeps the same as much as possible.
- a quantity of primary partitions, a quantity of secondary partitions 1 , and a quantity of secondary partitions 2 on each OSD keep the same as much as possible, so that services processed on all the OSDs are balanced, and no hot spot appears.
- data dispersion replicas of a partition on each OSD should be distributed to other different OSDs as evenly as possible, and a same requirement is true of a higher-level physical component.
- the OSD view includes status information of an OSD node in a cluster.
- the OSD view may include an ID of an OSD node and status information of the OSD node, where the OSD ID is an OSD marker or number.
- an OSD status may be classified into an “UP” state and a “DOWN” state according to whether the OSD is faulty, and classified into an “OUT” state and an “IN” state according to whether the OSD exits from a cluster.
- a specific status transition includes: after a failback, an OSD node is initialized or restarted and then transits from an “IN” and “DOWN” state to an “IN” and “UP” state.
- the OSD view may further include OSD view version information, such as an OSD view version number, an OSD view ID, or any other information that marks a view version.
- the IO view includes an identifier that identifies a primary OSD node of a partition group.
- the IO view may include a partition group ID and an identifier of a primary OSD node of a partition group corresponding to the partition group ID.
- Each IO view has IO view version information that identifies the IO view, where the IO view version information may be an IO view ID (which may also be called an IO view version number), which is used to identify a version of the IO view, so as to help different modules compare IO view versions.
- the IO view version information may be included in the IO view, or may be excluded from the IO view.
- the partition view includes information about an OSD on which a partition in a partition group is located.
- the partition view may include a partition group ID, an OSD on which each partition in a partition group corresponding to the partition group ID is located and a primary/secondary identity of the OSD, and a partition status corresponding to the OSD of each partition.
- the partition view includes information about an OSD node on which a primary partition is located (such as an OSD node ID, a primary/secondary identity of the OSD node, and a partition status corresponding to the OSD node of the primary partition) and information about an OSD node on which a secondary partition (there may be one or more secondary partitions) corresponding to the primary partition is located (such as an OSD node ID, a primary/secondary identity of the OSD node, and a partition status corresponding to the OSD of the secondary partition).
- information about an OSD node on which a primary partition is located such as an OSD node ID, a primary/secondary identity of the OSD node, and a partition status corresponding to the OSD of the secondary partition.
- the partition status may be classified into two types: “consistent” and “inconsistent”, where “consistent” indicates that data in a secondary partition is consistent with that in a primary partition, and “inconsistent” indicates that data in a secondary partition may be inconsistent with that in a primary partition.
- Each partition view has partition view version information that identifies the partition view, where the partition view version information may be a partition view ID (which may also be called a partition view version number), so that the modules compare view versions.
- the partition view version information may be included in the partition view, or may be excluded from the partition view.
- the partition view may further include the IO view version information.
- the MDC is further adapted to: maintain, manage, and update the cluster view; update the cluster view according to an OSD node status, such as a failure, a failback, exiting from a cluster after a failure, rejoining a cluster after a failback, and newly joining a cluster; and notify a related module of cluster view updating, so that the related module processes, according to the updated cluster view, replication of data corresponding to a corresponding IO request.
- an OSD node status such as a failure, a failback, exiting from a cluster after a failure, rejoining a cluster after a failback, and newly joining a cluster
- a related module of cluster view updating so that the related module processes, according to the updated cluster view, replication of data corresponding to a corresponding IO request.
- the OSD view may exist only on the MDC
- the partition view may exist only on the MDC module and the primary OSD node
- the IO view exists on the MDC module, the IO routing module, the primary OSD node, and the secondary OSD node.
- the MDC module sends the partition view only to a primary OSD node on which a partition in the partition view is located, or instructs only a primary OSD node on which a partition in the partition view is located to update a local partition view; and sends the IO view that constitutes a part of the partition view (that is, the IO view may be regarded as a subview of the partition view) to the IO routing module, the primary OSD node, and the secondary OSD node, or instructs a corresponding module to update a locally stored IO view.
- the MDC module may set cluster views in different forms according to configuration information or a particular policy and according to basic functions of the cluster views, which is not limited in this embodiment of the present disclosure.
- the IO routing module is mainly adapted to implement an IO request routing function.
- the IO routing module acquires IO views of all partitions in the cluster from the MDC module and caches the IO views.
- the IO routing module obtains, by means of calculation (a hash algorithm or another algorithm may be used in a calculation method) by using a key in the IO request, a partition group to which IO belongs; then searches the locally stored IO views to find a primary OSD node corresponding to the partition group, and sends the IO request to the primary OSD node.
- the IO routing module processes an IO view updating notification received from the MDC module, where the updating notification may include the updated IO view or corresponding updating indication information, which indicates, for example, content that is to be updated; updates a locally stored IO view according to the updating notification, and routes the IO request according to the updated locally-stored IO view.
- the updating notification may include the updated IO view or corresponding updating indication information, which indicates, for example, content that is to be updated
- updates a locally stored IO view according to the updating notification updates a locally stored IO view according to the updating notification, and routes the IO request according to the updated locally-stored IO view.
- that the OSD node processes the IO request according to the cluster view to execute an IO operation includes the following:
- the primary OSD node When the OSD node serves as a primary OSD node, the primary OSD node is mainly adapted to: receive the IO request sent by the IO routing module, execute the IO request, and send a replication request to a corresponding secondary OSD node, so as to execute IO data storage and replication.
- the primary OSD node receives a partition view of a partition on the primary OSD node from the MDC module, and stores the partition view.
- the primary OSD node processes replication of the IO request according to the partition view.
- the primary OSD node further receives, from the MDC module, an updating notification about the partition view, updates the locally stored partition view according to the updating notification, and processes, according to the updated partition view, replication of data corresponding to the IO request, where the updating notification may include the updated partition view or corresponding updating information, so that the OSD node updates the locally stored partition view and IO view according to the updated partition view or the updating information.
- the secondary OSD node When the OSD node serves as a secondary OSD node, the secondary OSD node is adapted to: receive a replication request of a primary OSD node, and perform data replication and backup according to the replication request; receive an IO view of a partition to which data on the secondary OSD node belongs from the MDC module, and store the IO view, and process, according to the IO view, replication of data corresponding to the IO request; and further receive, from the MDC module, an updating notification about the IO view, update the locally stored IO view according to the updating notification, and process, according to the updated IO view, replication of the data corresponding to the IO request.
- an updating notification about the IO view update the locally stored IO view according to the updating notification
- process according to the updated IO view, replication of the data corresponding to the IO request.
- the distributed storage and replication system (shown in FIG. 2A , FIG. 2B , and FIG. 2C ) in the foregoing embodiment may be implemented based on a system shown in FIG. 5 .
- the system may include one or more memories 502 , one or more communications interfaces 504 , and one or more processors 506 , or another data interaction network (which is used for interaction between multiple processors and memories, and is not shown in the figure).
- the memory 502 may be a memory of various types such as a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
- the memory 502 may store an operating system and an instruction and application data of another application program, where the instruction includes an instruction used for executing functions of an MDC module, and IO routing module, and an OSD node in various embodiments of the present disclosure.
- the instruction stored in the memory 502 is run and executed by the processor 506 .
- the communications interface 504 is adapted to implement communication between the memory 502 and the processor 506 , communication between processors, communication between memories, and communication between the system and another device or a communications network.
- the processor 506 may be a general purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits, and is adapted to execute a related program, so as to execute procedures of interaction among the MDC module, the IO routing module, and the OSD node described in and the functions implemented in various embodiments of the present disclosure.
- CPU Central Processing Unit
- ASIC application-specific integrated circuit
- the processor is adapted to: be connected to the memory and read the instruction in the memory, where the instruction includes the instruction used for executing the functions of the MDC module, the IO routing module, and the OSD node; and enable, according to the instruction, the processor to execute the following operations:
- the MDC module to: determine that an OSD node in the system is a faulty OSD node, determine a partition on the faulty OSD node, update a partition view of a partition group that includes the partition on the faulty OSD node, and send an updating notification to a primary OSD node on which the partition group in the updated partition view is located, so that the primary OSD node processes, according to the updated partition view, replication of data corresponding to an IO request.
- the functions of the foregoing MDC module, the IO routing module, and the OSD node may be implemented by one host.
- the instruction for implementing the functions of the foregoing MDC module, the IO routing module, and the OSD node may be stored in a memory of the host, and a processor of the host reads the instruction for implementing the functions of the foregoing MDC module, the IO routing module, and the OSD node from the memory.
- the functions of the foregoing MDC module, the IO routing module, and the OSD node may be implemented by multiple hosts in an interactive manner.
- the processor is adapted to: be connected to the memory and read the instruction in the memory, where the instruction includes the instruction used for executing the functions of the MDC module, the IO routing module, and the OSD node; and enable, according to the instruction, the processor to execute the following operations:
- the IO routing module to: receive an IO request, where the IO request includes a key; determine, according to the key, a partition group to which data corresponding to the IO request belongs, and determine a primary OSD node of the partition group to which the data belongs; add IO view version information of an IO view of the partition group to which the data belongs to the IO request, and send, to the determined primary OSD node, the IO request that carries the IO view version information;
- the primary OSD node to: receive the IO request; execute the IO request after determining, according to the IO view version information, that an IO view version in the IO request is consistent with a locally stored IO view version; generate a replication request that carries the IO view version information; and send the replication request to a secondary OSD node of the partition group to which the data belongs; and
- the secondary OSD node to: receive the replication request, and execute the replication request after determining, according to the IO view version information, that an IO view version in the replication request is consistent with an IO view version locally stored on the secondary OSD node, so that data corresponding to the IO request on the secondary OSD node keeps consistent with data corresponding to the IO request on the primary OSD node.
- the functions of the foregoing MDC module, the IO routing module, and the OSD node may be implemented by one host.
- the instruction for implementing the functions of the foregoing MDC module, the IO routing module, and the OSD node may exist in a memory of the host, and a processor of the host reads the instruction for implementing the functions of the foregoing MDC module, the IO routing module, and the OSD node from the memory.
- the functions of the foregoing MDC module, the IO routing module, and the OSD node may be implemented by multiple hosts in an interactive manner.
- the foregoing MDC module, the IO routing module, and the OSD node are stored in memories of the different hosts in a distributed manner.
- a processor of a host 1 executes a function of the foregoing IO routing module
- a processor of a host 2 executes a function of the primary OSD node
- a processor of a host 3 executes a function of the secondary OSD node
- a processor of a host 4 executes a function of the MDC module.
- the processor is adapted to: be connected to the memory and read the instruction in the memory, where the instruction includes the instruction used for executing the functions of the MDC module, the IO routing module, and the OSD node; and enable, according to the instruction, the processor to execute the following operations:
- the OSD node to: send a query request to the MDC module after a failback to request an IO view of a partition group that includes a partition on the OSD node, where the OSD node is called a failback OSD node, and the query request carries an OSD identifier of the failback OSD node; receive the IO view returned by the MDC; initiate a data recovery request to a primary OSD in the IO view, to request to recover data updated by the failback OSD node during failure; receive the data that is updated during the failure and that is sent by the primary OSD; and process replication of the IO request according to a partition view that is of the partition group and that is updated by the MDC module after the failback OSD node completes data recovery;
- the MDC module to: receive the query request of the failback OSD node, return the IO view to the failback OSD node according to the OSD identifier in the query request, and update the partition view of the partition group after the failback OSD node completes data recovery;
- the primary OSD node to: receive the data recovery request of the failback OSD node, send the data updated during the failure to the failback OSD node, and process, according to the partition view that is of the partition group and that is updated by the MDC module after the failback OSD node completes data recovery, replication of data corresponding to the IO request.
- the functions of the foregoing MDC module, the IO routing module, and the OSD node may be implemented by one host.
- the instruction for implementing the functions of the foregoing MDC module, the IO routing module, and the OSD node may exist in a memory of the host, and a processor of the host reads the instruction for implementing the functions of the foregoing MDC module, the IO routing module, and the OSD node from the memory.
- the functions of the foregoing MDC module, the IO routing module, and the OSD node may be implemented by multiple hosts in an interactive manner.
- the foregoing MDC module, the IO routing module, and the OSD node are stored in memories of the different hosts in a distributed manner.
- a processor of a host 1 executes a function of the foregoing failback OSD node
- a processor of a host 2 executes a function of the primary OSD node
- a processor of a host 3 executes a function of the MDC module
- a processor of a host 4 executes a function of the IO routing module.
- the following uses multiple specific procedure embodiments to further describe connection and interaction, specific functions, and the like of the MDC module 202 , the IO routing module 204 , and the OSD node 206 in the foregoing distributed storage and replication control system in detail.
- These specific procedure embodiments include: a procedure of initialized generation and acquisition of a cluster view, an IO request processing procedure, an OSD failure processing procedure, an OSD node failback procedure, a data recovery procedure, a procedure in which an OSD node exits from a cluster after a failure, and a procedure in which a new OSD node joins a cluster, which are described one by one in detail in the following.
- FIG. 6 shows an embodiment of initialized generation and acquisition of a cluster view according to the present disclosure.
- an MDC module generates an initial cluster view according to cluster configuration information delivered by an administrator.
- An IO routing module and an OSD node query the MDC module for the view during initialization. This embodiment is executed by the MDC module, the IO routing module, and the OSD node mentioned in the embodiments described in FIG. 2A to FIG. 2C and FIG. 5 .
- the MDC module is first started, and then acquiring, by the IO routing module and the OSD node, the view from the MDC is started.
- a specific process includes the following steps:
- the MDC generates the initial cluster view according to the delivered cluster configuration information, where three types of cluster views (an OSD view, a partition view, and an IO view) are already described above; the MDC generates an OSD view according to configured OSD information, generates a partition view by using a partition allocation algorithm, the configured partition quantity, replica quantity, and OSD node quantity, and does not need to additionally generate an IO view, which is a subset of the partition view, where when the partition view is being generated, partition allocation balance (a quantity of partitions on each OSD node keeps the same as much as possible) and security (OSD nodes on which partition replicas exist are in different servers or different racks) generally need to be considered.
- partition allocation balance a quantity of partitions on each OSD node keeps the same as much as possible
- security OSD nodes on which partition replicas exist are in different servers or different racks
- the MDC module returns IO views of all partition groups to the IO routing module.
- the MDC module returns, to the OSD node, the partition view of the partition group that includes the primary partition distributed on the OSD and the IO view of the partition group that includes the secondary partition distributed on the OSD node.
- FIG. 7A and FIG. 7B show an embodiment of an IO request processing procedure according to the present disclosure. This embodiment is executed by the IO routing module and the OSD node mentioned in the embodiments described in FIG. 2A to FIG. 2 C and FIG. 5 .
- a partition group (a Partition X) is used as an example for description in this embodiment, where the Partition X may be any partition group managed and maintained by a distributed replication protocol layer in this embodiment of the present disclosure
- a primary OSD node on which a primary partition in the partition X is located is abbreviated as a Partition X primary OSD node
- a secondary OSD 1 node on which a secondary partition in the Partition X is located is abbreviated as a Partition X secondary OSD 1 node
- a secondary OSD 2 node on which a secondary partition in the Partition X is located is abbreviated as a Partition X secondary OSD 2 node.
- the partition X is used as an example in the following specific
- the IO routing module receives an IO request sent by a host (for example, a server shown in FIG. 2C on which the IO routing module is located).
- a host for example, a server shown in FIG. 2C on which the IO routing module is located.
- the IO routing module obtains, according to the received IO request, a partition of data (which may also be called IO data) corresponding to the IO request, and acquires a primary OSD node of a partition group that includes the partition.
- a partition of data which may also be called IO data
- the IO routing module may obtain, by means of calculation by using a hash algorithm and according to a key carried in the IO request, a partition group ID of the partition group that includes the data corresponding to the IO request, and then searches for an IO view by using the partition group ID, so as to acquire the primary OSD node of the partition group.
- the partition group corresponding to the partition group ID herein is the partition X in this embodiment.
- the Key is a digit or a character string defined in an upper-layer service, and is used to identify a data block.
- the IO routing module sends the IO request to the primary OSD node of the partition group, where the request carries IO view version information (for example, an IO view ID), IO key information, and the IO data.
- IO view version information for example, an IO view ID
- the partition group in this embodiment is the partition X, and accordingly, the IO request is sent to the primary OSD node of the partition X, that is, the partition X primary OSD node.
- the IO view ID may also be called an IO view version number, and the IO view version number is mainly used to identify a view version and monotonically increases, where a small IO view ID indicates that a view version maintained by the primary OSD node is an outdated version, and a requirement for ensuring consistency is that views seen by all modules in the IO processing procedure are consistent.
- the IO key information may include a key, an offset, and a length, where the offset indicates an offset of the IO data relative to a start position in a data block identified by the key, and the length indicates a length of the IO data.
- Sequence (Seq) ID (which may also be called a sequence identifier) for the IO request after it is determined, according to the IO view version information, that a view version carried in the IO request is consistent with a locally stored view version, where, after determining that the IO view ID carried in the IO request is consistent with a locally stored IO view ID, the primary OSD node generates a sequence identifier for the IO request.
- the Seq ID includes the view version number and a sequence number (Seq NO).
- the view version number monotonically increases as the IO view changes, and the Seq NO indicates a serial number of a modify operation (for example, writing and deleting) on data corresponding to a partition group in the IO view within one IO view version.
- the Seq NO in the Seq ID starts to increase again from 0.
- That the Partition X primary OSD node compares whether the IO view ID carried in the IO request is consistent with the locally stored IO view ID may be: the Partition X primary OSD node first compares the IO view IDs, where a larger IO view ID indicates a larger Seq ID; and if the IO view IDs are equal, then compares Seq NOs, where a larger Seq NO indicates a larger Seq ID, and only when both the IO view IDs and the Seq NOs are the same, it indicates that the Seq IDs are consistent.
- the local IO view ID is larger, return an error to the IO routing module; or if the local IO view ID is smaller, add the IO request to a cache queue, and query the MDC module for a partition view.
- the Partition X primary OSD node after determining that the IO view ID in the IO request is less than the locally stored IO view ID, the Partition X primary OSD node returns the error to the IO routing module, and the IO routing module queries the MDC module for the IO view of the partition group, and resends the IO request after obtaining an updated IO view ID; or after determining that the IO view ID in the IO request is greater than the locally stored IO view ID, the Partition X primary OSD node adds the IO request to the cache queue, and queries the MDC module for an IO view ID of the IO view of the partition group, so as to execute the IO request after determining that the locally stored IO view ID is consistent with the IO view ID in the IO request.
- the operation type may include writing, deleting, or the like.
- the entry may further include the foregoing offset and length.
- the entry may further include status information, which is used to describe whether the operation is successful. Generally, all modify operations (such as write and delete operations) on a same partition group are consecutively numbered.
- the IO data is written to local physical storage resources (for example, a cache layer or a persistence layer show in FIG. 2C , such as a magnetic disk, or the foregoing mentioned external physical storage resources SAN) managed by the Partition X primary OSD node; and if the IO request is a delete request, corresponding data on the local physical storage resources managed by the Partition X primary OSD node is deleted.
- the Partition X primary OSD node further generates a replication request.
- the replication request may be a replication request generated by separately assembling a control part of the IO request, so that data corresponding to the IO request on the secondary OSD node of the Partition X is consistent with data corresponding to the IO request on the Partition X primary OSD node.
- the replication request may further include the information such as the key, the offset, the length, and the IO data in the original request.
- the replication request When the replication request is a write replication request, the replication request carries the key, the offset, the length, and the IO data; and when the replication request is a delete replication request, the replication request carries only the key.
- the procedure may further include the following steps:
- the Partition X secondary OSD 1 node or the Partition X secondary OSD 2 After determining that the IO view ID in the replication request is less than the locally stored IO view ID, the Partition X secondary OSD 1 node or the Partition X secondary OSD 2 returns an error to the Partition X primary OSD node, and the Partition X primary OSD node queries the MDC module for the IO view of the partition group, and resends the replication request after obtaining the updated IO view ID; or after determining that the IO view ID in the replication request is greater than the locally stored IO view ID, the Partition X secondary OSD 1 node or the Partition X secondary OSD 2 node adds the replication request to the cache queue, and queries the MDC module for the IO view version number of the partition group, so as to execute the replication request after determining that the locally stored IO view ID is consistent with the IO view ID in the IO replication request.
- the replication request may further carry a Seq ID in a previous replication request sent by the Partition X primary OSD node of the partition X, and the IO procedure may further include step 720 .
- Seq ID that is in the previous replication request and is carried in the replication request is less than the largest local Seq ID
- an error is returned to the partition X primary OSD node, and the partition X primary OSD node resends the replication request; or a Seq ID is determined by means of further querying, and after the Seq ID is determined, an updated Seq ID is acquired and then processing continues, instead of directly returning an error.
- processing does not need to be terminated for a rollback, which further improves the fault tolerance and availability of the system and the performance of the entire system.
- the partition X primary OSD node or the partition X secondary OSD node becomes faulty (for example, after the IO request arrives at the partition X primary OSD node, the partition X primary OSD node becomes faulty, or the partition X secondary OSD node becomes faulty), a new OSD node joins the system as a secondary OSD node of the partition X, or the like, in these cases, the IO request processing procedure in the foregoing embodiment may further include a processing process described in the following embodiment.
- the IO processing procedure includes the following:
- the MDC module in the system sets, when detecting, in the process of processing the IO request, that the partition X primary OSD node becomes faulty, the partition X primary OSD node in the partition view of the partition group as a new partition X secondary OSD node, and marks a partition status of the new partition X secondary OSD node as inconsistent; sets the partition X secondary OSD 1 node in the partition view of the partition group as a new partition X primary OSD node, and sends the updated partition view of the partition group to the new partition X primary OSD node; and sets the partition X secondary OSD 1 node in the IO view of the partition group as a new partition X primary OSD node, and sends the updated IO view of the partition group to the IO routing module;
- the IO routing module receives the updated IO view that is of the partition group and that is sent by the MDC module, and sends the IO request to the new partition X primary OSD node according to the updated IO view of the partition group;
- the new partition X primary OSD node is adapted to: receive the IO request, and after executing the IO request, generate a replication request, and send the new replication request to another secondary OSD node in the updated partition view of the partition group, where steps of generating the replication request and sending the replication request are the same as the foregoing steps 714 and 716 .
- the IO processing procedure includes the following:
- the MDC module is further adapted to: when detecting, in the process of processing the IO request, that the partition X secondary OSD node becomes faulty, mark a partition status of the partition X secondary OSD node in the partition view as inconsistent, and send the updated partition view of the partition group to the partition X primary OSD node; and
- the MDC module can rapidly determine a new primary OSD node by means of voting, and rapidly resume IO processing, and therefore high availability and strong fault tolerance are achieved.
- an influence range of the storage node failure can be greatly narrowed down by means of control of a cluster view at a partition granularity, so that the storage system can be extended on a large scale, and the system extensibility is improved.
- the MDC finds that a new OSD node joins the cluster and serves as a secondary OSD node of the partition X, and the IO processing procedure includes the following:
- the MDC module notifies, when determining, in the process of processing the IO request, that the new OSD node joins a cluster, the partition X primary OSD node that the new OSD node serves as a new secondary OSD node on which the partition X is located; after partition data synchronization is completed, updates a partition view and an IO view of the partition group; and instructs the partition X primary OSD node to update a partition view locally stored on the partition X primary OSD node; and
- the partition X primary OSD node synchronizes data of a primary partition on the partition X primary OSD node to the new secondary OSD node, and sends the replication request to the new secondary OSD node according to the updated locally stored partition view.
- a storage node failure is common in a large-scale distributed storage system.
- OSD nodes When some OSD nodes are faulty, the system needs to be capable of normally providing an IO service.
- processing on all IO requests depends on a cluster view maintained by an MDC module, and when an OSD node in a cluster becomes faulty, the cluster view also needs to be updated accordingly, so that an IO request can be properly and effectively processed.
- the MDC module detects a status of an OSD node, and when an OSD node becomes faulty, the MDC module can find the failure in time; second, after finding the OSD failure, the MDC module needs to perform processing in time to make a correct alteration to the view, and notify a related IO routing module and OSD node of the alteration; third, the related IO routing module and OSD node process a corresponding IO request according to the updated view after receiving an updating notification of the MDC, so that the module and the node can obtain the updated view in time, thereby ensuring that the IO request is smoothly and effectively processed.
- the MDC module may detect the OSD node failure in the following two modes: (1) The MDC module is responsible for failure detection on all OSD nodes, and each OSD node regularly sends a heartbeat message to the MDC module, where if an OSD node does not send a heartbeat message to the MDC module within a specified time period, the MDC module determines that the OSD node becomes faulty; and (2) OSD nodes regularly send a heartbeat message to each other to detect a failure, where if a detecting party does not receive a heartbeat message of a detected party within a specified time period, the detecting party reports to the MDC module that a corresponding OSD node becomes faulty.
- a partition status of the faulty OSD node in a partition view of a partition group that includes the secondary partition is marked as inconsistent; at the same time, a primary OSD node of the partition group is notified of an updated partition view, and the primary OSD node of the partition group processes, according to the updated partition view, replication of data corresponding to an IO request;
- the faulty OSD node that serves as a primary OSD node in a partition view of a partition group that includes the primary partition is set as a new secondary OSD node, a partition status corresponding to the new secondary OSD node is marked as inconsistent, a secondary OSD node whose partition status is consistent is selected from an original secondary OSD node in the partition view of the partition group that includes the primary partition, and the selected secondary OSD node is set as a new primary OSD node; and then the new primary OSD node is notified of partition view updating, and another secondary OSD node is notified of IO view updating, where if the primary OSD of the partition group becomes faulty and partition statuses of OSD nodes on which all secondary partitions are located are inconsistent, no alteration is made to the partition view and an IO view; it is required to ensure that a primary partition replica has latest complete data, thereby ensuring data replication consistency.
- That the related IO routing module and OSD node process a corresponding IO request according to the updated view after receiving an updating notification of the MDC may include:
- the new primary OSD node replicates, according to the updated locally-stored partition view, the data corresponding to the IO request from the IO routing module onto a secondary OSD node whose partition status is consistent in the updated locally stored partition view, or onto a secondary OSD node on which a partition whose partition status is inconsistent is located in the updated locally-stored partition view but that is recovering data, so as to isolate the failure, and ensure proper and uninterrupted IO request processing, thereby improving fault tolerance of the system, and accordingly improving performance and availability of the system.
- an influence range of the OSD node failure can be narrowed down by means of control of a cluster view at a partition granularity, so that the system can be extended on a large scale, and system extensibility is improved.
- the MDC module further updates the partition view and the IO view, notifies a primary OSD node on which a partition in the further updated partition view is located of the further updated partition view, and sends the further updated IO view to a secondary OSD node on which a partition in the further updated partition view is located, so that the module or the OSD node that receives the further updated partition view or IO view updates a locally stored partition view or IO view, and processes, according to the further updated partition view or IO view, replication of the data corresponding to the IO request.
- the view is updated in time, so that the failback node can rapidly join the cluster to process the IO request, which improves the performance and efficiency of the system.
- FIG. 8A and FIG. 8B show an embodiment of an OSD node failure processing procedure according to the present disclosure. This embodiment is executed by the MDC module, the IO routing module, and the OSD node mentioned in the embodiments described in FIG. 2A to FIG. 2C and FIG. 5 .
- an OSDx node, an OSDy node, and an OSDz node among OSD nodes in this embodiment are used as an example for description, where the OSDx node, the OSDy node, or the OSDz node may be any one OSD node of multiple OSD nodes in a distributed replication protocol layer in this embodiment of the present disclosure.
- the OSDx node is a primary OSD node of a partition group 1 (P 1 for short) and a secondary OSD node of a partition group n (Pn for short);
- the OSDy node is a primary OSD node of the Pn and a secondary OSD node of the P 1 ;
- the OSDz node is a secondary OSD node of the Pn and a secondary OSD node of the P 1 .
- the OSD node failure processing procedure in this embodiment includes the following steps:
- the OSDx node, the OSDy node, and the OSDz node separately and regularly send a heartbeat message to the MDC module.
- the OSDx node becomes faulty.
- the MDC module determines that the OSDx node becomes faulty.
- the MDC module performs view updating according to a failure situation.
- the MDC updates a cluster view of a corresponding partition group according to a partition group that includes a partition on the determined faulty OSD node.
- partition groups that include a partition on the faulty OSDx node include the P 1 and the Pn, and therefore, the MDC needs to update cluster views of the P 1 and the Pn, which may include:
- a status of the OSDx node is updated from an “IN” and “UP” state to an “IN” and “DOWN” state.
- the OSDy node is set as a primary OSD node of the P 1 (the first secondary OSD node whose partition status is consistent is selected from a secondary OSD node list of the P 1 , and is set as a primary OSD node of the partition group), the OSDx is set as a secondary OSD node of the P 1 , and a corresponding partition status thereof is updated to “inconsistent”; for the Pn, a partition status of the secondary OSDx of the Pn is changed to inconsistent.
- the OSDy node (which serves as a primary OSD node of the P 1 after updating and still serves as a primary OSD node of the Pn) of updating of partition views of the P 1 and the Pn, where the updating includes: setting the OSDy node as a primary OSD node of the P 1 , setting the OSDx as a secondary OSD node of the P 1 and a corresponding partition status of the OSDx to “inconsistent”, and changing a partition status of the secondary OSDx node of the Pn to “inconsistent”.
- Notify the OSDz node of updating of an IO view of the P 1 that is, notify the OSDz node that the primary OSD node of the P 1 is replaced with the OSDy node.
- Notify the IO routing module of updating of the IO view of the P 1 that is, notify the IO routing module that the primary OSD node of the P 1 is replaced with the OSDy node.
- the OSDy node processes a notification of the MDC module, updates locally stored view information (a partition view and an IO view), and processes, according to a latest view notified by the MDC module, replication of the data corresponding to the IO request.
- the OSDy node updates a partition view and an IO view of the P 1 ; as the original OSD node of the Pn, the OSDy node updates a partition view of the Pn.
- the OSDy node For an IO operation on the P 1 , after receiving the IO request forwarded by the IO routing module, the OSDy node executes the IO request, generates a replication request, and sends the replication request to the secondary OSD node of the P 1 in the updated partition view, that is, the OSDz node, where a corresponding partition status of the OSDz node is “consistent”. Because the OSDx node serves as a new secondary OSD node of the P 1 , and a partition status of the OSDx node is “inconsistent”, the OSDy node no longer sends the replication request to the OSDx node, which implements failure isolation, and does not affect continuous IO request processing on the P 1 .
- the OSDy node For an IO operation on the Pn, after receiving the IO request forwarded by the IO routing module, the OSDy node executes the IO request, generates a replication request, and sends the replication request to the secondary OSD node of the Pn in the updated partition view, that is, the OSDz node, where a corresponding partition status of the OSDz node is “consistent”. Because the OSDx node serves as a secondary OSD node of the Pn, and a partition status of the OSDx node is “inconsistent”, the OSDy node no longer sends the replication request to the OSDx node, which implements failure isolation, and does not affect continuous IO request processing on the P 1 .
- the MDC instructs a primary node to make a view alteration, so as to ignore the faulty node, isolate the faulty node, and continue IO request processing without blocking processing on the another IO request, which has better fault tolerance and availability.
- the MDC node can rapidly determine a new primary node by means of voting, and rapidly resume IO processing, and if a secondary node becomes faulty, the MDC instructs the primary node to make a view alteration, so as to isolate or ignore the faulty OSD node, and continue IO request processing without blocking processing on another IO request, which has better fault tolerance, and can rapidly handle a node failure. For example, a failure of N replicas in N+1 replicas may be tolerated, which further improves performance and availability of a storage system.
- a system with low availability inevitably has poor extensibility, and because a storage node failure is common in a large-scale distributed storage system, complex and massive protocol interaction may further reduce system extensibility.
- an influence range of the storage node failure can be greatly narrowed down by means of control of a cluster view at a partition granularity, so that the storage system can be extended on a large scale, and the system extensibility is improved.
- a new data modify operation may occur during a failure of an OSD node, and therefore, before the faulty OSD node returns to normal and rejoins a cluster to provide a service, data recovery and synchronization need to be first performed, so as to make a partition replica on the faulty OSD node return to a state consistent with that of a primary replica.
- the OSD node failback procedure may be divided into three phases:
- a secondary OSD node synchronizes, with a primary OSD node, data modification performed by the primary OSD node during the failure, which is an incremental synchronization process, and certainly, in an actual application, all data of a partition may be synchronized according to an actual situation.
- an MDC module alters a cluster view.
- the MDC module notifies each module and node of the updated cluster view, so that each module and node process IO request replication or forwarding according to the notified updated cluster view.
- the OSD node failback procedure may further include the following phase:
- the secondary OSD node plays back a log that is recorded after the replication request of the primary OSD node is received and in a process of synchronizing data with the primary OSD, and writes data that is in the log. In this way, it can be ensured that in a failback process, all data of the failback OSD node is consistent with that of the primary OSD node, thereby further improving data consistency between the primary OSD node and the secondary OSD node.
- FIG. 10A and FIG. 10B For a specific procedure of a data synchronization process, refer to a specific embodiment provided in the following FIG. 10A and FIG. 10B .
- a cluster view updating and updating notification process may include the following cases:
- the OSD node is a secondary OSD node of a partition group before the failure, only a partition view may be altered, where a partition status of the secondary OSD node corresponding to the partition view is changed to a “consistent” state, and the altered partition view is notified to the primary OSD node.
- the OSD node is a primary OSD of a partition group before the failure, both a partition view and an IO view are altered, the OSD node is set as a new primary OSD node of the partition group, the current primary OSD is set as a secondary OSD node of the partition group, the new primary OSD node is instructed to alter the partition view, and an IO routing module and the secondary OSD node are instructed to alter the IO view.
- the MDC module kicks the OSD node out of the cluster, and migrates a partition distributed on the OSD node to another OSD node.
- the MDC module kicks the OSD node out of the cluster, and migrates a partition distributed on the OSD node to another OSD node.
- FIG. 9A , FIG. 9B , and FIG. 9C show an embodiment of an OSD node failback processing procedure according to the present disclosure. This embodiment is executed by the MDC module, the IO routing module, and the OSD node mentioned in the embodiments described in FIG. 2A to FIG. 2C and FIG. 5 .
- the OSD node failure processing procedure in this embodiment includes the following steps:
- the failback OSD node requests the MDC module for cluster view information on the failback OSD node, where the request carries an OSD ID of the OSD node.
- the MDC module queries a partition view according to the OSD ID to acquire partition information on the OSD.
- the MDC separately queries, according to the OSD ID of the failback OSD, partition views corresponding to the P 1 and the Pn on the failback OSD, to separately obtain partition information of the P 1 and partition information of the Pn.
- the partition information may include an IO view, and may further include a partition status.
- a primary OSD node for example, the primary OSD node of the Pn
- the primary OSD node needs to send a replication request to all secondary OSD nodes of the Pn, and the failback procedure may further include the following steps 912 - 916 and step 918 .
- the primary OSD node of the Pn receives an IO write request from a host during a failback.
- data corresponding to the IO request may also be written to the log, and for the IO key information in this step, refer to the description of the foregoing IO operation procedure in FIG. 7A and FIG. 7B .
- the failback OSD node After completing data recovery and synchronization with the primary OSD node, the failback OSD node writes, to physical storage resources managed by the failback OSD node and according to the IO information recorded in the log, data corresponding to the replication request from the primary OSD of the Pn.
- the IO request is written to the log in the failback process, and then an IO request generated during the failure is written after data missing during the failure is recovered, which can ensure that sequences in which the primary OSD node and the secondary OSD node execute all IO operations are consistent, and further improve data backup consistency.
- the MDC module may be instructed, in two manners, to execute updating of a cluster view, where in Manner 1, after data recovery is completed, the failback OSD node notifies the primary OSD nodes of the P 1 and the Pn, so that the primary OSD node instructs the MDC node to update the cluster view.
- Manner 2 refer to the following step 930 .
- the failback OSD node may further determine a partition status of the partition on the failback OSD node, and triggers a cluster view updating procedure after determining that the partition status is inconsistent.
- the failback OSD node may further acquire partition status information by gathering the partition information returned in the foregoing step 906 .
- the primary OSD nodes of the P 1 and the Pn separately instruct the MDC to alter the partition view, where a notification carries a partition group ID, an ID of a secondary OSD, and a view version.
- the notification is sent to the MDC module to require the MDC module to update the partition status of the partition on the failback OSD node to consistent, where the notification carries the partition group ID of the partition group that includes the partition on the OSD node, the ID of the secondary OSD node (that is, the failback OSD node), and the view version.
- the partition group ID herein is used to mark a to-be-updated view of the partition group
- the OSD ID is used to mark the faulty OSD node
- the view version herein is a view version of a latest partition view locally stored on the primary OSD nodes of the P 1 and the Pn, where a function of the view version is that the MDC performs cluster view updating processing after receiving the notification and determining that the view version of the partition view in the notification is consistent with a view version of a latest partition view locally maintained by the MDC, thereby ensuring that cluster views seen by all modules or nodes in the IO processing procedure are consistent, and improving data backup consistency.
- the latest partition view is further sent to the primary OSD nodes of the P 1 and the Pn, and after it is determined that primary data and secondary data of the P 1 and the Pn are consistent, the cluster view is updated.
- the failback OSD node after completing data recovery, sends the notification to the MDC module, to instruct the MDC module to update the partition status of the partition on the failback OSD node to consistent.
- the view version herein is a partition view version of a latest partition view or an IO view version of a latest IO view locally stored on the failback OSD node (whether the view version is the partition view or the IO view depends on whether the failback OSD node is a primary OSD node or a secondary node before the failure), and after determining, according to the view version in the notification, that the corresponding view version locally maintained by the MDC module is consistent, the MDC module performs cluster view updating processing.
- partition statuses corresponding to the failback OSD node in the partition views of the P 1 and the Pn on the failback OSD node are updated to “consistent”.
- the MDC resets the failback OSD node as a new primary OSD node of the P 1 , set the current primary OSD node on which the P 1 is located as a new secondary OSD node of the P 1 , and update the partition view of the P 1 ; for the Pn, because the failback OSD node is originally a secondary OSD node of the Pn, for the Pn, a primary/secondary identity change problem of the failback OSD node is not involved.
- failback OSD node If the failback OSD node is originally a primary node, send a latest Partition View to the failback OSD node.
- the failback OSD node determines whether primary OSDs in the Partition View and the locally stored IO view are consistent, determines whether the failback OSD node is set as a primary node, and updates the IO view.
- the failback OSD node After receiving the latest partition view, the failback OSD node determines whether primary OSDs in the latest partition View and the locally stored IO view are consistent, determine whether the failback OSD node is set as a primary node, and if the primary OSDs in the latest partition View and the locally stored IO view are consistent, and the failback OSD node is set as a primary node, update the IO view and the locally stored partition view, and process, according to the updated partition view and the updated IO view, replication of data related to the IO request.
- current primary OSD nodes of the P 1 and the Pn receive the updated IO view, and update respective locally-stored IO views.
- the current primary OSD node of the P 1 because the current primary OSD node has been set by the MDC module as a new secondary OSD node of the P 1 , the current primary OSD node of the P 1 deletes the locally stored partition view, and processes replication of the IO request according to the updated locally-stored IO view.
- an MDC instructs another related node to perform view updating, so as to isolate or ignore the faulty OSD node, continue IO request processing without blocking processing on another IO request, update a view after a node failback, and notify each related node, so that the failback node can rapidly rejoin a cluster again for work, which can rapidly process a node failure and failback, has better fault tolerance, and improve performance and availability of a storage system.
- a system with low availability inevitably has poor extensibility, and because a storage node failure is common in a large-scale distributed storage system, complex and massive protocol interaction may further reduce system extensibility.
- an influence range of the storage node failure can be greatly narrowed down by means of control of a cluster view at a partition granularity, so that the storage system can be extended on a large scale, and the system extensibility is improved.
- the OSD node failure processing procedure in this embodiment includes the following steps:
- the failback OSD node locally acquires a largest Seq ID in a recorded entry of each partition.
- an OSD node in a system records one entry for each IO operation on a partition in an IO processing process.
- the entry includes an IO operation type, a Partition group ID, a Seq ID, and a key, and the entry may further include status information, which is used to describe whether the operation is successful.
- the entry may further include the foregoing offset and length. For example, in this embodiment, largest Seq IDs for IO write operations on the P 1 and the Pn are separately acquired.
- Scenario 1 A largest Seq ID of a secondary OSD falls within a range of an entry recorded by a primary OSD.
- Seq ID numbering rules or manners may be different, and the primary OSD nodes of the P 1 and the Pn may lack different entries.
- the failback OSD node sends, according to the acquired entries, data synchronization requests one by one in batches to the primary OSD nodes of the P 1 and the Pn, where the request carries the IO key information such as the key, the offset, and the length.
- the primary OSD nodes of the P 1 and the Pn send, according to information in the acquired data synchronization request, data corresponding to each entry to the failback OSD node.
- Scenario 2 A largest Seq ID of a secondary OSD does not fall within a range of an entry recorded by a primary OSD and is less than a smallest Seq ID of the primary OSD.
- the primary OSD node of the Pn determines that the largest Seq ID of the failback OSD node does not fall within the range of the entry recorded by the primary OSD, that is, the largest Seq ID is less than a smallest Seq ID of the primary OSD of the Pn
- the smallest Seq ID of the primary OSD of the Pn and no entry are sent to the failback OSD node, which helps the failback OSD node to determine whether the primary OSD node of the Pn does not write data or writes excessive data, to such an extent that data recovery cannot be completed by means of incremental synchronization.
- the failback OSD node requests the primary OSD node of the Pn for synchronization of data of an entire partition, where for example, data of a primary partition on the primary OSD of the Pn in this embodiment is synchronized with data of a secondary partition of the Pn on the failback OSD node, and the request carries the partition group ID.
- a volume of data of a partition is generally quite large, so that the data cannot be entirely transmitted by using one request, and in addition, the primary node does not know an IO capability of the failback node, but continuously sends data to the failback OSD node, and the failback node may fail in processing the data; therefore, the primary node sends the data to the failback node only when the failback node requests data synchronization.
- the failback OSD node repeatedly sends a synchronization request to the primary OSD node of the Pn according to a situation, to synchronize data in an entire partition until all the data in the partition are synchronized. In an actual application, the entire partition may be synchronized in another manner, which is not limited in the present disclosure.
- the primary OSD node of the Pn sends, according to a synchronization request sent by the failback OSD node each time, the data corresponding to the one or more keys.
- an OSD node still cannot return to normal and rejoin a cluster after the failure lasts over a preset time threshold (for example, 5 minutes) or a hardware fault occurs on the OSD node, the faulty OSD node needs to be kicked out of the cluster, so as to ensure data reliability.
- a preset time threshold for example, 5 minutes
- Exiting from the cluster by the OSD node is a process of partition redistribution and data migration, where in partition redistribution, balance of each node and replica security need to be considered.
- IO processing in a data migration process is consistent with processing in a failback procedure and a data recovery procedure; after data migration is completed, a primary replica and a secondary replica reach a consistent state, and a process in which an MDC performs view updating and notification is consistent with view updating processing performed after a failback is completed.
- a process in which each related OSD node performs replication or forwarding processing on an IO request according to an updated view is also consistent with IO processing performed by each OSD node according to a latest view after a failback is completed.
- FIG. 11A and FIG. 11B show an embodiment of procedure in which an OSD node exits from a cluster after failure according to the present disclosure. This embodiment is executed by the MDC module, the IO routing module, and the OSD node mentioned in the embodiments described in FIG. 2A to FIG. 2C and FIG. 5 .
- related OSD nodes are an OSD 1 node, an OSD 2 node, and an OSDn node, where the OSD 1 node is a primary OSD node of a partition group 1 (P 1 for short) and a secondary OSD node of a partition group 2 (P 2 for short), the OSD 2 node is a secondary OSD node of the P 1 , and the OSDn node is a primary OSD node of the P 2 .
- the processing procedure in which an OSD node exits after being faulty in this embodiment includes the following steps:
- the OSD 1 node becomes faulty.
- the MDC module finds that the failure of the OSD 1 lasts over a predetermined threshold or a hardware fault occurs on the OSD 1 , the MDC kicks the OSD 1 out of a cluster, alters a view, and migrates partitions (herein, the partitions are a primary partition of the P 1 and a secondary partition of the P 2 ) on the OSD 1 node to other OSD nodes, for example, the OSD 2 node and the OSDn node in this embodiment.
- the partitions are a primary partition of the P 1 and a secondary partition of the P 2
- the MDC module notifies the OSD 2 node of view updating: the OSD 2 node is set as a primary node of the P 1 , and becomes a secondary node of the P 2 .
- the MDC module notifies the OSDn node of view updating: the OSDn node is a primary node of the P 2 , and becomes a secondary node of the P 1 .
- the OSD 2 node requests the OSDn node to synchronize data of the P 2 .
- the OSDn node is the primary OSD node of the P 2
- the OSD 2 node requests the OSDn node to synchronize data of a primary partition of the P 2 on the OSDn, so that data of a secondary partition of P 2 on the OSD 2 is consistent with the data of the primary partition of the P 2 on the OSDn.
- a specific synchronization procedure is similar to a synchronization procedure of data of an entire partition in the foregoing data recovery procedure shown in FIG. 10A and FIG. 10B , and details are not described herein again.
- the OSDn node can synchronize data of the P 1 only with the OSD 2 , and the OSDn node requests the OSD 2 node to synchronize data of a primary partition of the P 1 on the OSD 2 , so that data of a secondary partition of the P 1 on the OSDn is consistent with the data of the primary partition of the P 1 on the OSD 2 .
- the OSDn node notifies the MDC module that data migration of the P 1 has been completed.
- the MDC module performs view updating according to a corresponding notification.
- Notify view updating a partition status of the secondary OSDn node of the P 1 is consistent.
- Notify view updating a partition status of the secondary OSD 2 node of the P 2 is consistent.
- the OSD 2 node and the OSDn node process, according to a latest view, replication of data corresponding to an IO request.
- IO processing in a data migration process is consistent with processing in a failback procedure and a data recovery procedure; after data migration is completed, a process in which an MDC performs view updating and notification is consistent with view updating processing performed after a failback is completed.
- a process in which each related OSD node processes, according to the updated view, replication of data corresponding to the IO request is also consistent with IO processing performed by each OSD node according to a latest view after a failback is completed.
- performing view updating after data migration is completed may include the following: (1) The new node that joins the cluster is still a secondary OSD node of some partition groups, a partition status is consistent, and an original secondary OSD of the partition group is no longer a secondary node of the partition group; (2) The new node that joins the cluster is set as a primary OSD node of some partition groups, and the partition group no longer belongs to an original primary OSD node (the partition group is no longer distributed on the original primary OSD).
- the following uses a specific embodiment for description.
- This embodiment is executed by the MDC module, the IO routing module, and the OSD node mentioned in the embodiments described in FIG. 2A to FIG. 2C and FIG. 5 .
- related OSD nodes are an OSD 1 node, an OSDn node, and a new OSD node that joins a cluster, where the OSD 1 is a primary OSD node of a partition group P 1 (P 1 for short) and a secondary OSD node of a partition group Pn (Pn for short), the OSDn is a primary OSD node of the Pn and a secondary OSD node of the P 1 , and the new OSD node that joins the cluster is a secondary OSD node of the P 1 and a secondary OSD node of the Pn.
- the procedure in which the new OSD node joins the cluster in this embodiment includes the following steps:
- the MDC module performs view updating, and migrates partitions on some OSD nodes to the new OSD node that joins the cluster.
- the MDC module migrates a secondary partition of the P 1 on the OSD 1 node and a secondary partition of the Pn on the OSDn node to the new OSD node that joins the cluster, so that the new OSD node that joins the cluster serves as a new secondary OSD node of the P 1 and a new secondary OSD node of the Pn.
- the MDC module notifies the OSD 1 node of view updating, the new OSD node that joins the cluster is added to a new partition view of the P 1 as the secondary OSD node of the P 1 , and a corresponding partition status is “inconsistent” (because the new OSD node that joins the cluster and the OSD 1 node have not synchronized data of the P 1 yet).
- the new OSD node that joins the cluster is added to a new partition view of the Pn as the secondary OSD node of the Pn, and a corresponding partition status is “inconsistent” (because the new OSD node that joins the cluster and the OSDn node have not synchronized data of the Pn yet).
- an initialization process is performed.
- a specific process is the same as the foregoing procedure of initialized generation and acquisition of a cluster view in FIG. 6 , and details are not described herein again.
- the MDC module returns a view of a partition on the new OSD node that joins the cluster to the new OSD node that joins the cluster, that is, an IO view of the P 1 and an IO view of the Pn in this embodiment.
- the new OSD node that joins the cluster requests the primary OSD 1 node to synchronize partition data.
- the new OSD node that joins the cluster requests, according to the IO view of the P 1 returned by the MDC, the primary OSD node of the P 1 , that is, the OSD 1 , to synchronize data of the P 1 , that is, to synchronize data of a primary partition of the P 1 on the OSD 1 node, so that data of a secondary partition of the P 1 on the new OSD node that joins the cluster is consistent with the data of the primary partition of the P 1 on the OSD 1 node.
- a specific synchronization procedure is similar to a synchronization procedure of data of an entire partition in the foregoing data recovery procedure shown in FIG. 10A and FIG. 10B , and details are not described herein again.
- the new OSD node that joins the cluster requests the primary OSDn node to synchronize partition data.
- the new OSD node that joins the cluster requests, according to the IO view of the Pn returned by the MDC, the primary OSD node of the Pn, that is, the OSDn, to synchronize data of the Pn, that is, to synchronize data of a primary partition of the Pn on the OSDn node, so that data of a secondary partition of the Pn on the new OSD node that joins the cluster is consistent with the data of the primary partition of the Pn on the OSDn node.
- a specific synchronization procedure is similar to a synchronization procedure of data of an entire partition in the foregoing data recovery procedure shown in FIG. 10A and FIG. 10B , and details are not described herein again.
- the OSD 1 node notifies the MDC module that the new OSD node that joins the cluster has completed data synchronization of the P 1 .
- the OSDn node notifies the MDC module that the new OSD node that joins the cluster has completed data synchronization of the Pn.
- the MDC module performs view updating.
- the program may be stored in a computer-readable storage medium.
- the storage medium is, for example, a ROM/RAM, a magnetic disk, or an optical disc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Hardware Redundancy (AREA)
- Stored Programmes (AREA)
Abstract
Description
Claims (12)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2014/090445 WO2016070375A1 (en) | 2014-11-06 | 2014-11-06 | Distributed storage replication system and method |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2014/090445 Continuation WO2016070375A1 (en) | 2014-11-06 | 2014-11-06 | Distributed storage replication system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
US20170242767A1 US20170242767A1 (en) | 2017-08-24 |
US10713134B2 true US10713134B2 (en) | 2020-07-14 |
Family
ID=55908392
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/589,856 Active US10713134B2 (en) | 2014-11-06 | 2017-05-08 | Distributed storage and replication system and method |
Country Status (7)
Country | Link |
---|---|
US (1) | US10713134B2 (en) |
EP (1) | EP3159794B1 (en) |
JP (1) | JP6382454B2 (en) |
CN (1) | CN106062717B (en) |
BR (1) | BR112016030547B1 (en) |
SG (1) | SG11201703220SA (en) |
WO (1) | WO2016070375A1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10997130B2 (en) * | 2019-01-31 | 2021-05-04 | Rubrik, Inc. | Systems and methods for node consistency in a clustered database |
US11016952B2 (en) | 2019-01-31 | 2021-05-25 | Rubrik, Inc. | Systems and methods to process a topology change in a clustered database |
US20210278983A1 (en) * | 2018-10-25 | 2021-09-09 | Huawei Technologies Co., Ltd. | Node Capacity Expansion Method in Storage System and Storage System |
US11223681B2 (en) * | 2020-04-10 | 2022-01-11 | Netapp, Inc. | Updating no sync technique for ensuring continuous storage service in event of degraded cluster state |
US11514024B2 (en) | 2019-01-31 | 2022-11-29 | Rubrik, Inc. | Systems and methods for shard consistency in a clustered database |
Families Citing this family (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9875259B2 (en) | 2014-07-22 | 2018-01-23 | Oracle International Corporation | Distribution of an object in volatile memory across a multi-node cluster |
US10002148B2 (en) * | 2014-07-22 | 2018-06-19 | Oracle International Corporation | Memory-aware joins based in a database cluster |
US10275184B2 (en) | 2014-07-22 | 2019-04-30 | Oracle International Corporation | Framework for volatile memory query execution in a multi node cluster |
US10073885B2 (en) | 2015-05-29 | 2018-09-11 | Oracle International Corporation | Optimizer statistics and cost model for in-memory tables |
US10567500B1 (en) * | 2015-12-21 | 2020-02-18 | Amazon Technologies, Inc. | Continuous backup of data in a distributed data store |
JP2018073231A (en) * | 2016-11-01 | 2018-05-10 | 富士通株式会社 | Storage system and storage device |
JP6279780B1 (en) * | 2017-02-20 | 2018-02-14 | 株式会社東芝 | Asynchronous remote replication system for distributed storage and asynchronous remote replication method for distributed storage |
JP6724252B2 (en) * | 2017-04-14 | 2020-07-15 | 華為技術有限公司Huawei Technologies Co.,Ltd. | Data processing method, storage system and switching device |
CN107046575B (en) * | 2017-04-18 | 2019-07-12 | 南京卓盛云信息科技有限公司 | A kind of high density storage method for cloud storage system |
CN107678918B (en) * | 2017-09-26 | 2021-06-29 | 郑州云海信息技术有限公司 | OSD heartbeat mechanism setting method and device of distributed file system |
CN107832164A (en) * | 2017-11-20 | 2018-03-23 | 郑州云海信息技术有限公司 | A kind of method and device of the faulty hard disk processing based on Ceph |
CN108235751B (en) * | 2017-12-18 | 2020-04-14 | 华为技术有限公司 | Method and device for identifying sub-health of object storage equipment and data storage system |
CN109995813B (en) * | 2017-12-29 | 2021-02-26 | 华为技术有限公司 | Partition expansion method, data storage method and device |
CN110096220B (en) | 2018-01-31 | 2020-06-26 | 华为技术有限公司 | Distributed storage system, data processing method and storage node |
CN110515535B (en) * | 2018-05-22 | 2021-01-01 | 杭州海康威视数字技术股份有限公司 | Hard disk read-write control method and device, electronic equipment and storage medium |
CN108845772B (en) * | 2018-07-11 | 2021-06-29 | 郑州云海信息技术有限公司 | Hard disk fault processing method, system, equipment and computer storage medium |
CN110874382B (en) * | 2018-08-29 | 2023-07-04 | 阿里云计算有限公司 | Data writing method, device and equipment thereof |
CN109144788B (en) * | 2018-09-10 | 2021-10-22 | 网宿科技股份有限公司 | Method, device and system for reconstructing OSD |
CN109144789B (en) * | 2018-09-10 | 2020-12-29 | 网宿科技股份有限公司 | Method, device and system for restarting OSD |
CN109189738A (en) * | 2018-09-18 | 2019-01-11 | 郑州云海信息技术有限公司 | Choosing method, the apparatus and system of main OSD in a kind of distributed file system |
CN109558437B (en) * | 2018-11-16 | 2021-01-01 | 新华三技术有限公司成都分公司 | Main OSD (on-screen display) adjusting method and device |
CN111435331B (en) * | 2019-01-14 | 2022-08-26 | 杭州宏杉科技股份有限公司 | Data writing method and device for storage volume, electronic equipment and machine-readable storage medium |
CN111510338B (en) * | 2020-03-09 | 2022-04-26 | 苏州浪潮智能科技有限公司 | Distributed block storage network sub-health test method, device and storage medium |
CN112596935B (en) * | 2020-11-16 | 2022-08-30 | 新华三大数据技术有限公司 | OSD (on-screen display) fault processing method and device |
CN112819592B (en) * | 2021-04-16 | 2021-08-03 | 深圳华锐金融技术股份有限公司 | Service request processing method, system, computer equipment and storage medium |
CN115480798B (en) * | 2021-06-15 | 2023-06-16 | 荣耀终端有限公司 | Operating system upgrade method, device, storage medium and computer program product |
CN113254277B (en) * | 2021-06-15 | 2021-11-02 | 云宏信息科技股份有限公司 | Storage cluster OSD fault repairing method, storage medium, monitor and storage cluster |
Citations (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5790788A (en) * | 1996-07-23 | 1998-08-04 | International Business Machines Corporation | Managing group events by a name server for a group of processors in a distributed computing environment |
US6065065A (en) * | 1997-01-30 | 2000-05-16 | Fujitsu Limited | Parallel computer system and file processing method using multiple I/O nodes |
US6308300B1 (en) * | 1999-06-04 | 2001-10-23 | Rutgers University | Test generation for analog circuits using partitioning and inverted system simulation |
CN1474275A (en) | 2002-08-06 | 2004-02-11 | 中国科学院计算技术研究所 | System of intellignent network storage device based on virtual storage |
US20060182050A1 (en) | 2005-01-28 | 2006-08-17 | Hewlett-Packard Development Company, L.P. | Storage replication system with data tracking |
CN101751284A (en) | 2009-12-25 | 2010-06-23 | 北京航空航天大学 | I/O resource scheduling method for distributed virtual machine monitor |
CN102025550A (en) | 2010-12-20 | 2011-04-20 | 中兴通讯股份有限公司 | System and method for managing data in distributed cluster |
US20110099420A1 (en) * | 2009-10-26 | 2011-04-28 | Macdonald Mcalister Grant Alexander | Failover and recovery for replicated data instances |
US20110106757A1 (en) | 2009-10-30 | 2011-05-05 | Pickney David B | Fixed content storage within a partitioned content platform, with replication |
US20110167113A1 (en) * | 2008-10-07 | 2011-07-07 | Huazhong University Of Science And Technology | Method for managing object-based storage system |
WO2011100366A2 (en) | 2010-02-09 | 2011-08-18 | Google Inc. | System and method for managing replicas of objects in a distributed storage system |
US20110313973A1 (en) | 2010-06-19 | 2011-12-22 | Srivas Mandayam C | Map-Reduce Ready Distributed File System |
CN102355369A (en) | 2011-09-27 | 2012-02-15 | 华为技术有限公司 | Virtual clustered system as well as processing method and processing device thereof |
US20120131272A1 (en) | 2004-08-27 | 2012-05-24 | Hitachi, Ltd. | Data Processing System and Storage Subsystem Provided in Data Processing System |
US20120166390A1 (en) | 2010-12-23 | 2012-06-28 | Dwight Merriman | Method and apparatus for maintaining replica sets |
CN102571452A (en) | 2012-02-20 | 2012-07-11 | 华为技术有限公司 | Multi-node management method and system |
US20120179798A1 (en) | 2011-01-11 | 2012-07-12 | Ibm Corporation | Autonomous primary node election within a virtual input/output server cluster |
US20120246517A1 (en) | 2011-03-24 | 2012-09-27 | Ibm Corporation | Providing first field data capture in a virtual input/output server (vios) cluster environment with cluster-aware vioses |
CN102724057A (en) | 2012-02-23 | 2012-10-10 | 北京市计算中心 | Distributed hierarchical autonomous management method facing cloud calculating platform |
US20130029024A1 (en) | 2011-07-25 | 2013-01-31 | David Warren | Barbeque stove |
CN103051691A (en) | 2012-12-12 | 2013-04-17 | 华为技术有限公司 | Subarea distribution method, device and distributed type storage system |
CN103294675A (en) | 2012-02-23 | 2013-09-11 | 上海盛霄云计算技术有限公司 | Method and device for updating data in distributed storage system |
US20130290249A1 (en) | 2010-12-23 | 2013-10-31 | Dwight Merriman | Large distributed database clustering systems and methods |
US20140032496A1 (en) | 2011-04-13 | 2014-01-30 | Hitachi, Ltd. | Information storage system and data replication method thereof |
US8644188B1 (en) * | 2009-06-25 | 2014-02-04 | Amazon Technologies, Inc. | Providing virtual networking functionality for managed computer networks |
US8713282B1 (en) * | 2011-03-31 | 2014-04-29 | Emc Corporation | Large scale data storage system with fault tolerance |
US20140136800A1 (en) | 2012-11-13 | 2014-05-15 | International Business Machines Corporation | Dynamically improving memory affinity of logical partitions |
US20140351636A1 (en) * | 2012-02-09 | 2014-11-27 | Huawei Technologies Co., Ltd. | Method, device, and system for data reconstruction |
US20150143376A1 (en) * | 2011-06-08 | 2015-05-21 | Workday, Inc. | System for error checking of process definitions for batch processes |
US20150378775A1 (en) * | 2014-06-26 | 2015-12-31 | Amazon Technologies, Inc. | Log-based transaction constraint management |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4519573B2 (en) * | 2004-08-27 | 2010-08-04 | 株式会社日立製作所 | Data processing system and method |
US7917469B2 (en) * | 2006-11-08 | 2011-03-29 | Hitachi Data Systems Corporation | Fast primary cluster recovery |
-
2014
- 2014-11-06 SG SG11201703220SA patent/SG11201703220SA/en unknown
- 2014-11-06 EP EP14905449.6A patent/EP3159794B1/en active Active
- 2014-11-06 BR BR112016030547-7A patent/BR112016030547B1/en active IP Right Grant
- 2014-11-06 CN CN201480040590.9A patent/CN106062717B/en active Active
- 2014-11-06 WO PCT/CN2014/090445 patent/WO2016070375A1/en active Application Filing
- 2014-11-06 JP JP2017539482A patent/JP6382454B2/en active Active
-
2017
- 2017-05-08 US US15/589,856 patent/US10713134B2/en active Active
Patent Citations (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5790788A (en) * | 1996-07-23 | 1998-08-04 | International Business Machines Corporation | Managing group events by a name server for a group of processors in a distributed computing environment |
US6065065A (en) * | 1997-01-30 | 2000-05-16 | Fujitsu Limited | Parallel computer system and file processing method using multiple I/O nodes |
US6308300B1 (en) * | 1999-06-04 | 2001-10-23 | Rutgers University | Test generation for analog circuits using partitioning and inverted system simulation |
CN1474275A (en) | 2002-08-06 | 2004-02-11 | 中国科学院计算技术研究所 | System of intellignent network storage device based on virtual storage |
US20120131272A1 (en) | 2004-08-27 | 2012-05-24 | Hitachi, Ltd. | Data Processing System and Storage Subsystem Provided in Data Processing System |
US20060182050A1 (en) | 2005-01-28 | 2006-08-17 | Hewlett-Packard Development Company, L.P. | Storage replication system with data tracking |
US20110167113A1 (en) * | 2008-10-07 | 2011-07-07 | Huazhong University Of Science And Technology | Method for managing object-based storage system |
US8644188B1 (en) * | 2009-06-25 | 2014-02-04 | Amazon Technologies, Inc. | Providing virtual networking functionality for managed computer networks |
US20110099420A1 (en) * | 2009-10-26 | 2011-04-28 | Macdonald Mcalister Grant Alexander | Failover and recovery for replicated data instances |
US20110106757A1 (en) | 2009-10-30 | 2011-05-05 | Pickney David B | Fixed content storage within a partitioned content platform, with replication |
CN101751284A (en) | 2009-12-25 | 2010-06-23 | 北京航空航天大学 | I/O resource scheduling method for distributed virtual machine monitor |
WO2011100366A2 (en) | 2010-02-09 | 2011-08-18 | Google Inc. | System and method for managing replicas of objects in a distributed storage system |
US20110313973A1 (en) | 2010-06-19 | 2011-12-22 | Srivas Mandayam C | Map-Reduce Ready Distributed File System |
CN102025550A (en) | 2010-12-20 | 2011-04-20 | 中兴通讯股份有限公司 | System and method for managing data in distributed cluster |
US20120166390A1 (en) | 2010-12-23 | 2012-06-28 | Dwight Merriman | Method and apparatus for maintaining replica sets |
US20130290249A1 (en) | 2010-12-23 | 2013-10-31 | Dwight Merriman | Large distributed database clustering systems and methods |
US20120179798A1 (en) | 2011-01-11 | 2012-07-12 | Ibm Corporation | Autonomous primary node election within a virtual input/output server cluster |
US20120246517A1 (en) | 2011-03-24 | 2012-09-27 | Ibm Corporation | Providing first field data capture in a virtual input/output server (vios) cluster environment with cluster-aware vioses |
US8713282B1 (en) * | 2011-03-31 | 2014-04-29 | Emc Corporation | Large scale data storage system with fault tolerance |
US20140032496A1 (en) | 2011-04-13 | 2014-01-30 | Hitachi, Ltd. | Information storage system and data replication method thereof |
US20150143376A1 (en) * | 2011-06-08 | 2015-05-21 | Workday, Inc. | System for error checking of process definitions for batch processes |
US20130029024A1 (en) | 2011-07-25 | 2013-01-31 | David Warren | Barbeque stove |
CN102355369A (en) | 2011-09-27 | 2012-02-15 | 华为技术有限公司 | Virtual clustered system as well as processing method and processing device thereof |
US20140351636A1 (en) * | 2012-02-09 | 2014-11-27 | Huawei Technologies Co., Ltd. | Method, device, and system for data reconstruction |
CN102571452A (en) | 2012-02-20 | 2012-07-11 | 华为技术有限公司 | Multi-node management method and system |
CN103294675A (en) | 2012-02-23 | 2013-09-11 | 上海盛霄云计算技术有限公司 | Method and device for updating data in distributed storage system |
CN102724057A (en) | 2012-02-23 | 2012-10-10 | 北京市计算中心 | Distributed hierarchical autonomous management method facing cloud calculating platform |
US20140136800A1 (en) | 2012-11-13 | 2014-05-15 | International Business Machines Corporation | Dynamically improving memory affinity of logical partitions |
CN103810047A (en) | 2012-11-13 | 2014-05-21 | 国际商业机器公司 | Dynamically improving memory affinity of logical partitions |
CN103051691A (en) | 2012-12-12 | 2013-04-17 | 华为技术有限公司 | Subarea distribution method, device and distributed type storage system |
US20150378775A1 (en) * | 2014-06-26 | 2015-12-31 | Amazon Technologies, Inc. | Log-based transaction constraint management |
Non-Patent Citations (2)
Title |
---|
PHILIP A. BERNSTEIN ; ISTVAN CSERI ; NISHANT DANI ; NIGEL ELLIS ; AJAY KALHAN ; GOPAL KAKIVAYA ; DAVID B. LOMET ; RAMESH MANNE ; L: "Adapting microsoft SQL server for cloud computing", DATA ENGINEERING (ICDE), 2011 IEEE 27TH INTERNATIONAL CONFERENCE ON, IEEE, 11 April 2011 (2011-04-11), pages 1255 - 1263, XP031868529, ISBN: 978-1-4244-8959-6, DOI: 10.1109/ICDE.2011.5767935 |
Philip A. Bernstein et al., "Adapting Microsoft SQL Server for Cloud Computing", Data Engineering (ICDE), 2011 IEEE 27th International Conference, Apr. 11, 2011, XP031868529, 9 pages. |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210278983A1 (en) * | 2018-10-25 | 2021-09-09 | Huawei Technologies Co., Ltd. | Node Capacity Expansion Method in Storage System and Storage System |
US10997130B2 (en) * | 2019-01-31 | 2021-05-04 | Rubrik, Inc. | Systems and methods for node consistency in a clustered database |
US11016952B2 (en) | 2019-01-31 | 2021-05-25 | Rubrik, Inc. | Systems and methods to process a topology change in a clustered database |
US11514024B2 (en) | 2019-01-31 | 2022-11-29 | Rubrik, Inc. | Systems and methods for shard consistency in a clustered database |
US11223681B2 (en) * | 2020-04-10 | 2022-01-11 | Netapp, Inc. | Updating no sync technique for ensuring continuous storage service in event of degraded cluster state |
Also Published As
Publication number | Publication date |
---|---|
BR112016030547A8 (en) | 2022-07-12 |
JP2017534133A (en) | 2017-11-16 |
CN106062717B (en) | 2019-05-03 |
US20170242767A1 (en) | 2017-08-24 |
WO2016070375A1 (en) | 2016-05-12 |
EP3159794A1 (en) | 2017-04-26 |
BR112016030547A2 (en) | 2017-05-22 |
CN106062717A (en) | 2016-10-26 |
EP3159794A4 (en) | 2017-10-25 |
SG11201703220SA (en) | 2017-05-30 |
JP6382454B2 (en) | 2018-08-29 |
BR112016030547B1 (en) | 2022-11-16 |
EP3159794B1 (en) | 2019-01-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10713134B2 (en) | Distributed storage and replication system and method | |
US11360854B2 (en) | Storage cluster configuration change method, storage cluster, and computer system | |
US10693958B2 (en) | System and method for adding node in blockchain network | |
US11265216B2 (en) | Communicating state information in distributed operating systems | |
US10740325B2 (en) | System and method for deleting node in blockchain network | |
CA2938768C (en) | Geographically-distributed file system using coordinated namespace replication | |
JP6491210B2 (en) | System and method for supporting persistent partition recovery in a distributed data grid | |
CN113010496B (en) | Data migration method, device, equipment and storage medium | |
US11709743B2 (en) | Methods and systems for a non-disruptive automatic unplanned failover from a primary copy of data at a primary storage system to a mirror copy of the data at a cross-site secondary storage system | |
JP2019219954A (en) | Cluster storage system, data management control method, and data management control program | |
US10320905B2 (en) | Highly available network filer super cluster | |
WO2012071920A1 (en) | Method, system, token conreoller and memory database for implementing distribute-type main memory database system | |
US12050558B2 (en) | Facilitating immediate performance of volume resynchronization with the use of passive cache entries | |
WO2016180005A1 (en) | Method for processing virtual machine cluster and computer system | |
CN105323271B (en) | Cloud computing system and processing method and device thereof | |
WO2022130005A1 (en) | Granular replica healing for distributed databases |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: HUAWEI TECHNOLOGIES CO., LTD., CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WANG, DAOHUI;ZHANG, FENG;LIU, XUYOU;SIGNING DATES FROM 20170613 TO 20170619;REEL/FRAME:042883/0923 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
CC | Certificate of correction | ||
AS | Assignment |
Owner name: HUAWEI CLOUD COMPUTING TECHNOLOGIES CO., LTD., CHINA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:HUAWEI TECHNOLOGIES CO., LTD.;REEL/FRAME:059267/0088 Effective date: 20220224 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |