WO2016070375A1 - 一种分布式存储复制系统和方法 - Google Patents

一种分布式存储复制系统和方法 Download PDF

Info

Publication number
WO2016070375A1
WO2016070375A1 PCT/CN2014/090445 CN2014090445W WO2016070375A1 WO 2016070375 A1 WO2016070375 A1 WO 2016070375A1 CN 2014090445 W CN2014090445 W CN 2014090445W WO 2016070375 A1 WO2016070375 A1 WO 2016070375A1
Authority
WO
WIPO (PCT)
Prior art keywords
partition
view
osd
osd node
node
Prior art date
Application number
PCT/CN2014/090445
Other languages
English (en)
French (fr)
Inventor
王道辉
张烽
刘叙友
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to BR112016030547-7A priority Critical patent/BR112016030547B1/pt
Priority to JP2017539482A priority patent/JP6382454B2/ja
Priority to CN201480040590.9A priority patent/CN106062717B/zh
Priority to EP14905449.6A priority patent/EP3159794B1/en
Priority to PCT/CN2014/090445 priority patent/WO2016070375A1/zh
Priority to SG11201703220SA priority patent/SG11201703220SA/en
Publication of WO2016070375A1 publication Critical patent/WO2016070375A1/zh
Priority to US15/589,856 priority patent/US10713134B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • G06F11/2064Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring while ensuring consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2041Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with more than one idle spare processing component
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2048Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share neither address space nor persistent storage
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2094Redundant storage or storage space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/273Asynchronous replication or reconciliation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/466Transaction processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • G06F11/0754Error or fault detection not based on redundancy by exceeding limits
    • G06F11/0757Error or fault detection not based on redundancy by exceeding limits by exceeding a time limit, i.e. time-out, e.g. watchdogs

Definitions

  • the present invention relates to the field of IT (Information Technology) information technology, and in particular, to a distributed storage replication system and method.
  • IT Information Technology
  • Distributed storage systems generally organize a large number of storage nodes into a distributed system, and ensure data reliability through data replication backup between different nodes. Such a data has copies on different storage nodes, how to ensure more Data consistency of data copies is a long-standing problem for distributed storage systems. The performance and availability of the system is becoming an increasingly important consideration in ensuring data consistency.
  • Figure 1 shows the existing two-phase commit protocol (2Phase Commit, 2PC), which is a typical centralized strong consistency copy control protocol. Many distributed database systems use this protocol to ensure the consistency of replicas.
  • the system In a two-phase commit protocol, the system generally contains two types of nodes: one is a coordinator and the other is a participant.
  • the coordinator is responsible for initiating voting for data updates and informing the execution of voting decisions, participants participating in voting for data updates, and decisions to perform voting.
  • the two-phase commit protocol consists of two phases: Phase 1: Request phase, the coordinator notifies the participant to vote on the modification of the data, and the participant informs the coordinator of its own voting result: consent or rejection; phase two: submission phase, coordinator Make decisions based on the results of the first stage of voting: execution or cancellation.
  • the coordinator and each participant need at least two rounds of interactions for four messages. Excessive interactions will degrade performance. In addition, if the nodes in the two-phase commit protocol fail or remain unresponsive Other IO requests will block, eventually causing a timeout failure and requiring data rollback, fault tolerance and low availability.
  • Embodiments of the present invention provide a distributed storage replication system, and a method for managing data storage and replication in a distributed storage system to solve the problem of low performance and low availability in the existing consistent replication protocol.
  • a first aspect is a distributed storage replication system, the system comprising at least one metadata control (MDC) module, a plurality of IO routing modules, and a plurality of storage object device (OSD) nodes, the MDC module being used for Each OSD node configures at least one logical partition corresponding to a physical storage resource managed by each OSD node, and the at least one partition is any combination of a primary partition or a standby partition or a primary standby partition, a primary partition and The standby partition corresponding to the main partition constitutes a partition group, and the main partition and the standby partition in the same partition group are located on different OSD nodes, and the OSD node where the main partition is located is the main OSD node of the partition group in which the main partition is located.
  • MDC metadata control
  • OSD storage object device
  • the OSD node where the standby partition is located is a standby OSD node of the partition group in which the standby partition is located, and a partition view is generated according to the partition, the partition
  • the view includes OSD information of each partition in a partition group; the IO routing module is configured to route the received IO request to the OSD node; and the OSD node is configured to execute an IO request according to the IO request.
  • the MDC determines that an OSD node in the system is a faulty OSD node, determines a partition on the faulty OSD node, and updates a partition view of a partition group in which the partition on the faulty OSD node is located,
  • the primary OSD node in the updated partition view sends an update notification; after receiving the update notification sent by the MDC module, the primary OSD node processes the copy of the data corresponding to the IO request according to the updated partition view.
  • the partition view specifically includes an active/standby identity and a corresponding partition state of an OSD in which each partition in a partition group is located; the primary OSD node is further configured to be used according to the The updated partition view updates the partition view saved locally by the main OSD node; the copying of the data corresponding to the IO request according to the updated partition view specifically includes: according to the update, the locally saved partition view will be from the The data corresponding to the IO request of the IO routing module is copied to the standby OSD node whose local state partition state is updated after the update, or the partition state is consistent in the partition view saved locally after the update. On the standby OSD node and in the partition view saved locally after the update, the partition state is inconsistent but the data is being restored on the standby OSD node.
  • the MDC module is further configured to generate an IO view, where the IO view includes an identifier of a primary OSD node of a partition group, The IO view is sent to the IO routing module and an OSD node where the partition in the partition view is located; and the primary OSD node is further configured to update the IO locally saved by the primary OSD node according to the updated partition view Viewing, and processing, according to the updated locally saved IO view, a copy of the data corresponding to the IO request.
  • the MDC module is further configured to: when determining that the partition on the faulty OSD node includes a primary partition, update the primary An IO view of the partition group in which the partition is located, notifying the updated IO view to the standby OSD node in the updated partition view; and the standby OSD node in the updated partition view for using the updated IO
  • the view updates the locally saved IO view and processes the copy of the data corresponding to the IO request according to the updated locally saved IO view.
  • the updating the partition view of the partition group in which the partition on the faulty OSD node is located specifically includes: when the faulty OSD node When the partition includes a standby partition, the partition state of the faulty OSD node in the partition view of the partition group in which the standby partition is located is marked as inconsistent; and when the partition on the faulty OSD node includes the main partition, The faulty OSD node that is the primary OSD node in the partition view of the partition group in which the primary partition is located is reduced to a new standby OSD node, and the partition state corresponding to the new standby OSD node is marked as inconsistent, and from the In the original OSD node in the partition view of the partition group where the main partition is located, select a standby OSD node with the same partition state as the new primary OSD node.
  • a second aspect is a distributed storage replication system, the system comprising at least one metadata control (MDC) module and a plurality of IO routing modules, a plurality of storage object device (OSD) nodes, the MDC module being used for Each OSD node configures at least one logical partition corresponding to a physical storage resource managed by each OSD node, and the at least one partition is a primary partition or a standby partition or an arbitrary combination of a primary and secondary partition, a primary partition and The preparation part structure corresponding to the main partition As a partition group, the main partition and the standby partition in the same partition group are located on different OSD nodes.
  • MDC metadata control
  • OSD storage object device
  • the OSD node where the main partition is located is the main OSD node of the partition group where the main partition is located, and the OSD node where the partition is located is The standby OSD node of the partition group in which the standby partition is located generates a partition view and an IO view according to the partition.
  • the partition view includes OSD information of each partition in a partition group, and the IO view includes a partition group.
  • the IO routing module is configured to route the received IO request to the OSD node; the OSD node is configured to perform storage of IO data according to the IO request; the IO routing module is configured to receive IO a request, the IO request includes a key, determining, according to the key, a partition group to which the data corresponding to the IO request belongs, and determining a main OSD node of the partition group to which the data belongs, and IO of a partition group to which the data belongs.
  • the IO view version information of the view is added to the IO request, and the IO request carrying the IO view version information is sent to the Determining a primary OSD node; the primary OSD node is configured to receive the IO request, and determine, according to the IO view version information, an IO view version in the IO request and an IO view version saved locally by the primary OSD node After being consistent, executing the IO request, generating a copy request carrying the IO view version information,
  • the primary OSD node is further configured to determine, according to the IO view version information, that an IO view version in the IO request is older than an IO saved locally by the primary OSD. Returning an error to the IO routing module after the view version, and adding the IO request to the buffer queue after determining that the IO view version in the IO request is newer than the IO view version saved locally by the primary OSD, and The MDC module queries the IO view version information of the IO view of the partition group to which the data belongs to determine the IO request after the IO view version saved locally by the main OSD is consistent with the IO view version in the IO request; The IO routing module is configured to: after receiving an error returned by the primary OSD node, query the MDC module for an IO view of a partition group to which the data belongs, and send the carrying the information after obtaining the updated IO view version information. Updated IO request for IO view version information.
  • the IO view version information includes an IO view version number
  • the primary OSD node further requests the IO Generating a sequence identifier, the sequence identifier including the IO view version number and a sequence number, and adding the sequence identifier to the copy request sent to the standby OSD node; the sequence number being represented in an IO view version a consecutive number of modification operations for data corresponding to the partition group in the IO view; the standby OSD node is further configured to execute the copy request according to the sequence identifier.
  • the copy request further carries a sequence in the last copy request sent by the primary OSD node for the partition group
  • the standby OSD node is configured to perform the copying after receiving the copy request and determining that the sequence identifier of the last copy request is consistent with the maximum sequence identifier saved locally by the standby OSD node. request.
  • the partition view specifically includes a main body of the OSD in which each partition in the partition group is located.
  • Standby identity and corresponding partition shape The MDC module is further configured to: when the IO request processing detects that the primary OSD node fails, reduce the primary OSD node in a partition view of a partition group to which the data belongs to a new one.
  • the standby OSD node marks the partition state of the new standby OSD as inconsistent, and promotes any standby OSD node in the standby OSD node in the partition view of the partition group to which the data belongs to the new primary OSD node.
  • the IO view of the partition to which the data belongs is notified to the IO routing module; the IO routing module is further configured to receive the updated IO view of the partition group sent by the MDC module, according to the updated An IO view of the partition group sends the IO request to the new primary OSD node; and the new primary OSD node is configured to receive the IO request, and generate a second complex after executing the IO request Request, the second request to the copy partition partition status of the group view partition the updated data belongs to the same OSD standby node.
  • the partition view specifically includes a main body of the OSD in which each partition in the partition group is located. And the corresponding identity state, the MDC module is further configured to: when detecting, in the IO request processing, that any of the standby OSD nodes fails, the partition of the partition group to which the data belongs The partition status of the any standby OSD in the view is marked as inconsistent, and the updated partition view of the partition group to which the updated data belongs is notified to the primary OSD node; and the primary OSD node is configured to receive the After the updated partition view of the partition group to which the data belongs, the copy request is sent to the standby OSD node whose partition status is the same in the updated partition view, and the copy request is not sent to the partition state. For inconsistent standby OSD nodes.
  • a third aspect is a distributed storage replication system, the system comprising at least one metadata control (MDC) module and a plurality of IO routing modules, a plurality of storage object device (OSD) nodes, the MDC module being used for Each OSD node configures at least one logical partition corresponding to a physical storage resource managed by each OSD node, and the at least one partition is a primary partition or a standby partition or an arbitrary combination of a primary and secondary partition, a primary partition and The standby partition corresponding to the main partition constitutes a partition group, and the main partition and the standby partition in the same partition group are located on different OSD nodes, and the OSD node where the main partition is located is the main OSD node of the partition group in which the main partition is located.
  • MDC metadata control
  • OSD storage object device
  • the OSD node where the standby partition is located is a standby OSD node of the partition group in which the standby partition is located, and a partition view and an IO view are generated according to the partition, and the partition view includes OSD information of each partition in a partition group.
  • the IO view includes an identifier of a primary OSD of a partition group; the IO routing module is configured to receive IO Requesting routing to the OSD node; the OSD node is configured to store data corresponding to the IO request; the OSD node is configured to send a query request to the MDC module to request the
  • the IO view of the partition group in which the partition on the OSD node is located, the OSD node is called a fault recovery OSD node, and the query request carries the OSD identifier of the fault recovery OSD node, and receives the IO view returned by the MDC.
  • the partition view of the partition group updated after the failure recovery OSD node data recovery is completed processes the copy of the IO request;
  • the MDC module is configured to receive the query request of the fault recovery OSD node, according to the query request
  • the OSD identifier returns the IO view Returning to the fault recovery OSD node, and updating a partition view of the partition group after the fault recovery OSD node data recovery execution is completed;
  • the main OSD node is configured to receive the data recovery request of the fault recovery OSD node Transmitting the updated data during the fault to the fault recovery OSD node, and processing the IO according to a partition view of the partition group updated after the fault recovery OSD node data recovery execution is completed by the MDC module
  • the primary OSD node is further configured to receive, after receiving the data recovery request, an IO sent by the IO routing module for a partition on the fault recovery OSD node. Requesting, executing the IO request, transmitting, to the failover OSD node, a copy request carrying IO key information and data corresponding to the IO request; and the fault recovery OSD node to the IO key information and The copy request of the data corresponding to the IO request is written into the log, and after the data recovery is completed, the data corresponding to the IO request is updated according to the record of the log to the physics managed by the fault recovery OSD node. In the storage resource.
  • the data recovery request carries a local record of the fault recovery OSD node for the fault recovery OSD node
  • the maximum sequence identifier of the IO operation of the partition is: the latest IO view version number of the IO view of the partition group in which the partition on the OSD node is located, and the version corresponding to the latest IO view version number
  • the sending the data updated during the failure to the fault recovery OSD node includes: determining that the largest sequence identifier in the data recovery request is greater than or equal to The current smallest sequence identifier stored locally by the primary OSD node, sending an entry that is missing during the fault of the fault recovery OSD node to the fault recovery OSD node, and receiving data recovery initiated by the fault recovery OSD node according to the entry Requesting, sending data corresponding to the entry to the fault recovery OSD node; the minimum sequence label
  • the data recovery request carries a local record of the fault recovery OSD node for the fault recovery OSD node a maximum sequence identifier of the IO operation of the partition;
  • the maximum sequence identifier includes: an latest IO view version number of the IO view of the partition group in which the partition on the fault recovery OSD node is located, and a version number corresponding to the latest IO view version number a maximum number of modification operations for data corresponding to the partition in the IO view within the IO view;
  • the transmitting the data updated during the failure to the failover OSD node includes determining a maximum of the data recovery request
  • the sequence identifier is smaller than the current smallest sequence identifier locally stored by the primary OSD node, and sends the current smallest sequence identifier locally stored in the primary OSD to the fault recovery OSD node, and receives the synchronization initiated by the fault recovery OSD node.
  • the minimum sequence identifier is: a minimum IO view version number of the IO view of the partition group saved by the main OSD node, and the minimum The minimum number of the modification operation of the data corresponding to the partition in the IO view corresponding to the IO view version number.
  • a fourth aspect is a distributed storage replication system, the system comprising at least one metadata control (MDC) module and a plurality of IO routing modules, a plurality of storage object device (OSD) nodes, the MDC module being used for Each OSD node configures at least one logical partition corresponding to a physical storage resource managed by each OSD node, and the at least one partition is a primary partition.
  • MDC metadata control
  • OSD storage object device
  • the node is a primary OSD node of the partition group in which the primary partition is located
  • the OSD node in which the standby partition is located is a standby OSD node of the partition group in which the standby partition is located
  • a partition view is generated according to the partition, and the partition view includes a partition.
  • the OSD information of each partition in the group; the IO routing module is configured to route the received IO request to the OSD node; the OSD node is configured to store data corresponding to the IO request; the system Included in a memory and a processor; the memory is for storing computer readable instructions for performing functions of the MDC module, the IO routing module, and an OSD node; the processor is configured to be coupled to the memory Reading an instruction in the memory, causing the processor to perform the following operations according to the instruction Determining that an OSD node in the system is a faulty OSD node, determining a partition on the faulty OSD node, and updating a partition view of a partition group in which the partition on the faulty OSD node is located, in the updated partition view.
  • the primary OSD node where the partition group is located sends an update notification, so that the primary OSD node processes the copy of the data corresponding to the IO request according to the updated partition view.
  • a fifth aspect is a distributed storage replication system, the system comprising at least one metadata control (MDC) module and a plurality of IO routing modules, a plurality of storage object device (OSD) nodes, the MDC module being used for Each OSD node configures at least one logical partition corresponding to a physical storage resource managed by each OSD node, and the at least one partition is a primary partition or a standby partition or an arbitrary combination of a primary and secondary partition, a primary partition and The standby partition corresponding to the main partition constitutes a partition group, and the main partition and the standby partition in the same partition group are located on different OSD nodes, and the OSD node where the main partition is located is the main OSD node of the partition group in which the main partition is located.
  • MDC metadata control
  • OSD storage object device
  • the OSD node where the standby partition is located is a standby OSD node of the partition group in which the standby partition is located, and a partition view and an IO view are generated according to the partition, and the partition view includes OSD information of each partition in a partition group.
  • the IO view includes an identifier of a primary OSD of a partition group; the IO routing module is used to receive the IO Requesting routing to the OSD node; the OSD node for storing data corresponding to the IO request; the system comprising a memory and a processor; the memory for storing computer readable instructions, the instructions a function for executing the MDC module, the IO routing module, and the OSD node; the processor is configured to be coupled to the memory, read an instruction in the memory, and cause the processor to execute according to the instruction
  • the operation of causing the IO routing module to receive an IO request, the IO request includes a key, determining, according to the key, a partition group to which the data corresponding to the IO request belongs and determining a main OSD of the partition group to which the data belongs
  • the node adds the IO view version information of the IO view of the partition group to which the data belongs to the IO request, and sends the IO request carrying the IO view version information to the
  • a sixth aspect is a distributed storage replication system, the system comprising at least one metadata control (MDC) module and a plurality of IOs a routing module, a plurality of storage object device (OSD) nodes, wherein the MDC module is configured to configure, for each OSD node, at least one logical partition corresponding to a physical storage resource managed by each OSD node,
  • the at least one partition is a combination of a main partition or a standby partition or an active standby partition, and the main partition and the standby partition corresponding to the main partition constitute a partition group, and the main partition and the standby partition in the same partition group are located in different OSD nodes.
  • the OSD node where the main partition is located is the main OSD node of the partition group in which the main partition is located
  • the OSD node where the standby partition is located is the standby OSD node of the partition group in which the standby partition is located, and the partition view is generated according to the partition.
  • the partition view includes OSD information of each partition in a partition group, the IO view includes an identifier of a primary OSD of a partition group, and the IO routing module is configured to route the received IO request to the An OSD node; the OSD node is configured to store data corresponding to the IO request; the system
  • the memory and the processor are configured to store computer readable instructions for performing functions of the MDC module, the IO routing module, and an OSD node; the processor is configured to be coupled to the memory Reading an instruction in the memory, causing the processor to perform an operation of causing the OSD node to send a query request to the MDC module to request the OSD node after recovering from its own failure according to the instruction
  • the IO view of the partition group in which the partition is located, the OSD node is called a fault recovery OSD node, and the query request carries the OSD identifier of the fault recovery OSD node, and receives the IO view returned by the MDC.
  • the primary OSD in the IO view initiates a data recovery request to request recovery of data updated by the fault recovery OSD node during the failure, and receives data updated during the failure sent by the primary OSD, according to the MDC module
  • the partition view of the partition group updated after the failure recovery OSD node data recovery is completed processes the copy of the IO request; prompting the MDC module to receive Returning the query request of the fault recovery OSD node, returning the IO view to the fault recovery OSD node according to the OSD identifier in the query request, and updating the partition group after the fault recovery OSD node data recovery execution is completed a partition view; and a data recovery request for causing the primary OSD node to receive the failback OSD node, transmitting data updated during the fault to the failover OSD node, and according to the MDC module
  • the partition view of the partition group updated after the failure recovery OSD node data recovery is completed processes the copy of the data corresponding to the IO request.
  • a seventh aspect is a method for managing data storage and replication in a distributed storage system, the system comprising at least one metadata control (MDC) module and a plurality of IO routing modules, a plurality of storage object device (OSD) nodes
  • the MDC module is configured to use at least one logical partition corresponding to a physical storage resource managed by each OSD node, where the at least one partition is a primary partition or a standby partition or any combination of active and standby partitions,
  • the partition and the standby partition corresponding to the main partition constitute a partition group, and the main partition and the standby partition in the same partition group are located on different OSD nodes, and the OSD node where the main partition is located is the main part of the partition group in which the main partition is located.
  • An OSD node where the OSD node of the standby partition is a standby OSD node of the partition group in which the standby partition is located, generates a partition view according to the partition, and the partition view includes OSD information of each partition in a partition group;
  • An IO routing module is configured to route the received IO request to the OSD node; the OSD node is used And storing, according to the IO request, storage of data corresponding to the IO request; the method includes: determining that an OSD node in the system is a faulty OSD node, determining a partition on the faulty OSD node, and updating the faulty OSD node a partition view of the partition group in which the partition is located, sending an update notification to the primary OSD node in the updated partition view; and the primary OSD node is configured to receive the update notification sent by the MDC module, Processing the copy of the data corresponding to the IO request according to the updated partition view.
  • the partition view specifically includes an active/standby identity and a corresponding partition state of an OSD in which each partition in a partition group is located; the primary OSD node is further configured to be used according to the The updated partition view updates the partition view saved locally by the main OSD node; the copying of the data corresponding to the IO request according to the updated partition view specifically includes: according to the update, the locally saved partition view will be from the The data corresponding to the IO request of the IO routing module is copied to the standby OSD node whose partition status is the same in the partition view saved in the update, or the partition state is consistent in the partition view saved locally after the update. On the standby OSD node and in the partition view saved locally after the update, the partition state is inconsistent but the data is being restored on the standby OSD node.
  • the method further includes: when the MDC module determines that a partition on the faulty OSD node includes a main partition, Updating an IO view of the partition group in which the primary partition is located, notifying the updated IO view to the standby OSD node in the updated partition view; and updating the standby OSD node in the updated partition view according to the update
  • the IO view updates the locally saved IO view and processes the copy of the data corresponding to the IO request according to the updated locally saved IO view.
  • the updating the partition view of the partition group in which the partition on the faulty OSD node is located specifically includes: when the faulty OSD node When the partition includes the standby partition, the partition state of the faulty OSD node in the partition view of the partition group in which the standby partition is located is marked as inconsistent; and when the partition on the faulty OSD node includes the main partition, The faulty OSD node that is the primary OSD node in the partition view of the partition group in which the master partition is located is reduced to a new standby OSD node, and the partition state corresponding to the new standby OSD node is marked as inconsistent, and from the master In the original OSD node in the partition view of the partition group where the partition is located, select a standby OSD node whose partition status is consistent to be the new primary OSD node.
  • a method for managing data storage and replication in a distributed storage system comprising at least one metadata control (MDC) module and a plurality of IO routing modules, a plurality of storage object device (OSD) nodes
  • the MDC module is configured to configure, for each OSD node, at least one logical partition corresponding to a physical storage resource managed by each OSD node, where the at least one partition is a primary partition or a standby partition or a primary Any combination of the partitions, the main partition and the standby partition corresponding to the main partition constitute a partition group, the main partition and the standby partition in the same partition group are located on different OSD nodes, and the OSD node where the main partition is located is the main The main OSD node of the partition group in which the partition is located, the OSD node where the partition is located is the standby OSD node of the partition group in which the standby partition is located, and the partition view and the IO view are generated according to the partition, and the partition view includes a partition
  • the OSD information of each partition, the IO view includes the label of the main OSD of a partition group
  • the IO routing module is configured to route the received IO request to the OSD node; the OSD node is configured to perform storage of IO data according to the IO request; the method includes: the IO routing module is configured to: Receiving an IO request, the IO request includes a key, determining, according to the key, a partition group to which the data corresponding to the IO request belongs and determining a main OSD node of the partition group to which the data belongs, and a partition group to which the data belongs
  • the IO view version information of the IO view is added to the IO request, and the IO request carrying the IO view version information is sent to the determined primary OSD node; the primary OSD node is configured to receive the An IO request, determined according to the IO view version information After the IO view version in the IO request is consistent with the IO view version saved locally by the main OSD node, the IO request
  • the partition view specifically includes an active/standby identity and a corresponding partition state of an OSD in which each partition in a partition group is located
  • the method further includes: the MDC module is When the main OSD node fails, the main OSD node in the partition view of the partition group to which the data belongs is reduced to a new standby OSD node, and the new standby device is The partition status of the OSD is marked as inconsistent, and any standby OSD node in the standby OSD node in the partition view of the partition group to which the data belongs is promoted to a new primary OSD node, and the updated partition to which the data belongs is updated.
  • the IO routing module is further configured to receive an IO view of the updated partition group sent by the MDC module, according to the updated An IO view of the partition group sends the IO request to the new primary OSD node; and the new primary OSD node is configured to receive the IO request, and generate a second copy request after executing the IO request, Sending the second copy request to the standby OSD node in which the partition state in the partition view of the partition group to which the updated data belongs is consistent.
  • the partition view specifically includes an active/standby identity and a corresponding partition state of an OSD in which each partition in a partition group is located
  • the method further includes: the MDC module is When the IO request processing process detects that any of the standby OSD nodes is faulty, the partition status of the any standby OSD in the partition view of the partition group to which the data belongs is marked as inconsistent.
  • the main OSD node Notifying the partition view of the partition group to which the updated data belongs to the main OSD node; and the main OSD node is configured to receive a partition view of the partition group to which the updated data belongs
  • the copy request is sent to the standby OSD node whose partition status is the same in the updated partition view, and the copy request is not sent to the standby OSD node whose partition status is inconsistent.
  • a ninth aspect is a method for managing data storage and replication in a distributed storage system, the system comprising at least one metadata control (MDC) module and a plurality of IO routing modules, a plurality of storage object device (OSD) nodes
  • the MDC module is configured to configure, for each OSD node, at least one logical partition corresponding to a physical storage resource managed by each OSD node, where the at least one partition is a primary partition or a standby partition or a primary Any combination of the partitions, the main partition and the standby partition corresponding to the main partition constitute a partition group, the main partition and the standby partition in the same partition group are located on different OSD nodes, and the OSD node where the main partition is located is the main The main OSD node of the partition group in which the partition is located, the OSD node where the partition is located is the standby OSD node of the partition group in which the standby partition is located, and the partition view and the IO view are generated according to the partition, and the partition view includes
  • the OSD information of each partition, the IO view includes the identity of the main OSD of a partition group
  • the IO IO request routing module configured to route received Transmitting the OSD node; the OSD node is configured to store data corresponding to the IO request; the method includes: the OSD node is configured to send a query request to the MDC module to request after the fault is recovered by itself
  • An IO view of a partition group in which the partition on the OSD node is located, the OSD node is referred to as a fault recovery OSD node, and the query request carries an OSD identifier of the fault recovery OSD node, and receives the An IO view, initiating a data recovery request to the primary OSD in the IO view to request recovery of data updated by the fault recovery OSD node during the failure, receiving data updated during the failure sent by the primary OSD, according to the
  • the MDC module processes the copy of the IO request in the partition view of the partition group that is updated after the failure recovery OSD node
  • FIG. 2A is a schematic structural diagram of a distributed storage replication system according to an embodiment of the present invention.
  • 2B is a schematic structural diagram of a distributed storage replication system according to another embodiment of the present invention.
  • 2C is a schematic structural diagram of a distributed storage replication system according to another embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a cluster view according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a state transition of an OSD view according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a distributed storage replication system according to another embodiment of the present invention.
  • FIG. 6 is a flowchart of a view initialization according to an embodiment of the present invention.
  • FIG. 7 is a flowchart of an IO request process according to an embodiment of the present invention.
  • FIG. 8 is a flowchart of an OSD fault process according to an embodiment of the present invention.
  • FIG. 9 is a flowchart of an OSD fault recovery process according to an embodiment of the present invention.
  • FIG. 10 is a flowchart of data recovery in an OSD failure recovery process according to an embodiment of the present invention.
  • FIG. 11 is a flowchart of processing after an OSD exits a cluster according to an embodiment of the present invention.
  • FIG. 12 is a flowchart of processing after OSD expansion according to an embodiment of the present invention.
  • a specific embodiment of the present invention provides a distributed storage copy control system for managing and controlling data storage and copying as mentioned in the embodiments of the present invention.
  • the distributed storage replication control system mainly includes three sub-layers: a state layer, an interface layer, and a data layer.
  • the state layer includes a metadata control Meta Data Controller (MDC) module 202.
  • MDC Meta Data Controller
  • the MDC is used to assume the role of the primary MDC and the number of standby MDCs when the primary MDC module fails;
  • the interface layer includes a plurality of IO routing modules (Input/Output input/output routing modules) 204 (also referred to as clients, in the implementation of the present invention)
  • the two concepts can be interoperable;
  • the data layer includes a plurality of object storage device Object Storage Device (OSD) nodes 206.
  • the state layer communicates with the interface layer and the data layer through a state view message, for example, the MDC module 202 sends an update notification to the IO routing module 204 and the OSD node 206 via the state view message to notify the update local.
  • OSD Object Storage Device
  • the cluster view (which may also be referred to as a view, the two concepts may be interoperable in the implementation of the present invention), or directly send the cluster view generated or updated by the MDC module 202 to the IO routing module 204 and the OSD node 206.
  • the interface layer and the data layer communicate via a service message, for example, the IO routing module 204 sends an IO request message to the OSD node 206 to request storage and copying of the IO data.
  • the MDC module 202 is used as an entry for the deployment of the cluster configuration information, and is used to allocate a logical partition of the logical storage resource in the application storage space to each OSD node, generate a cluster view according to the partition, and maintain and update the The cluster view is described, and the update of the cluster view is notified to the corresponding IO routing module 204 and the OSD node 206.
  • the IO routing module 204 is configured to forward the IO request route of the upper layer application to the corresponding ODS node according to the cluster view.
  • the OSD node 206 is configured to perform related IO operations on the IO request according to the cluster view, mainly including performing storage and copying of data, implementing consistent backup of data, and organizing physical storage resources for management thereof (for example, Data manipulation of local disks or external storage resources.
  • the foregoing MDC module, the IO routing module, and the OSD node can be implemented by hardware, firmware, software, or a combination thereof.
  • the specific implementation is determined according to product design requirements or manufacturing cost considerations. The invention should not be limited by a particular implementation.
  • the distributed storage replication system may be uniformly deployed on a separate platform or server (such as the distributed storage replication control platform in FIG. 2B above) for managing with the platform or Data replication and storage of distributed storage systems connected to the server.
  • the distributed storage replication control system may be distributedly deployed in a distributed storage system as shown in FIG. 2C, where the distributed storage system includes multiple server servers or hosts,
  • the host or server in the embodiment is a host or a server in a physical sense, that is, hardware including a processor and a memory.
  • the MDC module 202 may be deployed only on one (without backup MDC), two (one primary and one standby), or three (one primary and two standby) servers or hosts in the distributed storage system; the IO routing module 204 is deployed.
  • an OSD node 206 is deployed on each server or host having storage resources in the distributed storage system for managing and controlling local storage resources or external Storage resources.
  • the MDC module 202, the IO routing module 204, and the OSD node 206 in FIG. 2C constitute a distributed storage control system.
  • the distributed storage system shown in FIG. 2B it is called a distributed replication protocol layer, and the distributed storage system passes the distribution.
  • the replication protocol layer controls IO data storage and replication to storage resources in the storage tier.
  • the storage layer is composed of local storage resources on the plurality of servers or hosts, and each module distributed in the server or the host through the distributed replication protocol layer interacts through the data exchange network of the network layer, and a specific An Ethernet or infiniband may be used in the embodiment. It should be understood that the above-mentioned Ethernet or infiniband is only an exemplary implementation of the high-speed data exchange network used in the embodiment of the present invention, which is not limited in the embodiment of the present invention.
  • connection interaction and specific functions of the MDC module 202, the IO routing module 204, and the OSD node 206 in the distributed storage replication control system are described in detail below through specific embodiments and implementation manners.
  • the partitioning function of the MDC module may include: the MDC module is configured on each OSD node and managed by each OSD node according to physical storage resources managed by each OSD node.
  • the partition is composed of a certain number of data blocks in the application storage space.
  • the application storage space of the application layer is relative to the physical storage space of the storage layer, is a certain amount of logical storage space allocated to the user at the application layer, and is a logical mapping of the physical layer outgoing space of the storage layer. That is, the concept of the partition here is different from the concept of the physical storage space partition.
  • the space of a partition on the application storage space may be mapped to one or more partitions of the physical storage space, and the partition specific granularity. It can be obtained from the cluster configuration information, or can be determined by the MDC module according to certain rules, and other manners, which are not limited by the present invention.
  • the MDC module can configure the information of the root partition size, the local storage resource, and the external storage resource (for example, a LUN (Logical Unit Number) of a SAN (Storage Area Network) that is normally accessed.
  • the information such as information generates a cluster view of the partition.
  • a partition stores a copy on different OSD nodes.
  • the number of copies of the partition can be configured through a configuration file, or can be determined by MDC according to a certain algorithm.
  • the partition is divided into a main partition and a standby. Partition, in multiple copies of the partition, select one of the copies as the main copy, called the main partition, a copy other than the main copy of the partition, called the standby partition, the main partition and the standby partition corresponding to the main partition
  • a partition group, the OSD node where the main partition is located, is called the main OSD node of the partition group where the main partition is located.
  • the main OSD described in this embodiment refers to the main OSD of a certain partition group, and the OSD node where the partition is located.
  • the standby OSD node which is referred to as the partition group in which the standby partition is located, refers to the standby OSD of a certain partition group.
  • the host or the server (the concept of the host and the server in the embodiment of the present invention may be interchanged) the managed storage in the OSD on the server_1.
  • Resources are divided into partition1, partition2 and partition3 (referred to as P1, P2 and P3), and partition4', partition5', and partition6' (referred to as P4', P5' and P6').
  • P4', P5' and P6' are copies of partition4, partition5 and partition6 (referred to as P4, P5 and P6) on the OSD node on the server server_2, respectively.
  • the partition in the OSD on the server_1 has a corresponding mapping relationship with the physical storage resource in the storage layer. For example, the space of a partition on the OSD is mapped to the physical storage. One or more blocks of space.
  • the OSD in the host or server server_1 manages the main partitions (P1, P2, and P3) and the standby partitions (P4', P5', and P6'), which are the partition group main OSD nodes composed of P1 and P1', respectively, P2 and P2.
  • the standby OSD node of the partition composed of P6 and P6' It can be seen that the same OSD node can exist as both a primary OSD node and a standby OSD node for different partition groups.
  • the above partition partition and the corresponding copy settings may be performed according to the following factors, and specific practical factors may be added to set the planned disk partition.
  • each copy of the partition should be distributed to different hosts or servers.
  • the bottom line for data security is that multiple copies of a partition are not allowed to be placed on the same host or server.
  • the data balance, the number of partitions on each OSD is as equal as possible, the number of main partition, standby partition1, and spare partition2 on each OSD is as much as possible. In this way, the services processed on each OSD are balanced and there are no hot spots.
  • Third, data dispersibility, the partition on each OSD, its copy should be distributed as evenly as possible to other different OSDs, and the same requirements for higher-level physical components.
  • the generating, by the MDC module, the cluster view information may include: generating, by the MDC, cluster view information according to the cluster configuration information delivered by the administrator and the partitioning situation.
  • the cluster view information includes a cluster view of three dimensions, an OSD view, an IO view, and a partition view.
  • the OSD view includes status information of an OSD node in a cluster.
  • the OSD view in the specific implementation manner may include an ID of an OSD node and status information of an OSD node.
  • the OSD ID is an OSD identifier or a number.
  • the OSD state may specifically include being classified into an "UP” state and a “DOWN” according to the fault of the OSD.
  • the status is divided into “exit cluster (OUT)" and "in cluster (IN)” states according to whether the OSD exits the cluster. As shown in FIG.
  • the specific state transition includes, when an OSD node is initialized or restarted after failback, it is changed to "in the cluster (IN)” and “not expired (DOWN)” state to "in the cluster. (IN)” and “effective”.
  • an OSD failure occurs above a certain threshold (for example, more than 5 minutes)
  • the OSD node will be proposed to be clustered, and its corresponding state is changed from “in the cluster (IN)” and “inactive (DOWN)” state to "Exit cluster (OUT)” and “Failed (DOWN)” status.
  • the OSD view may further include any information of OSD view version information, such as an OSD view version number, an OSD view ID, or other mark view version.
  • the IO view includes an identification of a primary OSD node that identifies a partition group.
  • the IO view may include a partition group ID and an identifier of a main OSD node of the partition group corresponding to the partition group ID.
  • Each of the IO views has an IO view version information that identifies the IO view.
  • the IO view version information may be an IO view ID (also referred to as an IO view version number), which is used to identify the version of the IO view, and facilitates different modules for IO. View version of the old and new contrast.
  • the IO view version information may be included in the IO view or outside the IO view.
  • the partition view includes the OSD information of each partition in a partition group.
  • the partition The view may include a partition group ID, an OSD in which each partition in the partition group corresponding to the partition group ID and an active/standby identity of the OSD, and a partition state of the OSD corresponding to each partition.
  • the partition view includes the OSD node information of the main partition (for example, an OSD node ID, an active/standby identity of the OSD node, and a partition state of the OSD node corresponding to the main partition), and a standby partition corresponding to the main partition (may be One or more OSD node information (eg, an OSD node ID, a primary and secondary identity of the OSD node, and a partition state of the standby partition corresponding to the OSD).
  • the partition state can be divided into two types: “consistent” and “inconsistent”. “consistent” means that the data of the standby partition is consistent with the main partition, and “inconsistent” indicates that the data of the standby partition may be inconsistent with the main partition.
  • Each partition view has a partition view version information that identifies the partition view.
  • the partition view version information may be a partition view ID (also referred to as a partition view version number), and is used to compare the view between the modules.
  • the partition view version information may be included in the partition view or outside the partition view.
  • the partition view since the IO view is a subset of the partition view, that is, the partition view includes IO view information, the partition view may further include IO view version information.
  • the MDC is further used to maintain and manage the cluster view, and update the cluster view according to the state of the OSD node, such as fault, fault recovery, exiting the cluster after the fault, and resuming joining the cluster after the fault recovery, newly joining the cluster, and the like. Notifying the relevant module, so that the relevant module processes the copy of the data corresponding to the corresponding IO request according to the updated cluster view.
  • the OSD view may exist only in the MDC, and the partition view may exist only in the MDC module and the main OSD node, where the IO view exists.
  • the MDC module only sends the partition view to the main OSD node where the partition in the partition view is located, or only informs the main OSD node where the partition in the partition view is located to perform local partition view update;
  • the IO view (that is, the IO view can be regarded as a subview of the partition view) is sent to the IO routing module, the main OSD node, and the standby OSD node, or the corresponding module is notified to update the locally saved IO view.
  • the specific implementation process can refer to the following specific processes, and the OSD joins the cluster process.
  • the MDC module can set different types of cluster views according to the configuration information or a certain policy according to the basic functions of the cluster view, which is not limited in the embodiment of the present invention.
  • the IO routing module is mainly used to implement the function of IO request routing.
  • the IO routing module acquires and caches the IO view of all the partitions in the cluster from the MDC module.
  • the IO routing module calculates the partition group to which the IO belongs by using the key in the IO request. (The calculation method can use a hash algorithm or other algorithms); then find the locally saved IO view, find the main OSD node corresponding to the partition group, and send the IO request to the main OSD node.
  • the update notification may include the updated IO view or corresponding update knowledge information, such as indicating which content is updated, and updating the locally saved IO view according to the update notification And routing the IO request according to the locally saved updated IO view.
  • update notification may include the updated IO view or corresponding update knowledge information, such as indicating which content is updated, and updating the locally saved IO view according to the update notification And routing the IO request according to the locally saved updated IO view.
  • the performing, by the OSD node, the IO operation according to the cluster view processing IO request specifically includes:
  • the primary OSD node When the OSD node is a primary OSD node, the primary OSD node is mainly used to receive the IO routing module sending The IO requests, executes the IO request, and sends a copy request to the corresponding standby OSD node to perform storage and copying of the IO data.
  • the primary OSD node receives from the MDC module its partition view as a partition on the primary OSD node, and the primary OSD node processes the copy of the IO request according to the partition view.
  • the primary OSD node further receives an update notification of the MDC module about the partition view, updates the locally saved partition view according to the update notification, and processes the copy of the data corresponding to the IO request according to the updated partition view, the update
  • the notification may include an updated partition view or corresponding update information so that the OSD node updates the locally saved partition view and IO view according to the updated partition view or the update information.
  • the standby OSD node is configured to receive a copy request of the primary OSD node, perform a copy backup of the data according to the copy request, and receive the backup OSD node from the MDC module.
  • the IO view of the partition to which the data belongs is saved, the data corresponding to the IO request is processed according to the IO view, and the update notification of the IO view by the MDC module is further received, and the locally saved IO is updated according to the update notification. Viewing, and processing a copy of the data corresponding to the IO request according to the updated IO view. For specific implementation procedures, reference may be made to the specific processes described below.
  • the distributed storage replication system (such as FIGS. 2A, 2B, and 2C) in the foregoing embodiment may be implemented based on the system shown in FIG. 5.
  • the system may include one. Or a plurality of memory 502, one or more communication interfaces 504, and one or more processors 506, or other data interaction networks (for multiple processors and interactions between memories, which are not shown in the figure) ).
  • the memory 502 can be any type of memory such as a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • ROM read only memory
  • RAM random access memory
  • Memory 502 can store instructions for operating systems and other applications, as well as application data, including instructions for performing the functions of the MDC module, the IO routing module, and the OSD node in various embodiments of the present invention.
  • the instructions stored in memory 502 are executed by processor 506 for execution.
  • Communication interface 504 is used to effect communication between memory 502 and processor 506, as well as between processors and between memory, as well as communication between the system and other devices or communication networks.
  • the processor 506 can be a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), or one or more integrated circuits for executing related programs.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • the interaction flow between the MDC module, the IO routing module, and the OSD node described in the embodiments of the present invention and the functions implemented are performed.
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • the processor is configured to be coupled to the memory, to read an instruction in the memory, the instruction comprising an instruction to execute a function of the MDC module, an IO routing module, and an OSD node, according to the instruction
  • the processor is caused to perform the following operations:
  • the MDC module determines that an OSD node in the system is a faulty OSD node, determining a partition on the faulty OSD node, and updating a partition view of a partition group in which the partition on the faulty OSD node is located, to update the partition
  • the primary OSD node in which the partition group in the partition view is located sends an update notification, so that the primary OSD node processes the copy of the data corresponding to the IO request according to the updated partition view.
  • the foregoing MDC module, the IO routing module, and the OSD node function may be implemented by one host.
  • the instructions for implementing the functions of the MDC module, the IO routing module, and the OSD node may exist in the In the memory of the host, instructions of the MDC module, the IO routing module, and the OSD node function in the memory are read by the processor of the host.
  • the above MDC module, IO routing module, and OSD node functions may be implemented by multiple host interactions.
  • the MDC module, the IO routing module, and the OSD node are distributedly stored in different host memories.
  • the processor in the host 1 performs the functions of the above-described MDC module
  • the processor of the host 2 performs the function of the main OSD node
  • the host 3 executes the IO routing module function.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • the processor is configured to be coupled to the memory, to read an instruction in the memory, the instruction comprising an instruction to execute a function of the MDC module, an IO routing module, and an OSD node, according to the instruction
  • the processor is caused to perform the following operations:
  • the IO routing module to receive an IO request, where the IO request includes a key, determining, according to the key, a partition group to which the data corresponding to the IO request belongs and determining a main OSD node of the partition group to which the data belongs, The IO view version information of the IO view of the partition group to which the data belongs is added to the IO request, and the IO request carrying the IO view version information is sent to the determined main OSD node;
  • the request is such that data corresponding to the IO request on the standby OSD node is consistent with data corresponding to the IO request on the primary OSD node.
  • the foregoing MDC module, the IO routing module, and the OSD node function may be implemented by one host.
  • an instruction for implementing the functions of the MDC module, the IO routing module, and the OSD node may exist in the In the memory of the host, instructions of the MDC module, the IO routing module, and the OSD node function in the memory are read by the processor of the host.
  • the above MDC module, IO routing module, and OSD node functions may be implemented by multiple host interactions. In this case, the MDC module, the IO routing module, and the OSD node are distributedly stored in different host memories.
  • the processor in the host 1 performs the functions of the above-described IO routing module
  • the processor of the host 2 performs the function of the main OSD node
  • the processor of the host 3 executes the function of the standby OSD node
  • the processor of the host 4 executes the function of the MDC module.
  • Embodiment 3 is a diagrammatic representation of Embodiment 3
  • the processor is configured to be coupled to the memory, to read an instruction in the memory, the instruction comprising an instruction to execute a function of the MDC module, an IO routing module, and an OSD node, according to the instruction
  • the processor is caused to perform the following operations:
  • the OSD node is referred to as a fault recovery OSD node
  • the query request is Carrying the OSD identifier of the fault recovery OSD node, receiving the IO view returned by the MDC, and initiating a data recovery request to the primary OSD in the IO view to request recovery of the fault recovery OSD node to update during a fault Data, receiving the data updated during the failure sent by the primary OSD, and processing the copy of the IO request according to the partition view of the partition group updated by the MDC module after the failure recovery OSD node data recovery execution is completed;
  • the partition view of the partition group updated after the execution is resumed processes the copy of the data corresponding to the IO request.
  • the foregoing MDC module, the IO routing module, and the OSD node function may be implemented by one host.
  • an instruction for implementing the functions of the MDC module, the IO routing module, and the OSD node may exist in the host.
  • instructions of the MDC module, the IO routing module, and the OSD node function in the memory are read by the processor of the host.
  • the above MDC module, IO routing module, and OSD node functions may be implemented by multiple host interactions. In this case, the MDC module, the IO routing module, and the OSD node are distributedly stored in different host memories.
  • the processor in the host 1 performs the function of the above-described failure recovery OSD node
  • the processor of the host 2 performs the function of the main OSD node
  • the host 3 processor executes the function of the MDC module
  • the host 4 processor executes the IO route.
  • the function of the module is the function of the module.
  • the connection interaction and specific functions of the MDC module 202, the IO routing module 204, and the OSD node 206 in the distributed storage replication control system are further described in detail below through a plurality of specific process embodiments.
  • the specific process embodiments include: a cluster view initialization generation and acquisition process, an IO request processing flow, an OSD failure processing flow, an OSD node failure recovery process, a data recovery process, a process of exiting the cluster after an OSD node failure, and an OSD node expansion process. .
  • the following is a detailed description.
  • the MDC module generates an initial cluster view according to the cluster configuration information delivered by the administrator.
  • MDC module query view The execution subject involved in this embodiment is the MDC module, the IO routing module, and the OSD node mentioned in the embodiments described in FIGS. 2A to 2C and FIG.
  • the MDC module When the system is initialized, the MDC module is started first, and then the IO routing module and the OSD node obtain a view start from the MDC.
  • the specific process includes:
  • the user or the administrator sends the cluster configuration information to the MDC module.
  • the cluster configuration information may include topology information of the cluster and system configuration parameters such as the number of partitions and the number of copies, wherein the cluster topology information mainly includes the number of servers and their IP addresses, rack information, and the number of OSD nodes on each server. And physical storage resource information (such as its corresponding local disk information) managed by it.
  • the MDC generates an initial cluster view according to the issued cluster configuration.
  • the three views of the cluster (OSD view, partition view, and IO view) are described above.
  • the OSD view is generated according to the configured OSD information.
  • the partition allocation algorithm and the number of configured partitions, the number of copies, and the number of OSD nodes generate a partition view.
  • the IO view is a subset of the partition view and does not need to be generated. When generating the partition view, it is usually necessary to consider the balance of the partition allocation (the number of partitions on each OSD node is as consistent as possible), and the security (the OSD node where the copy of the partition is located is located on a different server or a different rack).
  • the IO routing module is initialized, and the IO view is queried to the MDC module.
  • the IO routing module starts, the related view needs to be obtained from the MDC to work normally.
  • the OSD node is initialized, and the partition view and the IO view are queried to the MDC module, where the OSD node needs to obtain a partition view of a partition group in which the main partition distributed on the OSD node is located and distributed on the OSD node.
  • the MDC module returns an IO view of all the partition groups to the IO routing module.
  • the MDC module returns, to the OSD node, an IO view of a partition view of a partition group in which the main partition is located on the OSD and a partition group in which the standby partition is located.
  • Partition X may be any partition Partition of the distributed maintenance protocol layer in the embodiment of the present invention, and the main OSD where the Partition X is located.
  • the node (referred to as the Partition X main OSD node), the standby OSD1 node where the standby partition of the Partition X is located (referred to as the Partition X standby OSD1 node), the standby OSD2 node where the standby partition of the Partition X is located (referred to as the Partition X standby OSD2 node), and subsequent specific
  • the description will be described with partition X as an example.
  • the IO operation process (such as a write operation or a modification operation) in this embodiment specifically includes:
  • the IO routing module receives an IO request sent by a host (for example, the server where the IO routing module is shown in FIG. 3).
  • the IO routing module obtains data corresponding to the IO request according to the received IO request (also referred to as IO data).
  • the partition and get the main OSD node of the partition group where the partition is located.
  • the IO routing module may calculate the partition ID of the partition group in which the data corresponding to the IO request is located according to the Key carried in the IO request, and then search for the IO view through the partition ID.
  • the partition group corresponding to the partition ID is partition X.
  • the Key is a number or a string defined by the upper layer service, and is used to identify a data block.
  • the IO routing module sends the IO request to the primary OSD node of the partition group, where the request carries IO view version information (eg, IO view ID), IO key information, and the IO data.
  • IO view version information eg, IO view ID
  • IO key information e.g., IO key information
  • the partition group in this embodiment is partition X.
  • the IO request is sent to the main OSD node of the partition X, that is, the partition X main OSD node.
  • the IO view ID may also be referred to as an IO view version number, which is mainly used to identify a version of a view, which is monotonically increasing, and the IO view ID is small, indicating that the view version maintained by it is an outdated version, and is guaranteed.
  • the IO key information may include a key, an offset, and a length, where offset refers to an offset of IO data relative to a starting position in a data block identified by a key, and length refers to a length of IO data.
  • the Seq ID is composed of the view version number and a sequence number sequence number (Seq NO).
  • the view version number is monotonically incremented with the change of the IO view, the Seq NO indicating a consecutive number of modification operations (eg, write and delete) for data corresponding to the partition in the IO view within an IO view version.
  • the Seq NO in the Seq ID is incremented from 0 again.
  • the Partition X main OSD node compares whether the IO view ID carried in the IO request is consistent with the locally saved IO view ID.
  • the IO view ID may be compared first, and the IO view ID is considered to be large. If the IO view IDs are equal, and then compare the seq NOs therein, the large seq IDs of the seq NOs are large, and only if the two are the same, the Seq IDs are consistent.
  • the Partition X main OSD node returns an error to the IO routing module after determining that the IO view ID in the IO request is smaller than a locally saved IO view ID, and the IO routing module sends the error to the IO routing module.
  • the MDC module queries the IO view of the partition group, resends the IO request after obtaining the updated IO view ID, and the IO after determining that the IO view ID in the IO request is greater than the locally saved IO view ID
  • the request is added to the buffer queue, and the IO view ID of the IO view of the partition group is queried to the MDC module to determine that the locally saved IO view ID is consistent with the IO view ID in the IO request.
  • the operation type may include writing or deleting, and the like. If it is a write operation, the entry can be entered One step includes the aforementioned offset and length. In addition, for various types of operations, the entry may further include status information for describing whether the operation is successful. In general, the modification operations (such as write and delete operations) for the same partition group are uniformly numbered consecutively.
  • the executing the IO request is a write IO request, writing the IO data to a local physical storage resource managed by the Partition X main OSD node (for example, the cache layer shown in FIG. 3) Or a persistence layer, such as a disk, or the aforementioned external physical storage resource SAN); if the IO request is a delete request, deleting corresponding data on the local physical storage resource managed by the Partition X main OSD node .
  • the Partition X main OSD node further generates a copy request.
  • the copy request may be separately assembling the control part of the IO request to generate a copy request, so that the data corresponding to the IO request on the standby OSD node of the Partition X and the Partition X main OSD node The data corresponding to the above IO request remains consistent.
  • the copy request may further include information such as key, offset, length, and IO data in the original request.
  • the copy request is a write copy request
  • the copy request carries key, offset, length, and IO data.
  • the copy request is a delete copy request, only the key is carried in the copy request.
  • the Partition X main OSD node is The MDC module queries the IO view of the partition group, resends the copy request after obtaining the updated IO view ID, or after determining that the IO view ID in the copy request is greater than the locally saved IO view ID.
  • the copy request is added to the buffer queue, and the copy request is executed after the MDC module queries the IO view version number of the partition group to determine that the locally saved IO view ID is consistent with the IO view ID in the IO request.
  • the participant rejects the proposal the entire IO process needs to be rolled back, which brings a lot of overhead.
  • the standby OSD node rejects the request it first queries. After the latest view continues processing, there is no need to roll back, which improves the fault tolerance and availability of the entire system.
  • the existing two-phase commit protocol requires two message interactions between the coordinator and the participant in a normal IO process.
  • only one IO process needs to be performed, and the primary node has only one message interaction with the standby node. It reduces the IO delay caused by message interaction and improves the efficiency and performance of the entire system.
  • the copy request may further carry the Seq ID in the last copy request sent by the Partition X main OSD node for the partition X, and the IO process may further include step 720.
  • the Seq ID of the last copy request By carrying the Seq ID of the last copy request, the data omission caused by the IO view version number change can be prevented, and by sending the lost data, the execution order of each IO of the active and standby OSD nodes can be consistent, thereby further improving the data backup. Consistency.
  • the Seq ID of the last copy request carried in the copy request is smaller than the local maximum Seq ID, an error is returned to the partition X main OSD node, and the partition X main OSD node retransmits the copy. Request; or confirm the Seq ID by further query, and continue to process after obtaining the updated seq ID after confirmation, without directly returning the error. In both cases, there is no need to terminate processing and then roll back, further improving the system's fault tolerance and availability, as well as the performance of the entire system.
  • the entry includes an operation type, a Partition ID, a Seq ID, and a key
  • the operation type may include writing or deleting. If it is a write operation, the entry may further include the aforementioned offset and length. In addition, for various types of operations, the entry may further include status information for describing whether the operation is successful.
  • the performing the copy request specifically includes: when the copy request is a write copy request, the copy request carries a key, an offset, a length, and an IO data. When the copy request is a delete copy request, only the key is carried in the copy request.
  • the partition X standby OSD1 node and the partition X standby OSD2 node respectively send a response request success message to the partition X primary OSD node.
  • the partition X main OSD node sends a response request success message to the IO routing module.
  • the partition X main OSD node or the partition X standby OSD node fails during the IO request processing (for example, after the IO request reaches the partition X main OSD node, The partition X main OSD node fails, or the partition X standby OSD node fails, or the system has a new OSD node added as the standby OSD node of the partition X, etc., in these cases, in the foregoing embodiment
  • the IO request processing flow may further include the processing procedures described in the following embodiments.
  • the IO processing flow after the partition X main OSD node fails includes the following:
  • the MDC module in the system detects the partition in the partition view of the partition group after the partition X main OSD node fails in the IO request processing process.
  • the X main OSD node is reduced to a new partition X standby OSD node, and the partition state of the new partition X standby OSD node is marked as inconsistent, and the partition X standby OSD1 node in the partition view of the partition group is raised.
  • the updated IO view of the partition group is sent to the IO routing module.
  • the IO routing module receives an IO view of the updated partition group sent by the MDC module, according to the The new IO view of the partition group sends the IO request to the new partition X main OSD node;
  • the new partition X main OSD node is configured to receive the IO request, generate a copy request after executing the IO request, and send the new copy request to the updated partition view of the partition.
  • Other standby OSD nodes are configured to receive the IO request, generate a copy request after executing the IO request, and send the new copy request to the updated partition view of the partition.
  • Other standby OSD nodes are configured to receive the IO request, generate a copy request after executing the IO request, and send the new copy request to the updated partition view of the partition.
  • Other standby OSD nodes are configured to receive the IO request, generate a copy request after executing the IO request, and send the new copy request to the updated partition view of the partition.
  • the IO processing flow after the partition X standby OSD node fails includes the following:
  • the MDC module is further configured to: when the partition X standby OSD node fails in the IO request processing, the partition in the partition view is The partition status of the X standby OSD node is marked as inconsistent, and the updated partition view of the partition group is sent to the partition X main OSD node;
  • the primary OSD node After receiving the updated partition view of the partition group, the primary OSD node sends the copy request to other standby OSD nodes in which the partition status in the updated partition view is consistent, The copy request is sent to the partition X standby OSD node whose partition state is inconsistent.
  • the coordinator fails, the IO processing is interrupted, and the coordinator needs to recover before continuing processing.
  • the MDC node can quickly elect a new primary OSD node, which is fast. The recovery of IO processing, high availability, and strong fault tolerance.
  • the participant fails or does not respond if the participant fails or does not respond, the other IO requests are blocked, and the timeout fails, and the rollback needs to be performed.
  • the standby node fails the MDC notifies.
  • the master node performs view change, isolates or ignores the faulty OSD node, continues the processing of IO requests, does not block other IO request processing, has better fault tolerance, and can quickly handle node faults and node fault recovery, for example, N +1 copies can tolerate N replica failures, which in turn improves storage system performance and availability.
  • a system with poor availability determines that its scalability is not too good, because in a large-scale distributed storage system, the failure of the storage node is a common occurrence, and the complex and large number of protocol interactions will also reduce the scalability of the system. .
  • the cluster view control of the partition granularity can greatly reduce the impact range of the storage node failure, so that the storage system can be expanded on a large scale and the scalability of the system is improved.
  • the MDC finds that a new OSD node joins the cluster, and as the standby OSD node of the partition X, the IO processing flow includes the following:
  • the MDC module determines that a new OSD node joins the cluster, the MDC module notifies the partition X main OSD node that the new OSD node is the new standby OSD node where the partition X is located, and Updating the partition view and the IO view of the partition group after the data synchronization of the partition is completed, notifying the partition X main OSD node to update the partition view saved locally by the partition X main OSD node;
  • the partition X main OSD node synchronizes data of the main partition above it to the new standby OSD node, and sends the copy request to the new standby OSD node according to the updated locally saved partition.
  • the failure of a storage node is a normal state.
  • the system needs to be able to provide IO services normally.
  • all the processing of the IO request depends on the cluster view maintained by the MDC module.
  • the cluster view needs to be updated accordingly, so that the processing of the IO request can be effectively performed. Going on.
  • the MDC module detects the state of the OSD node, and when an OSD node fails, the MDC module The fault can be discovered in time.
  • the MDC module finds an OSD fault, it needs to process it in time to make correct changes to the view and notify the relevant IO routing module and OSD node.
  • the related IO routing module And the OSD node processes the corresponding IO request according to the updated view according to the update notification of the MDC, so that the module and the node can obtain the updated view in time, thereby ensuring smooth and efficient processing of the IO request.
  • the MDC module can detect the failure of the OSD node by the following two modes: 1) The MDC module is uniformly responsible, and each OSD node periodically sends a heartbeat message to the MDC module if an OSD node does not have a specified time period. Sending a heartbeat message to the MDC module, the MDC module determines that the OSD node has failed; 2) the OSD nodes periodically send heartbeat messages to each other for fault detection, and the detector does not receive the heartbeat message of the detected object within a specified time period. Then, the OSD node corresponding to the MDC module is reported to have failed.
  • the partition view and the IO view are for a certain group of groups.
  • the update of the view, the view updates for each partition group are independent of each other:
  • the partition state of the faulty OSD node in the partition view of the partition group in which the standby partition is located is marked as inconsistent, and the updated partition view is notified to the a primary OSD node of the partition group, wherein the primary OSD node of the partition performs copy processing of data corresponding to the IO request according to the updated partition view;
  • the partition on the faulty OSD node includes the main partition
  • the faulty OSD node that is the main OSD node in the partition view of the partition group in which the main partition is located is reduced to the new standby OSD node and the new The partition status corresponding to the OSD node is marked as inconsistent, and a standby OSD node whose partition status is consistent is selected from the original OSD node in the partition view of the partition group where the main partition is located to be the new primary OSD node.
  • the new master's OSD node partition view is notified to update, and the other standby OSD node IO view updates are notified.
  • the related IO routing module and the OSD node may process the corresponding IO request according to the updated view according to the update notification of the MDC, and may specifically include:
  • the OSD node of the new master copies the data corresponding to the IO request from the IO routing module according to the updated locally saved partition view to the partition in which the partition state in the updated locally saved partition view is in the same partition.
  • the standby OSD node or the partition status in the partition view saved locally after copying to the update is inconsistent but positive
  • fault isolation is performed to ensure the normal and uninterrupted processing of the IO request, thereby improving the fault tolerance of the system and correspondingly improving the performance and availability of the system.
  • the cluster view control of the partition granularity can reduce the impact range of the OSD node failure, thereby enabling the system to expand on a large scale and improving the scalability of the system.
  • the MDC module further updates the partition view and the IO view, and notifies the further updated partitionview to the main part of the partition in the further updated partition view.
  • the OSD node sends the further updated IOview to the standby OSD node where the partition in the further updated partition view is located, so as to receive the module or OSD node of the further updated partition view or IO view, and update the locally saved Partition view or IO view, and processing the copy of the data corresponding to the IO request according to the further updated partition view or IO view.
  • the failover node can quickly join the cluster to process IO requests, improving system performance and efficiency.
  • an OSD node fault processing flow is provided in the embodiment of the present invention.
  • the execution subject involved in this embodiment is FIG. 2A to FIG. 2C and FIG. 5 .
  • the present embodiment is described by taking the OSDx node, the OSDy node, and the OSDz node in the OSD node as an example.
  • the OSDx node, the OSDy node, and the OSDz node may be multiple of the distributed replication protocol layer in the embodiment of the present invention.
  • the OSDx node is assumed to be the primary OSD node of the partition1 group (referred to as P1), and is the standby OSD node of the partition n group (referred to as Pn);
  • the OSDy node is the The primary OSD node of the Pn is the standby OSD node of the P1;
  • the OSDz node is the standby OSD node of the Pn, and is the standby OSD node of the P1.
  • the OSD node fault processing process in this embodiment specifically includes:
  • the OSDx node, the OSDy node, and the OSDz node periodically send heartbeat messages to the primary MDC module.
  • the fault may include various software, hardware, or network faults, such as a program restart caused by a software bug, a short network failure, a server restart, etc., causing the OSD node to fail to process the IO request and fail to complete data storage. And copy function.
  • the MDC module detects that the OSDx node does not send a heartbeat message within a predetermined time, and determines that it has failed.
  • the MDC can also notify an OSD node of a failure by other means, such as an OSD node that is not faulty.
  • the MDC module performs a view update according to the fault condition.
  • the MDC updates the cluster view of the corresponding partition group according to the partition group in which the partition on the faulty OSD node is determined.
  • the partition in the faulty OSDx node is in the partition including P1 and Pn.
  • the MDC needs to update the cluster view of P1 and Pn, which may include:
  • Update IO view For P1, change the main OSD node in the IO view of P1 from the original OSDx node to the OSDy node; for Pn, the faulty OSDx node acts only as the standby OSD node of Pn, and the main OSDy node of Pn No failure occurred, so the IO view of Pn does not update.
  • the OSDy node processes the notification of the MDC module, updates the locally saved view information (partition view and IO view), and processes the copy of the data corresponding to the IO request according to the latest view notified by the MDC module.
  • the OSDy node updates the partition view and IO view of P1 as the main OSD node of the updated P1, and updates the partition view for Pn as the main OSD node of the original Pn.
  • the OSDy node For the IO operation of P1, after the OSDy node receives the IO request forwarded by the IO routing module, the IO request is executed, a copy request is generated, and the copy request is sent to the P1 in the updated partition view.
  • OSD node OSDz node, and the corresponding partition state is "consistent”. Since the OSDx node acts as a new standby OSD node of P1, its partition state is "inconsistent", and the OSDy node no longer sends the copy request to the OSDx node. Fault isolation is achieved so as not to affect the continuation of the IO request processing for P1.
  • the OSDy node For the IO operation of the Pn, after the OSDy node receives the IO request forwarded by the IO routing module, the IO request is executed, a copy request is generated, and the copy request is sent to the Pn of the updated partition view.
  • OSD node OSDz node, and the corresponding partition state is "consistent". Since the OSDx node acts as the standby OSD node of Pn, its partition state is "inconsistent", and the OSDy node no longer sends the copy request to the OSDx node to implement the failure. Isolation does not affect the continuation of the IO request processing for P1.
  • the participant fails or does not respond the other IO requests are blocked, and the timeout fails.
  • the rollback is required.
  • the standby OSD node fails the MDC notifies the master. The node makes a view change, ignores the faulty node, isolates the faulty node, continues the processing of the IO request, does not block other IO request processing, and has better fault tolerance and availability.
  • the MDC will update the partition view of P1 in time, and the IO routing module is updated according to the updated partition view.
  • the selected primary OSDy node of P1 re-transfers the IO request to the newly selected OSDy node.
  • the MDC node can quickly elect a new primary node to quickly restore the IO processing. If the standby node fails, the MDC notifies the primary node to perform view change, isolation or ignore.
  • the faulty OSD node continues the processing of the IO request, does not block other IO request processing, has better fault tolerance, and can quickly handle node faults. For example, N+1 copies can tolerate N replica faults, thereby improving The performance and availability of the storage system.
  • a system with poor availability determines that its scalability is not too good, because in a large-scale distributed storage system, the failure of the storage node is a common occurrence, and the complex and large number of protocol interactions will also reduce the scalability of the system. .
  • the storage system can be expanded on a large scale and the scalability of the system is improved.
  • the OSD node failure recovery process can be divided into the following three stages:
  • the standby OSD node synchronizes the data modification made by the primary OSD node during the failure to the primary OSD node, which is an incremental synchronization process.
  • all data of a certain partition may also be synchronized according to actual conditions.
  • the MDC module After the standby OSD node is restored to the state consistent with the primary OSD node, the MDC module performs the change of the cluster view;
  • the MDC module notifies each module and node of the updated cluster view, so that each module and node processes the copying or forwarding of the IO request according to the updated cluster view.
  • the OSD node fault recovery process may further include the following stages:
  • the standby OSD node plays back the log recorded after receiving the copy request of the primary OSD node in the process of synchronizing data to the primary OSD, and writes the data in the log. This ensures that the fault recovery OSD node ensures that all data is exactly the same as the primary OSD node during the fault recovery process, thereby improving the data consistency of the active and standby OSD nodes.
  • the specific process of the data synchronization process can be referred to a specific embodiment given in FIG. 10 below.
  • the cluster view update and update notification process can include the following:
  • the OSD node is the main OSD of the partition group before the failure, the partition view and the IO view are simultaneously changed, and the former OSD node is promoted to the new main OSD node of the partition group, and the current main OSD is reduced to the partition group.
  • the standby OSD node notifies the new primary OSD node to perform the partition view change, and notifies the IO routing module and the standby OSD node to perform the IO view change.
  • the MDC module kicks the OSD node out of the cluster and migrates the partition distributed on the OSD node to another OSD node.
  • OSD node exiting the cluster given in FIG. 11 described below.
  • FIG. 9 an embodiment of the OSD node fault recovery processing flow provided by the present invention is shown in FIG. 2A to FIG. 2C and FIG. 2C.
  • the partition group in which the partition on the OSD node is located includes the partition1 group (P1 for short) and the partition n group (Pn) for example.
  • the OSD node fault processing in this embodiment is described.
  • the process specifically includes:
  • the fault recovery OSD node requests the MDC module to restore the cluster view information on the OSD node, and the request carries the OSD ID of the OSD node.
  • the MDC module queries the partition view according to the OSD ID to obtain the partition information on the OSD.
  • the MDC queries the partition view corresponding to P1 and Pn on the fault recovery node OSD according to the OSD ID of the fault recovery OSD to obtain the partition information of P1 and the partition information of Pn, respectively.
  • the partition information may include an IO view, and may further include a partition state.
  • 908/910 Initiating a data recovery process according to the Partition information returned by the MDC to the primary OSD node of the partition group, and acquiring the entry information missing during the failure to the primary OSD node of P1 and Pn, respectively, and carrying the SeqID.
  • the fault recovery process may further include the following steps 912-916 and 918:
  • the main node OSD point of Pn receives a host IO write request during failure recovery.
  • steps 912-914 may refer to the IO operation flow in FIG. 7 above.
  • the data corresponding to the IO request may also be written into the log.
  • the IO key information in this step may refer to the description of the IO operation flow in FIG. 7 above.
  • the data corresponding to the replication request of the primary OSD from the Pn is written into the fault recovery OSD according to the IO information recorded in the log.
  • the MDC module may be notified to perform the update of the cluster view in two ways.
  • One of the ways is that after the data recovery is completed, the fault recovery OSD node notifies the primary OSD node of P1 and Pn to be used by the primary OSD.
  • the node notifies the MDC node to perform a cluster view update.
  • Method 2 See step 930 below.
  • the fault recovery OSD node may further confirm the partition state of the partition on the OSD node before the notification of the cluster view update, and trigger the cluster graph update process after confirming that the partition state is inconsistent.
  • the fault recovery OSD node may further obtain the partition state information by using the partition information returned in the above step 906.
  • the primary OSD nodes of P1 and Pn respectively notify the MDC to change the partition view, carrying the partition group ID, the standby OSD ID, and the view version.
  • the MDC module is sent a notification that the status of the partition of the partition on the OSD node is updated to be consistent, and the notification carries the partition group ID and the standby OSD node ID of the partition group where the partition is located on the OSD node. That is, the fault recovery OSD node), and the view version.
  • the partition group ID is used to indicate the view of the partition group to be updated.
  • the OSD ID is used to indicate the faulty OSD node.
  • the view version is the view version of the latest partition view saved locally by the main OSD node where the P1 and Pn are located.
  • the cluster view update process is performed, thereby ensuring all modules in the IO processing process or The cluster view seen by the node is consistent, which improves the consistency of data backup.
  • the latest partition view is further sent to the main OSD node where the P1 and Pn are located, and After confirming that the primary and backup data for the P1 and the Pn are the same, the cluster view is updated.
  • the MDC is notified to change the Partition View, and the status of the fault recovery OSD node is updated to be consistent, the Partition group ID carried, the standby OSD ID, and the view version.
  • the fault recovery OSD node sends a notification to the MDC module to notify the MDC module to update the partition status of the partition on the fault recovery OSD node to be consistent, which is different from the above manner.
  • the view version is the partition viw version of the latest partition view saved locally by the fault recovery OSD node or the IO view version of the latest IO view (specifically, the fault recovery OSD node is the primary OSD node or the standby OSD node before the fault is determined to be
  • the partition view is also the IO view.
  • the MDC module determines the cluster view update process after the corresponding view version of the MDC module is consistent according to the view version in the notification.
  • the partition status corresponding to the fault recovery OSD node in the partition view of P1 and Pn on the fault recovery OSD node is respectively updated to “consistent”.
  • the fault recovery OSD node is P1 by comparing the latest partition view maintained by the MDC module to the latest partition view of P1 and Pn on the fault recovery OSD node and the initialization partition view of the P1 and Pn.
  • the primary OSD node or by comparing the latest IO view of the P1 and Pn on the fault recovery OSD node and the initialized IO view of the P1 and Pn maintained by the MDC module, confirming that the fault recovery OSD node is P1 Primary OSD node.
  • the MDC restores the fault recovery OSD node to a new primary OSD node of the P1, and reduces the current primary OSD node where the P1 is located to a new standby OSD node of the P1. And updating the partition view of the P1.
  • the fault recovery OSD node since the fault recovery OSD node is originally a standby OSD node of Pn, the fault is not involved for Pn. Restore the master/slave identity change of the OSD node.
  • the fault recovery OSD node since the fault recovery OSD node is originally the primary node of P1, it is raised to the new primary OSD node of the fault recovery OSD node according to step 937. Therefore, the latest Partition View of the updated P1 is required.
  • the fault recovery OSD node determines whether the Partition View is consistent with the main OSD in the locally saved IO view, determines whether it is the owner, and updates the IO view.
  • the fault recovery OSD node After receiving the latest partition view, the fault recovery OSD node determines whether the latest partition view is consistent with the main OSD in the locally saved IO view, and determines whether the user is promoted, and if so, updates the IO view, and A locally saved partition view. And copying the data involved in the IO request according to the updated partition view and IO view.
  • the current primary OSD node of the P1 and Pn receives the updated IO view and updates the respective locally saved IO views.
  • the current primary OSD node of P1 since it has been reduced by the MDC module to the new standby OSD node for P1, the current primary OSD node of P1 deletes the locally saved partition view and according to the updated locally saved IO view. Handles the replication of IO requests.
  • the MDC after the OSD node fails, the MDC notifies other related nodes to update the view, isolate or ignore the faulty OSD node, continue the processing of the IO request, does not block other IO request processing, and fails in the node.
  • the view After the recovery, the view is updated, and the related nodes are notified.
  • the fault recovery node can quickly rejoin the cluster work, and can quickly deal with node failure and fault recovery, have better fault tolerance, and improve the performance and availability of the storage system.
  • a system with poor availability determines that its scalability is not too good, because in a large-scale distributed storage system, the failure of the storage node is a common occurrence, and the complex and large number of protocol interactions will also reduce the scalability of the system. .
  • the cluster view control of the partition granularity can greatly reduce the impact range of the storage node failure, so that the storage system can be expanded on a large scale and the scalability of the system is improved.
  • the OSD node fault processing process in this embodiment specifically includes:
  • the failback OSD node obtains the largest SeqID in the entry of each partition of the record locally.
  • the OSD node in the system records an entry for each IO operation of a certain partition during the IO process, as described above, the entry.
  • the IO operation type, the Partition ID, the Seq ID, and the key may further include status information for describing whether the operation is successful.
  • the write IO operation may further include the foregoing offset and length. .
  • the maximum Seq ID for the write IO operation for P1 and Pn is obtained, respectively.
  • the main OSD node of P1 and Pn is respectively requested to acquire an entry missing from the OSD, and the request carries the partition group ID and the maximum SeqID of each.
  • Scenario 1 The maximum SeqID of the standby OSD is within the entry range of the primary OSD record.
  • the primary OSD nodes of P1 and Pn respectively send the entry of the failure recovery OSD node OSD to the fault recovery OSD node.
  • the Seq ID numbering rules or methods may be different, and there may be different missing entries on the main OSD nodes of P1 and Pn.
  • the fault recovery OSD node repeatedly performs data synchronization according to the entry obtained in the previous step: the request carries IO key information such as key, offset, and length.
  • the fault recovery OSD node performs a data synchronization request to the primary OSD node where the P1 and the Pn are located in batches according to the obtained entry, and includes IO key information such as key, offset, and length.
  • the primary OSD nodes of the P1 and Pn respectively send data corresponding to each entry to the fault recovery OSD node according to the obtained information in the data synchronization request.
  • Scenario 2 The maximum SeqID of the standby OSD is not in the entry range of the main OSD record, which is smaller than the minimum SeqID of the main OSD.
  • the primary OSD node of the Pn determines that the maximum SeqID of the fault recovery OSD node is not within the entry range of the primary OSD record, that is, smaller than the minimum SeqID of the primary OSD of the Pn. Sending a minimum SeqID and a zero entry of the primary OSD node of the Pn to the fault recovery OSD node, facilitating the fault recovery OSD node to determine whether the primary OSD node of the Pn has not written data or written data too much So that data recovery cannot be done by incremental synchronization.
  • the data of the synchronization partition is repeatedly requested, and the request carries the partition group ID.
  • the fault recovery OSD node requests the primary OSD node of the Pn to synchronize the data of the entire partition, such as the primary partition data on the primary OSD of the Pn in the embodiment, and the Pn backup partition on the fault recovery OSD node.
  • the data is synchronized, and the request carries the partition group ID.
  • the master node since a data volume of a partition is usually large, a request cannot be transmitted, and the master node does not know the IO capability of the fault recovery node, and sends data to the fault recovery node, and the fault recovery node May not be processed, so the data will be sent to the fault recovery only when the fault recovery node requests synchronization data.
  • the complex node therefore, the fault recovery OSD node repeats the synchronization request sent to the primary OSD node of the Pn according to the situation, and synchronizes the data in the entire partition until all the data in the partition is synchronized.
  • the invention is not limited.
  • the primary OSD node of the Pn sends data corresponding to one or more Keys according to the synchronization request sent by the fault recovery OSD node each time.
  • the cluster After the fault of the OSD node exceeds the predetermined time threshold (for example, 5 minutes), the cluster cannot be restored to the cluster, or the OSD node has a hardware fault. To ensure the reliability of the data, the faulty OSD node needs to be kicked out of the cluster.
  • the predetermined time threshold for example, 5 minutes
  • the OSD node exiting the cluster is a partition redistribution and data migration process.
  • the redistribution of the partition should consider the balance of each node and the security of the replica; the IO processing of the data migration process is consistent with the processing in the fault recovery process and the data recovery process.
  • the data migration is complete, and the active and standby replicas reach the consistent state.
  • the process of updating and notifying the MDC is consistent with the process of updating the view after the fault recovery is completed.
  • the related OSD node performs the IO request copying or forwarding process according to the updated view. After the fault recovery is completed, each OSD node performs the IO processing according to the latest view.
  • FIG. 11 an embodiment of the OSD node fault recovery processing flow provided by the present invention is shown in FIG. 2A to FIG. 2C and FIG. 2C.
  • this embodiment assumes that the OSD nodes involved are an OSD1 node, an OSD2 node, and an OSDn node, wherein the OSD1 node is a primary OSD node of the partition1 group (referred to as P1), and a standby OSD node of the partition2 group (referred to as P2).
  • the OSD2 node is the standby OSD node of P1
  • the OSDn node is the primary OSD node of P2.
  • the OSD node fault exit processing process in this embodiment specifically includes:
  • the MDC module finds that the OSD1 fault exceeds a predetermined threshold or a hardware failure occurs.
  • the MDC kicks the OSD1 out of the cluster, performs view change, and migrates the partition on the OSD1 node (here, the main partition of P1 and the standby partition of P2).
  • OSD1 node here, the main partition of P1 and the standby partition of P2.
  • the MDC module notifies the OSD2 node of the view update: the OSD2 node is promoted to the P1 master and becomes the P2 backup.
  • the MDC module notifies the OSDn node of the view update: the OSDn node is the master of P2 and becomes the standby of P1.
  • the OSD2 node requests synchronization of P2 data to the OSDn node.
  • the OSD2 node Since the OSDn node is the main OSD node of P2, the OSD2 node requests the OSDn node to synchronize the data of the P2 main partition on the OSDn so that the data of the P2 partition on the OSD2 coincides with the data of the P2 main partition on the OSDn.
  • the specific synchronization process is similar to the data synchronization process of the entire partition in the data recovery process shown in FIG. 10 above, and details are not described herein again.
  • the OSDn node requests the OSD2 node to synchronize the data of P1.
  • the OSDn node acts as the P1 new standby OSD node, and can only synchronize the P1 data with the OSD2.
  • the OSDn node requests the OSD2 node to synchronize the P1 main partition on the OSD2.
  • the data is such that the data of the standby partition of P1 on OSDn coincides with the data of the main partition of P1 on OSD2.
  • the partition data is synchronized.
  • the OSD2 node notifies the MDC module that the data migration of P2 has been completed.
  • the OSDn node notifies the MDC module that the data migration of P1 has been completed.
  • the MDC module performs a view update according to the corresponding notification.
  • Notification view update The partition status of the standby OSDn node of P1 is consistent.
  • Notification view update The partition status of the standby OSD2 node of P2 is consistent.
  • the OSD2 node and the OSDn node perform a copy processing of data corresponding to the IO request according to the latest view.
  • the process of node expansion mainly involves data migration, view update after data migration, notification of view update, and related OSD nodes process replication of IO requests according to the updated view.
  • the IO processing of the data migration process is consistent with the processing in the fault recovery process and the data recovery process; the data migration is completed, and the process of updating and notifying the view by the MDC is consistent with the process of updating the view after the failure recovery is completed.
  • Each of the related OSD nodes performs a copy processing process of the data corresponding to the IO request according to the updated view. After the fault recovery is completed, each OSD node performs the IO processing according to the latest view.
  • the updating of the view after the data migration is completed may include: 1) the new expansion node is still a standby OSD node of the partition group, and the partition state is in a consistent state, and the original standby OSD of the partition group is no longer The standby node of the partition; 2) The new expansion node is promoted to the main OSD node of the partial partition group, and the partition group no longer belongs to the original primary OSD node (the partition group is no longer distributed in the original primary OSD).
  • the execution subject involved in this embodiment is the MDC module, the IO routing module, and the OSD node mentioned in the embodiments described in FIGS. 2A to 2C and FIG.
  • the OSD node involved is an OSD1 node, an OSDn node, and a newly expanded OSD node, wherein OSD1 is a primary OSD node of a partition group P1 (referred to as P1), and a standby OSD of a partition group Pn (referred to as Pn).
  • the node, OSDn is the primary OSD node of Pn, the standby OSD node of P1, the newly expanded OSD node is the standby OSD node of P1, and the standby OSD node of Pn.
  • the OSD node expansion process in this embodiment specifically includes:
  • the OSD node expansion configuration and commands are sent to the MDC module.
  • the system administrator may notify the MDC module of the expansion of the OSD node by using a configuration command.
  • the MDC module performs view update to migrate the partition on some OSD nodes to the newly expanded OSD node.
  • the MDC module migrates the P1 standby on the OSD1 node and the Pn standby partition on the OSDn node to the newly expanded OSD node, so that the newly expanded OSD node is used as the standby POD of the new P1 and the new Pn. OSD node.
  • Notify the OSD1 node of the view update add a new standby OSD.
  • the MDC module notifies the OSD1 node of the view update, and adds a newly expanded OSD node to the standby OSD node of the P1 in the new partition view of the P1, and the corresponding partition state is “inconsistent” (because the newly expanded OSD node and The OSD1 node has not yet synchronized the data of P1).
  • a new expansion OSD node is added as a standby OSD node of the Pn in the new partition view of the Pn, and the corresponding partition state is “inconsistent” (because the newly expanded OSD node and the OSDn node have not performed data on the Pn yet). Synchronize).
  • the initialization process is performed.
  • the specific process is the same as the cluster view initialization generation and acquisition process in FIG. 6 above, and details are not described herein again.
  • the MDC module returns a view of the partition on the newly expanded OSD node to the newly expanded OSD node.
  • the IO view of P1 and the IO view of Pn are exemplary views of the partition on the newly expanded OSD node.
  • the new OSD node requests the data of the synchronization P1 from the IO view of the P1 returned by the MDC to the primary OSD node OSD1 of the P1, that is, synchronizes the data of the primary partition of the P1 on the OSD1 node, so that the new OSD node is expanded.
  • the data of the standby partition of P1 is consistent with the data of the main partition of P1 on the OSD1 node.
  • the specific synchronization process is similar to the data synchronization process of the entire partition in the data recovery process shown in FIG. 10 above, and details are not described herein again.
  • Expanding the new OSD node requests the data of the synchronization partition to the primary OSDn node.
  • the new OSD node requests the data of the synchronization Pn to the primary OSD node OSDn of the Pn according to the IO view of the Pn returned by the MDC, that is, synchronizes the data of the primary partition of the Pn on the OSDn node, so that the new OSD node is expanded.
  • the data of the standby partition of Pn is consistent with the data of the main partition of Pn on the OSDn node.
  • the specific synchronization process is similar to the data synchronization process of the entire partition in the data recovery process shown in FIG. 10 above, and details are not described herein again.
  • the partition data is synchronized.
  • the MDC standby node is notified that the partition data migration has been completed.
  • the OSD1 node notifies the MDC module that the newly expanded OSD node has completed data synchronization of P1.
  • the OSDn node notifies the MDC module that the newly expanded OSD node has completed data synchronization of the Pn.
  • the MDC module performs view update.
  • the OSD1 node, the OSDn node, and the newly expanded OSD node perform copy processing of data corresponding to the IO request according to the updated view.

Abstract

本发明公开了一种分布式存储复制系统、方法,该系统包括至少一个元数据控制(MDC)模块,多个IO路由模块以及多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源相对应的至少一个逻辑分区(partition);所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求执行IO请求所对应的数据的存储;所述MDC确定所述系统中某个OSD节点为故障OSD节点,确定所述故障OSD节点上的partition,更新所述故障OSD节点上的partition所在的partition群的partition视图,向所述更新的partition视图中的主OSD节点发送更新通知;所述主OSD节点用于接收所述MDC模块发送的所述更新通知后,根据所述更新的partition视图处理IO请求所对应的数据的复制。应用本发明实施例,提高了一致性复制的处理性能,容错能力以及可用性。

Description

一种分布式存储复制系统和方法 技术领域
本发明涉及IT(Information Technology)信息技术领域,特别涉及一种分布式存储复制系统和方法。
背景技术
随着信息技术的快速发展和互联网的广泛应用,人类产生的数据呈现爆炸式增长,这样对数据存储提出了更高的扩展性要求。相对于传统的存储阵列系统,分布式存储系统有着更好的扩展性和通用硬件设备适应性,能够更好的满足未来对数据存储的需求。
分布式存储系统一般是把大量存储节点组织成一个分布式系统,通过不同节点之间的数据复制备份来保证数据的可靠性,这样一份数据在不同的存储节点上都有副本,如何保证多个数据副本的数据一致性,是分布式存储系统长期以来面临的一个问题。在保证数据一致性的情况下,系统的性能和可用性也越来越成为很重要的考虑因素。
如图1所示为现有的两阶段提交协议(2Phase Commit,2PC),其是一种典型的中心化的强一致性副本控制协议,许多分布数据库系统采用该协议来保证副本的一致性。
在两阶段提交协议中,系统一般包含两类节点:一类是协调者(coordinator),一类是参与者(participant)。协调者负责发起对数据更新的投票以及通知投票决定的执行,参与者参与数据更新的投票以及执行投票的决定。
两阶段提交协议由两个阶段组成:阶段一:请求阶段,协调者通知参与者对数据的修改进行投票,参与者告知协调者自己的投票结果:同意或者否决;阶段二:提交阶段,协调者根据第一阶段的投票结果进行决策:执行或者取消。
一次成功的两阶段提交协议中,协调者与每个参与者至少需要两轮交互4个消息,过多的交互次数会降低性能,此外,两阶段提交协议中的节点如果发生故障或者一直没有响应,其他的IO请求便会阻塞,最后导致超时失败而需要进行数据回滚,容错能力及可用性较低。
发明内容
本发明实施例提供一种分布式存储复制系统,以及一种应用于分布式存储系统中管理数据存储和复制的方法以解决现有一致性复制协议中低性能以及可用性低的问题。
第一方面,一种分布式存储复制系统,所述系统包括至少一个元数据控制(MDC)模块,多个IO路由模块以及多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源相对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图,所述partition 视图包括一个partition群中的各partition所在的OSD信息;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求执行IO请求所对应的数据的存储;所述MDC确定所述系统中某个OSD节点为故障OSD节点,确定所述故障OSD节点上的partition,更新所述故障OSD节点上的partition所在的partition群的partition视图,向所述更新的partition视图中的主OSD节点发送更新通知;所述主OSD节点用于接收所述MDC模块发送的所述更新通知后,根据所述更新的partition视图处理IO请求所对应的数据的复制。
在第一方面的第一种可能的实现方式中,所述partition视图具体包括一个partition群中各partition所在的OSD的主备身份及对应的partition状态;所述主OSD节点进一步用于根据所述更新的partition视图更新所述主OSD节点本地保存的partition视图;所述根据所述更新的partition视图处理IO请求所对应的数据的复制具体包括:根据所述更新后本地保存的partition视图将来自所述IO路由模块的IO请求所对应的数据复制到所述更新后本地保存的partition视图partition状态为一致的备OSD节点上,或者复制到所述更新后本地保存的partition视图中partition状态为一致的备OSD节点上以及所述更新后本地保存的partition视图中partition状态为不一致但正在进行数据恢复的备OSD节点上。
结合第一方面的第一种可能的实现方式,在第二种可能的实现方式中,所述MDC模块进一步用于生成IO视图,所述IO视图包括一个partition群的主OSD节点的标识,将所述IO视图发送给所述IO路由模块以及所述partition视图中的partition所在的OSD节点;以及所述主OSD节点进一步用于根据所述更新的partition视图更新所述主OSD节点本地保存的IO视图,并根据所述更新后本地保存的IO视图处理所述IO请求所对应的数据的复制。
结合第一方面的第二种可能的实现方式,在第三种可能的实现方式中,所述的MDC模块进一步用于当确定所述故障OSD节点上的partition包括主partition时,更新所述主partition所在的partition群的IO视图,将所述更新的IO视图通知给所述更新的partition视图中的备OSD节点;以及所述更新的partition视图中的备OSD节点用于根据所述更新的IO视图更新本地保存的IO视图并根据所述更新后本地保存的IO视图处理所述IO请求所对应的数据的复制。
结合第一方面的第三种可能的实现方式,在第四种可能的实现方式中,所述更新所述故障OSD节点上的partition所在的partition群的partition视图具体包括:当所述故障OSD节点上的partition包括备partition时,将所述备partition所在的partition群的partition视图中所述故障OSD节点的partition状态标记为不一致;以及,当所述故障OSD节点上的partition包括主partition时,将所述主partition所在的partition群的partition视图中作为主OSD节点的所述故障OSD节点降为新的备OSD节点并将所述新的备OSD节点对应的partition状态标记为不一致,并从所述主partition所在的partition群的partition视图中的原备OSD节点中选择一个partition状态为一致的备OSD节点升为新的主OSD节点。
第二方面,一种分布式存储复制系统,所述系统包括至少一个元数据控制(MDC)模块以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构 成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图以及IO视图,所述partition视图包括一个partition群中的各partition所在的OSD信息,所述IO视图包括一个partition群的主OSD的标识;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求执行IO数据的存储;所述IO路由模块用于接收IO请求,所述IO请求包括key,根据所述key确定所述IO请求所对应的数据所属的partition群并确定所述数据所属的partition群的主OSD节点,将所述数据所属的partition群的IO视图的IO视图版本信息加入所述IO请求,并将携带有所述IO视图版本信息的所述IO请求发送至所述确定出的主OSD节点;所述主OSD节点用于接收所述IO请求,在根据所述IO视图版本信息确定所述IO请求中的IO视图版本与所述主OSD节点本地保存的IO视图版本一致后,执行所述IO请求,生成携带有所述IO视图版本信息的复制请求,并将所述的复制请求发送至所述数据所属的partition群的备OSD节点;以及所述备OSD节点用于接收所述复制请求,并在根据所述IO视图版本信息确定所述复制请求中的IO视图版本与所述备OSD节点本地保存的IO视图版本一致后,执行所述复制请求以使得所述备OSD节点上的所述IO请求所对应的数据与所述主OSD节点上的所述IO请求所对应的数据保持一致。
在第二方面的第一种可能的实现方式中,所述主OSD节点进一步用于在根据所述IO视图版本信息确定所述IO请求中的IO视图版本旧于所述主OSD本地保存的IO视图版本后向所述的IO路由模块返回错误,以及在确定所述IO请求中的IO视图版本新于所述主OSD本地保存的IO视图版本后将所述IO请求加入缓冲队列,并向所述MDC模块查询所述数据所属的partition群的IO视图的IO视图版本信息以确定所述主OSD本地保存的IO视图版本与所述IO请求中的IO视图版本一致后执行所述IO请求;以及所述IO路由模块用于收到所述主OSD节点返回的错误后向所述MDC模块查询所述数据所属的partition群的IO视图,在获得更新的IO视图版本信息后发送所述携带所述更新的IO视图版本信息的IO请求。
结合第二方面或第二方面的第一种可能的实现方式,在第二种可能的实现方式中,所述IO视图版本信息包括IO视图版本号,所述主OSD节点进一步为所述IO请求生成序列标识,所述序列标识包括所述IO视图版本号和序列号,并将所述序列标识加入到发送至所述备OSD节点的所述复制请求;所述序列号表示在一个IO视图版本内针对所述IO视图中的partition群对应的数据的修改操作的连续编号;所述备OSD节点进一步用于根据所述序列标识执行所述复制请求。
结合第二方面的第二种可能的实现方式,在第三种可能的实现方式中,所述复制请求中进一步携带有所述主OSD节点针对所述partition群发送的上一次复制请求中的序列标识;所述备OSD节点用于在收到所述的复制请求后并在确定所述上一次复制请求的序列标识与所述备OSD节点本地保存的最大序列标识一致的情况下执行所述复制请求。
结合第二方面或者第二方面的第一、第二以及第三种可能的实现方式,在第四种可能的实现方式中,所述partition视图具体包括一个partition群中各partition所在的OSD的主备身份及对应的partition状 态;所述MDC模块进一步用于:在所述IO请求处理过程中检测到所述主OSD节点发生故障时将所述数据所属的partition群的partition视图中的所述主OSD节点降为新的备OSD节点,并将所述新的备OSD的partition状态标记为不一致,将所述数据所属的partition群的partition视图中的所述备OSD节点中任一备OSD节点升为新的主OSD节点,将更新后的所述数据所属的partition群的partition视图通知至所述新的主OSD节点,用所述新的主OSD节点更新所述数据所属的partition群的IO视图,将更新后的所述数据所属的partition的IO视图通知至所述IO路由模块;所述IO路由模块进一步用于接收所述MDC模块发送的所述更新后的所述partition群的IO视图,根据所述更新后的所述partition群的IO视图将所述IO请求发送至所述新的主OSD节点;以及所述新的主OSD节点用于接收所述IO请求,在执行所述IO请求后生成第二复制请求,将所述第二复制请求发送至所述更新后的所述数据所属的partition群的partition视图中的partition状态为一致的备OSD节点。
结合第二方面或者第二方面的第一、第二以及第三种可能的实现方式,在第五种可能的实现方式中,所述partition视图具体包括一个partition群中各partition所在的OSD的主备身份及相应的partition状态,所述MDC模块进一步用于在所述IO请求处理过程中检测到所述备OSD节点中任一备OSD节点发生故障时,将所述数据所属的partition群的partition视图中的所述任一备OSD的partition状态标记为不一致,将更新后的所述数据所属的partition群的partition视图通知至所述主OSD节点;以及所述主OSD节点用于接收到所述更新后的所述数据所属的partition群的partition视图后,将所述复制请求发送至所述更新后的partition视图中的partition状态为一致的备OSD节点,不将所述复制请求发送至partition状态为不一致的备OSD节点。
第三方面,一种分布式存储复制系统,所述系统包括至少一个元数据控制(MDC)模以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图以及IO视图,所述partition视图包括一个partition群中的各partition所在的OSD信息,所述IO视图包括一个partition群的主OSD的标识;所述IO路由模块用于对接收的IO请求进行路由转所述OSD节点;所述OSD节点用于根据所述IO请求所对应的数据的存储;所述OSD节点用于在自身故障恢复后向所述MDC模块发送查询请求以请求所述OSD节点上的partition所在的partition群的IO视图,所述OSD节点称为故障恢复OSD节点,所述查询请求中携带所述故障恢复OSD节点的OSD标识,接收所述MDC返回的所述IO视图,向所述IO视图中的主OSD发起数据恢复请求以请求恢复所述故障恢复OSD节点在故障期间更新的数据,接收所述主OSD发送的所述故障期间更新的数据,根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理IO请求的复制;所述MDC模块用于接收所述故障恢复OSD节点的所述查询请求,根据查询请求中的OSD标识将所述IO视图返 回给所述故障恢复OSD节点,并在所述故障恢复OSD节点数据恢复执行完毕后更新所述partition群的partition视图;以及所述主OSD节点用于接收所述故障恢复OSD节点的数据恢复请求,将所述故障期间更新的数据发送至所述的故障恢复OSD节点,以及根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理所述IO请求所对应的数据的复制。
在第三方面的第一种可能的实现方式中,所述主OSD节点进一步用于在接收所述数据恢复请求后接收所述IO路由模块发送的针对所述故障恢复OSD节点上的partition的IO请求,执行所述IO请求,向所述的故障恢复OSD节点发送携带有IO关键信息以及所述IO请求所对应的数据的复制请求;以及所述的故障恢复OSD节点将所述IO关键信息以及所述IO请求所对应的数据的复制请求写入日志,在所述数据恢复执行完毕后根据所述日志的记录将所述IO请求所对应的数据更新到所述故障恢复OSD节点所管理的物理存储资源中。
结合第三方面的第一种可能的实现方式中,在第二种可能的实现方式中,所述数据恢复请求中携带有所述故障恢复OSD节点本地记录的针对所述故障恢复OSD节点上的partition的IO操作的最大序列标识;所述最大序列标识为:所述故障恢复OSD节点上的partition所在的partition群的IO视图的最新IO视图版本号,以及针对所述最新IO视图版本号对应的IO视图中的partition对应的数据的修改操作的最大编号;所述将所述故障期间更新的数据发送至所述的故障恢复OSD节点包括:确定所述数据恢复请求中最大的序列标识大于或等于所述主OSD节点本地存储的当前最小的序列标识,向所述故障恢复OSD节点发送所述故障恢复OSD节点在故障期间缺少的entry,接收所述故障恢复OSD节点根据所述entry发起的数据恢复请求,将对应所述entry的数据发送至所述的故障恢复OSD节点;所述最小序列标识为:所述主OSD节点保存的所述partition群的IO视图的最小IO视图版本号,以及针对所述最小IO视图版本号对应的IO视图中的partition对应的数据的修改操作的最小编号。
结合第三方面的第一种可能的实现方式中,在第二种可能的实现方式中,所述数据恢复请求中携带有所述故障恢复OSD节点本地记录的针对所述故障恢复OSD节点上的partition的IO操作的最大序列标识;所述最大序列标识包括:所述故障恢复OSD节点上的partition所在的partition群的IO视图的最新IO视图版本号,以及在所述最新IO视图版本号对应的IO视图内针对所述IO视图中的partition对应的数据的修改操作的最大编号;所述将所述故障期间更新的数据发送至所述的故障恢复OSD节点包括:确定所述数据恢复请求中最大的序列标识小于所述主OSD节点本地存储的当前最小的序列标识,向所述故障恢复OSD节点发送所述主OSD本地存储的当前最小的序列标识,接收所述故障恢复OSD节点发起的同步所述主OSD节点上所属所述partition群的主partition所对应的全部数据的数据恢复请求,将所述主partition所对应的全部数据发送至所述故障恢复OSD节点;所述最小序列标识为:所述主OSD节点保存的所述partition群的IO视图的最小IO视图版本号,以及针对所述最小IO视图版本号对应的IO视图中的partition对应的数据的修改操作的最小编号。
第四方面,一种分布式存储复制系统,所述系统包括至少一个元数据控制(MDC)模以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition 或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图,所述partition视图包括一个partition群中的各partition所在的OSD信息;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求所对应的数据的存储;所述系统包括存储器以及处理器;所述的存储器用于存储计算机可读指令,所述指令用于执行所述MDC模块、IO路由模块以及OSD节点的功能;所述的处理器用于与所述的存储器连接,读取所述存储器中的指令,根据所述的指令促使所述处理器执行下述操作:确定所述系统中某个OSD节点为故障OSD节点,确定所述故障OSD节点上的partition,更新所述故障OSD节点上的partition所在的partition群的partition视图,向所述更新的partition视图中的partition群所在的主OSD节点发送更新通知,以便所述主OSD节点根据所述更新的partition视图处理IO请求所对应的数据的复制。
第五方面,一种分布式存储复制系统,所述系统包括至少一个元数据控制(MDC)模以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图以及IO视图,所述partition视图包括一个partition群中的各partition所在的OSD信息,所述IO视图包括一个partition群的主OSD的标识;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求所对应的数据的存储;所述系统包括存储器以及处理器;所述的存储器用于存储计算机可读指令,所述指令用于执行所述MDC模块、IO路由模块以及OSD节点的功能;所述的处理器用于与所述的存储器连接,读取所述存储器中的指令,根据所述的指令促使所述处理器执行下述操作:促使所述IO路由模块接收IO请求,所述IO请求包括key,根据所述key确定所述IO请求所对应的数据所属的partition群并确定所述数据所属的partition群的主OSD节点,将所述数据所属的partition群的IO视图的IO视图版本信息加入所述IO请求,并将携带有所述IO视图版本信息的所述IO请求发送至所述确定出的主OSD节点;促使所述主OSD节点接收所述IO请求,在根据所述IO视图版本信息确定所述IO请求中的IO视图版本与本地保存的IO视图版本一致后,执行所述IO请求,生成携带有所述IO视图版本信息的复制请求,并将所述的复制请求发送至所述数据所属的partition的备OSD节点;以及促使所述备OSD节点接收所述复制请求,并在根据所述IO视图版本信息确定所述复制请求中的IO视图版本与所述备OSD节点本地保存的IO视图版本一致后,执行所述复制请求以使得所述备OSD节点上的所述IO请求所对应的数据与所述主OSD节点上的所述IO请求所对应的数据保持一致。
第六方面,一种分布式存储复制系统,所述系统包括至少一个元数据控制(MDC)模以及多个IO 路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图以及IO视图,所述partition视图包括一个partition群中的各partition所在的OSD信息,所述IO视图包括一个partition群的主OSD的标识;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求所对应的数据的存储;所述系统包括存储器以及处理器;所述的存储器用于存储计算机可读指令,所述指令用于执行所述MDC模块、IO路由模块以及OSD节点的功能;所述的处理器用于与所述的存储器连接,读取所述存储器中的指令,根据所述的指令促使所述处理器执行下述操作:促使所述OSD节点在自身故障恢复后向所述MDC模块发送查询请求以请求所述OSD节点上的partition所在的partition群的IO视图,所述OSD节点称为故障恢复OSD节点,所述查询请求中携带所述故障恢复OSD节点的OSD标识,接收所述MDC返回的所述IO视图,向所述IO视图中的的主OSD发起数据恢复请求以请求恢复所述故障恢复OSD节点在故障期间更新的数据,接收所述主OSD发送的所述故障期间更新的数据,根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理IO请求的复制;促使所述MDC模块接收所述故障恢复OSD节点的所述查询请求,根据查询请求中的OSD标识将所述IO视图返回给所述故障恢复OSD节点,并在所述故障恢复OSD节点数据恢复执行完毕后更新所述partition群的partition视图;以及促使所述主OSD节点接收所述故障恢复OSD节点的数据恢复请求,将所述故障期间更新的数据发送至所述的故障恢复OSD节点,以及根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理所述IO请求所对应的数据的复制。
第七方面,一种应用于分布式存储系统中管理数据存储和复制的方法,所述系统包括至少一个元数据控制(MDC)模块以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点所管理的物理存储资源相对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图,所述partition视图包括一个partition群中的各partition所在的OSD信息;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求执行IO请求所对应的数据的存储;所述的方法包括:确定所述系统中某个OSD节点为故障OSD节点,确定所述故障OSD节点上的partition,更新所故障OSD节点上的partition所在的partition群的partition视图,向所述更新的partition视图中的主OSD节点发送更新通知;以及所述主OSD节点用于接收所述MDC模块发送的所述更新通知后, 根据所述更新的partition视图处理IO请求所对应的数据的复制。
在第七方面的第一种可能的实现方式中,所述partition视图具体包括一个partition群中各partition所在的OSD的主备身份及对应的partition状态;所述主OSD节点进一步用于根据所述更新的partition视图更新所述主OSD节点本地保存的partition视图;所述根据所述更新的partition视图处理IO请求所对应的数据的复制具体包括:根据所述更新后本地保存的partition视图将来自所述IO路由模块的IO请求所对应的数据复制到所述更新后本地保存的partition视图中partition状态为一致的备OSD节点上,或者复制到所述更新后本地保存的partition视图中partition状态为一致的备OSD节点上以及所述更新后本地保存的partition视图中partition状态为不一致但正在进行数据恢复的备OSD节点上。
结合第七方面的第一种可能的实现方式,在第二种可能的实现方式中,所述的方法进一步包括:当所述的MDC模块确定所述故障OSD节点上的partition包括主partition时,更新所述主partition所在的partition群的IO视图,将所述更新的IO视图通知给所述更新的partition视图中的备OSD节点;以及所述更新的partition视图中的备OSD节点根据所述更新的IO视图更新本地保存的IO视图并根据所述更新后本地保存的IO视图处理所述IO请求所对应的数据的复制。
结合第七方面的第二种可能的实现方式,在第三种可能的实现方式中,所述更新所述故障OSD节点上的partition所在的partition群的partition视图具体包括:当所述故障OSD节点上的partition包括备partition时,将所述备partition所在的partition群的partition视图中所述故障OSD节点的partition状态标记为不一致;以及当所述故障OSD节点上的partition包括主partition时,将所述主partition所在的partition群的partition视图中作为主OSD节点的所述故障OSD节点降为新的备OSD节点并将所述新的备OSD节点对应的partition状态标记为不一致,并从所述主partition所在的partition群的partition视图中的原备OSD节点中选择一个partition状态为一致的备OSD节点升为新的主OSD节点。
第八方面,一种应用于分布式存储系统中管理数据存储和复制的方法,所述系统包括至少一个元数据控制(MDC)模块以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图以及IO视图,所述partition视图包括一个partition群中的各partition所在的OSD信息,所述IO视图包括一个partition群的主OSD的标识;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求执行IO数据的存储;所述的方法包括:所述IO路由模块用于接收IO请求,所述IO请求包括key,根据所述key确定所述IO请求所对应的数据所属的partition群并确定所述数据所属的partition群的主OSD节点,将所述数据所属的partition群的IO视图的IO视图版本信息加入所述IO请求,并将携带有所述IO视图版本信息的所述IO请求发送至所述确定出的主OSD节点;所述主OSD节点用于接收所述IO请求,在根据所述IO视图版本信息确定所述 IO请求中的IO视图版本与所述主OSD节点本地保存的IO视图版本一致后,执行所述IO请求,生成携带有所述IO视图版本信息的复制请求,并将所述的复制请求发送至所述数据所属的partition群的备OSD节点;以及所述备OSD节点用于接收所述复制请求,并在根据所述IO视图版本信息确定所述复制请求中的IO视图版本与所述备OSD节点本地保存的IO视图版本一致后,执行所述复制请求以使得所述备OSD节点上的所述IO请求所对应的数据与所述主OSD节点上的所述IO请求所对应的数据保持一致。
在第八方面的第一种可能的实现方式中,所述partition视图具体包括一个partition群中各partition所在的OSD的主备身份及对应的partition状态,所述方法进一步包括:所述MDC模块在所述IO请求处理过程中检测到所述主OSD节点发生故障时将所述数据所属的partition群的partition视图中的所述主OSD节点降为新的备OSD节点,并将所述新的备OSD的partition状态标记为不一致,将所述数据所属的partition群的partition视图中的所述备OSD节点中任一备OSD节点升为新的主OSD节点,将更新后的所述数据所属的partition群的partition视图通知至所述新的主OSD节点,用所述新的主OSD节点更新所述数据所属的partition群的IO视图,将更新后的所述数据所属的partition的IO视图通知至所述IO路由模块;所述IO路由模块进一步用于接收所述MDC模块发送的所述更新后的所述partition群的IO视图,根据所述更新后的所述partition群的IO视图将所述IO请求发送至所述新的主OSD节点;以及所述新的主OSD节点用于接收所述IO请求,在执行所述IO请求后生成第二复制请求,将所述第二复制请求发送至所述更新后的所述数据所属的partition群的partition视图中的partition状态为一致的备OSD节点。
在第八方面的第二种可能的实现方式中,所述partition视图具体包括一个partition群中各partition所在的OSD的主备身份及相应的partition状态,所述方法进一步包括:所述MDC模块在所述IO请求处理过程中检测到所述备OSD节点中任一备OSD节点发生故障时,将所述数据所属的partition群的partition视图中的所述任一备OSD的partition状态标记为不一致,将更新后的所述数据所属的partition群的partition视图通知至所述主OSD节点;以及所述主OSD节点用于接收到所述更新后的所述数据所属的partition群的partition视图后,将所述复制请求发送至所述更新后的partition视图中的partition状态为一致的备OSD节点,不将所述复制请求发送至partition状态为不一致的备OSD节点。
第九方面,一种应用于分布式存储系统中管理数据存储和复制的方法,所述系统包括至少一个元数据控制(MDC)模以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图以及IO视图,所述partition视图包括一个partition群中的各partition所在的OSD信息,所述IO视图包括一个partition群的主OSD的标识;所述IO路由模块用于对接收的IO请求进行路由 转所述OSD节点;所述OSD节点用于根据所述IO请求所对应的数据的存储;所述方法包括:所述OSD节点用于在自身故障恢复后向所述MDC模块发送查询请求以请求所述OSD节点上的partition所在的partition群的IO视图,所述OSD节点称为故障恢复OSD节点,所述查询请求中携带所述故障恢复OSD节点的OSD标识,接收所述MDC返回的所述IO视图,向所述IO视图中的主OSD发起数据恢复请求以请求恢复所述故障恢复OSD节点在故障期间更新的数据,接收所述主OSD发送的所述故障期间更新的数据,根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理IO请求的复制;所述MDC模块用于接收所述故障恢复OSD节点的所述查询请求,根据查询请求中的OSD标识将所述IO视图返回给所述故障恢复OSD节点,并在所述故障恢复OSD节点数据恢复执行完毕后更新所述partition群的partition视图;以及所述主OSD节点用于接收所述故障恢复OSD节点的数据恢复请求,将所述故障期间更新的数据发送至所述的故障恢复OSD节点,以及根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理所述IO请求所对应的数据的复制。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是现有技术中的两阶段提交协议流程图;
图2A是本发明实施例提供的分布式存储复制系统架构示意图;
图2B是本发明另一实施例提供的分布式存储复制系统结构示意图;
图2C是本发明另一实施例提供的分布式存储复制系统结构示意图;
图3是本发明实施例提供的集群视图示意图;
图4是本发明实施例提供的OSD视图的状态变迁示意图;
图5是本发明另一实施例提供的分布式存储复制系统结构示意图;
图6是本发明实施例提供的视图初始化流程图;
图7是本发明实施例提供的IO请求处理的流程图;
图8是本发明实施例提供的OSD故障处理流程图;
图9是本发明实施例提供的OSD故障恢复处理流程图;
图10是本发明实施例提供的OSD故障恢复处理过程中数据恢复流程图;
图11是本发明实施例提供的OSD退出集群后的处理流程图;
图12是本发明实施例提供的OSD扩容后的处理流程图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
如图2A所示,本发明的一个具体实施例提供了一种分布式存储复制控制系统用于管理控制本发明实施例中提及的数据存储和复制。该分布式存储复制控制系统主要包括状态层、接口层以及数据层三个子层,其中状态层包括元数据控制Meta Data Controller(MDC)模块202,实际应用中可以根据需要确定是否需要配置MDC的备MDC用于在主MDC模块发生故障时承担主MDC的角色以及备MDC数量;接口层包括多个IO路由模块(Input/Output输入/输出路由模块)204(也可以称为client,本发明实施中该两个概念可以互用);数据层包括多个对象存储设备Object Storage Device(OSD)节点206。其中所述的状态层通过状态视图消息与所述接口层和所述的数据层进行通信,例如MDC模块202通过状态视图消息发送更新通知给所述IO路由模块204以及OSD节点206告知更新本地的集群视图(也可以称为视图,本发明实施中该两个概念可以互用),或者直接发送MDC模块202生成或者更新的集群视图给所述IO路由模块204以及OSD节点206。接口层和数据层之间通过业务消息进行通信,例如IO路由模块204向OSD节点206发送IO请求消息以请求IO数据的存储和复制。
其中,所述MDC模块202作为集群配置信息下发的入口,用于将应用存储空间中的逻辑存储资源的逻辑分区(partition)分配给各OSD节点,根据该partition生成集群视图,维护并更新所述集群视图,并向相应的IO路由模块204和OSD节点206通知所述集群视图的更新。
所述IO路由模块204用于根据所述集群视图将上层应用的IO请求路由转发至相应的ODS节点。
所述OSD节点206用于根据所述的集群视图对所述IO请求执行相关IO操作,主要包括执行数据的存储和复制,实现数据的一致性备份,以及组织对其管理的物理存储资源(例如本地磁盘或者外部存储资源)的数据操作。
可以理解的是,上述MDC模块,IO路由模块以及OSD节点可以由硬件、固件、软件或者他们的组合予以实施,实际应用中,具体用什么方式实现是根据产品设计要求或者制造成本考虑所确定,本发明不应被某种特定的实现方式限定。
在本发明的一个具体实施例中,所述的分布式存储复制系统可以统一部署在一个独立的平台或者服务器上(如上述图2B中的分布式存储复制控制平台)用于管理与该平台或者服务器连接的分布式存储系统的数据复制和存储。
在本发明的另一个具体实施例中,该分布式存储复制控制系统可以分布式部署在如图2C所示的分布式存储系统中,所述分布式存储系统包括多个服务器server或者主机,本实施例中的主机或服务器为物理意义上的主机或服务器,即包括处理器以及存储器等硬件。上述MDC模块202可以只部署在所述分布式存储系统中的一个(没有备份MDC)、两个(一主一备)或三个(一主两备)服务器或主机上;IO路由模块204部署在所述分布式存储系统中的每个服务器或主机上;OSD节点206部署在所述分布式存储系统中的每个具有存储资源的服务器或主机上,用于管理和控制本地存储资源或外部存储资源。 实际应用中,也可以在一台主机上可以只部署IO路由模块,或者只部署OSD节点,或二者都部署,具体的部署方式可以根据实际具体的情况确定,本发明不做限制。图2C中的MDC模块202、IO路由模块204以及OSD节点206构成分布式存储控制系统在图2B所示的分布式存储系统中称为分布式复制协议层,所述分布式存储系统通过该分布式复制协议层控制IO数据存储和复制到所述存储层中的存储资源。所述存储层由所述多个服务器或主机上的本地存储资源构成,所述分布式复制协议层分布在服务器或主机中的各模块通过所述网络层的数据交换网络进行交互,一个具体的实施方式中可以采用以太网或者infiniband,应该理解上述太网或者infiniband只是本发明实施例采用的高速数据交换网络的示例性实现方式,本发明实施例并不对此进行限定。
以下通过具体实施例以及实施方式详细说明上述分布式存储复制控制系统中所述MDC模块202、IO路由模块204,以及OSD节点206相互之间的连接交互以及具体功能等情况。
在本发明的一个具体实施例中,所述MDC模块的分区功能具体可以包括:所述MDC模块根据各OSD节点所管理的物理存储资源情况在各OSD节点上配置与所述各OSD节点所管理的物理存储资源对应的逻辑分区(partition)。其中,该partition是由应用存储空间中的一定数量的数据block组成的。其中该应用层的应用存储空间是相对于存储层的物理存储空间而言的,是在应用层为用户分配的一定数量的逻辑存储空间,是存储层的物理层出空间的逻辑映射。即,此处的partition的概念区别于物理存储空间分区的概念,在数据进行存储时,可以把应用存储空间上的一个partition的空间映射到物理存储空间的一个或者多个分区,该partition具体粒度可以从集群配置信息中获取,也可以由该MDC模块根据一定的规则确定,以及其他方式确定,本发明对此不作限制。
具体实施方式中,所述MDC模块可以根分区大小配置信息,本地存储资源情况,外部存储资源情况(例如正常接入的SAN(Storage Area Network,存储区域网络)的LUN(Logical Unit Number,逻辑单元数)信息等信息生成所述partition的集群视图。
通常为了保证数据的可靠性和可用性,一个partition在不同的OSD节点上存储有副本,partition的副本数目可以通过配置文件进行配置,也可以由MDC根据一定的算法确定,partition分为主partition和备partition,在partition的多个副本中,选取其中一个副本为主副本,称之为主partition,一个partition主副本之外的副本,称之为备partition,主partition和该主partition对应的备partition构成一个partition群,主partition所在的OSD节点,称为该主partition所在partition群的主OSD节点,本实施例中描述的主OSD都是指某一个partition群的主OSD,备partition所在的OSD节点,称之为该备partition所在的partition群的备OSD节点,本实施例中描述的备OSD都是指某一个partition群的备OSD。
为了便于理解,下面结合图2B提供的实施例进行进一步说明,如图2B所示,主机或者服务器(本发明实施例中主机和服务器的概念可以互换)server_1上的OSD中的所管理的存储资源被划分为partition1,partition2和partition3(简称P1,P2和P3),以及partition4’,partition5’,和partition6’(简称P4’,P5’和P6’)。其中,P4’,P5’和P6’分别为服务器server_2上的OSD节点上的partition4,partition5和partition6(简称P4,P5和P6)的副本。server_1上的OSD中的partition与存储层中的物理存储资源之间具有对应的映射关系,例如,该OSD上的一个partition的空间映射到物理存储 空间的一个或者多个Block。
主机或者服务器server_1中的OSD管理主partition(P1,P2和P3)以及备partition(P4’,P5’和P6’),该ODS分别为P1和P1’组成的partition群主OSD节点,P2和P2’组成的partition群主OSD节点以及P3和P3’组成的partition群的主OSD节点,同时该OSD分别为P4和P4’组成的partition的备OSD节点,P5和P5’组成的partition的备OSD节点,以及P6和P6’组成的partition的备OSD节点。可以看出,同一OSD节点,针对不同的partition群,可以既作为主OSD节点同时又是作为备OSD节点而存在。
上述分区partition以及对应副本的设置可以根据基于下述因素进行,具体实际应用中可以加入其他的考虑因素设置规划磁盘分区。
其一,数据安全性,每个partition的副本应尽量分布到不同主机或服务器中。数据安全的底线是,某个partition的多个副本不允许放在同一个主机或服务器中。其二,数据平衡性,每个OSD上的partition数量尽量相同,每个OSD上的主partition、备partition1、备partition2数量尽量相同。这样,每个OSD上处理的业务是均衡的,不会出现热点。其三,数据分散性,每个OSD上的partition,其副本应尽量平均地分布到其他不同的OSD上,对于更高级别的物理部件也有同样的要求。
如图4所示,本发明一个具体的实施例中,所述MDC模块生成集群视图信息具体可以包括MDC根据管理员下发的集群配置信息,以及所述分区情况生成集群视图信息。具体而言,该集群视图信息包括三个维度的集群视图,OSD视图(OSD view)、IO视图(IO view)以及Partition视图(partition view)。
所述OSD view包括在集群中的OSD节点的状态信息。具体实施方式中所述OSD视图可以包括OSD节点的ID以及OSD节点的状态信息。其中OSD ID为OSD标示或者编号,如图5所示的本发明的一个实施例中,OSD状态具体可以包括根据所述OSD是故障分为“有效(UP)”状态和“失效(DOWN)”状态,以及根据所述OSD是否退出集群分为“退出集群(OUT)”和“在集群中(IN)”状态。如图5所示,具体的状态变迁包括,当某个OSD节点于故障回复后经过初始化或者重启后由“在集群中(IN)”以及“不失效(DOWN)”状态变更为“在集群中(IN)”以及“有效(UP)”。当某个OSD故障发生超过一定阈值(例如超过5分钟时间),则所述OSD节点将被提出集群,相应的其状态由“在集群中(IN)”以及“失效(DOWN)”状态变更为“退出集群(OUT)”以及“失效(DOWN)”状态。本发明的具体实施例中,所述OSD视图还可以进一步包括OSD视图版本信息,例如OSD视图版本号、OSD视图ID或者其他标记视图版本的任何信息。
所述IO view包括标识一个partition群的主OSD节点的标识。具体实施方式中,所述IO view可以包括partition群ID以及所述partition群ID所对应的partition群的主OSD节点的标识。其中每一个IO view有标识该IO view的IO视图版本信息,该IO视图版本信息可以是IO view ID(也可以称为IO视图版本号),用于标识IO view的版本,方便不同模块进行IO view版本新旧的对比。具体实施例中,该IO视图版本信息可以包括在IO视图中,也可以在IO视图之外。
partition view包括一个partition群中的各partition所在的OSD信息。具体实施方式中,所述partition  view可以包括partition群ID,所述partition群ID对应的partition群中各partition所在的OSD以及所述OSD的主备身份,以及对应所述各partition的所述OSD的partition状态。该partition view中包括主partition所在的OSD节点信息(例如OSD节点ID、该OSD节点的主备身份以及对应所述主partition的该OSD节点的partition状态)、该主partition对应的备partition(可以为一个或者多个)所在的OSD节点信息(例如OSD节点ID、所述OSD节点的主备身份以及所述OSD对应所述备partition的partition状态)。具体实施例中,所述partition状态可以分为“一致”和“不一致”两种,“一致”表示备partition的数据与主partition是一致的,“不一致”表示备partition的数据可能与主partition不一致;每一个partition view有标识该partition view的partition视图版本信息,该partition视图版本信息可以是partition view ID(也可以称为partition视图版本号),用于模块之间比较view的新旧。具体实施例中,该partition视图版本信息可以包括在partition视图中,也可以在partition视图之外。具体实施例中,由于IO视图是partition视图的一个子集,即partition视图包括了IO视图信息,该partition视图可以进一步包括IO视图版本信息。
所述MDC进一步用于维护管理和更新所述集群视图,根据OSD节点的状态、如故障、故障恢复、故障后退出集群、和故障恢复后恢复加入集群、新加入集群等进行集群视图的更新并通知到相关模块,以便相关模块根据该更新的集群视图处理相应的IO请求所对应的数据的复制。
在具体的实施方式中,为了减少交互以及节省管理和存储资源,所述OSD view可以只存在于MDC中,所述partition view可以只存在于MDC模块和主OSD节点中,所述IO view存在所述MDC模块、IO路由模块、主OSD节点以及备OSD节点中。所述MDC模块仅将所述partition view发给所述partition view中的partition所在的主OSD节点,或者仅通知所述partition view中的partition所在的主OSD节点进行本地partition view更新;而将构成partition view一部分的IO view(即IO view可以看成是partition view的子视图)发给IO路由模块、主OSD节点以及备OSD节点中,或者通知相应的模块进行本地保存的IO视图的更新。具体的实施过程可以参考下述各具体流程,以及OSD加入集群流程。实际应用中,MDC模块可以根据配置信息或者一定的策略根据集群视图的基本功能设置不同形式的集群视图,本发明实施例不做限制。
本发明的一个具体实施例中,所述IO路由模块主要用于实现IO请求路由的功能。所述IO路由模块从所述MDC模块获取集群中所有partition的IO view并进行缓存,当业务IO请求到达所述IO路由模块时,IO路由模块通过IO请求中的key计算出IO所属的partition群(计算法可以采用哈希算法或者其他算法);然后查找本地保存的IO view,找到partition群对应的主OSD节点,把IO请求发送给该主OSD节点。处理从所述MDC模块接收的IO view更新通知,所述的更新通知可以包括所述的更新后的IO view或者相应的更新知识信息,例如指示更新哪些内容,根据更新通知更新本地保存的IO view,并根据本地保存的更新后的IO view进行IO请求的路由。具体的实施过程可以参考下述各具体流程。
本发明的一个具体实施例中,所述OSD节点根据所述集群视图处理IO请求执行IO操作具体包括:
当所述OSD节点作为主OSD节点时,所述的主OSD节点主要用于接收所述IO路由模块发送的 IO请求,执行IO请求以及向相应的备OSD节点发送复制请求,以执行IO数据的存储和复制。所述主OSD节点从所述MDC模块接收其作为主OSD节点上的partition的partition view,并进行保存,所述主OSD节点根据所述partition view对所述IO请求的复制进行处理。所述主OSD节点进一步接收MDC模块关于partition view的更新通知,根据所述更新通知更新本地保存的partition view,并根据更新后的partition view处理所述IO请求所对应的数据的复制,所述更新通知可以包括更新的partition view或者相应的更新信息以便所述OSD节点根据所述更新的partition view或者所述更新信息更新本地保存的partition view和IO view。当所述OSD节点作为备OSD节点时,所述备OSD节点用于接收主OSD节点的复制请求,根据所述复制请求进行数据的复制备份,从所述MDC模块接收其该备OSD节点上该数据所属的partition的IO view,并进行保存,根据IO view处理所述的IO请求所对应的数据的复制,并进一步接收MDC模块关于IO view的更新通知,根据所述更新通知更新本地保存的IO view,以及根据所述更新的IO view处理所述IO请求所对应的数据的复制。具体的实施过程可以参考下述各具体流程。
本发明的一个具体实施例中,上述实施例中的分布式存储复制系统(如图2A、2B以及2C)可以基于如图5所示的系统实施,如图5所述,该系统可以包一个或者多个括存储器502、一个或者多个通信接口504,以及一个或者多个处理器506,或者其他数据交互网络(用于多个处理器以及存储器之间的交互,图中对此未示出)。
其中,存储器502可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)等各种类型的存储器。存储器502可以存储操作系统和其他应用程序的指令以及应用数据,所述指令包括用于执行本发明各实施例中的MDC模块、IO路由模块以及OSD节点的功能的指令。
存储器502中存储的指令由处理器506来运行执行。
通信接口504用来实现存储器502与处理器506之间的通信,以及处理器之间和存储器之间的通信,以及该系统与其他设备或通信网络之间的通信。
处理器506可以采用通用的中央处理器(Central Processing Unit,CPU),微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),或者一个或多个集成电路,用于执行相关程序,以执行本发明各实施例中所描述的MDC模块、IO路由模块以及OSD节点之间的交互流程以及实现的功能。
为了便于理解同时避免不必要的重复描述,以下具体的实施例用于说明图5所示的系统具体如何执行本发明各实施例中所描述的MDC模块、IO路由模块以及OSD节点之间的交互流程以及实现的功能。本领域技术人员在阅读本发明所有实施例的基础上,可以理解图5所示的系统可以用于实施其他各实施例中描述的情况,这些都在本发明的记载以及公开的范围内。
实施例一:
所述的处理器用于与所述的存储器连接,读取所述存储器中的指令,所述指令包括用于执行所述MDC模块、IO路由模块以及OSD节点的功能的指令,根据所述的指令促使所述处理器执行下述操作:
促使所述MDC模块确定所述系统中某个OSD节点为故障OSD节点,确定所述故障OSD节点上的partition,更新所述故障OSD节点上的partition所在的partition群的partition视图,向所述更新的partition视图中的partition群所在的主OSD节点发送更新通知,以便所述主OSD节点根据所述更新的partition视图处理IO请求所对应的数据的复制。
需要说明的是,上述实施例中,上述MDC模块、IO路由模块以及OSD节点功能可以由一个主机实现,在这种情况下,实现上述MDC模块、IO路由模块以及OSD节点功能的指令可以存在于该主机的存储器中,由该主机的处理器读取该存储器中实现上述MDC模块、IO路由模块以及OSD节点功能的指令。在另外的实施例中,上述MDC模块、IO路由模块以及OSD节点功能可以由多个主机交互实现,在这种情况下,上述MDC模块、IO路由模块以及OSD节点分布式存储于不同主机的存储器中,例如,主机1中的处理器执行上述MDC模块的功能,主机2的处理器执行主OSD节点的功能,主机3执行IO路由模块功能。
实施例二:
所述的处理器用于与所述的存储器连接,读取所述存储器中的指令,所述指令包括用于执行所述MDC模块、IO路由模块以及OSD节点的功能的指令,根据所述的指令促使所述处理器执行下述操作:
促使所述IO路由模块接收IO请求,所述IO请求包括key,根据所述key确定所述IO请求所对应的数据所属的partition群并确定所述数据所属的partition群的主OSD节点,将所述数据所属的partition群的IO视图的IO视图版本信息加入所述IO请求,并将携带有所述IO视图版本信息的所述IO请求发送至所述确定出的主OSD节点;
促使所述主OSD节点接收所述IO请求,在根据所述IO视图版本信息确定所述IO请求中的IO视图版本与本地保存的IO视图版本一致后,执行所述IO请求,生成携带有所述IO视图版本信息的复制请求,并将所述的复制请求发送至所述数据所属的partition的备OSD节点;以及
促使所述备OSD节点接收所述复制请求,并在根据所述IO视图版本信息确定所述复制请求中的IO视图版本与所述备OSD节点本地保存的IO视图版本一致后,执行所述复制请求以使得所述备OSD节点上的所述IO请求所对应的数据与所述主OSD节点上的所述IO请求所对应的数据保持一致。
与上面一样,上述实施例中,上述MDC模块、IO路由模块以及OSD节点功能可以由一个主机实现,在这种情况下,实现上述MDC模块、IO路由模块以及OSD节点功能的指令可以存在于该主机的存储器中,由该主机的处理器读取该存储器中实现上述MDC模块、IO路由模块以及OSD节点功能的指令。在另外的实施例中,上述MDC模块、IO路由模块以及OSD节点功能可以由多个主机交互实现,在这种情况下,上述MDC模块、IO路由模块以及OSD节点分布式存储于不同主机的存储器中,例如,主机1中的处理器执行上述IO路由模块的功能,主机2的处理器执行主OSD节点的功能,主机3处理器执行备OSD节点的功能,主机4处理器执行MDC模块的功能。
实施例三:
所述的处理器用于与所述的存储器连接,读取所述存储器中的指令,所述指令包括用于执行所述MDC模块、IO路由模块以及OSD节点的功能的指令,根据所述的指令促使所述处理器执行下述操作:
促使所述OSD节点在自身故障恢复后向所述MDC模块发送查询请求以请求所述OSD节点上的partition所在的partition群的IO视图,所述OSD节点称为故障恢复OSD节点,所述查询请求中携带所述故障恢复OSD节点的OSD标识,接收所述MDC返回的所述IO视图,向所述IO视图中的的主OSD发起数据恢复请求以请求恢复所述故障恢复OSD节点在故障期间更新的数据,接收所述主OSD发送的所述故障期间更新的数据,根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理IO请求的复制;
促使所述MDC模块接收所述故障恢复OSD节点的所述查询请求,根据查询请求中的OSD标识将所述IO视图返回给所述故障恢复OSD节点,并在所述故障恢复OSD节点数据恢复执行完毕后更新所述partition群的partition视图;以及
促使所述主OSD节点接收所述故障恢复OSD节点的数据恢复请求,将所述故障期间更新的数据发送至所述的故障恢复OSD节点,以及根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理所述IO请求所对应的数据的复制。
同理,上述实施例中,上述MDC模块、IO路由模块以及OSD节点功能可以由一个主机实现,在这种情况下,实现上述MDC模块、IO路由模块以及OSD节点功能的指令可以存在于该主机的存储器中,由该主机的处理器读取该存储器中实现上述MDC模块、IO路由模块以及OSD节点功能的指令。在另外的实施例中,上述MDC模块、IO路由模块以及OSD节点功能可以由多个主机交互实现,在这种情况下,上述MDC模块、IO路由模块以及OSD节点分布式存储于不同主机的存储器中,例如,主机1中的处理器执行上述故障恢复OSD节点的功能,主机2的处理器执行主OSD节点的功能,主机3处理器执行所述MDC模块的功能,主机4处理器执行IO路由模块的功能。
下面通过多个具体流程实施例进一步详细说明上述分布式存储复制控制系统中所述MDC模块202、IO路由模块204以及OSD节点206相互之间的连接交互以及具体功能等情况。这些具体的流程实施例包括:集群视图初始化生成和获取流程、IO请求处理流程、OSD故障的处理流程、OSD节点故障恢复流程、数据恢复流程、OSD节点故障后退出集群的流程以及OSD节点扩容流程。以下一一进行详细说明。
需要说明的是,以下实施例中各步骤或功能并不都是必须的,各步骤之间的顺序仅仅是为了描述的方便,除了根据本发明方案的原理必然要求的顺序之外,各步骤之间的顺序没有特别限定,而且各步骤中具体的实现或实施方式仅仅是举例说明,并不对本发明的保护范围构成特定限定。本领域技术人员在阅读本发明整个说明书的基础上,可以根据实际情况对上述各步骤进行相应的增减或者进行非创造性的变更或者替换,将各步骤中不同的实施方式与其他步骤的实施方式进行组合形成不同的实施例等,都属于本发明记载以及公开的范围内。
集群视图初始化生成和获取流程
如图6所示,为本发明提供的一个集群视图初始化生成和获取的实施例,在该实施例中MDC模块根据管理员下发的集群配置信息,生成初始的集群视图。IO路由模块和OSD节点初始化时,向所述 MDC模块查询视图。本实施例中涉及的执行主体为图2A至图2C以及图5中所描述的实施例中提及的MDC模块,IO路由模块以及OSD节点。
系统初始化时,先启动所述MDC模块,然后所述IO路由模块和OSD节点从所述MDC获取视图启动,具体过程包括:
602:用户或者管理员下发集群配置信息至所述MDC模块。所述集群配置信息可以包括集群的拓扑信息以及partition数目、副本数目等系统配置参数,其中所述集群拓扑信息主要包括服务器的数目及其IP地址、机架信息、每个服务器上的OSD节点数目以及其所管理的物理存储资源信息(例如其对应的本地磁盘信息)等。
604:所述MDC根据下发的所述集群配置生成初始的集群视图,上文已经介绍了集群的三种视图(OSD view、partition view、IO view);根据配置的OSD信息生成OSD view,通过partition分配算法以及配置的partition数目、副本数目、OSD节点的数目生成partition view,IO view是partition view的一个子集,不需要额外去生成。在生成partition视图时,通常需要考虑partition分配的均衡性(每个OSD节点上的partition数目尽量一致),安全性(partition的副本所在的OSD节点位于不同的服务器或者不同的机架)。
606:IO路由模块初始化启动,向所述MDC模块查询IO view,所述IO路由模块启动时需要从MDC获取到相关的视图才能正常工作。
608:OSD节点初始化启动,向所述MDC模块查询partition view和IO view,所述OSD节点需要获取分布在该OSD节点上的主partition所在的partition群的的partition view和分布在该OSD节点上的备partition所在的partition群的IO view。
610:所述MDC模块向所述IO路由模块返回所有partition群的IO view。
612:所述MDC模块向所述OSD节点返回分布在该OSD上主partition所在的partition群的的partition view和备partition所在的partition群的的IO view。
IO请求处理流程
如图7所示,为本发明提供的IO请求流程的实施例,本实施例中涉及的执行主体为图2A至图2C以及图5中所描述的实施例中提及的IO路由模块和OSD节点。为了理解的方便本实施例以某个partition群(Partition X)为例说明,Partition X可以是本发明实施例中的分布式复制协议层的管理维护的任一分区Partition,Partition X所在的主OSD节点(简称Partition X主OSD节点),Partition X的备partition所在的备OSD1节点(简称Partition X备OSD1节点),Partition X的备partition所在的备OSD2节点(简称Partition X备OSD2节点),后续具体描述中将以partition X为例描述。该实施例中的IO操作流程(例如写操作或修改操作)具体包括:
702:所述IO路由模块接收主机(例如,图3中所示的该IO路由模块所在的server)发送的IO请求。
704:所述IO路由模块根据该接收的IO请求获得该IO请求所对应的数据(也可以称为IO数据) 的partition,并获取该partition所在的partition群的主OSD节点。
在具体实施方式中,IO路由模块可以根据该IO请求中携带的Key,通过哈希算法计算出该IO请求所对应的数据所在的partition群的partition ID,然后通过该partition ID查找IO view,获取该partition群的主OSD节点。如上所述,本实施例中此处partition ID所对应的partition群为partition X。所述的Key是上层业务定义的一个数字或者字符串,用于标识一个数据块。
706:所述IO路由模块将该IO请求发送给所述partition群的主OSD节点,所述请求中携带IO视图版本信息(例如IO view ID),IO关键信息以及该IO数据。
根据前述描述本实施例中的partition群为partition X,相应的,该IO请求被发送给所述partition X的主OSD节点,即partition X主OSD节点。在具体实施方式中,所述IO view ID也可以称为IO视图版本号,其主要用来标识一个视图的版本,单调递增,IO view ID小说明其维护的视图版本是一个过时的版本,保证一致性的一个要求是IO处理流程中所有模块看到的视图是一致的。所述的IO关键信息可以包括key、offset以及length,其中offset是指IO数据在key所标识的数据块中相对起始位置的偏移量,length是指IO数据的长度大小。
708:根据所述IO视图版本信息确定所述IO请求中携带的视图版本与本地保存的视图版本一致后,为所述IO请求生成一个sequence(Seq)ID(也可以称为序列标识)。具体而言,所述主OSD节点确定所述IO请求中携带的IO view ID与本地保存的IO view ID一致后,为所述的IO请求生成一个序列标识。
在具体实施方式中,所述的Seq ID由所述视图版本号以及序列号sequence number(Seq NO)组成。其中视图版本号随着所述IO view的变更单调递增,所述Seq NO表示在一个IO view版本内,针对该IO视图中的partition对应的数据的修改操作(例如写和删除)的连续编号。IO view变更后,所述Seq ID中的Seq NO重新从0开始递增。其中,所述Partition X主OSD节点比较所述IO请求中携带的IO view ID与本地保存的IO view ID是否一致具体可以为先比较其中的IO view ID,IO view ID大则认为其seqID大,如果IO view ID相等,再比较其中的seq NO,seq NO大的seq ID大,只有两者都相同则表示所述的Seq ID一致。
710:如果本地IO view大,则向所述IO路由模块返回错误,如果本地的IO view小,则将所述IO请求加入缓冲队列,向所述MDC模块查询partition view。
具体实施方式中,所述Partition X主OSD节点在确定所述IO请求中的IO视图ID小于本地保存的IO视图ID后向所述的IO路由模块返回错误,由所述IO路由模块向所述MDC模块查询所述partition群的IO视图,在获得更新的IO视图ID后重新发送所述IO请求;以及在确定所述IO请求中的IO视图ID大于本地保存的IO视图ID后将所述IO请求加入缓冲队列,并向所述MDC模块查询所述partition群的IO视图的IO视图ID以确定本地保存的IO视图ID与所述IO请求中的IO视图ID一致后执行所述IO请求。
712:记录一条entry,所述的entry包括操作类型,Partition群ID,Seq ID,以及key。
具体实施方式中,所述的操作类型可以包括写或者删除等。如果是写操作,则所述的entry可以进 一步包括前述的offset以及length。另外,针对各种类型的操作,所述的entry中还可以进一步包括状态信息,用于描述所述的操作是否成功。一般而言,针对同一个partition群的修改操作(如写和删除操作)进行统一连续编号的。
714:执行所述的IO请求,查询所述partition群的partition view获得所述partition群备OSD节点信息。
具体实施方式中,如果所述执行所述IO请求为写IO请求,则将所述IO数据写入所述Partition X主OSD节点所管理的本地物理存储资源(例如图3中所示的缓存层或者持久化层,如磁盘,或者前面提及的外部物理存储资源SAN);如果所述IO请求为删除请求,则将所述Partition X主OSD节点所管理的本地物理存储资源上的相应数据删除。具体实施例中,所述Partition X主OSD节点进一步的生成复制请求。具体实现中,所述的复制请求可以为将IO请求的控制部分另外组装以生成复制请求,从而使得Partition X的备OSD节点上的所述IO请求所对应的数据与所述Partition X主OSD节点上的所述IO请求所对应的数据保持一致。
716(1)/(2):分别向所述partition X备OSD1节点以及所述partitionX备OSD2节点发送复制请求,所述复制请求携带Seq ID。
具体实施例方式中,所述的复制请求可以进一步包括原始请求中的key、offset、length以及IO数据等信息。当所述复制请求为写入复制请求时,所述复制请求中携带有key、offset、length以及IO数据。当所述的复制请求为删除复制请求,所述复制请求中仅携带有所述的key。
718:确定所述复制请求中的IO view ID与本地保存的IO view ID一致后,处理所述的复制请求。
在具体的实施方式中,在上述718步骤之外可以进一步包括以下步骤:
如果所述Partition X备OSD1节点或备OSD2在确定所述复制请求中的IOview ID小于本地保存的IOview ID后向所述的Partition X主OSD节点返回错误,由所述Partition X主OSD节点向所述MDC模块查询所述partition群的IO视图,在获得更新的IO视图ID后重新发送所述复制请求;或者在确定所述复制请求中的IO视图ID大于本地保存的IO视图ID后将所述复制请求加入缓冲队列,并向所述MDC模块查询所述partition群的IO视图版本号以确定本地保存的IO视图ID与所述IO请求中的IO视图ID一致后执行所述复制请求。
相比现有两阶段提交协议中,如果参与者拒绝了提议,整个IO流程需要进行回退,带来了很大的开销,而本实施例中如果备OSD节点拒绝了请求,而是先查询最新的视图后继续处理,不需要进行回退,从而提高了整个系统的容错能力以及可用性。
现有的两阶段提交协议在正常的一次IO过程中,协调者与参与者之间需要两次消息交互,而本实施例中只需要进行一次IO过程,主节点与备节点只有一次消息交互,降低了消息交互带来的IO时延,提高了整个系统的效率和性能。
在具体的实施方式中,所述的复制请求进一步可以携带有所述Partition X主OSD节点针对所述partition X发送的上一次复制请求中的Seq ID,所述IO流程可以进一步包括步骤720.
720:比较复制请求中携带携带的上一次复制请求的Seq ID与本地最大的Seq ID进行比较,如果比本地的大,则要求所述Partition X主OSD节点发送丢失的请求,如果一致则继续后续步骤的处理。
通过携带上一次复制请求的Seq ID可以防止IO view版本号变更导致的数据遗漏,而且通过发送丢失的数据,可以保证主备OSD节点对每个IO的执行顺序是一致的,从而进一步提高数据备份的一致性。
具体实施方式中,如果复制请求中携带携带的上一次复制请求的Seq ID比本地最大的Seq ID小,则向所述partition X主OSD节点返回错误,由所述partition X主OSD节点重发复制请求;或者通过进一步的查询确认Seq ID,并在确认后获取更新seq ID后继续进行处理,而不直接返回错误。两种情况都不需要终止处理,进而进行回退,进一步提高了系统的容错能力和可用性,以及整个系统的性能。
当IO在主备OSD节点上都需要执行时,需要保证主备OSD节点对每个IO操作的执行顺序是一致的,这时保证partition的多个副本一致性的重点
722、记录一条entry,执行所述的复制操作。
具体实施方式中,所述的entry包括操作类型,Partition ID,Seq ID,以及key,所述的操作类型可以包括写或者删除等。如果是写操作,则所述的entry可以进一步包括前述的offset以及length。另外,针对各种类型的操作,所述的entry中还可以进一步包括状态信息,用于描述所述的操作是否成功。执行所述的复制请求具体包括:当所述复制请求为写入复制请求时,所述复制请求中携带有key、offset、length以及IO数据。当所述的复制请求为删除复制请求,所述复制请求中仅携带有所述的key。
724:所述partition X备OSD1节点和partition X备OSD2节点分别向所述partition X主OSD节点发送响应请求成功消息。
726:所述partition X主OSD节点向所述的IO路由模块发送响应请求成功消息。
在上述IO请求实施例的基础上,假设在所述IO请求处理过程中,所述的partition X主OSD节点或者partition X备OSD节点发生故障(例如在IO请求达到partition X主OSD节点之后,所述partition X主OSD节点发生故障),或者所述partition X备OSD节点发生故障,或者系统有新的OSD节点加入作为所述partition X的备OSD节点等,在这些情况下,前述实施例中的IO请求处理流程进一步可以包括以下实施例中所描述的处理过程。
在IO请求达到partition X主OSD节点之后,所述partition X主OSD节点发生故障后IO处理流程包括如下:
当所述partition X主OSD节点发生故障后,系统中的MDC模块在所述IO请求处理过程中检测到所述partition X主OSD节点发生故障后将所述partition群的partition view中的所述partition X主OSD节点降为新的partition X备OSD节点,并将所述新的partition X备OSD节点的partition状态标记为不一致,将所述partition群的partition view中的所述partition X备OSD1节点升为新的partition X主OSD节点,将更新后的所述partition群的partition view发送至所述新的partition X主OSD节点,将所述partition群的IO view中的所述partition X备OSD1节点升为新的partition X主OSD节点,将更新后的所述partition群的IO视图发送至所述IO路由模块。
所述IO路由模块接收所述MDC模块发送的所述更新后的所述partition群的IO视图,根据所述更 新后的所述partition群的IO视图将所述IO请求发送至所述新的partition X主OSD节点;以及
所述新的partition X主OSD节点用于接收所述IO请求,在执行所述IO请求后生成复制请求,将所述新的复制请求发送至所述更新后的所述partition的partition视图中的其他备OSD节点。这里生成复制请求与发送复制请求的步骤与上述步骤714和716步骤一样。
在IO请求达到partition X主OSD节点之后,所述partition X备OSD节点发生故障后IO处理流程包括如下:
当所述partition X备OSD节点发生故障后,所述MDC模块进一步用于在所述IO请求处理过程中检测到所述partition X备OSD节点发生故障时,将所述partition视图中的所述partition X备OSD节点的partition状态标记为不一致,将更新后的所述partition群的partition视图发送至所述partition X主OSD节点;以及
所述主OSD节点用于接收到所述更新后的所述partition群的partition视图后,将所述复制请求发送至所述更新后的partition视图中的partition状态为一致的其他备OSD节点,不将所述复制请求发送至所述partition状态为不一致的partition X备OSD节点。
在两阶段提交协议中,协调者发生故障,IO处理中断,需要协调者恢复后才能继续处理,在本实施例中,主OSD节点发生故障后,MDC节点能够快速选举新的主OSD节点,快速的恢复IO的处理,可用性高,容错能力强。
另外,在两阶段端提交协议中,如果参与者发生故障或者一直没有响应,其他的IO请求一直阻塞,最后超时失败,需要进行回退;本实施例中如果备节点发生故障后,MDC会通知主节点进行视图变更,隔离或忽略该故障OSD节点,继续IO请求的处理,不会阻塞其他的IO请求处理,具有更好的容错能力,能够快速处理节点故障和进行节点故障恢复,例如,N+1个副本可以容忍N个副本故障,进而提高了存储系统的性能和可用性。可用性差的系统决定了其扩展性不会太好,因为在大规模分布式存储系统中,存储节点的故障是一件经常发生的事情,复杂且大量的协议交互也会使得系统的扩展能力减低。
此外,可以针对partition粒度的集群视图控制,可以极大的缩小存储节点故障的影响范围,使得存储系统能够进行大规模的扩展,提高了系统的扩展性。
在IO请求达到partition X主OSD节点之后,所述MDC发现有新的OSD节点加入集群,并且作为所述partition X的备OSD节点,所述IO处理流程包括如下:
所述MDC模块在所述IO请求处理过程中确定有新的OSD节点加入集群时,通知所述partition X主OSD节点所述新的OSD节点作为所述partition X所在的新的备OSD节点,并在partition的数据同步完毕后更新所述partition群的partition视图以及IO视图,通知所述的partition X主OSD节点更新所述partition X主OSD节点本地保存的partition视图;以及
所述partition X主OSD节点将其上面的主partition的数据同步至所述新的备OSD节点,并根据更新后的本地保存的partition将所述复制请求发送至新的备OSD节点。
OSD故障的处理流程
在大规模的分布式存储系统中,存储节点的故障是一种常态。当出现部分OSD节点故障时,系统需要能够正常提供IO服务。在本发明实施例中,所有的IO请求的处理都依赖于MDC模块维护的集群视图,当集群中的OSD节点发生故障,集群视图也需要进行相应的更新,以便于IO请求的处理能够正常有效的进行。
为不让所述的OSD节点的故障影响IO请求的复制的正常处理,通常通常需要进行以下处理:第一,MDC模块对OSD节点的状态进行检测,当某个OSD节点发生了故障,MDC模块能够及时发现该故障;第二,MDC模块发现某OSD故障后,需要及时进行处理以对视图做出正确的变更,并通知到相关的IO路由模块和OSD节点;第三,相关的IO路由模块和OSD节点根据所述MDC的更新通知后根据更新后的视图处理相应的IO请求,以便所述的模块和节点能够及时获得更新的视图,从而保证IO请求顺利高效的处理。
MDC模块可以通过下述两种模式对OSD节点的故障进行检测:1)由MDC模块统一负责,每个OSD节点都定时向MDC模块发送心跳消息,如果某个OSD节点在规定的时间周期内没有向MDC模块发送心跳消息,则MDC模块判定该OSD节点发生了故障;2)OSD节点之间定时互相发送心跳消息进行故障检测,检测者在规定的时间周期内没有收到被检测者的心跳消息,则上报MDC模块对应的OSD节点发生了故障。
如前所述,partition view和IO view都是针对某个partition群来说的,一般情况下,一个OSD节点上有多个partition;一个OSD节点故障,涉及到多个partition所在的多个partition群的视图的更新,每个partition群的视图更新是相互独立的:
1)当故障的OSD节点上的partition包括备partition时,将所述备partition所在的partition群的partition视图中所述故障OSD节点的partition状态标记为不一致,同时把更新后的partition视图通知给所述partition群的主OSD节点,所述partition的主OSD节点根据更新的partition视图进行IO请求所对应的数据的复制处理;以及
2)当故障OSD节点上的partition包括主partition时,将所述主partition所在的partition群的partition视图中作为主OSD节点的所述故障OSD节点降为新的备OSD节点并将所述新的备OSD节点对应的partition状态标记为不一致,并从所述主partition所在的partition群的partition视图中的原备OSD节点中选择一个partition状态为一致的备OSD节点升为新的主OSD节点。之后,通知新主的OSD节点partition视图更新,通知其他的备OSD节点IO view更新。若partition群的主OSD故障并且所有的备partition所在的OSD节点的partition状态都处于不一致状态,则不进行partition视图和IO视图的变更,因为需要保证partition的主副本拥有最新的完整数据,从而保证数据复制的一致性。
相关的IO路由模块和OSD节点根据所述MDC的更新通知后根据更新后的视图处理相应的IO请求具体可以包括:
新主的OSD节点根据所述更新后本地保存的partition view将来自所述IO路由模块的IO请求所对应的数据复制到所述更新后本地保存的partition view中的partition状态为一致的partition所在的备OSD节点上,或复制到所述更新后本地保存的partition view中的partition所在的partition状态为不一致但正 在进行数据恢复的备OSD节点上,从而进行故障隔离,保障IO请求的正常不受干扰的处理,从而提高了系统的容错能力,相应的提高了系统的性能和可用性。此外,可以通过对partition粒度的集群视图控制,可以缩小OSD节点故障的影响范围,从而使得系统能够进行大规模的扩展,提高了系统的扩展性。
最后,在所述故障OSD节点故障恢复并完成数据恢复后,MDC模块进一步更新所述partition view以及IO view,将所述进一步更新的partitionview通知给所述进一步更新的partition view中的partition所在的主OSD节点,将所述进一步更新的IOview发送给所述进一步更新的partition view中的partition所在的备OSD节点,以便接收到该进一步更新的partition view或者IO view的模块或者OSD节点,更新本地保存的partition view或者IO view,并根据该进一步更新的partition view或者IO view处理IO请求所对应数据的复制。
通过及时的更新视图,故障恢复节点可以快速加入集群处理IO请求,提高了系统的性能和效率。
为了便于理解,以下通过一个具体的实施例进行说明,如图8所示,为本发明提供的OSD节点故障处理流程实施例,本实施例中涉及的执行主体为图2A至图2C以及图5中所描述的实施例中提及的MDC模块,IO路由模块以及OSD节点。为了理解的方便,本实施例以OSD节点中的OSDx节点、OSDy节点以及OSDz节点为例说明,OSDx节点、OSDy节点以及OSDz节点可以是本发明实施例中的分布式复制协议层的中多个OSD节点的任一OSD节点,并且,为了便于理解,本实施例中假设OSDx节点为partition1群(简称P1)的主OSD节点,为partition n群(简称Pn)的备OSD节点;OSDy节点为所述Pn的主OSD节点,为所述P1的备OSD节点;OSDz节点为所述Pn的备OSD节点,为所述P1的备OSD节点。该实施例中的OSD节点故障处理流程具体包括:
802/804/806:OSDx节点、OSDy节点、OSDz节点分别定时向主MDC模块发送心跳消息。
808:OSDx节点发生故障。
实际应用中,该故障可以包括各种软、硬件或者网络的故障,如软件BUG导致的程序进程重启、网络短暂不通、服务器重启等导致OSD节点不能处理所述的IO请求,不能完成数据的存储和复制功能。
810:MDC模块检测到所述OSDx节点在预定的时间内没有发送心跳消息,则判断其发生了故障。
实际应用中,如前所述,MDC还可以通过其他方式,比如由没有故障的OSD节点通知某一OSD节点发生故障。
812:MDC模块根据故障情况进行视图更新。
具体实施方式中,MDC根据确定后的故障OSD节点上的partition所在的partition群,更新相应partition群的集群视图,本实施例中,故障OSDx节点上的partition所在的partition包括P1和Pn,因此,MDC需要更新P1和Pn的集群视图,具体可以包括:
1)更新OSD view:将OSDx节点的状态由“在集群中(IN)”以及“有效(UP)”更新为“在集群中(IN)”以及“失效(DOWN)”状态。
2)变更partition view:针对P1,将OSDy节点升为P1的主(在P1的备OSD节点列表中,选择第一个partition状态为一致的备OSD节点升为partition群的主OSD节点),将OSDx降为P1的备,且对应的partition状态更新为“不一致”;针对Pn,将Pn的备OSDx的partition状态变更为不一致。
3)更新IO view:针对P1,将P1的IO view中的主OSD节点由原来的OSDx节点变更为OSDy节点;针对Pn,由于故障OSDx节点仅作为Pn的备OSD节点,Pn的主OSDy节点并没有发生故障,所以Pn的IO view不发生更新。
814:向OSDy节点(作为更新后的P1的主OSD节点,以及继续保持Pn的主OSD节点)通知P1和Pn的partition视图的更新,包括OSDy节点升为P1的主OSD节点,OSDx降为P1的备OSD节点且对应的partition状态为“不一致”,Pn的备OSDx节点的partition状态变更为“不一致”。
816:向OSDz节点通知P1的IO view更新,告知P1的主OSD节点变更为OSDy节点。
818:向IO路由模块通知P1的IO view更新,告知P1的主OSD节点变更为OSDy节点。
820:OSDy节点处理MDC模块的通知,更新本地保存的视图信息(partition view和IO view),并根据MDC模块通知的最新的视图处理IO请求所对应的数据的复制。
具体实施方式中,OSDy节点作为更新后的P1的主OSD节点,更新P1的partition view以及IO view,作为原Pn的主OSD节点,更新针对Pn的partition view。
针对P1的IO操作,当所述OSDy节点接收到所述IO路由模块转发的IO请求之后,执行IO请求,产生复制请求,将所述的复制请求发送至更新后的partition view中的P1的备OSD节点OSDz节点,且对应的partition状态为“一致”,由于OSDx节点作为P1的新的备OSD节点,其partition状态为“不一致”,OSDy节点不再将所述的复制请求发给OSDx节点,实现故障隔离,从而不影响所述针对P1的IO请求处理的继续进行。
针对Pn的IO操作,当所述OSDy节点接收到所述IO路由模块转发的IO请求之后,执行IO请求,产生复制请求,将所述的复制请求发送至更新后的partition view中的Pn的备OSD节点OSDz节点,且对应的partition状态为“一致”,由于OSDx节点作为Pn的备OSD节点,其partition状态为“不一致”,OSDy节点不再将所述的复制请求发给OSDx节点,实现故障隔离,从而不影响所述针对P1的IO请求处理的继续进行。
在两阶段端提交协议中,如果参与者发生故障或者一直没有响应,其他的IO请求一直阻塞,最后超时失败,需要进行回退;本实施例中如果备OSD节点发生故障后,MDC会通知主节点进行视图变更,忽略该故障节点,对该故障节点进行隔离,继续IO请求的处理,不会阻塞其他的IO请求处理,具有更好的容错能力和可用性。
822:处理MDC模块的视图通知,更新本地保存的IO view信息。
824:处理MDC模块的视图通知,更新本地保存的IO view信息,并根据MDC模块通知的最新IO view视图进行IO请求的路由转发。
具体实施方式中,针对P1的IO处理,如果在所述P1的IO处理过程中,P1所在的原主OSDx节点故障,MDC会及时更新P1的partition view,IO路由模块根据更新后的partition view中新选出的P1的主OSDy节点,将所述的IO请求重新转发至新选出的OSDy节点。
在本发明实施例中,主OSD节点发生故障后,MDC节点能够快速选举新的主节点,快速的恢复IO的处理,如果备节点发生故障后,MDC会通知主节点进行视图变更,隔离或忽略该故障OSD节点,继续IO请求的处理,不会阻塞其他的IO请求处理,具有更好的容错能力,能够快速处理节点故障,例如,N+1个副本可以容忍N个副本故障,进而提高了存储系统的性能和可用性。可用性差的系统决定了其扩展性不会太好,因为在大规模分布式存储系统中,存储节点的故障是一件经常发生的事情,复杂且大量的协议交互也会使得系统的扩展能力减低。此外,可以针对partition粒度的集群视图控制,可 以极大的缩小存储节点故障的影响范围,使得存储系统能够进行大规模的扩展,提高了系统的扩展性。
OSD节点故障恢复流程
在OSD节点故障期间,可能发生了新的数据修改操作,因此,故障的OSD节点恢复重新加入集群提供服务前,需要先进行数据的恢复同步,使自己的状态恢复到主副本一致。
本发明的一个实施例中,OSD节点故障恢复流程可以分为以下三个阶段:
1)备OSD节点向主OSD节点同步所述主OSD节点在故障期间所做的数据修改,这是一个增量同步的过程,当然实际应用中,也可以根据实际情况同步某个partition的所有数据;
2)备OSD节点恢复到与主OSD节点一致的状态后,MDC模块进行集群视图的变更;
3)MDC模块通知各模块和节点更新后的集群视图,以便各模块和节点根据通知所述更新的集群视图处理IO请求的复制或转发。
实际应用中,如果在数据恢复过程中,主OSD节点收到IO请求后,将复制请求发送至该故障恢复节点,则该OSD节点故障恢复流程可以进一步包括以下阶段:
备OSD节点回放在向主OSD同步数据的过程中在收到主OSD节点的复制请求后记录的log,将log中的数据写入。这样可以保证,故障恢复OSD节点在故障恢复过程中确保所有数据同主OSD节点保持完全一致,进而提高主备OSD节点的数据一致性。
其中的数据同步过程的具体流程可以参见下述图10的给出的一个具体实施例。
其中的集群视图更新与更新通知过程可以包括以下情况:
1)若OSD节点在故障前是partition群的备OSD节点,则可以只变更partition view,把partition view中对应的备OSD节点的partition状态修改为“一致”状态,并把变更后的partition view通知给主OSD节点;
2)若OSD节点在故障前是partition群的主OSD,则同时变更partition view和IO view,该前OSD节点升为该partition群的新的主OSD节点,当前的主OSD降为该partition群的备OSD节点,通知新的主OSD节点进行partition view变更,通知IO路由模块和备OSD节点进行IO view变更。
实际应用中,如果故障的OSD节点在规定的时间周期内没有恢复,MDC模块则把该OSD节点踢出集群,把分布在该OSD节点的partition迁移到其他的OSD节点上。具体流程参见下述图11给出的关于OSD节点退出集群的一个具体实施例。
为了便于理解,以下通过一个具体的实施例进行说明,如图9所示,为本发明提供的OSD节点故障恢复处理流程实施例,本实施例中涉及的执行主体为图2A至图2C以及图5中所描述的实施例中提及的MDC模块,IO路由模块以及OSD节点。为了理解的方便本实施例假设所述故障恢复OSD节点上的partition所在的partition群包括partition1群(简称P1)以及partition n群(简称Pn)为例进行说明,该实施例中的OSD节点故障处理流程具体包括:
902:故障恢复OSD节点向MDC模块请求该故障恢复OSD节点上的集群视图信息,请求中携带该OSD节点的OSD ID。
具体的实施方式中,可以参考上述图6中步骤608所描述的过程。
904:MDC模块根据OSD ID查询partition view获取该OSD上的partition信息。
具体实施方式中,MDC根据所述故障恢复OSD的OSD ID分别查询所述故障恢复节点OSD上的P1和Pn所对应的partition view分别获得P1的partition信息和Pn的partition信息。该partition信息可以包括IO view,也可以进一步包括partition状态。
906:向OSD节点返回partition信息。
908/910:根据MDC返回的Partition信息向partition群的主OSD节点发起数据恢复过程,分别向P1和Pn的主OSD节点获取故障期间缺少的entry信息,携带SeqID。
在本发明的具体实施例中,如果在所述故障恢复过程中,某个主OSD节点(如Pn的主OSD节点)收到IO写请求,需要向该Pn的所有备OSD节点发送复制请求,则该故障恢复流程可以进一步包括以下步骤912-916以及918:
912:Pn的主节OSD点在故障恢复期间接收主机IO写请求。
914:向Partition所在备OSD节点发送复制IO信息。
具体实施方式中,上述步骤912-914可以参考上述图7中的IO操作流程。
916:将复制IO关键信息写入日志(log)。
具体实施方式中,也可以将所述IO请求所对应的数据写入日志,该步骤中的IO关键信息可以参考上述图7中的IO操作流程的描述。
918:根据主OSD节点返回的差异entry数目信息,反复发送根据entry请求数据的消息,确保数据恢复过程执行完毕。
具体实施方式参考下属图11中关于数据恢复的流程。
920:根据日志记录的IO信息,按日志记录将IO写入。
具体实施方式中,在所故障恢复OSD节点与所述主OSD节点完成数据恢复同步后,再根据日志中记录的IO信息将来自该Pn的主OSD的复制请求对应的数据写入该故障恢复OSD节点所管理的物理存储资源。
通过将故障恢复过程中的IO请求写入日志,然后在故障期间遗漏的数据恢复完成后写入故障期间发生的IO请求,可以保证主备OSD节点对每个IO的执行顺序是一致的,从而进一步提高数据备份的一致性。
922/924:数据恢复完成,分别通知P1和Pn的主OSD节点。
具体实施方式中,可以包括两种方式通知MDC模块实执行集群视图的更新,方式之一是数据恢复完毕后,由所述故障恢复OSD节点通知P1和Pn的主OSD节点,以便由该主OSD节点通知所述MDC节点进行集群视图更新。方式二参见下面的步骤930。
实际应用中,所述故障恢复OSD节点在进行集群视图更新的通知之前可以进一步确认该故障恢复OSD节点上的partition的partition状态,并在确认partition状态不一致后触发集群式图更新流程。该故障恢复OSD节点可以通过上述步骤906中返回的partition信息汇总进一步获取partition状态信息。
926/928:P1和Pn的主OSD节点分别通知MDC变更partition View,携带partition群ID、备OSD ID以及视图版本。
具体实施方式中,向MDC模块发送通知,要求其将故障恢复OSD节点上的partition的partition的状态更新为一致,通知中携带OSD节点上partition所在的partition群的partition群ID、备OSD节点ID(即所述故障恢复OSD节点),以及视图版本。这里的partition群ID用于标示待更新的partition群的视图,OSD ID用于标示故障OSD节点,这里视图版本为该P1和Pn所在的主OSD节点本地保存的最新partition view的视图版本,其作用在于所述MDC收到该通知后,确认通知中的partition view的视图版本与该MDC本地维护的最新的partition view的视图版本一致,则进行集群视图更新处理,从而保证IO处理流程中所有模块或者节点看到的集群视图是一致的,从而提高数据备份的一致性。
实际应用中,如果确认通知中的partition view的视图版本与该MDC本地维护的最新的partition view的视图版本不一致,则进一步将最新的partition view发送给所述P1和Pn所在的主OSD节点,并在确认针对该P1和Pn的主备数据一致后,进行集群视图的更新。
928:通知MDC变更Partition View,将故障恢复OSD节点的状态更新为一致,携带的Partition群ID,备OSD ID,视图版本。
930:数据恢复完成,通知主MDC变更Partition View,携带的Partition群ID,备OSD ID,视图版本。
具体实施方式中,该故障恢复OSD节点在数据恢复完成之后,向MDC模块发送通知,告知MDC模块将故障恢复OSD节点上的partition的partition状态更新为一致,与上述方式一不同点在于,此处的视图版本为所述故障恢复OSD节点本地保存的最新的partition view的partition viw版本或者最新IO view的IO view版本(具体看该故障恢复OSD节点在故障前是主OSD节点还是备OSD节点确定是partition view还是IO view),MDC模块根据通知中的视图版本,确定该MDC模块于本地维护的相应的视图版本一致后,进行集群视图更新处理。
932:将partition View中的故障恢复OSD节点的partition状态更新为一致。
具体实施方式中,分别将故障恢复OSD节点上的P1和Pn的partition view中的所述故障恢复OSD节点对应的partition状态更新为“一致”。
934:通过对视图比较,确认故障恢复OSD节点在故障前是否是某些partition的主OSD节点。
具体实施方式中,通过比较所述MDC模块本地维护的针对所述故障恢复OSD节点上P1和Pn的最新的partition view与该P1和Pn的初始化partition view,确认所述故障恢复OSD节点是P1的主OSD节点;或者通过比较所述MDC模块本地维护的针对所述故障恢复OSD节点上P1和Pn的最新的IO view与该P1和Pn的初始化IO view,确认所述故障恢复OSD节点是P1的主OSD节点。
936:将故障恢复节点恢复为主OSD节点,更新Partition View。
具体实施方式中,针对P1,所述MDC将所述故障恢复OSD节点恢复为所述P1的新的主OSD节点,并将所述P1所在的当前主OSD节点降为P1的新的备OSD节点,并更新所述P1的partition视图,针对Pn,由于所述故障恢复OSD节点原来就是Pn的备OSD节点,所以针对Pn不涉及所述故障 恢复OSD节点的主备身份变更问题。
938:如果故障恢复OSD节点原先是主节点,则将最新的Partition View发送至所述故障恢复OSD节点。
具体实施方式中,由于该故障恢复OSD节点原先是P1的主节点,按照步骤937将其上升为所述故障恢复OSD节点的新的主OSD节点,因此,需要将更新后P1的最新的Partition View发送给所述故障恢复OSD节点的新的主OSD节点,即所述故障恢复OSD节点。由于故障恢复OSD节点原先不是Pn的主节点,所以Pn的最新partition view可不发给所述故障恢复OSD节点。
940/942:通知备OSD更新本地的IO view。
具体实施方式中,根据所述P1和Pn的最新的partition View或者IO view,分别获取P1和Pn的新的主OSD节点,将所述所述P1和Pn的最新的IO View分别发送给所述P1的新的主OSD节点和Pn的主OSD节点(由于Pn不涉及主OSD节点的变更,所以还是发给Pn原来的主OSD节点)。
944:所述故障恢复OSD节点判断Partition View与本地保存的IO view中主OSD是否一致,判断自己是否升主,更新IO view。
所述故障恢复OSD节点接收到所述最新的partition视图后,判断该最新的partition View与本地保存的IO view中的主OSD是否一致,判断自己是否升主,如果是,则更新IO view,以及本地保存的partition view。并根据所述更新的partition视图和IO view处理IO请求所涉及的数据的复制。
946:更新IO view,如果是主OSD降备,则删除Partition View。
具体实施方式中,所述P1和Pn的当前的主OSD节点(即数据恢复完毕后更新前的OSD节点)接收到所述更新的IO视图,更新各自的本地保存的IO视图。针对P1的当前主OSD节点,由于其已经由MDC模块降为给P1的新的备OSD节点,所以P1的当前主OSD节点将本地保存的partition视图删除,并根据更新后的本地保存的IO视图处理IO请求的复制。
在本发明实施例中,OSD节点发生故障后,MDC会通知其他相关节点进行视图更新,隔离或忽略该故障OSD节点,继续IO请求的处理,不会阻塞其他的IO请求处理,并在节点故障恢复后更新视图,通知各相关节点,是的该故障恢复节点可以快速的重新加入集群工作,能够快速处理节点故障和故障的恢复,具有更好的容错能力,提高了存储系统的性能和可用性。可用性差的系统决定了其扩展性不会太好,因为在大规模分布式存储系统中,存储节点的故障是一件经常发生的事情,复杂且大量的协议交互也会使得系统的扩展能力减低。此外,可以针对partition粒度的集群视图控制,可以极大的缩小存储节点故障的影响范围,使得存储系统能够进行大规模的扩展,提高了系统的扩展性。
数据恢复流程
下面通过一个具体实施例说明上述图9的OSD故障恢复处理过程中的数据恢复处理流程,为了理解的方便本实施例假设所述故障恢复OSD节点上的partition所在的partition群包括partition1(简称P1)以及partition n(简称Pn)为例进行说明,如图10所示,该实施例中的OSD节点故障处理流程具体包括:
1002:故障恢复OSD节点从本地获取记录的每个partition的entry中最大的SeqID。
如上述图7所述的实施例中步骤712以及722所示,系统中的OSD节点在IO处理过程中针对某个partition的每一IO操作会记录一条entry,如前所述,所述的entry包括IO操作类型,Partition ID,Seq ID,以及key,所述的entry中还可以进一步包括状态信息,用于描述所述的操作是否成功,另外针对写IO操作还可以进一步包括前述的offset以及length。例如,本实施例中,分别获取针对P1以及Pn的写IO操作的最大Seq ID。
1004/1006:分别向P1和Pn的主OSD节点请求获取该OSD缺少的entry,请求中携带partition群ID,各自的最大的SeqID。
场景一:备OSD最大的SeqID在主OSD记录的entry范围内。
1008/1010:P1和Pn的主OSD节点分别向所述故障恢复OSD节点发送该故障恢复OSD节点OSD缺少的entry。
具体实施方式中,如果所述故障恢复OSD节点最大的SeqID为1.6,主OSD当前最大的SeqID为1.16,则把SeqID为1.7到1.16对应的10条entry发送给备OSD。这里仅仅是为了理解的方便而给出的一个例子,实际应用中,Seq ID的编号规则或者方式可能不一样,而且P1和Pn的主OSD节点上可能会有不同的缺少的entry。
1012/1014:所述故障恢复OSD节点根据上一步获取的entry反复进行数据同步:请求中携带key、offset、length等IO关键信息。
具体实施方式中,所述故障恢复OSD节点根据获取的entry分批逐条向所述P1和Pn所在的主OSD节点进行数据同步请求,请求中包括key、offset、length等IO关键信息。
1016:发送对应的数据。
具体实施方式中,所述P1和Pn的主OSD节点分别根据获取的所述数据同步请求中的信息将每条entry对应的数据发送至所述故障恢复OSD节点。
场景二:备OSD最大的SeqID不在主OSD记录的entry范围内,比主OSD最小SeqID小。
1018:发送主OSD最小SeqID和零条entry。
具体实施方式中,所述Pn的主OSD节点在确定所述故障恢复OSD节点的最大的SeqID不在主OSD记录的entry范围内,即比所述Pn的主OSD的最小SeqID还小的情况下,向所述故障恢复OSD节点发送所述Pn的主OSD节点的最小SeqID和零条entry,方便所述故障恢复OSD节点进行判断所述Pn的主OSD节点是没有写过数据还是写的数据太多,以至于不能通过增量同步完成数据恢复。
1020:反复请求同步partition的数据,请求中携带partition群ID。
所述故障恢复OSD节点向所述Pn的主OSD节点请求同步整个partition的数据,如本实施例中Pn的主OSD上的主partition数据,与所述故障恢复OSD节点上的Pn备partition上的数据进行同步,请求中携带partition群ID。具体实施方式中,由于通常一个partition的数据量很大,一个请求无法传输完,同时主节点不知道所述故障恢复节点的IO能力,一直往所述故障恢复节点发送数据,所述故障恢复节点可能无法处理,所以只有等所述故障恢复节点来请求同步数据时才会把数据发送给所述故障恢 复节点;所以所述故障恢复OSD节点会根据情况反复向Pn的主OSD节点发送的同步请求,同步整个partition中的数据,直至所述partition中的所有数据都同步完毕。实际应用中,也可以有其他的整体partition同步的方式,本发明不做限制。
1022:发送一个key或者多个key对应的数据。
所述Pn的主OSD节点根据所述故障恢复OSD节点每次发送的同步请求发送一个或多个Key对应的数据。
OSD节点故障后退出集群的流程
OSD节点故障后超过预定的时间阈值(例如5分钟)仍不能恢复加入集群,或者OSD节点发生硬件故障,为了保证数据的可靠性,需要把故障的OSD节点踢出集群。
OSD节点退出集群是一个partition重新分布和数据迁移的过程,partition的重新分布要考虑各个节点的均衡性和副本的安全性;数据迁移过程的IO处理与故障恢复流程以及数据恢复流程中的处理一致;数据迁移完成,主备副本到达一致状态,MDC进行视图更新与通知的过程与故障恢复完成后视图更新的处理一致。各相关OSD节点根据更新后的视图进行IO请求的复制或转发处理过程也故障恢复完成后各OSD节点根据最新的视图进行IO的处理一致。
为了便于理解,以下通过一个具体的实施例进行说明,如图11所示,为本发明提供的OSD节点故障恢复处理流程实施例,本实施例中涉及的执行主体为图2A至图2C以及图5中所描述的实施例中提及的MDC模块,IO路由模块以及OSD节点。为了理解的方便本实施例假设所涉及的OSD节点为OSD1节点,OSD2节点,以及OSDn节点,其中OSD1节点为partition1群(简称P1)的主OSD节点,以及partition2群(简称P2)的备OSD节点,OSD2节点为P1的备OSD节点,OSDn节点为P2的主OSD节点。该实施例中的OSD节点故障退出处理流程具体包括:
1100:OSD1节点发生故障。
1102:MDC模块发现节点OSD1故障超过预定的阈值或者发生了硬件故障,MDC把OSD1踢出集群,进行视图变更,把OSD1节点上的partition(此处为P1的主partition和P2的备partition)迁移到其他OSD节点,例如本实施例中的OSD2节点和OSDn节点。
1104:MDC模块向OSD2节点通知视图更新:OSD2节点升为P1的主,成为P2的备。
1106:MDC模块向OSDn节点通知视图更新:OSDn节点为P2的主,成为P1的备。
1108:OSD2节点向OSDn节点请求同步P2的数据。
既然OSDn节点为P2的主OSD节点,所以OSD2节点向OSDn节点请求同步OSDn上的P2的主partition的数据,使得OSD2上的P2的备partition的数据与OSDn上的P2的主partition的数据一致。具体的同步流程与上述图10所示的数据恢复流程中整个partition的数据同步流程相似,此处不再赘述。
1110:OSDn节点向OSD2节点请求同步P1的数据。
由于OSD2节点原来就是P1的备OSD节点,OSD1节点故障后,OSDn节点作为P1新的备OSD节点,只能向OSD2同步P1的数据,OSDn节点向OSD2节点请求同步OSD2上的P1的主partition的 数据,使得OSDn上的P1的备partition的数据与OSD2上的P1的主partition的数据一致。
1112:partition数据同步完成。
1114:OSD2节点通知MDC模块已经完成P2的数据迁移。
1116:OSDn节点通知MDC模块已经完成P1的数据迁移。
1118:MDC模块根据相应的通知进行视图更新。
具体的视图更新原理与上述各流程中所描述的一样,更新过程与上述各流程中所描述的一样,此处不再赘述。
1120:通知视图更新:P1的备OSDn节点的partition状态为一致。
1122:通知视图更新:P2的备OSD2节点的partition状态为一致。
1124:所述OSD2节点和OSDn节点根据最新的视图进行IO请求所对应的数据的复制处理。
OSD节点扩容流程
新节点扩容加入集群,需要把原来分布在其他OSD节点的partition迁移到新扩容加入集群的OSD节点,以保证数据分布的均衡性。节点扩容的流程主要涉及数据迁移,数据迁移后的视图更新,视图更新的通知,相关OSD节点根据更新后的视图处理IO请求的复制。
数据迁移过程的IO处理与故障恢复流程以及数据恢复流程中的处理一致;数据迁移完成,MDC进行视图更新与通知的过程与故障恢复完成后视图更新的处理一致。各相关OSD节点根据更新后的视图进行IO请求所对应的数据的复制处理过程也故障恢复完成后各OSD节点根据最新的视图进行IO的处理一致。
具体的实施例中,在数据迁移完成后进行视图更新可以包括:1)新扩容节点仍然是部分partition群的备OSD节点,partition状态为一致状态,该partition群的原来的一个备OSD不再是partition的备节点;2)新扩容节点升为部分partition群的主OSD节点,该partition群不再属于原来的主OSD节点(该partition群不再分布在原来的主OSD)。
为了便于理解,以下通过一个具体的实施例进行说明。本实施例中涉及的执行主体为图2A至图2C以及图5中所描述的实施例中提及的MDC模块,IO路由模块以及OSD节点。为了理解的方便本实施例假设所涉及的OSD节点为OSD1节点、OSDn节点以及新扩容OSD节点,其中OSD1为partition群P1(简称P1)的主OSD节点,partition群Pn(简称Pn)的备OSD节点,OSDn为Pn的主OSD节点,P1的备OSD节点,新扩容OSD节点为P1的备OSD节点,以及Pn的备OSD节点。如图12所示,该实施例中的OSD节点扩容流程具体包括:
1202:向MDC模块下发OSD节点扩容配置和命令。
具体实施方式中,可以是系统管理员通过配置命令向MDC模块通知OSD节点扩容。
1202:MDC模块进行视图更新,把部分OSD节点上的partition迁移到新扩容OSD节点。
本实施例中,MDC模块将OSD1节点上的P1备和OSDn节点上的Pn备partition迁移到所述新扩容OSD节点,使得新扩容OSD节点作为新的P1的备OSD节点和新的Pn的备OSD节点。
1206:向OSD1节点通知视图更新:增加新的备OSD。
具体实施方式中,MDC模块向OSD1节点通知视图更新,针对P1的新的partition view中增加新扩容OSD节点作为P1的备OSD节点,且对应的partition状态为“不一致”(因为新扩容OSD节点和OSD1节点还没有对P1的数据进行同步)。
1208:向OSDn节点通知视图更新:增加新的备OSD。
具体实施方式中,针对Pn的新的partition view中增加新扩容OSD节点作为Pn的备OSD节点,且对应的partition状态为“不一致”(因为新扩容OSD节点和OSDn节点还没有对Pn的数据进行同步)。
1210:扩容新OSD节点启动。
具体实施方式中,新扩容OSD节点作为新的OSD节点加入集群后,进行初始化过程。具体过程与上述图6中的集群视图初始化生成和获取流程一样,此处不再赘述。
1212:返回该OSD节点的partition信息。
具体实施方式中,MDC模块向所述新扩容OSD节点返回该新扩容OSD节点上的partition的视图,本实施例中为P1的IO view和Pn的IO view。
1214:扩容新OSD节点向主OSD1节点请求同步partition的数据。
具体实施方式中,扩容新OSD节点根据MDC返回的P1的IO view向P1的主OSD节点OSD1请求同步P1的数据,即同步OSD1节点上P1的主partition的数据,以使得该扩容新OSD节点上P1的备partition的数据与OSD1节点上P1的主partition的数据一致。
具体的同步流程与上述图10所示的数据恢复流程中整个partition的数据同步流程相似,此处不再赘述。
1216:扩容新OSD节点向主OSDn节点请求同步partition的数据。
具体实施方式中,扩容新OSD节点根据MDC返回的Pn的IO view向Pn的主OSD节点OSDn请求同步Pn的数据,,即同步OSDn节点上Pn的主partition的数据,以使得该扩容新OSD节点上Pn的备partition的数据与OSDn节点上Pn的主partition的数据一致。
具体的同步流程与上述图10所示的数据恢复流程中整个partition的数据同步流程相似,此处不再赘述。
1218:partition数据同步完成。
1220:通知MDC备节点已经完成partition数据迁移。
具体实施方式中,OSD1节点向MDC模块通知,新扩容OSD节点已经完成P1的数据同步。
1222:通知MDC备节点已经完成partition数据迁移。
具体实施方式中,OSDn节点向MDC模块通知,新扩容OSD节点已经完成Pn的数据同步。
1224:MDC模块进行视图更新。
具体的视图更新原理与上述各流程中所描述的一样,更新过程与上述各流程中所描述的一样,此处不再赘述。
1226-1230:分别通知所述OSD1节点、OSDn节点和新扩容OSD节点通知视图更新。
1232:所述OSD1节点、OSDn节点和新扩容OSD节点根据更新后的视图进行IO请求所对应的数据的复制处理。
通过以上的实施例的描述,本领域普通技术人员可以理解:实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,所述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,包括如上述方法实施例的步骤,所述的存储介质,如:ROM/RAM、磁碟、光盘等。
以上所述,仅为本发明的具体实施例,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (37)

  1. 一种分布式存储复制系统,所述系统包括至少一个元数据控制(MDC)模块,多个IO路由模块以及多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源相对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图,所述partition视图包括一个partition群中的各partition所在的OSD信息;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求执行IO请求所对应的数据的存储;其特征在于:
    所述MDC确定所述系统中某个OSD节点为故障OSD节点,确定所述故障OSD节点上的partition,更新所述故障OSD节点上的partition所在的partition群的partition视图,向所述更新的partition视图中的主OSD节点发送更新通知;以及
    所述主OSD节点用于接收所述MDC模块发送的所述更新通知后,根据所述更新的partition视图处理IO请求所对应的数据的复制。
  2. 根据权利要求1所述的系统,其特征在于,所述partition视图具体包括一个partition群中各partition所在的OSD的主备身份及对应的partition状态;所述主OSD节点进一步用于根据所述更新的partition视图更新所述主OSD节点本地保存的partition视图;
    所述根据所述更新的partition视图处理IO请求所对应的数据的复制具体包括:
    根据所述更新后本地保存的partition视图将来自所述IO路由模块的IO请求所对应的数据复制到所述更新后本地保存的partition视图中partition状态为一致的备OSD节点上,或者复制到所述更新后本地保存的partition视图中partition状态为一致的备OSD节点上以及所述更新后本地保存的partition视图中partition状态为不一致但正在进行数据恢复的备OSD节点上。
  3. 根据权利要求2所述的系统,其特征在于,所述MDC模块进一步用于生成IO视图,所述IO视图包括一个partition群的主OSD节点的标识,将所述IO视图发送给所述IO路由模块以及所述partition视图中的partition所在的OSD节点;以及
    所述主OSD节点进一步用于根据所述更新的partition视图更新所述主OSD节点本地保存的IO视图,并根据所述更新后本地保存的IO视图处理所述IO请求所对应的数据的复制。
  4. 根据权利要求3所述的系统,其特征在于:
    所述的MDC模块进一步用于当确定所述故障OSD节点上的partition包括主partition时,更新所 述主partition所在的partition群的IO视图,将所述更新的IO视图通知给所述更新的partition视图中的备OSD节点;以及
    所述更新的partition视图中的备OSD节点用于根据所述更新的IO视图更新本地保存的IO视图并根据所述更新后本地保存的IO视图处理所述IO请求所对应的数据的复制。
  5. 根据权利要求4所述的系统,其特征在于:
    所述MDC模块进一步用于将所述更新的IO视图通知给所述IO路由模块;以及
    所述IO路由模块根据所述更新的IO视图更新所述IO路由模块本地保存的IO视图并根据所述更新后本地保存的IO视图处理IO请求的转发。
  6. 根据权利要求4所述的系统,其特征在于,所述更新所述故障OSD节点上的partition所在的partition群的partition视图具体包括:
    当所述故障OSD节点上的partition包括备partition时,将所述备partition所在的partition群的partition视图中所述故障OSD节点的partition状态标记为不一致;以及
    当所述故障OSD节点上的partition包括主partition时,将所述主partition所在的partition群的partition视图中作为主OSD节点的所述故障OSD节点降为新的备OSD节点并将所述新的备OSD节点对应的partition状态标记为不一致,并从所述主partition所在的partition群的partition视图中的原备OSD节点中选择一个partition状态为一致的备OSD节点升为新的主OSD节点。
  7. 根据权利要求6所述的系统,其特征在于:
    在所述故障OSD节点故障恢复并完成数据恢复后,所述MDC模块用于进一步更新所述更新的partition视图以及所述更新的IO视图,向所述进一步更新的partition视图中的主OSD节点发送更新通知,向所述进一步更新的partition视图中的备OSD节点发送更新通知;
    所述进一步更新的partition视图中的主OSD节点用于根据所述进一步更新的partition视图处理所述IO请求所对应的数据的复制;以及
    所述进一步更新的partition视图中的备OSD节点用于根据所述进一步更新的IO视图处理所述IO请求所对应的数据的复制。
  8. 根据权利要求1-7所述的任一系统,其特征在于,所述系统包括多个主机,所述MDC模块,IO路由模块和OSD节点分别部署在所述多个主机中的至少一个主机上,所述OSD节点用于管理所述主机上的物理存储资源。
  9. 一种分布式存储复制系统,所述系统包括至少一个元数据控制(MDC)模块以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition 或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图以及IO视图,所述partition视图包括一个partition群中的各partition所在的OSD信息,所述IO视图包括一个partition群的主OSD的标识;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求执行IO数据的存储;其特征在于:
    所述IO路由模块用于接收IO请求,所述IO请求包括key,根据所述key确定所述IO请求所对应的数据所属的partition群并确定所述数据所属的partition群的主OSD节点,将所述数据所属的partition群的IO视图的IO视图版本信息加入所述IO请求,并将携带有所述IO视图版本信息的所述IO请求发送至所述确定出的主OSD节点;
    所述主OSD节点用于接收所述IO请求,在根据所述IO视图版本信息确定所述IO请求中的IO视图版本与所述主OSD节点本地保存的IO视图版本一致后,执行所述IO请求,生成携带有所述IO视图版本信息的复制请求,并将所述的复制请求发送至所述数据所属的partition群的备OSD节点;以及
    所述备OSD节点用于接收所述复制请求,并在根据所述IO视图版本信息确定所述复制请求中的IO视图版本与所述备OSD节点本地保存的IO视图版本一致后,执行所述复制请求以使得所述备OSD节点上的所述IO请求所对应的数据与所述主OSD节点上的所述IO请求所对应的数据保持一致。
  10. 根据权利要求9所述的系统,其特征在于,所述主OSD节点进一步用于在根据所述IO视图版本信息确定所述IO请求中的IO视图版本旧于所述主OSD本地保存的IO视图版本后向所述的IO路由模块返回错误,以及在确定所述IO请求中的IO视图版本新于所述主OSD本地保存的IO视图版本后将所述IO请求加入缓冲队列,并向所述MDC模块查询所述数据所属的partition群的IO视图的IO视图版本信息以确定所述主OSD本地保存的IO视图版本与所述IO请求中的IO视图版本一致后执行所述IO请求;以及
    所述IO路由模块用于收到所述主OSD节点返回的错误后向所述MDC模块查询所述数据所属的partition群的IO视图,在获得更新的IO视图版本信息后发送所述携带所述更新的IO视图版本信息的IO请求。
  11. 根据权利要求9所述的系统,其特征在于,其特征在于,所述IO视图版本信息包括IO视图版本号,所述主OSD节点进一步为所述IO请求生成序列标识,所述序列标识包括所述IO视图版本号和序列号,并将所述序列标识加入到发送至所述备OSD节点的所述复制请求;所述序列号表示在一个IO视图版本内针对所述IO视图中的partition群对应的数据的修改操作的连续编号;
    所述备OSD节点进一步用于根据所述序列标识执行所述复制请求。
  12. 根据权利要求11所述的系统,其特征在于,所述主OSD节点进一步用于在所述IO请求为修改操作时记录一条entry,所述的entry包括操作类型、所述数据所属的partition群的partition群ID、所述序列标识以及所述key。
  13. 根据权利要求11所述的系统,其特征在于,所述复制请求中进一步携带有所述主OSD节点针对所述partition群发送的上一次复制请求中的序列标识;
    所述备OSD节点用于在收到所述的复制请求后并在确定所述上一次复制请求的序列标识与所述备OSD节点本地保存的最大序列标识一致的情况下执行所述复制请求。
  14. 根据权利要求13所述的系统,其特征在于,所述备OSD节点进一步用于在确定所述上一次复制请求的序列标识大于所述备OSD节点本地保存的最大序列标识时请求所述主OSD节点重发丢失的请求,以保持所述备OSD节点与所述主OSD节点上的数据一致。
  15. 根据权利要求9所述的系统,其特征在于,所述partition视图具体包括一个partition群中各partition所在的OSD的主备身份及对应的partition状态;所述MDC模块进一步用于:
    在所述IO请求处理过程中检测到所述主OSD节点发生故障时将所述数据所属的partition群的partition视图中的所述主OSD节点降为新的备OSD节点,并将所述新的备OSD的partition状态标记为不一致,将所述数据所属的partition群的partition视图中的所述备OSD节点中任一备OSD节点升为新的主OSD节点,将更新后的所述数据所属的partition群的partition视图通知至所述新的主OSD节点,用所述新的主OSD节点更新所述数据所属的partition群的IO视图,将更新后的所述数据所属的partition的IO视图通知至所述IO路由模块;
    所述IO路由模块进一步用于接收所述MDC模块发送的所述更新后的所述partition群的IO视图,根据所述更新后的所述partition群的IO视图将所述IO请求发送至所述新的主OSD节点;以及
    所述新的主OSD节点用于接收所述IO请求,在执行所述IO请求后生成第二复制请求,将所述第二复制请求发送至所述更新后的所述数据所属的partition群的partition视图中的partition状态为一致的备OSD节点。
  16. 根据权利要求9所述的系统,其特征在于,所述partition视图具体包括一个partition群中各partition所在的OSD的主备身份及相应的partition状态,所述MDC模块进一步用于在所述IO请求处理过程中检测到所述备OSD节点中任一备OSD节点发生故障时,将所述数据所属的partition群的partition视图中的所述任一备OSD的partition状态标记为不一致,将更新后的所述数据所属的partition群的partition视图通知至所述主OSD节点;以及
    所述主OSD节点用于接收到所述更新后的所述数据所属的partition群的partition视图后,将所述复制请求发送至所述更新后的partition视图中的partition状态为一致的备OSD节点,不将所述复制请 求发送至partition状态为不一致的备OSD节点。
  17. 根据权利要求9-16所述的任一系统,其特征在于,所述系统包括多个主机,所述MDC模块,IO路由模块和OSD节点分别部署在所述多个主机中的至少一个主机上,所述OSD节点用于管理所述主机上的物理存储资源。
  18. 一种分布式存储复制系统,所述系统包括至少一个元数据控制(MDC)模以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图以及IO视图,所述partition视图包括一个partition群中的各partition所在的OSD信息,所述IO视图包括一个partition群的主OSD的标识;所述IO路由模块用于对接收的IO请求进行路由转所述OSD节点;所述OSD节点用于根据所述IO请求所对应的数据的存储;其特征在于:
    所述OSD节点用于在自身故障恢复后向所述MDC模块发送查询请求以请求所述OSD节点上的partition所在的partition群的IO视图,所述OSD节点称为故障恢复OSD节点,所述查询请求中携带所述故障恢复OSD节点的OSD标识,接收所述MDC返回的所述IO视图,向所述IO视图中的主OSD发起数据恢复请求以请求恢复所述故障恢复OSD节点在故障期间更新的数据,接收所述主OSD发送的所述故障期间更新的数据,根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理IO请求的复制;
    所述MDC模块用于接收所述故障恢复OSD节点的所述查询请求,根据查询请求中的OSD标识将所述IO视图返回给所述故障恢复OSD节点,并在所述故障恢复OSD节点数据恢复执行完毕后更新所述partition群的partition视图;以及
    所述主OSD节点用于接收所述故障恢复OSD节点的数据恢复请求,将所述故障期间更新的数据发送至所述的故障恢复OSD节点,以及根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理所述IO请求所对应的数据的复制。
  19. 根据权利要求18所述的系统,其特征在于:
    所述MDC模块进一步用于在所述故障恢复OSD节点的故障发生后以及故障恢复之前更新所述partition群的partition视图和IO视图;
    所述根据查询请求中的OSD标识将所述IO视图返回给所述故障恢复OSD节点具体为将所述更新后的IO视图返回给所述故障恢复OSD节点。
  20. 根据权利要求18所述的系统,其特征在于:
    所述主OSD节点进一步用于在接收所述数据恢复请求后接收所述IO路由模块发送的针对所述故障恢复OSD节点上的partition的IO请求,执行所述IO请求,向所述的故障恢复OSD节点发送携带有IO关键信息以及所述IO请求所对应的数据的复制请求;以及
    所述的故障恢复OSD节点将所述IO关键信息以及所述IO请求所对应的数据的复制请求写入日志,在所述数据恢复执行完毕后根据所述日志的记录将所述IO请求所对应的数据更新到所述故障恢复OSD节点所管理的物理存储资源中。
  21. 根据权利要求18所述的系统,其特征在于,所述数据恢复请求中携带有所述故障恢复OSD节点本地记录的针对所述故障恢复OSD节点上的partition的IO操作的最大序列标识;所述最大序列标识为:所述故障恢复OSD节点上的partition所在的partition群的IO视图的最新IO视图版本号,以及针对所述最新IO视图版本号对应的IO视图中的partition对应的数据的修改操作的最大编号;
    所述将所述故障期间更新的数据发送至所述的故障恢复OSD节点包括:
    确定所述数据恢复请求中最大的序列标识大于或等于所述主OSD节点本地存储的当前最小的序列标识,向所述故障恢复OSD节点发送所述故障恢复OSD节点在故障期间缺少的entry,接收所述故障恢复OSD节点根据所述entry发起的数据恢复请求,将对应所述entry的数据发送至所述的故障恢复OSD节点;
    所述最小序列标识为:所述主OSD节点保存的所述partition群的IO视图的最小IO视图版本号,以及针对所述最小IO视图版本号对应的IO视图中的partition对应的数据的修改操作的最小编号。
  22. 根据权利要求18所述的系统,其特征在于,所述数据恢复请求中携带有所述故障恢复OSD节点本地记录的针对所述故障恢复OSD节点上的partition的IO操作的最大序列标识;
    所述最大序列标识包括:所述故障恢复OSD节点上的partition所在的partition群的IO视图的最新IO视图版本号,以及在所述最新IO视图版本号对应的IO视图内针对所述IO视图中的partition对应的数据的修改操作的最大编号;
    所述将所述故障期间更新的数据发送至所述的故障恢复OSD节点包括:
    确定所述数据恢复请求中最大的序列标识小于所述主OSD节点本地存储的当前最小的序列标识,向所述故障恢复OSD节点发送所述主OSD本地存储的当前最小的序列标识,接收所述故障恢复OSD节点发起的同步所述主OSD节点上所属所述partition群的主partition所对应的全部数据的数据恢复请求,将所述主partition所对应的全部数据发送至所述故障恢复OSD节点;
    所述最小序列标识为:所述主OSD节点保存的所述partition群的IO视图的最小IO视图版本号,以及针对所述最小IO视图版本号对应的IO视图中的partition对应的数据的修改操作的最小编号。
  23. 根据权利要求18所述的系统,其特征在于,所述MDC模块进一步用于在所述故障恢复OSD 节点数据恢复执行完毕后更新所述partition群的partition视图之前:
    接收所述主OSD节点发送的请求更新视图的通知,所述请求更新视图的通知中携带所述主OSD节点本地保存的所述partition群的最新partition视图版本号,确定所述主OSD节点本地保存的所述partition群的最新partition视图版本号与所述MDC本地维护的最新partition视图版本号一致后执行所述更新操作。
  24. 根据权利要求18所述的系统,其特征在于,所述更新所述partition群的partition视图具体包括:
    将所述partition视图中的故障恢复OSD节点的partition状态更新为一致,并在确认所述故障恢复OSD节点在故障前是所述partition群的主OSD节点后将所述的故障恢复OSD节点恢复为所述partition群的新的主OSD节点,将所述主OSD节点降为所述partition群的新的备OSD节点。
  25. 根据权利要求18-24所述的任一系统,其特征在于,所述系统包括多个主机,所述MDC模块,IO路由模块和OSD节点分别部署在所述多个主机中的至少一个主机上,所述OSD节点用于管理所述主机上的物理存储资源。
  26. 一种分布式存储复制系统,所述系统包括至少一个元数据控制(MDC)模以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图,所述partition视图包括一个partition群中的各partition所在的OSD信息;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求所对应的数据的存储;其特征在于:
    所述系统包括存储器以及处理器;
    所述的存储器用于存储计算机可读指令,所述指令用于执行所述MDC模块、IO路由模块以及OSD节点的功能;
    所述的处理器用于与所述的存储器连接,读取所述存储器中的指令,根据所述的指令促使所述处理器执行下述操作:
    确定所述系统中某个OSD节点为故障OSD节点,确定所述故障OSD节点上的partition,更新所述故障OSD节点上的partition所在的partition群的partition视图,向所述更新的partition视图中的partition群所在的主OSD节点发送更新通知,以便所述主OSD节点根据所述更新的partition视图处理IO请求所对应的数据的复制。
  27. 一种分布式存储复制系统,所述系统包括至少一个元数据控制(MDC)模以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图以及IO视图,所述partition视图包括一个partition群中的各partition所在的OSD信息,所述IO视图包括一个partition群的主OSD的标识;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求所对应的数据的存储;其特征在于:
    所述系统包括存储器以及处理器;
    所述的存储器用于存储计算机可读指令,所述指令用于执行所述MDC模块、IO路由模块以及OSD节点的功能;
    所述的处理器用于与所述的存储器连接,读取所述存储器中的指令,根据所述的指令促使所述处理器执行下述操作:
    促使所述IO路由模块接收IO请求,所述IO请求包括key,根据所述key确定所述IO请求所对应的数据所属的partition群并确定所述数据所属的partition群的主OSD节点,将所述数据所属的partition群的IO视图的IO视图版本信息加入所述IO请求,并将携带有所述IO视图版本信息的所述IO请求发送至所述确定出的主OSD节点;
    促使所述主OSD节点接收所述IO请求,在根据所述IO视图版本信息确定所述IO请求中的IO视图版本与本地保存的IO视图版本一致后,执行所述IO请求,生成携带有所述IO视图版本信息的复制请求,并将所述的复制请求发送至所述数据所属的partition的备OSD节点;以及
    促使所述备OSD节点接收所述复制请求,并在根据所述IO视图版本信息确定所述复制请求中的IO视图版本与所述备OSD节点本地保存的IO视图版本一致后,执行所述复制请求以使得所述备OSD节点上的所述IO请求所对应的数据与所述主OSD节点上的所述IO请求所对应的数据保持一致。
  28. 一种分布式存储复制系统,所述系统包括至少一个元数据控制(MDC)模以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图以及IO视图,所述 partition视图包括一个partition群中的各partition所在的OSD信息,所述IO视图包括一个partition群的主OSD的标识;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求所对应的数据的存储;其特征在于:
    所述系统包括存储器以及处理器;
    所述的存储器用于存储计算机可读指令,所述指令用于执行所述MDC模块、IO路由模块以及OSD节点的功能;
    所述的处理器用于与所述的存储器连接,读取所述存储器中的指令,根据所述的指令促使所述处理器执行下述操作:
    促使所述OSD节点在自身故障恢复后向所述MDC模块发送查询请求以请求所述OSD节点上的partition所在的partition群的IO视图,所述OSD节点称为故障恢复OSD节点,所述查询请求中携带所述故障恢复OSD节点的OSD标识,接收所述MDC返回的所述IO视图,向所述IO视图中的的主OSD发起数据恢复请求以请求恢复所述故障恢复OSD节点在故障期间更新的数据,接收所述主OSD发送的所述故障期间更新的数据,根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理IO请求的复制;
    促使所述MDC模块接收所述故障恢复OSD节点的所述查询请求,根据查询请求中的OSD标识将所述IO视图返回给所述故障恢复OSD节点,并在所述故障恢复OSD节点数据恢复执行完毕后更新所述partition群的partition视图;以及
    促使所述主OSD节点接收所述故障恢复OSD节点的数据恢复请求,将所述故障期间更新的数据发送至所述的故障恢复OSD节点,以及根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理所述IO请求所对应的数据的复制。
  29. 一种应用于分布式存储系统中管理数据存储和复制的方法,所述系统包括至少一个元数据控制(MDC)模块以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点所管理的物理存储资源相对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图,所述partition视图包括一个partition群中的各partition所在的OSD信息;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求执行IO请求所对应的数据的存储;其特征在于,所述的方法包括:
    确定所述系统中某个OSD节点为故障OSD节点,确定所述故障OSD节点上的partition,更新所故障OSD节点上的partition所在的partition群的partition视图,向所述更新的partition视图中的主OSD节点发送更新通知;以及
    所述主OSD节点用于接收所述MDC模块发送的所述更新通知后,根据所述更新的partition视图处理IO请求所对应的数据的复制。
  30. 根据权利要求29所述的方法,其特征在于,所述partition视图具体包括一个partition群中各partition所在的OSD的主备身份及对应的partition状态;所述主OSD节点进一步用于根据所述更新的partition视图更新所述主OSD节点本地保存的partition视图;
    所述根据所述更新的partition视图处理IO请求所对应的数据的复制具体包括:
    根据所述更新后本地保存的partition视图将来自所述IO路由模块的IO请求所对应的数据复制到所述更新后本地保存的partition视图中partition状态为一致的备OSD节点上,或者复制到所述更新后本地保存的partition视图中partition状态为一致的备OSD节点上以及所述更新后本地保存的partition视图中partition状态为不一致但正在进行数据恢复的备OSD节点上。
  31. 根据权利要求29所述的方法,其特征在于,所述的方法进一步包括:
    当所述的MDC模块确定所述故障OSD节点上的partition包括主partition时,更新所述主partition所在的partition群的IO视图,将所述更新的IO视图通知给所述更新的partition视图中的备OSD节点;以及
    所述更新的partition视图中的备OSD节点根据所述更新的IO视图更新本地保存的IO视图并根据所述更新后本地保存的IO视图处理所述IO请求所对应的数据的复制。
  32. 根据权利要求31所述的方法,其特征在于,所述更新所述故障OSD节点上的partition所在的partition群的partition视图具体包括:
    当所述故障OSD节点上的partition包括备partition时,将所述备partition所在的partition群的partition视图中所述故障OSD节点的partition状态标记为不一致;以及
    当所述故障OSD节点上的partition包括主partition时,将所述主partition所在的partition群的partition视图中作为主OSD节点的所述故障OSD节点降为新的备OSD节点并将所述新的备OSD节点对应的partition状态标记为不一致,并从所述主partition所在的partition群的partition视图中的原备OSD节点中选择一个partition状态为一致的备OSD节点升为新的主OSD节点。
  33. 根据权利要求32所述的方法,其特征在于,所述方法进一步包括:
    所述MDC模块在所述故障OSD节点故障恢复并完成数据恢复后进一步更新所述更新的partition视图以及所述更新的IO视图,向所述进一步更新的partition视图中的主OSD节点发送更新通知,向所述进一步更新的partition视图中的备OSD节点发送更新通知;
    所述进一步更新的partition视图中的主OSD节点根据所述进一步更新的partition视图处理所述IO请求所对应的数据的复制;以及
    所述进一步更新的partition视图中的备OSD节点根据所述进一步更新的IO视图处理所述IO请 求所对应的数据的复制。
  34. 一种应用于分布式存储系统中管理数据存储和复制的方法,所述系统包括至少一个元数据控制(MDC)模块以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图以及IO视图,所述partition视图包括一个partition群中的各partition所在的OSD信息,所述IO视图包括一个partition群的主OSD的标识;所述IO路由模块用于将接收的IO请求路由至所述OSD节点;所述OSD节点用于根据所述IO请求执行IO数据的存储;其特征在于,所述的方法包括:
    所述IO路由模块用于接收IO请求,所述IO请求包括key,根据所述key确定所述IO请求所对应的数据所属的partition群并确定所述数据所属的partition群的主OSD节点,将所述数据所属的partition群的IO视图的IO视图版本信息加入所述IO请求,并将携带有所述IO视图版本信息的所述IO请求发送至所述确定出的主OSD节点;
    所述主OSD节点用于接收所述IO请求,在根据所述IO视图版本信息确定所述IO请求中的IO视图版本与所述主OSD节点本地保存的IO视图版本一致后,执行所述IO请求,生成携带有所述IO视图版本信息的复制请求,并将所述的复制请求发送至所述数据所属的partition群的备OSD节点;以及
    所述备OSD节点用于接收所述复制请求,并在根据所述IO视图版本信息确定所述复制请求中的IO视图版本与所述备OSD节点本地保存的IO视图版本一致后,执行所述复制请求以使得所述备OSD节点上的所述IO请求所对应的数据与所述主OSD节点上的所述IO请求所对应的数据保持一致。
  35. 根据权利要求34所述的方法,其特征在于,所述partition视图具体包括一个partition群中各partition所在的OSD的主备身份及对应的partition状态,所述方法进一步包括:
    所述MDC模块在所述IO请求处理过程中检测到所述主OSD节点发生故障时将所述数据所属的partition群的partition视图中的所述主OSD节点降为新的备OSD节点,并将所述新的备OSD的partition状态标记为不一致,将所述数据所属的partition群的partition视图中的所述备OSD节点中任一备OSD节点升为新的主OSD节点,将更新后的所述数据所属的partition群的partition视图通知至所述新的主OSD节点,用所述新的主OSD节点更新所述数据所属的partition群的IO视图,将更新后的所述数据所属的partition的IO视图通知至所述IO路由模块;
    所述IO路由模块进一步用于接收所述MDC模块发送的所述更新后的所述partition群的IO视图, 根据所述更新后的所述partition群的IO视图将所述IO请求发送至所述新的主OSD节点;以及
    所述新的主OSD节点用于接收所述IO请求,在执行所述IO请求后生成第二复制请求,将所述第二复制请求发送至所述更新后的所述数据所属的partition群的partition视图中的partition状态为一致的备OSD节点。
  36. 根据权利要求34所述的方法,其特征在于,所述partition视图具体包括一个partition群中各partition所在的OSD的主备身份及相应的partition状态,所述方法进一步包括:
    所述MDC模块在所述IO请求处理过程中检测到所述备OSD节点中任一备OSD节点发生故障时,将所述数据所属的partition群的partition视图中的所述任一备OSD的partition状态标记为不一致,将更新后的所述数据所属的partition群的partition视图通知至所述主OSD节点;以及
    所述主OSD节点用于接收到所述更新后的所述数据所属的partition群的partition视图后,将所述复制请求发送至所述更新后的partition视图中的partition状态为一致的备OSD节点,不将所述复制请求发送至partition状态为不一致的备OSD节点。
  37. 一种应用于分布式存储系统中管理数据存储和复制的方法,所述系统包括至少一个元数据控制(MDC)模以及多个IO路由模块,多个存储对象设备(OSD)节点,所述MDC模块用于为所述各OSD节点配置与所述各OSD节点所管理的物理存储资源所对应的至少一个逻辑分区(partition),所述至少一个partition为主partition或备partition或主备partition的任意组合,主partition和所述主partition对应的备partition构成一个partition群,同一个partition群中的主partition和备partition位于不同的OSD节点上,主partition所在的OSD节点为所述主partition所在的partition群的主OSD节点,备partition所在的OSD节点为所述备partition所在的partition群的备OSD节点,根据所述partition生成partition视图以及IO视图,所述partition视图包括一个partition群中的各partition所在的OSD信息,所述IO视图包括一个partition群的主OSD的标识;所述IO路由模块用于对接收的IO请求进行路由转所述OSD节点;所述OSD节点用于根据所述IO请求所对应的数据的存储;其特征在于,所述方法包括:
    所述OSD节点用于在自身故障恢复后向所述MDC模块发送查询请求以请求所述OSD节点上的partition所在的partition群的IO视图,所述OSD节点称为故障恢复OSD节点,所述查询请求中携带所述故障恢复OSD节点的OSD标识,接收所述MDC返回的所述IO视图,向所述IO视图中的主OSD发起数据恢复请求以请求恢复所述故障恢复OSD节点在故障期间更新的数据,接收所述主OSD发送的所述故障期间更新的数据,根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理IO请求的复制;
    所述MDC模块用于接收所述故障恢复OSD节点的所述查询请求,根据查询请求中的OSD标识将所述IO视图返回给所述故障恢复OSD节点,并在所述故障恢复OSD节点数据恢复执行完毕后更新所述partition群的partition视图;以及
    所述主OSD节点用于接收所述故障恢复OSD节点的数据恢复请求,将所述故障期间更新的数据 发送至所述的故障恢复OSD节点,以及根据所述MDC模块在所述故障恢复OSD节点数据恢复执行完毕后更新的所述partition群的partition视图处理所述IO请求所对应的数据的复制。
PCT/CN2014/090445 2014-11-06 2014-11-06 一种分布式存储复制系统和方法 WO2016070375A1 (zh)

Priority Applications (7)

Application Number Priority Date Filing Date Title
BR112016030547-7A BR112016030547B1 (pt) 2014-11-06 2014-11-06 Sistema e método de replicação e de armazenamento distribuído
JP2017539482A JP6382454B2 (ja) 2014-11-06 2014-11-06 分散ストレージ及びレプリケーションシステム、並びに方法
CN201480040590.9A CN106062717B (zh) 2014-11-06 2014-11-06 一种分布式存储复制系统和方法
EP14905449.6A EP3159794B1 (en) 2014-11-06 2014-11-06 Distributed storage replication system and method
PCT/CN2014/090445 WO2016070375A1 (zh) 2014-11-06 2014-11-06 一种分布式存储复制系统和方法
SG11201703220SA SG11201703220SA (en) 2014-11-06 2014-11-06 Distributed storage and replication system and method
US15/589,856 US10713134B2 (en) 2014-11-06 2017-05-08 Distributed storage and replication system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/090445 WO2016070375A1 (zh) 2014-11-06 2014-11-06 一种分布式存储复制系统和方法

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/589,856 Continuation US10713134B2 (en) 2014-11-06 2017-05-08 Distributed storage and replication system and method

Publications (1)

Publication Number Publication Date
WO2016070375A1 true WO2016070375A1 (zh) 2016-05-12

Family

ID=55908392

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/090445 WO2016070375A1 (zh) 2014-11-06 2014-11-06 一种分布式存储复制系统和方法

Country Status (7)

Country Link
US (1) US10713134B2 (zh)
EP (1) EP3159794B1 (zh)
JP (1) JP6382454B2 (zh)
CN (1) CN106062717B (zh)
BR (1) BR112016030547B1 (zh)
SG (1) SG11201703220SA (zh)
WO (1) WO2016070375A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107046575A (zh) * 2017-04-18 2017-08-15 南京卓盛云信息科技有限公司 一种云存储系统及其高密度存储方法
JP6279780B1 (ja) * 2017-02-20 2018-02-14 株式会社東芝 分散ストレージの非同期リモートレプリケーションシステムおよび分散ストレージの非同期リモートレプリケーション方法
CN109558437A (zh) * 2018-11-16 2019-04-02 新华三技术有限公司成都分公司 主osd调整方法及装置
WO2019148841A1 (zh) * 2018-01-31 2019-08-08 华为技术有限公司 一种分布式存储系统、数据处理方法和存储节点

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10275184B2 (en) 2014-07-22 2019-04-30 Oracle International Corporation Framework for volatile memory query execution in a multi node cluster
US9875259B2 (en) 2014-07-22 2018-01-23 Oracle International Corporation Distribution of an object in volatile memory across a multi-node cluster
US10002148B2 (en) * 2014-07-22 2018-06-19 Oracle International Corporation Memory-aware joins based in a database cluster
US10025822B2 (en) 2015-05-29 2018-07-17 Oracle International Corporation Optimizing execution plans for in-memory-aware joins
US10567500B1 (en) * 2015-12-21 2020-02-18 Amazon Technologies, Inc. Continuous backup of data in a distributed data store
JP2018073231A (ja) * 2016-11-01 2018-05-10 富士通株式会社 ストレージシステムおよびストレージ装置
JP6724252B2 (ja) * 2017-04-14 2020-07-15 華為技術有限公司Huawei Technologies Co.,Ltd. データ処理方法、記憶システムおよび切り換え装置
CN107678918B (zh) * 2017-09-26 2021-06-29 郑州云海信息技术有限公司 一种分布式文件系统的osd心跳机制设置方法及装置
CN107832164A (zh) * 2017-11-20 2018-03-23 郑州云海信息技术有限公司 一种基于Ceph的故障硬盘处理的方法及装置
CN108235751B (zh) 2017-12-18 2020-04-14 华为技术有限公司 识别对象存储设备亚健康的方法、装置和数据存储系统
CN109995813B (zh) * 2017-12-29 2021-02-26 华为技术有限公司 一种分区扩展方法、数据存储方法及装置
CN110515535B (zh) * 2018-05-22 2021-01-01 杭州海康威视数字技术股份有限公司 硬盘读写控制方法、装置、电子设备及存储介质
CN108845772B (zh) * 2018-07-11 2021-06-29 郑州云海信息技术有限公司 一种硬盘故障处理方法、系统、设备及计算机存储介质
CN110874382B (zh) * 2018-08-29 2023-07-04 阿里云计算有限公司 一种数据写入方法、装置及其设备
CN109144788B (zh) * 2018-09-10 2021-10-22 网宿科技股份有限公司 一种重建osd的方法、装置及系统
CN109144789B (zh) * 2018-09-10 2020-12-29 网宿科技股份有限公司 一种重启osd的方法、装置及系统
CN109189738A (zh) * 2018-09-18 2019-01-11 郑州云海信息技术有限公司 一种分布式文件系统中主osd的选取方法、装置及系统
CN111104057B (zh) * 2018-10-25 2022-03-29 华为技术有限公司 存储系统中的节点扩容方法和存储系统
CN111435331B (zh) * 2019-01-14 2022-08-26 杭州宏杉科技股份有限公司 存储卷写数据方法、装置、电子设备及机器可读存储介质
US11514024B2 (en) 2019-01-31 2022-11-29 Rubrik, Inc. Systems and methods for shard consistency in a clustered database
US10997130B2 (en) * 2019-01-31 2021-05-04 Rubrik, Inc. Systems and methods for node consistency in a clustered database
US11016952B2 (en) 2019-01-31 2021-05-25 Rubrik, Inc. Systems and methods to process a topology change in a clustered database
CN111510338B (zh) * 2020-03-09 2022-04-26 苏州浪潮智能科技有限公司 一种分布式块存储网络亚健康测试方法、装置及存储介质
US11223681B2 (en) * 2020-04-10 2022-01-11 Netapp, Inc. Updating no sync technique for ensuring continuous storage service in event of degraded cluster state
CN112596935B (zh) * 2020-11-16 2022-08-30 新华三大数据技术有限公司 一种osd故障处理方法及装置
CN112819592B (zh) * 2021-04-16 2021-08-03 深圳华锐金融技术股份有限公司 业务请求处理方法、系统、计算机设备和存储介质
CN113254277B (zh) * 2021-06-15 2021-11-02 云宏信息科技股份有限公司 存储集群osd故障修复方法、存储介质、监视器及存储集群

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1474275A (zh) * 2002-08-06 2004-02-11 中国科学院计算技术研究所 基于虚拟存储的智能网络存储设备的系统
CN101751284A (zh) * 2009-12-25 2010-06-23 北京航空航天大学 一种分布式虚拟机监控器的i/o资源调度方法
CN103051691A (zh) * 2012-12-12 2013-04-17 华为技术有限公司 分区分配方法、装置以及分布式存储系统
CN103810047A (zh) * 2012-11-13 2014-05-21 国际商业机器公司 动态改善逻辑分区的存储器亲和性

Family Cites Families (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5790788A (en) * 1996-07-23 1998-08-04 International Business Machines Corporation Managing group events by a name server for a group of processors in a distributed computing environment
JP3563907B2 (ja) * 1997-01-30 2004-09-08 富士通株式会社 並列計算機
US6308300B1 (en) * 1999-06-04 2001-10-23 Rutgers University Test generation for analog circuits using partitioning and inverted system simulation
US7395265B2 (en) 2004-08-27 2008-07-01 Hitachi, Ltd. Data processing system and storage subsystem provided in data processing system
JP4519573B2 (ja) * 2004-08-27 2010-08-04 株式会社日立製作所 データ処理システム及び方法
US20060182050A1 (en) * 2005-01-28 2006-08-17 Hewlett-Packard Development Company, L.P. Storage replication system with data tracking
US7917469B2 (en) * 2006-11-08 2011-03-29 Hitachi Data Systems Corporation Fast primary cluster recovery
US8533155B2 (en) 2009-10-30 2013-09-10 Hitachi Data Systems Corporation Fixed content storage within a partitioned content platform, with replication
CN101796514B (zh) * 2008-10-07 2012-04-18 华中科技大学 对象存储系统的管理方法
US8644188B1 (en) * 2009-06-25 2014-02-04 Amazon Technologies, Inc. Providing virtual networking functionality for managed computer networks
US8074107B2 (en) * 2009-10-26 2011-12-06 Amazon Technologies, Inc. Failover and recovery for replicated data instances
EP2534569B1 (en) 2010-02-09 2015-12-30 Google, Inc. System and method for managing replicas of objects in a distributed storage system
US9323775B2 (en) 2010-06-19 2016-04-26 Mapr Technologies, Inc. Map-reduce ready distributed file system
CN102025550A (zh) * 2010-12-20 2011-04-20 中兴通讯股份有限公司 一种分布式集群中数据管理的系统和方法
US9805108B2 (en) * 2010-12-23 2017-10-31 Mongodb, Inc. Large distributed database clustering systems and methods
US8572031B2 (en) * 2010-12-23 2013-10-29 Mongodb, Inc. Method and apparatus for maintaining replica sets
US8583773B2 (en) * 2011-01-11 2013-11-12 International Business Machines Corporation Autonomous primary node election within a virtual input/output server cluster
US8615676B2 (en) * 2011-03-24 2013-12-24 International Business Machines Corporation Providing first field data capture in a virtual input/output server (VIOS) cluster environment with cluster-aware vioses
US8713282B1 (en) * 2011-03-31 2014-04-29 Emc Corporation Large scale data storage system with fault tolerance
JP2012221419A (ja) 2011-04-13 2012-11-12 Hitachi Ltd 情報記憶システム及びそのデータ複製方法
US8924974B1 (en) * 2011-06-08 2014-12-30 Workday, Inc. System for error checking of process definitions for batch processes
US20130029024A1 (en) 2011-07-25 2013-01-31 David Warren Barbeque stove
CN102355369B (zh) * 2011-09-27 2014-01-08 华为技术有限公司 虚拟化集群系统及其处理方法和设备
WO2013117002A1 (zh) * 2012-02-09 2013-08-15 华为技术有限公司 一种数据重建方法、装置和系统
CN102571452B (zh) * 2012-02-20 2015-04-08 华为技术有限公司 多节点管理的方法和系统
CN103294675B (zh) 2012-02-23 2018-08-03 上海盛大网络发展有限公司 一种分布式存储系统中的数据更新方法及装置
CN102724057B (zh) * 2012-02-23 2017-03-08 北京市计算中心 一种面向云计算平台的分布式层次化自主管理方法
US10282228B2 (en) * 2014-06-26 2019-05-07 Amazon Technologies, Inc. Log-based transaction constraint management

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1474275A (zh) * 2002-08-06 2004-02-11 中国科学院计算技术研究所 基于虚拟存储的智能网络存储设备的系统
CN101751284A (zh) * 2009-12-25 2010-06-23 北京航空航天大学 一种分布式虚拟机监控器的i/o资源调度方法
CN103810047A (zh) * 2012-11-13 2014-05-21 国际商业机器公司 动态改善逻辑分区的存储器亲和性
CN103051691A (zh) * 2012-12-12 2013-04-17 华为技术有限公司 分区分配方法、装置以及分布式存储系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3159794A4 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6279780B1 (ja) * 2017-02-20 2018-02-14 株式会社東芝 分散ストレージの非同期リモートレプリケーションシステムおよび分散ストレージの非同期リモートレプリケーション方法
CN107046575A (zh) * 2017-04-18 2017-08-15 南京卓盛云信息科技有限公司 一种云存储系统及其高密度存储方法
CN107046575B (zh) * 2017-04-18 2019-07-12 南京卓盛云信息科技有限公司 一种用于云存储系统的高密度存储方法
WO2019148841A1 (zh) * 2018-01-31 2019-08-08 华为技术有限公司 一种分布式存储系统、数据处理方法和存储节点
US11262916B2 (en) 2018-01-31 2022-03-01 Huawei Technologies Co., Ltd. Distributed storage system, data processing method, and storage node
CN109558437A (zh) * 2018-11-16 2019-04-02 新华三技术有限公司成都分公司 主osd调整方法及装置
CN109558437B (zh) * 2018-11-16 2021-01-01 新华三技术有限公司成都分公司 主osd调整方法及装置

Also Published As

Publication number Publication date
CN106062717A (zh) 2016-10-26
JP2017534133A (ja) 2017-11-16
BR112016030547B1 (pt) 2022-11-16
EP3159794A1 (en) 2017-04-26
US10713134B2 (en) 2020-07-14
US20170242767A1 (en) 2017-08-24
EP3159794B1 (en) 2019-01-16
JP6382454B2 (ja) 2018-08-29
EP3159794A4 (en) 2017-10-25
SG11201703220SA (en) 2017-05-30
BR112016030547A8 (pt) 2022-07-12
BR112016030547A2 (zh) 2017-05-22
CN106062717B (zh) 2019-05-03

Similar Documents

Publication Publication Date Title
WO2016070375A1 (zh) 一种分布式存储复制系统和方法
US11360854B2 (en) Storage cluster configuration change method, storage cluster, and computer system
US11853263B2 (en) Geographically-distributed file system using coordinated namespace replication over a wide area network
US9934242B2 (en) Replication of data between mirrored data sites
US9495381B2 (en) Geographically-distributed file system using coordinated namespace replication over a wide area network
JP6491210B2 (ja) 分散データグリッドにおいて永続性パーティションリカバリをサポートするためのシステムおよび方法
WO2012071920A1 (zh) 分布式内存数据库的实现方法、系统、令牌控制器及内存数据库
US9659078B2 (en) System and method for supporting failover during synchronization between clusters in a distributed data grid
JP2012528382A (ja) キャッシュクラスタを構成可能モードで用いるキャッシュデータ処理
WO2014205847A1 (zh) 一种分区平衡子任务下发方法、装置与系统
CN113010496B (zh) 一种数据迁移方法、装置、设备和存储介质
WO2015196692A1 (zh) 一种云计算系统以及云计算系统的处理方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14905449

Country of ref document: EP

Kind code of ref document: A1

REG Reference to national code

Ref country code: BR

Ref legal event code: B01A

Ref document number: 112016030547

Country of ref document: BR

REEP Request for entry into the european phase

Ref document number: 2014905449

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014905449

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2017539482

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 11201703220S

Country of ref document: SG

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 112016030547

Country of ref document: BR

Kind code of ref document: A2

Effective date: 20161226