WO2019085875A1 - 存储集群的配置修改方法、存储集群及计算机系统 - Google Patents

存储集群的配置修改方法、存储集群及计算机系统 Download PDF

Info

Publication number
WO2019085875A1
WO2019085875A1 PCT/CN2018/112580 CN2018112580W WO2019085875A1 WO 2019085875 A1 WO2019085875 A1 WO 2019085875A1 CN 2018112580 W CN2018112580 W CN 2018112580W WO 2019085875 A1 WO2019085875 A1 WO 2019085875A1
Authority
WO
WIPO (PCT)
Prior art keywords
storage
cluster
node
configuration
nodes
Prior art date
Application number
PCT/CN2018/112580
Other languages
English (en)
French (fr)
Inventor
周思义
梁锋
智雅楠
黄西华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP18872063.5A priority Critical patent/EP3694148B1/en
Publication of WO2019085875A1 publication Critical patent/WO2019085875A1/zh
Priority to US16/862,591 priority patent/US11360854B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0668Management of faults, events, alarms or notifications using network fault recovery by dynamic selection of recovery network elements, e.g. replacement by the most appropriate element after failure
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0727Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • G06F11/1425Reconfiguring to eliminate the error by reconfiguration of node membership
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2025Failover techniques using centralised failover control functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2028Failover techniques eliminating a faulty processor or activating a spare
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2089Redundant storage control functionality
    • G06F11/2092Techniques of failing over between control units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2094Redundant storage or storage space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/02Topology update or discovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/12Shortest path evaluation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/34Network arrangements or protocols for supporting network services or applications involving the movement of software or configuration parameters 
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/40Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection

Definitions

  • the present application relates to a distributed storage cluster technology, and in particular, to a cluster configuration modification technology after a storage cluster fails.
  • a plurality of distributed storage nodes are deployed in the storage cluster, and the multiple storage nodes can adopt a consistent replication protocol, such as the Raft protocol, to ensure data consistency.
  • the Raft protocol is a commonly used coherent replication protocol that stipulates that the most recent data is stored in most (more than half) storage nodes before the storage cluster can continue to provide data replication services (or data write services). That is to say, each time the data is updated, the storage master node in the Raft protocol needs to get the response of most nodes (including the storage master node itself) that has saved the latest data, and then the storage cluster can continue to provide the data replication service. When a certain number of storage nodes fail, the Raft protocol requires that the cluster configuration be modified so that the latest data is still stored in most nodes under the new cluster configuration, and the data replication service can continue to be provided under this premise.
  • the modification process of the cluster configuration is actually a data replication process, but the copied data (or called the log) is a configuration change command for modifying the cluster configuration, so when performing the modification of the cluster configuration, follow the aforementioned "preservation in most" regulations.
  • the configuration change command sent by the storage master node can obtain the response of most of the non-failed storage nodes, so the data replication service can continue to be provided.
  • the storage master node sends a configuration change command to all remaining non-faulty storage nodes, and the command is saved by all the failed storage nodes and returns a response to the storage primary node (returning) The response indicates that the save was successful.
  • the storage master node instructs all the failed storage nodes to modify their own cluster configuration information, so that the storage cluster continues in the new cluster configuration.
  • the Raft protocol requires that the process of reselecting the storage master node be performed first, and the new storage master node performs the foregoing cluster configuration modification process.
  • FIG. 1 shows an example of a storage cluster.
  • the storage cluster includes two available zones (AZ).
  • An AZ is generally composed of multiple data centers, each of which can have independent power supply and independent network. When a typical fault occurs in an AZ, such as power, network, software deployment, flooding, etc., it generally does not affect other AZs.
  • AZ and AZ are typically connected by a low latency network.
  • Two storage nodes (storage layers) are deployed in AZ1 and AZ2 respectively, that is, symmetric deployment.
  • Two AZ symmetric deployments mean that the same number of storage nodes are deployed in each of the two AZs.
  • Two compute nodes (computation layers) are also deployed in the two AZs.
  • the computing node may include, for example, a structured query language (SQL) read and write node and a SQL read node.
  • the storage node includes one storage primary node and three storage standby nodes. According to the Raft protocol, three or more storage nodes in the four storage nodes need to store the latest data. Assume that the latest data has been saved in the storage master node L and the storage standby nodes F1 and F2 in FIG. As shown in Figure 1, after AZ1 fails (or any two other storage nodes fail), since half of the four storage nodes, that is, two, have failed, the configuration change command can only get AZ2 even if it is issued at most. The response of the two storage nodes cannot meet the requirements of most of the responses. Therefore, the cluster configuration change fails. Even if the storage standby node F2 that holds the latest data still exists, the storage cluster cannot continue to provide the data replication service.
  • SQL structured query language
  • Node A computer with a separate operating system, which can be a virtual machine of a common virtual machine, a lightweight virtual machine, or a physical computer.
  • Persistent storage Information stored after a computer is powered off or restarted is not lost.
  • Typical persistent storage includes file storage and database storage.
  • Ms of a set of size N are more than half the size of this set, ie,
  • the half number M is not an integer, but the number of failed nodes must be an integer, so the requirement of "half number" cannot always be satisfied.
  • a log represents any operation command of any type of database data operation, file operation, status operation, etc., such as data read command, data write command or configuration change command.
  • the log maintains the order by incrementing the index number.
  • the database operation executes the log in the order of the index number of the log, and can construct the actual content of the data.
  • "committed” is a response from a storage cluster to a log (or command) that has been saved successfully on most storage nodes.
  • the "committed log” has saved a successful log in most storage nodes. Uncommitted logs are not executed by any of the storage nodes.
  • Information for indicating A refers to information capable of obtaining A in the present application, and the manner of obtaining A may be that the information itself is A, or A may be obtained indirectly through the information.
  • “A/B” or “A and/or B” in this application includes A, B, and A and B, which are three possible forms.
  • the application provides a configuration change method of a storage cluster, a corresponding device, and a storage cluster, and the method is applied to a storage cluster adopting a consistent replication protocol.
  • a storage cluster when half of the storage nodes fail, the storage cluster must not continue to provide data replication services, and the method provided by the present application can ensure that the storage clusters are still in the case where half of the storage nodes fail. Continue to provide data replication services under certain conditions to improve the availability of storage clusters.
  • the application provides a storage cluster, where the storage cluster includes a plurality of storage nodes, wherein one storage node and a plurality of storage standby nodes, and each storage node of the storage cluster adopts a consistent replication protocol, for example, Raft, to maintain consistency.
  • the storage cluster includes an arbitration module and a configuration library.
  • the configuration library is configured to store configuration data of the storage cluster, the configuration data includes cluster configuration information, and the cluster configuration information includes information of all non-faulty storage nodes in the storage cluster, such as an ID of a storage node.
  • the arbitration module is configured to: after the storage cluster fails, modify the cluster configuration information stored in the configuration repository if the condition A and the condition B are satisfied, and send a mandatory cluster configuration change to the non-faulty storage node.
  • the instruction, the mandatory cluster configuration change instruction is used to instruct the non-failed storage node to modify local cluster configuration information.
  • Condition A The number of failed storage nodes is half of the number of all storage nodes in the storage cluster before the failure occurs.
  • Condition B At least one storage node exists in the storage cluster after the failure, and the latest log on the storage node
  • the index number is greater than or equal to the index number of the committed log provided by the storage cluster to the client.
  • the index number represents the capability of the data replication service and is a quality of service that the storage cluster promises to the client.
  • the fault here usually refers to a fault that the storage node is not available, or other faults that affect the cluster configuration.
  • the “local” of a storage node generally refers to memory such as memory and/or disk inside the storage node.
  • “local” usually refers to memory such as memory and/or disk on the physical computer;
  • “local” Usually refers to memory located inside a physical host, such as memory and/or disk that the virtual machine can access.
  • condition A does not necessarily require obtaining these two values: the number of failed storage nodes and the number of all storage nodes in the storage cluster before the failure occurs. It can also be determined in some simple ways based on actual conditions. For example, in a storage cluster where two AZs and AZ internal storage nodes are symmetrically deployed, when an AZ-level fault occurs, half of the node faults can be directly determined. In the same way, the condition B is also the same, and there are various specific implementation manners for judging whether the condition is satisfied, and the present application is not limited.
  • the quorum module can be deployed on a storage node or deployed elsewhere, as is the configuration library. This application is not limited.
  • the arbitration module is further configured to: after the storage cluster fails and perform a forced cluster configuration change, reselect a candidate storage primary node from the failed storage node, And sending a master request to the candidate storage master node, where the candidate master request is used to instruct the candidate storage master node to initiate a master process that conforms to the consistent replication protocol.
  • the arbitration module is configured to: select a storage node with the latest log index number among all the non-faulty storage nodes as the candidate storage primary node.
  • the configuration data further includes network topology information of a network to which the storage cluster belongs.
  • the arbitration module is configured to: acquire and construct a directed right map of the client node or the proxy node to each of the non-faulty storage nodes according to the network topology information, where the weight of the edge between the nodes is determined by the node
  • the network communication rate or load (rate and load may also be determined) between the nodes is determined, and the shortest path in the directed right graph is calculated, and the storage node located on the shortest path is determined to be the candidate storage primary node.
  • the network communication rate or load between the node and the node may be stored on the node to which the arbitration module belongs, or may be acquired by the arbitration module in storage by other nodes.
  • the manner of obtaining the communication rate or the load is not limited in this application.
  • the network topology information primarily includes information indicating which AZ and/or which computer each storage (or SQL) node is located.
  • the network topology information and configuration information may be indicated by independent data or by combined data identification. For example, if the configuration data includes multiple pieces of such information: "cluster ID, AZ ID, computer name, (unfaulted) storage node) (or the SQL node) ID", that is, the configuration information, and the network topology information.
  • the re-selected storage master node is configured to: send a write to the cluster configuration change log to a current storage standby node (ie, a non-faulty storage standby node) in the storage cluster.
  • the request, the cluster configuration change log includes information of a storage node that is faulty in the cluster configuration change.
  • the "reselected storage master node" is the new storage master node selected by the storage cluster according to the selection request and the consistency replication protocol mentioned in the previous embodiment after the failure occurs, and the other is not The faulty storage node is the storage standby node.
  • the logs that record the cluster configuration changes are saved in the storage cluster, and the storage clusters can be restored according to the logs when the subsequent storage clusters are restored. This step is not required if the subsequent recovery of the storage cluster is not considered.
  • the write request of the cluster configuration change log may also be sent by the arbitrator module to all non-faulty storage nodes, indicating that all the failed storage nodes save the cluster configuration change log locally.
  • the arbitration module is configured to: obtain the latest log index number of all the faulty storage nodes after the fault, if the maximum value is greater than or equal to the last fault The maximum value of the latest log index number among all the failed storage nodes, determining that the condition B is satisfied; and/or,
  • the arbitration module is configured to determine that the condition B is satisfied if the current failure is the first failure of the cluster.
  • the configuration library is a distributed configuration library distributedly distributed on the plurality of storage nodes and one additional node.
  • the present application provides a method for modifying a configuration of a storage cluster, which may be applied to a storage cluster adopting a consistent replication protocol, where the storage cluster includes a plurality of storage nodes and a configuration library.
  • the method includes:
  • condition A The number of failed storage nodes is half of the number of all storage nodes in the storage cluster before the failure occurs;
  • Condition B at least one storage node exists in the storage cluster after the failure, and the latest on the storage node
  • the log index number is greater than or equal to the index number of the submitted log provided by the storage cluster to the client.
  • the configuration database stores configuration data of the storage cluster, where the configuration data includes cluster configuration information, and the cluster configuration information includes information of all non-faulty storage nodes in the storage cluster.
  • the method may be implemented by an arbitration module, which may be deployed on a storage node, or deployed on other types of nodes, or may be independently deployed as a separate node.
  • the configuration library described above may be a distributed configuration library distributedly distributed on multiple storage nodes. Further, the configuration repository can also be deployed on a lightweight node that is independent of the storage node.
  • the storage cluster can select a new storage primary node; further, after selecting a new storage primary node, the arbitration module or the new storage primary node can send cluster configuration changes to other storage nodes.
  • the write request of the log includes the cluster configuration change log, which is used to record information about the configuration change of the cluster for subsequent cluster recovery operations.
  • the present application provides an apparatus for modifying a configuration of a storage cluster, the apparatus comprising one or more modules, and the modules or modules are used to implement any of the methods provided by the foregoing aspects.
  • the present application provides a computer readable storage medium including computer readable instructions that are read by a processor and executed to implement any of the embodiments of the present application. Methods.
  • the present application provides a computer program product, comprising computer readable instructions, which are read by a processor and executed to implement a method provided by any embodiment of the present application.
  • the present application provides a computer system including a processor and a memory for storing computer instructions for reading the computer instructions and implementing the methods provided by any of the embodiments of the present application.
  • 1 is a schematic diagram of a fault of a storage cluster
  • FIG. 2 is a schematic diagram of a logical composition of a storage cluster
  • FIG. 3 is a schematic flowchart of a method for changing a cluster configuration
  • FIG. 5 is a schematic flowchart of another cluster configuration change method
  • FIG. 6 is a partial diagram of a distributed database
  • FIG. 7 is a schematic diagram of interaction of a cluster configuration change method based on FIG. 6;
  • FIG. 9 is a schematic diagram of interaction of reselecting a storage master node based on FIG. 6 and FIG. 7;
  • FIG. 10 is an example of a limited right map applied when a candidate storage master node is selected.
  • FIG. 2 is a schematic diagram of a storage cluster provided by this embodiment.
  • the figure exemplarily shows four storage nodes in a storage cluster, including one storage primary node L and three storage standby nodes F1-F3, each of which contains a storage module for providing and data storage.
  • a configuration library 100 and an arbitration module are also included in the storage cluster.
  • the arbitration module may include a primary arbitration module 200 and one or more standby arbitration modules 200a. The standby arbitration module 200a does not operate if the primary arbitration module 200 is available.
  • the storage master node L and the storage standby nodes F1-Fn are nodes of a storage cluster adopting a coherence protocol (for example, Raft). Multiple copies of the same data (hereinafter referred to as data copies) are maintained by a plurality of storage nodes through a coherence protocol.
  • a coherence protocol for example, Raft
  • the configuration repository 100 is a configuration repository that can still provide cluster configuration information reading and storage services when the storage node fails.
  • the configuration repository 100 stores configuration data of the storage cluster, and the configuration data includes cluster configuration information.
  • the cluster configuration information may include an ID of each storage node in the storage cluster, and may further include a map of the ID of each storage node and the progress of the synchronized data.
  • the configuration data may further include status information and network topology information during the configuration change process.
  • Multiple storage clusters can share the same configuration repository.
  • the configuration library 100 may specifically be a distributed configuration library, and the configuration library may be deployed on one or more nodes, and some or all of the one or more nodes may be storage nodes.
  • the arbitration module (200 and 200a) is a stateless arbitration role. When the storage node fails, the configuration change process of the storage cluster is completed by the member change module (210 or 210a) in the arbitration module. "Stateless" means that the arbitration module itself does not persist storage cluster configuration information, as well as state information during configuration change and network topology information, etc. (these information is persistently stored in the configuration repository 100).
  • the arbitration modules (200 and 200a) receive external input instructions (or information), access the configuration library according to the programming logic to obtain the required information to make an arbitration decision, or coordinate other nodes to make an arbitration decision. In normal times, only one master arbitration module 200 operates. After the master arbitration module 200 fails, the standby arbitration module 200a can take over the work of the master arbitration module 200. In the middle, the active/standby switchover is involved, but because of the stateless feature, the active/standby switchover is convenient and fast.
  • the arbitration module can be deployed on a separate arbitration node or on a storage node.
  • FIG. 3 is a schematic flowchart of a storage cluster configuration change method according to an embodiment of the present disclosure.
  • the method is performed by the member change module 210 if the primary arbitration module 200 is not faulty.
  • the member change module 210 determines whether the number of failed storage nodes is less than half, that is, (n+1)/2 (S101), and if so, performs consistency in the prior art.
  • the cluster configuration change flow (S106); if not, it is determined whether the configuration degradation condition is satisfied (S102).
  • the process of the consistency cluster configuration change can be referred to the description in the background section, which may involve the process of reselecting the storage master node.
  • the fault information can be entered into the configuration repository 100 by an external device, such as a cluster monitoring device, or an administrator.
  • the master arbitration module 200 itself monitors the fault information (such as the ID of the faulty node or the ID of the node that survived the fault), or the cluster monitoring device, or the administrator monitors the fault information and sends it to the master arbitration module 200 through the management client. .
  • the member change module 210 can compare the received fault information with the original cluster configuration information to determine whether the number of faulty nodes is less than half.
  • the fault information itself may carry information indicating whether the faulty storage node is less than half. For example, in the case of two AZ symmetric deployments, the received fault information indicates one of the AZ faults, and the fault may be determined. The storage node is not less than half, but just half.
  • configuration downgrade can also be understood as a kind of configuration change. Because there is a storage node failure in the storage cluster, the total number of nodes in the storage cluster is reduced, that is, the cluster configuration is reduced. Therefore, this configuration change is referred to as “configuration degradation" in this application.
  • the member change module 210 reads the current cluster configuration information from the configuration library 100, and determines whether the configuration degradation condition is satisfied (S102). If the configuration degradation condition is not satisfied, an error or failure information is returned (S105). If the configuration downgrade condition is met, continue with the configuration downgrade.
  • the configuration degradation conditions are as follows: A and B. Both must be satisfied if the configuration is to be downgraded.
  • Configure the degraded condition A The number of failed storage nodes accounts for half of all storage nodes before the failure, that is, (n+1)/2. The judgment of the condition can be merged with step S101. The number of failed storage nodes can be obtained from the received failure information, and the number of all storage nodes before the failure can be obtained from the cluster configuration information read in the configuration repository 100.
  • Configure the downgrade condition B At least one storage node exists in the storage node that is not faulty.
  • the latest log index number is greater than or equal to the submitted log index number provided by the storage cluster to the user (or client) before the fault. That is to say, the storage cluster after the failure has the ability to provide the user with the service promised to the user before, and cannot reduce the service requirement because of the failure.
  • the log represents a series of actual operations such as data operations, file operations, and state operations in the storage node.
  • the log maintains the order by incrementing the index number.
  • the storage node executes the log in the order of the index number of the log, and can construct the actual content of the data.
  • "committed” is a response from the storage cluster that the log has been saved successfully on most storage nodes.
  • the "committed log” has saved a successful log in most storage nodes. Uncommitted logs are not executed by any of the storage nodes.
  • Each consistent data replication process is accompanied by an index number of at least one new committed log.
  • the member change module 210 may obtain an identifier (ID) of each non-faulty storage node from the configuration repository 100, and then construct a network connection (or use an existing network connection) with each of the non-failed storage nodes. Then get their latest log index number.
  • ID an identifier
  • the member change module 210 may obtain an identifier (ID) of each non-faulty storage node from the configuration repository 100, and then construct a network connection (or use an existing network connection) with each of the non-failed storage nodes. Then get their latest log index number.
  • the check of the configuration degradation condition B may also be initiated by the member change module 210, and the individual storage nodes are independently completed and fed back to the member change module.
  • the error/fault information is returned (S105), indicating that the cluster configuration cannot be modified, and the current storage cluster cannot continue to provide the consistent data replication service. If it is determined that the above-described configuration degradation condition is satisfied, the cluster configuration information stored in the configuration repository 100 is updated (S103).
  • the member change module 210 forces the non-failed storage node to perform the cluster configuration change (S104). Specifically, the member change module 210 sends a mandatory member change command to all the faulty storage nodes, where the command includes the ID of the storage node that needs to be deleted (that is, the ID of the failed storage node). After receiving the Forced Member Change command, the storage node that has not failed deletes the information of the faulty node in the cluster configuration information included in its memory or hard disk. The "forced member changes" for each storage node can be performed in parallel.
  • the storage configuration information in the memory of the primary node is to confirm which storage nodes are shared in the cluster (and their data synchronization progress); the cluster configuration information in the storage standby node memory is for later it may become Store the primary node.
  • the cluster configuration information on the hard disk or disk is used to load the cluster configuration information from the hard disk into the memory when the storage node is restarted.
  • the storage primary node in the storage cluster before the failure is not faulty, the storage primary node can continue to be used, and the storage cluster continues to provide a consistent data replication service.
  • the cluster configuration changes, and the network topology of the storage cluster may change.
  • the storage master node can be reselected. The reselected storage master node may be the same as before.
  • FIG. 4 is a schematic diagram of a storage cluster according to another embodiment of the present application.
  • the storage cluster includes the main modules 220 and 220a.
  • the master module is used to reselect the storage master after the cluster configuration changes.
  • FIG. 5 is a schematic flowchart of a storage cluster configuration change method according to an embodiment of the present disclosure. Steps S101-S104 are similar to the previous embodiments, and reference may be made to the description of the foregoing embodiments.
  • the selection master module 220 causes the storage cluster to reselect the storage master node. There are many strategies for re-selecting the storage master node, which can be determined according to the needs of the current storage cluster.
  • the process of re-selecting the storage master node also satisfies the requirements of the coherence protocol, so it is impossible to implement only by the main module 220, and all the nodes that are not faulty are involved.
  • This application will exemplarily describe a process of reselecting a master node in the following embodiments.
  • the data backup node that is behind the data may (but does not have to) synchronize the log from the new storage primary node to update the latest data copy.
  • the re-selected storage master node may also send a write request of the cluster configuration change log to all the storage standby nodes in the new cluster configuration, where the log content includes the node deleted in the cluster configuration change.
  • the log submission is successful. The purpose of submitting this log is to maintain all the operation logs of the storage cluster and save the logs in order, so that during the abnormal recovery (or restart), the logs can be executed in order to restore the data and status of the storage cluster.
  • the storage cluster adopting the consistency protocol can still provide the data replication service under the premise of ensuring the service capability after half of the node failures, instead of as long as half of the node failures are completely completed as in the prior art. Data replication services are not available, increasing the availability of storage clusters.
  • the above embodiment generally describes the logical composition of the storage cluster provided by the present application and the method flow of the method.
  • the following embodiment will take a distributed database as an example to describe one embodiment of the technical solution provided by the present application.
  • the storage cluster proposed in this application is deployed in the storage layer of a distributed database.
  • FIG. 6 is a schematic diagram of deployment of a distributed database provided by this embodiment.
  • the storage cluster of the distributed database storage tier is symmetrically deployed within two AZs (AZ100 and AZ200).
  • Each AZ contains two storage nodes, a total of four storage nodes, including one storage master node 130 and three storage standby nodes 140, 230 and 240.
  • the distributed database computing layer includes SQL nodes, which are divided into SQL read-write nodes 110 and SQL read-only nodes 120, 210, and 220.
  • the SQL read-write node 110 and the storage master node 130 are deployed in the same AZ 100.
  • the SQL read-only node 120 is deployed in the AZ 100, and the SQL read-only node 210 and the SQL read-only node 220 are deployed in the AZ 200.
  • FIG. 6 also includes a client 400 and an SQL proxy 500.
  • the client 400 is a management client that handles management operations performed by a user or an engineer on this distributed database. Cluster configuration changes are one of the management operations for this database.
  • the client that uses the data service provided by the distributed database can be integrated with the client 400, or can be a separate client. This application is not limited.
  • the SQL proxy 500 is configured to receive the SQL request sent by the client 400, and distribute the request to the SQL read-write node of the computing layer or one of the SQL read-only nodes according to the type of the SQL request and the load condition of each SQL node in the computing layer.
  • the types of SQL requests include read requests and write requests.
  • the SQL read-write node is responsible for translating SQL read requests and SQL write requests, while SQL read-only nodes can only translate SQL read requests. Translation is a series of actual operations, such as data operations, file operations, and state operations, that convert SQL read requests or SQL write requests into a database. These operations can be represented in the form of logs.
  • the SQL node then sends these logs to the storage nodes of the storage tier.
  • the storage node is mainly used to store database data (including logs, metadata, data itself, etc.), and can execute logs to operate on data, metadata, files, etc., and return the operation result to the SQL node.
  • the storage main module 131 is deployed in the storage main node 130, and the storage standby modules 141, 231, and 241 are respectively disposed on the storage standby nodes 140, 230, and 240.
  • the storage module here is a storage module for providing a data replication service, and provides a function similar to the storage node of the prior art.
  • one arbitration module is also deployed in each of the four storage nodes, wherein the primary arbitration module 132 is deployed in the storage master node 130, and the standby arbitration modules 142, 232, and 242 are respectively deployed in the storage standby nodes 140, 230 and 240 on.
  • the three standby arbitration modules do not work normally. Only after the primary arbitration module 132 fails or the storage master node 130 fails, one of the standby arbitration modules takes over the work of the original primary arbitration module 132.
  • the main arbitration module 132 switches to any one of the standby arbitration modules, and the method of the active/standby switchover is used in the prior art.
  • the quorum module may not support storage of persistent cluster configuration information.
  • the arbitration module can be deployed within other nodes that are independent of the storage node.
  • the configuration repository can also be deployed in other nodes.
  • the cluster configuration information is stored in a distributed configuration repository cluster.
  • the configuration repository cluster is distributed in three AZs, and there are a total of five configuration libraries.
  • the five configuration libraries include four configuration libraries 133, 143, 233, and 243 deployed within the storage node, and a configuration library 333 deployed within a lightweight AZ 300 to further increase the availability of the configuration library.
  • the configuration data stored in the configuration repository can have not only cluster configuration information, but also status information and network topology information during configuration change.
  • the various "configuration libraries" in Figure 6 can be understood as a runtime configuration repository instance of a distributed configuration repository cluster.
  • a distributed configuration repository cluster consists of multiple configuration repository instances, but external (that is, the arbitration module) is embodied as a unified service. Which configuration library instance accessed by the arbitration module is determined according to the interface exposed by the distributed configuration library to the arbitration module, which may be a configuration library instance in the local node, or a configuration instance in the local AZ, or may be another configuration library instance in the AZ. From the perspective of the arbitration module, there is only one "configuration library”. Synchronous replication of configuration data between configuration library instances through a consistent replication protocol such as Raft.
  • the distributed configuration repository cluster can still provide normal configuration data persistence and consistent data replication services in the event that one AZ fails, and most (three) configuration repository instances still work normally.
  • the configuration libraries within the AZ100 and AZ200 may not be deployed within the storage node, but rather within a separate node.
  • the node that deploys the configuration repository instance can be a lightweight node with a small storage capacity.
  • this embodiment assumes that the AZ100 is faulty (the storage master node 130 and the storage standby node 140 are simultaneously faulty). As described above, the distributed configuration library still provides the configuration data read/write service normally, and the faulty AZ200 is prepared.
  • the arbitration module 232 or 242 can take over the work of the primary arbitration module 132 by active/standby switching.
  • the client 400 After the client 400 determines that the storage node in the storage cluster or a certain AZ fails, the client 400 sends a cluster change message to the arbitration module 232, where the cluster change message includes the node ID of the failed storage node or the faulty AZ.
  • the identifier of the storage node in this embodiment is 130 and 140, and the faulty AZ is AZ100.
  • the client 400 can obtain the monitoring device to monitor the status of the storage cluster, or monitor the status of the storage cluster as part of the monitoring system, thereby obtaining a fault message of the storage cluster and determining a storage cluster failure.
  • the fault message may include the ID of the faulty storage node, the faulty AZ identifier, or both. In this embodiment, it is assumed that the fault message includes the faulty AZ100.
  • the client 400 sends a cluster change message to the arbitration module 232, which carries the faulty AZ 100 to cause the arbitration module 232 to initiate a cluster configuration change check.
  • the arbitration module 232 determines whether the configuration degradation condition is met.
  • the downgrade conditions are configured as two A and B as described in the foregoing embodiments.
  • Condition A The specific judgment methods of Condition A include the following:
  • the storage node deployed in the faulty AZ100 and all the storage nodes in the current storage cluster are read from the configuration library, and the condition A is determined according to the number of the two.
  • failure information includes the IDs 130 and 140 of the failed storage node
  • all storage nodes of the current storage cluster are read from the configuration repository, and whether the condition A is satisfied is determined according to the two values.
  • the content or form of the configuration data stored in the library is different, or the content or form of the fault information received may cause a difference in the specific judgment manner of the condition A, which is not enumerated in this application.
  • Condition B is as follows:
  • the arbitration module 232 reads the "storage cluster latest latest index number" in the configuration repository (see step S209). If there is no such value, the storage cluster does not undergo a forced cluster configuration change, and condition B is established. If this value exists, the storage cluster changes the configuration of the cluster at least once. Each network node that is not faulty is accessed through the network connection, and the latest index number of the storage node is read. If the index number of the latest index number is greater than or equal to If the last cluster index of the storage cluster is "", condition B is established, otherwise condition B is not established.
  • Each storage node will have an up-to-date index number and a submitted index number.
  • the latest index number is the index number of the most recent log that has been saved for each node.
  • the storage master node submitted index number is the largest index number in the log that the storage master node determines to have been saved to most nodes.
  • the committed index number of the storage master node typically (but not exclusively) represents the index number of the committed log that the storage cluster provides to the client or user.
  • the committed index number of the storage standby node is periodically received from the storage master node and may be slightly behind the storage master node's committed index number. In some cases, the committed index number of the storage master node may lag behind the index number of the submitted log that the storage cluster currently provides to the client or user.
  • the storage node that is faulty includes the storage master.
  • the node selects a new storage master node from a storage standby node by the method of the present application.
  • the submitted index number of the storage master node at this time is also the submitted index number of the storage standby node, which may lag behind the storage cluster.
  • FIG. 8 shows the index number inside the current four storage nodes in this embodiment.
  • the log corresponding to the storage node index number 5 has been saved to three (131, 141, and 231) storage nodes, accounting for the majority, so the storage master node has submitted the index number of 5. 3 storage devices.
  • the committed index numbers of nodes 140, 230, and 240 all lag behind the committed index numbers of the storage master nodes, 4, 4, and 2, respectively.
  • the most recent index number for most nodes is greater than or equal to the committed index number of the storage cluster.
  • condition B is satisfied.
  • the “storage cluster last latest index number” is empty, and it is determined that the condition B is satisfied. It is easy to understand that for a storage cluster using a coherent protocol, since most storage nodes store the committed index number of the storage master node, there must be at least one storage node after the first half of the storage node failure.
  • the latest index number is greater than or equal to the submitted index number of the storage master node, and the committed index number of the storage master node before the first failure represents the index number of the submitted log provided by the storage cluster to the client or user, or It is understood to be one of the aspects of the quality of data services promised by the storage cluster to the client or user.
  • the arbitration module 232 After determining that the configuration degradation condition is established, the arbitration module 232 writes the new configuration information of the cluster after the failure is written to the configuration library.
  • the new configuration information may include the ID of each surviving node after the failure.
  • the configuration information of the storage cluster includes information of each storage node, and the format is roughly as follows: “cluster ID, AZ ID, computer name, ID of the storage node (or SQL node)”, so the arbitration module 232 goes to the configuration library.
  • the storage node information of the corresponding fault is deleted, and the new cluster configuration information is changed.
  • the previously received fault information indicates an AZ fault, which is the deletion of all storage node information corresponding to AZ in the configuration repository.
  • the previously received fault information indicates that the plurality of storage nodes are faulty
  • the fault information may include the information of the storage node, "cluster ID, AZ ID, computer name, storage node ID", and corresponding in the configuration library.
  • the information in the storage node can be deleted.
  • the arbitration module 232 sends an asynchronous request to the failed storage nodes 130 and 140, asking them to stop the service. This asynchronous service may or may not be received by the failed node. This step S204 and S205 is not necessary in some other embodiments.
  • S205 The storage nodes 130 and 140 stop the service.
  • the arbitration module 232 sends a "force member change" command to the storage node that is not faulty (ie, the storage standby nodes 230 and 240). This command is used to instruct them to delete information about the failed storage node (storage node 130 and storage node 140) in memory and on the hard disk. In this embodiment, the arbitration module 232 is deployed on the storage standby node 230, so the mandatory member change command is actually sent to the storage module 231.
  • S207 The non-failed storage standby nodes 230 and 240 modify the cluster configuration information stored in the memory and the hard disk, stop receiving the message from the failed storage node, and stop sending the message to the failed storage node.
  • the non-faulty storage standby nodes 230 and 240 return a "change success" response to the arbitration module 232, which includes the latest index number of the respective log, that is, the index number 5 and the index number 4 in FIG.
  • the arbitration module 232 is deployed on the storage standby node 230. Therefore, for the storage standby node 230, the storage standby module 231 returns a response to the arbitration module 232.
  • the latest index number in the response of the successful change is theoretically consistent with the latest index number obtained when the previous downgrade condition check is configured, so if the previous arbitration module 232 stores the latest index number obtained before, then here The response does not need to contain the latest index number.
  • the arbitration module 232 will determine that the latest index number 4 of the non-faulty storage node 240 is less than the "storage cluster last latest index number", that is, 5, Condition B does not hold (of course, condition A also holds), the configuration degradation can no longer be performed, the storage cluster will stop providing services; if the failed storage node is 240, the arbitration module 232 will determine the latest index number 5 of the non-faulty storage node 230. It is equal to "the last cluster index of the storage cluster" is 5, so if condition B is established, the cluster configuration will be forced again, and the storage cluster can continue to provide services.
  • the arbitration module 232 initiates reselection of the storage master node under the new cluster configuration. This step will be described in detail later.
  • the re-selected storage master node may submit a storage cluster configuration change log to the storage cluster according to the Raft protocol, and the log may include the IDs of the nodes deleted in the cluster configuration change, that is, the storage nodes 130 and 140.
  • the arbitration module 232 responds to the client, the cluster configuration changes successfully, and the system resumes service.
  • arbitration module 232 initiates a process of reselecting the storage master node.
  • Figure 9 shows the process of reselecting the storage master node.
  • the arbitration module 232 sends a request message to the non-faulty storage node to obtain the latest index number of the storage nodes 230 and 240 that are not faulty.
  • S302 The storage nodes 230 and 240 return the latest index number to the arbitration module 232, namely index numbers 5 and 4 in FIG.
  • the arbitration module 232 selects the storage node with the largest index number as the candidate storage master node, that is, the storage standby node 230.
  • the arbitration module 232 sends a candidate request to the candidate storage master node 230, the request including whether the data copy (ie, log) of the candidate storage master node is up-to-date, and the ID of the node that is not the latest data copy.
  • the data copy (ie, the log) containing the candidate storage primary node in the request is the latest.
  • the storage standby node 230 determines whether the data copy (ie, the log) of the data is up-to-date, and if not, updates the data copy from the node that owns the latest data copy, and updates the data copy to the latest.
  • the copy of the data stored in the storage standby node 230 is the latest.
  • the storage standby node 230 initiates an election that requires most nodes to actively respond according to the current cluster configuration (the cluster configuration that has been changed after the failure) through the selection mode of the Raft protocol.
  • the prior art is not described herein.
  • the other storage standby nodes may actively update the data copy, specifically, by synchronizing the data log with the new storage primary node. A copy of the latest data.
  • the storage standby node may also not actively update the data copy, and wait for the next data replication to automatically update the data copy of the storage standby node according to the requirements of the Raft protocol.
  • the unique SQL read-write node 110 is not available.
  • the distributed database needs to re-specify one of the nodes 210 and 220 as the SQL read-write node.
  • the client 400 can be arbitrarily selected, or Choose by some programming algorithm (either randomly, or based on predictable machine performance, etc.).
  • the change of the SQL node can be notified to the SQL agent 500 through various methods.
  • the SQL agent 500 listens to the change event of the SQL read/write node in the configuration library, thereby knowing the change of the SQL read-write node (in the landing plan), and the other The SQL agent 500 periodically detects each SQL node, and the arbitration module 232 notifies the SQL agent 500 of the appearance of a new SQL read-write node, etc., which are not enumerated in this application.
  • step S303 the method of selecting the candidate storage master node may also be replaced by the following algorithm.
  • the arbitration module 232 reads the network topology data between the entire system computing layer and the storage layer from the configuration library, and establishes a directed right and acyclic graph according to the network topology data, as shown in FIG. Represents a node. Then calculate the shortest path of this graph.
  • the weight of the edge of the graph (W1-W6) can be determined by the network communication rate and load between the node and the node, and dynamically adjusted according to the feedback of the communication during the operation of the system.
  • the shortest path calculation method may employ a dijkstra algorithm, and the calculation result includes, for example, an ID of a next hop node, and a mapping of an IP address and a port of the node.
  • the storage node in the calculated shortest path is the candidate storage master node.
  • the mapping of the IP address and port of the arbitration module 232 communicates with the storage node to continue the operation of the next step.
  • the SQL node in the shortest path is determined to be a SQL read-write node, and other SQL nodes can be SQL read nodes.
  • the SQL node of the computing layer and the storage node of the storage layer can periodically send their own topology information to the configuration library, so that the network topology data of the entire system is always stored in the configuration library.
  • the prior art does not provide a solution for how to optimize the network performance of the storage node and the computing layer under the new cluster configuration after the storage cluster member is changed.
  • This selection process re-selects the SQL read-write node and the storage master node (through the data copy latest rule or the most preferred main algorithm) to achieve fast recovery of the computing layer and storage layer communication under the new cluster configuration.
  • the most preferred primary algorithm dynamically adjusts the weight of the topology network between the computing layer and the storage layer according to the network communication rate and load, and calculates the shortest path to optimize the read and write performance of the computing layer and the storage layer.
  • the cluster configuration change method provided by the present application can be used in a storage layer of a distributed database, and can also be used in other storage clusters using a consistent replication protocol, such as a distributed key value system, a distributed lock, and a distributed file system.
  • a consistent replication protocol such as a distributed key value system, a distributed lock, and a distributed file system.
  • the device embodiments described above are merely illustrative, wherein the modules described as separate components may or may not be physically separate, and the components displayed as modules may or may not be physical modules, ie may be located A place, or it can be distributed to multiple network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • the connection relationship between the modules indicates that there is a communication connection between them, and specifically, one or more communication buses or signal lines can be realized.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Hardware Redundancy (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请提供一种修改存储集群配置的方法、装置及计算机系统等。采用一致性复制协议的存储集群,在半数存储节点发生故障的前提下,若确定故障后的存储集群中存在至少一个存储节点,该存储节点上的最新的日志索引号大于或等于存储集群向客户端提供的已提交日志的索引号,则由仲裁模块向未故障的存储节点发送强制集群配置变更指令,所述强制集群配置变更指令用于指示所述未故障的存储节点修改本地的集群配置信息;且仲裁模块更新配置库中存储的集群配置信息,使其指示新的集群配置。这样解决半数存储节点故障后存储集群不可用的问题,提高了存储集群的可用性。

Description

存储集群的配置修改方法、存储集群及计算机系统 技术领域
本申请涉及分布式存储集群技术,尤其涉及一种存储集群发生故障后的集群配置修改技术。
背景技术
存储集群中部署有分布式的多个存储节点,这多个存储节点可以采用一致性复制协议,例如Raft协议,来保证数据的一致性。
Raft协议是一种常用的一致性复制协议,它规定最新数据要保存在大多数(超过半数)存储节点中,然后存储集群才能继续提供数据复制服务(或称为数据写服务)。也就是说,每次数据更新,Raft协议中的存储主节点要得到大多数节点(包括存储主节点自己)已保存最新数据的响应,然后存储集群才能继续提供数据复制服务。当一定数量的存储节点发生故障后,Raft协议要求修改集群配置,使得在新的集群配置下仍满足最新数据保存在大多数节点中,在这个前提下才能继续提供数据复制服务。而在Raft协议中,集群配置的修改过程其实也是一个数据复制过程,只是复制的数据(或称为日志)是一个用于修改集群配置的配置变更命令,因此在执行集群配置的修改时也要遵循前述“保存在大多数”的规定。
在现有的Raft协议中,当少于半数的存储节点发生故障后,存储主节点发送的配置变更命令可以得到大多数未故障的存储节点的响应,所以能够继续提供数据复制服务。具体的,当少于半数的存储节点发生故障后,存储主节点向剩余的所有未故障的存储节点发送配置变更命令,该命令被所有未故障的存储节点保存且向存储主节点返回响应(返回响应即指示保存成功);由于未故障的存储节点超过半数,满足Raft协议要求,因此存储主节点指示所有未故障存储节点各自修改自己的集群配置信息,从而使存储集群在新的集群配置下继续提供数据复制服务。当存储主节点也在故障节点中时,Raft协议要求先执行重新选择存储主节点的过程,新的存储主节点再执行前述集群配置修改过程。
但是,当半数的存储节点同时发生故障时,现有的Raft协议无法获得大多数存储节点对于配置变更命令的响应,所以无法满足Raft协议的要求,也就不能对集群配置的修改达成决定,从而导致存储集群无法完成集群配置的修改。这样,即便存储集群中实质上还保存有满足用户需求的数据,也无法再继续提供数据复制服务了。
图1为一个存储集群的示例。该存储集群包括两个可用区(available zone,AZ)。一个AZ一般由多个数据中心组成,每个可AZ具有独立的供电和独立的网络等。当一个AZ出现通常的故障时,例如电源、网络、软件部署、洪水等灾害,一般不会影响其它AZ。AZ与AZ之间一般通过低延迟网络连接。AZ1和AZ2中分别部署有2个存储节点(存储层),即对称部署。两个AZ对称部署,意味着两个AZ内分别部署有相同数量的存储节点。两个AZ内还分别部署有2个计算节点(计算层)。计算节点例如可以包括结构化查询语言(structured query language,SQL)读写节点和SQL读节点。存储节点包括1个存储主节点和3个存储备节点。根据Raft协议的规定,4个存储节点中需要有3个或3个以上的存储节点保存最新数据。假设图1中存储主节点L、存储备节点F1和F2中已保存最新数据。如 图1所示,当AZ1发生故障(或任意两个其它存储节点发生故障)后,由于4个存储节点中的半数即2个已经故障,那么配置变更命令即便被发出最多也只能得到AZ2内这2个存储节点的响应,无法满足大多数响应的要求,所以集群配置变更失败,即便保存最新数据的存储备节点F2还存在,该存储集群也无法继续提供数据复制服务。
由以上分析和示例可见,在应用Raft等一致性复制协议的存储集群中,急需一种技术方案,该技术方案能够在半数的存储节点发生故障之后仍能保证存储集群在一定条件下继续提供一致性数据复制服务。
发明内容
为了方便理解本申请提出的技术方案,首先在此介绍本申请描述中会引入的几个要素。
节点:具有独立操作系统的计算机,可以是普通虚拟机、轻量级虚拟机等类型的虚拟机,也可以是一台物理计算机。
持久化存储:计算机断电或重启后被存储的信息不会丢失。典型的持久化存储包括文件存储和数据库存储。
大多数:一个大小为N的集合的大多数M指超过此集合大小的一半的数量,即、|N/2|+1≤M≤N,其中M,N都为正整数,||表示向下取整。
半数(或一半):一个大小为N的集合的半数M,即M=N/2。注意这里N/2不取整。在本申请中,当存储集群故障前的节点数量为单数N时,半数M不为整数,但是故障的节点数量一定为整数,所以总不能满足“半数”的要求。
日志:一条日志代表数据库数据操作、文件操作、状态操作等类型中任意一种类型的任意一个操作命令,例如数据读取命令,数据写命令或配置变更命令等。日志通过递增的索引号来维护顺序。数据库操作按日志的索引号的顺序执行日志,能构造出数据的实际内容。“提交”(committed)是存储集群对日志(或称命令)在大多数存储节点中已保存成功的一种响应。“已提交的日志”即已在大多数存储节点中保存成功的日志。任意一个存储节点都不会执行未提交的日志。
“用于指示A的信息”在本申请中指的是能够获得A的信息,获取A的方式可能是信息本身就是A,或者通过该信息可以间接获得A。本申请中的“A/B”或“A和/或B”包括A、B以及A和B,这三种可能的形式。
本申请提供一种存储集群的配置变更方法、相应的装置以及存储集群,该方法应用在采用一致性复制协议的存储集群中。在这样的存储集群中,当半数存储节点发生故障后,该存储集群就一定不能继续提供数据复制服务,而本申请提供的方法可以在半数存储节点发生故障的情况下,仍然能够保证存储集群在满足一定条件的前提下继续提供数据复制服务,提高了存储集群的可用性。
下面介绍本申请提供的多个方面的发明内容,应理解的是,以下多个方面的发明内容并非本申请提供的全部内容,且可互相参考彼此的实现方式和有益效果。
第一方面,本申请提供一种存储集群,该存储集群包括多个存储节点,其中有一个存储主节点和多个存储备节点,该存储集群的各个存储节点之间采用一致性复制协议,例如Raft,来维持一致性。进一步的,该存储集群中包括仲裁模块和配置库。所述配置库被配 置为存储所述存储集群的配置数据,所述配置数据包括集群配置信息,所述集群配置信息中包括存储集群中所有未故障的存储节点的信息,例如存储节点的ID。所述仲裁模块被配置为:在存储集群发生故障之后,若如下条件A和条件B满足则修改所述配置库中存储的所述集群配置信息,并向未故障的存储节点发送强制集群配置变更指令,所述强制集群配置变更指令用于指示所述未故障的存储节点修改本地的集群配置信息。条件A:故障的存储节点的数量为未发生本次故障之前存储集群中所有存储节点的数量的一半;条件B:故障后的存储集群中存在至少一个存储节点,该存储节点上的最新的日志索引号大于或等于存储集群向客户端提供的已提交日志的索引号。该索引号代表数据复制服务的能力,是存储集群向客户端承诺的一个服务质量。
这里的故障通常指的是存储节点不可用的故障,或其它影响集群配置的故障。
存储节点的“本地”通常指的是存储节点内部的内存和/或磁盘等存储器。当存储节点是一台物理计算机时,“本地”通常指的是该物理计算机上的内存和/或磁盘等存储器;当存储节点是一台虚拟机或其他类型的虚拟化设备时,“本地”通常指的是位于物理宿主机内部的、该虚拟机可以访问的内存和/或磁盘等存储器。
条件A的判断方式并非一定要获取这两个数值:故障的存储节点数量和未发生本次故障之前存储集群中所有存储节点的数量,才能判断。也可以通过一些简单的方式,根据实际情况确定该条件成立。例如在两个AZ,且AZ内部存储节点呈对称部署的存储集群中,当发生AZ级故障之后,可以直接确定半数节点故障。同理,条件B也是如此,判断条件是否满足的具体的实现方式有多种,本申请不做限定。
仲裁模块可以部署在存储节点上,也可以部署在别处,配置库也是如此。本申请不做限定。
可见,首先确定半数的存储节点发生故障,然后确定故障后的存储集群还能保证服务质量,之后执行强制的配置集群变更,让存储集群继续服务,在保障服务质量的前提下提升了存储集群的可用性。
基于第一方面,在一些实现方式中,所述仲裁模块还被配置为:在所述存储集群发生故障且执行强制集群配置变更之后,从未故障的存储节点中重新选择一个候选存储主节点,并向所述候选存储主节点发送选主请求,所述选主请求用于指示所述候选存储主节点发起符合一致性复制协议的选主过程。
存储集群配置变更之后,可以根据一些现有存储集群配置的情况和要求重新选择存储主节点,以提升存储集群在新配置下的性能。
基于第一方面,在一些实现方式中,所述仲裁模块被配置为:选择所有未故障的存储节点中最新日志索引号最大的存储节点为所述候选存储主节点。
基于第一方面,在一些实现方式中,所述配置数据中还包括所述存储集群所部属的网络的网络拓扑信息。所述仲裁模块被配置为:获取并根据所述网络拓扑信息构建客户端节点或代理节点到各个所述未故障的存储节点的有向有权图,其中节点之间的边的权值由节点与节点间的网络通信速率或负载(速率和负载也可以)确定,并计算所述有向有权图中的最短路径,确定位于所述最短路径上的存储节点为所述候选存储主节点。
节点与节点间的网络通信速率或负载可以在仲裁模块所部属的节点上存储,也可以在 由其它节点存储由仲裁模块获取。通信速率或负载的获得方式本申请不做限定。
网络拓扑信息主要包括用于指示每个存储(或SQL)节点位于哪个AZ和/或哪台计算机的信息。网络拓扑信息和配置信息可以用独立的数据指示,也可以用合并的数据标识,例如配置数据中若包含多条这样的信息:“集群ID、AZ ID、计算机名、(未故障的)存储节点(或SQL节点)的ID”,那就即表示了配置信息,又表示了网络拓扑信息。
重新选择存储主节点的方式有多种,可以根据日志索引号或根据网络访问性能等因素,也可以结合多个因素,以提升存储集群在新配置下的性能。以上仅是两个示例。
基于第一方面,在一些实现方式中,重新选择出的存储主节点被配置为:向所述存储集群中当前的存储备节点(即未故障的存储备节点)发送集群配置变更日志的写入请求,所述集群配置变更日志包括此次集群配置变更中故障的存储节点的信息。这里这个“重新选择出的存储主节点”为此次故障发生后所述存储集群依据前面实施例中提到的那次选主请求和一致性复制协议选择出的新的存储主节点,其它未故障的存储节点即为存储备节点。
将记录集群配置变更的日志在存储集群内部保存起来,可以在后续存储集群恢复的时按照该日志进行存储集群的恢复。如果不考虑存储集群后续的恢复的话,这个步骤不是必须的。
在其他一些实现方式中,该集群配置变更日志的写入请求也可以由仲裁模块向所有未故障的存储节点发送,指示所有未故障的存储节点将该集群配置变更日志保存在本地。基于第一方面,在一些实现方式中,所述仲裁模块被配置为:获取本次故障后所有未故障的存储节点中的最新的日志索引号,若其中的最大值大于或等于上一次故障后所有未故障的存储节点中的最新的日志索引号的最大值,则确定所述条件B满足;和/或,
所述仲裁模块被配置为:若本次故障为所述集群的首次故障,则确定所述条件B满足。
基于第一方面,在一些实现方式中,所述配置库为分布式配置库,分布地部署在所述多个存储节点以及一个另外的节点上。
第二方面,本申请提供一种修改存储集群的配置的方法,该方法可以应用于采用一致性复制协议的存储集群中,所述存储集群包括多个存储节点、以及配置库。该方法包括:
在所述存储集群发生故障之后,若如下条件A和条件B满足,则修改所述配置库中存储的所述集群配置信息,并向未故障的存储节点发送强制集群配置变更指令,所述强制集群配置变更指令用于指示所述未故障的存储节点修改本地的集群配置信息。其中条件A:故障的存储节点的数量为未发生本次故障之前存储集群中所有存储节点的数量的一半;条件B:故障后的存储集群中存在至少一个存储节点,该存储节点上的最新的日志索引号大于或等于存储集群向客户端提供的已提交日志的索引号。所述配置库中存储所述存储集群的配置数据,所述配置数据包括集群配置信息,所述集群配置信息中包括存储集群中所有未故障的存储节点的信息。
在一些实现方式中,该方法可以由仲裁模块实现,该仲裁模块可以部署在存储节点上,也可以部署在其他类型的节点上,也可以自己独立部署成为一个独立的节点。
在一些实现方式中,以上所述的配置库可以是分布式配置库,分布地部署在多个存储节点上。进一步的,该配置库还可以在独立于存储节点的一个轻量级节点上部署。
在一些实现方式中,集群配置变更之后,存储集群可以选择新的存储主节点;进一步 的,选择出新的存储主节点之后,仲裁模块或新的存储主节点可以向其他存储节点发送集群配置变更日志的写入请求,该写入请求中包含集群配置变更日志,该日志用来记录本次集群配置变更的相关信息,以用于后续集群恢复等操作。这些具体实现可参考与第一方面互相参考。
第三方面,本申请提供一种修改存储集群的配置的装置,该装置包括一个或多个模块,这个模块或这些模块用于实现前述方面提供的任意一种方法。
第四方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中包括计算机可读指令,所述计算机可读指令被处理器读取并执行时实现本申请任意实施例提供的方法。
第五方面,本申请提供一种计算机程序产品,所述计算机程序产品中包括计算机可读指令,所述计算机可读指令被处理器读取并执行时实现本申请任意实施例提供的方法。
第六方面,本申请提供一种计算机系统,包括处理器和存储器,所述存储器用于存储计算机指令,所述处理器用于读取所述计算机指令并实现本申请任意实施例提供的方法。
附图说明
为了更清楚地说明本申请提供的技术方案,下面将对附图作简单地介绍。显而易见地,下面描述的附图仅仅是本申请的一些实施例。
图1为一种存储集群的故障示意图;
图2为一种存储集群的逻辑组成示意图;
图3为一种集群配置变更方法的流程示意图;
图4为另一种存储集群的逻辑组成示意图;
图5为另一种集群配置变更方法的流程示意图;
图6为一种分布式数据库的部属示意图;
图7为基于图6的一种集群配置变更方法的交互示意图;
图8为存储节点上存储的日志索引号的示例;
图9为基于图6和图7的一种重新选择存储主节点的交互示意图;
图10为选择候选存储主节点时应用的有限有权图的示例。
具体实施方式
图2为本实施例提供的一种存储集群的示意图。图中示例性地示出了存储集群中的4个存储节点,其中包括1个存储主节点L和3个存储备节点F1-F3,各个存储节点里都包含存储模块,用于提供与数据存储相关的服务。另外,在该存储集群中还包括配置库100和仲裁模块。进一步的,仲裁模块可以包括主仲裁模块200和一个或多个备用仲裁模块200a。备用仲裁模块200a在主仲裁模块200可用的情况下不运行。
存储主节点L和存储备节点F1-Fn为采用一致性协议(例如Raft)的存储集群的节点。多个存储节点之间通过一致性协议维护同一份数据的多个副本(下文简称为数据副本)。
配置库100是一个在存储节点故障时,依然可以正常提供集群配置信息读取和存储服务的配置库。配置库100中存储有存储集群的配置数据,配置数据包括集群配置信息。其 中,集群配置信息可以包括存储集群中每个存储节点的ID,进一步的还可以包括每个存储节点的ID与其已同步数据的进度(index)的一个映射(map)。
进一步的,配置数据还可以包括配置变更过程中的状态信息以及网络拓扑信息等。多个存储集群可以共享同一个配置库。配置库100具体可以是分布式配置库,配置库可以部署在一个或多个节点上,这个一个或多个节点中可以部分或全部是存储节点。
仲裁模块(200和200a)是一个无状态的仲裁角色。当存储节点发生故障时,由仲裁模块中的成员变更模块(210或210a)完成存储集群的配置变更过程。“无状态”的意思是仲裁模块本身不持久化存储集群配置信息、以及配置变更过程中的状态信息以及网络拓扑信息等(这些信息都持久化存储在配置库100中)。仲裁模块(200和200a)接收外部输入指令(或信息),根据编程逻辑访问配置库获取需要的信息做出仲裁决定,或协调其它节点做出仲裁决定等。平时只有一个主仲裁模块200工作,主仲裁模块200发生故障以后,备用仲裁模块200a能够接替主仲裁模块200的工作。中间涉及主备切换,但因为无状态的特性,所以主备切换方便快捷。仲裁模块可以部署在独立的仲裁节点上,也可以部署在存储节点上。
图3为本实施例提供的存储集群配置变更方法的流程示意图。该方法在主仲裁模块200未故障的情况下由成员变更模块210执行。成员变更模块210在接收到故障信息或配置变更指令之后,判断故障的存储节点的数量是否小于半数,即(n+1)/2(S101),如果是,则执行现有技术中的一致性集群配置变更流程(S106);如果否,则判断是否满足配置降级条件(S102)。一致性集群配置变更流程可参考背景技术部分的描述,可能会涉及重新选择存储主节点的过程。
在一些实现方式下,故障信息可以由外部(如集群监控装置、或管理员)输入到配置库100中。主仲裁模块200自己监听到该故障信息(如故障的节点的ID或故障后存活的节点的ID),或者集群监控装置、或管理员监控到故障信息之后通过管理客户端发送给主仲裁模块200。
在一些实现方式下,成员变更模块210可以将接收到的故障信息与原有的集群配置信息比较,判断故障节点的数量是否小于半数。在另一些实现方式下,故障信息里本身可以携带指示故障的存储节点是否小于半数的信息,例如两个AZ对称部署的情况下,接收到的故障信息指示其中一个AZ故障,那可以确定故障的存储节点不小于半数,而是恰好半数。
这里“配置降级”也可以理解为配置变更的一种。因为存储集群中存在存储节点故障,所以存储集群的节点总数变少了,也就是集群配置降低了,所以本申请中将这种配置变更称为“配置降级”。
成员变更模块210从配置库100中读取当前的集群配置信息,并判断是否满足配置降级条件(S102)。如果不满足配置降级条件,则返回错误或故障信息(S105)。如果满足配置降级条件,则继续执行配置降级。配置降级条件有以下A和B两个,若要执行配置降级必须都满足。
配置降级条件A:故障的存储节点的数量占故障前全部存储节点的半数,即(n+1)/2。该条件的判断可以和步骤S101融合在一起。故障的存储节点的数量可以从接收到的故障信 息中获取,故障前全部存储节点的数量可以从配置库100中读取的集群配置信息中获取。
配置降级条件B:未故障的存储节点中至少存在一个存储节点,其最新的日志索引号大于或等于故障前存储集群对用户(或客户端)提供的已提交的日志索引号。也就是说,故障之后的存储集群还有能力向用户提供之前对用户承诺的服务,不能因为故障就降低服务要求。
日志代表了存储节点中的数据操作、文件操作、状态操作等一系列实际操作。日志通过递增的索引号来维护顺序。存储节点按日志的索引号的顺序执行日志,能构造出数据的实际内容。“提交”(committed)是存储集群对日志在大多数存储节点中已保存成功的一种响应。“已提交的日志”即已在大多数存储节点中保存成功的日志。任意一个存储节点都不会执行未提交的日志。每一次一致性数据复制过程都伴随着至少一个新的已提交日志的索引号。
具体的,成员变更模块210可以从配置库100中获取每个未故障的存储节点的标识(ID),然后构建与每个未故障的存储节点的网络连接(或使用已有的网络连接),之后获取它们的最新的日志索引号。
在其它一些实施例中,配置降级条件B的检查也可以由成员变更模块210发起,由各个存储节点独立完成后反馈到成员变更模块。
继续参考图3,若确定上述配置降级条件不满足,则返回错误/故障信息(S105),表明集群配置无法修改,当前存储集群不能再继续提供一致性数据复制服务。若确定上述配置降级条件满足,则更新配置库100中存储的集群配置信息(S103)。
配置库100中的集群配置信息被更新之后,成员变更模块210强制未故障的存储节点实施集群配置变更(S104)。具体的,成员变更模块210向所有未故障的存储节点下发“强制成员变更”命令,该命令中包括需要删除的存储节点的ID(也就是故障的存储节点的ID)。未故障的存储节点收到“强制成员变更”命令后,在其内存或硬盘包括的集群配置信息中将故障节点的信息删除。各个存储节点的“强制成员变更”可以并行进行。
需要说明的是,存储主节点内存中的集群配置信息是为了要确认集群中一共有哪些存储节点(以及它们的数据同步进度);存储备节点内存中的集群配置信息是为了以后它可能变为存储主节点。硬盘或磁盘中的集群配置信息是为了存储节点重启时,可以从硬盘中加载集群配置信息到内存中。
集群配置降级之后,如果故障前的存储集群中的存储主节点没有故障,那可以继续使用该存储主节点,存储集群继续提供一致性数据复制服务。但是,集群配置变化了,存储集群的网络拓扑可能发生变化,为了提高服务性能可以重新选择存储主节点。重新选择的存储主节点可能与原来一样。
图4为本申请另一实施例提供的一种存储集群的示意图。该存储集群中除图2所示的组件之外,仲裁模块内还包括选主模块220和220a。选主模块用于在集群配置变更之后重新选择存储主节点。图5为本实施例提供的存储集群配置变更方法的流程示意图。步骤S101-S104与前述实施例类似,可参考前述实施例的描述。如图5所示,在步骤S104之后,选主模块220促使存储集群重新选择存储主节点。重新选择存储主节点的策略有很多种,可根据当前存储集群的需求确定。重新选择存储主节点的过程也要满足一致性协议的要求, 所以不可能仅由选主模块220实现,需要所有未故障的节点都参与。本申请将在下面的实施例中对重新选择主节点的过程做示例性地描述。
在其它一些实施例中,在重新选择出存储主节点后,数据落后的存储备节点可以(但非必须)从新的存储主节点同步日志,更新得到最新的数据副本。
在其它一些实施例中,重新选择出的存储主节点还可以向新的集群配置下的所有存储备节点发送一条集群配置变更日志的写入请求,日志内容包括此次集群配置变更中删除的节点,当得到大多数存储备节点接受并已保存该日志的响应后,该日志提交成功。提交该日志的目的是为了维护对存储集群的所有操作日志,顺序保存日志,以便在异常恢复(或重启)的过程中,可以通过按顺序执行日志以恢复存储集群的数据和状态。
通过上述实施例提供的方法,采用一致性协议的存储集群可以在半数节点故障的时后仍然能在保证服务能力的前提下提供数据复制服务,而不是像现有技术那样只要半数节点故障就完全不能提供数据复制服务了,从而提高了存储集群的可用性。
以上实施例概括性地描述了本申请提供的存储集群的逻辑组成以及其中执行的方法流程,下面一个实施例将以分布式数据库为例,详细介绍本申请提供的技术方案的一个实施例。本申请提出的存储集群部署在分布式数据库的存储层。
请参考图6,为本实施例提供的一种分布式数据库的部署示意图。分布式数据库存储层的存储集群在两个AZ(AZ100和AZ200)内对称部署。每个AZ内包含两个存储节点,共4个存储节点,其中包括1个存储主节点130,3个存储备节点140、230和240。分布式数据库计算层包括SQL节点,SQL节点分为SQL读写节点110和SQL只读节点120、210和220。SQL读写节点110与存储主节点130部署在同一AZ100内。SQL只读节点120部署在AZ100内,SQL只读节点210和SQL只读节点220部署在AZ200内。
另外,图6还包括客户端400和SQL代理500。该客户端400为处理用户或工程人员对这个分布式数据库执行管理操作的管理客户端。集群配置变更属于对这个数据库的管理操作之一。使用分布式数据库提供的数据服务的客户端可以和这个客户端400集成在一起,也可以单独另一个客户端,本申请不做限定。
SQL代理500用于接收客户端400发出的SQL请求,根据SQL请求的类型,以及计算层中每个SQL节点的负载情况,分发请求到计算层的SQL读写节点或其中某个SQL只读节点。SQL请求的类型包括读请求和写请求。SQL读写节点负责翻译SQL读请求和SQL写请求,SQL只读节点则只能翻译SQL读请求。翻译就是将SQL读请求或SQL写请求转换成数据库的数据操作、文件操作、状态操作等一系列实际操作,这些操作可以以日志的形式来表示。SQL节点再将这些日志发送给存储层的存储节点。存储节点主要用于存储数据库数据(包括日志、元数据、数据本身等),并可以执行日志以对数据、元数据、文件等进行操作,并返回操作结果给SQL节点。
4个存储节点内分别部署有1个存储模块,其中存储主模块131部署在存储主节点130内,存储备模块141、231以及241分别部署在存储备节点140、230和240上。这里的存储模块是用于提供数据复制服务的存储模块,提供类似于现有技术的存储节点的功能。
本实施例中,4个存储节点内还分别部署有1个仲裁模块,其中主仲裁模块132部署在存储主节点130内,备仲裁模块142、232以及242分别部署在存储备节点140、230和240上。3个备仲裁模块正常情况下不工作,只有当主仲裁模块132故障或所在的存储主节点130故障之后,其中一个备仲裁模块才接替原有的主仲裁模块132的工作。主仲裁模块132切换到任意一个备仲裁模块涉及主备切换,可采用现有技术中的主备切换的方法,本申请在此不赘述。仲裁模块可以不支持存储持久化的集群配置信息。
在其他一些实施例中,仲裁模块可以部署在独立于存储节点的其他节点内。同样的,配置库也可以部署在其他节点内。
集群配置信息持久化存储在一个分布式配置库集群中,该配置库集群分布在3个AZ内,总共有5个配置库。如图所示,这5个配置库包括存储节点内部署的4个配置库133、143、233以及243,以及为了进一步提高配置库的可用性,在一个轻量级AZ300内部署的配置库333。配置库内存储的配置数据不仅可以有集群配置信息,还可以包括配置变更过程中的状态信息以及网络拓扑信息等。
图6中各个“配置库”可以理解为分布式配置库集群的一个运行时的配置库实例。分布式配置库集群由多个配置库实例构成,但对外(即对仲裁模块)体现为一个统一的服务。仲裁模块访问哪个配置库实例根据分布式配置库暴露给仲裁模块的接口确定,可能是本地节点内的配置库实例、或本地AZ内配置库实例,也可能是另外一个AZ内的配置库实例。从仲裁模块的角度,只有一个“配置库”。配置库实例之间通过一致性复制协议(例如Raft)进行配置数据的同步复制。分布式配置库集群可以在1个AZ发生故障的情况下,大多数(3个)配置库实例仍然正常工作,从而继续正常提供配置数据持久化和一致性数据复制服务。
在其他一些实施例中,在AZ100和AZ200内的配置库可以不部署在存储节点内部,而是部署单独的节点内。不同于存储节点的大量数据存储需求,部署配置库实例的节点可以是存储容量较小的轻量级节点。
在图6所示的分布式数据库的部署示例中,当发生AZ级故障,即整个AZ的节点全部故障的时候,本申请提供的方法可以快速恢复一致性数据复制服务。具体过程如图7所示。
当发生AZ级故障时,本实施例假设AZ100故障(存储主节点130及存储备节点140同时故障),如上述,分布式配置库仍然正常提供配置数据读写服务,未故障的AZ200里的备仲裁模块232或242可以通过主备切换接替主仲裁模块132的工作。
S201:当客户端400确定存储集群中的存储节点或某个AZ发生故障之后,客户端400向仲裁模块232发送集群变化消息,该集群变化消息中包含故障的存储节点的节点ID或故障的AZ的标识,本实施例中故障的存储节点是130和140,故障的AZ是AZ100。
具体的,客户端400可以获取监控装置监控存储集群的状态,或作为监控系统的一部分监控存储集群的状态,从而获得存储集群的故障消息,确定存储集群故障。该故障消息中可以包含故障的存储节点的ID,也可以包括故障的AZ的标识,或两者都有。本实施例中,假设故障消息中包括的是故障的AZ100。之后,客户端400向仲裁模块232发送集群变化消息,该消息中携带故障的AZ100,以促使仲裁模块232发起集群配置变更检查。
S202:仲裁模块232判断是否满足配置降级条件。配置降级条件如前述实施例所述的A和B两个。
条件A具体的判断方式包括如下几种:
在一种实现方式下,如果已知两个AZ为对称部署,那么如果接收到的故障消息中携带故障的AZ100,则确定满足条件A。
在另一种实现方式下,从配置库中读取故障的AZ100中部署的存储节点以及当前存储集群所有存储节点,根据二者的个数确定是否满足条件A。
在其他实施例中,如果故障信息中包括故障的存储节点的ID130和140,则从配置库中读取当前存储集群所有存储节点,并根据两者个数值确定是否满足条件A。
也可以参考前述实施例,先判断故障的节点个数是否小于半数,若否再判断是否满足条件A;或者采用分支的形式,故障小于半数的时候执行原有Raft集群配置变更,等于半数的时候执行本申请提供的集群配置变更,大于半数的时候返回变更失败消息。
需要说明的是,配置库存储的配置数据的内容或形式不同,或收到的故障信息的内容或形式不同,都可能造成条件A具体的判断方式的差别,本申请不一一列举。
条件B具体的判断方式如下:
仲裁模块232读取配置库中的“存储集群上次最新索引号”(见步骤S209),如果没有这个值,说明存储集群没有经过强制集群配置变更,条件B成立。如果存在这个值,说明存储集群至少经过一次集群配置变更,通过网络连接访问当前未故障的每个存储节点,读取它们的最新索引号,这些最新索引号中最大的索引号如果大于或等于“存储集群上次最新索引号”,则条件B成立,否则条件B不成立。
每个存储节点都会有一个最新索引号和已提交索引号。最新索引号是每个节点已保存的最新日志的索引号。存储主节点已提交索引号是存储主节点确定已保存到大多数节点的日志中的最大的索引号。存储主节点的已提交索引号通常(但非绝对)代表着存储集群向客户端或用户提供的已提交日志的索引号。存储备节点的已提交索引号是定期从存储主节点接收的,可能比存储主节点的已提交索引号滞后一点。在某些情况下,存储主节点的已提交索引号可能滞后于存储集群当前向客户端或用户提供的已提交日志的索引号,比如在半数存储节点故障后,故障的存储节点中包括存储主节点,通过本申请的方法从一个存储备节点中选择一个新的存储主节点,此时的存储主节点的已提交索引号还是它之前作为存储备节点的已提交索引号,可能滞后于存储集群当前向客户端或用户提供的已提交日志的索引号。
图8示出了本实施例当前4个存储节点内部的索引号情况。如图8所示,存储主节点索引号5对应的日志已经保存到3个(131、141以及231)存储节点中,占大多数,所以存储主节点已提交索引号为5。3个存储备节点140、230和240的已提交索引号都滞后于存储主节点的已提交索引号,分别为4、4和2。在存储集群中,大多数节点的最新索引号大于或等于存储集群的已提交索引号。假设,存储集群在本次故障前向用户或客户端400承诺的服务要达到索引号为5的日志提供的服务,故障后未故障的存储节点为230和240,其中230中最新索引号为5,因此满足条件B。
或者,假设本实施例中本次故障是首次故障,那么“存储集群上次最新索引号”为空,则确定条件B满足。容易理解的是,对于使用一致性协议的存储集群来说,由于大多数存储节点中存储了存储主节点的已提交索引号,所以首次半数存储节点故障之后,一定还存在至少一个存储节点,其最新的索引号大于或等于存储主节点的已提交索引号,而在首次故障之前存储主节点的已提交索引号就代表着存储集群向客户端或用户提供的已提交日志的索引号,也可以理解为是存储集群向客户端或用户承诺的数据服务质量的其中一个方面。
S203:确定配置降级条件成立之后,仲裁模块232向配置库写入故障后存储集群新的配置信息。该新的配置信息可以包括故障后每个存活节点的ID。
本申请对“配置信息”的格式不做具体限定,故障后如何修改配置信息可以根据实际情况确定。在一些实现方式中,存储集群的配置信息包括每个存储节点的信息,格式大概如下“集群ID、AZ ID、计算机名、存储节点(或SQL节点)的ID”,因此仲裁模块232向配置库写入(更新)故障后存储集群新的配置信息,其实是在把对应的故障的存储节点信息删除,就变成了新的集群配置信息。例如,之前收到的故障信息指示一个AZ故障,那就是把配置库中的所有对应该AZ的存储节点信息的删除。再例如,之前收到的故障信息指示多个存储节点故障,那该故障信息中可能包含存储节点的信息"集群ID、AZ ID、计算机名、存储节点的ID”,在配置库中把对应的存储节点中的信息删除即可。
S204:仲裁模块232发送一个异步请求到故障的存储节点130和140,要求它们停止服务。此异步服务可能被故障节点接收到,也可能不被故障节点接收到。该步骤S204和S205在其他一些实施例中不是必须的。
S205:存储节点130和140停止服务。
S206:仲裁模块232发送“强制成员变更”命令到未故障的存储节点(即存储备节点230和240)。该命令用于指示它们在内存及硬盘中删除故障的存储节点(存储节点130和存储节点140)的信息。本实施例中,仲裁模块232部署在存储备节点230上,所以强制成员变更命令实际上发送给存储模块231。
S207:未故障的存储备节点230和240修改存储在其内存及硬盘中集群配置信息,并停止接收来自故障的存储节点的消息,停止发送去往故障的存储节点的消息。
S208:未故障的存储备节点230和240向仲裁模块232返回“变更成功”响应,该响应中包含各自的日志的最新索引号,即为图8中的索引号5和索引号4。仲裁模块232部署在存储备节点230上,所以对于存储备节点230而言,是存储备模块231向仲裁模块232返回响应。
需要说明的是,变更成功的响应中的最新索引号理论上来说与之前配置降级条件检查时获取的最新索引号是一致的,所以如果之前仲裁模块232存储了之前获取的最新索引号,则这里的响应中并不需要包含最新索引号。
S209:当仲裁模块232得到所有未故障的存储节点的“变更成功”响应,把所有响应中最大的一个最新索引号(即索引号5)保存在分布式配置库中,标记为“存储集群上次最新索引号”。至此,存储集群的集群配置变更完成。
假设在步骤S209之后再发生一次故障,若故障的存储节点为230,那么仲裁模块232将会判断未故障的存储节点240的最新索引号4小于“存储集群上次最新索引号”即5,因 此条件B不成立(当然条件A也成立),不能再执行配置降级,存储集群将停止提供服务;若故障的存储节点为240,那么仲裁模块232将会判断未故障的存储节点230的最新索引号5等于“存储集群上次最新索引号”即5,因此条件B成立,将再次进行集群配置强制变更,存储集群仍然可以继续提供服务。
S210:仲裁模块232在新的集群配置下发起重新选择存储主节点。该步骤将在后面详细描述。
重新选择出的存储主节点可以按照Raft协议向存储集群请求提交一条存储集群配置变更日志,该日志中可以包含此次集群配置变更中删除的节点的ID,即存储节点130和140。
S211:仲裁模块232响应客户端,集群配置变更成功,系统恢复服务。
进一步的,仲裁模块232发起重新选择存储主节点的过程。
图9为重新选择存储主节点的过程。
S301:仲裁模块232向未故障的存储节点发送请求消息,用以获取未故障的存储节点230和240的最新索引号。
S302:存储节点230和240向仲裁模块232返回最新索引号,即图8中的索引号5和4。
S303:仲裁模块232选择最新索引号最大的存储节点作为候选存储主节点,即存储备节点230。
S304:仲裁模块232向候选存储主节点230发送一个选主请求,该请求包括候选存储主节点的数据副本(即日志)是否最新,以及不是最新的时候拥有最新数据副本的节点的ID。本实施例中由于上一步骤选择的就是索引号最大的存储备节点230,所以该请求中包含候选存储主节点的数据副本(即日志)是最新。
S305:存储备节点230接收到选主请求之后,确定自身数据副本(即日志)是否最新,如果不是最新,从拥有最新数据副本的节点中获取数据副本,把自己的数据副本更新到最新。在本实施例中,存储备节点230中存储的数据副本是最新的,
S306:存储备节点230根据当前集群配置(故障后已更改的集群配置),通过Raft协议的选主方式,发起需要大多数节点积极响应的选举。此处为现有技术,本申请不赘述。
S307:当新的存储主节点(可能是存储备节点230,也可能不是)通过选举确立以后,其它存储备节点可以主动更新数据副本,具体的,通过与新的存储主节点同步数据日志更新到最新数据副本。存储备节点也可以不主动更新数据副本,等待下一次数据复制的时候,按照Raft协议的要求自动更新存储备节点的数据副本。
由于AZ100发生故障,所以唯一的SQL读写节点110也不可用了,分布式数据库需要重新指定210和220中的一个节点作为SQL读写节点,本实施例中可以由客户端400任意选择,或者通过某种编程算法(或随机、或根据预知的机器性能择优等等)选择。SQL节点的变化可以通过多种方法通知到SQL代理500,一种是SQL代理500监听配置库中关于SQL读写节点的变更事件,从而得知SQL读写节点变化(落地方案中),另一种是SQL代理500定时检测各个SQL节点,也可以是仲裁模块232通知SQL代理500新的SQL读写节点的出现等,本申请不一一列举。
在其他实施例中,在步骤S303中,选择候选存储主节点的方法也可以替换为以下算法。
仲裁模块232从配置库中读取整个系统计算层与存储层间的网络拓扑数据,并根据该网络拓扑数据建立一个有向有权无环图,如图10所示,每个圆形都可以代表一个节点。然后计算这个图的最短路径。其中,此图的边的权值(W1-W6)可以由节点与节点间的网络通信速率和负载确定,并会根据系统运行过程中通信的反馈进行动态调整。最短路径计算方法可以采用dijkstra算法,计算结果例如包括下一跳节点的ID,以及该节点的IP地址和端口的映射。计算出的最短路径中的存储节点即为候选的存储主节点。由仲裁模块232IP地址和端口的映射与该存储节点通信,继续下一步骤的操作。另外,确定最短路径中的SQL节点,为SQL读写节点,其他SQL节点可以为SQL读节点。
需要说明的是,计算层的SQL节点和存储层的存储节点可以定时向配置库发送自身的拓扑信息,以使得配置库中总是保存有整个系统的网络拓扑数据。
现有技术均没有对存储集群成员变更后,在新的集群配置下,如何优化存储节点与计算层的网络性能提供解决方案。此选主过程对SQL读写节点和存储主节点的重新选取(通过数据副本最新规则或最优选主算法),实现在新的集群配置下的计算层与存储层通信快速恢复。其中,最优选主算法根据网络通信速率和负载,对计算层与存储层间节点的拓扑网络进行动态调整权值,并计算最短路径,来实现计算层与存储层读写性能最优化。
本申请提供的集群配置变更方法除了可以用在分布式数据库的存储层,也可以用于其他采用一致性复制协议的存储集群,如分布式键值系统、分布式锁和分布式文件系统等。
需要说明的是,前述实施例中提出模块或单元的划分仅作为一种示例性的示出,所描述的各个模块的功能仅是举例说明,本申请并不以此为限。本领域普通技术人员可以根据需求合并其中两个或更多模块的功能,或者将一个模块的功能拆分从而获得更多更细粒度的模块,以及其他变形方式。
以上描述的各个实施例之间相同或相似的部分可相互参考。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个地方,或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本发明提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
以上所述,仅为本申请的一些具体实施方式,但本申请的保护范围并不局限于此。

Claims (17)

  1. 一种存储集群,包括多个存储节点,其特征在于,还包括仲裁模块和配置库,其中:
    所述配置库被配置为:存储所述存储集群的配置数据,所述配置数据包括集群配置信息,所述集群配置信息中包括存储集群中所有未故障的存储节点的信息;
    所述仲裁模块被配置为:在存储集群发生故障之后,若如下条件A和条件B满足则修改所述配置库中存储的所述集群配置信息,并向未故障的存储节点发送强制集群配置变更指令,所述强制集群配置变更指令用于指示所述未故障的存储节点修改本地的集群配置信息;
    条件A:故障的存储节点的数量为未发生本次故障之前存储集群中所有存储节点的数量的一半;
    条件B:故障后的存储集群中存在至少一个存储节点,该存储节点上的最新的日志索引号大于或等于存储集群向客户端提供的已提交日志的索引号。
  2. 如权利要求1所述的存储集群,其特征在于,
    所述仲裁模块还被配置为:在所述存储集群发生故障且执行强制集群配置变更之后,从未故障的存储节点中重新选择一个候选存储主节点,并向所述候选存储主节点发送选主请求,所述选主请求用于指示所述候选存储主节点发起选主过程。
  3. 如权利要求2所述的存储集群,其特征在于,
    所述仲裁模块被配置为:选择所有未故障的存储节点中最新日志索引号最大的存储节点为所述候选存储主节点。
  4. 如权利要求2所述的存储集群,其特征在于,所述配置数据中还包括所述存储集群所部属的网络的网络拓扑信息;
    所述仲裁模块被配置为:获取并根据所述网络拓扑信息构建客户端节点或代理节点到各个所述未故障的存储节点的有向有权图,其中节点之间的边的权值由节点与节点间的网络通信速率或负载确定,并计算所述有向有权图中的最短路径,确定位于所述最短路径上的存储节点为所述候选存储主节点。
  5. 如权利要求2-4任意一项所述的存储集群,其特征在于,
    重新选择出的存储主节点被配置为:向所述存储集群中的存储备节点发送集群配置变更日志的写入请求,所述集群配置变更日志包括此次集群配置变更中故障的存储节点的信息,所述重新选择出的存储主节点为所述存储集群根据所述选主请求选择出的存储主节点。
  6. 如权利要求1-5任意一项所述的存储集群,其特征在于,
    所述仲裁模块被配置为:获取本次故障后所有未故障的存储节点中的最新的日志索引号,若其中的最大值大于或等于上一次故障后所有未故障的存储节点中的最新的日志索引号的最大值,则确定所述条件B满足;和/或,
    所述仲裁模块被配置为:若本次故障为所述集群的首次故障,则确定所述条件B满足。
  7. 如权利要求1-6任意一项所述的存储集群,其特征在于,所述配置库为分布式配置库,分布地部署在所述多个存储节点以及一个另外的节点上。
  8. 如权利要求1-7任意一项所述的存储集群,其特征在于,还包括备用仲裁模块,被 配置为在所述仲裁模块故障之后,接替所述仲裁模块以实现所述仲裁模块的功能。
  9. 一种修改存储集群配置的方法,其特征在于,所述存储集群包括多个存储节点以及配置库,包括:
    在所述存储集群发生故障之后,若如下条件A和条件B满足,则修改所述配置库中存储的集群配置信息,并向未故障的存储节点发送强制集群配置变更指令,所述强制集群配置变更指令用于指示所述未故障的存储节点修改本地的集群配置信息;
    条件A:故障的存储节点的数量为未发生本次故障之前存储集群中所有存储节点的数量的一半;
    条件B:故障后的存储集群中存在至少一个存储节点,该存储节点上的最新的日志索引号大于或等于存储集群向客户端提供的已提交日志的索引号;
    其中,所述配置库中存储所述存储集群的配置数据,所述配置数据包括所述集群配置信息,所述集群配置信息中包括存储集群中所有未故障的存储节点的信息。
  10. 如权利要求9所述的方法,其特征在于,还包括:
    在所述存储集群发生故障且执行所述强制集群配置变更之后,从未故障的存储节点中重新选择一个候选存储主节点,并向所述候选存储主节点发送选主请求,所述选主请求用于指示所述候选存储主节点发起选主过程。
  11. 如权利要求10所述的方法,其特征在于,
    所述从未故障的存储节点中重新选择一个候选存储主节点,包括:
    选择所有未故障的存储节点中最新日志索引号最大的存储节点为所述候选存储主节点。
  12. 如权利要求10所述的方法,其特征在于,所述配置数据中还包括所述存储集群所部属的网络的网络拓扑信息;
    所述从未故障的存储节点中重新选择一个候选存储主节点,包括:
    获取并根据所述网络拓扑信息构建客户端节点或代理节点到各个所述未故障的存储节点的有向有权图,其中节点之间的边的权值由节点与节点间的网络通信速率或负载确定,并计算所述有向有权图中的最短路径,确定位于所述最短路径上的存储节点为所述候选存储主节点。
  13. 如权利要求9-12任意一项所述的方法,其特征在于,所述方法还包括:
    向所述存储集群中的存储备节点发送集群配置变更日志的写入请求,所述集群配置变更日志包括此次集群配置变更中故障的存储节点的信息,所述重新选择出的存储主节点为所述存储集群根据所述选主请求选择出的存储主节点。
  14. 如权利要求9-13任意一项所述的方法,其特征在于,判断条件B是否满足,包括:
    获取本次故障后所有未故障的存储节点中的最新的日志索引号,若其中的最大值大于或等于上一次故障后所有未故障的存储节点中的最新的日志索引号的最大值,则确定所述条件B满足;和/或,
    若本次故障为所述集群的首次故障,则确定所述条件B满足。
  15. 一种计算机系统,其特征在于,包括处理器和存储器,所述存储器用于存储计算机指令,所述处理器用于读取所述计算机指令并实现如权利要求9-14任意一项所述的方法。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中包括计算机可读指令,所述计算机可读指令被处理器读取并执行时实现如权利要求9-14任意一项所述的方法。
  17. 一种计算机程序产品,其特征在于,所述计算机程序产品中包括计算机可读指令,所述计算机可读指令被处理器读取并执行时实现如权利要求9-14任意一项所述的方法。
PCT/CN2018/112580 2017-10-31 2018-10-30 存储集群的配置修改方法、存储集群及计算机系统 WO2019085875A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP18872063.5A EP3694148B1 (en) 2017-10-31 2018-10-30 Configuration modification method for storage cluster, storage cluster and computer system
US16/862,591 US11360854B2 (en) 2017-10-31 2020-04-30 Storage cluster configuration change method, storage cluster, and computer system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711054330.9A CN109729129B (zh) 2017-10-31 2017-10-31 存储集群系统的配置修改方法、存储集群及计算机系统
CN201711054330.9 2017-10-31

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/862,591 Continuation US11360854B2 (en) 2017-10-31 2020-04-30 Storage cluster configuration change method, storage cluster, and computer system

Publications (1)

Publication Number Publication Date
WO2019085875A1 true WO2019085875A1 (zh) 2019-05-09

Family

ID=66293618

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/112580 WO2019085875A1 (zh) 2017-10-31 2018-10-30 存储集群的配置修改方法、存储集群及计算机系统

Country Status (4)

Country Link
US (1) US11360854B2 (zh)
EP (1) EP3694148B1 (zh)
CN (1) CN109729129B (zh)
WO (1) WO2019085875A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475821A (zh) * 2020-01-17 2020-07-31 吉林大学 基于文件存储证明的区块链共识机制方法
CN112468596A (zh) * 2020-12-02 2021-03-09 苏州浪潮智能科技有限公司 一种集群仲裁方法、装置、电子设备及可读存储介质
CN112667449A (zh) * 2020-12-29 2021-04-16 新华三技术有限公司 一种集群管理方法及装置
CN114401457A (zh) * 2021-12-30 2022-04-26 广东职业技术学院 一种水表集抄方法和系统

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10990282B1 (en) 2017-11-28 2021-04-27 Pure Storage, Inc. Hybrid data tiering with cloud storage
US10834190B2 (en) 2018-01-18 2020-11-10 Portworx, Inc. Provisioning of clustered containerized applications
US11436344B1 (en) * 2018-04-24 2022-09-06 Pure Storage, Inc. Secure encryption in deduplication cluster
US11392553B1 (en) 2018-04-24 2022-07-19 Pure Storage, Inc. Remote data management
CN110493326B (zh) * 2019-08-02 2021-11-12 泰华智慧产业集团股份有限公司 基于zookeeper管理集群配置文件的系统和方法
CN112749178A (zh) * 2019-10-31 2021-05-04 华为技术有限公司 一种保证数据一致性的方法及相关设备
US11064051B2 (en) * 2019-12-11 2021-07-13 Vast Data Ltd. System and method for leader election in distributed storage systems
CN111125219A (zh) * 2019-12-18 2020-05-08 紫光云(南京)数字技术有限公司 一种修改云平台上Redis集群参数的方法
US11200254B2 (en) * 2020-01-14 2021-12-14 EMC IP Holding Company LLC Efficient configuration replication using a configuration change log
CN113225362B (zh) * 2020-02-06 2024-04-05 北京京东振世信息技术有限公司 服务器集群系统和服务器集群系统的实现方法
CN111400112B (zh) * 2020-03-18 2021-04-13 深圳市腾讯计算机系统有限公司 分布式集群的存储系统的写入方法、装置及可读存储介质
CN111444537B (zh) * 2020-03-24 2023-07-18 网络通信与安全紫金山实验室 一种适用于拟态环境的日志处理方法及系统
CN111628895B (zh) * 2020-05-28 2022-06-17 苏州浪潮智能科技有限公司 一种配置数据同步方法、装置、设备及可读存储介质
US11119872B1 (en) * 2020-06-02 2021-09-14 Hewlett Packard Enterprise Development Lp Log management for a multi-node data processing system
CN111770158B (zh) * 2020-06-24 2023-09-19 腾讯科技(深圳)有限公司 云平台恢复方法、装置、电子设备及计算机可读存储介质
US11755229B2 (en) * 2020-06-25 2023-09-12 EMC IP Holding Company LLC Archival task processing in a data storage system
CN112650624B (zh) * 2020-12-25 2023-05-16 浪潮(北京)电子信息产业有限公司 一种集群升级方法、装置、设备及计算机可读存储介质
US11593210B2 (en) 2020-12-29 2023-02-28 Hewlett Packard Enterprise Development Lp Leader election in a distributed system based on node weight and leadership priority based on network performance
CN113067733B (zh) * 2021-03-25 2022-08-26 支付宝(杭州)信息技术有限公司 具有隐私保护的多站点配置控制方法、装置以及设备
US20220358118A1 (en) * 2021-05-10 2022-11-10 International Business Machines Corporation Data synchronization in edge computing networks
JP2022180956A (ja) * 2021-05-25 2022-12-07 富士通株式会社 情報処理装置,プログラム及び情報処理方法
CN113422692A (zh) * 2021-05-28 2021-09-21 作业帮教育科技(北京)有限公司 一种K8s集群内节点故障检测及处理方法、装置及存储介质
CN116644119A (zh) * 2022-02-16 2023-08-25 华为技术有限公司 一种数据存储系统及方法
CN114448996B (zh) * 2022-03-08 2022-11-11 南京大学 基于计算存储分离框架下的冗余存储资源的共识方法和系统
CN114564458B (zh) * 2022-03-10 2024-01-23 苏州浪潮智能科技有限公司 集群间数据同步的方法、装置、设备和存储介质
CN115051913B (zh) * 2022-08-12 2022-10-28 杭州悦数科技有限公司 Raft配置变更方法和装置
CN116991666A (zh) * 2023-08-01 2023-11-03 合芯科技(苏州)有限公司 一种hdfs数据监控分析系统、方法、设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426439A (zh) * 2015-11-05 2016-03-23 腾讯科技(深圳)有限公司 一种元数据的处理方法和装置
CN107105032A (zh) * 2017-04-20 2017-08-29 腾讯科技(深圳)有限公司 节点设备运行方法及节点设备
CN107124305A (zh) * 2017-04-20 2017-09-01 腾讯科技(深圳)有限公司 节点设备运行方法及节点设备
CN107295080A (zh) * 2017-06-19 2017-10-24 北京百度网讯科技有限公司 应用于分布式服务器集群的数据存储方法和服务器

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8074107B2 (en) 2009-10-26 2011-12-06 Amazon Technologies, Inc. Failover and recovery for replicated data instances
US8812631B2 (en) * 2011-05-11 2014-08-19 International Business Machines Corporation Method and arrangement for operating a computer cluster
CN104283906B (zh) * 2013-07-02 2018-06-19 华为技术有限公司 分布式存储系统、集群节点及其区间管理方法
US9251017B2 (en) * 2014-03-25 2016-02-02 International Business Machines Corporation Handling failed cluster members when replicating a database between clusters
US9367410B2 (en) * 2014-09-12 2016-06-14 Facebook, Inc. Failover mechanism in a distributed computing system
US10496669B2 (en) * 2015-07-02 2019-12-03 Mongodb, Inc. System and method for augmenting consensus election in a distributed database
CN105159818B (zh) * 2015-08-28 2018-01-02 东北大学 内存数据管理中日志恢复方法及其仿真系统
US10387248B2 (en) * 2016-03-29 2019-08-20 International Business Machines Corporation Allocating data for storage by utilizing a location-based hierarchy in a dispersed storage network
CN105915369B (zh) * 2016-03-31 2019-04-12 北京奇艺世纪科技有限公司 一种配置信息管理方法及装置
CN106407042B (zh) 2016-09-06 2019-04-23 深圳市华成峰数据技术有限公司 一种基于开源数据库的跨数据中心容灾解决系统及方法
CN106878382B (zh) * 2016-12-29 2020-02-14 北京华为数字技术有限公司 一种分布式仲裁集群中动态改变集群规模的方法及装置
CN106953901B (zh) * 2017-03-10 2020-04-07 重庆邮电大学 一种提高消息传递性能的集群通信系统及其方法
US10742724B2 (en) * 2017-08-17 2020-08-11 Hewlett Packard Enterprise Development Lp Cluster computer system with failover handling
US11210187B1 (en) * 2020-06-30 2021-12-28 Oracle International Corporation Computer cluster with adaptive quorum rules

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105426439A (zh) * 2015-11-05 2016-03-23 腾讯科技(深圳)有限公司 一种元数据的处理方法和装置
CN107105032A (zh) * 2017-04-20 2017-08-29 腾讯科技(深圳)有限公司 节点设备运行方法及节点设备
CN107124305A (zh) * 2017-04-20 2017-09-01 腾讯科技(深圳)有限公司 节点设备运行方法及节点设备
CN107295080A (zh) * 2017-06-19 2017-10-24 北京百度网讯科技有限公司 应用于分布式服务器集群的数据存储方法和服务器

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3694148A4 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475821A (zh) * 2020-01-17 2020-07-31 吉林大学 基于文件存储证明的区块链共识机制方法
CN111475821B (zh) * 2020-01-17 2023-04-18 吉林大学 基于文件存储证明的区块链共识机制方法
CN112468596A (zh) * 2020-12-02 2021-03-09 苏州浪潮智能科技有限公司 一种集群仲裁方法、装置、电子设备及可读存储介质
CN112468596B (zh) * 2020-12-02 2022-07-05 苏州浪潮智能科技有限公司 一种集群仲裁方法、装置、电子设备及可读存储介质
US11902095B2 (en) 2020-12-02 2024-02-13 Inspur Suzhou Intelligent Technology Co., Ltd. Cluster quorum method and apparatus, electronic device, and readable storage medium
CN112667449A (zh) * 2020-12-29 2021-04-16 新华三技术有限公司 一种集群管理方法及装置
CN112667449B (zh) * 2020-12-29 2024-03-08 新华三技术有限公司 一种集群管理方法及装置
CN114401457A (zh) * 2021-12-30 2022-04-26 广东职业技术学院 一种水表集抄方法和系统
CN114401457B (zh) * 2021-12-30 2023-08-01 广东职业技术学院 一种水表集抄方法和系统

Also Published As

Publication number Publication date
US11360854B2 (en) 2022-06-14
US20200257593A1 (en) 2020-08-13
CN109729129B (zh) 2021-10-26
EP3694148B1 (en) 2022-08-24
CN109729129A (zh) 2019-05-07
EP3694148A4 (en) 2020-12-16
EP3694148A1 (en) 2020-08-12

Similar Documents

Publication Publication Date Title
WO2019085875A1 (zh) 存储集群的配置修改方法、存储集群及计算机系统
US11163653B2 (en) Storage cluster failure detection
US10713134B2 (en) Distributed storage and replication system and method
US7680994B2 (en) Automatically managing the state of replicated data of a computing environment, and methods therefor
US9280430B2 (en) Deferred replication of recovery information at site switchover
JP5352115B2 (ja) ストレージシステム及びその監視条件変更方法
JP2019219954A (ja) クラスタストレージシステム、データ管理制御方法、データ管理制御プログラム
US20060203718A1 (en) Method, apparatus and program storage device for providing a triad copy of storage data
EP2643771B1 (en) Real time database system
US20140317438A1 (en) System, software, and method for storing and processing information
US20150169718A1 (en) System and method for supporting persistence partition discovery in a distributed data grid
GB2484086A (en) Reliability and performance modes in a distributed storage system
WO2014205847A1 (zh) 一种分区平衡子任务下发方法、装置与系统
US20230385244A1 (en) Facilitating immediate performance of volume resynchronization with the use of passive cache entries
CN115794499B (zh) 一种用于分布式块存储集群间双活复制数据的方法和系统
JP2007042001A (ja) 計算機システム、同期化処理方法、およびプログラム
CN113326251B (zh) 数据管理方法、系统、设备和存储介质
WO2015196692A1 (zh) 一种云计算系统以及云计算系统的处理方法和装置
CN112887367A (zh) 实现分布式集群高可用的方法、系统及计算机可读介质
CN116389233A (zh) 容器云管理平台主备切换系统、方法、装置和计算机设备
JP2009265973A (ja) データ同期システム、障害復旧方法、及び、プログラム
US20240028611A1 (en) Granular Replica Healing for Distributed Databases
CN117176799A (zh) 分布式系统的故障处理方法及相关设备
CN117520050A (zh) 一种跨idc数据湖容灾方法、系统、设备及介质
CN117725100A (zh) 数据库主从切换方法、系统、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18872063

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018872063

Country of ref document: EP

Effective date: 20200508