CN111708668A - Cluster fault processing method and device and electronic equipment - Google Patents

Cluster fault processing method and device and electronic equipment Download PDF

Info

Publication number
CN111708668A
CN111708668A CN202010477541.9A CN202010477541A CN111708668A CN 111708668 A CN111708668 A CN 111708668A CN 202010477541 A CN202010477541 A CN 202010477541A CN 111708668 A CN111708668 A CN 111708668A
Authority
CN
China
Prior art keywords
node
cluster
nodes
root
slave
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010477541.9A
Other languages
Chinese (zh)
Other versions
CN111708668B (en
Inventor
汤爱迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Cloud Network Technology Co Ltd
Original Assignee
Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Cloud Network Technology Co Ltd filed Critical Beijing Kingsoft Cloud Network Technology Co Ltd
Priority to CN202010477541.9A priority Critical patent/CN111708668B/en
Publication of CN111708668A publication Critical patent/CN111708668A/en
Application granted granted Critical
Publication of CN111708668B publication Critical patent/CN111708668B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/184Distributed file systems implemented as replicated file system
    • G06F16/1844Management specifically adapted to replicated file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases

Abstract

The invention relates to a processing method and device of cluster faults and electronic equipment, and relates to the field of cloud computing. The method comprises the following steps: receiving node information reported by a root node of a cluster, wherein the root node is determined by mutual communication among nodes in the cluster; determining the number of root nodes in the cluster according to the received node information; and determining that the cluster has a network partition fault when the number of the root nodes is more than one. The method can automatically find the fault of the database cluster under the condition of network partition, find the problem in time and avoid production accidents.

Description

Cluster fault processing method and device and electronic equipment
Technical Field
The invention relates to the technical field of cloud computing, in particular to a processing method of a cluster fault, a processing method of a cluster fault and electronic equipment.
Background
Databases (e.g., Redis) are widely used in various service scenarios for content identification, and in a database cluster, when we encounter bottlenecks of stand-alone memory, concurrency, traffic, and the like, a purpose of high availability can be achieved through the database cluster.
The nodes in the database cluster are divided into a Master node (Master node) and a Slave node (Slave node), the Master node is responsible for all read-write requests and maintenance of cluster key information, and the Slave node is only responsible for copying Master node data and state information.
The nodes can communicate by adopting a Gossip protocol of P2P, for example, each Master node in the cluster can regularly send ping messages to other Master nodes and receive node reply pong messages, if Master node A fails to communicate with Master node B all the time within cluster-node-timeout, A marks B as a subjective offline, and simultaneously, A broadcasts a message that B is considered as the subjective offline to Master cluster. When more than half of Master nodes determine that B is subjectively offline, B is determined to be objectively offline, and fault finding is completed.
In the related art, the fault discovery function of the cluster can only deal with the situation that a server fault causes a fault of an individual Master node under the condition that the network is normal, and the fault discovery cannot be performed on the fault related to the network partition.
Therefore, it is necessary to provide a new technical solution for processing cluster failures.
Disclosure of Invention
An object of the present invention is to provide a new technical solution for cluster fault handling.
According to a first aspect of the present invention, there is provided a method for handling a cluster failure, where the cluster includes a plurality of nodes, and the plurality of nodes includes at least one master node corresponding to at least one slave node, where the method is implemented by a monitor outside the cluster, and includes:
receiving node information reported by a root node of the cluster, wherein the root node is determined through mutual communication among nodes in the cluster;
determining the number of root nodes in the cluster according to the received node information;
and determining that the cluster has a network partition fault when the number of the root nodes is more than one.
Optionally, after the determining that the cluster has a network partition failure, the method further includes:
sending a merging instruction to the root nodes of the cluster so as to enable more than one root node in the cluster to perform merging operation;
receiving merged feedback information of more than one root node in the cluster;
and under the condition that the merging feedback information indicates that the merging operation fails to be executed, confirming that the cluster has network partition faults.
Optionally, after the determining that the cluster has a network partition failure, the method further includes:
sending a state reporting instruction to a plurality of nodes of the cluster so that each node of the plurality of nodes reports its own state information to the monitor, wherein the state information comprises at least one of a node identifier, a node address, a node type, an identifier of a root node to which the node belongs, information of a main node to which the node belongs, and an identifier of a corresponding hash slot;
and receiving the state information sent by the plurality of nodes.
Optionally, after the receiving the status information sent by the plurality of nodes, the method further includes:
determining at least two network communication areas according to the identifier of a root node to which each node belongs, wherein the network communication areas correspond to the root nodes one to one;
detecting whether each main node and the corresponding slave node are located in the same network communication area;
and if the first main node and at least one corresponding slave node are positioned in a first network communication area and at least one slave node of the first main node is positioned in a second network communication area, sending a release instruction to the cluster to clear the master-slave relationship between the slave node and the first main node in the second network communication area.
Optionally, after the receiving the status information sent by the plurality of nodes, the method further includes:
periodically sending a connectivity detection instruction to each node of the cluster according to a preset frequency;
determining whether offline nodes exist in a plurality of nodes in the cluster according to the response result of each node to the connectivity detection instruction;
and determining a new master node from the slave nodes of the offline node under the condition that the offline node exists in the cluster and the node type of the offline node is the master node.
Optionally, the determining a new master node from the slave nodes of the downline nodes includes:
sending an offset reporting instruction to the slave node of the offline node so that the slave node feeds back the offset of self synchronous data;
and determining the slave node with the maximum offset from the slave nodes of the downline nodes, and sending a type conversion instruction to the slave node with the maximum offset so as to convert the slave node with the maximum offset into the master node.
Optionally, the method further comprises:
sending a hash slot clearing instruction to the cluster so as to clear the corresponding relation between the offline node and the corresponding hash slot;
and sending a hash slot distribution instruction to the cluster to establish a corresponding relation between the hash slot corresponding to the downline node and the slave node with the maximum offset.
Optionally, the method further comprises:
and sending a network partition recovery instruction to the cluster under the condition that the merging feedback information shows that the merging operation is successfully executed, so that the cluster is recovered to a normal operation state.
According to a second aspect of the present invention, there is provided a method for processing a cluster fault, which is implemented by a first node in a cluster, including:
receiving a father node identifier sent by a second node in the cluster, wherein the father node identifier of each node is initially a self identifier;
comparing the father node identification sent by the second node with the father node identification of the second node to update the father node identification of the second node;
after at least one updating step, if the father node identification of the self is the same as the self identification, the self is determined to be a root node;
in the case where the node itself is the root node, the node information of itself is sent to the monitor as described above.
According to a third aspect of the present invention, there is provided an apparatus for processing a cluster failure, where the cluster includes a plurality of nodes, and the plurality of nodes includes at least one master node corresponding to at least one slave node, where the apparatus is applied to a monitor outside the cluster, and includes:
a first receiving module, configured to receive node information reported by a root node of the cluster, where the root node is determined by mutual communication between nodes in the cluster;
the first processing module is used for determining the number of root nodes in the cluster according to the received node information;
and the second processing module is used for determining that the cluster has a network partition fault when the number of the root nodes is more than one.
Optionally, the apparatus further comprises a review module configured to:
sending a merging instruction to the root nodes of the cluster so as to enable more than one root node in the cluster to perform merging operation;
receiving merged feedback information of more than one root node in the cluster;
and under the condition that the merging feedback information indicates that the merging operation fails to be executed, confirming that the cluster has network partition faults.
Optionally, the apparatus further comprises a status collection module configured to:
sending a state reporting instruction to a plurality of nodes of the cluster so that each node of the plurality of nodes reports its own state information to the monitor, wherein the state information comprises at least one of a node identifier, a node address, a node type, an identifier of a root node to which the node belongs, information of a main node to which the node belongs, and an identifier of a corresponding hash slot;
and receiving the state information sent by the plurality of nodes.
Optionally, the apparatus further comprises a master-slave control module configured to:
determining at least two network communication areas according to the identifier of a root node to which each node belongs, wherein the network communication areas correspond to the root nodes one to one;
detecting whether each main node and the corresponding slave node are located in the same network communication area;
and if the first main node and at least one corresponding slave node are positioned in a first network communication area and at least one slave node of the first main node is positioned in a second network communication area, sending a release instruction to the cluster to clear the master-slave relationship between the slave node and the first main node in the second network communication area.
Optionally, the apparatus further comprises a failover module, the failover module further comprising:
the sending unit is used for regularly sending a connectivity detection instruction to each node of the cluster according to a preset frequency;
a detecting unit, configured to determine whether offline nodes exist in multiple nodes in the cluster according to a response result of each node to the connectivity detection instruction;
and the master selecting unit is used for determining a new master node from the slave nodes of the offline node under the condition that the offline node exists in the cluster and the node type of the offline node is the master node.
Optionally, the master unit further comprises:
the offset obtaining subunit is configured to send an offset reporting instruction to the slave node of the offline node, so that the slave node feeds back an offset of self-synchronization data;
and the master-slave conversion sub-unit is used for determining the slave node with the maximum offset from the slave nodes of the downline nodes and sending a type conversion instruction to the slave node with the maximum offset so as to convert the slave node with the maximum offset into the master node.
Optionally, the apparatus further comprises a clearing module configured to:
sending a hash slot clearing instruction to the cluster so as to clear the corresponding relation between the offline node and the corresponding hash slot;
and sending a hash slot distribution instruction to the cluster to establish a corresponding relation between the hash slot corresponding to the downline node and the slave node with the maximum offset.
Optionally, the apparatus further comprises a recovery module configured to:
and sending a network partition recovery instruction to the cluster under the condition that the merging feedback information shows that the merging operation is successfully executed, so that the cluster is recovered to a normal operation state.
According to a fourth aspect of the present invention, there is provided a device for processing a cluster fault, which is applied to a first node in a cluster, and includes:
the receiving module is used for receiving father node identifications sent by second nodes in the cluster, wherein the father node identification of each node is initially the self identification;
the updating module is used for comparing the father node identification sent by the second node with the father node identification of the updating module so as to update the father node identification of the updating module;
the determining module is used for determining that the self is a root node if the father node identification of the self is the same as the self identification after at least one updating step;
a sending module, configured to send node information of itself to the monitor as described above when the sending module is a root node.
According to a fifth aspect of the present invention, there is also provided an electronic device, including a memory and a processor, where the memory stores executable commands, and the processor executes the executable commands to implement the method according to the first aspect or the second aspect of the present invention.
In the cluster fault processing method in this embodiment, the monitor acquires the root node information of the cluster, and determines whether the cluster has a network partition fault or not according to the root node information, so that a fault of the database cluster can be automatically found under the condition of network partition, a problem can be timely found, and a production accident can be avoided.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a schematic diagram of an electronic device that may be used to implement an embodiment of the invention.
Fig. 2 is a flowchart of a processing method of a cluster failure according to an embodiment of the present invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
< hardware configuration >
Fig. 1 shows a hardware configuration of an electronic device that can be used to implement an embodiment of the present invention.
Referring to fig. 1, an electronic device 1000 includes a processor 1100, a memory 1200, an interface device 1300, a communication device 1400, a display device 1500, and an input device 1600. The processor 1100 may be, for example, a central processing unit CPU, a micro control unit MCU, or the like. The memory 1200 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1300 includes, for example, a USB interface, a serial interface, and the like. The communication device 1400 is, for example, a wired network card or a wireless network card. The display device 1500 is, for example, a liquid crystal display panel. The input device 1600 includes, for example, a touch screen, a keyboard, a mouse, a microphone, and the like.
In an embodiment applied to this description, the memory 1200 of the electronic device 1000 is used to store instructions for controlling the processor 1100 to operate in support of implementing a method according to any embodiment of this description. The skilled person can design the instructions according to the solution disclosed in the present specification. How the instructions control the operation of the processor is well known in the art and will not be described in detail herein.
It should be understood by those skilled in the art that although a plurality of devices of the electronic apparatus 1000 are shown in fig. 1, the electronic apparatus 1000 of the embodiments of the present specification may refer to only some of the devices, for example, only the processor 1100, the memory 1200 and the communication device 1400.
The hardware configuration shown in FIG. 1 is illustrative only and is not intended to limit the present invention, its application, or uses in any way.
< method examples >
The embodiment provides a method for processing a cluster fault, which is implemented by the electronic device 1000 shown in fig. 1, for example.
In this embodiment, the cluster includes a plurality of nodes, where the plurality of nodes includes at least one master node, and the master node corresponds to at least one slave node.
In addition, it should be further noted that the present invention is applicable to all database clusters, and in the following embodiments, a cluster of a Redis database is taken as an example, but is not limited to the Redis database, and therefore, the following example does not constitute a limitation to the present invention.
In this embodiment, the electronic device 1000 is a monitor outside the cluster. The monitor is a component dedicated to monitoring the cluster state, for example, a client software, which is capable of communicating with all nodes.
As shown in fig. 2, the method includes the following steps S1100-S1300.
In step S1100, node information reported by a root node of a cluster is received, where the root node is determined by mutual communication between nodes in the cluster.
In this embodiment, the root node is determined by mutual communication between nodes in the cluster. One root node corresponds to one network communication area. The network communication area refers to a set of a plurality of nodes capable of exchanging data between each two nodes. And a communication relation exists between two nodes in the same network communication area. .
In this embodiment, the connected relationship means that two nodes can directly or indirectly communicate.
In one example, the acquisition process of the root node (root node) includes: a first node in the plurality of nodes acquires parent node (parent node) information of a second node based on a consistency protocol; the first node updates the father node information of the first node according to the father node information of the second node and a preset updating rule; and the first node judges that the first node is a root node under the condition that the father node is the first node.
In the above example, the parent node of a certain node is initially the node itself.
In the above example, the coherence protocol is, for example, the Gossip protocol.
In the above example, the update rule is, for example, a node ID minimum rule or a node ID maximum rule.
In one example, the updating, by the first node, parent node information of itself according to the parent node information of the second node and a preset updating rule includes: under the condition that the first node corresponds to different father nodes with the second node, acquiring a root node of the first node and a root node of the second node; and the first node updates the father node information of the first node according to the root node of the first node, the root node of the second node and a preset updating rule.
The above-mentioned root node for obtaining the root node of the self and the root node of the second node are, for example: in a chain E-F-G-H-K-J (root node), if the parent node of E is F and the root node is J, the E changes the parent node of the E into J, so that the chain becomes E-J, wherein the parent node and the root node of the E are J.
The process of acquiring the root node can be realized based on a Union-Find algorithm. And looking up sets (Union-FindSet), also known as disjoint sets data structures. Refers to a set of disjoint Sets, providing both merge (Union) and Find (Find) operations. find (I) is the set to which I belongs, and usually we use find (I) and find (j) to determine if I and j are connected, i.e. belong to the same set. The Union method connects two sets of I and J, and after the method is executed, the set of I is connected with all elements of the set of J and all elements of the set of J are connected
In one example, the obtaining process of the node information of the root node includes the following steps.
1. All Master and Slave nodes in the cluster follow the Gossip protocol to mutually transmit information in the cluster, and the information carries parent node nodeId (node ID) of the Master and the Slave nodes.
2. When node a receives the information of node B, it checks whether the parent node of B and its own parent node are the same.
3. If the two nodes are the same, the two nodes are already in the same connected graph, and the node A stops acting.
4. If not, the node A executes find operation, and finds the final root node according to the parent node (namely, the parent node is equal to the node of the node).
And 5, the find operation updates the parent node into the root node for path compression.
6. And the node A judges whether the root nodes of the node A and the node B are consistent, and if the root nodes of the node A and the node B are consistent, the two nodes are already in the same connected graph, and the node A stops acting.
7. If the root nodes are different, at this time, because a and B can communicate with each other, node a initiates a Union operation on its own root node and the root node of B, for example, the root node with the minimum nodeId is taken as the root node of the new connectivity graph, and the root nodes of both are updated. At this time, A and B are in the same communication diagram.
7. Because the Redis cluster communication follows the Gossip protocol, after a period of time, nodes capable of communicating with each other all have the same root node.
8. If the parent node of the node is equal to the parent node, the node is a root node, and each connected graph only has one root node.
And 9, reporting the self state of the root node to a monitor at regular time.
In step S1200, the number of root nodes in the cluster is determined according to the received node information.
In this embodiment, the monitor may determine the number of root node IDs according to the received node information, that is, the number of root nodes in the cluster.
In step S1300, when the number of root nodes is greater than one, it is determined that a network partition failure occurs in the cluster.
In this embodiment, the number of root nodes is greater than one, which means that the cluster network includes at least two partitions, that is, a network partition failure occurs.
In the method for processing the cluster fault in the embodiment, the monitor is used for acquiring the root node information of the cluster, and whether the cluster has the network partition fault is judged according to the root node information, so that the fault of the Redis cluster can be automatically found under the condition of network partition, the problem can be timely found, and the production accident can be avoided.
In one example, after determining that the cluster has a network partition failure, the method further comprises: sending a merging instruction to the root nodes of the cluster so as to enable more than one root node in the cluster to perform merging operation; receiving the combined feedback information of more than one root node in the cluster; and confirming that the cluster has a network partition fault in the case that the merging feedback information indicates that the merging operation fails to be executed.
Through the process, the network partition fault can be confirmed again, so that the accuracy of partition fault detection is guaranteed.
As an example of reconfirming the network partition failure, after the step 9, the method further includes the following steps:
10. if the Redis monitor only receives the information reported by one root node in the same time period, the network is in a normal state, and the nodes can communicate with each other. And continuing monitoring.
11. If the Redis monitor receives the information reported by two or more root nodes, the network partition fault of the Redis cluster may occur.
A Redis monitor issuing Union requests to a plurality of root nodes.
And 13, the Union request carries address information of another node B, the node A receiving the Union request firstly pings the node B, if the node A succeeds in the ping, the two nodes perform the Union operation, and the root node with the minimum nodeId is taken as the root node of the new connected graph.
14. Update and return Union success. The failure returns a Union failure.
15. If all root nodes can be successfully Union, the network is normal, and at the moment, the Redis cluster is updated to be that all nodes belong to a connected graph.
16. If there is a Union failure, it is confirmed that network partitioning has occurred.
In one example, in the event of a network partition failure, the method further comprises the following steps of detecting, by the monitor, a cluster state: sending a state reporting instruction to a plurality of nodes of the cluster so that each node of the plurality of nodes reports state information of the node to a monitor, wherein the state information comprises at least one of node identification, node address, node type, identification of a root node to which the node belongs, information of a main node to which the node belongs and identification of a corresponding hash slot; and receiving the state information sent by the plurality of nodes.
In one example of the monitor detecting the cluster state, the following steps 1-4 are specifically included.
1. The monitor informs all root nodes of the occurrence of network partition, and the root nodes propagate in the connectable nodes according to the Gossip protocol and are currently in a network partition mode
2. And all the nodes switch to the network partition mode after receiving the change information, and update the parent node information to be the root node.
3. All nodes in the network partition mode report states to a Redis monitor periodically, wherein the states include nodeId, addresses, the node to which the node belongs, node types (Master/Slave), Hash slot information reported by the Master node, and nodeId of the Master node to which the Slave node belongs.
4. In the network partition mode, the Redis cluster loses the automatic fault transfer capability, fault discovery and fault transfer responsibilities are handed over from the Redis cluster to a Redis monitor, and the monitor is responsible for managing and maintaining node metadata information.
Redis clusters do not use a consistent hash, but rather introduce the concept of a hash-slot. 16384 hash slots are built in the Redis cluster, when a key-value needs to be placed in the Redis cluster, Redis calculates a result for the key by using a crc16 algorithm, then calculates the remainder for the result pair 16384, so that each key corresponds to a hash slot with a number between 0 and 16383, and the Redis used for mapping the hash slot to different nodes according to the approximately equal number of the nodes.
In the above example, originally, each node in the cluster maintains the cluster metadata, and broadcasts and updates each other, and the monitor maintains the metadata, which is beneficial to avoiding the inconsistency of the metadata under the condition of network partition failure and avoiding data loss during the network partition.
In one example, in the event of a network partition failure of the cluster, the method further comprises the steps of reassigning the master and slave nodes by: determining at least two network communication areas according to the identifier of the root node to which each node belongs, wherein the network communication areas correspond to the root nodes one to one; detecting whether each main node and the corresponding slave node are located in the same network communication area; and if the first main node and the at least one corresponding slave node are positioned in the first network communication area and the at least one slave node of the first main node is positioned in the second network communication area, sending a release instruction to the cluster to clear the master-slave relationship between the slave node and the first main node in the second network communication area.
In another example, the process of performing master-slave node assignment includes: under the condition that the target master node and at least one target slave node are located in the same partition, keeping the target master node and the target slave node unchanged, and releasing slave nodes of which the target master node is located in other partitions; or, in the case that the target master node is located in the first partition and at least two slave nodes of the target master node are located in the second partition, the target master node is released and the master node is reassigned among the at least two slave nodes of the target master node.
As an example, the master node is reassigned by the following steps.
1. If the Master node and all Slave nodes thereof are in the same network partition, the monitor continues to check the next Master node without changing
2. If the Master node and the Slave node are in different partitions, checking whether the number of the Slave nodes in the same partition is more than or equal to 1, if so, releasing the Slave node resources not in the same partition, and continuously checking the next node
3. And if the Master node and all Slave nodes below the Master node are in different partitions, checking whether the partition where the Master node is located has a resource new Slave node, and if so, adding the Slave node newly and releasing the Slave resources of other partitions.
4. And if the Master node and all Slave nodes below the Master node are in different partitions and no resource under the network partition where the Master is located can allocate a new Slave, checking whether the number of the Slave nodes under a certain partition is more than or equal to 2, and if so, promoting the Slave node with the largest offset as the Master node to ensure that the topological structure of one Master and multiple slaves is unchanged. And then releasing the Master node resource and the rest partition Slave node resources.
5. If the number of the Slave nodes in a partition is not larger than or equal to 2, searching whether resources in the partition where the Slave node is located can be newly started, if so, newly adding the Slave node, promoting the original Slave node to be the Master, and then releasing the resources.
6. If the situations are not met, the administrator is informed of adding the machine in a mode of e.g. mail.
7. And after the Slave node is reassigned, the monitor broadcasts ping information to the cluster and updates the information of all Master nodes and the Slave node.
In the above example, the master and slave nodes are automatically reassigned during network partitioning, saving labor and time and ensuring the availability of the cluster in the partitioned state.
In one example, in the event of a network partition failure of the cluster, the method further comprises the step of failover: periodically sending a connectivity detection instruction to each node of the cluster according to a preset frequency; determining whether a plurality of nodes in the cluster have offline nodes or not according to the response result of each node to the connectivity detection instruction; and determining a new master node from the slave nodes of the offline node under the condition that the offline node exists in the cluster and the node type of the offline node is the master node. Sending a Hash slot clearing instruction to the cluster so as to clear the corresponding relation between the offline node and the corresponding Hash slot; and sending a hash slot distribution instruction to the cluster to establish a corresponding relation between the hash slot corresponding to the downline node and the slave node with the maximum offset.
In the above example, determining a new master node from the slave nodes of the downline nodes includes: sending an offset reporting instruction to a slave node of a downlink node so that the slave node feeds back the offset of self synchronous data; and determining the slave node with the maximum offset from the slave nodes of the downline nodes, and sending a type conversion instruction to the slave node with the maximum offset so as to convert the slave node with the maximum offset into the master node.
As an example, the process of failover specifically includes the following steps.
1. When the monitor monitors that a certain node fails to communicate all the time within the cluster-node-timeout time, the node is considered to have a fault, and the node is marked to be in a down state.
2. If the node is the Slave node, the monitor does not process the node, automatically synchronizes with the Master node after the Slave node is recovered, and completely copies the Master existing data.
3. If the node is a Master node, the monitor issues requests to all Slave nodes under the Master node for qualification check, and the Slave nodes report the offset to the monitor.
4. And the monitor selects the Slave node with the maximum offset to replace the Master node, and the Slave node is deduplicated to become the Master node.
5. Executing the clusterDelSlot operation revokes the slot responsible for the failed master node and executing the clusterAddSlot delegates the slot to the Slave node.
6. The monitor broadcasts a pong message to the cluster, informs all nodes in the cluster that the Slave node becomes a Master node, and takes over the slot information of the failed Master.
In the above example, failover is automatically performed during network partitioning, saving labor and time, and ensuring that the cluster is highly available.
In one example, in the event of a network partition failure of the cluster, the method further comprises the following cluster recovery steps: and under the condition that the merging feedback information shows that the merging operation is successfully executed, sending a network partition recovery instruction to the cluster so as to enable the cluster to recover to a normal operation state.
As an example, the process of cluster recovery includes the following steps.
1. When the monitor monitors that all nodes are recovered to the same root node, the network partition recovery is illustrated.
2. The monitor issues a network partition recovery request to the root node, the root node follows the Gossip protocol and is in a normal state when information is transmitted, and all nodes in the connected graph switch to the normal state after receiving the information.
3. The monitor transfers the failure discovery and automatic transfer right to self management in the cluster. And returning to the normal running state of the cluster.
In the above example, the monitor takes over the recovery after the network partition is recovered, so as to recover to the self-management in the cluster, which is beneficial to timely recovering to the normal state.
In the embodiment, the Redis monitor detects whether a network partition exists in the Redis cluster through a Union-Find algorithm, and if the network partition exists, the fault discovery and fault transfer responsibility is handed over from the Redis cluster to the Redis monitor. The Redis monitor is simultaneously responsible for cluster metadata maintenance, and if the Master node and the Slave node are found to be in an unconnected network, the Slave node is reassigned to ensure that the Slave and the Master are in the same network environment. The monitor is responsible for monitoring node failures, and when a node failure occurs, the monitor is responsible for performing failover as a leader and notifying all nodes. When the monitor detects that the network partition is recovered, the fault discovery and fault transfer responsibilities are handed over from the Redis monitor back to the Redis cluster, and the elastic recovery is ensured.
The present embodiment further provides another method for processing a cluster fault, which is implemented by a first node in a cluster, and includes the following steps: receiving a father node identifier sent by a second node in the cluster, wherein the father node identifier of each node is initially the self identifier; comparing the father node identification sent by the second node with the father node identification of the second node to update the father node identification of the second node; after at least one updating step, if the father node identification of the self is the same as the self identification, the self is determined to be a root node; and sending the node information of the node to the monitor under the condition that the node is the root node.
< apparatus embodiment >
The embodiment provides a processing apparatus for a cluster fault, where a cluster includes a plurality of nodes, the plurality of nodes includes at least one master node, and the master node corresponds to at least one slave node, where the apparatus is applied to a monitor outside the cluster, and includes a first receiving module, a first processing module, and a second processing module.
The first receiving module is used for receiving node information reported by a root node of a cluster, wherein the root node is determined by mutual communication among nodes in the cluster.
And the first processing module is used for determining the number of the root nodes in the cluster according to the received node information.
And the second processing module is used for determining that the cluster has a network partition fault under the condition that the number of the root nodes is more than one.
In one example, the apparatus further comprises a review module to: sending a merging instruction to the root nodes of the cluster so as to enable more than one root node in the cluster to perform merging operation; receiving the combined feedback information of more than one root node in the cluster; and confirming that the cluster has a network partition fault in the case that the merging feedback information indicates that the merging operation fails to be executed.
In one example, the apparatus further comprises a state collection module to: sending a state reporting instruction to a plurality of nodes of the cluster so that each node of the plurality of nodes reports state information of the node to a monitor, wherein the state information comprises at least one of node identification, node address, node type, identification of a root node to which the node belongs, information of a main node to which the node belongs and identification of a corresponding hash slot; and receiving the state information sent by the plurality of nodes.
In one example, the apparatus further comprises a master-slave control module to: determining at least two network communication areas according to the identifier of the root node to which each node belongs, wherein the network communication areas correspond to the root nodes one to one; detecting whether each main node and the corresponding slave node are located in the same network communication area; and if the first main node and the at least one corresponding slave node are positioned in the first network communication area and the at least one slave node of the first main node is positioned in the second network communication area, sending a release instruction to the cluster to clear the master-slave relationship between the slave node and the first main node in the second network communication area.
In one example, the apparatus further comprises a failover module, the failover module further comprising: the sending unit is used for regularly sending a connectivity detection instruction to each node of the cluster according to a preset frequency; the detection unit is used for determining whether a plurality of nodes in the cluster have offline nodes according to the response result of each node to the connectivity detection instruction; and the master selecting unit is used for determining a new master node from the slave nodes of the offline node under the condition that the offline node exists in the cluster and the node type of the offline node is the master node.
In one example, selecting the master unit further comprises: the offset acquisition subunit is used for sending an offset reporting instruction to the slave nodes of the downlink nodes so as to enable the slave nodes to feed back the offset of self synchronous data; and the master-slave conversion sub-unit is used for determining the slave node with the maximum offset from the slave nodes of the downline nodes and sending a type conversion instruction to the slave node with the maximum offset so as to convert the slave node with the maximum offset into the master node.
In one example, the apparatus further comprises a clearing module to: sending a Hash slot clearing instruction to the cluster so as to clear the corresponding relation between the offline node and the corresponding Hash slot; and sending a hash slot distribution instruction to the cluster to establish a corresponding relation between the hash slot corresponding to the downline node and the slave node with the maximum offset.
In one example, the apparatus further comprises a recovery module to: and under the condition that the merging feedback information shows that the merging operation is successfully executed, sending a network partition recovery instruction to the cluster so as to enable the cluster to recover to a normal operation state.
The present embodiment further provides a device for processing a cluster fault, which is applied to a first node in a cluster, and includes: the receiving module is used for receiving father node identifications sent by second nodes in the cluster, wherein the father node identification of each node is initially the self identification; the updating module is used for comparing the father node identification sent by the second node with the father node identification of the updating module so as to update the father node identification of the updating module; the determining module is used for determining that the self is a root node if the father node identification of the self is the same as the self identification after at least one updating step; and the sending module is used for sending the node information of the sending module to the monitor under the condition that the sending module is the root node.
The processing apparatus for cluster failure in this embodiment can implement each step described in the method embodiment of the present invention, and can also implement the same technical effect, which is not described herein again.
< electronic device embodiment >
The embodiment provides an electronic device, which includes a processor and a memory, where the memory stores machine executable instructions capable of being executed by the processor, and the processor executes the machine executable instructions to implement the method for handling cluster faults described in the method embodiment of the present invention.
The electronic device in this embodiment can implement each step described in the method embodiment of the present invention, and can also implement the same technical effect, which is not described herein again.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (12)

1. A method for handling a cluster failure, wherein the cluster includes a plurality of nodes, and wherein the plurality of nodes includes at least one master node corresponding to at least one slave node, and wherein the method is implemented by a monitor outside the cluster, and comprises:
receiving node information reported by a root node of the cluster, wherein the root node is determined through mutual communication among nodes in the cluster;
determining the number of root nodes in the cluster according to the received node information;
and determining that the cluster has a network partition fault when the number of the root nodes is more than one.
2. The method of claim 1, wherein after the determining that the cluster has a network partition failure, the method further comprises:
sending a merging instruction to the root nodes of the cluster so as to enable more than one root node in the cluster to perform merging operation;
receiving merged feedback information of more than one root node in the cluster;
and under the condition that the merging feedback information indicates that the merging operation fails to be executed, confirming that the cluster has network partition faults.
3. The method of claim 1, wherein after the determining that the cluster has a network partition failure, the method further comprises:
sending a state reporting instruction to a plurality of nodes of the cluster so that each node of the plurality of nodes reports its own state information to the monitor, wherein the state information comprises at least one of a node identifier, a node address, a node type, an identifier of a root node to which the node belongs, information of a main node to which the node belongs, and an identifier of a corresponding hash slot;
and receiving the state information sent by the plurality of nodes.
4. The method of claim 3, wherein after said receiving the status information sent by the plurality of nodes, the method further comprises:
determining at least two network communication areas according to the identifier of a root node to which each node belongs, wherein the network communication areas correspond to the root nodes one to one;
detecting whether each main node and the corresponding slave node are located in the same network communication area;
and if the first main node and at least one corresponding slave node are positioned in a first network communication area and at least one slave node of the first main node is positioned in a second network communication area, sending a release instruction to the cluster to clear the master-slave relationship between the slave node and the first main node in the second network communication area.
5. The method of claim 3, wherein after said receiving the status information sent by the plurality of nodes, the method further comprises:
periodically sending a connectivity detection instruction to each node of the cluster according to a preset frequency;
determining whether offline nodes exist in a plurality of nodes in the cluster according to the response result of each node to the connectivity detection instruction;
and determining a new master node from the slave nodes of the offline node under the condition that the offline node exists in the cluster and the node type of the offline node is the master node.
6. The method of claim 5, wherein determining a new master node from the slave nodes of the downline nodes comprises:
sending an offset reporting instruction to the slave node of the offline node so that the slave node feeds back the offset of self synchronous data;
and determining the slave node with the maximum offset from the slave nodes of the downline nodes, and sending a type conversion instruction to the slave node with the maximum offset so as to convert the slave node with the maximum offset into the master node.
7. The method of claim 5, further comprising:
sending a hash slot clearing instruction to the cluster so as to clear the corresponding relation between the offline node and the corresponding hash slot;
and sending a hash slot distribution instruction to the cluster to establish a corresponding relation between the hash slot corresponding to the downline node and the slave node with the maximum offset.
8. The method of claim 1, further comprising:
and sending a network partition recovery instruction to the cluster under the condition that the merging feedback information shows that the merging operation is successfully executed, so that the cluster is recovered to a normal operation state.
9. A method for handling cluster failures, implemented by a first node in a cluster, includes:
receiving a father node identifier sent by a second node in the cluster, wherein the father node identifier of each node is initially a self identifier;
comparing the father node identification sent by the second node with the father node identification of the second node to update the father node identification of the second node;
after at least one updating step, if the father node identification of the self is the same as the self identification, the self is determined to be a root node;
in case of itself being a root node, sending its own node information to the monitor of claim 1.
10. An apparatus for processing cluster failure, wherein the cluster comprises a plurality of nodes, and the plurality of nodes comprises at least one master node corresponding to at least one slave node, and wherein the apparatus is applied to a monitor outside the cluster, and comprises:
a first receiving module, configured to receive node information reported by a root node of the cluster, where the root node is determined by mutual communication between nodes in the cluster;
the first processing module is used for determining the number of root nodes in the cluster according to the received node information;
and the second processing module is used for determining that the cluster has a network partition fault when the number of the root nodes is more than one.
11. The device for processing the cluster fault is applied to a first node in a cluster and comprises the following components:
the receiving module is used for receiving father node identifications sent by second nodes in the cluster, wherein the father node identification of each node is initially the self identification;
the updating module is used for comparing the father node identification sent by the second node with the father node identification of the updating module so as to update the father node identification of the updating module;
the determining module is used for determining that the self is a root node if the father node identification of the self is the same as the self identification after at least one updating step;
a sending module, configured to send node information of itself to the monitor according to claim 1, if the node is a root node.
12. An electronic device comprising a memory storing executable commands and a processor that, when executing the executable commands, implements the method of any one of claims 1-9.
CN202010477541.9A 2020-05-29 2020-05-29 Cluster fault processing method and device and electronic equipment Active CN111708668B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010477541.9A CN111708668B (en) 2020-05-29 2020-05-29 Cluster fault processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010477541.9A CN111708668B (en) 2020-05-29 2020-05-29 Cluster fault processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN111708668A true CN111708668A (en) 2020-09-25
CN111708668B CN111708668B (en) 2023-07-07

Family

ID=72538409

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010477541.9A Active CN111708668B (en) 2020-05-29 2020-05-29 Cluster fault processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111708668B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113315657A (en) * 2021-05-26 2021-08-27 中国电信集团系统集成有限责任公司 Telecommunication transmission network customer influence analysis method and system based on parallel search set
CN113810216A (en) * 2020-12-31 2021-12-17 京东科技控股股份有限公司 Cluster fault switching method and device and electronic equipment
CN115037595A (en) * 2022-04-29 2022-09-09 北京华耀科技有限公司 Network recovery method, device, equipment and storage medium
CN115037595B (en) * 2022-04-29 2024-04-23 北京华耀科技有限公司 Network recovery method, device, equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1552020A (en) * 2001-07-05 2004-12-01 �����ɷ� Method for ensuring operation during node failures and network partitions in a clustered message passing server
CN102594596A (en) * 2012-02-15 2012-07-18 华为技术有限公司 Method and device for recognizing available partitions, and clustering network system
JP2012209625A (en) * 2011-03-29 2012-10-25 Nec Corp System and method for reducing wiring complexity in cluster system
US20130179729A1 (en) * 2012-01-05 2013-07-11 International Business Machines Corporation Fault tolerant system in a loosely-coupled cluster environment
US20140325257A1 (en) * 2013-04-29 2014-10-30 King Fahd University Of Petroleum And Minerals Wsan simultaneous failures recovery method
CN106656624A (en) * 2017-01-04 2017-05-10 合肥康捷信息科技有限公司 Optimization method based on Gossip communication protocol and Raft election algorithm
US20170364423A1 (en) * 2016-06-21 2017-12-21 EMC IP Holding Company LLC Method and apparatus for failover processing
US20180227363A1 (en) * 2017-02-08 2018-08-09 Vmware, Inc. Maintaining partition-tolerant distributed metadata
CN108551765A (en) * 2015-09-30 2018-09-18 华睿泰科技有限责任公司 input/output isolation optimization
CN109040212A (en) * 2018-07-24 2018-12-18 苏州科达科技股份有限公司 Equipment access server cluster method, system, equipment and storage medium
US20190235978A1 (en) * 2018-01-30 2019-08-01 EMC IP Holding Company LLC Data Protection Cluster System Supporting Multiple Data Tiers

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1552020A (en) * 2001-07-05 2004-12-01 �����ɷ� Method for ensuring operation during node failures and network partitions in a clustered message passing server
JP2012209625A (en) * 2011-03-29 2012-10-25 Nec Corp System and method for reducing wiring complexity in cluster system
US20130179729A1 (en) * 2012-01-05 2013-07-11 International Business Machines Corporation Fault tolerant system in a loosely-coupled cluster environment
CN102594596A (en) * 2012-02-15 2012-07-18 华为技术有限公司 Method and device for recognizing available partitions, and clustering network system
US20140325257A1 (en) * 2013-04-29 2014-10-30 King Fahd University Of Petroleum And Minerals Wsan simultaneous failures recovery method
CN108551765A (en) * 2015-09-30 2018-09-18 华睿泰科技有限责任公司 input/output isolation optimization
US20170364423A1 (en) * 2016-06-21 2017-12-21 EMC IP Holding Company LLC Method and apparatus for failover processing
CN106656624A (en) * 2017-01-04 2017-05-10 合肥康捷信息科技有限公司 Optimization method based on Gossip communication protocol and Raft election algorithm
US20180227363A1 (en) * 2017-02-08 2018-08-09 Vmware, Inc. Maintaining partition-tolerant distributed metadata
US20190235978A1 (en) * 2018-01-30 2019-08-01 EMC IP Holding Company LLC Data Protection Cluster System Supporting Multiple Data Tiers
CN109040212A (en) * 2018-07-24 2018-12-18 苏州科达科技股份有限公司 Equipment access server cluster method, system, equipment and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
JAVASHUO: "Redis集群进阶之路", pages 1 - 30 *
PCULPA: "⽹络故障后, RabbitMQ群集未重新连接", pages 1 - 2 *
佚名: "模拟RabbitMQ⽹络分区", pages 1 - 3 *
阿卡牛: "网络分区故障", pages 1 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113810216A (en) * 2020-12-31 2021-12-17 京东科技控股股份有限公司 Cluster fault switching method and device and electronic equipment
CN113315657A (en) * 2021-05-26 2021-08-27 中国电信集团系统集成有限责任公司 Telecommunication transmission network customer influence analysis method and system based on parallel search set
CN113315657B (en) * 2021-05-26 2023-11-24 中电信数智科技有限公司 Method and system for analyzing influence of telecommunication transmission network clients based on union collection
CN115037595A (en) * 2022-04-29 2022-09-09 北京华耀科技有限公司 Network recovery method, device, equipment and storage medium
CN115037595B (en) * 2022-04-29 2024-04-23 北京华耀科技有限公司 Network recovery method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111708668B (en) 2023-07-07

Similar Documents

Publication Publication Date Title
CN109729111B (en) Method, apparatus and computer program product for managing distributed systems
US8949828B2 (en) Single point, scalable data synchronization for management of a virtual input/output server cluster
US8055735B2 (en) Method and system for forming a cluster of networked nodes
US11360867B1 (en) Re-aligning data replication configuration of primary and secondary data serving entities of a cross-site storage solution after a failover event
US9450700B1 (en) Efficient network fleet monitoring
CN107924362B (en) Database system, server device, computer-readable recording medium, and information processing method
JP2016085753A (en) Failover and recovery for replicated data instances
US20220318104A1 (en) Methods and systems for a non-disruptive automatic unplanned failover from a primary copy of data at a primary storage system to a mirror copy of the data at a cross-site secondary storage system
US10341168B2 (en) Topology manager for failure detection in a distributed computing system
CN107153660B (en) Fault detection processing method and system for distributed database system
CN103460203A (en) Cluster unique identifier
CN106657167B (en) Management server, server cluster, and management method
CN109144748B (en) Server, distributed server cluster and state driving method thereof
JPWO2014076838A1 (en) Virtual machine synchronization system
CN103036719A (en) Cross-regional service disaster method and device based on main cluster servers
EP3291487B1 (en) Method for processing virtual machine cluster and computer system
WO2015047240A1 (en) Baseboard management controller providing peer system identification
CN104850416A (en) Upgrading system, method and device and cloud computing node
CN111708668B (en) Cluster fault processing method and device and electronic equipment
CN112100005A (en) Redis copy set implementation method and device
CN106230622B (en) Cluster implementation method and device
CN107071189B (en) Connection method of communication equipment physical interface
CN111752488A (en) Management method and device of storage cluster, management node and storage medium
CN109189854B (en) Method and node equipment for providing continuous service
JP5176231B2 (en) Computer system, computer control method, and computer control program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant