CN113515408A

CN113515408A - Data disaster tolerance method, device, equipment and medium

Info

Publication number: CN113515408A
Application number: CN202010279154.4A
Authority: CN
Inventors: 罗永佳; 高书明
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2020-04-10
Filing date: 2020-04-10
Publication date: 2021-10-19

Abstract

The application provides a data disaster tolerance method, which is applied to a first data center and a second data center, wherein the first data center is provided with M nodes, the second data center is provided with N nodes, M is an even number, N is an odd number, and M is larger than N, wherein the M nodes of the first data center and the N nodes of the second data center form a first cluster, the second data center obtains the working state of the first data center, when the first data center fails, the nodes of the second data center start a second cluster to provide service through a leader node of the second cluster, and the leader node of the second cluster is one of the N nodes and is used for maintaining the consistency of data among the N nodes in the second cluster. Therefore, the problem that when the first data center in the related art fails, the first cluster possibly cannot provide services to the outside and the high availability requirement is difficult to meet is solved.

Description

Data disaster tolerance method, device, equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data disaster recovery method, apparatus, device, and computer-readable storage medium.

Background

With the increasing requirements of users on data security, more and more applications adopt a cluster mode to store data. For example, the ETCD cluster is adopted for data storage. An ETCD cluster is a distributed system, typically comprising a plurality of nodes. A plurality of nodes communicate with each other to form an integral external service. Wherein, each node stores complete data, and a consistency protocol such as a raft protocol is adopted among the nodes to ensure that the data maintained by each node is consistent.

If the distributed system is deployed in a single data center, there may be a problem that infrastructure of the data center, such as water, electricity, network, etc., fails, resulting in failure to provide external services. To this end, the industry often deploys distributed systems in two separate data centers to increase availability.

A typical deployment is to deploy two nodes in one data center and one node in another data center, thereby forming a three-node cluster. When a data center with two nodes fails, a cluster may not be able to provide services to the outside, and it is difficult to meet the requirement of high availability.

Disclosure of Invention

The application provides a data disaster tolerance method, which solves the problem that a cluster of a double-data-center architecture cannot provide services to the outside when a data center with more nodes fails, and meets the requirement of high availability. The application also provides a device, equipment, a computer readable storage medium and a computer program product corresponding to the data disaster recovery method.

In a first aspect, the present application provides a data disaster recovery method. The method is applied to a first data center and a second data center. The first data center is provided with M nodes, and the second data center is provided with N nodes. M is even number, N is odd number, M is greater than N. The M nodes of the first data center and the N nodes of the second data center form a first cluster, and the first cluster provides services based on a leader node in the first cluster.

The second data center may obtain an operating state of the first data center, and when the operating state of the first data center indicates a failure of the first data center, the second data center initiates a second cluster (a sub-cluster of the first cluster) to provide service through a leader node of the second cluster. The leader node of the second cluster is one of the N nodes, and the leader node of the second cluster is used for maintaining the consistency of the data among the N nodes in the second cluster.

Therefore, when the first data center fails, the application can still provide service to the outside through the second cluster, and the problem that the cluster cannot provide service to the outside because only one node is left and the leader node cannot be selected in the related technology is solved. In addition, when the second data center fails, half of the nodes in the first cluster still work, and the service can be provided to the outside normally. Namely, when any one of the first data center and the second data center fails, the service can be provided to the outside, and the requirement of high availability is met.

Wherein a second cluster may be pre-created and then activated upon failure of the first data center. In some possible implementations, the second data center may create a second cluster upon failure of the first data center and then activate the second cluster. The creation time of the second cluster does not affect the specific implementation of the present application.

In some possible implementations, the second data center may add the nodes of the first data center to the second cluster when the first data center recovers. Thus, data consistency between the M nodes of the first data center and the N nodes of the second data center can be achieved, and application availability is further improved.

In some possible implementations, the second data center may send an inquiry request to the arbitration node, where the inquiry request is used to inquire the operating state of the first data center, and then receive an inquiry response sent by the arbitration node, so as to obtain the operating state of the first data center from the inquiry response.

In some possible implementations, when the second data center recovers from the failure, the second data center may further determine a leader node of the first cluster from the N nodes deployed by the second data center. In particular, the second data center may determine a leader node from the second data center in response to a leader node transfer operation triggered by a node of the first data center. That is, when the second data center is normal, the node in the second data center is always used as the leader node, so that it is ensured that the data written into the node of the second data center is complete when the second data center fails or is normal. The first data center and the second data center can store complete data, and data consistency is guaranteed.

In some possible implementations, the first data center includes 2K nodes and the second data center includes 2K-1 nodes. And K is a positive integer. As an example, K may take a value of 1 or 2, and correspondingly, the first cluster may be a 3-node cluster or a 7-node cluster. When the first cluster is a 3-node cluster, the first data center includes 2 nodes, and the second data center includes 1 node. When the first cluster is a 7-node cluster, the first data center comprises 3 nodes, and the second data center comprises 4 nodes. In this way, a balance of fault tolerance and performance can be achieved.

In a second aspect, the present application provides a data disaster recovery device. The device is applied to a first data center and a second data center. The first data center is provided with M nodes, and the second data center is provided with N nodes. M is even number, N is odd number, M is greater than N. The M nodes of the first data center and the N nodes of the second data center form a first cluster, and the first cluster provides services based on a leader node in the first cluster.

The device comprises a communication module and an initiating module. The communication module is used for acquiring the working state of the first data center, and the starting module is used for starting the second cluster to provide service through the leader node of the second cluster when the first data center fails. The leader node of the second cluster is one of the N nodes, and the leader node of the second cluster is used for maintaining consistency of data among the N nodes in the second cluster.

In some possible implementations, the apparatus further includes:

an adding module, configured to add a node of the first data center to the second cluster when the first data center is restored.

In some possible implementations, the communication module is specifically configured to:

sending an inquiry request to an arbitration node, wherein the inquiry request is used for inquiring the working state of the first data center;

and receiving an inquiry response sent by the arbitration node, wherein the inquiry response comprises the working state of the first data center.

In some possible implementations, the apparatus further includes:

a determining module to determine a leader node of the first cluster from the second data center when the second data center recovers from a failure.

In some possible implementations, the first data center includes 2K nodes, and the second data center includes 2K-1 nodes, where K is a positive integer. As an example, K may take the value 1 or 2. When the value of K is 1, the first data center includes 2 nodes, the second data center includes 1 node, and when the value of K is 2, the first data center includes 4 nodes, and the second data center includes 3 nodes.

In a third aspect, the present application provides a device, which may be a computer device such as a server or a cloud server. The apparatus includes a processor and a memory. The processor and the memory are in communication with each other. The processor is configured to execute the instructions stored in the memory to cause the device to perform the data disaster recovery method according to the first aspect or any implementation manner of the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium. The computer-readable storage medium includes instructions that instruct to execute the data disaster recovery method according to the first aspect or any implementation manner of the first aspect.

In a fifth aspect, the present application provides a computer program product containing instructions. The computer program product, when running on a device, such as a computer device, causes the device to perform the data disaster recovery method according to the first aspect or any implementation manner of the first aspect.

The present application can further combine to provide more implementations on the basis of the implementations provided by the above aspects.

Drawings

In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings used in the embodiments will be briefly described below.

Fig. 1 is a system architecture diagram of a data disaster recovery method according to an embodiment of the present application;

fig. 2 is a flowchart of a data disaster recovery method according to an embodiment of the present application;

fig. 3 is a system architecture diagram of a data disaster recovery method according to an embodiment of the present application;

fig. 4 is a system architecture diagram of a data disaster recovery method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a data disaster recovery device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an apparatus according to an embodiment of the present application.

Detailed Description

The scheme in the embodiments provided in the present application will be described below with reference to the drawings in the present application.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished.

In order to facilitate understanding of the technical solutions of the present application, some technical terms related to the present application are described below.

Data disaster recovery is a technique to protect data security and to improve the continuous availability of data. Data disaster recovery generally includes data backup, which is implemented to guarantee data security and improve data continuous availability. Data disaster tolerance can be generally divided into different levels according to different data backup modes. For example, the disaster tolerance levels of a tape drive that backs up data locally and a data center that backs up data remotely are significantly different, and the disaster tolerance capability of the latter is significantly higher than that of the former.

A data center (data center) is an infrastructure for transferring, accelerating, exposing, computing, and storing data. The infrastructure may be understood as a room that provides power, network, heat dissipation systems and other services. The data center is provided with nodes, the nodes are devices for realizing data calculation and/or storage, and the data center realizes functions of data transmission, acceleration, presentation, calculation, storage and the like through the nodes arranged in the data center. Wherein a node may be a physical device, such as a server. Of course, a node may also be a logical device, such as a Virtual Machine (VM) on a server, and so on. The data center provides power supply and network service for the nodes to ensure that the nodes can normally provide service to the outside.

In view of data security and continued availability, many applications employ a dual data center architecture to prevent a single data center failure from rendering services unavailable. Under the double data center architecture, two data centers can receive transactions at the same time and process the same data. When one data center fails, the transaction can continue at the other data center without a system switch.

The key to switching to another data center when a data center fails is to maintain data consistency between the two data centers. Two data centers can generally maintain data consistency by establishing clusters. For ease of understanding, the embodiments of the present application are illustrated with an etc cluster.

An ETCD cluster is a distributed storage system for shared configuration and service discovery. ETCD clusters use the Raft protocol to maintain consistency of the state of various nodes within the cluster. The Raft protocol is also called the Raft algorithm, which belongs to a consensus algorithm (consensus algorithm). The consensus is an opinion that a plurality of nodes agree with a certain event even when a partial node fails, a network delays, or a network is split. Generally, more than half of the nodes in an ETCD cluster agree on something and the opinion is considered to be trustworthy. For example, if the ETCD cluster includes 4K-1 nodes, and K is a positive integer, if there are 2K nodes that agree on writing data, each node in the ETCD cluster can write data. Even if some nodes do not update the data due to the network or other reasons, the data is finally kept consistent.

The industry provides an ETCD cluster of a dual data center architecture. The ETCD cluster comprises two data centers, DC1 and DC 2. DC1 includes two nodes, denoted VM1 and VM3, respectively. DC2 includes a node, denoted as VM 2. When DC2 fails, VM1, VM3 may continue to provide service outside the cluster. However, when the DC1 fails, the cluster cannot provide service to the outside because only one node of the VM2 remains, and the cluster cannot select a leader node. At this time, the application is abnormal for external service and does not meet the requirement of high availability.

In view of this, the present application provides a data disaster recovery method. The method is applied to double data centers, specifically a first data center and a second data center, wherein the first data center is provided with M nodes, and the second data center is provided with N nodes. M is even number, N is odd number, M is greater than N. The M nodes of the first data center and the N nodes of the second data center form a first cluster of size M + N. The first cluster provides service to the outside based on the leader node of the first cluster. The second data center can obtain the working state of the first data center, when the working state of the first data center indicates that the first data center fails, the second data center can start a second cluster (a sub-cluster of the first cluster), one node of the second data center serves as a leader node of the second cluster, and consistency of data among N nodes in the second cluster is maintained, so that the application can still provide service to the outside when the first data center fails. Therefore, the problem that the cluster cannot provide service to the outside due to the fact that only one node is left and the leader node cannot be selected in the related technology is solved, and the requirement of high availability is met.

In particular implementations, the first data center may include 2K nodes and the second data center may include 2K-1 nodes. Wherein K is a positive integer. Through the arrangement, the nodes can be distributed in the first data center and the second data center more uniformly, and the balance of fault tolerance and performance can be realized.

In order to make the technical solution of the present application clearer and easier to understand, an application environment of the data disaster recovery method provided in the embodiment of the present application will be described below with reference to the accompanying drawings.

The data disaster recovery method provided by the embodiment of the application includes, but is not limited to, the application environment shown in fig. 1. As shown in fig. 1, the scenario includes a data center 102 and a data center 104. Therein, 2K nodes, denoted as node 1 to node 2K, are deployed in the data center 102. The data center 104 is deployed with 2K-1 nodes, which are denoted as node 2K +1 to node 4K-1. K is a positive integer, and may be, for example, 1 or 2. Correspondingly, data center 102 may include 2 nodes and data center 104 includes 1 node. Of course, the data center 102 may also include 4 nodes, and correspondingly, the data center 104 includes 3 nodes.

Data center 104 may obtain the operational status of data center 102. When data center 102 fails, data center 104 may initiate a second cluster. The second cluster includes 2K 1 nodes, nodes 2K +1 through 4K 1, in the data center 104.

Wherein, the leader node of the second cluster is one of the nodes 2K +1 to 4K-1. Specifically, one node in the cluster is in one of three states at any one time: leader, candidate (candidate) and follower (follower) nodes. All nodes are follower nodes at the time of starting, and when the heartbeat message from the leader node is not received within a period of time, the nodes are switched to the candidate nodes from the follower nodes. The candidate nodes may vote for themselves and send election requests to other nodes, and when one candidate node gets most votes (e.g., more than half of the nodes' votes), the candidate node wins the election and becomes the leader node. The leader node is responsible for receiving update requests (e.g., data write requests) from clients, then replicating the update requests to the follower node, and executing the update requests when "safe" (e.g., more than half of the nodes agree), thereby maintaining data consistency among the 2K-1 nodes in the second cluster.

Next, a data disaster recovery method provided in the embodiment of the present application will be described from the perspective of the data center 104.

Referring to a flow chart of a data disaster recovery method shown in fig. 2, the method includes:

s202: the data center 104 obtains the operating status of the data center 102. When the operating status of the data center 102 indicates that the data center 102 is malfunctioning, S204 is performed.

The operating state of the data center 102 specifically includes two states. One state is a normal operating state, and at this time, the data center 102 can provide power, network, heat dissipation system, and the like for the nodes, so that the nodes can provide services to the outside normally. The other state is a fault operating state, and at this time, the data center 102 cannot provide a power supply, a network or a heat dissipation system for the node, so that the node cannot normally provide a service to the outside.

It should be noted that the data center 102 is in the failure operating state completely different from the node in the data center 102. When the data center 102 is in a failure operation state, such as power failure, all nodes in the data center 102 cannot provide external services. When a certain node in the data center 102 is in a failure working state, other nodes in the data center 102 can still provide external services.

The data center 104 may obtain the working state of the data center 102 through any node deployed in the data center 104, so as to perform data processing according to the working state of the data center 102, thereby implementing data disaster recovery. The nodes of the data center 104 may be implemented by arbitration nodes when obtaining the operating status of the data center 102.

Referring to fig. 3, a schematic structural diagram of a data disaster recovery system is shown, where the data disaster recovery system includes a data center 102 and a data center 104, and further includes an arbitration node 106. Arbitration node 106 is used to interact with data center 102 and data center 104. In this manner, data center 104 may obtain the operational status of data center 102 via arbitration node 106.

In some possible implementations, the data center 104 may send a query request to the mediation node 106, the query request being used to query the operating state of the data center 102. Arbitration node 106 may determine an operational status of data center 102 based on heartbeat messages with nodes in data center 102. For example, if the arbitration node 106 receives a heartbeat message from at least one node in the data center 102 within a preset time period, it determines that the data center 102 is in a normal operating state. If the arbitration node 106 does not receive the heartbeat message sent by any node in the data center 102 within the preset time period, it is determined that the data center 102 is in the failure working state. The arbitration node 106 generates a query response according to the working state of the data center 102, and then sends the query response to the data center 104, so that the data center 104 learns the working state of the data center 102 according to the query response.

When the data center 104 learns that the data center 102 has failed, the data center 104 may execute S204, so that data consistency between nodes can be maintained when the data center 102 has failed, and the data can be provided to the outside normally.

S204: the node of the data center 104 initiates a second cluster to provide service through a leader node of the second cluster.

The leader node of the second cluster is one of the nodes of the second data center. When the second data center is deployed with N nodes, the leader node of the second cluster is one of the N nodes, and when N is equal to 2K-1, the leader node of the second cluster is one of the 2K-1 nodes. The leader node of the second cluster is specifically configured to maintain consistency of data among N (N may take a value of 2K-1) nodes in the second cluster.

When the data center 102 fails, any one of the nodes in the data center 104 or a first predetermined node in the data center 104 may create a second cluster, for example, by a force-new-cluster or the like command. The second cluster includes 2K-1 nodes deployed in the data center 104. The second cluster takes one of the 2K-1 nodes as a leader node, and the leader node can maintain the consistency of data among the 2K-1 nodes in the second cluster.

Wherein the leader node of the second cluster may be a second preset node of the 2K-1 nodes. The second preset node and the first preset node may be the same node or different nodes. When a node of the data center 104 creates a second cluster, the second preset node may send a notification message to other nodes in the second cluster to notify the other nodes in the second cluster that the leader node is the second preset node. In this way, the overhead caused by operations such as election can be reduced.

In some implementations, the leader node of the second cluster may also be determined by election. When the second cluster is created, 2K-1 nodes deployed in the data center 104 are changed from the follower node to the candidate node, the candidate node votes for itself and sends election requests to other nodes in the second cluster, and when one candidate node obtains most (major) votes, the candidate node is switched to the leader node.

In some implementations, the data center 104 may also pre-create a second cluster, determine a leader node of the second cluster, and then initiate the second cluster to provide service through the leader node of the second cluster when the first data center fails.

When the data center 104 fails, 2K nodes are included in the data center 102, and the number of the nodes exceeds the number of the nodes included in the first cluster, so that when the 2K nodes are agreed upon for an update request (e.g., a data write request), data consistency among the 2K nodes in the first cluster can be maintained. When the data center 104 recovers, the leader node in the first cluster can perform an update request (e.g., a data write request) again to write data to 2K-1 nodes deployed in the data center 104, thereby maintaining data consistency among the 4K-1 nodes in the first cluster.

In the embodiment of the present application, the number of nodes deployed in the data centers 102 and 104 may be set according to the service requirement. More nodes may be deployed in data center 102 and data center 104 to account for fault tolerance. Fewer nodes may be deployed in data center 102 and data center 104 for performance. And considering the fault tolerance and performance comprehensively, K can be 1 or 2.

Based on this, in some possible implementations, data center 102 may deploy 2 nodes and data center 104 may deploy 1 node. In this implementation, when data center 102 fails, the nodes in data center 104 may create a single-node cluster, which is the second cluster described above. The single-node cluster may have one node included in the data center 104 as a leader node for maintaining data consistency between different nodes in the second cluster.

In other possible implementations, data center 102 may deploy 4 nodes and data center 104 may deploy 3 nodes. In this implementation, when the data center 102 fails, the nodes in the data center 104 may create a three-node cluster, which is the second cluster described above. The leader node in the three-node cluster may be a second preset node, or a node determined in an election manner, which is not limited in this embodiment.

Based on the above description, the data disaster recovery method provided in the embodiment of the present application supports that when a data center (i.e., a first data center) with a large number of nodes in a dual-center data disaster recovery system fails, another data center (i.e., a second data center) can separately create a second cluster, and a node of the second data center serves as a leader node of the second cluster to maintain consistency of data among different nodes in the second cluster, so that an application can still provide a service to the outside when the first data center fails. Therefore, the problem that the cluster cannot provide service to the outside due to the fact that only one node is left and the leader node cannot be selected in the related technology is solved, and the requirement of high availability is met.

In the embodiment shown in fig. 2, when the data center 102 recovers, i.e., the failure operating state is switched to the normal operating state, the data center 104 adds the node of the data center 102 to the second cluster. Specifically, the nodes of the data center 104 include a leader node of the second cluster, and the leader node may add the nodes deployed in the data center 102 to the second cluster through a member addition command, such as a member add command.

In some possible implementations, when the data center 104 recovers from a failure, that is, when the data center 104 switches from a failure operating state to a normal operating state, the data center 104 may determine a leader node of the first cluster from the nodes deployed by the data center 104. In particular, data center 104 may determine a leader node from data center 104 in response to a leader node transfer operation triggered by a node of data center 102. Wherein, a node in the data center 102 as a leader node can transfer leader authority to a node in the data center 104 through a leader transfer command, such as a move leader command. The leader node of the first cluster is switched from a node deployed in data center 102 to a node deployed in data center 104.

In order to facilitate understanding of the technical solution of the present application, the data disaster tolerance method is illustrated below by deploying 2 nodes, namely, the VM1 and the VM3, in the data center 102 and deploying 1 node, namely, the VM2, in the data center 104.

Referring to the structural diagram of the data disaster recovery system shown in fig. 4, the nodes VM1, VM2, and VM3 deployed in the data center 102 and the data center 104 form a three-node etc. cluster. Each node in the etc cluster has a monitor program, and when the monitor program runs, the node may call an arbitration proxy interface corresponding to an arbitration service provided by the arbitration node 106 for arbitration.

Specifically, the monitor of the node VM2 deployed in the data center 104 calls the arbitration proxy interface of the arbitration node 106 to determine the operating state of the data center 102. When the data center 102 fails, the VM2 automatically modifies the ETCD cluster to a single node cluster using force-new-cluster commands. Wherein, the VM2 is a leader node of the single-node cluster.

When the hypervisor of VM2 determines that data center 102 is restored by calling the mediation agent interface of mediation node 106, VM2, which is a leader node, joins VM1 and VM3 in a new cluster using a member add command.

When the data center 104 fails, the nodes VM1, VM3 in the data center 102 are still available, and the ETCD cluster may rely on its election capability to determine VM1 or VM3 as a leader node based on which high availability of cluster services may still be achieved. The ETCD cluster may incorporate the VM2 into cluster management when the data center 104 recovers, i.e., switches from a failed operating state to a normal operating state. VM1 or VM3 transfers leader authority through the move leader command, so that VM2 becomes the leader node. Therefore, when the data center 102 fails or is normal, the data written into the VM2 is complete, and the data center 102 and the data center 104 both store complete data, so that data consistency is guaranteed.

By the method provided by the embodiment of the application, the problem that the service is unavailable due to the fact that the leader cannot be selected when the data center of the data disaster recovery system with the double data centers comprises more nodes has an overall fault can be solved, and the requirement for high availability is met. In addition, 3 data centers do not need to be built to guarantee odd-numbered node requirements of the ETCD cluster, and data disaster tolerance cost is reduced.

The data disaster recovery method provided in the embodiment of the present application is described in detail with reference to fig. 1 to 4, and then, the data disaster recovery device and the data disaster recovery apparatus provided in the embodiment of the present application are described with reference to the drawings.

Referring to a schematic structural diagram of a data disaster recovery device 500 shown in fig. 5, the device 500 is applied to a first data center and a second data center. The first data center is provided with M nodes, and the second data center is provided with N nodes. M is even number, N is odd number, M is greater than N. The M nodes of the first data center and the N nodes of the second data center form a first cluster.

The apparatus 500 comprises:

a communication module 502, configured to obtain a working state of the first data center;

a starting module 504, configured to start the second cluster to provide service through a leader node of the second cluster when the working state of the first data center indicates that the first data center is faulty, where the leader node of the second cluster is one of the N nodes, and the leader node is configured to maintain consistency of data among the N nodes in the second cluster.

The specific implementation of the communication module 502 may refer to the description of the relevant content of S202 in the embodiment shown in fig. 2, and the specific implementation of the starting module 504 may refer to the description of the relevant content of S204 in the embodiment shown in fig. 2, which is not described herein again.

In some possible implementations, the apparatus 500 further includes:

an adding module 506, configured to add a node of the first data center to the second cluster when the first data center is restored.

The specific implementation of the adding module 506 may refer to the description of the relevant content in the embodiment shown in fig. 2, and is not described herein again.

In some of the possible implementations of the present invention,

the communication module 502 is specifically configured to:

In some possible implementations, the apparatus 500 further includes:

a determining module 508, configured to determine a leader node from the second data center when the second data center recovers from the failure.

The specific implementation of the determining module 508 may refer to the description of the related content in the embodiment shown in fig. 2, and is not described herein again.

In some possible implementations, the first data center includes 2K nodes, and the second data center includes 2K-1 nodes, where K is a positive integer.

The data disaster recovery apparatus 500 according to the embodiment of the present application may correspond to perform the method described in the embodiment of the present application, and the above and other operations and/or functions of each module/unit of the data disaster recovery apparatus 500 are respectively for implementing corresponding flows of each method in the embodiment shown in fig. 2, and are not described herein again for brevity.

The embodiment of the application also provides equipment. The device may be a physical device such as a server, or a virtualized device such as a cloud server. The device is specifically configured to implement the function of the data disaster recovery apparatus 500 in the embodiment shown in fig. 5.

Fig. 6 provides a schematic structural diagram of a device 600, and as shown in fig. 6, the device 600 includes a bus 601, a processor 602, a communication interface 603, and a memory 604. The processor 602, memory 604, and communication interface 603 communicate over a bus 601. The bus 601 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus. The communication interface 603 is used for communication with the outside. For example, receiving an update request (e.g., a data write request), sending an inquiry request to arbitration node 106, receiving an inquiry response sent by arbitration node 106, and so on.

The processor 602 may be a Central Processing Unit (CPU). The memory 604 may include a volatile memory (volatile memory), such as a Random Access Memory (RAM). The memory 604 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, an HDD, or an SSD.

The memory 604 stores executable code that the processor 602 executes to perform the data disaster recovery method described above.

Specifically, in the case of implementing the embodiment shown in fig. 5, and in the case that the modules of the data disaster recovery device 500 described in the embodiment of fig. 5 are implemented by software, software or program codes required for executing the functions of the starting module 504, the adding module 506, and the determining module 508 in fig. 5 are stored in the memory 604. The communication module 102 obtains an operating status of a first data center, such as the data center 102, and transmits the operating status to the processor 602 through the bus 601, and the processor 602 executes program code corresponding to each module stored in the memory 604, such as program code corresponding to the initiating module 504, to initiate a second cluster to provide service through a leader node of the second cluster when the operating status of the first data center indicates a failure of the first data center. Therefore, the application can still provide services to the outside based on the second cluster, and data disaster tolerance is achieved.

Of course, the processor 602 may also execute the program code corresponding to the adding module 506 to add the node of the first data center to the second cluster when the first data center is restored. Processor 602 can further execute program code corresponding to determining module 508 to re-determine a leader node from a second data center in response to a leader node transfer operation triggered by a node of a first data center when the second data center recovers from a failure.

An embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium includes instructions that instruct the device 600 to execute the data disaster recovery method applied to the data disaster recovery apparatus 500.

The embodiment of the present application further provides a computer program product, and when the computer program product is executed by a computer, the computer executes any one of the foregoing data disaster recovery methods. The computer program product may be a software installation package, which may be downloaded and executed on a computer in case any of the aforementioned data disaster recovery methods needs to be used.

Claims

1. A data disaster tolerance method is applied to a first data center and a second data center, wherein the first data center is deployed with M nodes, the second data center is deployed with N nodes, M is an even number, N is an odd number, M is greater than N, the M nodes of the first data center and the N nodes of the second data center form a first cluster, and the first cluster provides services based on a leader node in the first cluster, and the method comprises the following steps:

the second data center acquires the working state of the first data center;

when the working state of the first data center indicates that the first data center is in failure, the second data center starts a second cluster to provide service through a leader node of the second cluster, wherein the leader node of the second cluster is one of the N nodes, and the leader node of the second cluster is used for maintaining consistency of data among the N nodes in the second cluster.

2. The method of claim 1, further comprising:

when the first data center recovers, the second data center adds the nodes of the first data center to the second cluster.

3. The method according to claim 1 or 2, wherein the second data center acquiring the working state of the first data center comprises:

the second data center sends a query request to an arbitration node, wherein the query request is used for querying the working state of the first data center;

and the second data center receives an inquiry response sent by the arbitration node, wherein the inquiry response comprises the working state of the first data center.

4. The method according to any one of claims 1 to 3, further comprising:

upon recovery from a failure by the second data center, determining a leader node of the first cluster from the second data center.

5. The method according to any one of claims 1 to 4, wherein the first data center comprises 2K nodes and the second data center comprises 2K-1 nodes, and K is a positive integer.

6. A data disaster recovery device, applied to a first data center and a second data center, where the first data center is deployed with M nodes, the second data center is deployed with N nodes, M is an even number, N is an odd number, and M is greater than N, the M nodes of the first data center and the N nodes of the second data center form a first cluster, and the first cluster provides services based on a leader node in the first cluster, and the device includes:

the communication module is used for acquiring the working state of the first data center;

a starting module, configured to start a second cluster to provide service through a leader node of the second cluster when the working state of the first data center indicates that the first data center is faulty, where the leader node of the second cluster is one of the N nodes, and the leader node of the second cluster is used to maintain consistency of data among the N nodes in the second cluster.

7. The apparatus of claim 6, further comprising:

8. The apparatus according to claim 6 or 7, wherein the communication module is specifically configured to:

9. The apparatus of any one of claims 6 to 8, further comprising:

10. The apparatus of any of claims 6 to 9, wherein the first data center comprises 2K nodes and the second data center comprises 2K-1 nodes, and K is a positive integer.

11. An apparatus, comprising a processor and a memory;

the processor is configured to execute instructions stored in the memory to cause the device to perform the data disaster recovery method according to any one of claims 1 to 5.

12. A computer-readable storage medium, comprising instructions that instruct a computer to perform the data disaster recovery method according to any one of claims 1 to 5.