CN115202917A

CN115202917A - Distributed cluster fault-tolerant recovery method and system used under virtualization platform

Info

Publication number: CN115202917A
Application number: CN202210783786.3A
Authority: CN
Inventors: 刘刚; 王阳; 赵山
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-10-18

Abstract

The invention discloses a distributed cluster fault-tolerant recovery method and a distributed cluster fault-tolerant recovery system for a virtualization platform, which belong to the field of virtualization and distributed computing; the system specifically comprises a copy starting module, a copy processing module, a node processing module, a request processing module, an exception handling module, an information inquiry module, an IP detection module and a data synchronization module: the invention can well solve the problem that the application cluster can not be automatically recovered due to IP change caused by the Pod restart in the K8S environment; the data stored by the original node is reused for quick recovery, so that the problem that the node data needs to be completely copied from other nodes is solved; the metadata is stored by the reliable distributed key value storage medium, so that the metadata information acquired by each node in the cluster is completely consistent, and the problem that the cluster state cannot be recovered due to local caching of old cluster snapshots is solved.

Description

Distributed cluster fault-tolerant recovery method and system used under virtualization platform

Technical Field

The invention discloses a distributed cluster fault-tolerant recovery method and a distributed cluster fault-tolerant recovery system for a virtualization platform, and relates to the technical field of virtualization and distributed computing.

Background

The distributed system is beneficial to the distribution and optimization of tasks on the whole computer system, overcomes the defects of central host resource shortage and response bottleneck caused by the traditional centralized system, and solves the problems of data heterogeneity, data sharing incapability, low computation efficiency and the like in the traditional single architecture.

One of the core problems in distributed technology is how to solve the problem of distributed consistency (CAP principle), which means that data stored on each copy needs to be kept consistent by multiple nodes. Distributed systems are typically made up of multiple nodes connected by an asynchronous network, each node having independent computing and storage, the nodes cooperating with one another via network communications. In distributed application clusters, reliable storage of data is typically achieved in a multi-copy redundant manner. The distributed consistency problem is that the values of the data copies (or called variables) stored in different nodes must be consistent. Furthermore, distributed consistency requires that the sequence of values of the variable must be consistent for multiple nodes.

Kubernetes, K8S for short. The containerization platform is an open source, is used for managing containerization application on a plurality of hosts in the cloud platform, and is the most widely applied containerization platform in the current cloud computing field. The K8S is realized by deploying the containers, each container is isolated from each other, each container has a file system, processes among the containers cannot influence each other, and computing resources can be distinguished. Compared with a virtual machine, the container can be deployed rapidly, and the container can be migrated among different clouds and different versions of operating systems because the container is decoupled from underlying facilities and a machine file system.

A plurality of copies of cluster applications running under a K8S platform organize a container to be started in a Pod mode, the container is restarted after the Pod exits under the conditions of abnormity and the like, a new IP address is redistributed to the Pod by a conventional Pod deployment method, the previous IP address is not used for the Pod, the problem is brought to the recovery of the distributed applications deployed in the Pod, the old IP cannot be used for communication continuously by a distributed consistency algorithm, whether the new IP can be added into the cluster correctly or not cannot be trusted, the cluster state cannot be recovered automatically, and the cluster needs to be redirected to recover the normal running state of the application cluster in a manual intervention mode.

Nomad is an open-source light-weight cluster management system for task scheduling, and a server side of the system performs distributed consistent data synchronization by using a Raft protocol in a high-availability deployment mode, and is suitable for problem scenarios solved by the patent.

Under the technical background, how to quickly and effectively recover the running state of the application cluster, how to tolerate the fault when the IP changes after the restart of the copy, and how to ensure the high availability of the application cluster are urgent problems to be solved. The existing technical scheme has the problems that manual intervention is needed, response time is slow, recovery time is long, service is required to be suspended for a period of time to complete cluster building, existing node metadata cannot be reused, and the like, and the continuity of online running service is greatly influenced.

Therefore, the invention provides a distributed cluster fault-tolerant recovery method and a distributed cluster fault-tolerant recovery system for a virtualization platform, so as to solve the problems.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a distributed cluster fault-tolerant recovery method and a distributed cluster fault-tolerant recovery system for a virtualization platform, and the adopted technical scheme is as follows: a distributed cluster fault-tolerant recovery method used under a virtualization platform comprises the following specific steps:

s1, a cluster starts a plurality of independent application copies for the first time, and each copy mounts an independent storage space;

s2, each application copy writes a copy unique identifier and an IP address into the distributed key value storage, inquires other copies of the IP of the same-name service, and establishes network connection with the other copies;

s3, forming a cluster snapshot by each node, and operating a Raft algorithm to enter a normal operation state;

s4, the master node operating the Raft cluster receives the write-in request and returns success of the client after the majority of dispatches are successfully submitted;

s5, restarting the node which abnormally exits in operation by using the cluster, and reallocating an IP address for the node;

s6, the restarted node queries an information list of other members in the cluster from the distributed key value storage;

s7, acquiring a cluster member IP address list, and starting to circularly detect member IP validity;

and S8, re-mounting the original storage space of the restart node, and synchronizing part of incremental data in the shutdown time period of the restart node.

The specific steps of S2 are as follows:

s21, each application copy writes a local IP address and a unique service identifier of the copy into a distributed key value storage according to a distributed key value storage access address configured locally, and inquires other IP address lists of the same-name service;

s22, when the number of the inquired addresses reaches the required seed node number guided by the Raft cluster, the node establishes network connection with the inquired IP addresses of other nodes;

s23, comparing each node in the cluster with the IP address initiated by the received opposite-side node connection request according to the inquired IP list;

s24, if the IP address requesting connection is in the IP list queried from the distributed key value store, allowing to establish the network connection, otherwise rejecting the network connection request.

The specific steps of S3 are as follows:

s31, generating a cluster snapshot for all nodes which establish network connection with the opposite side node and have correct IP addresses;

s32, based on the Raft consistency protocol algorithm of the cluster snapshot, the voting of the main node is executed.

S32, based on the Raft consistency protocol algorithm of the cluster snapshot, executing the voting election of the main node until the cluster election generates the main node, the cluster enters a stable operation available state, externally issues the main node address, normally provides application service,

the restarted node in the S5 reads the cluster snapshot from the storage data directory; reading IP addresses of other nodes from the snapshot; network connections are established with other nodes.

And the restarted node in the S5 tries to connect to the distributed key value storage according to the locally configured distributed key value storage access address.

The S6 comprises the following specific steps:

the S61 node traverses the acquired IP addresses of other cluster members and detects whether the addresses of the other members are effective or not through an ICMP (Internet control protocol);

and S62, if the member IP address is invalid, searching the member IP address from the distributed key value storage again.

The specific steps of S7 are as follows:

s71 when all member lists in the cluster are traversed and the IP addresses in the lists are valid, using the obtained IP address lists to establish network connection with other nodes in the cluster again, and waiting for the connection requests to be established correctly;

s72, after detecting that part of nodes in the cluster are unreachable, other nodes in the cluster inquire the IP address of the unreachable node from the distributed key value storage;

and S73, when each node in the cluster stores the inquired IP list from the distributed key value and is consistent with the accessed network connection request list, regenerating a new cluster snapshot.

The IP address of S72:

s721, if the change does not happen, the old IP is tried to be connected continuously to recover the network connection, and the IP address is tried to be obtained from the distributed key value cache continuously;

s722, if the change occurs, attempting to recover the network connection by using the new IP address;

s723, when the inquired latest IP address is consistent with the received new IP connection request address, allows the address to access.

A distributed cluster fault tolerance recovery system used under a virtualization platform specifically comprises a copy starting module, a copy processing module, a node processing module, a request processing module, an exception handling module, an information inquiry module, an IP detection module and a data synchronization module:

a copy starting module: the cluster starts a plurality of independent application copies for the first time, and each copy mounts an independent storage space;

a copy processing module: each application copy writes a copy unique identifier and an IP address into the distributed key value storage, inquires other copies of the IP of the same-name service, and establishes network connection with the other copies;

a node processing module: each node forms a cluster snapshot and runs a Raft algorithm to enter a normal running state;

a request processing module: the master node operating the Raft cluster receives the write-in request and returns the success of the client after the majority of dispatches are successfully submitted;

an exception handling module: restarting the node which abnormally exits in operation by using the cluster, and reallocating an IP address for the node;

an information inquiry module: the restarted node queries an information list of other members in the cluster from the distributed key value storage;

and an IP detection module: acquiring a cluster member IP address list, and starting to circularly detect member IP validity;

a data synchronization module: and re-mounting the original storage space of the restarting node, and synchronizing part of incremental data in the shutdown time period of the restarting node.

The invention has the beneficial effects that: aiming at the problems mentioned in the background technology, the distributed cluster fault-tolerant recovery method for the virtualization platform is based on reliable distributed key value storage medium storage metadata information, regenerates a cluster recovery snapshot after partial or all copies of an application cluster are restarted, multiplexes the existing node metadata information, and reconstructs a Raft consistency protocol cluster; the method provided by the invention can well solve the problem that the application cluster cannot be automatically recovered due to IP change caused by the Pod restart in the K8S environment; and the data stored by the original node is reused for quick recovery, so that the problem that the node data needs to be completely copied from other nodes is solved; the metadata are stored through a reliable distributed key value storage medium, so that the metadata information obtained by each node in the cluster is completely consistent, and the problem that the cluster state cannot be recovered due to local caching of an old cluster snapshot is avoided; the patent provides a method for rapidly recovering the normal operation state of an application cluster based on the Raft algorithm, which is more efficient and more practical than the existing method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of an optimal copy start relationship in an embodiment of the method of the present invention; FIG. 2 is a schematic diagram of the normal operation of distributed storage in an embodiment of the method of the present invention; FIG. 3 is a schematic diagram of a distributed storage operation formed by an embodiment of the method of the present invention; fig. 4 is a flow chart of an implementation of an embodiment of the method of the present invention.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

The first embodiment is as follows:

a distributed cluster fault tolerance recovery method used under a virtualization platform comprises the following specific steps:

the method comprises the steps that (1) an S1K8S application cluster (such as Nomad) is started for the first time, independent multi-copy cluster application is started according to the optimal number (such as 3) of copies of a distributed consistency algorithm, each copy is mounted with an independent storage space and used for storing business data needing real-time synchronization when application service runs, and a schematic diagram is shown in figure 1;

s4, the master node operating the Raft cluster receives the write-in request and returns success of the client after the majority of dispatches are successfully submitted:

when the application service cluster runs, a main node in the cluster receives a write operation command and synchronizes write-in service data to other replica nodes in the cluster, each replica node respectively updates application data in a storage directory mounted by the node, and only when a plurality of dispatching nodes successfully feed back and store the data, an application data write-in success result is returned; the reliability of the data is ensured through the data redundancy of multiple copies of the Raft cluster;

s6, the restarted node queries information lists of other members in the cluster from the distributed key value storage;

s7, acquiring a cluster member IP address list, and starting to circularly detect the member IP validity;

s8, the restarted node is re-mounted with the original storage space and can completely reuse the originally stored complete data of the node, so that the restarting node is prevented from copying the data again in a full amount, and only partial incremental data in the shutdown time period of the restarting node is synchronized;

further, the specific steps of S2 are as follows:

s21, each application copy writes a local IP address and a unique service identifier (UUID) of the copy into a distributed key value storage according to a distributed key value storage access address configured locally, and inquires other IP address lists of the same-name service;

s23, comparing each node in the cluster with an IP address initiated by a received opposite side (Peer) node connection request according to the inquired IP list;

s24, if the IP address requesting connection is in the IP list inquired from the distributed key value storage, the network connection is allowed to be established, otherwise, the network connection request is rejected;

further, the specific steps of S3 are as follows:

s32, executing main node voting based on the Raft consistency protocol algorithm of the cluster snapshot;

when all IP addresses inquired by the node are correct and establish network connection with the opposite side node, the node generates a cluster snapshot (snapshot), wherein the snapshot comprises information such as unique identifiers (UUIDs) of all nodes in the cluster, IP addresses and a snapshot timestamp;

further, in S32, based on the Raft consistency protocol algorithm of the cluster snapshot, a master node (leader) voting is performed until the master node is generated by the cluster voting, the cluster enters a stable operation available state, and issues a master node address to the outside, so as to normally provide an application service, where a normal operation schematic diagram is shown in fig. 2;

further, when a copy (Pod) node of the cluster exits due to program abnormality and the like in the operation process of the S5, the K8S platform restarts the node, the deployment policy of the K8S reallocates an IP address for the node, an old IP address is no longer available, and the restarted node reads the cluster snapshot from the storage data directory; reading IP addresses of other nodes from the snapshot; establishing network connection with other nodes;

but other nodes will refuse the connection request because the cluster snapshot cached by other nodes does not contain the newly allocated IP address;

the reason why the Raft cluster rejects the connection request is:

a large amount of service data copying caused by adding a new node is avoided, so that the service cluster is unavailable for a long time;

or a new node which cannot be trusted is not correctly guided to join the cluster, so that service data is lost;

in order to avoid the problem that the new IP address cannot join the cluster, further, the restarted node in S5 tries to connect to the distributed key value store according to the locally configured distributed key value store access address; after the connection is successful, the unique identification locally stored by the node is not changed, the identification is used as a proof that the node is correctly guided in the cluster, and the unique identification is used as a key value to update the IP address information stored by the node in the distributed key value storage;

further, the specific step of S6 is as follows:

s62, if the IP address is invalid, the member IP address is inquired from the distributed key value storage again, and the attempt is continuously made until all other member IP addresses are valid or the detection timeout time is finally reached;

further, the specific steps of S7 are as follows:

s73, when each node in the cluster stores the inquired IP list from the distributed key value and is consistent with the accessed network connection request list, regenerating a new cluster snapshot;

still further, the IP address of S72:

s723, when the inquired latest IP address is consistent with the received new IP connection request address, the address is allowed to access; and the raw consistency protocol re-elects the main node based on the latest cluster snapshot, the cluster enters a normal service state, and the process of re-joining the restarting node into the cluster is completed.

Example two:

on the basis of the first embodiment, if a plurality of nodes or all nodes in the cluster are abnormally exited and restarted, the cluster is added again according to the processes described in the above S4 to S6 until the cluster state returns to the stable normal operation.

The method caches the cluster metadata through S2 to S3, and updates the IP address information of the cache node in time, thereby effectively improving the efficiency and reliability of synchronizing the metadata among the members in the cluster;

through the cluster snapshot construction process from S4 to S8, the problem that the IP changes after the duplicate is restarted, which causes the cluster to be rebooted by manual intervention, can be tolerated, the problem that the restarting node needs to completely copy data from the original node in the cluster can be avoided, the effectiveness of generating the cluster snapshot is ensured, the success rate of abnormal recovery of the cluster is greatly improved, and thus, the operation and maintenance complexity and the operation and maintenance cost are reduced;

compared with the prior art, the method provided by the invention is faster and more efficient and is used for recovering the cluster node abnormity.

Example three:

a distributed cluster fault-tolerant recovery system used under a virtualization platform specifically comprises a copy starting module, a copy processing module, a node processing module, a request processing module, an exception handling module, an information inquiry module, an IP detection module and a data synchronization module:

a copy starting module: a cluster starts a plurality of independent application copies for the first time, and each copy mounts an independent storage space;

the node processing module: each node forms a cluster snapshot and runs a Raft algorithm to enter a normal running state;

an exception handling module: restarting the nodes which are abnormally withdrawn in operation by utilizing the cluster, and reallocating IP addresses for the nodes;

a data synchronization module: and re-mounting the original storage space of the restart node, and synchronizing part of incremental data in the shutdown time period of the restart node.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A distributed cluster fault-tolerant recovery method used under a virtualization platform is characterized by comprising the following specific steps:

2. The method of claim 1, wherein the step of S2 is as follows:

3. The method of claim 2, wherein the step of S3 is as follows:

4. The method as claimed in claim 3, wherein said S32 executes the master node voting election based on the Raft consistency protocol algorithm of the cluster snapshot until the cluster election generates the master node, the cluster enters a stable operation available state, and issues the master node address to the outside to normally provide the application service.

5. The method as claimed in claim 4, wherein the restarted node in S5 reads the cluster snapshot from the storage data directory; reading IP addresses of other nodes from the snapshot; network connections are established with other nodes.

6. The method as recited in claim 5, wherein the restarted node in S5 attempts to connect to the distributed key value store based on a locally configured distributed key value store access address.

7. The method of claim 6, wherein said S6 steps are as follows:

8. The method of claim 7, wherein the step of S7 is as follows:

and S73, when the IP list inquired from the distributed key value storage of each node in the cluster is consistent with the accessed network connection request list, regenerating a new cluster snapshot.

9. The method as claimed in claim 8, wherein the IP address of S72:

A distributed cluster fault-tolerant recovery system used under a virtualization platform is characterized by specifically comprising a copy starting module, a copy processing module, a node processing module, a request processing module, an exception handling module, an information query module, an IP detection module and a data synchronization module, wherein the copy starting module is used for starting a copy of a data file to be processed, and the request processing module is used for processing the data file to be processed:

a copy processing module: each application copy writes a copy unique identifier and an IP address into the distributed key value storage, inquires other copies of the IP of the same-name service and establishes network connection with the other copies;

an information inquiry module: the restarted node queries information lists of other members in the cluster from the distributed key value storage;