CN117579465A

CN117579465A - Fault processing method, device, equipment and storage medium

Info

Publication number: CN117579465A
Application number: CN202311416025.5A
Authority: CN
Inventors: 杨容权; 龚肖; 张慧敏
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2023-10-27
Filing date: 2023-10-27
Publication date: 2024-02-20

Abstract

The application discloses a fault processing method, device, equipment and storage medium, which are applied to a multi-node cluster, wherein the method comprises the following steps: if the main node is determined to be abnormal, switching the virtual ip address of the main node to a standby node; the simulation master node transmits state information to the management interface; wherein the state information includes at least anomaly information of the master node.

Description

Fault processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of multi-node cluster technologies, and in particular, to a fault handling method, device, apparatus, and storage medium.

Background

In a multi-node cluster built based on an open source container arrangement system (kubernets), if a main node fails and cannot report in time, service recovery time is wasted, service recovery involves collaborative cooperation among a plurality of components in the multi-node cluster, difficulty is high, a service recovery process is complex, time consumption is long, and efficiency is low.

Disclosure of Invention

The embodiment of the application expects to provide a fault processing method, device, equipment and storage medium.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a fault processing method, which is applied to a multi-node cluster and comprises the following steps: if the main node is determined to be abnormal, switching the virtual ip address of the main node to a standby node; the simulation master node transmits state information to the management interface; wherein the state information includes at least anomaly information of the master node.

The embodiment of the application provides a fault processing device, which comprises:

the switching module is used for switching the virtual ip address of the main node to the standby node if the main node is determined to be abnormal;

the simulation module is used for simulating the master node to transmit state information to the management interface;

wherein the state information includes at least abnormal information of the master node.

The embodiment of the application provides fault processing equipment, which comprises: a processor, a memory, and a communication bus;

the communication bus is used for realizing communication connection between the processor and the memory;

the processor is configured to execute the computer program stored in the memory, so as to implement the fault processing method.

Embodiments of the present application provide a computer readable storage medium storing one or more computer programs executable by one or more processors to implement the above-described fault handling method.

The embodiment of the application provides a fault processing method, device, equipment and storage medium, which are applied to a multi-node cluster, wherein the method comprises the following steps: if the main node is determined to be abnormal, switching the virtual ip address of the main node to a standby node; the simulation master node transmits state information to the management interface; wherein the state information includes at least anomaly information of the master node.

Drawings

FIG. 1 is a schematic flow chart of a fault handling method provided in the prior art;

fig. 2 is a schematic flow chart of a fault handling method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an exemplary multi-node cluster according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart of an exemplary simulated fault report provided in an embodiment of the present application;

FIG. 5 is a flow chart of an exemplary fault handling method according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a fault handling apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a fault handling apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are merely illustrative of the application and not limiting of the application. It should be noted that, for convenience of description, only a portion related to the related application is shown in the drawings.

Fig. 1 is a schematic flow chart of a fault handling method provided in the prior art. If the main node fails, the existing solution relies on the service recovery mechanism of kubernets for recovery. As shown in fig. 1, an exemplary service restoration implementation is: step S101, after the main node fails, the distributed key value storage system (Etcd database) service of the cluster is disconnected; step S102, the standby node starts to continuously try to acquire the lock (release) resource when the attempt to acquire the lock (release) resource from the cluster Etcd database fails, and becomes an owner of the release; step S103, until the Etcd database service is restored; step S104, the standby node can successfully acquire the release resource, and the standby node is converted into a main node, and the main node starts all control components. Step S105, a series of components controlled by a control management component (kube-controller-manager), such as a Node-controller (Node-controller), a service controller (service-controller), an Endpoint controller (Endpoint-controller) and the like start to work normally; step S106, after the node-monitor-level-period time passes, determining that the power-off node (master node) fails (not-ready), and updating the state of all applications (Pod) on the node through the management interface (API-server); in step S107, after the API-server modifies the state of the pod, the management component (kube-proxy) running on the standby node monitors the change synchronously, starts to update the service forwarding rule table (ipvs table entry) of the service (service) corresponding to the pod, and after the update is completed, the service of the cluster is restored to the available state. As can be seen from FIG. 1, the service recovery of a cluster relies on the cooperation of multiple components, and the cooperation mechanism of each component is complex, introducing a number of risks for the high availability of the cluster service, and the failure of any one component to achieve high availability results in the unavailability of the service. In addition, service restoration of existing solutions is time consuming, especially controlling the normal operation of the management components, which can have an impact on the traffic running on the multi-node cluster.

Aiming at the defect of solving the problems in the prior related art, the application provides a fault processing method, which is discussed in detail as follows:

the embodiment of the application provides a fault processing method, which is implemented by fault processing equipment, as shown in fig. 2, and includes the following steps S201 and S202:

step S201, if it is determined that the master node is abnormal, the virtual ip address of the master node is switched to the standby node.

In the embodiment of the present application, the fault handling apparatus is an electronic apparatus having a fault handling function. The fault handling device may be, for example, an electronic device such as a computer device, a smart terminal device, etc.

In the embodiment of the application, if the fault processing device determines that the master node in the multi-node cluster is abnormal, the virtual ip address of the master node is switched to the standby node in the multi-node cluster. It should be noted that the multi-node cluster may be a multi-node cluster built based on an open source container arranging system, or may be a multi-node cluster built based on a cloud development platform (OpenShift) and a workload arranging tool (Nomad), which is not limited in this application.

Illustratively, as shown in FIG. 3, a multi-node cluster may have a plurality of nodes deployed therein, wherein the plurality of nodes may include a primary node and a backup node.

In the embodiment of the application, if the fault processing device monitors that the master node is abnormal, the virtual ip address of the master node is switched to the standby node, so that the standby node plays the role of the master node in the multi-node cluster.

Step S202, the simulation master node transmits state information to the management interface; wherein the state information includes at least anomaly information of the master node.

In an embodiment of the present application, the fault handling device may simulate the transmission of status information from the master node to the management interface using the standby node.

Illustratively, as shown in fig. 3, a Control platform (Control plane) is further disposed in the multi-node cluster, and a management interface (API-server) is disposed on the Control platform. Thus, after the main node fails, the fault processing device can simulate the main node by using the standby node to report the abnormal information of the main node to the management interface.

Compared with the defect that the main node cannot report the fault in time in the related art, the virtual ip address of the main node is switched to the standby node under the condition that the main node is abnormal, so that the standby node can simulate the main node to report the abnormal information of the main node to the management interface in time, the defect that the fault cannot be reported to the management interface under the condition of node fault is avoided, and the fault processing efficiency is improved. In addition, aiming at the prior art in fig. 1, the method and the device for managing the service of the home node in the cloud computing have the advantages that the control management component is not needed, a mechanism with high availability of the whole service is simplified, the reliability is higher, the service recovery time is shorter, the time from a minute level to a second level is shortened, the standby node is directly utilized to simulate the home node to report the abnormal information of the home node to the management interface, the native code cannot be modified invasively, and the realization and maintenance are easy.

In some embodiments, before performing the above step S201, the fault handling apparatus may further perform the following steps: if the standby node does not receive the status message sent by the main node after the preset time length, determining that the main node is abnormal.

In embodiments of the present application, the standby node may monitor the status of the primary node. As shown in fig. 3, a high availability Service component (Service-HA) is deployed on both the primary node and the backup node.

Exemplary high availability service components may include: the detection module may include a heartbeat detection module (heartbeat) for detecting whether the peer node survives on the primary and backup nodes, and a state of a high reliability module currently associated with the peer node. The workflow of the heartbeat detection module may be: the heart beat detection module comprises a heart beat monitoring part which can be carried out through a network link or a serial port and supports a redundant link, and messages (state messages) are mutually sent between the main node and the standby node to tell the current state of the opposite party, and if the messages sent by the opposite party are not received within a designated time (preset duration), the opposite party is considered to be invalid, namely abnormal. The high availability of the heartbeat detection module is of an operating system level, and the high availability of the service level can be realized through simple script control. The preset duration may be set based on an application scenario and an actual requirement, which is not limited in this application.

Illustratively, the scenario in which the master node is abnormal may be: the host server at the master node is physically down (hardware damage and operating system failure); a heartbeat detection module on the main node fails (is down or is shut down); connection line failure between two primary and backup nodes (servers).

In the embodiment of the present application, when the fault handling device performs "switch the virtual ip address of the primary node to the standby node" in step S201, the fault handling device may include the following steps: and switching the virtual ip address of the main node to the standby node through a resource taking-over component of the standby node so as to take over the resources and services running on the main node by the standby node.

Illustratively, the probing module in the high availability service components deployed on the primary and standby nodes further comprises a (virtual internet protocol, vip) probing module primarily for probing whether a virtual ip is at the current node. If the standby node detects that the virtual ip is not in the current node and the main node is abnormal, the fault processing device starts a resource taking-over part in the heartbeat detection module to take over the resources and services running on the opposite host (main node).

In some embodiments, the fault handling apparatus may include the following steps when performing the above step S202: and through the standby node, simulating the proxy component of the main node, calling the state transfer application programming interface of the characterization layer, and sending state information to the management interface.

In an embodiment of the present application, the fault handling device invokes a token layer state transfer application programming interface (Representational State Transfer Application Programming Interface, REST API) through the standby node, emulating a proxy component (kubelet) of the primary node, sending state information to the management interface.

Illustratively, as shown in fig. 3, the master node is provided with a proxy component, and the high-availability service component deployed on the standby node may further include a fault repairing module, and if the master node is abnormal, the fault processing device simulates the proxy component of the master node through the standby node, invokes the attribute layer state transition application programming interface, and sends state information to the management interface in combination with the information (state message) acquired by the detecting module.

Illustratively, as shown in fig. 3, the standby node may simulate the proxy component of the primary node invoking the token layer transition application programming interface, sending a repair request (PATCH) (state information of the primary node) to the management interface, and updating the node state into the management interface. An example of constructing a repair request is as follows: (1) Specifying an internet protocol address (ip) of the management interface; (2) specifying authentication information for fault repair; (3) determining request message header information based on the authentication information; (4) adding the name of the fault node; (5) constructing an interface of the repair request; (6) Constructing json data marking the failed node (master node) as unavailable; (7) The send management interface request marks the node (master node) as unavailable. The marking node (main node) marks the information such as the state type, the state reason, the state information, the last heartbeat time, the last state transition time and the like of the main node when the marking node (main node) is not available.

In some embodiments, after performing the above step S202, the fault handling apparatus may further perform the following step S401, as shown in fig. 4:

step S401, detecting a service state of a preset database in the multi-node cluster.

In the embodiment of the present application, the fault handling device may detect a service state of a preset database in the multi-node cluster, as shown in fig. 3, where a distributed key value storage system (preset database) is disposed on the control platform, and the fault handling device detects the service state of the preset database of the multi-node cluster.

Accordingly, the fault handling apparatus may include the following step S402 when executing the above step S202:

step S402, if the service state of the preset database is normal, the master node is simulated to transmit state information to the management interface through the standby node.

In the embodiment of the present application, if the service state of the preset database is normal, the fault handling device simulates, through the standby node, the master node to transmit the state information to the management interface.

In some embodiments, after performing the above step S202, the fault handling apparatus may further perform the following steps: and updating the state of the master node in the preset database according to the abnormal information of the master node included in the state information through the management interface.

Illustratively, as shown in fig. 3, the fault handling apparatus updates the node abnormality information to a preset database according to the node name included in the state information through a management interface of the control platform.

In some embodiments, the fault handling apparatus may further perform the steps of: if the state of the main node in the preset database is updated, updating a service forwarding rule table of the multi-node cluster through a management component of the standby node.

Illustratively, after the failure handling device updates the distributed key value storage system through the management interface of the control platform, the management component of the standby node monitors changes to the open source container orchestration system services and endpoint objects and updates the service forwarding rule table accordingly.

Fig. 5 is a flow chart of an exemplary fault handling method according to an embodiment of the present application. This exemplary fault handling method, as shown in fig. 5, includes the following steps S501 to S511:

step S501, the node is started.

Here, the active and standby nodes in the multi-node cluster are all started up and run, and then step S502 is executed.

Step S502, the high availability service component is started.

Here, the failure processing apparatus starts up the high availability service component on the primary and standby nodes, and thereafter, performs step S503.

Step S503, heartbeat detection.

When the fault processing device works normally in the multi-node cluster, the detection module in the high-availability service component deployed on the standby node can monitor the heartbeat of the main node with the period of 2 s.

Step S504, whether the opposite end node fails.

Here, if the standby node can always receive the status message sent by the master node, it indicates that the peer node (master node) does not have a fault, and then the execution continues to execute step S503; if the standby node does not receive the status message sent by the main node within the preset time, it indicates that the opposite node fails, and step S505 is executed at this time.

Step S505, whether the virtual ip address is at the home end.

Here, the standby node detects whether the virtual ip is local (standby node) based on the virtual ip of the detection module in the deployed high-availability service component, and if it is detected that the virtual ip is not local, the step S503 is further executed, or the step S506 is executed.

Step S506, switching the virtual ip to the local terminal.

Here, the standby node uses the resource takeover portion of the probing module in the high availability service component to switch the vip of the master node to the home end, and then step S507 is performed.

Step S507, whether the service of the distributed key value storage system is restored.

Here, the standby node uses the detection module in the high availability service component to detect whether the service of the high availability distributed key value storage system of the multi-node cluster is normal, if not, step S508 is executed, and if so, step S509 is executed.

And step S508, waiting for the recovery of the distributed key value storage system.

Here, if the high-availability distributed key value storage system is not restored to normal, the step S509 is executed after waiting for the high-availability distributed key value storage system Etcd to restore to normal.

Step S509, transmitting node failure information to the management interface.

Here, the standby node transmits node failure information to the management interface using the high availability service component, and then, step S510 is performed.

Step S510, the management component updates the service forwarding rule table.

Here, the management component of the standby node automatically updates the service forwarding rule table of the multi-node cluster when the management component monitors the failure of the master node through the management interface.

Step S511, the service works normally.

Here, the management component of the standby node automatically updates the service forwarding rule table of the multi-node cluster, so that normal operation of the service can be realized.

From fig. 5, it can be seen that from the failure of the master node to the full availability of the service, the recovery time is only dependent on the time spent monitoring the failure of the node (the period of heartbeat monitoring), and the recovery time of the high availability distributed key value storage system, so that the overall service recovery time is greatly shortened, and the high availability of the service with a higher level is realized.

The embodiment of the application provides a fault processing method, which is applied to a multi-node cluster and comprises the following steps: if the main node is determined to be abnormal, switching the virtual ip address of the main node to a standby node; the simulation master node transmits state information to the management interface; wherein the state information includes at least anomaly information of the master node. According to the fault processing method, the virtual ip address of the main node is switched to the standby node under the condition that the main node is abnormal, so that the standby node can simulate the main node to timely report the abnormal information of the main node to the management interface, the defect that the fault cannot be reported to the management interface under the condition of node fault is avoided, and the fault processing efficiency is improved.

The embodiment of the application provides fault processing equipment, as shown in fig. 6, including:

a switching module 601, configured to switch, if it is determined that the master node is abnormal, a virtual ip address of the master node to a standby node;

the simulation module 602 is configured to simulate the master node to transmit status information to the management interface;

wherein the state information includes at least anomaly information of the master node.

In an embodiment of the present application, the simulation module 602 is further configured to simulate, by the standby node, a proxy component of the primary node, call the token layer state transition application programming interface, and send state information to the management interface.

In an embodiment of the present application, the switching module 601 is further configured to determine that an abnormality occurs in the master node if the standby node does not receive the status message sent by the master node beyond a preset duration.

In an embodiment of the present application, the switching module 601 is further configured to switch, through a resource takeover component of the standby node, the virtual ip address of the primary node to the standby node, so that the standby node takes over resources and services running on the primary node.

In an embodiment of the present application, the switching module 601 is further configured to detect a service state of a preset database in the multi-node cluster; correspondingly, the simulation module 602 is further configured to simulate, if the service state of the preset database is normal, the master node to transmit the state information to the management interface through the standby node.

In an embodiment of the present application, the simulation module 602 is further configured to update, through the management interface, a state of the master node in the preset database according to the abnormality information of the master node included in the state information.

In an embodiment of the present application, the simulation module 602 is further configured to update, through the management component of the standby node, the service forwarding rule table of the multi-node cluster if the state of the master node in the preset database is updated.

The embodiment of the application provides a fault handling device, as shown in fig. 7, the fault handling device includes: a processor 701, a memory 702, and a communication bus 703;

a communication bus 703 for implementing a communication connection between the processor 701 and the memory 702;

a processor 701 for executing a computer program stored in a memory 702 to implement the above-described fault handling method.

The embodiment of the application provides fault processing equipment, which is used for switching a virtual ip address of a main node to a standby node if abnormal conditions of the main node are determined; the simulation master node transmits state information to the management interface; wherein the state information includes at least anomaly information of the master node. According to the fault processing equipment, under the condition that the main node is abnormal, the virtual ip address of the main node is switched to the standby node, so that the standby node can simulate the main node to timely report the abnormal information of the main node to the management interface, the defect that the fault cannot be reported to the management interface under the condition of node fault is avoided, and the fault processing efficiency is improved.

Embodiments of the present application provide a computer readable storage medium storing one or more computer programs executable by one or more processors to implement the above-described fault handling method. The computer readable storage medium may be a volatile Memory (RAM), such as a Random-Access Memory (RAM); or a nonvolatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (HDD) or a Solid State Drive (SSD); but may be a respective device, such as a mobile phone, a computer, a tablet device, a personal digital assistant, etc., comprising one or any combination of the above memories.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A fault handling method applied to a multi-node cluster, the method comprising:

if the main node is determined to be abnormal, switching the virtual ip address of the main node to a standby node;

simulating the master node to transmit state information to a management interface;

2. The method of claim 1, the simulating the master node transmitting status information to a management interface, comprising:

and simulating an agent component of the main node through the standby node, calling a state transition application programming interface of the characterization layer, and sending the state information to the management interface.

3. The method of claim 1, wherein if it is determined that an exception occurs in the primary node, before switching the virtual ip address of the primary node to the standby node, the method further comprises:

if the standby node does not receive the state message sent by the main node after the preset time length is exceeded, determining that the main node is abnormal.

4. The method of claim 1, the switching the virtual ip address of the primary node to a standby node, comprising:

and switching the virtual ip address of the main node to the standby node through a resource taking-over component of the standby node so that the standby node takes over resources and services running on the main node.

5. The method of claim 1, after said switching the virtual ip address of the primary node to a standby node, the method further comprising:

detecting the service state of a preset database in the multi-node cluster;

accordingly, the simulating the master node to transmit status information to the management interface includes:

and if the service state of the preset database is normal, simulating the main node to transmit the state information to the management interface through the standby node.

6. The method of claim 1, after simulating the master node transmitting status information to a management interface, the method further comprising:

and updating the state of the master node in a preset database according to the abnormal information of the master node included in the state information through the management interface.

7. The method of claim 6, the method further comprising:

and if the state of the main node in the preset database is updated, updating the service forwarding rule table of the multi-node cluster through the management component of the standby node.

8. A fault handling apparatus, the fault handling apparatus comprising:

9. A fault handling apparatus comprising: a processor, a memory, and a communication bus;

the processor for executing a computer program stored in the memory to implement the fault handling method of any one of claims 1 to 7.

10. A computer readable storage medium storing one or more computer programs executable by one or more processors to implement the fault handling method of any of claims 1 to 7.