CN114691445A

CN114691445A - Cluster fault processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN114691445A
Application number: CN202011577951.7A
Authority: CN
Inventors: 陈阔
Original assignee: Suzhou Guoshuang Software Co ltd
Current assignee: Suzhou Guoshuang Software Co ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-07-01

Abstract

According to the cluster fault processing method, the cluster fault processing device, the electronic equipment and the storage medium, fault node detection is carried out on a cluster through a preset proxy service corresponding to the cluster, when the fault node exists in the cluster, a snapshot function corresponding to the fault node is established, and container internal snapshot information and container external snapshot information of the fault node are obtained through executing the snapshot function. According to the scheme, when the fault node is determined to exist, the snapshot task is executed to obtain the snapshot information of the fault node, so that the problems that at present, fault location is only carried out according to the running state of the node, evidence obtaining is difficult, location is difficult and the like are solved.

Description

Cluster fault processing method and device, electronic equipment and readable storage medium

Technical Field

The embodiment of the invention relates to the technical field of clusters, in particular to a cluster fault processing method and device, electronic equipment and a readable storage medium.

Background

With the advancement of information technology, both enterprises and other organizations increasingly rely on computer systems. With the rapid expansion of data volume, a single computer has not been able to meet its needs, and if a super computer is used, the cost of the computer is greatly increased, in this case, a kubernets cluster is created, which is a cluster composed of a plurality of nodes for running containerized applications, where the nodes may be physical or virtual machines, and the nodes are responsible for executing requests and assigned tasks.

When the Kubernetes cluster runs, the nodes fail to provide services normally, and the reliability of the production system is affected.

The current fault handling method used by the cluster is to monitor the running state of the nodes in the cluster, change the running state when detecting that the nodes have faults, then log in the system by operation and maintenance personnel, and find out the nodes with faults according to the running state of the nodes. By adopting the above manner, if the fault is triggered intermittently in stages, if the operation and maintenance personnel cannot log in the system for relevant evidence collection at the first time when the fault occurs, the fault location may be difficult.

The above description of the discovery process of the problems is only for the purpose of assisting understanding of the technical solutions of the present invention, and does not represent an admission that the above is prior art.

Disclosure of Invention

In order to solve the technical problem of difficulty in fault location, embodiments of the present invention provide a cluster fault processing method and apparatus, an electronic device, and a storage medium.

In view of this, in a first aspect, an embodiment of the present invention provides a cluster fault processing method, including:

detecting a fault node of the cluster by using a preset proxy service corresponding to the cluster;

when a fault node exists in the cluster, creating a snapshot function corresponding to the fault node;

and executing the snapshot function to acquire the container internal snapshot information and the container external snapshot information of the fault node.

As a possible implementation manner, executing the snapshot function to obtain container internal snapshot information and container external snapshot information of the failed node includes:

the snapshot function comprises a plurality of snapshot instructions, and the snapshot instructions comprise a snapshot instruction for acquiring snapshot information outside the container and a snapshot instruction for acquiring snapshot information inside the container;

and sequentially executing the plurality of snapshot instructions according to the execution sequence corresponding to each snapshot instruction, so as to obtain the container internal snapshot information and the container external snapshot information of the fault node.

As a possible implementation manner, the snapshot instruction for obtaining the container external snapshot information includes at least one of the following instructions:

the snapshot system comprises a snapshot instruction for acquiring snapshot information of an operating system, a snapshot instruction for acquiring network snapshot information of a host, a snapshot instruction for acquiring snapshot information of a system operation log and a snapshot instruction for acquiring snapshot information of hardware of the host;

the snapshot instruction for acquiring the snapshot information inside the container comprises:

snapshot instructions for obtaining container snapshot information.

As a possible implementation manner, performing fault node detection on a cluster based on a preset proxy service corresponding to the cluster, includes:

the proxy service acquires operation information of the cluster, wherein the operation information comprises working state information of each node in the cluster;

aiming at each node, comparing the working state information of the node with the preset working state information corresponding to the node;

and if the working state information of the node is inconsistent with the preset working state information corresponding to the node, determining the node as a fault node.

As a possible implementation, the method further includes:

creating a target directory in advance;

and storing the container internal snapshot information and the container external snapshot information of the fault node based on the target directory.

As a possible implementation manner, storing container internal snapshot information and container external snapshot information of the failed node based on the target directory includes:

the target directory comprises a plurality of subdirectories, and different subdirectories are used for storing snapshot information obtained by different snapshot instructions;

for each piece of snapshot information in the container internal snapshot information and the container external snapshot information, executing the following steps:

determining a snapshot instruction corresponding to the snapshot information, determining a subdirectory corresponding to the snapshot information based on the snapshot instruction, and storing the snapshot information to the subdirectory.

As a possible implementation, the method further includes:

and sending the container internal snapshot information and the container external snapshot information of the fault node to a setting person based on a preset sending mode.

In a second aspect, an embodiment of the present application further provides a cluster fault processing apparatus, including:

the fault detection module is used for detecting a fault node of the cluster by using a preset proxy service corresponding to the cluster;

the snapshot function creating module is used for creating a snapshot function corresponding to a fault node when the fault node is determined to exist in the cluster;

and the snapshot module is used for executing the snapshot function so as to acquire the container internal snapshot information and the container external snapshot information of the fault node.

In a third aspect, an embodiment of the present application further provides an electronic device, including at least one processor, and at least one memory and a bus connected to the processor; the processor and the memory complete mutual communication through a bus; the processor is configured to call program instructions in the memory to perform the steps of the cluster failure handling method according to the first aspect.

In a fourth aspect, this application further provides a readable storage medium, where the readable storage medium stores computer instructions, and the computer instructions cause a computer to execute the steps of the cluster failure processing method according to the first aspect.

Compared with the prior art, the cluster fault processing method provided by the embodiment of the invention has the advantages that the fault node detection is carried out on the cluster through the preset proxy service corresponding to the cluster, the snapshot function corresponding to the fault node is established when the fault node exists in the cluster, and the container internal snapshot information and the container external snapshot information of the fault node are obtained through executing the snapshot function. According to the scheme, when the fault node is determined to exist, the snapshot task is executed to obtain the snapshot information of the fault node, so that the problems that at present, fault location is only carried out according to the running state of the node, evidence obtaining is difficult, location is difficult and the like are solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

Fig. 1 is a flowchart of a cluster fault processing method according to an embodiment of the present invention;

fig. 2 is a flowchart of another cluster fault processing method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a relationship between a proxy service and a cluster according to an embodiment of the present invention;

fig. 4 is a block diagram of a cluster fault processing apparatus according to an embodiment of the present invention;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Fig. 1 is a flowchart of a cluster fault processing method according to an embodiment of the present invention, and as shown in fig. 1, the method may include the following steps:

and S11, carrying out fault node detection on the cluster by using a preset proxy service corresponding to the cluster.

As an embodiment, a proxy service corresponding to the cluster is preset, the cluster is subjected to the failed node detection by the proxy service, and S12 is executed when the failed node is detected in the cluster.

And S12, when the fault node exists in the cluster, creating a snapshot function corresponding to the fault node.

And S13, executing the snapshot function to acquire container internal snapshot information and container external snapshot information of the fault node.

The following collectively describes steps S12 and S13:

in order to prevent the difficulty in fault location evidence collection caused by untimely reaction of operation and maintenance personnel and facilitate the operation and maintenance personnel to locate the fault in the fault node, after the fault node in the cluster is determined, the proxy service creates a snapshot function corresponding to the fault node, and snapshot information of the fault node is obtained by executing the snapshot function.

Since the nodes in the cluster are physical machines or virtual machines for running the containerized application, the failure of the nodes may occur outside the containerized application, such as a failure of an operating system, a network, hardware, or the like, or inside the containerized application, such as a failure of the containerized application itself. Therefore, in this embodiment, the acquired snapshot information of the failed node includes the container internal snapshot information and the container external snapshot information of the failed node, so that the failure can be more accurately positioned subsequently according to the snapshot information.

In the fault processing method provided in this embodiment, a preset proxy service corresponding to a cluster is used to perform fault node detection on the cluster, when it is determined that a fault node exists in the cluster, a snapshot function corresponding to the fault node is created, and container internal snapshot information and container external snapshot information of the fault node are obtained by executing the snapshot function. According to the scheme, when the fault node is determined to exist, the snapshot task is executed to obtain the snapshot information of the fault node, so that the problems that at present, fault location is only carried out according to the running state of the node, evidence obtaining is difficult, location is difficult and the like are solved.

Furthermore, the snapshot information obtained in this embodiment includes container internal snapshot information and container external snapshot information of the failed node, which is convenient for more comprehensively performing fault location on the failed node, so that the fault location result is more accurate.

Fig. 2 is a flowchart of another cluster fault handling method according to an embodiment of the present invention, and as shown in fig. 2, the method may include the following steps:

and S21, carrying out fault node detection on the cluster by using a preset proxy service corresponding to the cluster.

As an embodiment, the proxy service may subscribe to the operation information of the cluster, so that the proxy service may acquire the operation information of the cluster. In the application, as shown in fig. 3, an agent (i.e., proxy service) corresponding to a kubernets cluster may be set, and the agent may subscribe to the condition information (i.e., operation state information) of the kubernets cluster in a kubernets information manner, so that the agent may acquire the condition information of the cluster.

Further, the operation information of the cluster includes working state information (e.g., status information) of each node in the cluster. Based on this, the detection of the failed node by the proxy service on the cluster may include the following steps:

step 1: and the proxy service acquires the running information of the cluster, wherein the running information comprises the working state information of each node in the cluster.

Step 2: and aiming at each node, comparing the working state information of the node with the preset working state information corresponding to the node.

And step 3: and if the working state information of the node is inconsistent with the preset working state information corresponding to the node, determining the node as a fault node.

The following provides a unified description of step 2 and step 3:

in application, the preset working state information corresponding to a node may be the working state information when the node operates normally, and when the obtained working state information of a certain node is inconsistent with the preset working state information corresponding to the node, it may be determined that the node has a fault, so that the node is determined to be a faulty node.

For example, the operating state information may include first fault identification information used to identify whether a node has a fault, where in normal operation of the node, the corresponding first fault identification information is 0, which indicates that the node has no fault, and if the first fault identification information of the node in the obtained operating state information is 1, which indicates that the node has a fault, the node is determined to be a faulty node.

In this embodiment, by setting the proxy service corresponding to the cluster, fault node detection is performed through the proxy service, and the working state information of the node can be acquired without logging in the node, thereby avoiding the problem of poor security and stability of operation of logging in the node (especially, secret-free logging). And the proxy service can continuously monitor all the nodes in the cluster, and the acquired data is comprehensive and real-time. The working state information of the nodes is acquired through the proxy service, so that batch acquisition can be realized, and the efficiency of fault detection is improved.

S22, when the fault node exists in the cluster, creating a snapshot function corresponding to the fault node.

As one embodiment, the snapshot function corresponding to the failed node may be generated by calling a preset method for generating the snapshot function.

For example, a preset createsponshot method is called to create a snapshot function corresponding to the failed node, specifically, the proxy service generates a request for calling the createsponshot method, the request includes an identifier of the failed node, and when the request for calling the createsponshot method is detected, a snapshot function script, that is, a create _ snaphot. The method and the device have the advantages that the paired snapshot functions are created aiming at the fault nodes, so that different snapshot functions can be created aiming at different snapshot nodes, the diversity of the snapshot functions is realized, and the applicability of fault processing is improved.

As an alternative implementation, loading the snapshot function script may include directly obtaining the corresponding script from a preset script library.

As another alternative implementation, loading the snapshot function script may include obtaining a plurality of snapshot instructions from a preset instruction library, where the plurality of snapshot instructions form a corresponding snapshot script.

In this embodiment, the snapshot function script includes a plurality of snapshot instructions, where the plurality of snapshot instructions include a snapshot instruction for acquiring snapshot information outside the container and a snapshot instruction for acquiring snapshot information inside the container. The snapshot instruction for acquiring the snapshot information outside the container comprises at least one of the following instructions: the snapshot system comprises a snapshot instruction for acquiring snapshot information of an operating system, a snapshot instruction for acquiring network snapshot information of a host, a snapshot instruction for acquiring snapshot information of a system operation log, a snapshot instruction for acquiring snapshot information of hardware of the host and the like. The snapshot instruction for acquiring the snapshot information in the container comprises the following steps: snapshot instructions for obtaining container snapshot information, and the like.

And S23, calling the snapshot function, and sequentially executing a plurality of snapshot instructions contained in the snapshot function according to the execution sequence corresponding to each snapshot instruction in the snapshot function, so as to obtain container internal snapshot information and container external snapshot information of the fault node.

As an optional embodiment, sequentially executing a plurality of snapshot instructions included in the snapshot function according to an execution sequence corresponding to each snapshot instruction in the snapshot function may include: taking the arrangement sequence of the snapshot instructions in the snapshot function script as the execution sequence of the snapshot instructions, and executing the snapshot instructions in the snapshot function script in sequence according to the arrangement sequence, wherein the earlier the arrangement sequence is, the earlier the execution sequence is executed.

As another optional implementation manner, the execution sequence corresponding to the snapshot instruction may be set by the user according to an actual requirement, for example, the user sets a snapshot instruction execution sequence table in advance according to the actual requirement, and the execution sequence table includes the execution sequence corresponding to each snapshot instruction. Based on this, when executing step S23, the execution sequence table is loaded first, and then the snapshot instructions are executed in sequence according to the execution sequence corresponding to the snapshot instructions in the table.

The snapshot can be performed in order by setting the execution sequence of the snapshot instructions, and further, because the resources occupied by executing different snapshot instructions are different, for the purpose of reasonable allocation of the resources, a user can set the execution sequence for the snapshot instructions according to actual requirements, thereby optimizing the resource utilization.

S24, a target directory is created in advance, and container internal snapshot information and container external snapshot information of the fault node are stored based on the target directory.

As an embodiment, before the snapshot information is stored, the snapshot information may be packaged first, so as to reduce the space occupied by the snapshot information storage, improve the space utilization,

as an optional implementation manner, all snapshot information of the same failed node may be packaged into one file, so that all snapshot information of the failed node is conveniently obtained.

As another alternative implementation, different snapshot information collected by different snapshot instructions may be separately packaged. This facilitates subsequent acquisition of snapshot information corresponding to different content (e.g., operating system, network, system log, container, etc.).

As an optional implementation manner, only one target directory is provided, and all the snapshot information is stored in the target directory, and in this manner, when the snapshot data is subsequently searched, all the needed snapshot information can be obtained by searching only one directory.

As another implementation manner, the created target directory includes a plurality of subdirectories, and different subdirectories are used for storing snapshot information obtained by different snapshot instructions, for example, two subdirectories are included, one is used for storing container internal snapshot information, and the other is used for storing container external snapshot information; for example, the number of the sub directories is consistent with the number of the snapshot instructions, and each sub directory stores snapshot information corresponding to one snapshot instruction. Based on this, storing the container internal snapshot information and the container external snapshot information of the failed node based on the target directory may include: for each piece of snapshot information in the container internal snapshot information and the container external snapshot information, executing the following steps: determining a snapshot instruction corresponding to the snapshot information, determining a subdirectory corresponding to the snapshot information based on the snapshot instruction, and storing the snapshot information to the subdirectory. By the method, the snapshot information is stored respectively, if only partial snapshot information of a certain fault node needs to be searched, only the subdirectory corresponding to the partial snapshot information needs to be searched, the whole target directory does not need to be searched, and the searching time is saved.

And S25, sending the container internal snapshot information and the container external snapshot information of the fault node to a setting person based on a preset sending mode.

As an embodiment, the information sending method (e.g. mail, short message, etc.) and the setting personnel (e.g. operation and maintenance personnel) that are the personnel receiving the information may be set in advance according to the actual requirements. After the snapshot information of the fault node is obtained, the snapshot information is sent to the setting personnel in a set sending mode, so that the setting personnel can obtain the snapshot information in time, time consumption caused by manually searching the snapshot information is avoided, the snapshot information is sent to the setting personnel to play a role of alarming the setting personnel, and the setting personnel can determine that the fault node currently appears after receiving the snapshot information, so that the fault can be processed in time, and the fault processing efficiency is improved.

Another embodiment of the present invention further provides a device for handling a cluster failure, as shown in fig. 4, the device may include: a failure detection module 401, a snapshot function creation module 402 and a snapshot module 403.

A fault detection module 401, configured to perform fault node detection on a cluster by using a preset proxy service corresponding to the cluster;

a snapshot function creating module 402, configured to create a snapshot function corresponding to a faulty node when it is determined that the faulty node exists in the cluster;

a snapshot module 403, configured to execute the snapshot function to obtain container internal snapshot information and container external snapshot information of the failed node.

As an embodiment, the snapshot module 403 is specifically configured to:

As an embodiment, the snapshot instruction for obtaining the snapshot information outside the container includes at least one of the following instructions:

snapshot instructions for obtaining container snapshot information.

As an embodiment, the fault detection module 401 is specifically configured to:

As an embodiment, the device further comprises (not shown in fig. 4):

and the storage module is used for storing the container internal snapshot information and the container external snapshot information of the fault node based on a pre-created target directory.

As an embodiment, storing the container internal snapshot information and the container external snapshot information of the failed node based on a target directory includes:

As an embodiment, the device further comprises (not shown in fig. 4):

and the sending module is used for sending the container internal snapshot information and the container external snapshot information of the fault node to a setting person based on a preset sending mode.

The cluster fault processing apparatus includes a processor and a memory, the fault detection module 401, the snapshot function creation module 402, the snapshot module 403, and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.

The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more, and the operation information of the fault node is obtained and stored when the cluster is detected to have a fault by adjusting the kernel parameters.

An embodiment of the present invention provides a storage medium, on which a program is stored, where the program, when executed by a processor, implements the cluster fault handling method.

The embodiment of the invention provides a processor, which is used for running a program, wherein the cluster fault processing method is executed when the program runs.

As shown in fig. 5, an embodiment of the present invention provides an electronic device 50, which includes at least one processor 501, at least one memory 502 connected to the processor, and a bus 503; the processor 501 and the memory 502 complete communication with each other through the bus 503; the processor 501 is used to call program instructions in the memory 502 to perform the cluster failure handling method described above. The electronic device 50 herein may be a server, a PC, a PAD, a cell phone, etc.

The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device:

Executing the snapshot function to obtain container internal snapshot information and container external snapshot information of the failed node, including:

The snapshot instruction for acquiring the snapshot information outside the container comprises at least one of the following instructions:

snapshot instructions for obtaining container snapshot information.

Based on the preset proxy service corresponding to the cluster, the fault node detection is carried out on the cluster, and the method comprises the following steps:

The method further comprises the following steps:

creating a target directory in advance;

Storing the container internal snapshot information and the container external snapshot information of the fault node based on the target directory, wherein the storing comprises:

The method further comprises the following steps:

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.

The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A cluster fault handling method is characterized by comprising the following steps:

2. The method of claim 1, wherein executing the snapshot function to obtain container internal snapshot information and container external snapshot information for the failed node comprises:

3. The method according to claim 2, wherein the snapshot instruction for obtaining snapshot information outside the container comprises at least one of the following instructions:

snapshot instructions for obtaining container snapshot information.

4. The method of claim 1, wherein performing fault node detection on the cluster based on a preset proxy service corresponding to the cluster comprises:

5. The method of claim 1, further comprising:

creating a target directory in advance;

6. The method of claim 5, wherein storing container internal snapshot information and container external snapshot information for the failed node based on the target directory comprises:

7. The method of claim 1, further comprising:

8. A cluster failure handling apparatus, comprising:

9. An electronic device, comprising at least one processor and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through a bus; the processor is adapted to call program instructions in the memory to perform the steps of the cluster failure handling method of any of claims 1-7.

10. A readable storage medium storing computer instructions for causing a computer to perform the steps of the cluster failure handling method of any one of claims 1 to 7.