CN113422692A

CN113422692A - Method, device and storage medium for detecting and processing node faults in K8s cluster

Info

Publication number: CN113422692A
Application number: CN202110594222.0A
Authority: CN
Inventors: 别路; 吕亚霖; 董晓聪
Original assignee: Zuoyebang Education Technology Beijing Co Ltd
Current assignee: Zuoyebang Education Technology Beijing Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-09-21

Abstract

The invention relates to the technical field of clusters, and discloses a method, a device and a storage medium for detecting and processing faults of nodes in a K8s cluster, wherein the method for detecting and processing the faults of the nodes in the K8s cluster comprises the following steps: carrying out fault detection on each working node Worker of the K8s cluster through a self-defined node fault detection program; reporting the detection result to an API Server of the K8s cluster, and storing the detection result; and monitoring the state of the Worker node Worker according to the detection result, and executing corresponding recovery action according to a preset rule when the state changes so as to process the fault. The K8s cluster internal node fault detection and processing method forms a whole set of closed-loop automatic system; the node fault detection and processing capability of K8s is enhanced, the system runs in a cloud native mode completely, does not depend on other third-party components, has high availability, greatly improves the node fault processing efficiency, does not need manual intervention, and has high practical value.

Description

Method, device and storage medium for detecting and processing node faults in K8s cluster

Technical Field

The invention relates to the technical field of clusters, in particular to a method, a device and a storage medium for detecting and processing a node fault in a K8s cluster.

Background

K8 s-Kubernets, a standard open source container orchestration and scheduling platform in the cloud-native domain. Node, namely Node, K8s cluster management computer entity, is divided into Master Node and Worker Node, and a plurality of nodes form a cluster. Pod, the smallest unit scheduled by K8s, a service consists of 1 or more pods, one Pod corresponding to a specific computing task on the computer.

The K8s executes the workload by placing the container in a Pod running on a Node. Each node contains the services needed to run the Pods; these nodes are managed by the control plane. Components on a node include kubel, container runtime, and kube-proxy. The state of a node contains the following information: address, status, capacity and allocable, information; kubecect may be used to view node status and other detail information. The conditions field describes the state of all Running nodes. Examples of conditions include:

ready is True if the node is healthy and Ready to receive Pod; false indicates that the node is unhealthy and cannot receive Pod; unknown indicates that the node controller did not receive a message from a node during the last node-monitor-gram-period (default 40 seconds).

DiskPresure True indicates that the free space of the node is not enough for adding a new Pod, otherwise False.

The MemoryPressure True indicates that the memory pressure exists in the node, namely the memory available amount of the node is low, otherwise, the node is False.

The PIDPRESsure True represents that the node has process pressure, namely the process on the node is excessive; otherwise False.

The network Unavailable True represents that the node network configuration is incorrect; otherwise False.

The existing K8s is used for node health detection through kubel on a node, and relies on a lease timing renewal mechanism, and can detect the following four problems by default: network unavailability, insufficient memory space, insufficient disk space, and insufficient number of available processes; compared with a complex real-world scene, the default detection capability of the K8s is weak, and other software or hardware faults dependent on the system operation besides the above four problems cannot be detected, such as: ntp service exception (which may cause the server clock to be out of sync), graphics card failure (which may cause the GPU dependent program to fail to run), and so on. In addition, after a fault is found, manual intervention processing is needed, the nodes are moved out of the cluster, and then new nodes are purchased to be added into the cluster, so that the labor cost is high and the efficiency is low.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

In order to solve the problems of node fault detection and automatic processing in the K8s cluster, in particular, the invention realizes a technical scheme that whether a node is abnormal or not can be judged according to a self-defined condition and automatic processing is carried out, and the technical scheme specifically comprises the following steps:

the invention provides a method for detecting and processing a fault of a node in a K8s cluster, which comprises the following steps:

carrying out fault detection on each working node Worker of the K8s cluster through a self-defined node fault detection program;

reporting the detection result to an API Server of the K8s cluster, and storing the detection result;

and monitoring the state of the Worker node Worker according to the detection result, and executing corresponding recovery action according to a preset rule when the state changes so as to process the fault.

As an optional embodiment of the present invention, the performing fault detection on each working node Worker of the K8s cluster by using a self-defined node fault detection program includes:

a fault detection program detector is respectively deployed on each working node Worker and used for detecting the fault of the node;

the fault detection program detector supports plug-in operation, and performs fault detection on each working node Worker of the K8s cluster by executing a self-defined node fault detection program;

optionally, the fault detection program detector is deployed on each Worker node Worker in a manner of daemon in the K8s cluster.

As an optional embodiment of the present invention, the customized node fault detection program includes: the ntp detection program is used for detecting the node ntp service abnormity, and/or the display card detection program is used for detecting the node display card fault.

As an alternative embodiment of the present invention, referring to fig. 2, the reporting the detection result to the API Server of the K8s cluster, and the saving includes:

the detection result is reported to an API Server in a NodeCondition form through a remote calling mode, and the API Server stores the NodeCondition to Etcd;

the NodeCondition is a field carried by a node structure body of the K8s and is used for storing a default node state of the K8s, and the NodeCondition field is multiplexed and used for storing a self-defined node state detected by a self-defined node fault detection program.

As an optional implementation manner of the present invention, the monitoring the state of the work node Worker according to the detection result includes:

monitoring the state of a Worker node Worker by monitoring the change event of the detection result in the API Server through the list-watch;

list API list resources of the list calling resources are realized based on HTTP short link; the watch API of the watch calls the resource listens for resource change events.

As an optional embodiment of the present invention, when the state changes, the performing the corresponding recovery action according to a pre-configured rule, and performing the fault processing includes:

when the status of the Worker node Worker is a 'GPUFallOff' fault and/or an 'NTPProblem' fault, performing a corresponding recovery action includes blocking, wherein the blocking is to prevent a new pod from being scheduled on the node;

optionally, the performing the corresponding recovery action further comprises alerting.

when the state of the working node Worker is a Kernel Deadlock fault, and/or a ReadonlyFilesystem fault, and/or a CorruptDocker overlay2 fault, and/or a Kubelerer fault, and/or a NICDeleted fault;

performing corresponding recovery actions including blocking and eviction;

wherein said blocking is to prevent a new pod from being dispatched to the node; the eviction is to transfer the pod on the node to other nodes;

As an optional embodiment of the present invention, the state of the Worker node Worker is that "GPUFallOff" failure detects the dmesg log by a detection mode "plug-in, matching: the reason for detecting the NVRM is 'GPU has fan off the bus';

and/or, when the state of the Worker node Worker is 'NTPProblem' fault, detecting that the reason is 'NTPIsDown' by executing a corresponding script;

and/or the state of the working node Worker is 'KernelDeadlock' fault:

detect the dmesg log by detection mode "plug-in, match: the detection reason is 'AUFSUmount Hung', the plug-in detects the dmesg log in a detection mode, and the detection result is matched with the detection result: the reason for detection of task docker: \ \ w + blocked for more than \ \ w + seconds \\\ \ the reason for detection is "DockerHung";

and/or the state of the Worker node Worker is ' ReadonlyFilesystem ' fault, and the dmesg log is detected through a detection mode ' plug-in, and matched: the reason for detecting the Remounting filesystem read-only nvme \ w +: Identify Controller failed (++) "is" FilessystemIsReadOnly ";

and/or the state of the working node Worker is a fault of CorruptDocker overlay2, and the docker log is detected by a detection mode of plug-in, and the detection is matched with the following steps: a return error: readlink/var/lib/docker/overlay2. invalid identification supplement. "the cause of the detection is" CorruptDocker overlay2 ";

and/or the state of the working node Worker is ' kubeleeror ' fault, and the kubelet log is detected through a detection mode ' plug-in unit, and is matched with the fault: a failure to get system container station for (. +): failure to get group station for (. +): failure to get container info for (. +): under container station for "";

and/or the state of the Worker node Worker is ' NICDELLETED ' fault, and the state is matched by a detection mode ' plug-in detection/var/log/messages: the reason for detecting the ntpd \ d + \ Deleting interfaces (++) "is NIC has been seen.

The invention also provides a device for detecting and processing the fault of the node in the K8s cluster, which comprises:

the node fault detection module is used for carrying out fault detection on each working node Worker of the K8s cluster through a self-defined node fault detection program, reporting a detection result to an API Server of the K8s cluster and storing the detection result;

and the fault processing module monitors the state of the working node Worker according to the detection result, and executes corresponding recovery action according to a preset rule when the state changes so as to process the fault.

The invention also provides a storage medium which stores a computer executable program, and when the computer executable program is executed, the method for detecting and processing the fault of the node in the K8s cluster is realized.

Compared with the prior art, the invention has the beneficial effects that:

the method for detecting and processing the node fault in the K8s cluster can run a self-defined detection program in a plug-in mode to complete the fault detection of the node, report the detection result to the API Server of the K8s cluster, and then perform persistent storage to trigger the recovery action, thereby forming a whole set of closed-loop automatic system; the node fault detection and processing capability of K8s is enhanced, the system runs in a cloud native mode completely, does not depend on other third-party components, has high availability, greatly improves the node fault processing efficiency, does not need manual intervention, and has high practical value.

Description of the drawings:

FIG. 1 is a block flow diagram of a method for detecting and handling a failure of a node in a K8s cluster according to the present invention;

FIG. 2 is a block flow diagram of an embodiment of a K8s cluster intra-node fault detection and handling method of the present invention;

fig. 3 is an overall framework diagram of the K8s cluster node fault detection and processing apparatus according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments.

Thus, the following detailed description of the embodiments of the invention is not intended to limit the scope of the invention as claimed, but is merely representative of some embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the embodiments of the present invention and the features and technical solutions thereof may be combined with each other without conflict.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present invention, it should be noted that the terms "upper", "lower", and the like refer to orientations or positional relationships based on those shown in the drawings, or orientations or positional relationships that are conventionally arranged when the products of the present invention are used, or orientations or positional relationships that are conventionally understood by those skilled in the art, and such terms are used for convenience of description and simplification of the description, and do not refer to or imply that the devices or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.

Referring to fig. 1, the present embodiment provides a method for detecting and processing a node fault in a K8s cluster, including:

In order to realize more fault detection on the nodes in the K8s cluster, the fault detection and processing method in the K8s cluster of the embodiment detects through a customized node fault detection program, and can perform more targeted and diversified fault detection through customized program design compared with a mode of performing node health detection through kubel on the nodes by depending on K8s, thereby meeting higher requirements of node fault detection.

The method for detecting and processing the node fault in the K8s cluster in the embodiment reports the detection result to the API Server of the K8s cluster and stores the detection result, so that the detected fault data can be conveniently called and monitored, and the automatic monitoring of the node fault is realized, so that the feedback can be made in time.

The method for detecting and processing the node fault in the K8s cluster in this embodiment monitors the state of the Worker node Worker according to the detection result, and executes a corresponding recovery action according to a preset rule to process the fault when the state changes. By pre-configuring the node fault and the corresponding recovery action, when the node fault is monitored, the fault is automatically processed according to the pre-configuration, so that the cost of manual processing is saved, and the efficiency of node fault processing is improved.

Therefore, the method for detecting and processing the node fault in the K8s cluster in the embodiment solves the problem of node fault detection and automatic processing in the K8s cluster, and particularly realizes a method for determining whether a node is abnormal according to a user-defined condition and automatically processing the node. The node fault detection and processing method in the K8s cluster of this embodiment can run a self-defined detection program in the form of a plug-in to complete the fault detection of the node, and report the detection result to the API Server of the K8s cluster, and then perform persistent storage, thereby triggering the recovery action, and forming a whole set of closed-loop automation system; the node fault detection and processing capability of K8s is enhanced, the system runs in a cloud native mode completely, does not depend on other third-party components, has high availability, greatly improves the node fault processing efficiency, does not need manual intervention, and has high practical value.

As an optional implementation manner of this embodiment, referring to fig. 2, in the method for detecting and processing a node fault in a K8s cluster of this embodiment, the performing fault detection on each Worker node Worker of the K8s cluster through a customized node fault detection program includes:

the fault detection program detector supports plug-in operation, and performs fault detection on each working node Worker of the K8s cluster by executing a self-defined node fault detection program.

In this embodiment, a fault detection program detector is deployed on each work node Worker of the K8s cluster and used for performing more targeted and diversified fault detection on the work node Worker, the fault detection program detector supports plug-in operation, and developers can develop a self-defined node fault detection program according to the requirement of node fault detection and operate on the fault detection program detector in a plug-in mode to detect node faults.

Daemonset, a deployment form of Pod in K8s, similar to a daemon, runs one Pod on each designated node. DaemonSet defines the Pods that provide the facility for local support of the nodes. These Pods may be important to the operation and maintenance of the cluster, for example as an aid to network links or as part of a network plug-in, etc. Each time you add a new node to the cluster, if the node matches the DaemonSet specification, the control plane will schedule a Pod for the DaemonSet to run on the new node. Thus, in the embodiment, a containerized daemon deployment fault detection program is used, the fault detection program provides necessary fault detection for the Worker node Worker, and when a node is newly added, the fault detection program can be automatically deployed on the new node, so that the management and operation and maintenance efficiency is greatly improved compared with the traditional method in which a monitoring program (Zabbix and Open-falcon commonly used in the industry) is deployed on the node.

Further, the customized node fault detection program in this embodiment includes: the ntp detection program is used for detecting the node ntp service abnormity, and/or the display card detection program is used for detecting the node display card fault.

NTP server (NTP) is a Protocol for synchronizing computer Time, which can synchronize a computer to its server or clock source (e.g., quartz clock, GPS, etc.), provide highly accurate Time correction (the difference between the standard and the standard on LAN is less than 1 millisecond, and tens of milliseconds on WAN), and prevent malicious Protocol attacks via encryption confirmation. The ntp service exception may cause the server clock to be asynchronous, and the ntp detection program of this embodiment is implemented by the following custom scripts:

and executing the script:

if systemctl-q is-active"ntp.service"；then

echo"ntp.service is running"

exit 0

else

echo"ntp.service is not running"

exit 1

fi”。

the failure of the node display card can cause that the program depending on the GPU cannot run, and the display card detection program of the embodiment detects the dmesg log by the detection mode of' plug-in detection, matching: and NVRM which detects the fault by Xid (++) \ d +, GPU has fallen off the bus \ ".

The self-defined node fault detection program in this embodiment includes a Kernel detection program for detecting a node "Kernel delay" fault, where the "Kernel delay" fault is a node Kernel lock, a screen does not have any effective print information, a network interrupt, and a keyboard and a mouse do not have any response. The Kernel detection program of this embodiment detects the dmesg log by the detection mode "plug-in, matches: the detection reason is "AUFSUmount Hung" (memory lock death); detect the dmesg log by detection mode "plug-in, match: task Docker: \ \ w + blocked for more than \ \ w + seconds \\.

The self-defined node fault detection program in this embodiment includes a Readonly detection program for detecting a node "Readonly file system" fault, where the "Readonly file system" fault is that a file cannot be written, a file is newly stored, and there is only a read right and no write right. The Readonly detection program of this embodiment detects the dmesg log through the detection mode "plug-in, matches: the reason for detecting the removal of file system read-only nvme \ \ w +: Identify Controller failed (++) "is" FilessystemIsReadOnly ".

The self-defined node fault detection program described in this embodiment includes a corrupttdocker detection program for detecting a fault of a node "corruptdoccker overlay 2", and a fault "corruptdoccker overlay 2" is that the occupied space of the current Docker is too large. The corrupt docker detection program of this embodiment detects docker logs through a detection mode "plug-in, matches: the cause of the return error: readlink/var/lib/docker/overlay2. invalid identification. the "cause of detection is" CorruptDocker overlay2 ".

The customized node fault detection program described in this embodiment includes a kubeleleterror detection program for detecting a node "kubeleleterror" fault, which is a kubelelet running error on a working node. The kubeleeror detection program of the embodiment detects the kubelet log by the plug-in detection mode, and matches: a cause of the failure to get system container stations for (. +), failure to get container containers for (. +), and under container containers (. +) "test is" kubelet Error ".

The self-defined node fault detection program described in this embodiment includes an NICD detection program for detecting a node "NICDeleted" fault, where the "NICDeleted" fault is a network card being deleted and unable to perform networking. The NICD detection program of the embodiment matches through the detection mode "plug-in detection/var/log/messages: the reason for detecting the ntpd \ d + \ Deleting interfaces (++) "is NIC has been seen.

Referring to fig. 2, in the method for detecting and processing a node fault in a K8s cluster according to this embodiment, reporting a detection result to an API Server of a K8s cluster, and storing the detection result includes:

In this embodiment, the node condition field is multiplexed, the node fault detection result is reported to the API Server in a remote call mode in a manner similar to that of the existing K8s, and the API Server stores the node condition to the Etcd in an existing mode.

etcd is a distributed, reliable key-value storage system for storing critical data in a distributed system. The cluster data can be accessed by the client provided by the etcd, and the etcd can also be directly accessed by http (similar to curl command). Inside the etcd, the data representation is also simple, and the data storage of the etcd can be directly understood as an ordered map which stores key-value data. Meanwhile, the etcd also supports a watch mechanism for facilitating the client to subscribe the change of the data, and the incremental update of the data in the etcd is carried out in real time through the watch, so that the business logic such as data synchronization and the like in the etcd is realized.

The interfaces provided by etcd are divided into the following 5 groups:

the first group is Put and Delete. The put and delete operations are very simple, data can be written into the cluster by only providing one key and one value, and only the key needs to be specified when the data is deleted.

The second group is query operations. etcd supports two types of queries: the first is a query of a single key, and the second is a range of a specified key.

The third group is data subscriptions. The etcd provides a Watch mechanism, a Watch can subscribe to incremental data update in the etcd in real time, the Watch supports to specify a single key and also can specify a prefix of the key, and a second situation is usually adopted in an actual application scene.

A fourth set of transactional operations. etcd provides a simple transactional support, where a user can perform some actions when a set of conditions is met and another set of operations when the conditions are not met, similar to if else statements in code, etcd ensures atomicity of the entire operation.

The fifth group is the Leases interface. The Leases interface is a common design model in distributed systems.

Further, referring to fig. 2, in the method for detecting and processing a node fault in a K8s cluster according to the present embodiment, the monitoring the state of the Worker node Worker according to the detection result includes:

The Etcd stores the data information of the cluster, the API Server is used as a unified entry, and any operation on the data must pass through the API Server. The client (kubel/scheduler/controller-manager) listens to the create, update and delete events of the resources (pod/rs/rc, etc.) in the API Server through the list-watch, and calls corresponding event processing functions according to the event types.

What the list-watch is specifically is, as the name implies, the list-watch is composed of two parts, namely the list and the watch. list is well understood, namely list API list resources of calling resources are realized based on HTTP short link; the watch is a watch API for calling resources to monitor resource change events and is realized based on an HTTP long link.

The list and the watch ensure the reliability of the message together, and avoid the scene of inconsistent state caused by the loss of the message. Specifically, the list API may query the current resource and its corresponding state (i.e., the expected state), and the client compares the expected state with the actual state to correct the resource with inconsistent state. The Watch API and the API Server keep a long link, receive the state change event of the resource and do corresponding processing. If only the watch API is called, if the connection is interrupted at a certain time point, the message can be lost, so that the problem of message loss needs to be solved through the list API. From another perspective, we can consider the list API to obtain full data and the watch API to obtain incremental data. Although the effect of synchronizing the resource states can be achieved only by polling the list API, there are problems of high overhead and insufficient real-time performance.

The message is required to be real-time, and under a list-watch mechanism, each time when the state change event is generated by the resource of the API Server, the event is timely pushed to the client, so that the real-time property of the message is ensured.

The sequentiality of the message is also very important, in a concurrent scenario, the client may receive multiple events of the same resource in a short time, and for K8S concerning final consistency, it needs to know which event is the most recent event and guarantee the final state of the resource as the state expressed by the most recent event. K8S has a resourceVersion tag in each resource event, which is an incremental number, so when the client concurrently processes events of the same resource, it can compare the resourceVersion to ensure that the final state is consistent with the expected state of the latest event.

The List-watch also has the characteristic of high performance, and although the effect of final consistency of resources can be achieved only by periodically calling the List API, the overhead is greatly increased by periodically and frequently polling, and the pressure of the API Server is increased. The watch is used as an asynchronous message notification mechanism and multiplexes a long link, so that the performance is guaranteed while the real-time performance is guaranteed.

Therefore, the method for detecting and processing the node fault in the K8s cluster in the embodiment realizes real-time monitoring of the state of the working node Worker through a list-watch mechanism, so that the fault processing system can execute a recovery action in time according to a monitoring result to process the node fault.

Referring to fig. 2, in the method for detecting and processing a node fault in a K8s cluster of this embodiment: once the node state is monitored to be changed, corresponding recovery actions are executed according to a pre-configured rule, and the recovery actions comprise blocking (preventing a new pod from being scheduled on the node), evicting (transferring the pod on the node to other nodes), removing a cluster, applying a new node to join the cluster and the like.

Specifically, in the method for detecting and processing a node failure in a K8s cluster of this embodiment, when a state changes, a corresponding recovery action is executed according to a preconfigured rule, and performing failure processing includes:

when the state of the working node Worker is GPUFallOff fault and/or NTPProblem fault, executing corresponding recovery actions including alarming and blocking;

wherein, the blocking is to prevent a new pod from being scheduled to the node, and the alarming is to send out warning information to inform relevant staff to perform subsequent processing of the fault.

In the method for detecting and processing a node failure in a K8s cluster of this embodiment, when a state changes, a corresponding recovery action is executed according to a preconfigured rule, and performing failure processing includes:

executing corresponding recovery actions including alarming, blocking and evicting;

wherein said blocking is to prevent a new pod from being dispatched to the node; the eviction is to transfer the pod on the node to other nodes; the alarm means that warning information is sent to inform relevant workers to carry out follow-up processing of faults.

The recovery action in the method for detecting and processing the node failure in the K8s cluster in this embodiment further includes that a cluster is already released, a new node is applied to join the cluster, and the like.

In the method for detecting and processing a node fault in a K8s cluster of this embodiment:

the state of the working node Worker is that the plug-in detects the dmesg log in a ' GPUFallOff ' fault passing detection mode ', and the match is as follows: the reason for detecting the NVRM is 'GPU has fan off the bus';

and/or when the state of the work node Worker is "NTPProblem" fault, the detection mode is as follows:

and executing the script:

if systemctl-q is-active"ntp.service"；then

echo"ntp.service is running"

exit 0

else

echo"ntp.service is not running"

exit 1

fi”；

and/or the state of the working node Worker is 'KernelDeadlock' fault:

Thus, the preconfigured rules of the present embodiment are shown in the following table:

this embodiment provides a K8s cluster inner node fault detection and processing apparatus simultaneously, includes:

Node fault detection and processing apparatus in the K8s cluster of this embodiment carries out more fault detection in order to realize carrying out node in the K8s cluster, and node fault detection module detects through self-defined node fault detection procedure, and for the mode that relies on K8s to carry out node health detection through the kubel on the node, can carry out more pointed and diversified fault detection through self-defined programming, thereby satisfy node fault detection's higher demand.

The node fault detection and processing device in the K8s cluster of this embodiment reports the testing result to the API Server of the K8s cluster and saves, is convenient for carrying out calling and monitoring on the fault data detected, realizes the automatic monitoring of the node fault, so as to make feedback in time.

In the device for detecting and processing a node fault in a K8s cluster according to this embodiment, a fault processing module monitors the state of a Worker node Worker according to a detection result, and executes a corresponding recovery action according to a preset rule to perform fault processing when the state changes. The fault processing module is used for carrying out automatic processing on the fault according to the pre-configuration when the node fault is monitored by pre-configuring the node fault and the corresponding recovery action, so that the cost of manual processing is saved, and the efficiency of node fault processing is improved.

Therefore, the device for detecting and processing the node fault in the K8s cluster in the embodiment solves the problem of detecting and automatically processing the node fault in the K8s cluster, and particularly realizes a method for determining whether a node is abnormal according to a user-defined condition and automatically processing the node. The node fault detection and processing device in the K8s cluster of this embodiment can run a self-defined detection program in the form of a plug-in to complete fault detection of the node, report the detection result to the API Server of the K8s cluster, and then perform persistent storage, thereby triggering recovery actions, and forming a whole set of closed-loop automation system; the node fault detection and processing capability of K8s is enhanced, the system runs in a cloud native mode completely, does not depend on other third-party components, has high availability, greatly improves the node fault processing efficiency, does not need manual intervention, and has high practical value.

As an optional implementation manner of this embodiment, the fault detection module of the node fault detection and processing apparatus in the K8s cluster of this embodiment includes a fault detection program detector respectively deployed on each Worker node Worker, and is used for detecting a fault of a node; the fault detection program detector supports plug-in operation, and performs fault detection on each working node Worker of the K8s cluster by executing a self-defined node fault detection program.

and executing the script:

if systemctl-q is-active"ntp.service"；then

echo"ntp.service is running"

exit 0

else

echo"ntp.service is not running"

exit 1

fi”。

In the device for detecting and processing a node failure in a K8s cluster of this embodiment, reporting a detection result to an API Server of a K8s cluster, and storing includes:

the detection result of the detector is reported to an API Server in a NodeCondition form in a remote calling mode, and the API Server stores the NodeCondition to the Etcd;

In this embodiment, the NodeCondition field is multiplexed, the detection result of the detector is reported to the API Server in a remote call mode in a mode similar to that of the existing K8s, and the API Server stores the NodeCondition in the Etcd in an existing mode.

The interfaces provided by etcd are divided into the following 5 groups:

Further, the monitoring, by the fault processing module in the device for detecting and processing a fault of a node in a K8s cluster according to a detection result, a state of a Worker node Worker includes:

the fault processing module monitors the change event of the detection result in the API Server through list-watch to realize the monitoring of the state of the Worker node;

Therefore, the node fault detection and processing device in the K8s cluster of the embodiment realizes real-time monitoring of the state of the working node Worker through a list-watch mechanism, which is beneficial for the fault processing system to execute a recovery action in time according to a monitoring result to process the node fault.

In the device for detecting and processing a node failure in a K8s cluster of this embodiment, when monitoring that a node state changes, the failure processing module executes a corresponding recovery action according to a preconfigured rule, and performing failure processing includes:

In the device for detecting and processing a node fault in a K8s cluster of this embodiment:

and executing the script:

if systemctl-q is-active"ntp.service"；then

echo"ntp.service is running"

exit 0

else

echo"ntp.service is not running"

exit 1

fi”；

and/or the state of the working node Worker is 'KernelDeadlock' fault:

and/or the state of the working node Worker is 'kubeleeror' fault, and the kubelet log is detected by a plug-in a detection mode, and is matched with the fault: a failure to get system container station for (. +): failure to get group station for (. +): failure to get container info for (. +): under container station for "";

Fig. 3 is a block diagram of the device for detecting and processing a node fault in a K8s cluster according to this embodiment.

The embodiment also provides a storage medium, which stores a computer executable program, and when the computer executable program is executed, the method for detecting and processing the node fault in the K8s cluster is implemented.

The storage medium of this embodiment may comprise a propagated data signal with readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

The embodiment also provides an electronic device, which includes a processor and a memory, where the memory is used to store a computer executable program, and when the computer program is executed by the processor, the processor executes the method for detecting and processing the node failure in the K8s cluster.

The electronic device is in the form of a general purpose computing device. The processor can be one or more and can work together. The invention also does not exclude that distributed processing is performed, i.e. the processors may be distributed over different physical devices. The electronic device of the present invention is not limited to a single entity, and may be a sum of a plurality of entity devices.

The memory stores a computer executable program, typically machine readable code. The computer readable program may be executed by the processor to enable an electronic device to perform the method of the invention, or at least some of the steps of the method.

The memory may include volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may also be non-volatile memory, such as read-only memory (ROM).

It should be understood that elements or components not shown in the above examples may also be included in the electronic device of the present invention. For example, some electronic devices further include a display unit such as a display screen, and some electronic devices further include a human-computer interaction element such as a button, a keyboard, and the like. Electronic devices are considered to be covered by the present invention as long as the electronic devices are capable of executing a computer-readable program in a memory to implement the method of the present invention or at least a part of the steps of the method. From the above description of the embodiments, those skilled in the art will readily appreciate that the present invention can be implemented by hardware capable of executing a specific computer program, such as the system of the present invention, and electronic processing units, servers, clients, mobile phones, control units, processors, etc. included in the system. The invention may also be implemented by computer software for performing the method of the invention, e.g. control software executed by a microprocessor, an electronic control unit, a client, a server, etc. It should be noted that the computer software for executing the method of the present invention is not limited to be executed by one or a specific hardware entity, and can also be realized in a distributed manner by non-specific hardware. For computer software, the software product may be stored in a computer readable storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or may be distributed over a network, as long as it enables the electronic device to perform the method according to the present invention.

The above embodiments are only used for illustrating the invention and not for limiting the technical solutions described in the invention, and although the present invention has been described in detail in the present specification with reference to the above embodiments, the present invention is not limited to the above embodiments, and therefore, any modification or equivalent replacement of the present invention is made; all such modifications and variations are intended to be included herein within the scope of this disclosure and the appended claims.

Claims

1. A method for detecting and processing faults of nodes in a K8s cluster is characterized by comprising the following steps:

2. The method for detecting and processing the fault of the node in the K8s cluster according to claim 1, wherein the step of performing the fault detection on each Worker node Worker in the K8s cluster through a customized node fault detection program comprises:

3. The method for detecting and processing the node fault in the K8s cluster according to claim 1 or 2, wherein the customized node fault detection program includes: the ntp detection program is used for detecting the node ntp service abnormity, and/or the display card detection program is used for detecting the node display card fault.

4. The method according to claim 1, wherein the reporting the detection result to an API Server of the K8s cluster and the saving comprises:

5. The method for detecting and processing the fault of the node in the K8s cluster according to claim 1, wherein the monitoring the state of a Worker node Worker according to the detection result includes:

6. The method for detecting and processing the failure of the node in the K8s cluster according to claim 1, wherein the performing the failure processing according to the pre-configured rule when the state changes includes:

7. The method for detecting and processing the failure of the node in the K8s cluster according to claim 1, wherein the performing the failure processing according to the pre-configured rule when the state changes includes:

performing corresponding recovery actions including blocking and eviction;

8. The method for detecting and processing the failure of the nodes in the K8s cluster according to claim 6 or 7,

and/or the state of the working node Worker is 'KernelDeadlock' fault:

9. A node fault detection and processing device in a K8s cluster is characterized by comprising:

10. A storage medium storing a computer-executable program, which when executed performs a method of detecting and handling node failure within a K8s cluster as claimed in any one of claims 1 to 8.