CN113422692A - Method, device and storage medium for detecting and processing node faults in K8s cluster - Google Patents

Method, device and storage medium for detecting and processing node faults in K8s cluster Download PDF

Info

Publication number
CN113422692A
CN113422692A CN202110594222.0A CN202110594222A CN113422692A CN 113422692 A CN113422692 A CN 113422692A CN 202110594222 A CN202110594222 A CN 202110594222A CN 113422692 A CN113422692 A CN 113422692A
Authority
CN
China
Prior art keywords
node
fault
detection
worker
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110594222.0A
Other languages
Chinese (zh)
Inventor
别路
吕亚霖
董晓聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zuoyebang Education Technology Beijing Co Ltd
Original Assignee
Zuoyebang Education Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zuoyebang Education Technology Beijing Co Ltd filed Critical Zuoyebang Education Technology Beijing Co Ltd
Priority to CN202110594222.0A priority Critical patent/CN113422692A/en
Publication of CN113422692A publication Critical patent/CN113422692A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to the technical field of clusters, and discloses a method, a device and a storage medium for detecting and processing faults of nodes in a K8s cluster, wherein the method for detecting and processing the faults of the nodes in the K8s cluster comprises the following steps: carrying out fault detection on each working node Worker of the K8s cluster through a self-defined node fault detection program; reporting the detection result to an API Server of the K8s cluster, and storing the detection result; and monitoring the state of the Worker node Worker according to the detection result, and executing corresponding recovery action according to a preset rule when the state changes so as to process the fault. The K8s cluster internal node fault detection and processing method forms a whole set of closed-loop automatic system; the node fault detection and processing capability of K8s is enhanced, the system runs in a cloud native mode completely, does not depend on other third-party components, has high availability, greatly improves the node fault processing efficiency, does not need manual intervention, and has high practical value.

Description

Method, device and storage medium for detecting and processing node faults in K8s cluster
Technical Field
The invention relates to the technical field of clusters, in particular to a method, a device and a storage medium for detecting and processing a node fault in a K8s cluster.
Background
K8 s-Kubernets, a standard open source container orchestration and scheduling platform in the cloud-native domain. Node, namely Node, K8s cluster management computer entity, is divided into Master Node and Worker Node, and a plurality of nodes form a cluster. Pod, the smallest unit scheduled by K8s, a service consists of 1 or more pods, one Pod corresponding to a specific computing task on the computer.
The K8s executes the workload by placing the container in a Pod running on a Node. Each node contains the services needed to run the Pods; these nodes are managed by the control plane. Components on a node include kubel, container runtime, and kube-proxy. The state of a node contains the following information: address, status, capacity and allocable, information; kubecect may be used to view node status and other detail information. The conditions field describes the state of all Running nodes. Examples of conditions include:
ready is True if the node is healthy and Ready to receive Pod; false indicates that the node is unhealthy and cannot receive Pod; unknown indicates that the node controller did not receive a message from a node during the last node-monitor-gram-period (default 40 seconds).
DiskPresure True indicates that the free space of the node is not enough for adding a new Pod, otherwise False.
The MemoryPressure True indicates that the memory pressure exists in the node, namely the memory available amount of the node is low, otherwise, the node is False.
The PIDPRESsure True represents that the node has process pressure, namely the process on the node is excessive; otherwise False.
The network Unavailable True represents that the node network configuration is incorrect; otherwise False.
The existing K8s is used for node health detection through kubel on a node, and relies on a lease timing renewal mechanism, and can detect the following four problems by default: network unavailability, insufficient memory space, insufficient disk space, and insufficient number of available processes; compared with a complex real-world scene, the default detection capability of the K8s is weak, and other software or hardware faults dependent on the system operation besides the above four problems cannot be detected, such as: ntp service exception (which may cause the server clock to be out of sync), graphics card failure (which may cause the GPU dependent program to fail to run), and so on. In addition, after a fault is found, manual intervention processing is needed, the nodes are moved out of the cluster, and then new nodes are purchased to be added into the cluster, so that the labor cost is high and the efficiency is low.
In view of the above, the present invention is particularly proposed.
Disclosure of Invention
In order to solve the problems of node fault detection and automatic processing in the K8s cluster, in particular, the invention realizes a technical scheme that whether a node is abnormal or not can be judged according to a self-defined condition and automatic processing is carried out, and the technical scheme specifically comprises the following steps:
the invention provides a method for detecting and processing a fault of a node in a K8s cluster, which comprises the following steps:
carrying out fault detection on each working node Worker of the K8s cluster through a self-defined node fault detection program;
reporting the detection result to an API Server of the K8s cluster, and storing the detection result;
and monitoring the state of the Worker node Worker according to the detection result, and executing corresponding recovery action according to a preset rule when the state changes so as to process the fault.
As an optional embodiment of the present invention, the performing fault detection on each working node Worker of the K8s cluster by using a self-defined node fault detection program includes:
a fault detection program detector is respectively deployed on each working node Worker and used for detecting the fault of the node;
the fault detection program detector supports plug-in operation, and performs fault detection on each working node Worker of the K8s cluster by executing a self-defined node fault detection program;
optionally, the fault detection program detector is deployed on each Worker node Worker in a manner of daemon in the K8s cluster.
As an optional embodiment of the present invention, the customized node fault detection program includes: the ntp detection program is used for detecting the node ntp service abnormity, and/or the display card detection program is used for detecting the node display card fault.
As an alternative embodiment of the present invention, referring to fig. 2, the reporting the detection result to the API Server of the K8s cluster, and the saving includes:
the detection result is reported to an API Server in a NodeCondition form through a remote calling mode, and the API Server stores the NodeCondition to Etcd;
the NodeCondition is a field carried by a node structure body of the K8s and is used for storing a default node state of the K8s, and the NodeCondition field is multiplexed and used for storing a self-defined node state detected by a self-defined node fault detection program.
As an optional implementation manner of the present invention, the monitoring the state of the work node Worker according to the detection result includes:
monitoring the state of a Worker node Worker by monitoring the change event of the detection result in the API Server through the list-watch;
list API list resources of the list calling resources are realized based on HTTP short link; the watch API of the watch calls the resource listens for resource change events.
As an optional embodiment of the present invention, when the state changes, the performing the corresponding recovery action according to a pre-configured rule, and performing the fault processing includes:
when the status of the Worker node Worker is a 'GPUFallOff' fault and/or an 'NTPProblem' fault, performing a corresponding recovery action includes blocking, wherein the blocking is to prevent a new pod from being scheduled on the node;
optionally, the performing the corresponding recovery action further comprises alerting.
As an optional embodiment of the present invention, when the state changes, the performing the corresponding recovery action according to a pre-configured rule, and performing the fault processing includes:
when the state of the working node Worker is a Kernel Deadlock fault, and/or a ReadonlyFilesystem fault, and/or a CorruptDocker overlay2 fault, and/or a Kubelerer fault, and/or a NICDeleted fault;
performing corresponding recovery actions including blocking and eviction;
wherein said blocking is to prevent a new pod from being dispatched to the node; the eviction is to transfer the pod on the node to other nodes;
optionally, the performing the corresponding recovery action further comprises alerting.
As an optional embodiment of the present invention, the state of the Worker node Worker is that "GPUFallOff" failure detects the dmesg log by a detection mode "plug-in, matching: the reason for detecting the NVRM is 'GPU has fan off the bus';
and/or, when the state of the Worker node Worker is 'NTPProblem' fault, detecting that the reason is 'NTPIsDown' by executing a corresponding script;
and/or the state of the working node Worker is 'KernelDeadlock' fault:
detect the dmesg log by detection mode "plug-in, match: the detection reason is 'AUFSUmount Hung', the plug-in detects the dmesg log in a detection mode, and the detection result is matched with the detection result: the reason for detection of task docker: \ \ w + blocked for more than \ \ w + seconds \\\ \ the reason for detection is "DockerHung";
and/or the state of the Worker node Worker is ' ReadonlyFilesystem ' fault, and the dmesg log is detected through a detection mode ' plug-in, and matched: the reason for detecting the Remounting filesystem read-only nvme \ w +: Identify Controller failed (++) "is" FilessystemIsReadOnly ";
and/or the state of the working node Worker is a fault of CorruptDocker overlay2, and the docker log is detected by a detection mode of plug-in, and the detection is matched with the following steps: a return error: readlink/var/lib/docker/overlay2. invalid identification supplement. "the cause of the detection is" CorruptDocker overlay2 ";
and/or the state of the working node Worker is ' kubeleeror ' fault, and the kubelet log is detected through a detection mode ' plug-in unit, and is matched with the fault: a failure to get system container station for (. +): failure to get group station for (. +): failure to get container info for (. +): under container station for "";
and/or the state of the Worker node Worker is ' NICDELLETED ' fault, and the state is matched by a detection mode ' plug-in detection/var/log/messages: the reason for detecting the ntpd \ d + \ Deleting interfaces (++) "is NIC has been seen.
The invention also provides a device for detecting and processing the fault of the node in the K8s cluster, which comprises:
the node fault detection module is used for carrying out fault detection on each working node Worker of the K8s cluster through a self-defined node fault detection program, reporting a detection result to an API Server of the K8s cluster and storing the detection result;
and the fault processing module monitors the state of the working node Worker according to the detection result, and executes corresponding recovery action according to a preset rule when the state changes so as to process the fault.
The invention also provides a storage medium which stores a computer executable program, and when the computer executable program is executed, the method for detecting and processing the fault of the node in the K8s cluster is realized.
Compared with the prior art, the invention has the beneficial effects that:
the method for detecting and processing the node fault in the K8s cluster can run a self-defined detection program in a plug-in mode to complete the fault detection of the node, report the detection result to the API Server of the K8s cluster, and then perform persistent storage to trigger the recovery action, thereby forming a whole set of closed-loop automatic system; the node fault detection and processing capability of K8s is enhanced, the system runs in a cloud native mode completely, does not depend on other third-party components, has high availability, greatly improves the node fault processing efficiency, does not need manual intervention, and has high practical value.
Description of the drawings:
FIG. 1 is a block flow diagram of a method for detecting and handling a failure of a node in a K8s cluster according to the present invention;
FIG. 2 is a block flow diagram of an embodiment of a K8s cluster intra-node fault detection and handling method of the present invention;
fig. 3 is an overall framework diagram of the K8s cluster node fault detection and processing apparatus according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments.
Thus, the following detailed description of the embodiments of the invention is not intended to limit the scope of the invention as claimed, but is merely representative of some embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the embodiments of the present invention and the features and technical solutions thereof may be combined with each other without conflict.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the present invention, it should be noted that the terms "upper", "lower", and the like refer to orientations or positional relationships based on those shown in the drawings, or orientations or positional relationships that are conventionally arranged when the products of the present invention are used, or orientations or positional relationships that are conventionally understood by those skilled in the art, and such terms are used for convenience of description and simplification of the description, and do not refer to or imply that the devices or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.
Referring to fig. 1, the present embodiment provides a method for detecting and processing a node fault in a K8s cluster, including:
carrying out fault detection on each working node Worker of the K8s cluster through a self-defined node fault detection program;
reporting the detection result to an API Server of the K8s cluster, and storing the detection result;
and monitoring the state of the Worker node Worker according to the detection result, and executing corresponding recovery action according to a preset rule when the state changes so as to process the fault.
In order to realize more fault detection on the nodes in the K8s cluster, the fault detection and processing method in the K8s cluster of the embodiment detects through a customized node fault detection program, and can perform more targeted and diversified fault detection through customized program design compared with a mode of performing node health detection through kubel on the nodes by depending on K8s, thereby meeting higher requirements of node fault detection.
The method for detecting and processing the node fault in the K8s cluster in the embodiment reports the detection result to the API Server of the K8s cluster and stores the detection result, so that the detected fault data can be conveniently called and monitored, and the automatic monitoring of the node fault is realized, so that the feedback can be made in time.
The method for detecting and processing the node fault in the K8s cluster in this embodiment monitors the state of the Worker node Worker according to the detection result, and executes a corresponding recovery action according to a preset rule to process the fault when the state changes. By pre-configuring the node fault and the corresponding recovery action, when the node fault is monitored, the fault is automatically processed according to the pre-configuration, so that the cost of manual processing is saved, and the efficiency of node fault processing is improved.
Therefore, the method for detecting and processing the node fault in the K8s cluster in the embodiment solves the problem of node fault detection and automatic processing in the K8s cluster, and particularly realizes a method for determining whether a node is abnormal according to a user-defined condition and automatically processing the node. The node fault detection and processing method in the K8s cluster of this embodiment can run a self-defined detection program in the form of a plug-in to complete the fault detection of the node, and report the detection result to the API Server of the K8s cluster, and then perform persistent storage, thereby triggering the recovery action, and forming a whole set of closed-loop automation system; the node fault detection and processing capability of K8s is enhanced, the system runs in a cloud native mode completely, does not depend on other third-party components, has high availability, greatly improves the node fault processing efficiency, does not need manual intervention, and has high practical value.
As an optional implementation manner of this embodiment, referring to fig. 2, in the method for detecting and processing a node fault in a K8s cluster of this embodiment, the performing fault detection on each Worker node Worker of the K8s cluster through a customized node fault detection program includes:
a fault detection program detector is respectively deployed on each working node Worker and used for detecting the fault of the node;
the fault detection program detector supports plug-in operation, and performs fault detection on each working node Worker of the K8s cluster by executing a self-defined node fault detection program.
In this embodiment, a fault detection program detector is deployed on each work node Worker of the K8s cluster and used for performing more targeted and diversified fault detection on the work node Worker, the fault detection program detector supports plug-in operation, and developers can develop a self-defined node fault detection program according to the requirement of node fault detection and operate on the fault detection program detector in a plug-in mode to detect node faults.
Optionally, the fault detection program detector is deployed on each Worker node Worker in a manner of daemon in the K8s cluster.
Daemonset, a deployment form of Pod in K8s, similar to a daemon, runs one Pod on each designated node. DaemonSet defines the Pods that provide the facility for local support of the nodes. These Pods may be important to the operation and maintenance of the cluster, for example as an aid to network links or as part of a network plug-in, etc. Each time you add a new node to the cluster, if the node matches the DaemonSet specification, the control plane will schedule a Pod for the DaemonSet to run on the new node. Thus, in the embodiment, a containerized daemon deployment fault detection program is used, the fault detection program provides necessary fault detection for the Worker node Worker, and when a node is newly added, the fault detection program can be automatically deployed on the new node, so that the management and operation and maintenance efficiency is greatly improved compared with the traditional method in which a monitoring program (Zabbix and Open-falcon commonly used in the industry) is deployed on the node.
Further, the customized node fault detection program in this embodiment includes: the ntp detection program is used for detecting the node ntp service abnormity, and/or the display card detection program is used for detecting the node display card fault.
NTP server (NTP) is a Protocol for synchronizing computer Time, which can synchronize a computer to its server or clock source (e.g., quartz clock, GPS, etc.), provide highly accurate Time correction (the difference between the standard and the standard on LAN is less than 1 millisecond, and tens of milliseconds on WAN), and prevent malicious Protocol attacks via encryption confirmation. The ntp service exception may cause the server clock to be asynchronous, and the ntp detection program of this embodiment is implemented by the following custom scripts:
and executing the script:
if systemctl-q is-active"ntp.service";then
echo"ntp.service is running"
exit 0
else
echo"ntp.service is not running"
exit 1
fi”。
the failure of the node display card can cause that the program depending on the GPU cannot run, and the display card detection program of the embodiment detects the dmesg log by the detection mode of' plug-in detection, matching: and NVRM which detects the fault by Xid (++) \ d +, GPU has fallen off the bus \ ".
The self-defined node fault detection program in this embodiment includes a Kernel detection program for detecting a node "Kernel delay" fault, where the "Kernel delay" fault is a node Kernel lock, a screen does not have any effective print information, a network interrupt, and a keyboard and a mouse do not have any response. The Kernel detection program of this embodiment detects the dmesg log by the detection mode "plug-in, matches: the detection reason is "AUFSUmount Hung" (memory lock death); detect the dmesg log by detection mode "plug-in, match: task Docker: \ \ w + blocked for more than \ \ w + seconds \\.
The self-defined node fault detection program in this embodiment includes a Readonly detection program for detecting a node "Readonly file system" fault, where the "Readonly file system" fault is that a file cannot be written, a file is newly stored, and there is only a read right and no write right. The Readonly detection program of this embodiment detects the dmesg log through the detection mode "plug-in, matches: the reason for detecting the removal of file system read-only nvme \ \ w +: Identify Controller failed (++) "is" FilessystemIsReadOnly ".
The self-defined node fault detection program described in this embodiment includes a corrupttdocker detection program for detecting a fault of a node "corruptdoccker overlay 2", and a fault "corruptdoccker overlay 2" is that the occupied space of the current Docker is too large. The corrupt docker detection program of this embodiment detects docker logs through a detection mode "plug-in, matches: the cause of the return error: readlink/var/lib/docker/overlay2. invalid identification. the "cause of detection is" CorruptDocker overlay2 ".
The customized node fault detection program described in this embodiment includes a kubeleleterror detection program for detecting a node "kubeleleterror" fault, which is a kubelelet running error on a working node. The kubeleeror detection program of the embodiment detects the kubelet log by the plug-in detection mode, and matches: a cause of the failure to get system container stations for (. +), failure to get container containers for (. +), and under container containers (. +) "test is" kubelet Error ".
The self-defined node fault detection program described in this embodiment includes an NICD detection program for detecting a node "NICDeleted" fault, where the "NICDeleted" fault is a network card being deleted and unable to perform networking. The NICD detection program of the embodiment matches through the detection mode "plug-in detection/var/log/messages: the reason for detecting the ntpd \ d + \ Deleting interfaces (++) "is NIC has been seen.
Referring to fig. 2, in the method for detecting and processing a node fault in a K8s cluster according to this embodiment, reporting a detection result to an API Server of a K8s cluster, and storing the detection result includes:
the detection result is reported to an API Server in a NodeCondition form through a remote calling mode, and the API Server stores the NodeCondition to Etcd;
the NodeCondition is a field carried by a node structure body of the K8s and is used for storing a default node state of the K8s, and the NodeCondition field is multiplexed and used for storing a self-defined node state detected by a self-defined node fault detection program.
In this embodiment, the node condition field is multiplexed, the node fault detection result is reported to the API Server in a remote call mode in a manner similar to that of the existing K8s, and the API Server stores the node condition to the Etcd in an existing mode.
etcd is a distributed, reliable key-value storage system for storing critical data in a distributed system. The cluster data can be accessed by the client provided by the etcd, and the etcd can also be directly accessed by http (similar to curl command). Inside the etcd, the data representation is also simple, and the data storage of the etcd can be directly understood as an ordered map which stores key-value data. Meanwhile, the etcd also supports a watch mechanism for facilitating the client to subscribe the change of the data, and the incremental update of the data in the etcd is carried out in real time through the watch, so that the business logic such as data synchronization and the like in the etcd is realized.
The interfaces provided by etcd are divided into the following 5 groups:
the first group is Put and Delete. The put and delete operations are very simple, data can be written into the cluster by only providing one key and one value, and only the key needs to be specified when the data is deleted.
The second group is query operations. etcd supports two types of queries: the first is a query of a single key, and the second is a range of a specified key.
The third group is data subscriptions. The etcd provides a Watch mechanism, a Watch can subscribe to incremental data update in the etcd in real time, the Watch supports to specify a single key and also can specify a prefix of the key, and a second situation is usually adopted in an actual application scene.
A fourth set of transactional operations. etcd provides a simple transactional support, where a user can perform some actions when a set of conditions is met and another set of operations when the conditions are not met, similar to if else statements in code, etcd ensures atomicity of the entire operation.
The fifth group is the Leases interface. The Leases interface is a common design model in distributed systems.
Further, referring to fig. 2, in the method for detecting and processing a node fault in a K8s cluster according to the present embodiment, the monitoring the state of the Worker node Worker according to the detection result includes:
monitoring the state of a Worker node Worker by monitoring the change event of the detection result in the API Server through the list-watch;
list API list resources of the list calling resources are realized based on HTTP short link; the watch API of the watch calls the resource listens for resource change events.
The Etcd stores the data information of the cluster, the API Server is used as a unified entry, and any operation on the data must pass through the API Server. The client (kubel/scheduler/controller-manager) listens to the create, update and delete events of the resources (pod/rs/rc, etc.) in the API Server through the list-watch, and calls corresponding event processing functions according to the event types.
What the list-watch is specifically is, as the name implies, the list-watch is composed of two parts, namely the list and the watch. list is well understood, namely list API list resources of calling resources are realized based on HTTP short link; the watch is a watch API for calling resources to monitor resource change events and is realized based on an HTTP long link.
The list and the watch ensure the reliability of the message together, and avoid the scene of inconsistent state caused by the loss of the message. Specifically, the list API may query the current resource and its corresponding state (i.e., the expected state), and the client compares the expected state with the actual state to correct the resource with inconsistent state. The Watch API and the API Server keep a long link, receive the state change event of the resource and do corresponding processing. If only the watch API is called, if the connection is interrupted at a certain time point, the message can be lost, so that the problem of message loss needs to be solved through the list API. From another perspective, we can consider the list API to obtain full data and the watch API to obtain incremental data. Although the effect of synchronizing the resource states can be achieved only by polling the list API, there are problems of high overhead and insufficient real-time performance.
The message is required to be real-time, and under a list-watch mechanism, each time when the state change event is generated by the resource of the API Server, the event is timely pushed to the client, so that the real-time property of the message is ensured.
The sequentiality of the message is also very important, in a concurrent scenario, the client may receive multiple events of the same resource in a short time, and for K8S concerning final consistency, it needs to know which event is the most recent event and guarantee the final state of the resource as the state expressed by the most recent event. K8S has a resourceVersion tag in each resource event, which is an incremental number, so when the client concurrently processes events of the same resource, it can compare the resourceVersion to ensure that the final state is consistent with the expected state of the latest event.
The List-watch also has the characteristic of high performance, and although the effect of final consistency of resources can be achieved only by periodically calling the List API, the overhead is greatly increased by periodically and frequently polling, and the pressure of the API Server is increased. The watch is used as an asynchronous message notification mechanism and multiplexes a long link, so that the performance is guaranteed while the real-time performance is guaranteed.
Therefore, the method for detecting and processing the node fault in the K8s cluster in the embodiment realizes real-time monitoring of the state of the working node Worker through a list-watch mechanism, so that the fault processing system can execute a recovery action in time according to a monitoring result to process the node fault.
Referring to fig. 2, in the method for detecting and processing a node fault in a K8s cluster of this embodiment: once the node state is monitored to be changed, corresponding recovery actions are executed according to a pre-configured rule, and the recovery actions comprise blocking (preventing a new pod from being scheduled on the node), evicting (transferring the pod on the node to other nodes), removing a cluster, applying a new node to join the cluster and the like.
Specifically, in the method for detecting and processing a node failure in a K8s cluster of this embodiment, when a state changes, a corresponding recovery action is executed according to a preconfigured rule, and performing failure processing includes:
when the state of the working node Worker is GPUFallOff fault and/or NTPProblem fault, executing corresponding recovery actions including alarming and blocking;
wherein, the blocking is to prevent a new pod from being scheduled to the node, and the alarming is to send out warning information to inform relevant staff to perform subsequent processing of the fault.
In the method for detecting and processing a node failure in a K8s cluster of this embodiment, when a state changes, a corresponding recovery action is executed according to a preconfigured rule, and performing failure processing includes:
when the state of the working node Worker is a Kernel Deadlock fault, and/or a ReadonlyFilesystem fault, and/or a CorruptDocker overlay2 fault, and/or a Kubelerer fault, and/or a NICDeleted fault;
executing corresponding recovery actions including alarming, blocking and evicting;
wherein said blocking is to prevent a new pod from being dispatched to the node; the eviction is to transfer the pod on the node to other nodes; the alarm means that warning information is sent to inform relevant workers to carry out follow-up processing of faults.
The recovery action in the method for detecting and processing the node failure in the K8s cluster in this embodiment further includes that a cluster is already released, a new node is applied to join the cluster, and the like.
In the method for detecting and processing a node fault in a K8s cluster of this embodiment:
the state of the working node Worker is that the plug-in detects the dmesg log in a ' GPUFallOff ' fault passing detection mode ', and the match is as follows: the reason for detecting the NVRM is 'GPU has fan off the bus';
and/or when the state of the work node Worker is "NTPProblem" fault, the detection mode is as follows:
and executing the script:
if systemctl-q is-active"ntp.service";then
echo"ntp.service is running"
exit 0
else
echo"ntp.service is not running"
exit 1
fi”;
and/or the state of the working node Worker is 'KernelDeadlock' fault:
detect the dmesg log by detection mode "plug-in, match: the detection reason is 'AUFSUmount Hung', the plug-in detects the dmesg log in a detection mode, and the detection result is matched with the detection result: the reason for detection of task docker: \ \ w + blocked for more than \ \ w + seconds \\\ \ the reason for detection is "DockerHung";
and/or the state of the Worker node Worker is ' ReadonlyFilesystem ' fault, and the dmesg log is detected through a detection mode ' plug-in, and matched: the reason for detecting the Remounting filesystem read-only nvme \ w +: Identify Controller failed (++) "is" FilessystemIsReadOnly ";
and/or the state of the working node Worker is a fault of CorruptDocker overlay2, and the docker log is detected by a detection mode of plug-in, and the detection is matched with the following steps: a return error: readlink/var/lib/docker/overlay2. invalid identification supplement. "the cause of the detection is" CorruptDocker overlay2 ";
and/or the state of the working node Worker is ' kubeleeror ' fault, and the kubelet log is detected through a detection mode ' plug-in unit, and is matched with the fault: a failure to get system container station for (. +): failure to get group station for (. +): failure to get container info for (. +): under container station for "";
and/or the state of the Worker node Worker is ' NICDELLETED ' fault, and the state is matched by a detection mode ' plug-in detection/var/log/messages: the reason for detecting the ntpd \ d + \ Deleting interfaces (++) "is NIC has been seen.
Thus, the preconfigured rules of the present embodiment are shown in the following table:
Figure BDA0003090376090000121
Figure DEST_PATH_IMAGE001
this embodiment provides a K8s cluster inner node fault detection and processing apparatus simultaneously, includes:
the node fault detection module is used for carrying out fault detection on each working node Worker of the K8s cluster through a self-defined node fault detection program, reporting a detection result to an API Server of the K8s cluster and storing the detection result;
and the fault processing module monitors the state of the working node Worker according to the detection result, and executes corresponding recovery action according to a preset rule when the state changes so as to process the fault.
Node fault detection and processing apparatus in the K8s cluster of this embodiment carries out more fault detection in order to realize carrying out node in the K8s cluster, and node fault detection module detects through self-defined node fault detection procedure, and for the mode that relies on K8s to carry out node health detection through the kubel on the node, can carry out more pointed and diversified fault detection through self-defined programming, thereby satisfy node fault detection's higher demand.
The node fault detection and processing device in the K8s cluster of this embodiment reports the testing result to the API Server of the K8s cluster and saves, is convenient for carrying out calling and monitoring on the fault data detected, realizes the automatic monitoring of the node fault, so as to make feedback in time.
In the device for detecting and processing a node fault in a K8s cluster according to this embodiment, a fault processing module monitors the state of a Worker node Worker according to a detection result, and executes a corresponding recovery action according to a preset rule to perform fault processing when the state changes. The fault processing module is used for carrying out automatic processing on the fault according to the pre-configuration when the node fault is monitored by pre-configuring the node fault and the corresponding recovery action, so that the cost of manual processing is saved, and the efficiency of node fault processing is improved.
Therefore, the device for detecting and processing the node fault in the K8s cluster in the embodiment solves the problem of detecting and automatically processing the node fault in the K8s cluster, and particularly realizes a method for determining whether a node is abnormal according to a user-defined condition and automatically processing the node. The node fault detection and processing device in the K8s cluster of this embodiment can run a self-defined detection program in the form of a plug-in to complete fault detection of the node, report the detection result to the API Server of the K8s cluster, and then perform persistent storage, thereby triggering recovery actions, and forming a whole set of closed-loop automation system; the node fault detection and processing capability of K8s is enhanced, the system runs in a cloud native mode completely, does not depend on other third-party components, has high availability, greatly improves the node fault processing efficiency, does not need manual intervention, and has high practical value.
As an optional implementation manner of this embodiment, the fault detection module of the node fault detection and processing apparatus in the K8s cluster of this embodiment includes a fault detection program detector respectively deployed on each Worker node Worker, and is used for detecting a fault of a node; the fault detection program detector supports plug-in operation, and performs fault detection on each working node Worker of the K8s cluster by executing a self-defined node fault detection program.
In this embodiment, a fault detection program detector is deployed on each work node Worker of the K8s cluster and used for performing more targeted and diversified fault detection on the work node Worker, the fault detection program detector supports plug-in operation, and developers can develop a self-defined node fault detection program according to the requirement of node fault detection and operate on the fault detection program detector in a plug-in mode to detect node faults.
Optionally, the fault detection program detector is deployed on each Worker node Worker in a manner of daemon in the K8s cluster.
Daemonset, a deployment form of Pod in K8s, similar to a daemon, runs one Pod on each designated node. DaemonSet defines the Pods that provide the facility for local support of the nodes. These Pods may be important to the operation and maintenance of the cluster, for example as an aid to network links or as part of a network plug-in, etc. Each time you add a new node to the cluster, if the node matches the DaemonSet specification, the control plane will schedule a Pod for the DaemonSet to run on the new node. Thus, in the embodiment, a containerized daemon deployment fault detection program is used, the fault detection program provides necessary fault detection for the Worker node Worker, and when a node is newly added, the fault detection program can be automatically deployed on the new node, so that the management and operation and maintenance efficiency is greatly improved compared with the traditional method in which a monitoring program (Zabbix and Open-falcon commonly used in the industry) is deployed on the node.
Further, the customized node fault detection program in this embodiment includes: the ntp detection program is used for detecting the node ntp service abnormity, and/or the display card detection program is used for detecting the node display card fault.
NTP server (NTP) is a Protocol for synchronizing computer Time, which can synchronize a computer to its server or clock source (e.g., quartz clock, GPS, etc.), provide highly accurate Time correction (the difference between the standard and the standard on LAN is less than 1 millisecond, and tens of milliseconds on WAN), and prevent malicious Protocol attacks via encryption confirmation. The ntp service exception may cause the server clock to be asynchronous, and the ntp detection program of this embodiment is implemented by the following custom scripts:
and executing the script:
if systemctl-q is-active"ntp.service";then
echo"ntp.service is running"
exit 0
else
echo"ntp.service is not running"
exit 1
fi”。
the failure of the node display card can cause that the program depending on the GPU cannot run, and the display card detection program of the embodiment detects the dmesg log by the detection mode of' plug-in detection, matching: and NVRM which detects the fault by Xid (++) \ d +, GPU has fallen off the bus \ ".
The self-defined node fault detection program in this embodiment includes a Kernel detection program for detecting a node "Kernel delay" fault, where the "Kernel delay" fault is a node Kernel lock, a screen does not have any effective print information, a network interrupt, and a keyboard and a mouse do not have any response. The Kernel detection program of this embodiment detects the dmesg log by the detection mode "plug-in, matches: the detection reason is "AUFSUmount Hung" (memory lock death); detect the dmesg log by detection mode "plug-in, match: task Docker: \ \ w + blocked for more than \ \ w + seconds \\.
The self-defined node fault detection program in this embodiment includes a Readonly detection program for detecting a node "Readonly file system" fault, where the "Readonly file system" fault is that a file cannot be written, a file is newly stored, and there is only a read right and no write right. The Readonly detection program of this embodiment detects the dmesg log through the detection mode "plug-in, matches: the reason for detecting the removal of file system read-only nvme \ \ w +: Identify Controller failed (++) "is" FilessystemIsReadOnly ".
The self-defined node fault detection program described in this embodiment includes a corrupttdocker detection program for detecting a fault of a node "corruptdoccker overlay 2", and a fault "corruptdoccker overlay 2" is that the occupied space of the current Docker is too large. The corrupt docker detection program of this embodiment detects docker logs through a detection mode "plug-in, matches: the cause of the return error: readlink/var/lib/docker/overlay2. invalid identification. the "cause of detection is" CorruptDocker overlay2 ".
The customized node fault detection program described in this embodiment includes a kubeleleterror detection program for detecting a node "kubeleleterror" fault, which is a kubelelet running error on a working node. The kubeleeror detection program of the embodiment detects the kubelet log by the plug-in detection mode, and matches: a cause of the failure to get system container stations for (. +), failure to get container containers for (. +), and under container containers (. +) "test is" kubelet Error ".
The self-defined node fault detection program described in this embodiment includes an NICD detection program for detecting a node "NICDeleted" fault, where the "NICDeleted" fault is a network card being deleted and unable to perform networking. The NICD detection program of the embodiment matches through the detection mode "plug-in detection/var/log/messages: the reason for detecting the ntpd \ d + \ Deleting interfaces (++) "is NIC has been seen.
In the device for detecting and processing a node failure in a K8s cluster of this embodiment, reporting a detection result to an API Server of a K8s cluster, and storing includes:
the detection result of the detector is reported to an API Server in a NodeCondition form in a remote calling mode, and the API Server stores the NodeCondition to the Etcd;
the NodeCondition is a field carried by a node structure body of the K8s and is used for storing a default node state of the K8s, and the NodeCondition field is multiplexed and used for storing a self-defined node state detected by a self-defined node fault detection program.
In this embodiment, the NodeCondition field is multiplexed, the detection result of the detector is reported to the API Server in a remote call mode in a mode similar to that of the existing K8s, and the API Server stores the NodeCondition in the Etcd in an existing mode.
etcd is a distributed, reliable key-value storage system for storing critical data in a distributed system. The cluster data can be accessed by the client provided by the etcd, and the etcd can also be directly accessed by http (similar to curl command). Inside the etcd, the data representation is also simple, and the data storage of the etcd can be directly understood as an ordered map which stores key-value data. Meanwhile, the etcd also supports a watch mechanism for facilitating the client to subscribe the change of the data, and the incremental update of the data in the etcd is carried out in real time through the watch, so that the business logic such as data synchronization and the like in the etcd is realized.
The interfaces provided by etcd are divided into the following 5 groups:
the first group is Put and Delete. The put and delete operations are very simple, data can be written into the cluster by only providing one key and one value, and only the key needs to be specified when the data is deleted.
The second group is query operations. etcd supports two types of queries: the first is a query of a single key, and the second is a range of a specified key.
The third group is data subscriptions. The etcd provides a Watch mechanism, a Watch can subscribe to incremental data update in the etcd in real time, the Watch supports to specify a single key and also can specify a prefix of the key, and a second situation is usually adopted in an actual application scene.
A fourth set of transactional operations. etcd provides a simple transactional support, where a user can perform some actions when a set of conditions is met and another set of operations when the conditions are not met, similar to if else statements in code, etcd ensures atomicity of the entire operation.
The fifth group is the Leases interface. The Leases interface is a common design model in distributed systems.
Further, the monitoring, by the fault processing module in the device for detecting and processing a fault of a node in a K8s cluster according to a detection result, a state of a Worker node Worker includes:
the fault processing module monitors the change event of the detection result in the API Server through list-watch to realize the monitoring of the state of the Worker node;
list API list resources of the list calling resources are realized based on HTTP short link; the watch API of the watch calls the resource listens for resource change events.
The Etcd stores the data information of the cluster, the API Server is used as a unified entry, and any operation on the data must pass through the API Server. The client (kubel/scheduler/controller-manager) listens to the create, update and delete events of the resources (pod/rs/rc, etc.) in the API Server through the list-watch, and calls corresponding event processing functions according to the event types.
What the list-watch is specifically is, as the name implies, the list-watch is composed of two parts, namely the list and the watch. list is well understood, namely list API list resources of calling resources are realized based on HTTP short link; the watch is a watch API for calling resources to monitor resource change events and is realized based on an HTTP long link.
The list and the watch ensure the reliability of the message together, and avoid the scene of inconsistent state caused by the loss of the message. Specifically, the list API may query the current resource and its corresponding state (i.e., the expected state), and the client compares the expected state with the actual state to correct the resource with inconsistent state. The Watch API and the API Server keep a long link, receive the state change event of the resource and do corresponding processing. If only the watch API is called, if the connection is interrupted at a certain time point, the message can be lost, so that the problem of message loss needs to be solved through the list API. From another perspective, we can consider the list API to obtain full data and the watch API to obtain incremental data. Although the effect of synchronizing the resource states can be achieved only by polling the list API, there are problems of high overhead and insufficient real-time performance.
The message is required to be real-time, and under a list-watch mechanism, each time when the state change event is generated by the resource of the API Server, the event is timely pushed to the client, so that the real-time property of the message is ensured.
The sequentiality of the message is also very important, in a concurrent scenario, the client may receive multiple events of the same resource in a short time, and for K8S concerning final consistency, it needs to know which event is the most recent event and guarantee the final state of the resource as the state expressed by the most recent event. K8S has a resourceVersion tag in each resource event, which is an incremental number, so when the client concurrently processes events of the same resource, it can compare the resourceVersion to ensure that the final state is consistent with the expected state of the latest event.
The List-watch also has the characteristic of high performance, and although the effect of final consistency of resources can be achieved only by periodically calling the List API, the overhead is greatly increased by periodically and frequently polling, and the pressure of the API Server is increased. The watch is used as an asynchronous message notification mechanism and multiplexes a long link, so that the performance is guaranteed while the real-time performance is guaranteed.
Therefore, the node fault detection and processing device in the K8s cluster of the embodiment realizes real-time monitoring of the state of the working node Worker through a list-watch mechanism, which is beneficial for the fault processing system to execute a recovery action in time according to a monitoring result to process the node fault.
In the device for detecting and processing a node failure in a K8s cluster of this embodiment, when monitoring that a node state changes, the failure processing module executes a corresponding recovery action according to a preconfigured rule, and performing failure processing includes:
when the state of the working node Worker is GPUFallOff fault and/or NTPProblem fault, executing corresponding recovery actions including alarming and blocking;
wherein, the blocking is to prevent a new pod from being scheduled to the node, and the alarming is to send out warning information to inform relevant staff to perform subsequent processing of the fault.
In the method for detecting and processing a node failure in a K8s cluster of this embodiment, when a state changes, a corresponding recovery action is executed according to a preconfigured rule, and performing failure processing includes:
when the state of the working node Worker is a Kernel Deadlock fault, and/or a ReadonlyFilesystem fault, and/or a CorruptDocker overlay2 fault, and/or a Kubelerer fault, and/or a NICDeleted fault;
executing corresponding recovery actions including alarming, blocking and evicting;
wherein said blocking is to prevent a new pod from being dispatched to the node; the eviction is to transfer the pod on the node to other nodes; the alarm means that warning information is sent to inform relevant workers to carry out follow-up processing of faults.
In the device for detecting and processing a node fault in a K8s cluster of this embodiment:
the state of the working node Worker is that the plug-in detects the dmesg log in a ' GPUFallOff ' fault passing detection mode ', and the match is as follows: the reason for detecting the NVRM is 'GPU has fan off the bus';
and/or when the state of the work node Worker is "NTPProblem" fault, the detection mode is as follows:
and executing the script:
if systemctl-q is-active"ntp.service";then
echo"ntp.service is running"
exit 0
else
echo"ntp.service is not running"
exit 1
fi”;
and/or the state of the working node Worker is 'KernelDeadlock' fault:
detect the dmesg log by detection mode "plug-in, match: the detection reason is 'AUFSUmount Hung', the plug-in detects the dmesg log in a detection mode, and the detection result is matched with the detection result: the reason for detection of task docker: \ \ w + blocked for more than \ \ w + seconds \\\ \ the reason for detection is "DockerHung";
and/or the state of the Worker node Worker is ' ReadonlyFilesystem ' fault, and the dmesg log is detected through a detection mode ' plug-in, and matched: the reason for detecting the Remounting filesystem read-only nvme \ w +: Identify Controller failed (++) "is" FilessystemIsReadOnly ";
and/or the state of the working node Worker is a fault of CorruptDocker overlay2, and the docker log is detected by a detection mode of plug-in, and the detection is matched with the following steps: a return error: readlink/var/lib/docker/overlay2. invalid identification supplement. "the cause of the detection is" CorruptDocker overlay2 ";
and/or the state of the working node Worker is 'kubeleeror' fault, and the kubelet log is detected by a plug-in a detection mode, and is matched with the fault: a failure to get system container station for (. +): failure to get group station for (. +): failure to get container info for (. +): under container station for "";
and/or the state of the Worker node Worker is ' NICDELLETED ' fault, and the state is matched by a detection mode ' plug-in detection/var/log/messages: the reason for detecting the ntpd \ d + \ Deleting interfaces (++) "is NIC has been seen.
Fig. 3 is a block diagram of the device for detecting and processing a node fault in a K8s cluster according to this embodiment.
The embodiment also provides a storage medium, which stores a computer executable program, and when the computer executable program is executed, the method for detecting and processing the node fault in the K8s cluster is implemented.
The storage medium of this embodiment may comprise a propagated data signal with readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
The embodiment also provides an electronic device, which includes a processor and a memory, where the memory is used to store a computer executable program, and when the computer program is executed by the processor, the processor executes the method for detecting and processing the node failure in the K8s cluster.
The electronic device is in the form of a general purpose computing device. The processor can be one or more and can work together. The invention also does not exclude that distributed processing is performed, i.e. the processors may be distributed over different physical devices. The electronic device of the present invention is not limited to a single entity, and may be a sum of a plurality of entity devices.
The memory stores a computer executable program, typically machine readable code. The computer readable program may be executed by the processor to enable an electronic device to perform the method of the invention, or at least some of the steps of the method.
The memory may include volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may also be non-volatile memory, such as read-only memory (ROM).
It should be understood that elements or components not shown in the above examples may also be included in the electronic device of the present invention. For example, some electronic devices further include a display unit such as a display screen, and some electronic devices further include a human-computer interaction element such as a button, a keyboard, and the like. Electronic devices are considered to be covered by the present invention as long as the electronic devices are capable of executing a computer-readable program in a memory to implement the method of the present invention or at least a part of the steps of the method. From the above description of the embodiments, those skilled in the art will readily appreciate that the present invention can be implemented by hardware capable of executing a specific computer program, such as the system of the present invention, and electronic processing units, servers, clients, mobile phones, control units, processors, etc. included in the system. The invention may also be implemented by computer software for performing the method of the invention, e.g. control software executed by a microprocessor, an electronic control unit, a client, a server, etc. It should be noted that the computer software for executing the method of the present invention is not limited to be executed by one or a specific hardware entity, and can also be realized in a distributed manner by non-specific hardware. For computer software, the software product may be stored in a computer readable storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or may be distributed over a network, as long as it enables the electronic device to perform the method according to the present invention.
The above embodiments are only used for illustrating the invention and not for limiting the technical solutions described in the invention, and although the present invention has been described in detail in the present specification with reference to the above embodiments, the present invention is not limited to the above embodiments, and therefore, any modification or equivalent replacement of the present invention is made; all such modifications and variations are intended to be included herein within the scope of this disclosure and the appended claims.

Claims (10)

1. A method for detecting and processing faults of nodes in a K8s cluster is characterized by comprising the following steps:
carrying out fault detection on each working node Worker of the K8s cluster through a self-defined node fault detection program;
reporting the detection result to an API Server of the K8s cluster, and storing the detection result;
and monitoring the state of the Worker node Worker according to the detection result, and executing corresponding recovery action according to a preset rule when the state changes so as to process the fault.
2. The method for detecting and processing the fault of the node in the K8s cluster according to claim 1, wherein the step of performing the fault detection on each Worker node Worker in the K8s cluster through a customized node fault detection program comprises:
a fault detection program detector is respectively deployed on each working node Worker and used for detecting the fault of the node;
the fault detection program detector supports plug-in operation, and performs fault detection on each working node Worker of the K8s cluster by executing a self-defined node fault detection program;
optionally, the fault detection program detector is deployed on each Worker node Worker in a manner of daemon in the K8s cluster.
3. The method for detecting and processing the node fault in the K8s cluster according to claim 1 or 2, wherein the customized node fault detection program includes: the ntp detection program is used for detecting the node ntp service abnormity, and/or the display card detection program is used for detecting the node display card fault.
4. The method according to claim 1, wherein the reporting the detection result to an API Server of the K8s cluster and the saving comprises:
the detection result is reported to an API Server in a NodeCondition form through a remote calling mode, and the API Server stores the NodeCondition to Etcd;
the NodeCondition is a field carried by a node structure body of the K8s and is used for storing a default node state of the K8s, and the NodeCondition field is multiplexed and used for storing a self-defined node state detected by a self-defined node fault detection program.
5. The method for detecting and processing the fault of the node in the K8s cluster according to claim 1, wherein the monitoring the state of a Worker node Worker according to the detection result includes:
monitoring the state of a Worker node Worker by monitoring the change event of the detection result in the API Server through the list-watch;
list API list resources of the list calling resources are realized based on HTTP short link; the watch API of the watch calls the resource listens for resource change events.
6. The method for detecting and processing the failure of the node in the K8s cluster according to claim 1, wherein the performing the failure processing according to the pre-configured rule when the state changes includes:
when the status of the Worker node Worker is a 'GPUFallOff' fault and/or an 'NTPProblem' fault, performing a corresponding recovery action includes blocking, wherein the blocking is to prevent a new pod from being scheduled on the node;
optionally, the performing the corresponding recovery action further comprises alerting.
7. The method for detecting and processing the failure of the node in the K8s cluster according to claim 1, wherein the performing the failure processing according to the pre-configured rule when the state changes includes:
when the state of the working node Worker is a Kernel Deadlock fault, and/or a ReadonlyFilesystem fault, and/or a CorruptDocker overlay2 fault, and/or a Kubelerer fault, and/or a NICDeleted fault;
performing corresponding recovery actions including blocking and eviction;
wherein said blocking is to prevent a new pod from being dispatched to the node; the eviction is to transfer the pod on the node to other nodes;
optionally, the performing the corresponding recovery action further comprises alerting.
8. The method for detecting and processing the failure of the nodes in the K8s cluster according to claim 6 or 7,
the state of the working node Worker is that the plug-in detects the dmesg log in a ' GPUFallOff ' fault passing detection mode ', and the match is as follows: the reason for detecting the NVRM is 'GPU has fan off the bus';
and/or, when the state of the Worker node Worker is 'NTPProblem' fault, detecting that the reason is 'NTPIsDown' by executing a corresponding script;
and/or the state of the working node Worker is 'KernelDeadlock' fault:
detect the dmesg log by detection mode "plug-in, match: the detection reason is 'AUFSUmount Hung', the plug-in detects the dmesg log in a detection mode, and the detection result is matched with the detection result: the reason for detection of task docker: \ \ w + blocked for more than \ \ w + seconds \\\ \ the reason for detection is "DockerHung";
and/or the state of the Worker node Worker is ' ReadonlyFilesystem ' fault, and the dmesg log is detected through a detection mode ' plug-in, and matched: the reason for detecting the Remounting filesystem read-only nvme \ w +: Identify Controller failed (++) "is" FilessystemIsReadOnly ";
and/or the state of the working node Worker is a fault of CorruptDocker overlay2, and the docker log is detected by a detection mode of plug-in, and the detection is matched with the following steps: a return error: readlink/var/lib/docker/overlay2. invalid identification supplement. "the cause of the detection is" CorruptDocker overlay2 ";
and/or the state of the working node Worker is 'kubeleeror' fault, and the kubelet log is detected by a plug-in a detection mode, and is matched with the fault: a failure to get system container station for (. +): failure to get group station for (. +): failure to get container info for (. +): under container station for "";
and/or the state of the Worker node Worker is ' NICDELLETED ' fault, and the state is matched by a detection mode ' plug-in detection/var/log/messages: the reason for detecting the ntpd \ d + \ Deleting interfaces (++) "is NIC has been seen.
9. A node fault detection and processing device in a K8s cluster is characterized by comprising:
the node fault detection module is used for carrying out fault detection on each working node Worker of the K8s cluster through a self-defined node fault detection program, reporting a detection result to an API Server of the K8s cluster and storing the detection result;
and the fault processing module monitors the state of the working node Worker according to the detection result, and executes corresponding recovery action according to a preset rule when the state changes so as to process the fault.
10. A storage medium storing a computer-executable program, which when executed performs a method of detecting and handling node failure within a K8s cluster as claimed in any one of claims 1 to 8.
CN202110594222.0A 2021-05-28 2021-05-28 Method, device and storage medium for detecting and processing node faults in K8s cluster Pending CN113422692A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110594222.0A CN113422692A (en) 2021-05-28 2021-05-28 Method, device and storage medium for detecting and processing node faults in K8s cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110594222.0A CN113422692A (en) 2021-05-28 2021-05-28 Method, device and storage medium for detecting and processing node faults in K8s cluster

Publications (1)

Publication Number Publication Date
CN113422692A true CN113422692A (en) 2021-09-21

Family

ID=77713183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110594222.0A Pending CN113422692A (en) 2021-05-28 2021-05-28 Method, device and storage medium for detecting and processing node faults in K8s cluster

Country Status (1)

Country Link
CN (1) CN113422692A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114827157A (en) * 2022-04-12 2022-07-29 北京云思智学科技有限公司 Cluster task processing method, device and system, electronic equipment and readable medium
CN115189934A (en) * 2022-07-06 2022-10-14 上海交通大学 Automatic configuration safety detection method and system for Kubernets
CN115396291A (en) * 2022-08-23 2022-11-25 度小满科技(北京)有限公司 Redis cluster fault self-healing method based on kubernets trustees
CN116016123A (en) * 2022-12-09 2023-04-25 京东科技信息技术有限公司 Fault processing method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109684036A (en) * 2018-12-17 2019-04-26 武汉烽火信息集成技术有限公司 A kind of container cluster management method, storage medium, electronic equipment and system
CN110798375A (en) * 2019-09-29 2020-02-14 烽火通信科技股份有限公司 Monitoring method, system and terminal equipment for enhancing high availability of container cluster
US20200257593A1 (en) * 2017-10-31 2020-08-13 Huawei Technologies Co., Ltd. Storage cluster configuration change method, storage cluster, and computer system
CN111752759A (en) * 2020-06-30 2020-10-09 重庆紫光华山智安科技有限公司 Kafka cluster fault recovery method, device, equipment and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200257593A1 (en) * 2017-10-31 2020-08-13 Huawei Technologies Co., Ltd. Storage cluster configuration change method, storage cluster, and computer system
CN109684036A (en) * 2018-12-17 2019-04-26 武汉烽火信息集成技术有限公司 A kind of container cluster management method, storage medium, electronic equipment and system
CN110798375A (en) * 2019-09-29 2020-02-14 烽火通信科技股份有限公司 Monitoring method, system and terminal equipment for enhancing high availability of container cluster
CN111752759A (en) * 2020-06-30 2020-10-09 重庆紫光华山智安科技有限公司 Kafka cluster fault recovery method, device, equipment and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
孟凡杰等: "《Kubernetes生产化实践之路》", 31 December 2020 *
龚正等: "《Kubernetes权威指南:从Docker到Kubernetes实践全接触第4版》", 30 June 2019 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114827157A (en) * 2022-04-12 2022-07-29 北京云思智学科技有限公司 Cluster task processing method, device and system, electronic equipment and readable medium
CN115189934A (en) * 2022-07-06 2022-10-14 上海交通大学 Automatic configuration safety detection method and system for Kubernets
CN115396291A (en) * 2022-08-23 2022-11-25 度小满科技(北京)有限公司 Redis cluster fault self-healing method based on kubernets trustees
CN116016123A (en) * 2022-12-09 2023-04-25 京东科技信息技术有限公司 Fault processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN113422692A (en) Method, device and storage medium for detecting and processing node faults in K8s cluster
CN106776212B (en) Supervision system and method for container cluster deployment of multi-process application
CN105653425B (en) Monitoring system based on complex event processing engine
CN112667362B (en) Method and system for deploying Kubernetes virtual machine cluster on Kubernetes
CN111459763B (en) Cross-kubernetes cluster monitoring system and method
CN107016480B (en) Task scheduling method, device and system
US20170063965A1 (en) Data transfer in a collaborative file sharing system
CN109634716B (en) OpenStack virtual machine high-availability management end device for preventing brain cracking and management method
CN104360878B (en) A kind of method and device of application software deployment
EP3495946A1 (en) Server updates
CN109656742B (en) Node exception handling method and device and storage medium
CN112416581B (en) Distributed calling system for timed tasks
US10892961B2 (en) Application- and infrastructure-aware orchestration for cloud monitoring applications
US10498817B1 (en) Performance tuning in distributed computing systems
CN111225064A (en) Ceph cluster deployment method, system, device and computer-readable storage medium
CN109144534A (en) Service module dynamic updating method, device and electronic equipment
US11397632B2 (en) Safely recovering workloads within a finite timeframe from unhealthy cluster nodes
CN114490272A (en) Data processing method and device, electronic equipment and computer readable storage medium
CN115640110A (en) Distributed cloud computing system scheduling method and device
CN117130730A (en) Metadata management method for federal Kubernetes cluster
CN110196749A (en) The restoration methods and device of virtual machine, storage medium and electronic device
US20210157690A1 (en) System and method for on-demand warm standby disaster recovery
CN112463561B (en) Fault positioning method, device, equipment and storage medium
CN114691445A (en) Cluster fault processing method and device, electronic equipment and readable storage medium
US20080216057A1 (en) Recording medium storing monitoring program, monitoring method, and monitoring system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210921