CN114942859A - Method, device, equipment, medium and program product for processing node failure - Google Patents

Method, device, equipment, medium and program product for processing node failure Download PDF

Info

Publication number
CN114942859A
CN114942859A CN202210690105.9A CN202210690105A CN114942859A CN 114942859 A CN114942859 A CN 114942859A CN 202210690105 A CN202210690105 A CN 202210690105A CN 114942859 A CN114942859 A CN 114942859A
Authority
CN
China
Prior art keywords
node
current node
fault
current
log information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210690105.9A
Other languages
Chinese (zh)
Inventor
李春祝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan United Imaging Healthcare Co Ltd
Original Assignee
Wuhan United Imaging Healthcare Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan United Imaging Healthcare Co Ltd filed Critical Wuhan United Imaging Healthcare Co Ltd
Priority to CN202210690105.9A priority Critical patent/CN114942859A/en
Publication of CN114942859A publication Critical patent/CN114942859A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing

Abstract

The present application relates to a method, apparatus, device, medium, and program product for processing node failure. The processing method comprises the steps of obtaining at least one type of log information on a current node in a computing cluster; and determining whether the current node is a fault node or not according to the log information, and releasing all container integration units on the current node under the condition that the current node is the fault node. The node fault processing method provided by the application can be used for detecting various faults existing in the node, and can ensure the normal operation of a Kubernetes system.

Description

Method, device, equipment, medium and program product for processing node failure
Technical Field
The present application relates to the field of cluster technologies, and in particular, to a method, an apparatus, a device, a medium, and a program product for processing a node failure.
Background
The container cluster management (Kubernetes) system is a standard open source container orchestration and scheduling platform in the cloud native domain. The kubernets system has the ability to manage and maintain containers in a cluster, almost improving the guarantee of zero downtime for service. The Kubernetes system consists of multiple nodes, and various computing tasks are performed on each node by running Pod.
The current Kubernetes system can monitor the Pod on each node in the operation process and repair the Pod when the Pod fails. Specifically, when a Pod packaged in a Pod or Pod fails, the kubernets system manages and controls the operation status of the application program by using the built-in active probe and ready probe, so that the Pod or Pod packaged in the Pod corresponding to the kubernets system can realize self-healing.
However, the repair capability of the kubernets system to Pod itself is established under the condition that the node is normal, but various faults (hardware problems, kernel deadlock, file system damage, hanging during operation and the like) existing in the node itself cannot be detected, so that the kubernets system is disordered in operation.
Disclosure of Invention
In view of the above, it is necessary to provide a method, an apparatus, a device, a medium, and a program product for processing node failure.
In a first aspect, an embodiment of the present application provides a method for processing a node failure, where the method includes:
acquiring at least one type of log information on a current node in a computing cluster; the log information comprises the working state information of the current node;
and determining whether the current node is a fault node or not according to the log information, and releasing all container integration units on the current node under the condition that the current node is the fault node.
In one embodiment, the processing method further includes:
and migrating all the released container integrated units to other nodes according to a preset scheduling strategy.
In one embodiment, the processing method further includes:
performing stain label setting on the current node; the dirty label is used to instruct the current node to stop receiving other container integrated units.
In one embodiment, the processing method further includes:
and if a node release command sent by a management node in the computing cluster is received, cutting off the connection relation between the current node and each node in the computing cluster.
In one embodiment, the processing method further includes:
and repairing the fault of the current node, and reestablishing the connection relation between the current node and each node in the computing cluster if a node joining command sent by a management node in the computing cluster is received.
In one embodiment, the method for repairing the current node failure comprises the following steps: reconfiguring the kernel parameter and the daemon process parameter of the system where the current node is located, restarting the daemon process, restarting the current node, and reinstalling the system of the current node.
In a second aspect, an embodiment of the present application provides a node failure processing apparatus, including:
the acquisition module is used for acquiring at least one type of log information on the current node in the computing cluster; the log information comprises the working state information of the current node;
and the determining module is used for determining whether the current node is a fault node or not according to the log information and releasing all container integration units on the current node under the condition that the current node is the fault node.
In a third aspect, an embodiment of the present application provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the processing method provided in the first aspect when executing the computer program.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the processing method as provided in the first aspect above.
In a fifth aspect, an embodiment of the present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the processing method provided in the first aspect.
The embodiment of the application provides a method, a device, equipment, a medium and a program product for processing node faults. The processing method comprises the steps of obtaining at least one type of log information on a current node in a computing cluster; and determining whether the current node is a fault node or not according to the log information, and releasing all container integration units on the current node under the condition that the current node is the fault node. In this embodiment, various faults existing in each node can be detected by calculating log information of each node in the cluster, and the faulty node can be determined, so that a worker can repair the faulty node in time. In addition, in this embodiment, all container integration units operating on the failed node are released, so that it is possible to avoid that the container on the failed node cannot operate, which may cause the kubernets system to operate disorderly.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments or the conventional technologies of the present application, the drawings used in the description of the embodiments or the conventional technologies will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic flowchart illustrating steps of a method for processing a node failure according to an embodiment;
FIG. 2 is a schematic diagram illustrating an exemplary operating mode of a node problem detector;
FIG. 3 is a flowchart illustrating steps of a method for handling a node failure according to another embodiment;
FIG. 4 is a schematic diagram of a Kubernets system according to an embodiment;
FIG. 5 is a schematic flow chart diagram for monitoring a Kubernets system, according to an embodiment;
fig. 6 is a schematic structural diagram of a device for processing a node failure according to an embodiment;
fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, embodiments accompanying the present application are described in detail below with reference to the accompanying drawings. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of embodiments in many different forms than those described herein and that modifications may be made by one skilled in the art without departing from the spirit and scope of the application and it is therefore not intended to be limited to the specific embodiments disclosed below.
First, before specifically describing the technical solution of the embodiment of the present disclosure, a technical background or a technical evolution context on which the embodiment of the present disclosure is based is described. The container cluster management (kubernets) system is a standard open source container orchestration and scheduling platform in the cloud-native domain. The Kubernetes system has the ability to manage and maintain containers in a cluster, almost improving the guarantee of service zero downtime. The Kubernetes system comprises a management node and a plurality of working nodes, wherein the management node is used for scheduling the Pod to be operated to each working node, and each working node executes various computing tasks by operating the Pod. After creating a Pod resource, the management node in the kubernets system will select a working node for it, and then schedule it to the container in the Pod running on the working node.
Currently, in the running process of the kubernets system, Pod or container on each working node can be monitored, and when the Pod or container fails, the Pod or container can be repaired. The system repair capabilities of Kubernetes itself mainly include two aspects:
in the first aspect, a restart strategy, also called restartPolicy, relies on the restart strategy of Pod. It is a standard field (Pod. Spec. restartpolicy) of the Spec part of Pod, with default values of Always, i.e.: whenever the container stops, it must be automatically restarted, and the recovery policy for Pod can be changed by setting restartPolicy. In addition to Always, restartPolicy has two values, OnFailure (a container that stops abnormally will restart automatically, and a normal stop will not restart) and Never (a container that stops in any way, will not restart automatically).
In a second aspect, two health checks are used to implement a self-healing policy, and after a Pod is dispatched to a worker node, the Kubelet component on the worker node will run the containers therein and keep them running for the life of the Pod. If the main process of the container crashes, the kubelet component will restart the container. However, if an application in a container throws an error causing it to restart continuously, the kubernets system can repair it by using the correct diagnostics and following the restart policy of Pod.
Wherein the two health checks comprise:
liveness test (Liveness): kubelet uses the return status of the active probe (livenessProbe) as a basis for restarting the container. One active probe is used to detect problems with the container while the application is running. After the container enters the state, the kubelet component of the node where the Pod is located can restart the container through the Pod policy.
Ready check (reading): this type of probe (readinessProbe) is used to detect whether a container is ready to accept a flow. Such probes may be used to manage which pods will be used as the back-end of the service. If the Pod is not ready, it is removed from the back-end list of services.
The two health checks described above use the same type of probe handler and the corrective measures made for Pod that fails the check will differ. The livenessProbe will restart the container, anticipating that errors will not occur after restart; readinessProbe isolates Pod from traffic until the cause of the failure disappears.
However, the system repair capability of kubernets is all based on various problems of the working nodes themselves on the premise that the working nodes are normal and normal communication can be performed between the management node and the working nodes, such as: hardware problems, kernel deadlock, file system damage, runtime hang-up, etc., cannot be detected by components in the kubernets system, and even cannot be repaired. However, various problems occurring in the working node itself may affect the normal creation and service of the Pod, and the control plane (management node) of the kubernets system, which is unaware of the various problems occurring in the working node, may continue to schedule the Pod on the working node where the problem occurs, thereby causing the failure of two or three times of the new and old Pod, resulting in an access failure, and further causing the operation disorder of the kubernets system.
The following describes the technical solutions of the present application and how to solve the technical problems with the technical solutions of the present application in detail with specific embodiments. These several specific embodiments may be combined with each other below, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
In an embodiment, as shown in fig. 1, a method for processing a node failure is provided, and this embodiment is exemplified by applying the method to a terminal, where a kubernets system is installed on the terminal. The terminal can be, but is not limited to, various personal computers, notebook computers, internet of things devices and the like. It is understood that the method can also be applied to a server installed with a kubernets system, and the server can be implemented by an independent server or a server cluster formed by a plurality of servers. In this embodiment, the method includes the steps of:
step 100, acquiring at least one type of log information on a current node in a computing cluster; the log information includes the operating state information of the current node.
The computing cluster is managed by a Kubernetes system installed on a terminal, and comprises a plurality of working nodes. A node problem detector is arranged on each working node, and a plurality of problem daemon threads run in each working node in a coroutine mode. In the running process of a Kubernetes system in a terminal, a node problem detector runs a plurality of problem daemon threads by executing a specific script to acquire the log information of a current node. Different problem daemon threads correspond to different log information. Specifically, the plurality of problem daemon threads comprise a kernel problem daemon thread, a disk daemon thread, a folder daemon thread and a user-defined daemon thread which can monitor other problems of the node. The kernel problem daemon thread can acquire the state of the kernel of the current node, namely the log information of the kernel; the state of the disk of the current node, namely the log information of the disk, can be acquired through a disk daemon thread; the state of the folder of the current node, namely the log information of the folder, can be acquired through the folder daemon thread.
And the terminal acquires at least one type of log information on the current node in the computing cluster through different problem daemon threads. Different types of log information correspond to different problem daemon threads. In this embodiment, the obtained specific log information is not limited, and the worker may select to execute different problem daemon threads according to his own needs to obtain different log information.
In a specific embodiment, the node problem detector operates in the mode shown in fig. 2, the node file detector is connected to a plurality of different monitors, different problem daemon threads run in the different monitors, and different types of log information can be acquired through the different monitors.
And step 110, determining whether the current node is a fault node according to the log information, and releasing all container integration units on the current node under the condition that the current node is the fault node.
When the current node fails, the log information comprises information which can represent that the current node fails. After the terminal acquires at least one type of log information, whether the current node is a fault node can be determined according to the log information. The present embodiment does not limit the specific method for determining the current node as the failed node according to the log information, as long as the function of the current node can be realized.
In one embodiment, an implementation manner related to determining whether a current node is a failed node according to the obtained log information is shown in fig. 3, and the steps include:
and step 300, matching the log information with preset fault information to obtain a matching result.
The preset fault information comprises sub-fault information corresponding to each log information. The preset failure information may be preset by a worker to be stored in a memory of the terminal. After the terminal acquires the log information, the log information is matched with the preset fault information, and a matching result can be obtained. Specifically, when the log information acquired by the terminal includes a plurality of pieces, the terminal may match each piece of log information with a plurality of pieces of sub-fault information in the preset fault information one by one to obtain a matching result; the terminal can also determine sub-fault information corresponding to each log information according to each log information, and then match each log information with the corresponding sub-fault information to obtain a matching result. The present embodiment does not limit the specific matching process as long as the function thereof can be achieved.
And step 210, determining whether the current node is a fault node according to the matching result.
And after the terminal acquires the matching result, if the matching result is determined to be matching, namely the fault information matched with the log information exists in the preset fault information and indicates that the current node has a fault, determining the current node as a fault node. If the matching result is determined to be not matched by the terminal, the fault information matched with the log information does not exist in the preset fault information, and the current node is indicated to be not in fault.
And after the terminal determines that the current node is the fault node, releasing all container integration units on the current node. The container integrated unit is Pod and comprises a plurality of containers. That is, when the current node is a failed node, the terminal removes all containers running on the current node from the current node. Specifically, the terminal may stop running all containers running on the current node, and then remove all containers that have stopped running from the current node. The terminal may also directly remove all containers running on the current node from the current node. The present embodiment does not limit the processing method of all container integrated units released from the current node.
In an alternative embodiment, the Kubernetes system includes a Draino component that is used to drive all container integrated units (Pod) on the current node when the current node is determined to be a failed node. Specifically, a buffering time for driving the Pod on the current node is configured in the Draino component, that is, when the current node is determined to be a failed node, the Pod on the current node is driven within the configured buffering time.
The node fault processing method provided by the embodiment of the application obtains at least one type of log information on the current node in the computing cluster; and determining whether the current node is a fault node or not according to the log information, and releasing all container integration units on the current node under the condition that the current node is the fault node. In this embodiment, various faults existing in each node can be detected by calculating log information of each node in the cluster, and the faulty node can be determined, so that a worker can repair the faulty node in time. In addition, in this embodiment, all container integrated units (Pod) operating on the failed node are released, so that it is possible to avoid that the containers on the failed node cannot operate, which may cause the kubernets system to operate disorderly.
In one embodiment, after releasing all container integrated units on the current node, the terminal processing method for all released container integrated units includes:
and migrating all the released container integrated units to other nodes according to a preset scheduling strategy.
Since the cluster in the kubernets system is dynamic and the state of the cluster changes with time, the already running Pod may need to be migrated to other nodes, and then a preset scheduling policy is set. If any one or more of the conditions that the node utilization rate is insufficient or overused, the node fails, and a new node is added to the cluster occur, the Pod running on the node can be migrated by using a preset scheduling strategy. The preset scheduling policy may include multiple scheduling policies, and different scheduling policies may be used for different situations occurring in the node. The embodiment does not limit the specific content of the preset scheduling policy, and the staff may set the scheduling policy according to the actual application environment.
After determining the failed node and releasing the container integration unit running on the failed node, the terminal can migrate all the released container integration units to other nodes according to a preset scheduling strategy, that is, migrate the container running on the failed node to other nodes, so that the container can continue to run.
In this embodiment, after all container integration units on the failed node are released, all the released container integration units are migrated to other nodes according to a preset scheduling policy, so that all the container integration units operating on the failed node can also normally operate on other nodes, and thus the kubernets system can be ensured to normally operate.
In an alternative embodiment, the kubernets system includes a Descheduler, and when the current node is determined to be a failed node, all container integrated units on the current node can be migrated to other nodes using the Descheduler.
In one embodiment, after determining that the current node is a failed node, the method for processing a node failure further includes:
performing stain label setting on the current node; the dirty label is used to instruct the current node to stop receiving other container integrated units.
And after the terminal determines that the current node is the fault node, adding a taint label to the current node, and enabling the current node not to receive other container integration units any more through the taint label. That is, when a taint label is set on the current node, the current node rejects a new container integrated unit when the new container integrated unit is scheduled to the current node.
In this embodiment, by adding a taint label to the current node with a fault, it can be avoided that a new container integration unit cannot be scheduled to the current node to operate, which results in abnormal access of an application layer, and thus, normal operation of the kubernets system can be ensured.
In an alternative embodiment, the Kubernets system of the terminal is enabled
The node b may be a node b, and when the current node fails, the current node may be added with an Effect that is a taint of no schedule (no scheduling), where the taint may be a condition, and the Descheduler in the kubernets system may not satisfy the Pod of the taint, that is, the current node does not receive the Pod that does not satisfy the taint.
In one embodiment, the method for processing node failure further includes:
and if a node release command sent by a management node in the computing cluster is received, cutting off the connection relation between the current node and each node in the computing cluster.
After receiving a fault message of the current node sent by a node problem detector deployed in the current node, the terminal can determine that the current node has a fault. The node problem detector sends a message to a management node in the Kubernetes system of the terminal to inform the management node that the current node fails, and the management node sends a node release command, that is, the failed current node needs to be removed from the computing cluster. If the current node receives the node release command sent by the management node, the current node responds to the node release command and cuts off the connection relation with other nodes in the computing cluster so as to achieve the purpose of removing the current node from the computing cluster.
In this embodiment, the current node can be removed from the computing cluster by cutting off the connection relationship between the current node and other nodes in the computing cluster, so that the work of the current node with a fault can be terminated, and no new Pod is scheduled to the current node.
In one embodiment, a cluster auto-scaler is configured in the kubernets system, and after Pod on a failed current node is released, the resources of the current node are not fully utilized, and at this time, the cluster auto-scaler removes the current node from the computing cluster. In particular, the cluster auto-scaler may be used with a Draino component, i.e., after releasing a Pod on a current node that fails using the Draino component, the cluster auto-scaler removes the current node from the computing cluster. The cluster autoscaler may also be used with Descheduler, i.e., after using Descheduler to migrate a Pod on the current node that failed to another node of the cluster, the cluster autoscaler will remove the current node from the computing cluster.
In one embodiment, after the terminal removes the current node with the failure from the computing cluster, the processing method of the node failure further includes:
and repairing the fault of the current node, and reestablishing the connection relation between the current node and each node in the computing cluster if a node joining command sent by a management node in the computing cluster is received.
And after determining the current node removed from the computing cluster, the terminal repairs the fault of the current node. The specific repair method corresponds to a specific fault occurring in the current node, and the specific fault repair method is not limited in this embodiment as long as the function of the fault repair method can be realized.
When the fault of the current node with the fault is repaired and can work normally, the node problem detector deployed on the current node can detect that the current node can work normally, the node problem detector sends a message to a management node in the Kubernetes system to inform that the current node is repaired, and the management node sends a node joining command to the repaired current node. And if the repaired current node receives the node joining command, reestablishing the connection relation with other nodes in the computing cluster in response to the command so as to achieve the purpose that the current node is re-accessed to the computing cluster.
In this embodiment, after the fault of the current node that has failed is repaired, the repaired current node can be rejoined into the computing cluster, and Pod running on other nodes in the computing cluster can be shared, so that phenomena such as overuse of other nodes in the computing cluster can be avoided, and normal running of the kubernets system can be ensured.
In one embodiment, a method of performing fault repair on a current node includes: reconfiguring the kernel parameter and the daemon process parameter of the system where the current node is located, restarting the daemon process, restarting the current node, and reinstalling the system of the current node.
The daemon process is a process for ensuring normal communication between the Kubernetes system and an external terminal or server. In an optional embodiment, when the current node fails, the terminal may first reconfigure the kernel parameters and the daemon parameters of the system in which the current node is located. If the fault of the current node is repaired after the parameters are reconfigured, then the step of adding the current node into the computing cluster again is executed; and if the fault of the current node is not repaired after the parameters are reconfigured, restarting the daemon process. If the fault of the current node is repaired after the daemon process is restarted, then the step of adding the current node into the computing cluster again is executed; and if the fault of the current node is not repaired after the daemon process is restarted, restarting the current node. If the fault of the current node is repaired after the current node is restarted, then the step of adding the current node into the computing cluster again is executed; and if the fault of the current node is not repaired after the current node is restarted, reinstalling the system of the current node.
In another alternative embodiment, the current node has a different failure, and the method for performing the failure recovery is different. If the kernel of the current node fails, the kernel parameter and the daemon parameter of the system where the current node is located can be reconfigured, or the daemon process is restarted to repair the failure. If the file system of the current node fails, the current stage can be restarted or the system of the current node can be reinstalled. If the current node has a hardware problem, a worker is required to replace the current node.
In one embodiment, the method for processing node failure further includes:
and acquiring the fault information of the current node with the fault, storing the fault information and sending warning information to the management equipment of the computing cluster.
The fault information of the current node comprises log information of the current node with the fault, and the terminal stores the fault information so as to be convenient for follow-up analysis. Meanwhile, the terminal sends warning information to the management equipment of the computing cluster, so that the management equipment can timely perform corresponding repair operation on the failed node. The alert information may include an alert bell. The management device of the computing cluster may be a monitoring system (Prometheus).
In one embodiment, before the node is removed from the computing cluster, the node is removed and repaired according to a preset strategy.
In order to prevent node avalanche in a computation cluster, when the node is removed from the computation cluster, strict current limitation is performed before the node is repaired, and the node is prevented from being restarted in a large scale. The specific current limiting method can comprise the following steps: only a certain number of nodes (e.g., one node) in a compute cluster are allowed to be removed and repaired at the same time; or the time interval between each node undergoing removal and repair is at least a preset time (e.g., 1 minute); or the time interval between nodes allowing removal and repair at the same time is at least preset.
In one embodiment, the current node is given a certain tolerance time when joining the repaired current node to the computing cluster. That is, when a repaired current node is added to the compute cluster, the container integrated unit is scheduled to the current node after a tolerance time (e.g., 2 minutes). This can prevent misjudgment of whether the current node has a fault due to instability of the current node just added to the computing cluster.
In an embodiment, as shown in fig. 4, the kubernets system further includes an automatic repair controller, the node problem detector in the kubernets system determines a failed node according to the acquired log information, and sends log information corresponding to the failed node to the automatic repair controller through the access server, and the automatic repair controller performs corresponding repair operation on the failed node according to the received log information.
In one embodiment, the monitoring flow for each node of a computing cluster in a kubernets system is shown in fig. 5. The method comprises the steps of obtaining log information of nodes of a computing cluster in a Kubernetes system through a node problem detector, determining a failed node according to the log information, and sending the log information corresponding to the failed node to a database and a user terminal. The log information sent to the database can be directly displayed visually, and can also be analyzed and processed through various algorithms. The log information sent to the user terminal can prompt the user that a node with a fault exists in the computing cluster in the Kubernets system. The node problem detector also sends log information corresponding to the failed node to a monitoring system (Prometheus), which sends the received information to the user terminal.
The repairing of the failed node may be that the node problem detector sends log information corresponding to the failed node to the automatic repair controller, and the automatic repair controller repairs the failed node according to the log information.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the present application further provides a node failure processing apparatus for implementing the above node failure processing method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme described in the above method, so specific limitations in the following embodiments of the processing device for one or more node failures may refer to the above limitations on the processing method for the node failure, and details are not described here.
In one embodiment, as shown in fig. 6, a node failure processing apparatus 10 is provided and includes an obtaining module 11 and a determining module 12. Wherein the content of the first and second substances,
the obtaining module 11 is configured to obtain at least one type of log information on a current node in a computing cluster; the log information comprises the working state information of the current node;
the determining module 12 is configured to determine whether the current node is a failed node according to the log information, and release all container integration units on the current node if the current node is the failed node.
In one embodiment, the processing apparatus 10 for node failure further comprises a migration module. And the migration module is used for migrating all the released container integrated units to other nodes according to a preset scheduling strategy.
In one embodiment, the processing apparatus 10 of node failure further comprises a setup module. The setting module is used for setting the stain label of the current node; the dirty label is used to instruct the current node to stop receiving other container integrated units.
In one embodiment, the processing device 10 of the node failure further comprises a shutdown module. And the cutting-off module is used for cutting off the connection relation between the current node and each node in the computing cluster if a node release command sent by a management node in the computing cluster is received.
In one embodiment, the processing apparatus 10 of the node failure further comprises a repair module. And the repairing module is used for repairing the fault of the current node, and reestablishing the connection relation between the current node and each node in the computing cluster if a node joining command sent by a management node in the computing cluster is received.
In one embodiment, a method of performing fault repair on a current node includes: reconfiguring the kernel parameter and the daemon process parameter of the system where the current node is located, restarting the daemon process, restarting the current node, and reinstalling the system of the current node.
The modules in the node failure processing device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 7. The computer apparatus includes a processor, a memory, a communication interface, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of handling node failures. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, there is provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the following steps when executing the computer program:
acquiring at least one type of log information on a current node in a computing cluster; the log information comprises the working state information of the current node;
and determining whether the current node is a fault node or not according to the log information, and releasing all container integration units on the current node under the condition that the current node is the fault node.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and migrating all the released container integrated units to other nodes according to a preset scheduling strategy.
In one embodiment, the processor, when executing the computer program, further performs the steps of: performing stain label setting on the current node; the dirty label is used to instruct the current node to stop receiving other container integrated units.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and if a node release command sent by a management node in the computing cluster is received, cutting off the connection relation between the current node and each node in the computing cluster.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and repairing the fault of the current node, and reestablishing the connection relation between the current node and each node in the computing cluster if a node joining command sent by a management node in the computing cluster is received.
In one embodiment, a method of performing fault repair on a current node includes: reconfiguring the kernel parameter and the daemon process parameter of the system where the current node is located, restarting the daemon process, restarting the current node, and reinstalling the system of the current node.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring at least one type of log information on a current node in a computing cluster; the log information comprises the working state information of the current node;
and determining whether the current node is a fault node or not according to the log information, and releasing all container integration units on the current node under the condition that the current node is the fault node.
In one embodiment, the computer program when executed by the processor further performs the steps of: and migrating all the released container integrated units to other nodes according to a preset scheduling strategy.
In one embodiment, the computer program when executed by the processor further performs the steps of: performing stain label setting on the current node; the dirty label is used to instruct the current node to stop receiving other container integrated units.
In one embodiment, the computer program when executed by the processor further performs the steps of: and if a node release command sent by a management node in the computing cluster is received, cutting off the connection relation between the current node and each node in the computing cluster.
In one embodiment, the computer program when executed by the processor further performs the steps of: and repairing the fault of the current node, and reestablishing the connection relation between the current node and each node in the computing cluster if a node joining command sent by a management node in the computing cluster is received.
In one embodiment, a method of performing fault repair on a current node includes: reconfiguring the kernel parameter and the daemon process parameter of the system where the current node is located, restarting the daemon process, restarting the current node, and reinstalling the system of the current node.
In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of:
acquiring at least one type of log information on a current node in a computing cluster; the log information comprises the working state information of the current node;
and determining whether the current node is a fault node or not according to the log information, and releasing all container integration units on the current node under the condition that the current node is the fault node.
In one embodiment, the computer program when executed by the processor further performs the steps of: and migrating all the released container integrated units to other nodes according to a preset scheduling strategy.
In one embodiment, the computer program when executed by the processor further performs the steps of: performing stain label setting on the current node; the dirty label is used to instruct the current node to stop receiving other container integrated units.
In one embodiment, the computer program when executed by the processor further performs the steps of: and if a node release command sent by a management node in the computing cluster is received, cutting off the connection relation between the current node and each node in the computing cluster.
In one embodiment, the computer program when executed by the processor further performs the steps of: and repairing the fault of the current node, and reestablishing the connection relation between the current node and each node in the computing cluster if a node joining command sent by a management node in the computing cluster is received.
In one embodiment, a method of performing fault repair on a current node includes: reconfiguring the kernel parameter and the daemon process parameter of the system where the current node is located, restarting the daemon process, restarting the current node, and reinstalling the system of the current node.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (10)

1. A processing method for node failure is characterized in that the processing method comprises the following steps:
acquiring at least one type of log information on a current node in a computing cluster; the log information comprises the working state information of the current node;
and determining whether the current node is a fault node according to the log information, and releasing all container integration units on the current node under the condition that the current node is the fault node.
2. The processing method according to claim 1, characterized in that it further comprises:
and migrating all the released container integrated units to other nodes according to a preset scheduling strategy.
3. The process of claim 1, further comprising:
setting a stain label for the current node; the dirty label is used to instruct the current node to stop receiving other container integrated units.
4. The processing method according to any one of claims 1-2, characterized in that the processing method further comprises:
and if a node release command sent by a management node in the computing cluster is received, cutting off the connection relation between the current node and each node in the computing cluster.
5. The processing method according to any one of claims 1-2, characterized in that the processing method further comprises:
and repairing the fault of the current node, and if a node joining command sent by a management node in the computing cluster is received, reestablishing the connection relationship between the current node and each node in the computing cluster.
6. The processing method according to claim 1, wherein the method of repairing the current node by failure comprises: reconfiguring the kernel parameter and the daemon process parameter of the system where the current node is located, restarting the daemon process, restarting the current node, and reinstalling the system of the current node.
7. A node failure handling apparatus, the apparatus comprising:
the acquisition module is used for acquiring at least one type of log information on the current node in the computing cluster; the log information comprises the working state information of the current node;
and the determining module is used for determining whether the current node is a fault node or not according to the log information and releasing all container integration units on the current node under the condition that the current node is the fault node.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the processing method of any one of claims 1 to 6 when executing the computer program.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the processing method of any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the processing method of any one of claims 1 to 6 when executed by a processor.
CN202210690105.9A 2022-06-17 2022-06-17 Method, device, equipment, medium and program product for processing node failure Pending CN114942859A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210690105.9A CN114942859A (en) 2022-06-17 2022-06-17 Method, device, equipment, medium and program product for processing node failure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210690105.9A CN114942859A (en) 2022-06-17 2022-06-17 Method, device, equipment, medium and program product for processing node failure

Publications (1)

Publication Number Publication Date
CN114942859A true CN114942859A (en) 2022-08-26

Family

ID=82910971

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210690105.9A Pending CN114942859A (en) 2022-06-17 2022-06-17 Method, device, equipment, medium and program product for processing node failure

Country Status (1)

Country Link
CN (1) CN114942859A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116155686A (en) * 2023-01-30 2023-05-23 浪潮云信息技术股份公司 Method for judging node faults in cloud environment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116155686A (en) * 2023-01-30 2023-05-23 浪潮云信息技术股份公司 Method for judging node faults in cloud environment

Similar Documents

Publication Publication Date Title
US10152382B2 (en) Method and system for monitoring virtual machine cluster
US8954579B2 (en) Transaction-level health monitoring of online services
US11321197B2 (en) File service auto-remediation in storage systems
US8156490B2 (en) Dynamic migration of virtual machine computer programs upon satisfaction of conditions
US8910172B2 (en) Application resource switchover systems and methods
US9712418B2 (en) Automated network control
US8990388B2 (en) Identification of critical web services and their dynamic optimal relocation
US9483314B2 (en) Systems and methods for fault tolerant batch processing in a virtual environment
US9495234B1 (en) Detecting anomalous behavior by determining correlations
US11157373B2 (en) Prioritized transfer of failure event log data
US9535754B1 (en) Dynamic provisioning of computing resources
US7673178B2 (en) Break and optional hold on failure
US9195528B1 (en) Systems and methods for managing failover clusters
CN115562911A (en) Virtual machine data backup method, device, system, electronic equipment and storage medium
CN114942859A (en) Method, device, equipment, medium and program product for processing node failure
CN111538585A (en) Js-based server process scheduling method, system and device
US11544091B2 (en) Determining and implementing recovery actions for containers to recover the containers from failures
US9274905B1 (en) Configuration tests for computer system
CN114691304A (en) Method, device, equipment and medium for realizing high availability of cluster virtual machine
CN112256384B (en) Service set processing method and device based on container technology and computer equipment
JP2018169920A (en) Management device, management method and management program
JP2012181737A (en) Computer system
US11323315B1 (en) Automated host management service
CN107783855B (en) Fault self-healing control device and method for virtual network element
WO2023185355A1 (en) Method and apparatus for achieving high availability of clustered virtual machines, device, and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination