CN113765748A

CN113765748A - Method for processing fault of computing node and computer readable storage medium

Info

Publication number: CN113765748A
Application number: CN202111027451.0A
Authority: CN
Inventors: 李义; 邹理贤; 刘建平
Original assignee: Aerospace Winhong Technology Guizhou Co ltd; Winhong Information Technology Co ltd
Current assignee: Aerospace Winhong Technology Guizhou Co ltd; Winhong Information Technology Co ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-12-07

Abstract

The invention discloses a method for processing faults of a computing node and a computer readable storage medium. The above method specifically: if the fault processing condition is met, executing the step X: and closing the virtual machine running on the computing node, wherein the fault processing condition comprises that a heartbeat detection signal sent by the management node is not received within a preset time length. The method for processing the fault of the computing node is executed by the computing node, the computing node automatically judges whether the node has the fault, and if the node has the fault, the virtual machine running on the node is automatically closed, so that the virtual machine can be restarted on other computing nodes.

Description

Method for processing fault of computing node and computer readable storage medium

Technical Field

The present invention relates to the field of virtualization technologies, and in particular, to a method for processing a failure of a compute node and a computer-readable storage medium.

Background

With the rapid development of computer technology, more and more companies begin to pay attention to how to reduce energy consumption and improve resource utilization, and under the background, a computing mode of cloud computing is developed. Cloud computing abstracts all computers into specific computing resources and then provides those computing resources to a user, rather than providing one or more computers directly to a user as before. Therefore, the user can apply for computing resources according to the needs of the user, unnecessary resource waste is avoided, and the resource utilization rate is improved.

In order to save equipment cost and further improve resource utilization rate, a cloud computing builder adopts a virtualization technology to virtualize a plurality of virtual computers (i.e., virtual machines) on a single physical computer for respective use by a plurality of users, and the physical computer is called a computing node. A plurality of such physical computers form a virtualized cluster. The virtualization cluster performs unified management on a plurality of physical computers.

A virtualized cluster typically includes a management node and a plurality of compute nodes, with the management node monitoring the health of each compute node. The management node performs fault detection on each computing node through the heartbeat packet, and if the management node does not receive the heartbeat packet sent by the computing node within a preset time length, the computing node is considered to have a fault. In order to improve the service continuity of the virtualized cluster and enable the virtual machines to timely recover the running services when the virtual machines fail, after detecting the failure of the computing node, the management node sends an instruction to the failed computing node to shut down all the virtual machines running on the failed computing node, and then restarts the virtual machines on other computing nodes in the cluster to recover the virtual machine services. Therefore, the condition of virtual machine failure is not sensed by the client, thereby ensuring the continuity of the service. However, when a computing node fails, the virtual machine may not be able to be shut down because the computing node cannot respond to the instruction of the management node. The original computing node does not close the virtual machines, and other computing nodes cannot start the virtual machines; if the virtual machine is forcibly started, the data of the virtual machine can be damaged.

Disclosure of Invention

The present invention is directed to a method for processing a failure of a compute node, which is capable of shutting down a virtual machine running on the failed compute node to enable the virtual machine to be restarted on another compute node, and a computer-readable storage medium storing a computer program that, when executed, implements the method.

In order to solve the above technical problem, according to the method for processing a failure of a compute node of the present invention, if a failure processing condition is met, the method performs step X: and closing the virtual machine running on the computing node, wherein the fault processing condition comprises that a heartbeat detection signal sent by the management node is not received within a preset time length.

Optionally, the step X is specifically to close the virtual machine on the current computing node from the kernel.

Optionally, the step X is specifically to close a virtual machine running on the current computing node and using the non-local resource.

Optionally, in the step X, the virtual machine using the local resource is not turned off.

Optionally, the preset duration is three times of a heartbeat detection signal transmission time interval.

Optionally, the fault handling condition specifically includes that the kernel does not receive a heartbeat detection signal sent by the management node within a preset time period.

Optionally, the method is a processing method of checking faults in the computing node.

A computer-readable storage medium having stored thereon an executable computer program, characterized by: the computer program when executed implements a compute node failure handling method as described above.

The method for processing the fault of the computing node is executed by the computing node, the computing node automatically judges whether the node has the fault, and if the node has the fault, the virtual machine running on the node is automatically closed, so that the virtual machine can be restarted on other computing nodes.

Drawings

FIG. 1 is a block diagram of a system architecture for a virtualization cluster.

Detailed Description

The invention is described in detail below with reference to specific embodiments.

As shown in fig. 1, a virtualized cluster includes a management node and three compute nodes 1, 2, 3 communicatively connected to the management node. Virtual machines run on the computing nodes 1, 2, and 3, respectively. The management node includes a processor and a computer-readable storage medium storing an executable computer program, the processor executing the computer program to implement the functions of the management node. Each computing node includes a processor and a computer-readable storage medium having an executable computer program stored therein, the computer program being executed by the processor of the computing node to perform the functions of the computing node.

Example one

The management node sends its own IP address and port name to each of the compute nodes 1, 2, 3. After receiving the IP address and the port name of the management node, each of the computing nodes 1, 2 and 3 transmits the IP address and the port name of the management node to a kernel of the computing node through netlink communication, and then a TCP heartbeat detection mechanism between the kernel and the management node is established, so that the management node establishes TCP socket communication channels with the kernels of the computing nodes 1, 2 and 3 respectively. The management node sends the disk path and the detection strategy of the virtual machine running on the computing node to the cores of the computing nodes 1, 2 and 3 through the communication channels respectively, and the detection strategy comprises a detection mode and a detection feedback mode. The management node detects whether the virtual machine does not issue the I/O request to the disk of the virtual machine within a preset time length through the computing node.

Taking the computing node 1 as an example, after receiving the disk paths and the detection policies of the virtual machines 11, 12, and 13, the kernel of the computing node 1 detects the disk paths of the virtual machines 11, 12, and 13 respectively according to the detection mode in the detection policies, thereby detecting whether the virtual machines 11, 12, and 13 issue the I/O request to their disks, and according to the detection feedback mode in the detection policies, sends a virtual machine detection heartbeat packet to the management node every 30s to report the detection result during the 30s period. Assume compute node 1 core is at 9: 00: and 00, starting detection, wherein if the QEMU process of the virtual machine 11 is blocked, the virtual machine actually fails although the QEMU process of the virtual machine 11 still exists, and the I/O request cannot be issued to the disk of the virtual machine. In the following step 9: 00: 00 to 9: 00: during period 30, if none of the virtual machines 11, 12, 13 issue an I/O request to their disks, the compute node 1 kernel is at 9: 00: 00 to 9: 00: during period 30, no I/O request issued by the virtual machines 11, 12, 13 to their disks is detected, at 9: 00: the 30 th moment generates a signal containing 9: 00: 00 to 9: 00: and the virtual machine of the detection result in the period of 30 detects the heartbeat packet and sends the heartbeat packet to the management node. After receiving the virtual machine detection heartbeat packet, the management node sends a virtual machine detection reply heartbeat packet to the kernel of the computing node 1 to inform the opposite management node that the virtual machine detection heartbeat packet is received, and records that the virtual machines 11, 12 and 13 are 9: 00: 00 to 9: 00: no I/O request is issued during 30. Then, at 9: 00: 30 to 9: 01: during the period 00, the kernel of the computing node 1 still does not detect that the virtual machine 11 issues the I/O request to the disk, but detects that the virtual machines 12 and 13 issue the I/O request to the disk, which is just at 9: 01: time 00 generates a signal comprising 9: 00: 30 to 9: 01: and detecting the heartbeat packet by the virtual machine of the detection result in the period of 00 and sending the heartbeat packet to the management node. After receiving the virtual machine detection heartbeat packet, the management node sends a virtual machine detection reply heartbeat packet to the kernel of the computing node 1, and records that the virtual machine 11 is 9: 00: 30 to 9: 01: no I/O request is issued during 00, and virtual machines 12, 13 are at 9: 00: 30 to 9: 01: during 00, I/O request is issued. Then, at 9: 01: 00 to 9: 01: during the period 30, the kernel of the computing node 1 still does not detect that the virtual machine 11 issues the I/O request to the disk, but detects that the virtual machines 12 and 13 issue the I/O request to the disk, and the following steps are performed in 9: 01: the 30 th moment generates a signal containing 9: 01: 00 to 9: 01: and the virtual machine of the detection result in the period of 30 detects the heartbeat packet and sends the heartbeat packet to the management node. After receiving the virtual machine detection heartbeat packet, the management node sends a virtual machine detection reply heartbeat packet to the kernel of the computing node 1, and records that the virtual machine 11 is 9: 01: 00 to 9: 01: no I/O request is issued during 30, and the virtual machines 12, 13 are at 9: 01: 00 to 9: 01: during 30 period, I/O request is issued. So far, the management node has received a virtual machine detection heartbeat packet issued by the virtual machine 11 without an I/O request three times, that is, the virtual machine 11 does not issue an I/O request to its disk within 90 seconds of the preset time, and the management node determines that the virtual machine 11 has a fault according to the heartbeat packet, and sends a virtual machine 11 fault message to the kernel of the computing node 1, so that the computing node 1 shuts down the virtual machine 11 from the kernel. Due to the virtual machines 12, 13, at 9: 00: 30 to 9: 01: and an I/O request is issued in the period of 30, and the management node judges that the virtual machines 12 and 13 run normally.

Similarly, after receiving the disk paths of the virtual machines 21, 22, and 23, the kernel of the compute node 2 detects the disk paths of the virtual machines 21, 22, and 23, respectively, so as to detect whether the virtual machines 21, 22, and 23 issue an I/O request to their disks, and sends a virtual machine detection heartbeat packet to the management node every 30s, thereby reporting the detection result during this 30s period. The same is true of the compute node 3 kernel.

In this embodiment, the management node issues a detection policy to each of the computing nodes 1, 2, and 3, so that the computing nodes 1, 2, and 3 detect the virtual machine according to the detection policy. Other embodiments may instead set the detection policy directly on the respective compute node 1, 2, 3, less preferred. In this way, after receiving the disk path of the virtual machine, each computing node 1, 2, 3 automatically detects the virtual machine according to the local detection policy.

The kernels of the computing nodes 1, 2 and 3 regularly send a heartbeat packet of the virtual machine detection to the management node to report the detection result of the virtual machine, and also regularly send a health heartbeat packet of the computing node to the management node to report the health state of the computing node. After receiving the healthy heartbeat packets of the computing nodes sent by the kernels of the computing nodes 1, 2 and 3, the management node sends healthy reply heartbeat packets of the computing nodes to the kernels of the computing nodes 1, 2 and 3 respectively to inform the other management node that the healthy heartbeat packets of the computing nodes are received. Suppose that each compute node 1, 2, 3 also sends a compute node health heartbeat packet to the management node every 30 s. Assume compute node 1, 2, 3 cores are at 9: 00: detection is initiated at time 00, at 9: 00: at time 30, the cores of the computing nodes 1, 2 and 3 send a heartbeat packet of the virtual machine detection to the management node, and also send a heartbeat packet of the health of the computing nodes to the management node. Assuming that the management node only receives the healthy heartbeat packets of the computing nodes 1 and 2 and does not receive the healthy heartbeat packet of the computing node 3, the management node records the computing nodes 1 and 2 as healthy states, records the computing node 3 as an undetermined state, and then respectively sends the healthy reply heartbeat packets of the computing nodes to the kernels of the computing nodes 1 and 2. At this time, the kernels of the computing nodes 1 and 2 receive the computing node health reply heartbeat packet sent by the management node, and the kernel of the computing node 3 does not receive the heartbeat packet. The kernel of the computing node 3 records that the health of the computing node replies that the heartbeat packet is missing once. In the following step 9: 01: at the time of 00, the kernels of the computing nodes 1, 2 and 3 send the healthy heartbeat packet of the computing nodes to the management node again. The management node only receives the healthy heartbeat packets of the computing nodes 1 and 2, but does not receive the healthy heartbeat packet of the computing node 3, records the computing nodes 1 and 2 as healthy states, records the computing node 3 as an undetermined state, and then respectively sends the healthy heartbeat reply packets of the computing nodes to the kernels of the computing nodes 1 and 2. At this time, the kernels of the computing nodes 1 and 2 receive the computing node health reply heartbeat packet sent by the management node, and the kernel of the computing node 3 does not receive the heartbeat packet. The kernel of the computing node 3 records that the health reply heartbeat packet of the computing node is continuously lost twice. In the following step 9: 01: at the moment 30, the kernels of the computing nodes 1, 2 and 3 send the health heartbeat packet of the computing nodes to the management node again. The management node still only receives the health heartbeat packets of the computing nodes 1 and 2, and does not receive the health heartbeat packet of the computing node 3. Since the management node has not received the health heartbeat packet of the computing node sent by the computing node 3 three times continuously, that is, the health heartbeat packet of the computing node sent by the computing node 3 is not received within a preset time length of 90s (90 s is three times of the sending time interval of the health heartbeat packet of the computing node), accordingly, the management node records the computing node 3 as a fault state, records the computing nodes 1 and 2 as a health state, and then sends the health reply heartbeat packet of the computing node to the kernels of the computing nodes 1 and 2 respectively. At this time, the kernels of the computing nodes 1 and 2 receive the computing node health reply heartbeat packet sent by the management node, and the kernel of the computing node 3 does not receive the heartbeat packet. The core of the computing node 3 records that the health reply heartbeat packet of the computing node is continuously lost three times, that is, the core of the computing node 3 does not receive the health reply heartbeat packet (namely, the heartbeat detection signal) of the computing node sent by the management node within the preset time length of 90s, and accordingly, the core of the computing node 3 is judged to be in fault. Assuming that the virtual machine 33 running on the compute node 3 uses a remote disk shared with the compute nodes 1, 2 and does not use the compute node 3 local storage resources, the compute node 3 kernel determines that it has failed and then shuts down the virtual machine 33 from the kernel. In this way, the kernel of the compute node 3 actively shuts down the virtual machine 33, and the virtual machine cannot be shut down because the kernel cannot respond to the shutdown command of the management node. After the management node determines that the computing node 3 has failed, at the same time, the computing node 3 also determines that it has failed and shuts down the virtual machine 33, so that the management node can restart the virtual machine 33 on the computing node 1 or the computing node 2.

In this embodiment, the failed computing node 3 only shuts down the virtual machine 33 using the non-local resource, and temporarily does not process the virtual machines 31 and 32 using the local resource. Non-preferably, all virtual machines 31, 32, 33 running on the failed compute node 3 may be shut down instead.

Example two

The present embodiment is substantially the same as the first embodiment, and only the differences between the present embodiment and the first embodiment will be described below.

In the first embodiment, the computing node only reports the detection result of the issue condition of the I/O request of the virtual machine to the management node, and the management node determines whether the virtual machine fails according to the detection result without determining. In this embodiment, the computing node instead determines whether the virtual machine fails according to the detection result. Still taking compute node 1 as an example, assume compute node 1 has a core at 9: 00: detection is initiated at time 00, at 9: 00: 00 to 9: 00: during period 30, assuming that none of the virtual machines 11, 12, 13 issued an I/O request to their disks, then the compute node 1 kernel is at 9: 00: 00 to 9: 00: 30, without detecting that the virtual machines 11, 12, 13 issue I/O requests to their disks, it records that the virtual machines 11, 12, 13 are at 9: 00: 00 to 9: 00: no I/O request is issued during the period 30, at this time, the virtual machines 11, 12, and 13 do not issue an I/O request to their disks only for 30s, and do not issue an I/O request for 90s, so that the kernel of the compute node 1 determines that the virtual machines 11, 12, and 13 are healthy at this time, and generates a healthy heartbeat packet of the virtual machine and sends the heartbeat packet to the management node, where the heartbeat packet describes that the virtual machines 11, 12, and 13 are healthy. After receiving the healthy heartbeat packet of the virtual machine, the management node immediately sends a healthy heartbeat packet of the virtual machine to the kernel of the computing node 1, so that the computing node 1 is informed that the management node has received the healthy heartbeat packet of the virtual machine. Then, at 9: 00: 30 to 9: 01: during the period 00, the kernel of the computing node 1 still does not detect that the virtual machine 11 issues the I/O request to the disk, but detects that the virtual machines 12 and 13 issue the I/O request to the disk, and records that the virtual machine 11 is 9: 00: 30 to 9: 01: no I/O request is issued during 00, and virtual machines 12, 13 are at 9: 00: 30 to 9: 01: during 00, I/O request is issued. At this time, the virtual machine 11 does not issue the I/O request to the disk for only 60s, and does not issue the I/O request for 90s after the preset duration is reached, so that the kernel of the computing node 1 determines that the virtual machines 11, 12, and 13 are healthy at this time, and generates a healthy heartbeat packet of the virtual machine and sends the heartbeat packet to the management node, where the heartbeat packet describes that the virtual machines 11, 12, and 13 are healthy. And after receiving the virtual machine health heartbeat packet, the management node immediately sends a virtual machine health reply heartbeat packet to the kernel of the computing node 1. Then, at 9: 01: 00 to 9: 01: during the period 30, the kernel of the computing node 1 still does not detect that the virtual machine 11 issues the I/O request to the disk, but detects that the virtual machines 12 and 13 issue the I/O request to the disk, and records that the virtual machine 11 is 9: 01: 00 to 9: 01: no I/O request is issued during 30, and the virtual machines 12, 13 are at 9: 01: 00 to 9: 01: during 30 period, I/O request is issued. At this time, the virtual machine 11 has not issued the I/O request to its disk for 90s, that is, the virtual machine 11 has not issued the I/O request to its disk within the preset time length of 90s, so that the kernel of the compute node 1 determines that the virtual machine 11 has a fault and the virtual machines 12 and 13 are still healthy, and then shuts down the virtual machine 11 from the kernel, generates a virtual machine healthy heartbeat packet, and sends the heartbeat packet to the management node, where the heartbeat packet describes that the virtual machine 11 is faulty and the virtual machines 12 and 13 are healthy. After the kernel of the computing node 1 judges that the virtual machine 11 fails.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the protection scope of the present invention, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A failure processing method for a computing node executes a step X if a failure processing condition is met: and closing the virtual machine running on the computing node, which is characterized in that: the fault processing condition comprises that a heartbeat detection signal sent by the management node is not received within a preset time length.

2. The compute node fault handling method of claim 1 wherein: and step X is to close the virtual machine on the computing node from the kernel.

3. The compute node fault handling method of claim 2 wherein: and the step X is specifically to close the virtual machine which runs on the computing node and uses the non-local resources.

4. The compute node fault handling method of claim 3 wherein: and in the step X, the virtual machine using the local resources is not closed.

5. The compute node fault handling method of claim 1 wherein: the preset duration is three times of the sending time interval of the heartbeat detection signal.

6. The compute node fault handling method of claim 1 wherein: the fault processing condition specifically includes that the kernel does not receive a heartbeat detection signal sent by the management node within a preset time length.

7. The compute node fault handling method of claim 1 wherein: the method is a processing method for checking faults in the computing nodes.

8. A computer-readable storage medium having stored thereon an executable computer program, characterized by: the computer program when executed implements a compute node failure handling method as claimed in any one of claims 1 to 7.