CN113760459A

CN113760459A - Virtual machine fault detection method, storage medium and virtualization cluster

Info

Publication number: CN113760459A
Application number: CN202111027985.3A
Authority: CN
Inventors: 李义; 邹理贤; 刘建平
Original assignee: Aerospace Winhong Technology Guizhou Co ltd; Winhong Information Technology Co ltd
Current assignee: Aerospace Winhong Technology Guizhou Co ltd; Winhong Information Technology Co ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-12-07

Abstract

The invention discloses a virtual machine fault detection method, a method for detecting faults of a virtual machine by a management node, a storage medium and a virtualization cluster. The virtual machine fault detection method specifically comprises the following steps: and if the virtual machine is detected not to issue the I/O request to the disk of the virtual machine within the preset time, judging that the virtual machine is in fault. The virtual machine fault detection method judges whether the virtual machine is in fault or not by detecting the I/O request condition of the virtual machine, and if the QEMU process of the virtual machine is blocked, the virtual machine is actually in fault and cannot run although the QEMU process still exists, the virtual machine cannot issue the I/O request to the disk of the virtual machine any more, so that if the virtual machine is detected not to issue the I/O request to the disk of the virtual machine within the preset time length, the virtual machine can be judged to be in fault.

Description

Virtual machine fault detection method, storage medium and virtualization cluster

Technical Field

The invention relates to the technical field of virtualization, in particular to a virtual machine fault detection method, a storage medium and a virtualization cluster.

Background

With the rapid development of computer technology, more and more companies begin to pay attention to how to reduce energy consumption and improve resource utilization, and under the background, a computing mode of cloud computing is developed. Cloud computing abstracts all computers into specific computing resources and then provides those computing resources to a user, rather than providing one or more computers directly to a user as before. Therefore, the user can apply for computing resources according to the needs of the user, unnecessary resource waste is avoided, and the resource utilization rate is improved.

In order to save equipment cost and further improve resource utilization rate, a cloud computing builder adopts a virtualization technology to virtualize a plurality of virtual computers (i.e., virtual machines) on a single physical computer for respective use by a plurality of users, and the physical computer is called a computing node. A plurality of such physical computers form a virtualized cluster. The virtualization cluster performs unified management on a plurality of physical computers.

A virtualization cluster typically includes a management node and a plurality of compute nodes, and the management node monitors the health of each compute node and the virtual machines running on each compute node. The existing detection mode for the health state of the virtual machine is as follows: and the computing node regularly detects whether the QEMU process of each virtual machine running on the computing node is healthy or not, if so, the virtual machine is judged to be normal, and otherwise, the virtual machine is judged to be in failure. However, the detection result of the detection method is not accurate enough, for example, when the QEMU process is blocked, the QEMU process still exists but the virtual machine cannot be operated, and the computing node determines that the virtual machine is normal when detecting that the QEMU process exists, but actually the virtual machine has failed and cannot be used.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a virtual machine fault detection method, a method for a management node to perform fault detection on a virtual machine, a computer readable storage medium storing a computer program capable of implementing any one of the methods when executed, and a virtualization cluster, wherein the virtual machine fault detection method can detect whether a virtual machine is faulty or not relatively accurately.

In order to solve the technical problem, according to the virtual machine fault detection method provided by the invention, if it is detected that the virtual machine does not issue the I/O request to the virtual machine disk within the preset time duration, the virtual machine fault is determined.

Optionally, it is specifically detected whether the virtual machine issues the I/O request to the virtual machine disk by detecting the disk path of the virtual machine.

Optionally, as a management node, a disk path of a virtual machine running on the computing node is sent to the computing node.

Optionally, the computing node detects whether the virtual machine issues an I/O request to the virtual machine disk according to the disk path and reports the detection result to the management node at regular time.

Optionally, it is specifically detected, by the computing node, whether the virtual machine has not issued an I/O request to the virtual machine disk within a preset time period.

Optionally, after determining that the virtual machine is failed, the failed virtual machine is shut down.

Optionally, as the management node, specifically, the computing node where the failed virtual machine is located is enabled to shut down the failed virtual machine from the kernel.

Optionally, as the computing node, after determining that the virtual machine has failed, reporting information of the failed virtual machine to the management node.

A method for detecting the failure of a virtual machine by a management node sends the disk path information of the virtual machine running on a computing node to the computing node.

A computer readable storage medium having stored thereon an executable computer program which, when executed by a management node, implements a virtual machine fault detection method as described above, or a method of fault detection of a virtual machine by a management node as described above.

A computer readable storage medium having stored thereon an executable computer program which, when executed by a computing node, implements a virtual machine fault detection method as described above.

A virtualization cluster comprises a management node and a computing node in communication connection with the management node, wherein a virtual machine runs on the computing node, the management node sends a disk path of the virtual machine running on the computing node to the computing node, the computing node detects whether the virtual machine issues an I/O request to a disk of the virtual machine according to the disk path and reports a detection result to the management node, and if the management node detects that the virtual machine does not issue the I/O request to the disk of the virtual machine within a preset time through the detection result reported by the computing node, the management node judges that the virtual machine fails.

The virtual machine fault detection method judges whether the virtual machine is in fault or not by detecting the I/O request condition of the virtual machine, and if the QEMU process of the virtual machine is blocked, the virtual machine is actually in fault and cannot run although the QEMU process still exists, the virtual machine cannot issue the I/O request to the disk of the virtual machine any more, so that if the virtual machine is detected not to issue the I/O request to the disk of the virtual machine within the preset time length, the virtual machine can be judged to be in fault.

Drawings

FIG. 1 is a block diagram of a system architecture for a virtualization cluster.

Detailed Description

The invention is described in detail below with reference to specific embodiments.

As shown in fig. 1, a virtualized cluster includes a management node and three compute nodes 1, 2, 3 communicatively connected to the management node. Virtual machines run on the computing nodes 1, 2, and 3, respectively. The management node includes a processor and a computer-readable storage medium storing an executable computer program, the processor executing the computer program to implement the functions of the management node. Each computing node includes a processor and a computer-readable storage medium having an executable computer program stored therein, the computer program being executed by the processor of the computing node to perform the functions of the computing node.

Example one

The management node sends its own IP address and port name to each of the compute nodes 1, 2, 3. After receiving the IP address and the port name of the management node, each of the computing nodes 1, 2 and 3 transmits the IP address and the port name of the management node to a kernel of the computing node through netlink communication, and then a TCP heartbeat detection mechanism between the kernel and the management node is established, so that the management node establishes TCP socket communication channels with the kernels of the computing nodes 1, 2 and 3 respectively. The management node sends the disk path and the detection strategy of the virtual machine running on the computing node to the cores of the computing nodes 1, 2 and 3 through the communication channels respectively, and the detection strategy comprises a detection mode and a detection feedback mode. The management node detects whether the virtual machine does not issue the I/O request to the disk of the virtual machine within a preset time length through the computing node.

Taking the computing node 1 as an example, after receiving the disk paths and the detection policies of the virtual machines 11, 12, and 13, the kernel of the computing node 1 detects the disk paths of the virtual machines 11, 12, and 13 respectively according to the detection mode in the detection policies, thereby detecting whether the virtual machines 11, 12, and 13 issue the I/O request to their disks, and according to the detection feedback mode in the detection policies, sends a virtual machine detection heartbeat packet to the management node every 30s to report the detection result during the 30s period. Assume compute node 1 core is at 9: 00: and 00, starting detection, wherein if the QEMU process of the virtual machine 11 is blocked, the virtual machine actually fails although the QEMU process of the virtual machine 11 still exists, and the I/O request cannot be issued to the disk of the virtual machine. In the following step 9: 00: 00 to 9: 00: during period 30, if none of the virtual machines 11, 12, 13 issue an I/O request to their disks, the compute node 1 kernel is at 9: 00: 00 to 9: 00: during period 30, no I/O request issued by the virtual machines 11, 12, 13 to their disks is detected, at 9: 00: the 30 th moment generates a signal containing 9: 00: 00 to 9: 00: and the virtual machine of the detection result in the period of 30 detects the heartbeat packet and sends the heartbeat packet to the management node. After receiving the virtual machine detection heartbeat packet, the management node sends a virtual machine detection reply heartbeat packet to the kernel of the computing node 1 to inform the opposite management node that the virtual machine detection heartbeat packet is received, and records that the virtual machines 11, 12 and 13 are 9: 00: 00 to 9: 00: no I/O request is issued during 30. Then, at 9: 00: 30 to 9: 01: during the period 00, the kernel of the computing node 1 still does not detect that the virtual machine 11 issues the I/O request to the disk, but detects that the virtual machines 12 and 13 issue the I/O request to the disk, which is just at 9: 01: time 00 generates a signal comprising 9: 00: 30 to 9: 01: and detecting the heartbeat packet by the virtual machine of the detection result in the period of 00 and sending the heartbeat packet to the management node. After receiving the virtual machine detection heartbeat packet, the management node sends a virtual machine detection reply heartbeat packet to the kernel of the computing node 1, and records that the virtual machine 11 is 9: 00: 30 to 9: 01: no I/O request is issued during 00, and virtual machines 12, 13 are at 9: 00: 30 to 9: 01: during 00, I/O request is issued. Then, at 9: 01: 00 to 9: 01: during the period 30, the kernel of the computing node 1 still does not detect that the virtual machine 11 issues the I/O request to the disk, but detects that the virtual machines 12 and 13 issue the I/O request to the disk, and the following steps are performed in 9: 01: the 30 th moment generates a signal containing 9: 01: 00 to 9: 01: and the virtual machine of the detection result in the period of 30 detects the heartbeat packet and sends the heartbeat packet to the management node. After receiving the virtual machine detection heartbeat packet, the management node sends a virtual machine detection reply heartbeat packet to the kernel of the computing node 1, and records that the virtual machine 11 is 9: 01: 00 to 9: 01: no I/O request is issued during 30, and the virtual machines 12, 13 are at 9: 01: 00 to 9: 01: during 30 period, I/O request is issued. So far, the management node has received a virtual machine detection heartbeat packet issued by the virtual machine 11 without an I/O request three times, that is, the virtual machine 11 does not issue an I/O request to its disk within 90 seconds of the preset time, and the management node determines that the virtual machine 11 has a fault according to the heartbeat packet, and sends a virtual machine 11 fault message to the kernel of the computing node 1, so that the computing node 1 shuts down the virtual machine 11 from the kernel. Due to the virtual machines 12, 13, at 9: 00: 30 to 9: 01: and an I/O request is issued in the period of 30, and the management node judges that the virtual machines 12 and 13 run normally.

Similarly, after receiving the disk paths of the virtual machines 21, 22, and 23, the kernel of the compute node 2 detects the disk paths of the virtual machines 21, 22, and 23, respectively, so as to detect whether the virtual machines 21, 22, and 23 issue an I/O request to their disks, and sends a virtual machine detection heartbeat packet to the management node every 30s, thereby reporting the detection result during this 30s period. The same is true of the compute node 3 kernel.

In this embodiment, the management node issues a detection policy to each of the computing nodes 1, 2, and 3, so that the computing nodes 1, 2, and 3 detect the virtual machine according to the detection policy. Other embodiments may instead set the detection policy directly on the respective compute node 1, 2, 3, less preferred. In this way, after receiving the disk path of the virtual machine, each computing node 1, 2, 3 automatically detects the virtual machine according to the local detection policy.

The kernels of the computing nodes 1, 2 and 3 regularly send a heartbeat packet of the virtual machine detection to the management node to report the detection result of the virtual machine, and also regularly send a health heartbeat packet of the computing node to the management node to report the health state of the computing node. After receiving the healthy heartbeat packets of the computing nodes sent by the kernels of the computing nodes 1, 2 and 3, the management node sends healthy reply heartbeat packets of the computing nodes to the kernels of the computing nodes 1, 2 and 3 respectively to inform the other management node that the healthy heartbeat packets of the computing nodes are received. Suppose that each compute node 1, 2, 3 also sends a compute node health heartbeat packet to the management node every 30 s. Assume compute node 1, 2, 3 cores are at 9: 00: detection is initiated at time 00, at 9: 00: at time 30, the cores of the computing nodes 1, 2 and 3 send a heartbeat packet of the virtual machine detection to the management node, and also send a heartbeat packet of the health of the computing nodes to the management node. Assuming that the management node only receives the healthy heartbeat packets of the computing nodes 1 and 2 and does not receive the healthy heartbeat packet of the computing node 3, the management node records the computing nodes 1 and 2 as healthy states, records the computing node 3 as an undetermined state, and then respectively sends the healthy reply heartbeat packets of the computing nodes to the kernels of the computing nodes 1 and 2. At this time, the kernels of the computing nodes 1 and 2 receive the computing node health reply heartbeat packet sent by the management node, and the kernel of the computing node 3 does not receive the heartbeat packet. The kernel of the computing node 3 records that the health of the computing node replies that the heartbeat packet is missing once. In the following step 9: 01: at the time of 00, the kernels of the computing nodes 1, 2 and 3 send the healthy heartbeat packet of the computing nodes to the management node again. The management node only receives the healthy heartbeat packets of the computing nodes 1 and 2, but does not receive the healthy heartbeat packet of the computing node 3, records the computing nodes 1 and 2 as healthy states, records the computing node 3 as an undetermined state, and then respectively sends the healthy heartbeat reply packets of the computing nodes to the kernels of the computing nodes 1 and 2. At this time, the kernels of the computing nodes 1 and 2 receive the computing node health reply heartbeat packet sent by the management node, and the kernel of the computing node 3 does not receive the heartbeat packet. The kernel of the computing node 3 records that the health reply heartbeat packet of the computing node is continuously lost twice. In the following step 9: 01: at the moment 30, the kernels of the computing nodes 1, 2 and 3 send the health heartbeat packet of the computing nodes to the management node again. The management node still only receives the health heartbeat packets of the computing nodes 1 and 2, and does not receive the health heartbeat packet of the computing node 3. Since the management node has not received the health heartbeat packet of the computing node sent by the computing node 3 three times continuously, that is, the health heartbeat packet of the computing node sent by the computing node 3 is not received within a preset time length of 90s (90 s is three times of the sending time interval of the health heartbeat packet of the computing node), accordingly, the management node records the computing node 3 as a fault state, records the computing nodes 1 and 2 as a health state, and then sends the health reply heartbeat packet of the computing node to the kernels of the computing nodes 1 and 2 respectively. At this time, the kernels of the computing nodes 1 and 2 receive the computing node health reply heartbeat packet sent by the management node, and the kernel of the computing node 3 does not receive the heartbeat packet. The core of the computing node 3 records that the health reply heartbeat packet of the computing node is continuously lost three times, that is, the core of the computing node 3 does not receive the health reply heartbeat packet (namely, the heartbeat detection signal) of the computing node sent by the management node within the preset time length of 90s, and accordingly, the core of the computing node 3 is judged to be in fault. Assuming that the virtual machine 33 running on the compute node 3 uses a remote disk shared with the compute nodes 1, 2 and does not use the compute node 3 local storage resources, the compute node 3 kernel determines that it has failed and then shuts down the virtual machine 33 from the kernel. In this way, the kernel of the compute node 3 actively shuts down the virtual machine 33, and the virtual machine cannot be shut down because the kernel cannot respond to the shutdown command of the management node. After the management node determines that the computing node 3 has failed, at the same time, the computing node 3 also determines that it has failed and shuts down the virtual machine 33, so that the management node can restart the virtual machine 33 on the computing node 1 or the computing node 2.

In this embodiment, the failed computing node 3 only shuts down the virtual machine 33 using the non-local resource, and temporarily does not process the virtual machines 31 and 32 using the local resource. Non-preferably, all virtual machines 31, 32, 33 running on the failed compute node 3 may be shut down instead.

Example two

The present embodiment is substantially the same as the first embodiment, and only the differences between the present embodiment and the first embodiment will be described below.

In the first embodiment, the computing node only reports the detection result of the issue condition of the I/O request of the virtual machine to the management node, and the management node determines whether the virtual machine fails according to the detection result without determining. In this embodiment, the computing node instead determines whether the virtual machine fails according to the detection result. Still taking compute node 1 as an example, assume compute node 1 has a core at 9: 00: detection is initiated at time 00, at 9: 00: 00 to 9: 00: during period 30, assuming that none of the virtual machines 11, 12, 13 issued an I/O request to their disks, then the compute node 1 kernel is at 9: 00: 00 to 9: 00: 30, without detecting that the virtual machines 11, 12, 13 issue I/O requests to their disks, it records that the virtual machines 11, 12, 13 are at 9: 00: 00 to 9: 00: no I/O request is issued during the period 30, at this time, the virtual machines 11, 12, and 13 do not issue an I/O request to their disks only for 30s, and do not issue an I/O request for 90s, so that the kernel of the compute node 1 determines that the virtual machines 11, 12, and 13 are healthy at this time, and generates a healthy heartbeat packet of the virtual machine and sends the heartbeat packet to the management node, where the heartbeat packet describes that the virtual machines 11, 12, and 13 are healthy. After receiving the healthy heartbeat packet of the virtual machine, the management node immediately sends a healthy heartbeat packet of the virtual machine to the kernel of the computing node 1, so that the computing node 1 is informed that the management node has received the healthy heartbeat packet of the virtual machine. Then, at 9: 00: 30 to 9: 01: during the period 00, the kernel of the computing node 1 still does not detect that the virtual machine 11 issues the I/O request to the disk, but detects that the virtual machines 12 and 13 issue the I/O request to the disk, and records that the virtual machine 11 is 9: 00: 30 to 9: 01: no I/O request is issued during 00, and virtual machines 12, 13 are at 9: 00: 30 to 9: 01: during 00, I/O request is issued. At this time, the virtual machine 11 does not issue the I/O request to the disk for only 60s, and does not issue the I/O request for 90s after the preset duration is reached, so that the kernel of the computing node 1 determines that the virtual machines 11, 12, and 13 are healthy at this time, and generates a healthy heartbeat packet of the virtual machine and sends the heartbeat packet to the management node, where the heartbeat packet describes that the virtual machines 11, 12, and 13 are healthy. And after receiving the virtual machine health heartbeat packet, the management node immediately sends a virtual machine health reply heartbeat packet to the kernel of the computing node 1. Then, at 9: 01: 00 to 9: 01: during the period 30, the kernel of the computing node 1 still does not detect that the virtual machine 11 issues the I/O request to the disk, but detects that the virtual machines 12 and 13 issue the I/O request to the disk, and records that the virtual machine 11 is 9: 01: 00 to 9: 01: no I/O request is issued during 30, and the virtual machines 12, 13 are at 9: 01: 00 to 9: 01: during 30 period, I/O request is issued. At this time, the virtual machine 11 has not issued the I/O request to its disk for 90s, that is, the virtual machine 11 has not issued the I/O request to its disk within the preset time length of 90s, so that the kernel of the compute node 1 determines that the virtual machine 11 has a fault and the virtual machines 12 and 13 are still healthy, and then shuts down the virtual machine 11 from the kernel, generates a virtual machine healthy heartbeat packet, and sends the heartbeat packet to the management node, where the heartbeat packet describes that the virtual machine 11 is faulty and the virtual machines 12 and 13 are healthy. After the kernel of the computing node 1 judges that the virtual machine 11 fails.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the protection scope of the present invention, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A virtual machine fault detection method is characterized in that: and if the virtual machine is detected not to issue the I/O request to the disk of the virtual machine within the preset time, judging that the virtual machine is in fault.

2. The virtual machine fault detection method of claim 1, wherein: specifically, whether the virtual machine issues an I/O request to a virtual machine disk is detected by detecting the disk path of the virtual machine.

3. The virtual machine fault detection method of claim 2, wherein: as a management node, a disk path of a virtual machine running on the compute node is sent to the compute node.

4. The virtual machine fault detection method of claim 3, wherein: and enabling the computing node to detect whether the virtual machine issues an I/O request to the disk of the virtual machine according to the disk path and reporting the detection result to the management node at regular time.

5. The virtual machine fault detection method of claim 1, wherein: specifically, whether the virtual machine does not issue an I/O request to a virtual machine disk within a preset time length is detected through a computing node.

6. The virtual machine fault detection method of claim 1, wherein: and after the virtual machine is judged to be in fault, closing the fault virtual machine.

7. The virtual machine fault detection method of claim 6, wherein: as the management node, specifically, the computing node where the failed virtual machine is located is made to shut down the failed virtual machine from the kernel.

8. The virtual machine fault detection method of claim 1, wherein: and as the computing node, reporting the information of the failed virtual machine to the management node after judging that the virtual machine fails.

9. A method for a management node to perform fault detection on a virtual machine is characterized by comprising the following steps: and sending the disk path information of the virtual machine running on the computing node to the computing node.

10. The method of managing node failure detection for virtual machines of claim 9, wherein: and enabling the computing node to detect whether the virtual machine issues an I/O request to the disk of the virtual machine according to the disk path and reporting the detection result to the management node at regular time.

11. A computer-readable storage medium having stored thereon an executable computer program, characterized by: the computer program, when executed by a management node, implements a virtual machine fault detection method as claimed in any one of claims 1 to 7, or implements a method for fault detection of a virtual machine by a management node as claimed in claim 9 or 10.

12. A computer-readable storage medium having stored thereon an executable computer program, characterized by: the computer program, when executed by a compute node, implements a virtual machine fault detection method as claimed in claim 1, 2, 6 or 8.

13. A virtualization cluster comprises a management node and a computing node which is in communication connection with the management node, wherein a virtual machine runs on the computing node, and the virtualization cluster is characterized in that: the management node sends a disk path of a virtual machine running on the computing node to the computing node, the computing node detects whether the virtual machine issues an I/O request to a disk of the virtual machine according to the disk path and reports a detection result to the management node, and if the management node detects that the virtual machine does not issue the I/O request to the disk of the virtual machine within a preset time length through the detection result reported by the computing node, the management node judges that the virtual machine is in a fault.