CN113765748A - Method for processing fault of computing node and computer readable storage medium - Google Patents

Method for processing fault of computing node and computer readable storage medium Download PDF

Info

Publication number
CN113765748A
CN113765748A CN202111027451.0A CN202111027451A CN113765748A CN 113765748 A CN113765748 A CN 113765748A CN 202111027451 A CN202111027451 A CN 202111027451A CN 113765748 A CN113765748 A CN 113765748A
Authority
CN
China
Prior art keywords
node
virtual machine
computing
computing node
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111027451.0A
Other languages
Chinese (zh)
Inventor
李义
邹理贤
刘建平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aerospace Winhong Technology Guizhou Co ltd
Winhong Information Technology Co ltd
Original Assignee
Aerospace Winhong Technology Guizhou Co ltd
Winhong Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aerospace Winhong Technology Guizhou Co ltd, Winhong Information Technology Co ltd filed Critical Aerospace Winhong Technology Guizhou Co ltd
Priority to CN202111027451.0A priority Critical patent/CN113765748A/en
Publication of CN113765748A publication Critical patent/CN113765748A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45575Starting, stopping, suspending or resuming virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method for processing faults of a computing node and a computer readable storage medium. The above method specifically: if the fault processing condition is met, executing the step X: and closing the virtual machine running on the computing node, wherein the fault processing condition comprises that a heartbeat detection signal sent by the management node is not received within a preset time length. The method for processing the fault of the computing node is executed by the computing node, the computing node automatically judges whether the node has the fault, and if the node has the fault, the virtual machine running on the node is automatically closed, so that the virtual machine can be restarted on other computing nodes.

Description

Method for processing fault of computing node and computer readable storage medium
Technical Field
The present invention relates to the field of virtualization technologies, and in particular, to a method for processing a failure of a compute node and a computer-readable storage medium.
Background
With the rapid development of computer technology, more and more companies begin to pay attention to how to reduce energy consumption and improve resource utilization, and under the background, a computing mode of cloud computing is developed. Cloud computing abstracts all computers into specific computing resources and then provides those computing resources to a user, rather than providing one or more computers directly to a user as before. Therefore, the user can apply for computing resources according to the needs of the user, unnecessary resource waste is avoided, and the resource utilization rate is improved.
In order to save equipment cost and further improve resource utilization rate, a cloud computing builder adopts a virtualization technology to virtualize a plurality of virtual computers (i.e., virtual machines) on a single physical computer for respective use by a plurality of users, and the physical computer is called a computing node. A plurality of such physical computers form a virtualized cluster. The virtualization cluster performs unified management on a plurality of physical computers.
A virtualized cluster typically includes a management node and a plurality of compute nodes, with the management node monitoring the health of each compute node. The management node performs fault detection on each computing node through the heartbeat packet, and if the management node does not receive the heartbeat packet sent by the computing node within a preset time length, the computing node is considered to have a fault. In order to improve the service continuity of the virtualized cluster and enable the virtual machines to timely recover the running services when the virtual machines fail, after detecting the failure of the computing node, the management node sends an instruction to the failed computing node to shut down all the virtual machines running on the failed computing node, and then restarts the virtual machines on other computing nodes in the cluster to recover the virtual machine services. Therefore, the condition of virtual machine failure is not sensed by the client, thereby ensuring the continuity of the service. However, when a computing node fails, the virtual machine may not be able to be shut down because the computing node cannot respond to the instruction of the management node. The original computing node does not close the virtual machines, and other computing nodes cannot start the virtual machines; if the virtual machine is forcibly started, the data of the virtual machine can be damaged.
Disclosure of Invention
The present invention is directed to a method for processing a failure of a compute node, which is capable of shutting down a virtual machine running on the failed compute node to enable the virtual machine to be restarted on another compute node, and a computer-readable storage medium storing a computer program that, when executed, implements the method.
In order to solve the above technical problem, according to the method for processing a failure of a compute node of the present invention, if a failure processing condition is met, the method performs step X: and closing the virtual machine running on the computing node, wherein the fault processing condition comprises that a heartbeat detection signal sent by the management node is not received within a preset time length.
Optionally, the step X is specifically to close the virtual machine on the current computing node from the kernel.
Optionally, the step X is specifically to close a virtual machine running on the current computing node and using the non-local resource.
Optionally, in the step X, the virtual machine using the local resource is not turned off.
Optionally, the preset duration is three times of a heartbeat detection signal transmission time interval.
Optionally, the fault handling condition specifically includes that the kernel does not receive a heartbeat detection signal sent by the management node within a preset time period.
Optionally, the method is a processing method of checking faults in the computing node.
A computer-readable storage medium having stored thereon an executable computer program, characterized by: the computer program when executed implements a compute node failure handling method as described above.
The method for processing the fault of the computing node is executed by the computing node, the computing node automatically judges whether the node has the fault, and if the node has the fault, the virtual machine running on the node is automatically closed, so that the virtual machine can be restarted on other computing nodes.
Drawings
FIG. 1 is a block diagram of a system architecture for a virtualization cluster.
Detailed Description
The invention is described in detail below with reference to specific embodiments.
As shown in fig. 1, a virtualized cluster includes a management node and three compute nodes 1, 2, 3 communicatively connected to the management node. Virtual machines run on the computing nodes 1, 2, and 3, respectively. The management node includes a processor and a computer-readable storage medium storing an executable computer program, the processor executing the computer program to implement the functions of the management node. Each computing node includes a processor and a computer-readable storage medium having an executable computer program stored therein, the computer program being executed by the processor of the computing node to perform the functions of the computing node.
Example one
The management node sends its own IP address and port name to each of the compute nodes 1, 2, 3. After receiving the IP address and the port name of the management node, each of the computing nodes 1, 2 and 3 transmits the IP address and the port name of the management node to a kernel of the computing node through netlink communication, and then a TCP heartbeat detection mechanism between the kernel and the management node is established, so that the management node establishes TCP socket communication channels with the kernels of the computing nodes 1, 2 and 3 respectively. The management node sends the disk path and the detection strategy of the virtual machine running on the computing node to the cores of the computing nodes 1, 2 and 3 through the communication channels respectively, and the detection strategy comprises a detection mode and a detection feedback mode. The management node detects whether the virtual machine does not issue the I/O request to the disk of the virtual machine within a preset time length through the computing node.
Taking the computing node 1 as an example, after receiving the disk paths and the detection policies of the virtual machines 11, 12, and 13, the kernel of the computing node 1 detects the disk paths of the virtual machines 11, 12, and 13 respectively according to the detection mode in the detection policies, thereby detecting whether the virtual machines 11, 12, and 13 issue the I/O request to their disks, and according to the detection feedback mode in the detection policies, sends a virtual machine detection heartbeat packet to the management node every 30s to report the detection result during the 30s period. Assume compute node 1 core is at 9: 00: and 00, starting detection, wherein if the QEMU process of the virtual machine 11 is blocked, the virtual machine actually fails although the QEMU process of the virtual machine 11 still exists, and the I/O request cannot be issued to the disk of the virtual machine. In the following step 9: 00: 00 to 9: 00: during period 30, if none of the virtual machines 11, 12, 13 issue an I/O request to their disks, the compute node 1 kernel is at 9: 00: 00 to 9: 00: during period 30, no I/O request issued by the virtual machines 11, 12, 13 to their disks is detected, at 9: 00: the 30 th moment generates a signal containing 9: 00: 00 to 9: 00: and the virtual machine of the detection result in the period of 30 detects the heartbeat packet and sends the heartbeat packet to the management node. After receiving the virtual machine detection heartbeat packet, the management node sends a virtual machine detection reply heartbeat packet to the kernel of the computing node 1 to inform the opposite management node that the virtual machine detection heartbeat packet is received, and records that the virtual machines 11, 12 and 13 are 9: 00: 00 to 9: 00: no I/O request is issued during 30. Then, at 9: 00: 30 to 9: 01: during the period 00, the kernel of the computing node 1 still does not detect that the virtual machine 11 issues the I/O request to the disk, but detects that the virtual machines 12 and 13 issue the I/O request to the disk, which is just at 9: 01: time 00 generates a signal comprising 9: 00: 30 to 9: 01: and detecting the heartbeat packet by the virtual machine of the detection result in the period of 00 and sending the heartbeat packet to the management node. After receiving the virtual machine detection heartbeat packet, the management node sends a virtual machine detection reply heartbeat packet to the kernel of the computing node 1, and records that the virtual machine 11 is 9: 00: 30 to 9: 01: no I/O request is issued during 00, and virtual machines 12, 13 are at 9: 00: 30 to 9: 01: during 00, I/O request is issued. Then, at 9: 01: 00 to 9: 01: during the period 30, the kernel of the computing node 1 still does not detect that the virtual machine 11 issues the I/O request to the disk, but detects that the virtual machines 12 and 13 issue the I/O request to the disk, and the following steps are performed in 9: 01: the 30 th moment generates a signal containing 9: 01: 00 to 9: 01: and the virtual machine of the detection result in the period of 30 detects the heartbeat packet and sends the heartbeat packet to the management node. After receiving the virtual machine detection heartbeat packet, the management node sends a virtual machine detection reply heartbeat packet to the kernel of the computing node 1, and records that the virtual machine 11 is 9: 01: 00 to 9: 01: no I/O request is issued during 30, and the virtual machines 12, 13 are at 9: 01: 00 to 9: 01: during 30 period, I/O request is issued. So far, the management node has received a virtual machine detection heartbeat packet issued by the virtual machine 11 without an I/O request three times, that is, the virtual machine 11 does not issue an I/O request to its disk within 90 seconds of the preset time, and the management node determines that the virtual machine 11 has a fault according to the heartbeat packet, and sends a virtual machine 11 fault message to the kernel of the computing node 1, so that the computing node 1 shuts down the virtual machine 11 from the kernel. Due to the virtual machines 12, 13, at 9: 00: 30 to 9: 01: and an I/O request is issued in the period of 30, and the management node judges that the virtual machines 12 and 13 run normally.
Similarly, after receiving the disk paths of the virtual machines 21, 22, and 23, the kernel of the compute node 2 detects the disk paths of the virtual machines 21, 22, and 23, respectively, so as to detect whether the virtual machines 21, 22, and 23 issue an I/O request to their disks, and sends a virtual machine detection heartbeat packet to the management node every 30s, thereby reporting the detection result during this 30s period. The same is true of the compute node 3 kernel.
In this embodiment, the management node issues a detection policy to each of the computing nodes 1, 2, and 3, so that the computing nodes 1, 2, and 3 detect the virtual machine according to the detection policy. Other embodiments may instead set the detection policy directly on the respective compute node 1, 2, 3, less preferred. In this way, after receiving the disk path of the virtual machine, each computing node 1, 2, 3 automatically detects the virtual machine according to the local detection policy.
The kernels of the computing nodes 1, 2 and 3 regularly send a heartbeat packet of the virtual machine detection to the management node to report the detection result of the virtual machine, and also regularly send a health heartbeat packet of the computing node to the management node to report the health state of the computing node. After receiving the healthy heartbeat packets of the computing nodes sent by the kernels of the computing nodes 1, 2 and 3, the management node sends healthy reply heartbeat packets of the computing nodes to the kernels of the computing nodes 1, 2 and 3 respectively to inform the other management node that the healthy heartbeat packets of the computing nodes are received. Suppose that each compute node 1, 2, 3 also sends a compute node health heartbeat packet to the management node every 30 s. Assume compute node 1, 2, 3 cores are at 9: 00: detection is initiated at time 00, at 9: 00: at time 30, the cores of the computing nodes 1, 2 and 3 send a heartbeat packet of the virtual machine detection to the management node, and also send a heartbeat packet of the health of the computing nodes to the management node. Assuming that the management node only receives the healthy heartbeat packets of the computing nodes 1 and 2 and does not receive the healthy heartbeat packet of the computing node 3, the management node records the computing nodes 1 and 2 as healthy states, records the computing node 3 as an undetermined state, and then respectively sends the healthy reply heartbeat packets of the computing nodes to the kernels of the computing nodes 1 and 2. At this time, the kernels of the computing nodes 1 and 2 receive the computing node health reply heartbeat packet sent by the management node, and the kernel of the computing node 3 does not receive the heartbeat packet. The kernel of the computing node 3 records that the health of the computing node replies that the heartbeat packet is missing once. In the following step 9: 01: at the time of 00, the kernels of the computing nodes 1, 2 and 3 send the healthy heartbeat packet of the computing nodes to the management node again. The management node only receives the healthy heartbeat packets of the computing nodes 1 and 2, but does not receive the healthy heartbeat packet of the computing node 3, records the computing nodes 1 and 2 as healthy states, records the computing node 3 as an undetermined state, and then respectively sends the healthy heartbeat reply packets of the computing nodes to the kernels of the computing nodes 1 and 2. At this time, the kernels of the computing nodes 1 and 2 receive the computing node health reply heartbeat packet sent by the management node, and the kernel of the computing node 3 does not receive the heartbeat packet. The kernel of the computing node 3 records that the health reply heartbeat packet of the computing node is continuously lost twice. In the following step 9: 01: at the moment 30, the kernels of the computing nodes 1, 2 and 3 send the health heartbeat packet of the computing nodes to the management node again. The management node still only receives the health heartbeat packets of the computing nodes 1 and 2, and does not receive the health heartbeat packet of the computing node 3. Since the management node has not received the health heartbeat packet of the computing node sent by the computing node 3 three times continuously, that is, the health heartbeat packet of the computing node sent by the computing node 3 is not received within a preset time length of 90s (90 s is three times of the sending time interval of the health heartbeat packet of the computing node), accordingly, the management node records the computing node 3 as a fault state, records the computing nodes 1 and 2 as a health state, and then sends the health reply heartbeat packet of the computing node to the kernels of the computing nodes 1 and 2 respectively. At this time, the kernels of the computing nodes 1 and 2 receive the computing node health reply heartbeat packet sent by the management node, and the kernel of the computing node 3 does not receive the heartbeat packet. The core of the computing node 3 records that the health reply heartbeat packet of the computing node is continuously lost three times, that is, the core of the computing node 3 does not receive the health reply heartbeat packet (namely, the heartbeat detection signal) of the computing node sent by the management node within the preset time length of 90s, and accordingly, the core of the computing node 3 is judged to be in fault. Assuming that the virtual machine 33 running on the compute node 3 uses a remote disk shared with the compute nodes 1, 2 and does not use the compute node 3 local storage resources, the compute node 3 kernel determines that it has failed and then shuts down the virtual machine 33 from the kernel. In this way, the kernel of the compute node 3 actively shuts down the virtual machine 33, and the virtual machine cannot be shut down because the kernel cannot respond to the shutdown command of the management node. After the management node determines that the computing node 3 has failed, at the same time, the computing node 3 also determines that it has failed and shuts down the virtual machine 33, so that the management node can restart the virtual machine 33 on the computing node 1 or the computing node 2.
In this embodiment, the failed computing node 3 only shuts down the virtual machine 33 using the non-local resource, and temporarily does not process the virtual machines 31 and 32 using the local resource. Non-preferably, all virtual machines 31, 32, 33 running on the failed compute node 3 may be shut down instead.
Example two
The present embodiment is substantially the same as the first embodiment, and only the differences between the present embodiment and the first embodiment will be described below.
In the first embodiment, the computing node only reports the detection result of the issue condition of the I/O request of the virtual machine to the management node, and the management node determines whether the virtual machine fails according to the detection result without determining. In this embodiment, the computing node instead determines whether the virtual machine fails according to the detection result. Still taking compute node 1 as an example, assume compute node 1 has a core at 9: 00: detection is initiated at time 00, at 9: 00: 00 to 9: 00: during period 30, assuming that none of the virtual machines 11, 12, 13 issued an I/O request to their disks, then the compute node 1 kernel is at 9: 00: 00 to 9: 00: 30, without detecting that the virtual machines 11, 12, 13 issue I/O requests to their disks, it records that the virtual machines 11, 12, 13 are at 9: 00: 00 to 9: 00: no I/O request is issued during the period 30, at this time, the virtual machines 11, 12, and 13 do not issue an I/O request to their disks only for 30s, and do not issue an I/O request for 90s, so that the kernel of the compute node 1 determines that the virtual machines 11, 12, and 13 are healthy at this time, and generates a healthy heartbeat packet of the virtual machine and sends the heartbeat packet to the management node, where the heartbeat packet describes that the virtual machines 11, 12, and 13 are healthy. After receiving the healthy heartbeat packet of the virtual machine, the management node immediately sends a healthy heartbeat packet of the virtual machine to the kernel of the computing node 1, so that the computing node 1 is informed that the management node has received the healthy heartbeat packet of the virtual machine. Then, at 9: 00: 30 to 9: 01: during the period 00, the kernel of the computing node 1 still does not detect that the virtual machine 11 issues the I/O request to the disk, but detects that the virtual machines 12 and 13 issue the I/O request to the disk, and records that the virtual machine 11 is 9: 00: 30 to 9: 01: no I/O request is issued during 00, and virtual machines 12, 13 are at 9: 00: 30 to 9: 01: during 00, I/O request is issued. At this time, the virtual machine 11 does not issue the I/O request to the disk for only 60s, and does not issue the I/O request for 90s after the preset duration is reached, so that the kernel of the computing node 1 determines that the virtual machines 11, 12, and 13 are healthy at this time, and generates a healthy heartbeat packet of the virtual machine and sends the heartbeat packet to the management node, where the heartbeat packet describes that the virtual machines 11, 12, and 13 are healthy. And after receiving the virtual machine health heartbeat packet, the management node immediately sends a virtual machine health reply heartbeat packet to the kernel of the computing node 1. Then, at 9: 01: 00 to 9: 01: during the period 30, the kernel of the computing node 1 still does not detect that the virtual machine 11 issues the I/O request to the disk, but detects that the virtual machines 12 and 13 issue the I/O request to the disk, and records that the virtual machine 11 is 9: 01: 00 to 9: 01: no I/O request is issued during 30, and the virtual machines 12, 13 are at 9: 01: 00 to 9: 01: during 30 period, I/O request is issued. At this time, the virtual machine 11 has not issued the I/O request to its disk for 90s, that is, the virtual machine 11 has not issued the I/O request to its disk within the preset time length of 90s, so that the kernel of the compute node 1 determines that the virtual machine 11 has a fault and the virtual machines 12 and 13 are still healthy, and then shuts down the virtual machine 11 from the kernel, generates a virtual machine healthy heartbeat packet, and sends the heartbeat packet to the management node, where the heartbeat packet describes that the virtual machine 11 is faulty and the virtual machines 12 and 13 are healthy. After the kernel of the computing node 1 judges that the virtual machine 11 fails.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the protection scope of the present invention, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (8)

1. A failure processing method for a computing node executes a step X if a failure processing condition is met: and closing the virtual machine running on the computing node, which is characterized in that: the fault processing condition comprises that a heartbeat detection signal sent by the management node is not received within a preset time length.
2. The compute node fault handling method of claim 1 wherein: and step X is to close the virtual machine on the computing node from the kernel.
3. The compute node fault handling method of claim 2 wherein: and the step X is specifically to close the virtual machine which runs on the computing node and uses the non-local resources.
4. The compute node fault handling method of claim 3 wherein: and in the step X, the virtual machine using the local resources is not closed.
5. The compute node fault handling method of claim 1 wherein: the preset duration is three times of the sending time interval of the heartbeat detection signal.
6. The compute node fault handling method of claim 1 wherein: the fault processing condition specifically includes that the kernel does not receive a heartbeat detection signal sent by the management node within a preset time length.
7. The compute node fault handling method of claim 1 wherein: the method is a processing method for checking faults in the computing nodes.
8. A computer-readable storage medium having stored thereon an executable computer program, characterized by: the computer program when executed implements a compute node failure handling method as claimed in any one of claims 1 to 7.
CN202111027451.0A 2021-09-02 2021-09-02 Method for processing fault of computing node and computer readable storage medium Pending CN113765748A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111027451.0A CN113765748A (en) 2021-09-02 2021-09-02 Method for processing fault of computing node and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111027451.0A CN113765748A (en) 2021-09-02 2021-09-02 Method for processing fault of computing node and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN113765748A true CN113765748A (en) 2021-12-07

Family

ID=78792698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111027451.0A Pending CN113765748A (en) 2021-09-02 2021-09-02 Method for processing fault of computing node and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113765748A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104253860A (en) * 2014-09-11 2014-12-31 武汉噢易云计算有限公司 Shared storage message queue-based implementation method for high availability of virtual machines
CN109728981A (en) * 2019-03-19 2019-05-07 江苏汇智达信息科技有限公司 A kind of cloud platform fault monitoring method and device
CN111880906A (en) * 2020-08-03 2020-11-03 广东省华南技术转移中心有限公司 Virtual machine high-availability management method, system and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104253860A (en) * 2014-09-11 2014-12-31 武汉噢易云计算有限公司 Shared storage message queue-based implementation method for high availability of virtual machines
CN109728981A (en) * 2019-03-19 2019-05-07 江苏汇智达信息科技有限公司 A kind of cloud platform fault monitoring method and device
CN111880906A (en) * 2020-08-03 2020-11-03 广东省华南技术转移中心有限公司 Virtual machine high-availability management method, system and storage medium

Similar Documents

Publication Publication Date Title
TWI746512B (en) Physical machine fault classification processing method and device, and virtual machine recovery method and system
US9489230B1 (en) Handling of virtual machine migration while performing clustering operations
CN103201724B (en) Providing application high availability in highly-available virtual machine environments
US10095576B2 (en) Anomaly recovery method for virtual machine in distributed environment
JP2001101033A (en) Fault monitoring method for operating system and application program
US7979744B2 (en) Fault model and rule based fault management apparatus in home network and method thereof
CN105024879A (en) Virtual machine fault detection and recovery system and virtual machine detection, recovery and starting method
US10120779B1 (en) Debugging of hosted computer programs
CN104158707A (en) Method and device of detecting and processing brain split in cluster
WO2023092772A1 (en) Method and device for implementing high availability of virtualized cluster
US20090138757A1 (en) Failure recovery method in cluster system
CN113965576A (en) Container-based big data acquisition method and device, storage medium and equipment
CN113760459A (en) Virtual machine fault detection method, storage medium and virtualization cluster
US20100085871A1 (en) Resource leak recovery in a multi-node computer system
CN111756826B (en) Lock information transmission method of DLM and related device
KR100943213B1 (en) Fault model and rule based apparatus and its method in a home network
CN114691304B (en) Method, device, equipment and medium for realizing high availability of cluster virtual machine
CN113765748A (en) Method for processing fault of computing node and computer readable storage medium
CN116192885A (en) High-availability cluster architecture artificial intelligent experiment cloud platform data processing method and system
US11074120B2 (en) Preventing corruption by blocking requests
US10365934B1 (en) Determining and reporting impaired conditions in a multi-tenant web services environment
CN112685803A (en) Hot standby state switching method, device, equipment and storage medium
CN107783855B (en) Fault self-healing control device and method for virtual network element
CN107423113B (en) Method for managing virtual equipment, out-of-band management equipment and standby virtual equipment
WO2023185355A1 (en) Method and apparatus for achieving high availability of clustered virtual machines, device, and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Zou Lixian

Inventor after: Li Yi

Inventor after: Liu Jianping

Inventor before: Li Yi

Inventor before: Zou Lixian

Inventor before: Liu Jianping

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211207