CN111953566B

CN111953566B - Distributed fault monitoring-based method and virtual machine high-availability system

Info

Publication number: CN111953566B
Application number: CN202010812521.2A
Authority: CN
Inventors: 姚培; 瞿洪桂; 冯龙飞; 赵策
Original assignee: Beijing Sinonet Science and Technology Co Ltd
Current assignee: Beijing Sinonet Science and Technology Co Ltd
Priority date: 2020-08-13
Filing date: 2020-08-13
Publication date: 2022-03-11
Anticipated expiration: 2040-08-13
Also published as: CN111953566A

Abstract

A method based on distributed fault monitoring and a virtual machine high-availability system monitor the state of a physical machine from a plurality of layers of a management network layer, a storage network layer and a service network layer of the physical machine, and detect the abnormal state of the life cycle of the virtual machine and the running state of a running process; different reaction actions are executed in time according to different fault combinations, and the interrupted virtual machine is recovered in time; the method and the device realize the detection of the state of the cloud platform from multiple dimensions, avoid unexpected service interruption caused by triggering the same HA operation due to abnormal states of different layers, execute different isolation recovery operations according to different faults, avoid unexpected faults caused by the action of recovering the service, avoid the interruption of other virtual machines due to the recovery of the fault virtual machine service, and improve the stability and the reliability of the cloud platform; the invention realizes distributed fault detection and avoids the unavailability of a high-availability system caused by single-node fault.

Description

Distributed fault monitoring-based method and virtual machine high-availability system

Technical Field

The invention relates to the field of network fault monitoring, in particular to a distributed fault monitoring-based method and a virtual machine high-availability system.

Background

As traditional applications become more complex, more users need to be supported, greater computing power is provided, stability is guaranteed, and security is enhanced, in order to support these increasing demands, enterprises have to purchase various hardware devices and software, and most difficult, an integrated team is also required to be constructed to maintain the normal operation of these devices or software, and these maintenance works mainly include installation, configuration, test, operation, upgrade, and guarantee the security of the system. The whole maintenance process is completely operated, so that the overhead is very huge, and the cost is continuously increased along with the increase of the scale of the application. However, the resource utilization rate in the traditional mode is not high, which causes great waste, and thus cloud computing is produced.

As more and more users under traditional IT architectures move into cloud computing, enterprise application clouds have become a great trend, while the uninterrupted operation of traditional centralized application systems almost entirely relies on the high availability of servers. If the cloud service requires 7X 24 hours to be out of order, the cost of failing to meet or meeting the demand can be very high. Therefore, the high available functions of the virtual machine are also urgently needed by many cloud users. The existing mainstream virtual machine high-availability scheme is only based on fault detection of a physical machine by a management network layer, when the heartbeat of a management network is interrupted, a high-availability action can be triggered, under the current cloud computing scene, a management network, a storage network and a service network are independently deployed based on different hardware, if only the heartbeat of the management network is interrupted, the normal operation of the virtual machine cannot be influenced, the service is not interrupted by perception, and at the moment, if the HA action is triggered, the service interruption can be generated instead, so that unnecessary loss is caused.

Disclosure of Invention

The invention aims to provide a method based on distributed fault monitoring and a virtual machine high-availability system, so as to solve the problems in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a method based on distributed fault monitoring comprises physical machine fault detection, virtual machine life cycle event detection and running process detection;

the detection of the physical machine comprises the following steps:

s101, detecting the network state of a node where the resident process is located, and stopping detection if the network state is abnormal; if the network state is normal, starting to detect the non-maintenance state node;

s02, acquiring the information of the non-maintenance state node, detecting whether a detection task runs in the non-maintenance state node, and if the detection task does not exist in the non-maintenance state node, newly building the detection task; otherwise, the node where the resident process is located starts to execute the detection task of the non-maintenance node;

s103, setting a preset maximum duration for the detection task, and if the execution time of the detection task exceeds the preset maximum duration, replacing a node where a resident process is located to execute the detection task on the non-maintenance node; if the preset maximum duration is not exceeded, skipping the non-maintenance node and starting the detection of the next non-maintenance node;

s104, if the detected non-maintenance state node detection result is in a normal state, stopping running the detection task corresponding to the non-maintenance state node, and starting executing the detection task of the next non-maintenance state node; otherwise, acquiring the non-maintenance node information corresponding to the detection task, generating a fault notification and sending the fault notification;

s105, repeating the step S102-S104, and ending a detection period of the physical machine when the detection tasks of all the nodes in the non-maintenance state are completely executed once;

the life cycle event detection of the virtual machine comprises the following steps:

s201, establishing connection with the node virtualization software, inquiring virtual machine information on the node, and monitoring the life cycle of the virtual machine;

s202, when the detected virtual machine has an abnormal life cycle event, acquiring the virtual machine information corresponding to the event;

s203, taking the abnormal event details detected in the step S202 and the corresponding virtual machine information as fault information, and sending a fault notification;

the detection of the running process comprises the following steps:

s301, acquiring information of all running processes to be detected by a resident process;

s302, traversing and checking all running processes to be detected, and ending a detection period if all running processes to be detected run normally; if the running of the running process to be detected is abnormal, trying to pull up the abnormal quitting process, and detecting a pull-up result;

s303, if the process is successfully pulled up in the step S302, finishing the detection of the running process; and if the process pull-up failure is detected, collecting node information corresponding to the running process and information of the running process to form fault information, and sending a fault notification.

Preferably, the detection tasks are synchronized, and when the non-maintenance node where the detection task is located completes detection, nodes where resident processes of other detection tasks are located synchronize detected information, so that repeated detection on the same non-maintenance node is avoided.

A virtual machine high-availability system based on distributed fault monitoring comprises a fault detection module, a fault notification processing module and a fault recovery module; the output end of the fault detection module is connected with the input end of the fault notification processing module; the fault notification processing module and the fault recovery module carry out data interaction;

the fault detection module detects the network state of the physical machine by reading the network configuration information of each node and sends the node information with abnormal network state to the fault notification processing module; calling an interface to receive a life cycle event of the virtual machine and sending the abnormal virtual machine information of the life cycle event to the fault notification processing module; monitoring the running state of a preset specific process through a resident process, and sending node information corresponding to the preset specific process which cannot be recovered by self to the fault notification processing module;

the fault notification processing module receives and stores the fault notification sent by the fault detection module; and transmitting to the failure recovery module; receiving a fault processing result returned by the fault recovery module;

and the fault recovery module receives the fault notification transmitted by the fault notification processing module, judges the isolation level of the fault node according to different fault conditions, triggers different fault recovery tasks and sends a fault recovery result to the fault notification processing module.

Preferably, the fault detection module detects the network states of the management network layer, the storage network layer and the service network layer at the same time.

Preferably, the action relationship between the detection result in the fault detection module and the reaction action of the fault recovery module is as follows:

when the network of the management network layer is interrupted and the networks of the storage network layer and the service network layer are normal, the fault recovery module sends fault information to the fault notification processing module;

when the network of the storage network layer is interrupted and the networks of the management network layer and the service network layer are normal, maintaining and isolating corresponding fault nodes and evacuating all the virtual machines of the fault nodes;

when the network of the service network layer is interrupted and the networks of the storage network layer and the management network layer are normal, maintaining the fault node and migrating the virtual machine in the fault node to a normal node;

when the network of the management network layer and the network of the service network layer are interrupted and the network of the storage network layer is normal, maintaining and isolating corresponding fault nodes and evacuating all the virtual machines of the fault nodes;

when the network of the management network layer, the storage network layer and the service network layer is interrupted, maintaining and isolating the corresponding fault node, and evacuating all the virtual machines of the fault node.

Preferably, the processing procedure of the lifecycle event of the virtual machine is as follows: monitoring life cycle events of all virtual machines on a host machine, and acquiring virtual machine information with abnormal life cycle events in the virtual machines; and restoring the specified virtual machine according to the virtual machine information.

Preferably, the recovery process for the failure of the running process is as follows: periodically checking the running states of all processes in the process list, attempting restart aiming at the abnormal process, and finishing the processing if the abnormal process is recovered to be normal after being restarted; and if the abnormal process is not restarted, isolating the fault node corresponding to the abnormal process.

The invention has the beneficial effects that: the invention discloses a distributed fault monitoring-based method and a virtual machine high-availability system, which are used for monitoring faults of a physical machine from a plurality of layers of a management network layer, a storage network layer and a service network layer of the physical machine, and detecting the abnormal life cycle of the virtual machine and the running state of a running process; different reaction actions are executed in time according to different fault combinations, and the interrupted virtual machine is recovered in time; the method and the device realize the detection of the state of the cloud platform from multiple dimensions, avoid unexpected service interruption caused by triggering the same HA operation due to abnormal states of different layers, execute different isolation recovery operations according to different faults, avoid unexpected faults caused by the action of recovering the service, avoid the interruption of other virtual machines due to the recovery of the fault virtual machine service, and improve the stability and the reliability of the cloud platform; according to the method, the difficulty in troubleshooting the problem of unavailable service caused by the failure of the non-cloud platform component is reduced by detecting the related processes depending on the cloud platform; the invention realizes distributed fault detection, all the computing nodes are execution units for fault detection, and unavailability of a high-availability system caused by single-node faults is avoided.

Drawings

FIG. 1 is a flow diagram of a distributed fault monitoring architecture;

FIG. 2 is a schematic diagram of the operation of a virtual machine high availability system;

FIG. 3 is a physical machine fault handling tree diagram;

FIG. 4 is a fault monitoring flow diagram;

FIG. 5 is a virtual machine failure recovery flow diagram;

FIG. 6 is a process exception recovery flow diagram;

FIG. 7 is a diagram of a physical machine fault recovery operation matrix.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

A method based on distributed fault monitoring is disclosed, as shown in FIG. 1, and includes fault detection of a physical machine, life cycle event detection of a virtual machine, and detection of a related running process depending on a cloud platform;

the steps of the fault detection of the physical machine are as follows:

in the above steps, each detection task is synchronized, and when the non-maintenance node where the detection task is located completes detection, nodes where resident processes of other detection tasks are located synchronize detected information, so that the waste of computing resources caused by repeated detection on the same non-maintenance node is avoided.

The steps of the life cycle event detection of the virtual machine are as follows:

s201, presetting a life cycle event of the virtual machine corresponding to a possible fault,

s202, establishing connection between the nodes of the physical machine and a server libvirt, inquiring virtual machine information of all the nodes, and monitoring life cycle events of the virtual machines;

s203, if the life cycle event of the virtual machine preset in the step S201 is monitored in the step S202, acquiring the virtual machine information to form fault information, and sending a fault notification; and otherwise, staying in the background and monitoring the life cycle event detection of the virtual machine in real time.

The detection steps of the running process depending on the cloud platform are as follows:

s301, the running process depended by the cloud platform is a process to be detected, all the processes to be detected are detected in a traversing mode through a resident process, and process information of the processes to be detected is obtained;

s302, detecting the running state of the process to be detected, and if all the processes to be detected run normally, ending the detection period; if not, aiming at the abnormal running process to be detected, trying to pull up the interrupted process to be detected, and if all the abnormal running processes to be detected are successfully pulled up, ending the detection process; otherwise, collecting the process information of the process to be detected which is failed to be pulled up and the corresponding node information to form fault information, and sending a fault notification.

The virtual machine high-availability system based on the distributed fault monitoring method comprises a fault detection module, a fault notification processing module and a fault recovery module; the fault detection module detects the network state of the physical machine by reading the network information of the nodes, collects the node information with abnormal network state as fault information and sends a fault notification to the fault notification processing module; monitoring a life cycle event of a virtual machine by calling a libvirt interface, acquiring node information of the node corresponding to the virtual machine with a fault as fault information, and sending a fault notification to the fault notification processing module; monitoring processes depended by the cloud platform, collecting interrupted process information of the processes and corresponding node information of the nodes to form fault information, and sending fault notification to the fault notification processing module. The network states include network states of a management network layer, a storage network layer and a service network layer.

The fault notification processing module receives the fault notification sent by the fault detection module, stores the fault information and transmits the fault information to the fault recovery module; and receiving a fault recovery result returned by the fault recovery module, and updating the node information after fault processing by combining the fault information.

The fault recovery module receives the fault information sent by the fault notification processing module and judges a fault scene according to the fault information; judging the isolation level of the node with the fault according to different fault scenes, avoiding the split phenomenon of the virtual machine and triggering different fault recovery tasks; returning the failure recovery result to the notification processing module; and checking whether there is a notification of completion of the unprocessed state, and retrying the recovery process.

The working principle of the virtual machine high-availability system is shown in fig. 2, the fault detection module comprises an instance monitor, a process monitor and a host monitor, the instance monitor runs in a computing node, and a livirt interface is called to detect a life cycle event of the virtual machine; the process monitor runs in a computing node and detects the running state of a key process depended on by the cloud platform; the host monitor runs in the computing nodes and detects whether all the computing nodes of the physical machine are abnormal or not; transmitting the fault information detected in the fault detection module to the fault notification processing module, wherein the fault notification processing module adopts HA-API to operate in a control node, provides a service interface, and sends an API processing request to the fault recovery module through RPC; the fault recovery module processes the API processing request through an HA-Engine, the HA-Engine runs in a control node, executes a recovery workflow in an asynchronous mode, and processes the fault notification sent by the fault notification processing module;

the fault handling rule of the network state of the physical machine is shown in fig. 3, and if the network of the storage network layer in the computing node is interrupted, the computing node is unlikely to successfully operate the virtual machine again no matter whether the network states of the management network layer and the service network layer are normal, at this time, the computing node needs to be isolated from a node cluster and shut down, and the virtual machine in the computing node is evacuated to migrate to other computing nodes with normal network states; if the networks of the storage network layer and the management network layer in the computing node are normal, and the network of the service network layer is interrupted, the virtual machine cannot provide service to the outside, but the virtual machine can normally operate, the computing node needs to be isolated into a node cluster at this time, and the virtual machine in the computing node is migrated to other computing nodes with normal network states; if the networks of the storage network layer and the service network layer in the computing node are normal and the network of the management network layer is interrupted, only a mail needs to be sent to an administrator to notify fault information at the moment, and no other operation is performed on the virtual machine; if only the network of the storage network layer in the computing node is normal, and the networks of the service network layer and the management network layer are interrupted, the virtual machine in the computing node cannot be migrated to other nodes, and at the moment, an evacuation interface is called to evacuate the virtual machine in the computing node.

Examples

The work flow diagram of the fault detection module is shown in fig. 4, and a periodic inspection task is started and a long connection with a host libvirt api is established; secondly, detecting the node state, the virtual machine state and the running state of the process; isolating the fault at the detection position; recovering the affected virtual machines according to the fault node information;

the fault recovery flow of the virtual machine in the fault recovery module is shown in fig. 5, and the virtual machine information of the fault event is acquired by monitoring the life cycle event of the virtual machine on the host machine, so as to recover the specified virtual machine;

a recovery flow of the cloud platform dependent key processes in the failure recovery module is shown in fig. 6, the key processes are counted in a process list together, the running states of all the processes in the process list are periodically checked, a restart is attempted for an interrupted process, and after the restart of the process is successful, the recovery of the process is ended; and after the process is failed to restart, isolating the fault node operated by the process.

The rule corresponding to the network state failure and the recovery operation of the physical machine is shown in fig. 7:

when the management network layer network is interrupted, and the storage network layer and the service network layer network are normal, only a notification is sent to an administrator; when the storage network layer network is interrupted and the management network layer and the service network layer network are normal, isolating the corresponding computing nodes out of the cluster, and evacuating virtual machines in the computing nodes to other nodes with normal network states; when the network of the service network layer is interrupted and the networks of the storage network layer and the management network layer are normal, the virtual machines in the corresponding computing nodes are evacuated to other nodes with normal network states; when the management network layer and the service network layer are interrupted and the storage network layer is normal, isolating the corresponding computing nodes out of the cluster, and evacuating virtual machines in the computing nodes to other nodes with normal network states; when the network states of the management network layer, the storage network layer and the service network layer are interrupted, the corresponding computing nodes are isolated out of the cluster, and the virtual machines in the computing nodes are evacuated to other nodes with normal network states.

By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:

the invention discloses a distributed fault monitoring-based method and a virtual machine high-availability system, which are used for monitoring faults of a physical machine from a plurality of layers of a management network layer, a storage network layer and a service network layer of the physical machine, and detecting the abnormal life cycle of the virtual machine and the running state of a running process; different reaction actions are executed in time according to different fault combinations, and the interrupted virtual machine is recovered in time; the method and the device realize the detection of the state of the physical machine in the cloud platform from multiple dimensions, avoid the triggering of HA operation due to the abnormal state of a certain network layer, which causes unexpected service interruption, execute different isolation recovery operations according to different faults, avoid the introduction of unexpected faults due to the action of recovering the service, avoid the interruption of other virtual machines due to the recovery of the fault virtual machine service, and improve the stability and the reliability of the cloud platform; according to the method, the difficulty in troubleshooting the problem of unavailable service caused by the failure of the non-cloud platform component is reduced by detecting the related processes depending on the cloud platform; the invention realizes distributed fault detection, all the computing nodes are execution units for fault detection, and unavailability of a high-availability system caused by single-node faults is avoided.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims

1. A method based on distributed fault monitoring is characterized by comprising physical machine fault detection, virtual machine life cycle event detection and running process detection;

the detection of the physical machine comprises the following steps:

s101, detecting the network state of a node where the resident process is located, and stopping detection if the network state is abnormal; if the network state is normal, starting to detect the nodes in the non-maintenance state;

s102, obtaining information of the non-maintenance state node, detecting whether a detection task runs in the non-maintenance state node, and if the detection task does not exist in the non-maintenance state node, establishing the detection task; otherwise, the node where the resident process is located starts to execute the detection task of the node in the non-maintenance state;

s103, setting a preset maximum time length for the detection task, and if the execution time of the detection task exceeds the preset maximum time length, replacing a detection task execution unit to execute the detection task on the non-maintenance state node; if the preset maximum duration is not exceeded, skipping the non-maintenance state node, and starting to judge whether the detection task of the next non-maintenance state node is overtime;

s104, if the detected non-maintenance state node is in a normal state, stopping running the detection task corresponding to the non-maintenance state node, and starting executing the detection task on the next non-maintenance state node; otherwise, acquiring the non-maintenance node information corresponding to the detection task, generating a fault notification and sending the fault notification;

s201, establishing connection between the node and virtualization software, inquiring virtual machine information on the node, and monitoring the life cycle of the virtual machine;

s202, when the virtual machine is detected to have an abnormal life cycle event, acquiring the virtual machine information corresponding to the abnormal life cycle event;

s203, taking the details of the abnormal life cycle event detected in the step S202 and the corresponding virtual machine information as fault information, and sending a fault notification;

the detection of the running process comprises the following steps:

2. The distributed fault monitoring-based method according to claim 1, wherein the detection tasks are synchronized, and when the non-maintenance state node where the detection task is located has completed detection, nodes where resident processes of other detection tasks are executed synchronize detected information, thereby avoiding repeated detection on the same non-maintenance state node.

3. A virtual machine high-availability system based on distributed fault monitoring is characterized by comprising a fault detection module, a fault notification processing module and a fault recovery module; the output end of the fault detection module is connected with the input end of the fault notification processing module; the fault notification processing module and the fault recovery module carry out data interaction;

the fault detection module adopts the method of claim 1 to monitor faults, detects the network state of a physical machine by reading the network configuration information of each node, and sends the node information with abnormal network state to the fault notification processing module; calling an interface to receive a life cycle event of the virtual machine and sending virtual machine information with abnormal life cycle event to the fault notification processing module; monitoring the running state of a preset specific process through a resident process, and sending node information corresponding to the preset specific process which cannot be recovered by self to the fault notification processing module;

4. The distributed fault monitoring based virtual machine high availability system according to claim 3, wherein the fault detection module detects network states of a management network layer, a storage network layer and a service network layer at the same time.

5. The distributed fault monitoring based virtual machine high availability system according to claim 4, wherein the action relationship between the detection result in the fault detection module and the reaction action of the fault recovery module is as follows:

when the network of the management network layer and the network of the service network layer are interrupted and the network of the storage network layer is normal, maintaining and isolating the corresponding fault node and evacuating all the virtual machines of the fault node;

6. The distributed fault monitoring based virtual machine high availability system according to claim 3, wherein the processing procedure of the life cycle event of the virtual machine is as follows: monitoring life cycle events of all virtual machines on a host machine, and acquiring virtual machine information with abnormal life cycle events in the virtual machines; and restoring the specified virtual machine according to the virtual machine information.

7. The virtual machine high availability system based on distributed fault monitoring as claimed in claim 3, wherein the recovery process for the fault of the running process is as follows: periodically checking the running states of all processes in the process list, attempting restart aiming at the abnormal process, and finishing the processing if the abnormal process is recovered to be normal after being restarted; and if the abnormal process is not restarted, isolating the fault node corresponding to the abnormal process.