CN111953566B - Distributed fault monitoring-based method and virtual machine high-availability system - Google Patents

Distributed fault monitoring-based method and virtual machine high-availability system Download PDF

Info

Publication number
CN111953566B
CN111953566B CN202010812521.2A CN202010812521A CN111953566B CN 111953566 B CN111953566 B CN 111953566B CN 202010812521 A CN202010812521 A CN 202010812521A CN 111953566 B CN111953566 B CN 111953566B
Authority
CN
China
Prior art keywords
fault
node
detection
virtual machine
network layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010812521.2A
Other languages
Chinese (zh)
Other versions
CN111953566A (en
Inventor
姚培
瞿洪桂
冯龙飞
赵策
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sinonet Science and Technology Co Ltd
Original Assignee
Beijing Sinonet Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sinonet Science and Technology Co Ltd filed Critical Beijing Sinonet Science and Technology Co Ltd
Priority to CN202010812521.2A priority Critical patent/CN111953566B/en
Publication of CN111953566A publication Critical patent/CN111953566A/en
Application granted granted Critical
Publication of CN111953566B publication Critical patent/CN111953566B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0805Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
    • H04L43/0817Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability by checking functioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0659Management of faults, events, alarms or notifications using network fault recovery by isolating or reconfiguring faulty entities

Abstract

A method based on distributed fault monitoring and a virtual machine high-availability system monitor the state of a physical machine from a plurality of layers of a management network layer, a storage network layer and a service network layer of the physical machine, and detect the abnormal state of the life cycle of the virtual machine and the running state of a running process; different reaction actions are executed in time according to different fault combinations, and the interrupted virtual machine is recovered in time; the method and the device realize the detection of the state of the cloud platform from multiple dimensions, avoid unexpected service interruption caused by triggering the same HA operation due to abnormal states of different layers, execute different isolation recovery operations according to different faults, avoid unexpected faults caused by the action of recovering the service, avoid the interruption of other virtual machines due to the recovery of the fault virtual machine service, and improve the stability and the reliability of the cloud platform; the invention realizes distributed fault detection and avoids the unavailability of a high-availability system caused by single-node fault.

Description

Distributed fault monitoring-based method and virtual machine high-availability system
Technical Field
The invention relates to the field of network fault monitoring, in particular to a distributed fault monitoring-based method and a virtual machine high-availability system.
Background
As traditional applications become more complex, more users need to be supported, greater computing power is provided, stability is guaranteed, and security is enhanced, in order to support these increasing demands, enterprises have to purchase various hardware devices and software, and most difficult, an integrated team is also required to be constructed to maintain the normal operation of these devices or software, and these maintenance works mainly include installation, configuration, test, operation, upgrade, and guarantee the security of the system. The whole maintenance process is completely operated, so that the overhead is very huge, and the cost is continuously increased along with the increase of the scale of the application. However, the resource utilization rate in the traditional mode is not high, which causes great waste, and thus cloud computing is produced.
As more and more users under traditional IT architectures move into cloud computing, enterprise application clouds have become a great trend, while the uninterrupted operation of traditional centralized application systems almost entirely relies on the high availability of servers. If the cloud service requires 7X 24 hours to be out of order, the cost of failing to meet or meeting the demand can be very high. Therefore, the high available functions of the virtual machine are also urgently needed by many cloud users. The existing mainstream virtual machine high-availability scheme is only based on fault detection of a physical machine by a management network layer, when the heartbeat of a management network is interrupted, a high-availability action can be triggered, under the current cloud computing scene, a management network, a storage network and a service network are independently deployed based on different hardware, if only the heartbeat of the management network is interrupted, the normal operation of the virtual machine cannot be influenced, the service is not interrupted by perception, and at the moment, if the HA action is triggered, the service interruption can be generated instead, so that unnecessary loss is caused.
Disclosure of Invention
The invention aims to provide a method based on distributed fault monitoring and a virtual machine high-availability system, so as to solve the problems in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a method based on distributed fault monitoring comprises physical machine fault detection, virtual machine life cycle event detection and running process detection;
the detection of the physical machine comprises the following steps:
s101, detecting the network state of a node where the resident process is located, and stopping detection if the network state is abnormal; if the network state is normal, starting to detect the non-maintenance state node;
s02, acquiring the information of the non-maintenance state node, detecting whether a detection task runs in the non-maintenance state node, and if the detection task does not exist in the non-maintenance state node, newly building the detection task; otherwise, the node where the resident process is located starts to execute the detection task of the non-maintenance node;
s103, setting a preset maximum duration for the detection task, and if the execution time of the detection task exceeds the preset maximum duration, replacing a node where a resident process is located to execute the detection task on the non-maintenance node; if the preset maximum duration is not exceeded, skipping the non-maintenance node and starting the detection of the next non-maintenance node;
s104, if the detected non-maintenance state node detection result is in a normal state, stopping running the detection task corresponding to the non-maintenance state node, and starting executing the detection task of the next non-maintenance state node; otherwise, acquiring the non-maintenance node information corresponding to the detection task, generating a fault notification and sending the fault notification;
s105, repeating the step S102-S104, and ending a detection period of the physical machine when the detection tasks of all the nodes in the non-maintenance state are completely executed once;
the life cycle event detection of the virtual machine comprises the following steps:
s201, establishing connection with the node virtualization software, inquiring virtual machine information on the node, and monitoring the life cycle of the virtual machine;
s202, when the detected virtual machine has an abnormal life cycle event, acquiring the virtual machine information corresponding to the event;
s203, taking the abnormal event details detected in the step S202 and the corresponding virtual machine information as fault information, and sending a fault notification;
the detection of the running process comprises the following steps:
s301, acquiring information of all running processes to be detected by a resident process;
s302, traversing and checking all running processes to be detected, and ending a detection period if all running processes to be detected run normally; if the running of the running process to be detected is abnormal, trying to pull up the abnormal quitting process, and detecting a pull-up result;
s303, if the process is successfully pulled up in the step S302, finishing the detection of the running process; and if the process pull-up failure is detected, collecting node information corresponding to the running process and information of the running process to form fault information, and sending a fault notification.
Preferably, the detection tasks are synchronized, and when the non-maintenance node where the detection task is located completes detection, nodes where resident processes of other detection tasks are located synchronize detected information, so that repeated detection on the same non-maintenance node is avoided.
A virtual machine high-availability system based on distributed fault monitoring comprises a fault detection module, a fault notification processing module and a fault recovery module; the output end of the fault detection module is connected with the input end of the fault notification processing module; the fault notification processing module and the fault recovery module carry out data interaction;
the fault detection module detects the network state of the physical machine by reading the network configuration information of each node and sends the node information with abnormal network state to the fault notification processing module; calling an interface to receive a life cycle event of the virtual machine and sending the abnormal virtual machine information of the life cycle event to the fault notification processing module; monitoring the running state of a preset specific process through a resident process, and sending node information corresponding to the preset specific process which cannot be recovered by self to the fault notification processing module;
the fault notification processing module receives and stores the fault notification sent by the fault detection module; and transmitting to the failure recovery module; receiving a fault processing result returned by the fault recovery module;
and the fault recovery module receives the fault notification transmitted by the fault notification processing module, judges the isolation level of the fault node according to different fault conditions, triggers different fault recovery tasks and sends a fault recovery result to the fault notification processing module.
Preferably, the fault detection module detects the network states of the management network layer, the storage network layer and the service network layer at the same time.
Preferably, the action relationship between the detection result in the fault detection module and the reaction action of the fault recovery module is as follows:
when the network of the management network layer is interrupted and the networks of the storage network layer and the service network layer are normal, the fault recovery module sends fault information to the fault notification processing module;
when the network of the storage network layer is interrupted and the networks of the management network layer and the service network layer are normal, maintaining and isolating corresponding fault nodes and evacuating all the virtual machines of the fault nodes;
when the network of the service network layer is interrupted and the networks of the storage network layer and the management network layer are normal, maintaining the fault node and migrating the virtual machine in the fault node to a normal node;
when the network of the management network layer and the network of the service network layer are interrupted and the network of the storage network layer is normal, maintaining and isolating corresponding fault nodes and evacuating all the virtual machines of the fault nodes;
when the network of the management network layer, the storage network layer and the service network layer is interrupted, maintaining and isolating the corresponding fault node, and evacuating all the virtual machines of the fault node.
Preferably, the processing procedure of the lifecycle event of the virtual machine is as follows: monitoring life cycle events of all virtual machines on a host machine, and acquiring virtual machine information with abnormal life cycle events in the virtual machines; and restoring the specified virtual machine according to the virtual machine information.
Preferably, the recovery process for the failure of the running process is as follows: periodically checking the running states of all processes in the process list, attempting restart aiming at the abnormal process, and finishing the processing if the abnormal process is recovered to be normal after being restarted; and if the abnormal process is not restarted, isolating the fault node corresponding to the abnormal process.
The invention has the beneficial effects that: the invention discloses a distributed fault monitoring-based method and a virtual machine high-availability system, which are used for monitoring faults of a physical machine from a plurality of layers of a management network layer, a storage network layer and a service network layer of the physical machine, and detecting the abnormal life cycle of the virtual machine and the running state of a running process; different reaction actions are executed in time according to different fault combinations, and the interrupted virtual machine is recovered in time; the method and the device realize the detection of the state of the cloud platform from multiple dimensions, avoid unexpected service interruption caused by triggering the same HA operation due to abnormal states of different layers, execute different isolation recovery operations according to different faults, avoid unexpected faults caused by the action of recovering the service, avoid the interruption of other virtual machines due to the recovery of the fault virtual machine service, and improve the stability and the reliability of the cloud platform; according to the method, the difficulty in troubleshooting the problem of unavailable service caused by the failure of the non-cloud platform component is reduced by detecting the related processes depending on the cloud platform; the invention realizes distributed fault detection, all the computing nodes are execution units for fault detection, and unavailability of a high-availability system caused by single-node faults is avoided.
Drawings
FIG. 1 is a flow diagram of a distributed fault monitoring architecture;
FIG. 2 is a schematic diagram of the operation of a virtual machine high availability system;
FIG. 3 is a physical machine fault handling tree diagram;
FIG. 4 is a fault monitoring flow diagram;
FIG. 5 is a virtual machine failure recovery flow diagram;
FIG. 6 is a process exception recovery flow diagram;
FIG. 7 is a diagram of a physical machine fault recovery operation matrix.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
A method based on distributed fault monitoring is disclosed, as shown in FIG. 1, and includes fault detection of a physical machine, life cycle event detection of a virtual machine, and detection of a related running process depending on a cloud platform;
the steps of the fault detection of the physical machine are as follows:
s101, detecting the network state of a node where the resident process is located, and stopping detection if the network state is abnormal; if the network state is normal, starting to detect the non-maintenance state node;
s02, acquiring the information of the non-maintenance state node, detecting whether a detection task runs in the non-maintenance state node, and if the detection task does not exist in the non-maintenance state node, newly building the detection task; otherwise, the node where the resident process is located starts to execute the detection task of the non-maintenance node;
s103, setting a preset maximum duration for the detection task, and if the execution time of the detection task exceeds the preset maximum duration, replacing a node where a resident process is located to execute the detection task on the non-maintenance node; if the preset maximum duration is not exceeded, skipping the non-maintenance node and starting the detection of the next non-maintenance node;
s104, if the detected non-maintenance state node detection result is in a normal state, stopping running the detection task corresponding to the non-maintenance state node, and starting executing the detection task of the next non-maintenance state node; otherwise, acquiring the non-maintenance node information corresponding to the detection task, generating a fault notification and sending the fault notification;
s105, repeating the step S102-S104, and ending a detection period of the physical machine when the detection tasks of all the nodes in the non-maintenance state are completely executed once;
in the above steps, each detection task is synchronized, and when the non-maintenance node where the detection task is located completes detection, nodes where resident processes of other detection tasks are located synchronize detected information, so that the waste of computing resources caused by repeated detection on the same non-maintenance node is avoided.
The steps of the life cycle event detection of the virtual machine are as follows:
s201, presetting a life cycle event of the virtual machine corresponding to a possible fault,
s202, establishing connection between the nodes of the physical machine and a server libvirt, inquiring virtual machine information of all the nodes, and monitoring life cycle events of the virtual machines;
s203, if the life cycle event of the virtual machine preset in the step S201 is monitored in the step S202, acquiring the virtual machine information to form fault information, and sending a fault notification; and otherwise, staying in the background and monitoring the life cycle event detection of the virtual machine in real time.
The detection steps of the running process depending on the cloud platform are as follows:
s301, the running process depended by the cloud platform is a process to be detected, all the processes to be detected are detected in a traversing mode through a resident process, and process information of the processes to be detected is obtained;
s302, detecting the running state of the process to be detected, and if all the processes to be detected run normally, ending the detection period; if not, aiming at the abnormal running process to be detected, trying to pull up the interrupted process to be detected, and if all the abnormal running processes to be detected are successfully pulled up, ending the detection process; otherwise, collecting the process information of the process to be detected which is failed to be pulled up and the corresponding node information to form fault information, and sending a fault notification.
The virtual machine high-availability system based on the distributed fault monitoring method comprises a fault detection module, a fault notification processing module and a fault recovery module; the fault detection module detects the network state of the physical machine by reading the network information of the nodes, collects the node information with abnormal network state as fault information and sends a fault notification to the fault notification processing module; monitoring a life cycle event of a virtual machine by calling a libvirt interface, acquiring node information of the node corresponding to the virtual machine with a fault as fault information, and sending a fault notification to the fault notification processing module; monitoring processes depended by the cloud platform, collecting interrupted process information of the processes and corresponding node information of the nodes to form fault information, and sending fault notification to the fault notification processing module. The network states include network states of a management network layer, a storage network layer and a service network layer.
The fault notification processing module receives the fault notification sent by the fault detection module, stores the fault information and transmits the fault information to the fault recovery module; and receiving a fault recovery result returned by the fault recovery module, and updating the node information after fault processing by combining the fault information.
The fault recovery module receives the fault information sent by the fault notification processing module and judges a fault scene according to the fault information; judging the isolation level of the node with the fault according to different fault scenes, avoiding the split phenomenon of the virtual machine and triggering different fault recovery tasks; returning the failure recovery result to the notification processing module; and checking whether there is a notification of completion of the unprocessed state, and retrying the recovery process.
The working principle of the virtual machine high-availability system is shown in fig. 2, the fault detection module comprises an instance monitor, a process monitor and a host monitor, the instance monitor runs in a computing node, and a livirt interface is called to detect a life cycle event of the virtual machine; the process monitor runs in a computing node and detects the running state of a key process depended on by the cloud platform; the host monitor runs in the computing nodes and detects whether all the computing nodes of the physical machine are abnormal or not; transmitting the fault information detected in the fault detection module to the fault notification processing module, wherein the fault notification processing module adopts HA-API to operate in a control node, provides a service interface, and sends an API processing request to the fault recovery module through RPC; the fault recovery module processes the API processing request through an HA-Engine, the HA-Engine runs in a control node, executes a recovery workflow in an asynchronous mode, and processes the fault notification sent by the fault notification processing module;
the fault handling rule of the network state of the physical machine is shown in fig. 3, and if the network of the storage network layer in the computing node is interrupted, the computing node is unlikely to successfully operate the virtual machine again no matter whether the network states of the management network layer and the service network layer are normal, at this time, the computing node needs to be isolated from a node cluster and shut down, and the virtual machine in the computing node is evacuated to migrate to other computing nodes with normal network states; if the networks of the storage network layer and the management network layer in the computing node are normal, and the network of the service network layer is interrupted, the virtual machine cannot provide service to the outside, but the virtual machine can normally operate, the computing node needs to be isolated into a node cluster at this time, and the virtual machine in the computing node is migrated to other computing nodes with normal network states; if the networks of the storage network layer and the service network layer in the computing node are normal and the network of the management network layer is interrupted, only a mail needs to be sent to an administrator to notify fault information at the moment, and no other operation is performed on the virtual machine; if only the network of the storage network layer in the computing node is normal, and the networks of the service network layer and the management network layer are interrupted, the virtual machine in the computing node cannot be migrated to other nodes, and at the moment, an evacuation interface is called to evacuate the virtual machine in the computing node.
Examples
The work flow diagram of the fault detection module is shown in fig. 4, and a periodic inspection task is started and a long connection with a host libvirt api is established; secondly, detecting the node state, the virtual machine state and the running state of the process; isolating the fault at the detection position; recovering the affected virtual machines according to the fault node information;
the fault recovery flow of the virtual machine in the fault recovery module is shown in fig. 5, and the virtual machine information of the fault event is acquired by monitoring the life cycle event of the virtual machine on the host machine, so as to recover the specified virtual machine;
a recovery flow of the cloud platform dependent key processes in the failure recovery module is shown in fig. 6, the key processes are counted in a process list together, the running states of all the processes in the process list are periodically checked, a restart is attempted for an interrupted process, and after the restart of the process is successful, the recovery of the process is ended; and after the process is failed to restart, isolating the fault node operated by the process.
The rule corresponding to the network state failure and the recovery operation of the physical machine is shown in fig. 7:
when the management network layer network is interrupted, and the storage network layer and the service network layer network are normal, only a notification is sent to an administrator; when the storage network layer network is interrupted and the management network layer and the service network layer network are normal, isolating the corresponding computing nodes out of the cluster, and evacuating virtual machines in the computing nodes to other nodes with normal network states; when the network of the service network layer is interrupted and the networks of the storage network layer and the management network layer are normal, the virtual machines in the corresponding computing nodes are evacuated to other nodes with normal network states; when the management network layer and the service network layer are interrupted and the storage network layer is normal, isolating the corresponding computing nodes out of the cluster, and evacuating virtual machines in the computing nodes to other nodes with normal network states; when the network states of the management network layer, the storage network layer and the service network layer are interrupted, the corresponding computing nodes are isolated out of the cluster, and the virtual machines in the computing nodes are evacuated to other nodes with normal network states.
By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:
the invention discloses a distributed fault monitoring-based method and a virtual machine high-availability system, which are used for monitoring faults of a physical machine from a plurality of layers of a management network layer, a storage network layer and a service network layer of the physical machine, and detecting the abnormal life cycle of the virtual machine and the running state of a running process; different reaction actions are executed in time according to different fault combinations, and the interrupted virtual machine is recovered in time; the method and the device realize the detection of the state of the physical machine in the cloud platform from multiple dimensions, avoid the triggering of HA operation due to the abnormal state of a certain network layer, which causes unexpected service interruption, execute different isolation recovery operations according to different faults, avoid the introduction of unexpected faults due to the action of recovering the service, avoid the interruption of other virtual machines due to the recovery of the fault virtual machine service, and improve the stability and the reliability of the cloud platform; according to the method, the difficulty in troubleshooting the problem of unavailable service caused by the failure of the non-cloud platform component is reduced by detecting the related processes depending on the cloud platform; the invention realizes distributed fault detection, all the computing nodes are execution units for fault detection, and unavailability of a high-availability system caused by single-node faults is avoided.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims (7)

1. A method based on distributed fault monitoring is characterized by comprising physical machine fault detection, virtual machine life cycle event detection and running process detection;
the detection of the physical machine comprises the following steps:
s101, detecting the network state of a node where the resident process is located, and stopping detection if the network state is abnormal; if the network state is normal, starting to detect the nodes in the non-maintenance state;
s102, obtaining information of the non-maintenance state node, detecting whether a detection task runs in the non-maintenance state node, and if the detection task does not exist in the non-maintenance state node, establishing the detection task; otherwise, the node where the resident process is located starts to execute the detection task of the node in the non-maintenance state;
s103, setting a preset maximum time length for the detection task, and if the execution time of the detection task exceeds the preset maximum time length, replacing a detection task execution unit to execute the detection task on the non-maintenance state node; if the preset maximum duration is not exceeded, skipping the non-maintenance state node, and starting to judge whether the detection task of the next non-maintenance state node is overtime;
s104, if the detected non-maintenance state node is in a normal state, stopping running the detection task corresponding to the non-maintenance state node, and starting executing the detection task on the next non-maintenance state node; otherwise, acquiring the non-maintenance node information corresponding to the detection task, generating a fault notification and sending the fault notification;
s105, repeating the step S102-S104, and ending a detection period of the physical machine when the detection tasks of all the nodes in the non-maintenance state are completely executed once;
the life cycle event detection of the virtual machine comprises the following steps:
s201, establishing connection between the node and virtualization software, inquiring virtual machine information on the node, and monitoring the life cycle of the virtual machine;
s202, when the virtual machine is detected to have an abnormal life cycle event, acquiring the virtual machine information corresponding to the abnormal life cycle event;
s203, taking the details of the abnormal life cycle event detected in the step S202 and the corresponding virtual machine information as fault information, and sending a fault notification;
the detection of the running process comprises the following steps:
s301, acquiring information of all running processes to be detected by a resident process;
s302, traversing and checking all running processes to be detected, and ending a detection period if all running processes to be detected run normally; if the running of the running process to be detected is abnormal, trying to pull up the abnormal quitting process, and detecting a pull-up result;
s303, if the process is successfully pulled up in the step S302, finishing the detection of the running process; and if the process pull-up failure is detected, collecting node information corresponding to the running process and information of the running process to form fault information, and sending a fault notification.
2. The distributed fault monitoring-based method according to claim 1, wherein the detection tasks are synchronized, and when the non-maintenance state node where the detection task is located has completed detection, nodes where resident processes of other detection tasks are executed synchronize detected information, thereby avoiding repeated detection on the same non-maintenance state node.
3. A virtual machine high-availability system based on distributed fault monitoring is characterized by comprising a fault detection module, a fault notification processing module and a fault recovery module; the output end of the fault detection module is connected with the input end of the fault notification processing module; the fault notification processing module and the fault recovery module carry out data interaction;
the fault detection module adopts the method of claim 1 to monitor faults, detects the network state of a physical machine by reading the network configuration information of each node, and sends the node information with abnormal network state to the fault notification processing module; calling an interface to receive a life cycle event of the virtual machine and sending virtual machine information with abnormal life cycle event to the fault notification processing module; monitoring the running state of a preset specific process through a resident process, and sending node information corresponding to the preset specific process which cannot be recovered by self to the fault notification processing module;
the fault notification processing module receives and stores the fault notification sent by the fault detection module; and transmitting to the failure recovery module; receiving a fault processing result returned by the fault recovery module;
and the fault recovery module receives the fault notification transmitted by the fault notification processing module, judges the isolation level of the fault node according to different fault conditions, triggers different fault recovery tasks and sends a fault recovery result to the fault notification processing module.
4. The distributed fault monitoring based virtual machine high availability system according to claim 3, wherein the fault detection module detects network states of a management network layer, a storage network layer and a service network layer at the same time.
5. The distributed fault monitoring based virtual machine high availability system according to claim 4, wherein the action relationship between the detection result in the fault detection module and the reaction action of the fault recovery module is as follows:
when the network of the management network layer is interrupted and the networks of the storage network layer and the service network layer are normal, the fault recovery module sends fault information to the fault notification processing module;
when the network of the storage network layer is interrupted and the networks of the management network layer and the service network layer are normal, maintaining and isolating corresponding fault nodes and evacuating all the virtual machines of the fault nodes;
when the network of the service network layer is interrupted and the networks of the storage network layer and the management network layer are normal, maintaining the fault node and migrating the virtual machine in the fault node to a normal node;
when the network of the management network layer and the network of the service network layer are interrupted and the network of the storage network layer is normal, maintaining and isolating the corresponding fault node and evacuating all the virtual machines of the fault node;
when the network of the management network layer, the storage network layer and the service network layer is interrupted, maintaining and isolating the corresponding fault node, and evacuating all the virtual machines of the fault node.
6. The distributed fault monitoring based virtual machine high availability system according to claim 3, wherein the processing procedure of the life cycle event of the virtual machine is as follows: monitoring life cycle events of all virtual machines on a host machine, and acquiring virtual machine information with abnormal life cycle events in the virtual machines; and restoring the specified virtual machine according to the virtual machine information.
7. The virtual machine high availability system based on distributed fault monitoring as claimed in claim 3, wherein the recovery process for the fault of the running process is as follows: periodically checking the running states of all processes in the process list, attempting restart aiming at the abnormal process, and finishing the processing if the abnormal process is recovered to be normal after being restarted; and if the abnormal process is not restarted, isolating the fault node corresponding to the abnormal process.
CN202010812521.2A 2020-08-13 2020-08-13 Distributed fault monitoring-based method and virtual machine high-availability system Active CN111953566B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010812521.2A CN111953566B (en) 2020-08-13 2020-08-13 Distributed fault monitoring-based method and virtual machine high-availability system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010812521.2A CN111953566B (en) 2020-08-13 2020-08-13 Distributed fault monitoring-based method and virtual machine high-availability system

Publications (2)

Publication Number Publication Date
CN111953566A CN111953566A (en) 2020-11-17
CN111953566B true CN111953566B (en) 2022-03-11

Family

ID=73341982

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010812521.2A Active CN111953566B (en) 2020-08-13 2020-08-13 Distributed fault monitoring-based method and virtual machine high-availability system

Country Status (1)

Country Link
CN (1) CN111953566B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506691B (en) * 2020-12-14 2024-04-19 贵州电网有限责任公司 Digital twin application fault recovery method and system for multi-energy system
CN113765709B (en) * 2021-08-23 2022-09-20 中国人寿保险股份有限公司上海数据中心 Openstack cloud platform-based multi-dimensional monitoring-based high-availability realization system and method for virtual machine
CN113965459A (en) * 2021-10-08 2022-01-21 浪潮云信息技术股份公司 Consul-based method for monitoring host network to realize high availability of computing nodes
CN114090184B (en) * 2021-11-26 2022-11-29 中电信数智科技有限公司 Method and equipment for realizing high availability of virtualization cluster
CN114363356B (en) * 2021-12-17 2024-04-26 上海浦东发展银行股份有限公司 Data synchronization method, system, device, computer equipment and storage medium
CN115022314B (en) * 2022-04-24 2024-02-20 中银金融科技有限公司 Enterprise-level RPA cloud management platform
CN117032881A (en) * 2023-07-31 2023-11-10 广东保伦电子股份有限公司 Method, device and storage medium for detecting and recovering abnormality of virtual machine

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103701627A (en) * 2012-09-27 2014-04-02 北京搜狐新媒体信息技术有限公司 Cloud computing platform fault detection method, cloud computing platform fault detection method, solving method and solving device
CN103778031A (en) * 2014-01-15 2014-05-07 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN104378262A (en) * 2013-12-13 2015-02-25 国家计算机网络与信息安全管理中心 Intelligent monitoring analyzing method and system under cloud computing
CN110175451A (en) * 2019-04-23 2019-08-27 国家电网公司华东分部 A kind of method for safety monitoring and system based on electric power cloud

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10025610B2 (en) * 2013-04-30 2018-07-17 Telefonaktiebolaget Lm Ericsson (Publ) Availability management of virtual machines hosting highly available applications
US9946614B2 (en) * 2014-12-16 2018-04-17 At&T Intellectual Property I, L.P. Methods, systems, and computer readable storage devices for managing faults in a virtual machine network
US10361919B2 (en) * 2015-11-09 2019-07-23 At&T Intellectual Property I, L.P. Self-healing and dynamic optimization of VM server cluster management in multi-cloud platform
CN109995568B (en) * 2018-01-02 2022-03-29 中国移动通信有限公司研究院 Fault linkage processing method, network element and storage medium
CN109558209B (en) * 2018-11-20 2021-10-29 郑州云海信息技术有限公司 Monitoring method for virtual machine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103701627A (en) * 2012-09-27 2014-04-02 北京搜狐新媒体信息技术有限公司 Cloud computing platform fault detection method, cloud computing platform fault detection method, solving method and solving device
CN104378262A (en) * 2013-12-13 2015-02-25 国家计算机网络与信息安全管理中心 Intelligent monitoring analyzing method and system under cloud computing
CN103778031A (en) * 2014-01-15 2014-05-07 华中科技大学 Distributed system multilevel fault tolerance method under cloud environment
CN110175451A (en) * 2019-04-23 2019-08-27 国家电网公司华东分部 A kind of method for safety monitoring and system based on electric power cloud

Also Published As

Publication number Publication date
CN111953566A (en) 2020-11-17

Similar Documents

Publication Publication Date Title
CN111953566B (en) Distributed fault monitoring-based method and virtual machine high-availability system
KR101970839B1 (en) Replaying jobs at a secondary location of a service
US6622261B1 (en) Process pair protection for complex applications
US9450700B1 (en) Efficient network fleet monitoring
US9098439B2 (en) Providing a fault tolerant system in a loosely-coupled cluster environment using application checkpoints and logs
US10489232B1 (en) Data center diagnostic information
CN109286529B (en) Method and system for recovering RabbitMQ network partition
CN110830283B (en) Fault detection method, device, equipment and system
US11706080B2 (en) Providing dynamic serviceability for software-defined data centers
US20130205017A1 (en) Computer failure monitoring method and device
CN109656742B (en) Node exception handling method and device and storage medium
CN103607297A (en) Fault processing method of computer cluster system
CN108347339B (en) Service recovery method and device
EP3671461A1 (en) Systems and methods of monitoring software application processes
US20160036654A1 (en) Cluster system
CN112052095A (en) Distributed high-availability big data mining task scheduling system
CN114064217A (en) Node virtual machine migration method and device based on OpenStack
CN110620798A (en) Control method, system, equipment and storage medium for FTP connection
CN101442437A (en) Method, system and equipment for implementing high availability
CN111897626A (en) Cloud computing scene-oriented virtual machine high-reliability system and implementation method
US11954509B2 (en) Service continuation system and service continuation method between active and standby virtual servers
JP2015176168A (en) Administration server, fault restoration method, and computer program
CN103150236B (en) Parallel communication library state self-recovery method facing to process failure fault
CN112306746A (en) Method, apparatus and computer program product for managing snapshots in an application environment
JP6984119B2 (en) Monitoring equipment, monitoring programs, and monitoring methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant