CN107656845A - A kind of virtual machine high availability method - Google Patents

A kind of virtual machine high availability method Download PDF

Info

Publication number
CN107656845A
CN107656845A CN201710843325.XA CN201710843325A CN107656845A CN 107656845 A CN107656845 A CN 107656845A CN 201710843325 A CN201710843325 A CN 201710843325A CN 107656845 A CN107656845 A CN 107656845A
Authority
CN
China
Prior art keywords
virtual machine
control module
host
monitoring
virtual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710843325.XA
Other languages
Chinese (zh)
Inventor
韩飞
邓玉芳
季统凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
G Cloud Technology Co Ltd
Original Assignee
G Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by G Cloud Technology Co Ltd filed Critical G Cloud Technology Co Ltd
Priority to CN201710843325.XA priority Critical patent/CN107656845A/en
Publication of CN107656845A publication Critical patent/CN107656845A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/301Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is a virtual computing platform, e.g. logically partitioned systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0712Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a virtual computing platform, e.g. logically partitioned systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45575Starting, stopping, suspending or resuming virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/815Virtual
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/835Timestamp

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The present invention relates to virtual machine technique field, particularly a kind of virtual machine high availability method.The present invention is that virtual machine monitor starts monitoring flow;Stab and mark toward the virtual machine write time when monitoring flow monitoring virtual machine storage system is normal;When stabbing update abnormal such as host monitoring module detection time, the control module of control machine is alerted;After control module confirms failure, virtual machine owner is changed, carries out fault recovery.The present invention provides high reliability for virtual machine in cloud computing platform and High Availabitity service easy to use provides scheme;It can be used on Virtual Machine Manager.

Description

A kind of virtual machine high availability method
Technical field
The present invention relates to virtual machine technique field, particularly a kind of virtual machine high availability method.
Background technology
In cloud computing platform, because many reasons such as network, storage system, hardware and software failure, virtual machine is potential because event Barrier causes the possibility of end of service.In order to solve this problem, many cloud computing platforms can provide high availability mechanism for virtual machine The operation of automatic fast quick-recovery virtual machine after virtual machine failure.These high availability mechanisms can be divided into two major classes, there is controller With without controller.Their typical case realizes often there is problems with:
First, for without controller scheme, in order to ensure virtual machine High Availabitity, it is necessary to be deploying virtual machine one in advance High-availability cluster, shifted by certain distributed algorithm to monitor the state of virtual machine and link up failure.Vmware and Azure are It is this kind of implementation method.The problem of this method, is the waste for operating complexity, resource on O&M, and uncontrollability, Reliability is not high;The complexity that mechanism is realized.Another be exactly with ready-made Open-Source Tools such as keepalived, Heartbeat etc. is done, but Operating Complexity is high, and scheme reliability is low.
2nd, for having controller scheme, the automatic recovery for performing virtual machine is carried out independent of distributed coordination algorithm Journey, but rely on control module and coordinate whole recovery process to control, controllability is higher;But the implementation of main flow is generally deposited The reliability deficiency the problem of, such as:Agent is relied on, does not find failure, judges failure, data in magnetic disk damage etc. by accident.
The content of the invention
Present invention solves the technical problem that it is to provide a kind of virtual machine High Availabitity side easy to use, controllable and highly reliable Method, meet the needs that virtual machine automatic fault is recovered under cloud computing platform environment.
The present invention solve above-mentioned technical problem scheme be:
Described method is that virtual machine monitor starts monitoring flow;It is normal in monitoring flow monitoring virtual machine storage system When toward the virtual machine write time stab mark;When stabbing update abnormal such as host monitoring module detection time, the control of control machine is alerted Molding block;After control module confirms failure, virtual machine owner is changed, carries out fault recovery.
Specifically comprise the following steps:
Step 1:Start an independent monitoring flow in virtual machine monitor, whether disk is checked when virtual machine starts Belong to virtual machine oneself, if be not belonging to, backed off after random can be alerted;Belong to then normal operation, and periodic test virtual machine magnetic Disk;
Step 2:If virtual machine storage system is normal, monitoring flow is stabbed toward the magnetic disk of virtual machine write time and marked;Such as Fruit storage system is abnormal, then can not stab renewal time;
Step 3:The timestamp of monitoring module periodic test virtual machine renewal where virtual machine on host, if do not had There is normal renewal, then send alarm to the control module in control main frame;
Step 4:Control module can check virtual machine state after receiving alarm again by suspected malfunctions magnetic disk of virtual machine, If certain failure, step 5 is performed;If simultaneously non-faulting or failure are recovered, step 6 is performed;If do not receive To the heartbeat signal of monitoring module, show that at least monitoring module is to out of joint between control module, control module can also be sent Warning.
Step 5:Change virtual machine owner, wait one section of safety time after other hosts recover virtual machine;
Step 6:Any recovery flow is not taken, while is reported to administrative staff, and flow terminates.
Described virtual machine monitor is hypervisor, is the system of actual motion and management and control virtual machine on host, Including Xen, qemu-kvm;
The host refers to the physical server of actual motion virtual machine, and monitoring module is transported independently of virtual machine monitor Row is on host;
The control main frame is to be responsible for operation in cloud computing cluster and provide the server of control service;
The owner, refer to which platform virtual machine and host possess the right to use of the virtual disk.
Described monitoring flow is to start the thread started before virtual machine;The thread is virtual dedicated for cycle detection The readable writability of machine disk, and renewal time stamp is to show that virtual machine hypervisor and virtual machine are accessing storage system just Often.
Server where described control module needs that the virtual disk of suspected malfunctions virtual machine can be had access to, and only needs The review time, stamp was either with or without change within the virtual machine update of time stamp cycle, and did not needed the accuracy of time.
Described control module selects a suitable host to resume operation the fault virtual machine, changes virtual disk After owner, an assurance time period is waited, then start the virtual machine to be recovered on the host of selection;
One assurance time period of the wait, is to prevent virtual machine for no other reason than that provisional network failure causes Wrong report and erroneous trigger failure transfer, so as to causing service disconnection or even corrupted data;Virtual machine is ensured that in this wait Assurance time period in voluntarily exit, prevent fissure.
Beneficial effects of the present invention are as follows:
(1) present invention can occurred because of network, deposited by the virtual machine high availability scheme based on virtual machine monitor Virtual machine caused by the reasons such as storage system failure, equipment fault, software fault is out of service and service disconnection in the case of, realize from Dynamicization, simply reliable virtual-machine fail recovery, ensure the high availability of virtual machine.
(2) present invention realizes simple and reliable virtual machine by realizing fault detection mechanism from virtual machine monitor layer Fault detect and recovery;By with hop controller, it is possible to achieve flexibly it is controllable, use simply based on strategy virtual-machine fail Restoration Mechanism.
Based on above reason, in order to realize a deployment O&M it is easy to use, it is controllable, economize on resources, highly reliable virtual Machine high availability mechanism provides the virtual machine that underlying mechanisms support, it is necessary to a kind of tape controller pattern, in hypervisor levels High availability scheme.
Brief description of the drawings
The present invention is further described below in conjunction with the accompanying drawings:
Fig. 1 is the inventive method flow chart;
Fig. 2 is module deployment topologies figure of the present invention.
Embodiment
As shown in Figure 1, 2, the basic procedure of invention is:
Step 1:Start an independent monitoring flow in virtual machine monitor, whether disk is checked when virtual machine starts Belong to virtual machine oneself, if be not belonging to, backed off after random can be alerted;Belong to then normal operation, and periodic test virtual machine magnetic Disk;
Step 2:If virtual machine storage system is normal, monitoring flow is stabbed toward the magnetic disk of virtual machine write time and marked;Such as Fruit storage system is abnormal, then can not stab renewal time;
Step 3:The timestamp of monitoring module periodic test virtual machine renewal where virtual machine on host, if do not had There is normal renewal, then send alarm to the control module in control main frame;
Step 4:Control module can check virtual machine state after receiving alarm again by suspected malfunctions magnetic disk of virtual machine, If certain failure, step 5 is performed;If simultaneously non-faulting or failure are recovered, step 6 is performed;If do not receive To the heartbeat signal of monitoring module, show that at least monitoring module is to out of joint between control module, control module can also be sent Warning.
Step 5:Change virtual machine owner, wait one section of safety time after other hosts recover virtual machine;
Step 6:Any recovery flow is not taken, while is reported to administrative staff, and flow terminates.
In above-mentioned flow, the testing process of hypervisor aspects, be this programme core and emphasis, there is provided High Availabitity Base layer support mechanism.This testing process is realized inside hypervisor, is run and is detected before virtual machine starts, and in Continuous service and incipient fault is detected in virtual machine running.Hypervisor detections basic procedure false code is as follows:
The basic procedure of monitoring module on host is as follows:
The basic procedure of control module is as follows in control machine:

Claims (8)

  1. A kind of 1. virtual machine high availability method, it is characterised in that:Described method is that virtual machine monitor starts monitoring flow; Stab and mark toward the virtual machine write time when monitoring flow monitoring virtual machine storage system is normal;When being detected such as host monitoring module Between when stabbing update abnormal, alert the control module of control machine;After control module confirms failure, virtual machine owner is changed, carries out event Barrier recovers.
  2. 2. according to the method for claim 1, it is characterised in that:Specifically comprise the following steps:
    Step 1:Start an independent monitoring flow in virtual machine monitor, check whether disk belongs to when virtual machine starts Virtual machine oneself, if be not belonging to, backed off after random can be alerted;Belong to then normal operation, and periodic test magnetic disk of virtual machine;
    Step 2:If virtual machine storage system is normal, monitoring flow is stabbed toward the magnetic disk of virtual machine write time and marked;If deposit Storage system is abnormal, then can not stab renewal time;
    Step 3:The timestamp of monitoring module periodic test virtual machine renewal where virtual machine on host, if without just Often renewal, then send alarm to the control module in control main frame;
    Step 4:Control module can check virtual machine state after receiving alarm again by suspected malfunctions magnetic disk of virtual machine, if Certain failure, then perform step 5;If simultaneously non-faulting or failure are recovered, step 6 is performed;If it is not received by prison The heartbeat signal of module is controlled, shows that at least monitoring module is to out of joint between control module, control module can also give a warning.
    Step 5:Change virtual machine owner, wait one section of safety time after other hosts recover virtual machine;
    Step 6:Any recovery flow is not taken, while is reported to administrative staff, and flow terminates.
  3. 3. according to the method for claim 2, it is characterised in that:Described virtual machine monitor is hypervisor, is place The system of actual motion and management and control virtual machine on main frame, including Xen, qemu-kvm;
    The host refers to the physical server of actual motion virtual machine, and monitoring module is run on independently of virtual machine monitor On host;
    The control main frame is to be responsible for operation in cloud computing cluster and provide the server of control service;
    The owner, refer to which platform virtual machine and host possess the right to use of the virtual disk.
  4. 4. according to the method for claim 2, it is characterised in that:Described monitoring flow be start virtual machine before start one Individual thread;The thread is dedicated for the readable writability of cycle detection magnetic disk of virtual machine, and renewal time stamp is to show virtual machine It is normal that hypervisor and virtual machine access storage system.
  5. 5. according to the method for claim 3, it is characterised in that:Described monitoring flow be start virtual machine before start one Individual thread;The thread is dedicated for the readable writability of cycle detection magnetic disk of virtual machine, and renewal time stamp is to show virtual machine It is normal that hypervisor and virtual machine access storage system.
  6. 6. according to the method described in any one of claim 2 to 5, it is characterised in that:Server needs where described control module The virtual disk of suspected malfunctions virtual machine can be had access to, and only needs the review time within the virtual machine update of time stamp cycle to stab Either with or without change, and the accuracy of time is not needed.
  7. 7. according to the method described in any one of claim 2 to 5, it is characterised in that:Described control module selects one properly Host resume operation the fault virtual machine, after the owner for changing virtual disk, wait an assurance time period, then selecting Start the virtual machine to be recovered on the host selected;
    One assurance time period of the wait, is to prevent virtual machine for no other reason than that provisional network failure causes to report by mistake And erroneous trigger failure transfer, so as to causing service disconnection or even corrupted data;Virtual machine ensures that the peace in this wait Voluntarily exited in the full time cycle, prevent fissure.
  8. 8. according to the method for claim 6, it is characterised in that:Described control module selects a suitable host extensive Run the fault virtual machine again, after the owner for changing virtual disk, wait an assurance time period, then the host in selection The upper startup virtual machine to be recovered;
    One assurance time period of the wait, is to prevent virtual machine for no other reason than that provisional network failure causes to report by mistake And erroneous trigger failure transfer, so as to causing service disconnection or even corrupted data;Virtual machine ensures that the peace in this wait Voluntarily exited in the full time cycle, prevent fissure.
CN201710843325.XA 2017-09-18 2017-09-18 A kind of virtual machine high availability method Pending CN107656845A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710843325.XA CN107656845A (en) 2017-09-18 2017-09-18 A kind of virtual machine high availability method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710843325.XA CN107656845A (en) 2017-09-18 2017-09-18 A kind of virtual machine high availability method

Publications (1)

Publication Number Publication Date
CN107656845A true CN107656845A (en) 2018-02-02

Family

ID=61130566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710843325.XA Pending CN107656845A (en) 2017-09-18 2017-09-18 A kind of virtual machine high availability method

Country Status (1)

Country Link
CN (1) CN107656845A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113220409A (en) * 2021-02-01 2021-08-06 浪潮云信息技术股份公司 Virtual machine monitoring system and method
GB2605268A (en) * 2020-03-31 2022-09-28 Imagination Tech Ltd Hypervisor Removal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103209095A (en) * 2013-03-13 2013-07-17 广东新支点技术服务有限公司 Method and device for preventing split brain on basis of disk service lock
CN104268061A (en) * 2014-09-12 2015-01-07 国云科技股份有限公司 Storage state monitoring mechanism for virtual machine
US20160323427A1 (en) * 2014-01-22 2016-11-03 Shanghai Jiao Tong University A dual-machine hot standby disaster tolerance system and method for network services in virtualilzed environment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103209095A (en) * 2013-03-13 2013-07-17 广东新支点技术服务有限公司 Method and device for preventing split brain on basis of disk service lock
US20160323427A1 (en) * 2014-01-22 2016-11-03 Shanghai Jiao Tong University A dual-machine hot standby disaster tolerance system and method for network services in virtualilzed environment
CN104268061A (en) * 2014-09-12 2015-01-07 国云科技股份有限公司 Storage state monitoring mechanism for virtual machine

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2605268A (en) * 2020-03-31 2022-09-28 Imagination Tech Ltd Hypervisor Removal
GB2605268B (en) * 2020-03-31 2023-06-14 Imagination Tech Ltd Hypervisor Removal
CN113220409A (en) * 2021-02-01 2021-08-06 浪潮云信息技术股份公司 Virtual machine monitoring system and method

Similar Documents

Publication Publication Date Title
CN110798375B (en) Monitoring method, system and terminal equipment for enhancing high availability of container cluster
TWI746512B (en) Physical machine fault classification processing method and device, and virtual machine recovery method and system
CN102662821B (en) Method, device and system for auxiliary diagnosis of virtual machine failure
CN104268061B (en) A kind of storage state monitoring method suitable for virtual machine
CN105187249B (en) A kind of fault recovery method and device
CN103209095B (en) Method and device for preventing split brain on basis of disk service lock
CN104685830B (en) Method, entity and the system of fault management
CN101291243B (en) Split brain preventing method for highly available cluster system
CN106775929B (en) A kind of virtual platform safety monitoring method and system
CN105790980B (en) fault repairing method and device
CN106789306A (en) Restoration methods and system are collected in communication equipment software fault detect
CN105095001A (en) Virtual machine exception recovery method under distributed environment
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
CN103559124B (en) Fast fault detection method and device
KR101712172B1 (en) The preliminary diagnosis and analysis and recovery system of computer error, and method thereof
CN106776282A (en) The abnormality eliminating method and device of a kind of bios program
CN106681858A (en) Virtual machine data disaster tolerance method and management device
CN113360579A (en) Database high-availability processing method and device, electronic equipment and storage medium
CN107656845A (en) A kind of virtual machine high availability method
CN103902401B (en) Virtual machine fault-tolerance approach and device based on monitoring
CN105119765B (en) A kind of Intelligent treatment fault system framework
CN109117317A (en) A kind of clustering fault restoration methods and relevant apparatus
CN107491344A (en) A kind of method and device for realizing virtual machine high availability
WO2015188619A1 (en) Physical host fault detection method and apparatus, and virtual machine management method and system
CN107204963A (en) High reliability WEB security protection implementation methods under cloud computing mode

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180202