WO2015169199A1 - 分布式环境下虚拟机异常恢复方法 - Google Patents

分布式环境下虚拟机异常恢复方法 Download PDF

Info

Publication number
WO2015169199A1
WO2015169199A1 PCT/CN2015/078248 CN2015078248W WO2015169199A1 WO 2015169199 A1 WO2015169199 A1 WO 2015169199A1 CN 2015078248 W CN2015078248 W CN 2015078248W WO 2015169199 A1 WO2015169199 A1 WO 2015169199A1
Authority
WO
WIPO (PCT)
Prior art keywords
physical machine
abnormal
machine
physical
virtual machine
Prior art date
Application number
PCT/CN2015/078248
Other languages
English (en)
French (fr)
Inventor
柴洪峰
鲁志军
祖立军
严逸兴
Original Assignee
中国银联股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国银联股份有限公司 filed Critical 中国银联股份有限公司
Priority to EP15788953.6A priority Critical patent/EP3142011B9/en
Priority to US15/308,497 priority patent/US10095576B2/en
Publication of WO2015169199A1 publication Critical patent/WO2015169199A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0712Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a virtual computing platform, e.g. logically partitioned systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0751Error or fault detection not based on redundancy
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • G06F11/0787Storage of error reports, e.g. persistent data storage, storage using memory protection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • G06F11/1484Generic software techniques for error detection or fault masking by means of middleware or OS functionality involving virtual machines

Definitions

  • the present invention relates to a virtual machine abnormality recovery method, and more particularly to a virtual machine abnormality recovery method in a distributed environment.
  • virtual machines refer to software devices that run on physical machines and have complete hardware system functions.
  • High-availability technology that is, a computer system running in a completely isolated environment
  • the virtual machine running on the physical machine A can be in the physical machine without manual intervention.
  • the technology that starts up on B to ensure the continued operation of the virtual machine becomes more and more important.
  • a logical cluster composed of multiple physical machines is defined as a high availability unit, whereby any physical machine in the logical cluster
  • the control node detects the state of the physical machine in a heartbeat manner or periodically pings the physical machine, that is, when After the control node cannot detect a physical machine, it considers that there is a problem with the physical machine.
  • the existing technical solutions have the following problems: (1) Since the virtual machine is assigned to a highly available cluster, the virtual machine is highly available by default regardless of whether the service running on the virtual machine is important. Therefore, such a design cannot guarantee that the virtual machine running the important service is preferentially started, and it will also cause some waste and redundancy to the resources; (2) since only the state of the physical machine is detected, the detection mode is single and one-sided, thereby possibly Causes misjudgment (for example, if a physical machine disables the ping function, it is possible to migrate the virtual machine on the running physical machine to another physical machine); (3) since the detection of the physical machine state is only initiated from the control node Therefore, the judgment of the state of the physical machine is not comprehensive and accurate.
  • the present invention proposes a virtual machine abnormality recovery method capable of accurately determining and efficiently processing physical machine failures in a distributed environment.
  • a virtual machine abnormality recovery method in a distributed environment where the virtual machine abnormality recovery method in the distributed environment includes the following steps:
  • the high availability controller periodically polls the state database to check the operational status of all physical machines in the physical machine cluster under the control of the high availability controller;
  • the abnormal processing operation includes: the high availability controller detecting connectivity of the physical machine whose operating state is abnormal to the management network, wherein the detecting is in the following two manners Perform: (1) ping the physical machine; (2) monitoring port 22 of the physical machine.
  • the exception processing operation further comprises: if the physical machine that finds the abnormality of the running state is connected to the management network in any manner, the exception processing operation ends, and if Detecting that the physical machine whose abnormal state is abnormal is not connected to the management network in two ways, and detecting the connectivity of all valid virtual machines running on the physical machine whose abnormal state is abnormal for the service network, and if any one If the effective virtual machine is connectable to the service network, the exception handling operation ends, and if all the valid virtual machines are not connectable to the service network, a secondary voting operation is performed to finally confirm whether the physical machine whose operating state is abnormal occurs. malfunction.
  • the secondary voting operation includes: (1) the high availability controller randomly selects from the physical machine cluster other than the physical machine whose operating state is abnormal.
  • the high-availability controller indicates that each of the selected physical machines detects the operation by pinging the physical machine whose operating state is abnormal and monitoring the physical port of the physical machine whose abnormal state is abnormal.
  • the virtual machine migration operation comprises: (1) the high availability controller sends a shutdown command to the physical machine with abnormal running status via an intelligent platform management interface (IPMI) to enable the The physical machine whose operating state is abnormal is in a shutdown state, thereby destroying the virtual machine residing in its memory; (2) the high-availability controller sends a migration scheduling instruction to the scheduling controller; and (3) receiving the migration After scheduling the instruction, the scheduling controller selects a physical machine that has idle resources in the physical machine cluster, and then sends a migration instruction to the selected physical machine with the idle resource one by one to be abnormal in the running state.
  • IPMI intelligent platform management interface
  • All valid virtual machines running on the physical machine are migrated to the selected physical machine with idle resources, wherein the virtual machines to be migrated allocated to different physical machines with idle resources are different from each other; (4) via shared storage
  • the device, the computing component running on the physical machine with the idle resource migrates the virtual machine to be migrated allocated to the physical machine to the physical machine.
  • the user can configure a high availability flag of each virtual machine, and the high availability controller determines all valid virtualities running on the physical machine whose operating state is abnormal before performing the virtual machine migration operation The machine's high availability flag and only perform subsequent virtual machine migration operations on virtual machines whose values for the high availability flag are "enabled.”
  • the user can configure the high availability priority of each virtual machine, and the high availability controller sequentially migrates each virtual to be migrated according to the high availability priority of each virtual machine to be migrated. machine.
  • the virtual machine abnormality recovery method in the distributed environment disclosed by the present invention has the following advantages: (1) ensuring that a virtual machine running an important service is preferentially started and restored, and saving resources; (2) because the network detection method is diverse and comprehensive, This significantly reduces the possibility of misjudgment; (3) due to the physical machine
  • the detection of the state can not only control the node initiation but also can be initiated from other physical machines that are randomly selected, so that the state of the physical machine can be more comprehensively and accurately determined.
  • FIG. 1 is a flow chart of a virtual machine anomaly recovery method in a distributed environment, in accordance with an embodiment of the present invention.
  • the virtual machine abnormality recovery method in the distributed environment disclosed by the present invention includes the following steps: (A1) running a separate computing component on each physical machine on which the virtual machine resides, and the computing The component periodically reports (eg, every 1 minute) to the state database the current operational state of the corresponding physical machine; (A2) the high availability controller periodically (eg, every 2 seconds) polls the state database to check for The operating state of all the physical machines in the physical machine cluster under the control of the high availability controller; (A3) if the running states of all the physical machines in the physical machine cluster are normal, the current inspection process is terminated, if If the running status of multiple physical machines in the physical machine cluster is abnormal, the current check process is terminated and an alarm is issued by means of logs, and if only one physical machine in the physical machine cluster is abnormal in operation state (for example, If a physical machine does not report its own running status within
  • the abnormal processing operation includes: the high availability controller detecting connectivity of the physical machine with abnormal operating status to the management network, wherein The detection is performed in the following two ways: (1) ping the physical machine; (2) monitoring the 22nd port of the physical machine.
  • the abnormal processing operation further includes: if the physical machine that finds the abnormal running state is connected to the management network in any manner , the exception handling operation ends, and if the detection is performed in two ways The physical machine whose abnormal state is running is not connectable to the management network, and detects the connectivity of all valid virtual machines running on the physical machine whose abnormal state is abnormal for the service network, and if any one of the valid virtual machines is for the service If the network is reachable, the exception handling operation ends, and if all valid virtual machines are not connectable to the service network, a secondary voting operation is performed to finally confirm whether the physical machine whose operating state is abnormal has failed.
  • the second voting operation includes: (1) the high availability controller randomly selects the running state from the physical machine cluster. a plurality of physical machines (for example, three physical machines) other than the abnormal physical machine; (2) the high-availability controller instructs each of the selected physical machines to ping the physical machine whose operating state is abnormal, and monitors the physical machine Port 22 of the physical machine whose abnormal state is running to detect the connectivity of the physical machine whose operating state is abnormal to the management network and/or the service network; (3) If any physical machine in the selected physical machine is found If the physical machine whose operating state is abnormal is connectable to the management network or the service network, the second voting operation is ended and the result of the secondary voting operation is “the physical machine whose operating state is abnormal has not failed”, and if all are selected The physical machine that is out of the physical machine finds that the physical machine whose operating state is abnormal is not connectable to the management network and the service network, and ends the second voting operation and the second voting operation.
  • the result of the secondary voting operation is “the physical machine whose operating state is abnormal has not
  • the virtual machine migration operation includes: (1) the high availability controller is abnormal to the running state via an intelligent platform management interface (IPMI) The physical machine sends a shutdown command to cause the physical machine whose operating state is abnormal to be in a shutdown state (ie, no further service is provided externally), thereby destroying the virtual machine residing in its memory (exemplarily, if the intelligent platform management interface (IPMI) exception does not stop the virtual machine migration operation, but will be alarmed in the form of a log); (2) the high availability controller sends a migration scheduling instruction to the scheduling controller; (3) upon receiving the migration scheduling instruction Afterwards, the scheduling controller selects a physical machine with idle resources in the physical machine cluster, and then sends a migration instruction to the selected physical machine with idle resources one by one to be performed on the physical machine with abnormal running status.
  • IPMI intelligent platform management interface
  • All active virtual machines running are migrated to the selected physical machine with idle resources, where the physical machines assigned to different idle resources are to be migrated. Are different from each other virtual machine; (4) via a shared memory means, said computing component has on the physical machine running free resources to be allocated to those physical machines to be migrated virtual machine migration to the present Physical machine.
  • the high availability controller changes the virtual machine's image file storage directory, The physical machine that prevents this abnormal state is started to start the virtual machine instance during the migration process.
  • the user can configure a high availability flag of each virtual machine, and the high availability controller determines the running state before performing the virtual machine migration operation.
  • the user can configure the high availability priority of each virtual machine, and the high availability controller prioritizes according to the high availability of each virtual machine to be migrated.
  • the level of the level moves the VMs to be migrated in turn.
  • the high availability priority of the virtual machine is configured to be "high”, it indicates that for the virtual machine, it must be ensured that sufficient free resources are reserved to ensure that the virtual machine can be migrated if the high availability of the virtual machine is prioritized
  • the level is configured as "medium” or "low”, indicating that for the virtual machine, the corresponding priority order is ensured during the migration, but there is no guarantee that there will be enough free resources to be reserved.
  • the virtual machine abnormality recovery method in the distributed environment disclosed by the present invention has the following advantages: (1) ensuring that a virtual machine running an important service is preferentially started and restored, and saving resources; (2) due to network detection mode Diversified and comprehensive, thus significantly reducing the possibility of misjudgment; (3) because the detection of the state of the physical machine can not only control the node initiation but also can be initiated from other physical machines randomly selected, it can be more comprehensive and accurate. Determine the state of the physical machine.

Abstract

一种分布式环境下虚拟机异常恢复方法,所述方法包括:在虚拟机驻留于其上的每个物理机上运行独立的计算组件,并且所述计算组件周期性地向状态数据库报告相应的物理机的当前运行状态(A1);高可用控制器周期性地轮询所述状态数据库以检查在所述高可用控制器的控制下的物理机集群中的所有物理机的运行状态(A2);如果所述物理机集群中仅一台物理机的运行状态异常,则执行后续的异常处理操作以确保该运行状态异常的物理机上的虚拟机继续正常运行(A3)。该分布式环境下虚拟机异常恢复方法能够准确地判断和高效地处理分布式环境下的物理机故障。

Description

分布式环境下虚拟机异常恢复方法 技术领域
本发明涉及虚拟机异常恢复方法,更具体地,涉及分布式环境下虚拟机异常恢复方法。
背景技术
目前,随着计算机和网络应用的日益广泛以及不同领域的业务种类的日益丰富,分布式环境下的虚拟机(虚拟机是指通过软件模拟的方式运行于物理机上面、具有完整硬件系统功能的、运行在一个完全隔离环境中的计算机系统)高可用技术(即在物理机A出现宕机等问题后,运行在物理机A上的虚拟机在不需要人工参与的情况下就能够在物理机B上启动起来以确保虚拟机的持续运转的技术)变得越来越重要。
在现有的技术方案中,通常以下列方式实现分布式环境下的虚拟机高可用性:将由多个物理机组成的逻辑集群定义为一个高可用单元,由此,此逻辑集群中的任意物理机出现宕机或问题时,运行在该物理机上的全部虚拟机都会在同一逻辑集群中其他物理机上启动起来,此外,控制节点以心跳方式或定期ping物理机的方式检测物理机的状态,即当控制节点不能探测到某台物理机后就认为这台物理机出现了问题。
然而,现有的技术方案存在如下问题:(1)由于当虚拟机被分配到一个高可用集群后,不论这个虚拟机上运行的业务是否重要,该虚拟机均被默认具有高可用性。故这样的设计不能保证运行重要业务的虚拟机优先启动,并且也会对资源造成一定浪费和冗余;(2)由于仅对物理机的状态进行检测,故检测方式单一和片面,由此可能造成误判(例如如果某台物理机禁用了ping功能,则有可能把运行正常的物理机上的虚拟机迁移到别的物理机上);(3)由于对物理机状态的探测仅从控制节点发起,故对物理机状态的判断不够全面和准确。
因此,存在如下需求:提供能够准确地判断和高效地处理分布式环境下的物理机故障的虚拟机异常恢复方法。
发明内容
为了解决上述现有技术方案所存在的问题,本发明提出了能够准确地判断和高效地处理分布式环境下的物理机故障的虚拟机异常恢复方法。
本发明的目的是通过以下技术方案实现的:
一种分布式环境下虚拟机异常恢复方法,所述分布式环境下虚拟机异常恢复方法包括下列步骤:
(A1)在虚拟机驻留于其上的每个物理机上运行独立的计算组件,并且所述计算组件周期性地向状态数据库报告相应的物理机的当前运行状态;
(A2)高可用控制器周期性地轮询所述状态数据库以检查在所述高可用控制器的控制下的物理机集群中的所有物理机的运行状态;
(A3)如果所述物理机集群中的所有物理机的运行状态均正常,则结束本次检查过程,如果所述物理机集群中的多台物理机的运行状态均异常,则结束本次检查过程并且通过日志的方式发出报警,而如果所述物理机集群中仅一台物理机的运行状态异常,则执行后续的异常处理操作以确保该运行状态异常的物理机上的虚拟机继续正常运行。
在上面所公开的方案中,优选地,所述异常处理操作包括:所述高可用控制器探测该运行状态异常的物理机针对管理网络的可连通性,其中,所述探测以下列两种方式进行:(1)ping该物理机;(2)监测该物理机的22号端口。
在上面所公开的方案中,优选地,所述异常处理操作进一步包括:如果以任何一种方式探测发现该运行状态异常的物理机针对管理网络是可连通的,则异常处理操作结束,而如果以两种方式探测发现该运行状态异常的物理机针对管理网络均是不可连通的,则探测运行于该运行状态异常的物理机上的所有有效虚拟机针对业务网络的可连通性,并且如果任何一个有效虚拟机针对业务网络是可连通的,则异常处理操作结束,而如果所有有效虚拟机针对业务网络均是不可连通的,则执行二次投票操作以最终确认该运行状态异常的物理机是否发生故障。
在上面所公开的方案中,优选地,所述二次投票操作包括:(1)所述高可用控制器从所述物理机集群中随机选择除该运行状态异常的物理机之外的若 干台物理机;(2)所述高可用控制器指示每个所选择出的物理机分别通过ping该运行状态异常的物理机以及监测该运行状态异常的物理机的22号端口来探测该运行状态异常的物理机针对管理网络和/或业务网络的可连通性;(3)如果所选择出的物理机中的任何一台物理机发现该运行状态异常的物理机针对管理网络或业务网络是可连通的,则结束二次投票操作并且二次投票操作的结果是“该运行状态异常的物理机未发生故障”,而如果所有所选择出的物理机均发现该运行状态异常的物理机针对管理网络和业务网络均是不可连通的,则结束二次投票操作并且二次投票操作的结果是“该运行状态异常的物理机发生故障”,随之执行虚拟机迁移操作。
在上面所公开的方案中,优选地,所述虚拟机迁移操作包括:(1)所述高可用控制器经由智能平台管理接口(IPMI)向该运行状态异常的物理机发送关机指令以使所述运行状态异常的物理机处于关机状态,从而销毁驻留在其内存中的虚拟机;(2)所述高可用控制器向调度控制器发送迁移调度指令;(3)在接收到所述迁移调度指令后,所述调度控制器选择所述物理机集群中有空闲资源的物理机,并随之逐个向所选择出的有空闲资源的物理机发送迁移指令,以将在该运行状态异常的物理机上运行的所有有效虚拟机迁移到所选择出的有空闲资源的物理机上,其中,分配给不同的有空闲资源的物理机的待迁移的虚拟机是彼此不同的;(4)经由共享存储装置,所述有空闲资源的物理机上运行的计算组件将分配给本物理机的待迁移虚拟机迁移至本物理机。
在上面所公开的方案中,优选地,用户能够配置各个虚拟机的高可用性标志,并且所述高可用控制器在执行虚拟机迁移操作之前判断在该运行状态异常的物理机上运行的所有有效虚拟机的高可用性标志,并且仅对其高可用性标志的值为“启用”的虚拟机执行后续的虚拟机迁移操作。
在上面所公开的方案中,优选地,用户能够配置各个虚拟机的高可用性优先级,并且所述高可用控制器根据每个待迁移虚拟机的高可用性优先级的高低依次迁移各个待迁移虚拟机。
本发明所公开的分布式环境下虚拟机异常恢复方法具有以下优点:(1)能够确保运行重要业务的虚拟机优先被启动和恢复,并且节省资源;(2)由于网络检测方式多样和全面,由此显著地减少了误判的可能性;(3)由于对物理机 状态的探测不但能够控制节点发起而且也能够从随机选取的其他物理机发起,故可以更全面和准确的判断物理机的状态。
附图说明
结合附图,本发明的技术特征以及优点将会被本领域技术人员更好地理解,其中:
图1是根据本发明的实施例的分布式环境下虚拟机异常恢复方法的流程图。
具体实施方式
图1是根据本发明的实施例的分布式环境下虚拟机异常恢复方法的流程图。如图1所示,本发明所公开的分布式环境下虚拟机异常恢复方法包括下列步骤:(A1)在虚拟机驻留于其上的每个物理机上运行独立的计算组件,并且所述计算组件周期性地(例如每隔1分钟)向状态数据库报告相应的物理机的当前运行状态;(A2)高可用控制器周期性地(例如每隔2秒)轮询所述状态数据库以检查在所述高可用控制器的控制下的物理机集群中的所有物理机的运行状态;(A3)如果所述物理机集群中的所有物理机的运行状态均正常,则结束本次检查过程,如果所述物理机集群中的多台物理机的运行状态均异常,则结束本次检查过程并且通过日志的方式发出报警,而如果所述物理机集群中仅一台物理机的运行状态异常(例如某台物理机未在1分钟内报告自己的运行状态),则执行后续的异常处理操作以确保该运行状态异常的物理机上的虚拟机继续正常运行。
优选地,在本发明所公开的分布式环境下虚拟机异常恢复方法中,所述异常处理操作包括:所述高可用控制器探测该运行状态异常的物理机针对管理网络的可连通性,其中,所述探测以下列两种方式进行:(1)ping该物理机;(2)监测该物理机的22号端口。
优选地,在本发明所公开的分布式环境下虚拟机异常恢复方法中,所述异常处理操作进一步包括:如果以任何一种方式探测发现该运行状态异常的物理机针对管理网络是可连通的,则异常处理操作结束,而如果以两种方式探测发 现该运行状态异常的物理机针对管理网络均是不可连通的,则探测运行于该运行状态异常的物理机上的所有有效虚拟机针对业务网络的可连通性,并且如果任何一个有效虚拟机针对业务网络是可连通的,则异常处理操作结束,而如果所有有效虚拟机针对业务网络均是不可连通的,则执行二次投票操作以最终确认该运行状态异常的物理机是否发生故障。
优选地,在本发明所公开的分布式环境下虚拟机异常恢复方法中,所述二次投票操作包括:(1)所述高可用控制器从所述物理机集群中随机选择除该运行状态异常的物理机之外的若干台物理机(例如3台物理机);(2)所述高可用控制器指示每个所选择出的物理机分别通过ping该运行状态异常的物理机以及监测该运行状态异常的物理机的22号端口来探测该运行状态异常的物理机针对管理网络和/或业务网络的可连通性;(3)如果所选择出的物理机中的任何一台物理机发现该运行状态异常的物理机针对管理网络或业务网络是可连通的,则结束二次投票操作并且二次投票操作的结果是“该运行状态异常的物理机未发生故障”,而如果所有所选择出的物理机均发现该运行状态异常的物理机针对管理网络和业务网络均是不可连通的,则结束二次投票操作并且二次投票操作的结果是“该运行状态异常的物理机发生故障”,随之执行虚拟机迁移操作。
优选地,在本发明所公开的分布式环境下虚拟机异常恢复方法中,所述虚拟机迁移操作包括:(1)所述高可用控制器经由智能平台管理接口(IPMI)向该运行状态异常的物理机发送关机指令以使所述运行状态异常的物理机处于关机状态(即不再对外提供任何服务),从而销毁驻留在其内存中的虚拟机(示例性地,如果智能平台管理接口(IPMI)异常则不停止虚拟机迁移操作,但是会以日志的形式告警);(2)所述高可用控制器向调度控制器发送迁移调度指令;(3)在接收到所述迁移调度指令后,所述调度控制器选择所述物理机集群中有空闲资源的物理机,并随之逐个向所选择出的有空闲资源的物理机发送迁移指令,以将在该运行状态异常的物理机上运行的所有有效(active)虚拟机迁移到所选择出的有空闲资源的物理机上,其中,分配给不同的有空闲资源的物理机的待迁移的虚拟机是彼此不同的;(4)经由共享存储装置,所述有空闲资源的物理机上运行的计算组件将分配给本物理机的待迁移虚拟机迁移至本 物理机。示例性地,为了确保同一时间点,每个独立的虚拟机镜像文件在整个分布式系统中有且仅有一个虚拟机实例运行,所述高可用控制器会更改虚拟机的镜像文件存储目录,以防止该运行状态异常的物理机在迁移过程中启动虚拟机实例。
优选地,在本发明所公开的分布式环境下虚拟机异常恢复方法中,用户能够配置各个虚拟机的高可用性标志,并且所述高可用控制器在执行虚拟机迁移操作之前判断在该运行状态异常的物理机上运行的所有有效虚拟机的高可用性标志,并且仅对其高可用性标志的值为“启用”的虚拟机执行后续的虚拟机迁移操作。
优选地,在本发明所公开的分布式环境下虚拟机异常恢复方法中,用户能够配置各个虚拟机的高可用性优先级,并且所述高可用控制器根据每个待迁移虚拟机的高可用性优先级的高低依次迁移各个待迁移虚拟机。示例性地,如果虚拟机的高可用性优先级被配置为“高”,则指示针对该虚拟机,一定确保预留足够的空闲资源来保证该虚拟机可以被迁移,如果虚拟机的高可用性优先级被配置为“中”或“低”,则指示针对该虚拟机,在迁移时确保相应的优先顺序,但是不保证一定有足够的空闲资源被预留。
由上可见,本发明所公开的分布式环境下虚拟机异常恢复方法具有下列优点:(1)能够确保运行重要业务的虚拟机优先被启动和恢复,并且节省资源;(2)由于网络检测方式多样和全面,由此显著地减少了误判的可能性;(3)由于对物理机状态的探测不但能够控制节点发起而且也能够从随机选取的其他物理机发起,故可以更全面和准确的判断物理机的状态。
尽管本发明是通过上述的优选实施方式进行描述的,但是其实现形式并不局限于上述的实施方式。应该认识到:在不脱离本发明主旨和范围的情况下,本领域技术人员可以对本发明做出不同的变化和修改。

Claims (7)

  1. 一种分布式环境下虚拟机异常恢复方法,所述分布式环境下虚拟机异常恢复方法包括下列步骤:
    (A1)在虚拟机驻留于其上的每个物理机上运行独立的计算组件,并且所述计算组件周期性地向状态数据库报告相应的物理机的当前运行状态;
    (A2)高可用控制器周期性地轮询所述状态数据库以检查在所述高可用控制器的控制下的物理机集群中的所有物理机的运行状态;
    (A3)如果所述物理机集群中的所有物理机的运行状态均正常,则结束本次检查过程,如果所述物理机集群中的多台物理机的运行状态均异常,则结束本次检查过程并且通过日志的方式发出报警,而如果所述物理机集群中仅一台物理机的运行状态异常,则执行后续的异常处理操作以确保该运行状态异常的物理机上的虚拟机继续正常运行。
  2. 根据权利要求1所述的分布式环境下虚拟机异常恢复方法,其特征在于,所述异常处理操作包括:所述高可用控制器探测该运行状态异常的物理机针对管理网络的可连通性,其中,所述探测以下列两种方式进行:(1)ping该物理机;(2)监测该物理机的22号端口。
  3. 根据权利要求2所述的分布式环境下虚拟机异常恢复方法,其特征在于,所述异常处理操作进一步包括:如果以任何一种方式探测发现该运行状态异常的物理机针对管理网络是可连通的,则异常处理操作结束,而如果以两种方式探测发现该运行状态异常的物理机针对管理网络均是不可连通的,则探测运行于该运行状态异常的物理机上的所有有效虚拟机针对业务网络的可连通性,并且如果任何一个有效虚拟机针对业务网络是可连通的,则异常处理操作结束,而如果所有有效虚拟机针对业务网络均是不可连通的,则执行二次投票操作以最终确认该运行状态异常的物理机是否发生故障。
  4. 根据权利要求3所述的分布式环境下虚拟机异常恢复方法,其特征在于,所述二次投票操作包括:(1)所述高可用控制器从所述物理机集群中随机选择除该运行状态异常的物理机之外的若干台物理机;(2)所述高可用控制器指示每个所选择出的物理机分别通过ping该运行状态异常的物理机以及监测该运 行状态异常的物理机的22号端口来探测该运行状态异常的物理机针对管理网络和/或业务网络的可连通性;(3)如果所选择出的物理机中的任何一台物理机发现该运行状态异常的物理机针对管理网络或业务网络是可连通的,则结束二次投票操作并且二次投票操作的结果是“该运行状态异常的物理机未发生故障”,而如果所有所选择出的物理机均发现该运行状态异常的物理机针对管理网络和业务网络均是不可连通的,则结束二次投票操作并且二次投票操作的结果是“该运行状态异常的物理机发生故障”,随之执行虚拟机迁移操作。
  5. 根据权利要求4所述的分布式环境下虚拟机异常恢复方法,其特征在于,所述虚拟机迁移操作包括:(1)所述高可用控制器经由智能平台管理接口(IPMI)向该运行状态异常的物理机发送关机指令以使所述运行状态异常的物理机处于关机状态,从而销毁驻留在其内存中的虚拟机;(2)所述高可用控制器向调度控制器发送迁移调度指令;(3)在接收到所述迁移调度指令后,所述调度控制器选择所述物理机集群中有空闲资源的物理机,并随之逐个向所选择出的有空闲资源的物理机发送迁移指令,以将在该运行状态异常的物理机上运行的所有有效虚拟机迁移到所选择出的有空闲资源的物理机上,其中,分配给不同的有空闲资源的物理机的待迁移的虚拟机是彼此不同的;(4)经由共享存储装置,所述有空闲资源的物理机上运行的计算组件将分配给本物理机的待迁移虚拟机迁移至本物理机。
  6. 根据权利要求5所述的分布式环境下虚拟机异常恢复方法,其特征在于,用户能够配置各个虚拟机的高可用性标志,并且所述高可用控制器在执行虚拟机迁移操作之前判断在该运行状态异常的物理机上运行的所有有效虚拟机的高可用性标志,并且仅对其高可用性标志的值为“启用”的虚拟机执行后续的虚拟机迁移操作。
  7. 根据权利要求6所述的分布式环境下虚拟机异常恢复方法,其特征在于,用户能够配置各个虚拟机的高可用性优先级,并且所述高可用控制器根据每个待迁移虚拟机的高可用性优先级的高低依次迁移各个待迁移虚拟机。
PCT/CN2015/078248 2014-05-08 2015-05-05 分布式环境下虚拟机异常恢复方法 WO2015169199A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP15788953.6A EP3142011B9 (en) 2014-05-08 2015-05-05 Anomaly recovery method for virtual machine in distributed environment
US15/308,497 US10095576B2 (en) 2014-05-08 2015-05-05 Anomaly recovery method for virtual machine in distributed environment

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410191655.1 2014-05-08
CN201410191655.1A CN105095001B (zh) 2014-05-08 2014-05-08 分布式环境下虚拟机异常恢复方法

Publications (1)

Publication Number Publication Date
WO2015169199A1 true WO2015169199A1 (zh) 2015-11-12

Family

ID=54392140

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/078248 WO2015169199A1 (zh) 2014-05-08 2015-05-05 分布式环境下虚拟机异常恢复方法

Country Status (4)

Country Link
US (1) US10095576B2 (zh)
EP (1) EP3142011B9 (zh)
CN (1) CN105095001B (zh)
WO (1) WO2015169199A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874111A (zh) * 2017-01-11 2017-06-20 深圳证券通信有限公司 一种云计算平台的虚拟机高可用性管理方法

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980528A (zh) * 2016-01-18 2017-07-25 中兴通讯股份有限公司 一种恢复虚拟机的方法和装置
CN107453888B (zh) * 2016-05-31 2020-11-20 深信服科技股份有限公司 高可用性的虚拟机集群的管理方法及装置
CN107544839B (zh) * 2016-06-27 2021-05-25 腾讯科技(深圳)有限公司 虚拟机迁移系统、方法及装置
CN107870801B (zh) * 2016-09-26 2020-05-26 中国电信股份有限公司 虚拟机高可用功能自动开通方法、装置和系统
JP2018170618A (ja) * 2017-03-29 2018-11-01 Kddi株式会社 障害自動復旧システム、制御装置、手順作成装置およびプログラム
CN107491344B (zh) * 2017-09-26 2020-09-01 北京思特奇信息技术股份有限公司 一种实现虚拟机高可用性的方法及装置
CN109491836B (zh) * 2018-10-30 2021-04-27 京信通信系统(中国)有限公司 数据恢复方法、装置及基站
CN109710377B (zh) * 2018-12-14 2023-06-30 国云科技股份有限公司 一种从故障的分布式存储里恢复kvm虚拟机的方法
CN110532090B (zh) * 2019-08-16 2022-03-15 国网冀北电力有限公司 私有云计算业务恢复调度方法及装置
CN112148485A (zh) * 2020-09-16 2020-12-29 杭州安恒信息技术股份有限公司 超融合平台故障恢复方法、装置、电子装置和存储介质
CN113608826A (zh) * 2021-06-29 2021-11-05 济南浪潮数据技术有限公司 虚拟化平台迁移方法、装置、电子设备及可读存储介质
CN113568710B (zh) * 2021-08-03 2023-07-21 罗慧 一种虚拟机高可用实现方法、装置和设备
CN113765709B (zh) * 2021-08-23 2022-09-20 中国人寿保险股份有限公司上海数据中心 基于Openstack云平台多维监控的虚拟机高可用实现系统及方法
CN114090184B (zh) * 2021-11-26 2022-11-29 中电信数智科技有限公司 一种虚拟化集群高可用性的实现方法和设备
CN114553917B (zh) * 2021-12-30 2024-01-26 北京天成通链科技有限公司 一种基于区块链的网络智能治理方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708018A (zh) * 2012-04-20 2012-10-03 华为技术有限公司 一种异常处理方法及系统、代理设备与控制装置
CN102819465A (zh) * 2012-06-29 2012-12-12 华中科技大学 一种虚拟化环境中故障恢复的方法
JP2013254354A (ja) * 2012-06-07 2013-12-19 Mitsubishi Electric Corp コンピュータ装置及びソフトウェア管理方法及びプログラム
CN103729280A (zh) * 2013-12-23 2014-04-16 国云科技股份有限公司 一种虚拟机高可用机制

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7197561B1 (en) * 2001-03-28 2007-03-27 Shoregroup, Inc. Method and apparatus for maintaining the status of objects in computer networks using virtual state machines
US6986076B1 (en) * 2002-05-28 2006-01-10 Unisys Corporation Proactive method for ensuring availability in a clustered system
US20060080678A1 (en) * 2004-09-07 2006-04-13 Bailey Mark W Task distribution method for protecting servers and tasks in a distributed system
US7925923B1 (en) * 2008-01-31 2011-04-12 Hewlett-Packard Development Company, L.P. Migrating a virtual machine in response to failure of an instruction to execute
US8566650B2 (en) * 2009-08-04 2013-10-22 Red Hat Israel, Ltd. Virtual machine infrastructure with storage domain monitoring
CN102708818B (zh) 2012-04-24 2014-07-09 京东方科技集团股份有限公司 一种移位寄存器和显示器
CN103118121B (zh) * 2013-02-19 2017-05-17 浪潮电子信息产业股份有限公司 一种高可用集群在虚拟化技术中的应用方法
US9146819B2 (en) * 2013-07-02 2015-09-29 International Business Machines Corporation Using RDMA for fast system recovery in virtualized environments
CN103440160B (zh) * 2013-08-15 2016-12-28 华为技术有限公司 虚拟机恢复方法和虚拟机迁移方法以及装置与系统
CN103559108B (zh) * 2013-11-11 2017-05-17 中国科学院信息工程研究所 一种基于虚拟化实现主备故障自动恢复的方法及系统
WO2015116048A1 (en) * 2014-01-29 2015-08-06 Hewlett-Packard Development Company, L.P. Shutdown of computing devices
US9417976B2 (en) * 2014-08-29 2016-08-16 Vmware, Inc. Preventing migration of a virtual machine from affecting disaster recovery of replica
US9798635B2 (en) * 2015-12-11 2017-10-24 International Business Machines Corporation Service level agreement-based resource allocation for failure recovery
US10521315B2 (en) * 2016-02-23 2019-12-31 Vmware, Inc. High availability handling network segmentation in a cluster

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708018A (zh) * 2012-04-20 2012-10-03 华为技术有限公司 一种异常处理方法及系统、代理设备与控制装置
JP2013254354A (ja) * 2012-06-07 2013-12-19 Mitsubishi Electric Corp コンピュータ装置及びソフトウェア管理方法及びプログラム
CN102819465A (zh) * 2012-06-29 2012-12-12 华中科技大学 一种虚拟化环境中故障恢复的方法
CN103729280A (zh) * 2013-12-23 2014-04-16 国云科技股份有限公司 一种虚拟机高可用机制

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3142011A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874111A (zh) * 2017-01-11 2017-06-20 深圳证券通信有限公司 一种云计算平台的虚拟机高可用性管理方法

Also Published As

Publication number Publication date
EP3142011B1 (en) 2018-12-12
EP3142011A4 (en) 2018-01-10
EP3142011A1 (en) 2017-03-15
CN105095001A (zh) 2015-11-25
US20170060671A1 (en) 2017-03-02
US10095576B2 (en) 2018-10-09
EP3142011B9 (en) 2019-05-29
CN105095001B (zh) 2018-01-30

Similar Documents

Publication Publication Date Title
WO2015169199A1 (zh) 分布式环境下虚拟机异常恢复方法
JP5851503B2 (ja) 高可用性仮想機械環境におけるアプリケーションの高可用性の提供
US7418627B2 (en) Cluster system wherein failover reset signals are sent from nodes according to their priority
US8862928B2 (en) Techniques for achieving high availability with multi-tenant storage when a partial fault occurs or when more than two complete faults occur
US8495412B2 (en) Autonomous propagation of virtual input/output (VIO) operation(s) to second VIO server (VIOS) due to a detected error condition at a first VIOS
US20020133727A1 (en) Automated node restart in clustered computer system
WO2018095414A1 (zh) 虚拟机故障的检测和恢复方法及装置
EP2518627B1 (en) Partial fault processing method in computer system
JP5305040B2 (ja) サーバ計算機の切替方法、管理計算機及びプログラム
US9210059B2 (en) Cluster system
US8112518B2 (en) Redundant systems management frameworks for network environments
CN107508694B (zh) 一种集群内的节点管理方法及节点设备
CN104239548A (zh) 数据库容灾系统和数据库容灾方法
CN107453888B (zh) 高可用性的虚拟机集群的管理方法及装置
CN103902401A (zh) 基于监控的虚拟机容错方法及装置
WO2015188619A1 (zh) 物理主机故障检测方法、装置及虚机管理方法、系统
CN105515838A (zh) 一种服务配置方法及ha集群系统
US11544091B2 (en) Determining and implementing recovery actions for containers to recover the containers from failures
US10514991B2 (en) Failover device ports
JP2003345620A (ja) 多ノードクラスタシステムのプロセス監視方法
JP2009110218A (ja) 仮想化スイッチおよびそれを用いたコンピュータシステム
WO2013007023A1 (zh) 改善高可用性系统可靠性的方法和装置
US20160321149A1 (en) Computer apparatus and computer mechanism
US8533331B1 (en) Method and apparatus for preventing concurrency violation among resources
KR101883251B1 (ko) 가상 시스템에서 장애 조치를 판단하는 장치 및 그 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15788953

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 15308497

Country of ref document: US

REEP Request for entry into the european phase

Ref document number: 2015788953

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2015788953

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE