WO2015188619A1 - 物理主机故障检测方法、装置及虚机管理方法、系统 - Google Patents

物理主机故障检测方法、装置及虚机管理方法、系统 Download PDF

Info

Publication number
WO2015188619A1
WO2015188619A1 PCT/CN2015/070237 CN2015070237W WO2015188619A1 WO 2015188619 A1 WO2015188619 A1 WO 2015188619A1 CN 2015070237 W CN2015070237 W CN 2015070237W WO 2015188619 A1 WO2015188619 A1 WO 2015188619A1
Authority
WO
WIPO (PCT)
Prior art keywords
physical host
detection
management interface
platform management
physical
Prior art date
Application number
PCT/CN2015/070237
Other languages
English (en)
French (fr)
Inventor
胡岩岩
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2015188619A1 publication Critical patent/WO2015188619A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks

Definitions

  • the present invention relates to the field of communications, and in particular, to a physical host fault detection method and apparatus, and a virtual machine management method and system.
  • the virtual machine on the failed physical host can be restarted on other normal running physical hosts in the cluster at any time. That is, when the physical host fails, the physical host is on the physical host.
  • An important issue for the virtualization management center is how to use an effective and reliable method to detect a physical host with an unexpected failure.
  • the existing way to detect whether the physical host is faulty is simply to determine whether the heartbeat message of the physical host is Normally, if the heartbeat message is abnormal, it is directly determined that the physical host is faulty, and then the virtual machine on the physical host is migrated. However, the accuracy of determining whether a physical host is faulty by a heartbeat message is relatively low.
  • the physical host is not faulty but is determined to be faulty according to whether the heartbeat message is normal. Therefore, even if the physical host is not faulty at this time, The virtual machines on the physical host are migrated, causing the same virtual machine to start simultaneously on multiple physical hosts.
  • the main technical problem to be solved by the present invention is to provide a physical host fault detection method and device, and a virtual machine management method and system, which solve the problem that the existing physical host fault detection accuracy is low.
  • an embodiment of the present invention provides a physical host fault detection method, including:
  • the smart host management interface detection command is sent to the physical host to detect the physical host, and the physical host is determined to be normal according to the detection result.
  • the method before the sending the intelligent platform management interface detection instruction to the physical host, the method further includes:
  • the timing module is configured to trigger an intelligent platform management interface detection instruction to the physical host when the timing module reaches a preset time value.
  • the process when the method includes monitoring whether the reporting of the physical host heartbeat message is normal, the process includes:
  • the method further includes:
  • the storage network determines whether the unique identifier of the physical host is normally reported. If not, the physical host is determined to be faulty.
  • the present invention further provides a virtual machine management method, including: after determining the physical host failure by the physical host failure detection method as described above, transferring the virtual machine on the physical host to another normal Run on the physical host.
  • an embodiment of the present invention further provides a physical host fault detection apparatus, including: an intelligent platform management interface detection module, configured to deliver an intelligent platform management interface detection instruction to the physical host to the physical The host performs detection, and determines whether the physical host is normal according to the detection result.
  • an intelligent platform management interface detection module configured to deliver an intelligent platform management interface detection instruction to the physical host to the physical The host performs detection, and determines whether the physical host is normal according to the detection result.
  • the method further includes a heartbeat detection module, configured to monitor the physical host heartbeat before the intelligent platform management interface detection module sends an intelligent platform management interface detection instruction to the physical host Whether the reporting of the message is normal; if not, triggering the intelligent platform management interface detecting module to send an intelligent platform management interface detection instruction to the physical host;
  • a heartbeat detection module configured to monitor the physical host heartbeat before the intelligent platform management interface detection module sends an intelligent platform management interface detection instruction to the physical host Whether the reporting of the message is normal; if not, triggering the intelligent platform management interface detecting module to send an intelligent platform management interface detection instruction to the physical host;
  • a timing module configured to start timing before the smart platform management interface detecting module sends an intelligent platform management interface detection instruction to the physical host, and trigger the smart type when the timing reaches a preset time value
  • the platform management interface detection module sends an intelligent platform management interface detection instruction to the physical host.
  • the heartbeat detecting module when the heartbeat detecting module is included, the heartbeat detecting module includes a monitoring submodule and a determining submodule; and the judging submodule is configured to determine whether the monitoring submodule is continuous N times The heartbeat message of the physical host is not detected by the physical host, and if yes, the report of the heartbeat message of the physical host is abnormal; the N is greater than or equal to 1.
  • the method further includes an identifier determining module, configured to determine, by the storage network, the physical host after the smart platform management interface detecting module determines that the physical host is abnormal according to the detection result. Whether the unique identification code is normally reported, and if not, it is determined that the physical host is faulty.
  • an embodiment of the present invention further provides a virtual machine management system, including a virtual machine transfer device and a physical host failure detecting device as described above; the virtual machine transfer device is configured to detect the physical host failure After detecting the physical host failure, the device transfers the virtual machine on the physical host to another normal running physical host.
  • the physical host fault detection method and device and the virtual machine management method and system provided by the embodiments of the present invention perform the detection of the physical host by sending an intelligent platform management interface detection instruction to the physical host, and determine whether the physical host is normal according to the detection result; It is simple to determine whether the physical host is faulty according to whether the heartbeat message of the physical host is normal.
  • the accuracy of physical host fault detection can be improved to a large extent, thereby preventing the same virtual machine from being started on multiple physical hosts due to false detection of physical host failure.
  • FIG. 1 is a schematic flowchart of a physical host fault detection process according to Embodiment 1 of the present invention
  • FIG. 2 is a schematic diagram of connection between a virtual machine management system and a physical host according to Embodiment 2 of the present invention
  • FIG. 3 is a schematic structural diagram of a physical host fault detecting apparatus according to Embodiment 3 of the present invention.
  • Embodiment 1 is a diagrammatic representation of Embodiment 1:
  • the physical host fault detection method provided in this embodiment is not an existing one that simply determines whether the physical host is faulty according to whether the heartbeat message is normal.
  • the process of determining whether the physical host is faulty in the embodiment includes: sending an intelligent platform management interface (IPMI) detection command to the physical host to detect the physical host (for example, detecting whether the physical host is powered on, etc.) And determining whether the physical host is normal according to the detection result.
  • IPMI intelligent platform management interface
  • IPMI specifically detects the status of the blade server through hardware through a separate CPU, for example, it can be used to check the power-on and power-off status of the blade server, and the obtained detection result is relatively accurate; but the heartbeat message is transmitted through the management network, for example, the management network cable is The heartbeat message will also be lost, but At this time, the blade server can be normally powered on without any problem; therefore, if the judgment is based only on the heartbeat message, a misjudgment will occur, and according to the IPMI detection result, no misjudgment will occur. Therefore, IPMI detection can improve the accuracy of physical host failure detection to a large extent, and thus avoid the occurrence of the same virtual machine starting on multiple physical hosts due to the false detection of physical host failure. In this embodiment, the corresponding IPMI command can be sent to the physical host through the IPMI network deployed in the production environment.
  • the physical host when the reporting of the heartbeat message of the physical host is normal, the physical host generally runs normally. In this case, the necessity of sending an IPMI detection command to the physical host is not too great, because even if an IPMI detection command is issued to the physical host at this time, the acquired detection result is basically normal. Therefore, in order to reduce the unnecessary overhead of the system and improve the resource utilization, the IPMI detection command is sent to the physical host only when the heartbeat message reported by the physical host is abnormal. In this case, before the IPMI detection command is sent to the physical host, it is also included;
  • the heartbeat message of the physical host may be determined whether the heartbeat message of the physical host is normal through the heartbeat message of the physical host that is not monitored by the physical host for a consecutive N times; if the heartbeat message of the physical host is not detected continuously, the heartbeat message of the physical host is determined.
  • the reporting is not normal; the value of N in this embodiment can theoretically take any integer value greater than or equal to 1; the specific value can be specifically set according to current network environment, user requirements and other factors, for example, can be set to 3 or 5 and many more.
  • condition for triggering the sending of the IPMI detection command to the physical host may be implemented by using a timing trigger in addition to the abnormal heartbeat of the physical host. At this time, before the IPMI detection instruction is sent to the physical host, it is further included;
  • the timing module is configured to trigger the operation of sending the intelligent platform management interface detection instruction to the physical host when the timing module reaches the preset time value (for example, once every 1 second or 5 seconds).
  • the preset time value for example, once every 1 second or 5 seconds.
  • the triggering may be implemented in combination with the two methods, that is, the operation of issuing the IPMI detection command is triggered only after the two trigger conditions are met.
  • the method further includes:
  • the storage network determines whether the unique identifier of the physical host is reported normally. If no, the physical host is determined to be faulty.
  • Each physical host can access a certain shared storage through a separate storage network. Based on this, each physical host periodically writes a physical host mark with a unique identification code (for example, Universally Unique Identifier UUID) to the shared storage. Therefore, it is possible to detect whether the physical host is faulty by detecting whether the physical host periodically reports the unique identifier through the storage network. If the unique identifier of the physical host is not detected in the shared storage, the physical host is not reported normally. Its unique identification code determines that the host is faulty; otherwise, it indicates that the physical host is not faulty or not necessarily faulty.
  • a unique identification code for example, Universally Unique Identifier UUID
  • the physical host detection method provided in this embodiment can detect the physical host failure by using the IPMI detection instruction alone, and can also be implemented by combining the physical host's heartbeat message and the IPMI detection instruction; in order to further improve the detection accuracy, it is even possible Combined with the heartbeat message of the physical host, the IPMI detection command, and the physical host by storing the unique identifier code on the network. After the physical host fault is detected in the above manner, the virtual machine on the physical host can be transferred to another normal running physical host (generally other physical hosts in the same cluster as the failed physical host).
  • Step 101 Monitor whether the reporting of the heartbeat message of the physical host is normal; if not, go to step 102; otherwise, continue to detect;
  • the virtualization management center node firstly communicates with the management host and the physical host through the management network. Therefore, the first step is to first detect whether the physical host occurs according to the heartbeat message reported by the physical host through the management network. Failure, if the heartbeat is lost, break through the first line of defense of the heartbeat detection of the management network, indicating that the host may be faulty;
  • Step 102 Send an IPMI detection command to the physical host to detect the physical host.
  • Step 103 Determine whether the physical host is normal according to the detection result. If no, go to step 104, otherwise, end;
  • the IPMI network is deployed separately in the production environment. Therefore, the IPMI detection command can be issued through the virtualization management center to detect whether the physical host is faulty. If the IPMI detection result is faulty, the IPMI command result is invalid because the abnormality caused by the host card is abnormal.
  • the detection of step 104 can be further performed;
  • Step 104 Determine whether the unique identification code of the physical host is reported normally through the storage network. If no, go to step 105; otherwise, end;
  • Step 105 Determine that the physical host is faulty.
  • the virtualized management center and each physical host can access a certain shared storage through a separate storage network. Based on this, each physical host periodically writes a physical host mark with a unique identification code (for example, UUID) to the shared storage, so The physical host is detected by the storage network to periodically report the unique identifier to detect whether the physical host is faulty. If the unique identifier of the physical host is not detected in the shared storage, the physical host does not report the unique identifier. Code to determine the host failure.
  • UUID unique identification code
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • This embodiment provides a virtual machine management system, that is, a virtualization management center, which includes a physical host failure detecting device and a virtual machine transfer device. Please refer to FIG. 2, which shows the connection relationship between the virtual machine management system and each physical host.
  • the physical host fault detection apparatus in this embodiment includes: an intelligent platform management interface detection module (IPMI detection module), configured to deliver an intelligent platform management interface detection instruction to a physical host to perform physical host The detection determines whether the physical host is normal according to the detection result.
  • IPMI detection module intelligent platform management interface detection module
  • the physical host fault detection apparatus may further include a heartbeat detection module, configured to monitor the physical host heartbeat before the intelligent platform management interface detection module sends the intelligent platform management interface detection instruction to the physical host.
  • the heartbeat detection module in this embodiment may specifically include a monitoring submodule and a determining submodule; the determining submodule is configured to determine whether the monitoring submodule has not monitored the heartbeat message of the physical host by managing the online report for N consecutive times, and if so, determining The reporting of the physical host heartbeat message is abnormal; N is greater than or equal to 1.
  • the condition for triggering the sending of the IPMI detection command to the physical host may be implemented by using a timing trigger in addition to the abnormal heartbeat of the physical host.
  • the physical host fault detection device may further include a timing module, configured to send the intelligent platform management interface detection to the physical host in the intelligent platform management interface detection module. The timing starts before the instruction, and when the timing reaches the preset time value, the intelligent platform management interface detection module is triggered to send the intelligent platform management interface detection instruction to the physical host.
  • the triggering may also be implemented in combination with the two methods, that is, the operation of issuing the IPMI detection command is triggered only after the two trigger conditions are satisfied.
  • the physical host failure detecting apparatus may further include an identifier determining module, which is set to be intelligent.
  • the platform management interface detection module determines whether the physical host is abnormally reported through the storage network, and determines whether the physical host is faulty.
  • each physical host can access a certain shared storage through a separate storage network. Based on this, each physical host periodically writes a unique identification code (for example, UUID: Universally Unique Identifier) to the shared storage. The physical host is marked. Therefore, it is possible to detect whether the physical host is faulty by detecting whether the physical host periodically reports the unique identifier through the storage network. For example, if the unique identifier of the physical host is not detected in the shared storage, the representative The physical host does not report its unique identification code and determines that the host is faulty.
  • UUID Universally Unique Identifier
  • the virtual machine transfer device can transfer the virtual machine on the physical host to another normal running physical host.
  • the physical host fault detection method provided by the present invention can accurately determine whether the physical host is faulty through the management network, the IPMI network, and the storage network, in combination with the heartbeat message, the IPMI detection result, and the unique identifier reporting status of the physical host. In addition, the occurrence of the same virtual machine being started on multiple physical hosts due to the false detection of the physical host failure is avoided.
  • the physical host is detected by sending an intelligent platform management interface detection instruction to the physical host, and the physical host is determined according to the detection result; whether the heartbeat message according to the physical host is normal or not is normal. To determine if the physical host is faulty. Can be improved to a large extent The accuracy of physical host fault detection, which avoids the occurrence of the same virtual machine being started on multiple physical hosts due to false detection of physical host failure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种物理主机故障检测方法、装置及虚机管理方法、系统,通过向物理主机下发智能型平台管理接口检测指令对物理主机进行检测(S102),根据检测结果判断物理主机是否正常(S103);并非简单的根据物理主机的心跳消息是否正常来判定物理主机是否故障。可以在较大程度上提高物理主机故障检测的准确率,进而避免由于物理主机故障的误检测导致同一虚机在多个物理主机上启动的情况发生。

Description

物理主机故障检测方法、装置及虚机管理方法、系统 技术领域
本发明涉及通信领域,具体涉及一种物理主机故障检测方法、装置及虚机管理方法、系统。
背景技术
虚机高可用即在物理主机发生故障时,能随时将发生故障的物理主机上的虚机在集群的其他正常运行的物理主机上重启机,也即当物理主机故障后,将该物理主机上的虚拟转移到其他正常运行的物理主机上的过程。对于虚拟化管理中心来说一个重要问题即是怎样采用有效可靠的方法去检测到发生非预期故障的物理主机;现有检测物理主机是否故障的方式是简单的通过判断该物理主机的心跳消息是否正常来实现,如判定心跳消息不正常,则直接判定该物理主机发生故障,然后将该物理主机上的虚机进行迁移。但简单的通过心跳消息判定物理主机是否故障的准确率本身就比较低,往往出现物理主机未故障但被根据心跳消息是否正常而被判定为故障的情况,因此即使此时该物理主机未故障也会对该物理主机上的虚机进行迁移,从而导致同一个虚机在多个物理主机上同时启动。
发明内容
本发明要解决的主要技术问题是,提供一种物理主机故障检测方法、装置及虚机管理方法、系统,解决现有物理主机故障检测准确率低的问题。
为解决上述技术问题,本发明实施例提供一种物理主机故障检测方法,包括:
向物理主机下发智能型平台管理接口检测指令对所述物理主机进行检测,根据检测结果判断所述物理主机是否正常。
在本发明的一种实施例中,在向所述物理主机下发智能型平台管理接口检测指令之前,还包括;
监测所述物理主机心跳消息的上报是否正常;如否,触发向所述物理主机下发智能型平台管理接口检测指令;
和/或,
设置定时模块,在该定时模块计时达到预设时间值时,触发向所述物理主机下发智能型平台管理接口检测指令。
在本发明的一种实施例中,当所述方法包括监测所述物理主机心跳消息的上报是否正常时,该过程包括:
判断是否连续N次未监测到所述物理主机通过管理网上报的心跳消息,如是,则判定所述物理主机心跳消息的上报不正常;所述N大于等于1。
在本发明的一种实施例中,根据所述检测结果判断所述物理主机不正常后,还包括:
通过存储网判断所述物理主机的唯一标识码是否正常上报,如否,判定所述物理主机故障。
为了解决上述问题,本发明还提供了一种虚机管理方法,包括:通过如上所述的物理主机故障检测方法判定所述物理主机故障后,将所述物理主机上的虚机转移到其他正常运行的物理主机上。
为了解决上述问题,本发明实施例还提供了一种物理主机故障检测装置,包括:智能型平台管理接口检测模块,设置为向所述物理主机下发智能型平台管理接口检测指令对所述物理主机进行检测,根据检测结果判断所述物理主机是否正常。
在本发明的一种实施例中,还包括心跳检测模块,设置为在所述智能型平台管理接口检测模块向所述物理主机下发智能型平台管理接口检测指令之前,监测所述物理主机心跳消息的上报是否正常;如否,触发所述智能型平台管理接口检测模块向所述物理主机下发智能型平台管理接口检测指令;
和/或,
还包括定时模块,设置为在所述智能型平台管理接口检测模块向所述物理主机下发智能型平台管理接口检测指令之前开始计时,并在计时达到预设时间值时,触发所述智能型平台管理接口检测模块向所述物理主机下发智能型平台管理接口检测指令。
在本发明的一种实施例中,当包括所述心跳检测模块时,所述心跳检测模块包括监测子模块和判断子模块;所述判断子模块设置为判断所述监测子模块是否连续N次未监测到所述物理主机通过管理网上报的心跳消息,如是,判定所述物理主机心跳消息的上报不正常;所述N大于等于1。
在本发明的一种实施例中,还包括标识判断模块,设置为在所述智能型平台管理接口检测模块根据所述检测结果判断所述物理主机不正常后,通过存储网判断所述物理主机的唯一标识码是否正常上报,如否,判定所述物理主机故障。
为了解决上述问题,本发明实施例还提供了一种虚机管理系统,包括虚机转移装置和如上所述的物理主机故障检测装置;所述虚机转移装置设置为在所述物理主机故障检测装置检测出物理主机故障后,将所述物理主机上的虚机转移到其他正常运行的物理主机上。
本发明的有益效果是:
本发明实施例提供的物理主机故障检测方法、装置及虚机管理方法、系统,通过向物理主机下发智能型平台管理接口检测指令对物理主机进行检测,根据检测结果判断物理主机是否正常;并非简单的根据物理主机的心跳消息是否正常来判定物理主机是否故障。可以在较大程度上提高物理主机故障检测的准确率,进而避免由于物理主机故障的误检测导致同一虚机在多个物理主机上启动的情况发生。
附图说明
图1为本发明实施例一提供的物理主机故障检测流程示意图;
图2为本发明实施例二提供的虚机管理系统与物理主机的连接示意图;
图3为本发明实施例三提供的物理主机故障检测装置的结构示意图。
具体实施方式
下面通过具体实施方式结合附图对本发明作进一步详细说明。
实施例一:
本实施例提供的物理主机故障检测方法,并非现有仅简单根据心跳消息是否正常来判定物理主机是否故障。本实施例中判定物理主机是否故障的过程包括向物理主机下发智能型平台管理接口(IPMI:Intelligent Platform Management Interface)检测指令对物理主机进行检测(例如检测物理主机上电状态是否正常等等),进而根据检测结果判断该物理主机是否正常。因为IPMI是专门通过单独的CPU通过硬件检测刀片服务器状态,例如可以用来检查刀片服务器的上下电状态,得到的检测结果比较准确;但心跳消息是通过管理网络传送的,在例如管理网络网线被拔时心跳消息也会丢失,但 此时刀片服务器可以是正常上电运转并没有任何问题;因此此时若仅根据心跳消息来判定的话就会出现误判,而根据IPMI检测结果则不会出现误判。因此采用IPMI检测可以在较大程度上提高物理主机故障检测的准确率,进而避免由于物理主机故障的误检测导致同一虚机在多个物理主机上启动的情况发生。本实施例中,可通过在生产环境中单独部署的IPMI网络向物理主机下发相应的IPMI指令。
一般来说,当监测到物理主机的心跳消息的上报正常时,该物理主机一般也都运行正常。在此情况下,向该物理主机下发IPMI检测指令的必要性就不是太大,因为即使此时向物理主机下发IPMI检测指令,获取的检测结果也基本都是运行正常。因此,为了尽可能减少系统不必要的额外开销,提高资源利用率,本实施例中可以设置仅在检测到物理主机上报的心跳消息不正常时,才向该物理主机下发IPMI检测指令。此时,则在向物理主机下发IPMI检测指令之前,还包括;
监测物理主机心跳消息的上报是否正常;如不正常,触发向该物理主机下发IPMI检测指令;如正常,则不触发该操作。
本实施例中可以通过判断是否连续N次未监测到物理主机通过管理网上报的心跳消息来判断该物理主机的心跳消息是否正常;如连续N次都未监测到,则判定该物理主机心跳消息的上报不正常;本实施例中的N的取值理论上可以取大于等于1的任何整数值;具体取值则可根据当前网络环境、用户需求等因素具体设置,例如可设置为3或5等等。
当然,触发向物理主机下发IPMI检测指令的条件除了物理主机心跳不正常外,还可采用定时触发的方式实现。此时,在则在向物理主机下发IPMI检测指令之前,还包括;
设置定时模块,在该定时模块计时达到预设时间值(例如每间隔1秒或5秒等下发一次)时,触发向物理主机下发智能型平台管理接口检测指令的操作。这种触发方式相对根据心跳消息的状态触发的方式,其占用的资源会相对更多一些。但这也是一种相对灵活且有效的触发方式。当然,在本实施例中,还可同时结合这两种方式实现触发,也即只有当着两种触发条件都满足后,才触发下发IPMI检测指令的操作。
在通过IPMI检测指令监测到物理主机不正常时,在绝大部分情况下都可判定物理主机发生故障,需要对其上的虚机进行迁移处理。但是,为了确保不是由于物理主机卡死等异常原因导致的IPMI检测结果失效,在本实施例中,根据检测结果判断物理主机不正常后,还可进一步包括:
通过存储网判断该物理主机的唯一标识码是否正常上报,如否,判定该物理主机故障。
各个物理主机通过单独的存储网络可访问某块共享存储,基于此,每个物理主机定时向该共享存储写入带有唯一标识码(例如通用唯一识别码UUID:Universally Unique Identifier)物理主机标记,因此可以通过检测物理主机是否通过存储网正常定时上报该唯一标识码进一步检测该物理主机是否故障,如在该共享存储中未检测到该物理主机的唯一标识码,则代表该物理主机未正常上报其唯一标识码,确定该主机故障;否则,则表明该物理主机未故障或不一定故障。
可见,本实施例提供的物理主机检测方法,可以单独通过IPMI检测指令实现对物理主机故障的检测,也可结合物理主机的心跳消息和IPMI检测指令实现;为了进一步提高检测准确率,甚至还可结合物理主机的心跳消息、IPMI检测指令和物理主机通过存储网上报唯一标识码的情况实现。在通过上述方式检测到物理主机故障后,即可将该物理主机上的虚机转移到其他正常运行的物理主机(一般是与该发生故障的物理主机相同集群下的其他物理主机)上。
为了更好的理解本发明,下面以结合物理主机的心跳消息、IPMI检测指令和物理主机通过存储网上报唯一标识码的情况实现物理主机故障的可靠检测的流程为例,对本发明做进一步的说明。请参见图1所示,其包括:
步骤101:监测物理主机心跳消息的上报是否正常;如否,转至步骤102;否则,继续检测;
虚拟化管理中心节点首先在管理网中,管理程序与物理主机之间的消息通信都是通过管理网,所以第一步可首先根据物理主机通过管理网上报上来的心跳消息检测该物理主机是否发生故障,如果心跳丢失,则突破管理网心跳检测的第一防线,说明主机可能发生故障;
步骤102:向该物理主机下发IPMI检测指令对该物理主机进行检测;
步骤103:根据检测结果判断该物理主机是否正常,如否,转至步骤104,否则,结束;
生产环境中单独部署IPMI网络,因此可通过虚拟化管理中心下达IPMI检测指令,检测物理主机是否故障,如果IPMI检测结果为故障,为了确保不是由于主机卡死等异常原因导致的IPMI指令结果失效,可进一步进行步骤104的检测;
步骤104:通过存储网判断该物理主机的唯一标识码是否正常上报,如否,转至步骤105;否则,结束;
步骤105:判定该物理主机故障。
虚拟化管理中心与各个物理主机通过单独的存储网络可访问某块共享存储,基于此,每个物理主机定时向该共享存储写入带有唯一标识码(例如UUID)物理主机标记,因此可以通过检测物理主机是否通过存储网正常定时上报该唯一标识码进一步检测该物理主机是否故障,如在该共享存储中未检测到该物理主机的唯一标识码,则代表该物理主机未正常上报其唯一标识码,确定该主机故障。
实施例二:
本实施例提供了一种虚机管理系统,也即虚拟化管理中心,其包括物理主机故障检测装置和虚机转移装置。请参见图2所示,该图示出了虚机管理系统与各物理主机的连接关系。
请参见图3所示,本实施例中的物理主机故障检测装置包括:智能型平台管理接口检测模块(IPMI检测模块),设置为向物理主机下发智能型平台管理接口检测指令对物理主机进行检测,根据检测结果判断该物理主机是否正常。
一般来说,当监测到物理主机的心跳消息的上报正常时,该物理主机一般也都运行正常。在此情况下,向该物理主机下发IPMI检测指令的必要性就不是太大,因为即使此时向物理主机下发IPMI检测指令,获取的检测结果页基本都是运行正常。因此,为了尽可能减少系统不必要的额外开销,提高资源利用率,本实施例中可以设置仅在检测到物理主机上报的心跳消息不正常时,才向该物理主机下发IPMI检测指令。此时,请参见图3所示,物理主机故障检测装置还可包括心跳检测模块,设置为在智能型平台管理接口检测模块向物理主机下发智能型平台管理接口检测指令之前,监测物理主机心跳消息的上报是否正常;如否,触发智能型平台管理接口检测模块向所述物理主机下发智能型平台管理接口检测指令。本实施例中的心跳检测模块具体可包括监测子模块和判断子模块;判断子模块设置为判断监测子模块是否连续N次未监测到所述物理主机通过管理网上报的心跳消息,如是,判定物理主机心跳消息的上报不正常;N大于等于1。
本实施例中,触发向物理主机下发IPMI检测指令的条件除了物理主机心跳不正常外,还可采用定时触发的方式实现。此时,物理主机故障检测装置还可包括定时模块,设置为在智能型平台管理接口检测模块向物理主机下发智能型平台管理接口检测 指令之前开始计时,并在计时达到预设时间值时,触发智能型平台管理接口检测模块向物理主机下发智能型平台管理接口检测指令。应当理解的是,在本实施例中,还可同时结合这两种方式实现触发,也即只有当着两种触发条件都满足后,才触发下发IPMI检测指令的操作。
在通过IPMI检测指令监测到物理主机不正常时,在绝大部分情况下都可判定物理主机发生故障,需要对其上的虚机进行迁移处理。但是,为了确保不是由于物理主机卡死等异常原因导致的IPMI检测结果失效,请参见图3所示,在本实施例中,物理主机故障检测装置还可包括标识判断模块,设置为在智能型平台管理接口检测模块根据检测结果判断物理主机不正常后,通过存储网判断物理主机的唯一标识码是否正常上报,如否,判定物理主机故障。
虚机管理系统与各个物理主机通过单独的存储网络可访问某块共享存储,基于此,每个物理主机定时向该共享存储写入带有唯一标识码(例如UUID:Universally Unique Identifier,通用唯一识别码)物理主机标记,因此可以通过检测物理主机是否通过存储网正常定时上报该唯一标识码进一步检测该物理主机是否故障,如在该共享存储中未检测到该物理主机的唯一标识码,则代表该物理主机未正常上报其唯一标识码,确定该主机故障。
在通过物理主机故障检测装置判定出物理主机故障后,虚机转移装置则可将该物理主机上的虚机转移到其他正常运行的物理主机上。
可见,本发明提供的物理主机故障检测方法可分别通过管理网、IPMI网以及存储网,结合心跳消息、IPMI检测结果以及物理主机的唯一标识上报情况对物理主机是否故障进行准确的判定。进而避免由于物理主机故障的误检测导致同一虚机在多个物理主机上启动的情况发生。
以上内容是结合具体的实施方式对本发明所作的进一步详细说明,不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干简单推演或替换,都应当视为属于本发明的保护范围。
工业实用性
基于本发明实施例提供的上述技术方案,通过向物理主机下发智能型平台管理接口检测指令对物理主机进行检测,根据检测结果判断物理主机是否正常;并非简单的根据物理主机的心跳消息是否正常来判定物理主机是否故障。可以在较大程度上提高 物理主机故障检测的准确率,进而避免由于物理主机故障的误检测导致同一虚机在多个物理主机上启动的情况发生。

Claims (10)

  1. 一种物理主机故障检测方法,包括:
    向物理主机下发智能型平台管理接口检测指令对所述物理主机进行检测,根据检测结果判断所述物理主机是否正常。
  2. 如权利要求1所述的物理主机故障检测方法,其中,在向所述物理主机下发智能型平台管理接口检测指令之前,还包括;
    监测所述物理主机心跳消息的上报是否正常;如否,触发向所述物理主机下发智能型平台管理接口检测指令;
    和/或,
    设置定时模块,在该定时模块计时达到预设时间值时,触发向所述物理主机下发智能型平台管理接口检测指令。
  3. 如权利要求2所述的物理主机故障检测方法,其中,监测所述物理主机心跳消息的上报是否正常,包括:
    判断是否连续N次未监测到所述物理主机通过管理网上报的心跳消息,如是,则判定所述物理主机心跳消息的上报不正常;所述N大于等于1。
  4. 如权利要求1-3任一项所述的物理主机故障检测方法,其中,根据所述检测结果判断所述物理主机不正常后,还包括:
    通过存储网判断所述物理主机的唯一标识码是否正常上报,如否,判定所述物理主机故障。
  5. 一种虚机管理方法,包括:通过如权利要求1-4任一项所述的物理主机故障检测方法判定所述物理主机故障后,将所述物理主机上的虚机转移到其他正常运行的物理主机上。
  6. 一种物理主机故障检测装置,包括:智能型平台管理接口检测模块,设置为向所述物理主机下发智能型平台管理接口检测指令对所述物理主机进行检测,根据检测结果判断所述物理主机是否正常。
  7. 如权利要求6所述的物理主机故障检测装置,其中,还包括心跳检测模块,设置为在所述智能型平台管理接口检测模块向所述物理主机下发智能型平台管理 接口检测指令之前,监测所述物理主机心跳消息的上报是否正常;如否,触发所述智能型平台管理接口检测模块向所述物理主机下发智能型平台管理接口检测指令;
    和/或,
    还包括定时模块,设置为在所述智能型平台管理接口检测模块向所述物理主机下发智能型平台管理接口检测指令之前开始计时,并在计时达到预设时间值时,触发所述智能型平台管理接口检测模块向所述物理主机下发智能型平台管理接口检测指令。
  8. 如权利要求6所述的物理主机故障检测装置,其中,当包括所述心跳检测模块时,所述心跳检测模块包括监测子模块和判断子模块;所述判断子模块设置为判断所述监测子模块是否连续N次未监测到所述物理主机通过管理网上报的心跳消息,如是,判定所述物理主机心跳消息的上报不正常;所述N大于等于1。
  9. 如权利要求6-8任一项所述的物理主机故障检测装置,其中,还包括标识判断模块,设置为在所述智能型平台管理接口检测模块根据所述检测结果判断所述物理主机不正常后,通过存储网判断所述物理主机的唯一标识码是否正常上报,如否,判定所述物理主机故障。
  10. 一种虚机管理系统,包括虚机转移装置和如权利要求6-9任一项所述的物理主机故障检测装置;所述虚机转移装置设置为在所述物理主机故障检测装置检测出物理主机故障后,将所述物理主机上的虚机转移到其他正常运行的物理主机上。
PCT/CN2015/070237 2014-06-09 2015-01-06 物理主机故障检测方法、装置及虚机管理方法、系统 WO2015188619A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410253708.8A CN105224426A (zh) 2014-06-09 2014-06-09 物理主机故障检测方法、装置及虚机管理方法、系统
CN201410253708.8 2014-06-09

Publications (1)

Publication Number Publication Date
WO2015188619A1 true WO2015188619A1 (zh) 2015-12-17

Family

ID=54832856

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/070237 WO2015188619A1 (zh) 2014-06-09 2015-01-06 物理主机故障检测方法、装置及虚机管理方法、系统

Country Status (2)

Country Link
CN (1) CN105224426A (zh)
WO (1) WO2015188619A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113965496A (zh) * 2021-10-15 2022-01-21 上汽通用五菱汽车股份有限公司 一种优化投屏进程响应的方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106444700A (zh) * 2016-09-09 2017-02-22 郑州宇通客车股份有限公司 一种汽车监控主机与定位模块的故障判断方法
CN107608766A (zh) * 2017-10-20 2018-01-19 北京易思捷信息技术有限公司 一种虚拟化跨平台ha系统
CN111447094B (zh) * 2020-03-27 2023-06-16 深圳融安网络科技有限公司 双机的主从切换方法、终端设备及计算机可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070234123A1 (en) * 2006-03-31 2007-10-04 Inventec Corporation Method for detecting switching failure
TW200743025A (en) * 2006-05-09 2007-11-16 Giga Byte Tech Co Ltd Method for simulating IPMI using BIOS
CN102819465A (zh) * 2012-06-29 2012-12-12 华中科技大学 一种虚拟化环境中故障恢复的方法
CN103200050A (zh) * 2013-04-12 2013-07-10 北京百度网讯科技有限公司 服务器的硬件状态监控方法和系统
CN103500133A (zh) * 2013-09-17 2014-01-08 华为技术有限公司 故障定位方法及装置
CN103617104A (zh) * 2013-12-01 2014-03-05 中国船舶重工集团公司第七一六研究所 一种基于ipmi的冗余计算机系统节点故障主被动检测方法

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103019366B (zh) * 2012-11-28 2015-06-10 国睿集团有限公司 基于cpu心跳幅度的物理主机负载检测方法
CN103051479B (zh) * 2012-12-24 2016-01-20 北京启明星辰信息技术股份有限公司 虚拟机网络控制策略的迁移处理方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070234123A1 (en) * 2006-03-31 2007-10-04 Inventec Corporation Method for detecting switching failure
TW200743025A (en) * 2006-05-09 2007-11-16 Giga Byte Tech Co Ltd Method for simulating IPMI using BIOS
CN102819465A (zh) * 2012-06-29 2012-12-12 华中科技大学 一种虚拟化环境中故障恢复的方法
CN103200050A (zh) * 2013-04-12 2013-07-10 北京百度网讯科技有限公司 服务器的硬件状态监控方法和系统
CN103500133A (zh) * 2013-09-17 2014-01-08 华为技术有限公司 故障定位方法及装置
CN103617104A (zh) * 2013-12-01 2014-03-05 中国船舶重工集团公司第七一六研究所 一种基于ipmi的冗余计算机系统节点故障主被动检测方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113965496A (zh) * 2021-10-15 2022-01-21 上汽通用五菱汽车股份有限公司 一种优化投屏进程响应的方法
CN113965496B (zh) * 2021-10-15 2023-11-17 上汽通用五菱汽车股份有限公司 一种优化投屏进程响应的方法

Also Published As

Publication number Publication date
CN105224426A (zh) 2016-01-06

Similar Documents

Publication Publication Date Title
WO2015169199A1 (zh) 分布式环境下虚拟机异常恢复方法
TWI746512B (zh) 實體機器故障分類處理方法、裝置和虛擬機器恢復方法、系統
KR101888029B1 (ko) 가상 머신 클러스터 모니터링 방법 및 모니터링 시스템
US9483368B2 (en) Method, apparatus, and system for handling virtual machine internal fault
TWI529624B (zh) Method and system of fault tolerance for multiple servers
EP2518627B1 (en) Partial fault processing method in computer system
WO2018095107A1 (zh) 一种bios程序的异常处理方法及装置
WO2015188619A1 (zh) 物理主机故障检测方法、装置及虚机管理方法、系统
WO2015196365A1 (zh) 一种故障处理方法、相关装置及计算机
US9210059B2 (en) Cluster system
WO2015058711A1 (zh) 故障快速检测方法及装置
JP6130520B2 (ja) 多重系システムおよび多重系システム管理方法
US10102088B2 (en) Cluster system, server device, cluster system management method, and computer-readable recording medium
CN105068763A (zh) 一种针对存储故障的虚拟机容错系统和方法
CN106411643B (zh) Bmc检测方法以及装置
CN114296995B (zh) 一种服务器自主修复bmc的方法、系统、设备及存储介质
CN112069032A (zh) 一种虚拟机的可用性检测方法、系统及相关装置
CN104268026A (zh) 嵌入式系统的监控管理方法和装置
JP2015176168A (ja) 管理サーバおよび障害復旧方法、並びにコンピュータ・プログラム
US20210247996A1 (en) Service continuation system and service continuation method
JP2015106226A (ja) 二重化システム
CN109412888B (zh) 虚拟交换机监控方法及装置
CN116483613B (zh) 故障内存条的处理方法及装置、电子设备及存储介质
TWI643063B (zh) Detection method
JP5819881B2 (ja) 通信装置、通信システム、通信方法、および、プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15806757

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15806757

Country of ref document: EP

Kind code of ref document: A1