WO2017152763A1 - 物理机故障分类处理方法、装置和虚拟机恢复方法、系统 - Google Patents

物理机故障分类处理方法、装置和虚拟机恢复方法、系统 Download PDF

Info

Publication number
WO2017152763A1
WO2017152763A1 PCT/CN2017/074618 CN2017074618W WO2017152763A1 WO 2017152763 A1 WO2017152763 A1 WO 2017152763A1 CN 2017074618 W CN2017074618 W CN 2017074618W WO 2017152763 A1 WO2017152763 A1 WO 2017152763A1
Authority
WO
WIPO (PCT)
Prior art keywords
physical machine
fault
physical
machine
network
Prior art date
Application number
PCT/CN2017/074618
Other languages
English (en)
French (fr)
Inventor
张文
Original Assignee
阿里巴巴集团控股有限公司
张文
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 张文 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017152763A1 publication Critical patent/WO2017152763A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • G06F11/1484Generic software techniques for error detection or fault masking by means of middleware or OS functionality involving virtual machines

Definitions

  • the present application relates to the field of communications technologies, and in particular, to a physical machine fault classification processing method and apparatus, and a virtual machine recovery method and system, which are applied to a virtualized cluster system.
  • Cloud computing abstracts all computers into specific computing resources and then provides those computing resources to users, rather than providing one or more computers directly as traditionally.
  • the biggest advantage of the cloud computing model is that users can apply for resources according to their own needs, avoid unnecessary waste of resources, and improve resource utilization.
  • virtualized cluster technology is one of the key technologies.
  • the virtualized cluster combines multiple virtualized servers into an organic whole, which achieves high computing speed and improves the overall computing power of the virtualized system.
  • a virtualized cluster manages multiple servers in a unified manner.
  • the virtualized technology abstracts physical resources into a large resource pool of various resources such as storage, computing, and network.
  • the virtual machine is provided to users by applying resources on demand.
  • embodiments of the present application are provided to provide a physical machine fault classification processing method, device, and virtual machine recovery method and system applied to a virtualized cluster system that overcome the above problems or at least partially solve the above problems. .
  • the present application discloses a cluster physical machine fault classification processing method, including:
  • the security attack protection center outside the cluster is triggered to be processed
  • the present application also discloses a cluster physical machine fault classification processing device, including:
  • An obtaining module configured to obtain a physical machine fault information list from the physical machine fault information storage center
  • a first processing module configured to trigger a security attack protection center processing outside the cluster if a physical machine failure caused by a network attack is detected in the physical machine fault information list;
  • the second processing module further includes:
  • a processing unit configured to: if a software and hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, send an instruction to shut down the failed physical machine to the failed physical machine;
  • a migration processing unit configured to migrate the virtual machine on the faulty physical machine to other healthy physical machines in the cluster system through the virtualization interface.
  • the application also discloses a virtual machine recovery method, which is applied to a virtualized cluster system, and the method includes:
  • the physical machine in the virtualized cluster system autonomously detects its own fault dynamics
  • the security attack protection center outside the cluster is triggered to be processed
  • a virtual machine recovery system including:
  • the physical machine fault repairing device is applied to the physical machine in the virtualized cluster system to independently detect the fault dynamics of the physical machine itself. If the physical and electronic faults of the physical machine itself can be fault-tolerantly repaired, the fault is repaired by fault tolerance; if the physical is detected autonomously The machine itself can restart the repaired hardware and software failure, and repair it by restarting the physical machine;
  • the physical machine fault information storage center is configured to collect all the reported physical fault information into a list of physical machine fault information
  • a physical machine fault classification processing device configured to obtain a physical machine fault information list from the physical machine fault information storage center, and if a physical machine fault is detected in the physical machine fault information list due to a network attack, triggering The security attack protection center outside the cluster processes; if a hardware/software failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, an instruction to shut down the failed physical machine is sent to the failed physical machine, and virtualization is performed.
  • the interface migrates virtual machines on the failed physical machine to other healthy physical machines in the cluster system.
  • the present application discloses the following technical effects:
  • the embodiments of the present application can be used to solve various physical machine faults in a large-scale cloud computing cluster. Scenarios perform fast and accurate identification of refinement faults and perform targeted processing to achieve fast and highly reliable physical machine fault repair processing to ensure rapid recovery of virtual machine services.
  • the embodiment of the present application independently detects the fault dynamics by the physical machine, and performs targeted repair and repair processing on the physical machine fault condition that the physical machine can repair; and the physical machine fault condition that the physical machine cannot repair itself.
  • the physical machine fault classification processing module outside the cluster performs targeted classification and repair processing, thereby effectively reducing the occurrence of false positives and missed judgments of physical machine faults, and automatically recovering virtual machines automatically, stably, and quickly.
  • the embodiment of the present application is directed to a physical machine fault condition that cannot be repaired by the physical machine itself.
  • the physical machine fault classification processing module outside the cluster may also be instructed. The failure of the physical machine to shut down autonomously, thereby making up for the availability of the shutdown operation of the out-of-band management module cannot meet the commercial standard, and also ensuring the effectiveness of the automated physical machine isolation.
  • the embodiment of the present application also considers the possibility of a physical machine scale failure in a large-scale cloud computing cluster, and determines whether the number of failed physical machines constitutes a computer room level, and adopts different repair processing methods in a targeted manner. Especially for the case of large-scale physical machine failure, it is repaired by manual processing, which effectively avoids the occurrence of system performance due to the frequent migration of virtual machines on the failed physical machine.
  • FIG. 1 is a flow chart of steps of an embodiment of a cluster physical machine fault classification processing method according to the present application
  • FIG. 2 is a flow chart of steps of another embodiment of a cluster physical machine fault classification processing method according to the present application.
  • FIG. 3 is a flow chart of steps of an embodiment of a virtual machine recovery method according to the present application.
  • FIG. 4 is a flow chart of steps of another embodiment of a virtual machine recovery method of the present application.
  • FIG. 5 is a structural block diagram of an embodiment of a physical machine fault repairing apparatus of the present application.
  • FIG. 6 is a structural block diagram of an embodiment of a cluster physical machine fault classification processing apparatus according to the present application.
  • FIG. 7 is a structural block diagram of an embodiment of a virtual machine recovery system of the present application.
  • Cloud computing is a model of the addition, use, and delivery of related services based on Internet technologies. It is a cluster of servers that practice distributed computing on all servers used. In other words, cloud computing provides a virtualized, resilient resource platform for dynamically provisioning hardware, software, and data sets on demand.
  • Cluster management on a cloud computing platform constitutes a virtual cluster.
  • the so-called virtual cluster is to virtualize multiple computing nodes by using virtualization technology to build a cluster system similar to the physical cluster and huge scale. That is to say, a virtual cluster is a system that connects multiple homogeneous or heterogeneous computers that work together to accomplish a specific task.
  • cluster physical computers or cluster physical machines.
  • one physical machine can simulate one or more virtual computers.
  • Virtual machine software can simulate one or more virtual computers on a single physical machine. These virtual machines work like real computers.
  • the virtual machine can be installed with operating systems and applications, and virtual machines can also be accessed. Internet resources.
  • the virtual machine is like working on a real computer.
  • the embodiment of the present application can be applied to a large-scale cloud computing virtualized cluster system, and the physical machine in the cluster system can detect the fault dynamics autonomously, and then the physical machine faults that can be repaired by the physical machine can be targeted.
  • the physical machine network is unreachable.
  • the reasons include: physical machine downtime, network card abnormality, uplink switch failure, hardware abnormality, kernel module abnormality, physical machine restart, and distributed denial of service (DDoS).
  • DDoS distributed denial of service
  • the reasons include: high physical load, uplink network device switching, network DdoS attack, and so on.
  • the file system of the physical machine For example, the file system of the physical machine, the virtualization related module, the operating system kernel module, and other operating system level software exceptions.
  • the reasons include: network packet loss, system service exceptions, file system exceptions, and so on.
  • I/O physical machine input/output
  • high load and so on.
  • the main reasons include: physical machine hardware failure, physical machine kernel module abnormality, physical machine user state process abnormality, and the like.
  • the above physical machine failure phenomena are not static, but can be transformed into each other within a certain period of time, or even related and intertwined. Moreover, the reasons behind the same physical machine phenomenon may be different. Therefore, the repair processing method of the fault physical machine needs to be specifically distinguished. For example, for a physical machine network failure due to network DDoS attacks and physical machine downtime The resulting physical machine network failure is treated differently. If the virtual machine is migrated to another physical machine while the physical machine is suffering from the network DDoS attack, a domino effect will occur, which will increase the risk of failure, that is, other physical machines will be successively Attacks are not available, which may eventually cause flooding of cluster-wide network devices, resulting in a risk of physical failure of the entire cluster.
  • the physical machine fault can be summarized into the following categories:
  • disk failures that store data For example, disk failures that store data, virtualization-related kernel module exceptions, file system exceptions that store data, and so on.
  • root file system read-only exceptions For example, root file system read-only exceptions, NIC driver restartable fixable exceptions, operating system kernel module exceptions, and so on.
  • faults for unknown reasons, such as system load class, system network class, hardware fault class, and so on.
  • system load class system network class
  • hardware fault class hardware fault class
  • symptoms of such faults are very clear, mainly: physical machine network packet loss, physical machine management channel access exception, physical machine performance use abnormal.
  • the physical machine is subjected to a network attack and causes a physical machine failure type.
  • a network DDoS type security attack causes a large amount of network packet loss or even network failure.
  • the phenomenon of such faults mainly includes: the physical machine network is unreachable, the network is lost, and the management channel is unreachable.
  • embodiments of the present application perform fast and accurate identification of refinement faults in various physical machine fault scenarios, and perform targeted processing, thereby realizing fast and highly reliable physical machine fault repair processing to ensure its Fast recovery of virtual machine services on.
  • embodiments of the present application can process virtual machine recovery on a failed physical machine within ten minutes and the virtual machine's functionality has a commercial availability standard of over 99.95%.
  • FIG. 1 a flow chart of a method for processing a cluster physical machine fault classification processing method of the present application is shown.
  • the physical machine fault classification processing method can be applied to a virtualized cluster system, and specifically includes the following steps:
  • Step 210 Obtain a physical machine fault information list from the physical machine fault information storage center.
  • the physical machine fault information list includes: physical machine fault information detected by the physical machine fault detecting module outside the cluster from the fault physical machine and reported to the physical machine fault information storage center, and a physical machine collected by the physical machine fault collection module outside the cluster from the fault physical machine and reported to the physical machine fault information storage center Barrier information.
  • Step 220 If it is detected in the physical machine fault information list that the physical machine is faulty due to the network attack, the security attack protection center outside the cluster is triggered to be processed;
  • Step 230 If a software or hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, send an instruction to shut down the failed physical machine to the failed physical machine; and migrate the failed physical machine through the virtualized interface. Virtual machine to other healthy physical machines in the cluster system;
  • an instruction to shut down the failed physical machine is sent to the failed physical machine to instruct the failed physical machine to autonomously shut down the failed physical machine. Or through the out-of-band management module on the physical machine.
  • the types of hardware and software faults that cannot be repaired by the physical machine may include: physical machine down, physical machine CPU abnormality, physical machine memory abnormality, physical machine power module and other hardware problem abnormalities.
  • This type of fault directly causes the physical machine to be unavailable, and needs to be replaced by the hardware module. Therefore, the embodiment of the present application performs hardware replacement or maintenance on the faulty physical machine by isolating the faulty physical machine from the cluster.
  • the out-of-band control system on the traditional physical machine usually has a usability of about 90% or even lower due to hardware failure rate and cost, and at least 99.95% of the cloud computing service itself.
  • the total unavailability duration is 262.8 minutes. If a faulty physical machine cannot be repaired in time, a physical machine failure will directly lead to tens of minutes of manual processing. Therefore, Skill The availability indicators of the out-of-band control system in the operation cannot match the Service-Level Agreement (SLA) of the commercial cloud computing service.
  • SLA Service-Level Agreement
  • the fault physicality can be indicated by the instruction of the physical machine fault classification processing module outside the cluster.
  • the machine is shut down autonomously, and the physical machine fault classification processing module outside the cluster migrates the virtual machine on the fault physical machine to other healthy physical machines in the cluster system through the virtualization interface; thereby greatly reducing the repair time of the fault physical machine. , thereby increasing the commercial availability of the system.
  • the embodiments of the present invention can quickly and accurately identify the faults of multiple physical machines, and perform targeted processing to achieve fast and highly reliable physics. Machine fault repair processing to ensure rapid recovery of virtual machine services on it.
  • the embodiment of the present application is directed to a physical machine fault condition that cannot be repaired by the physical machine itself.
  • the physical machine fault classification processing module outside the cluster may also be instructed. The failure of the physical machine to shut down autonomously, thereby making up for the availability of the shutdown operation of the out-of-band management module cannot meet the commercial standard, and also ensuring the effectiveness of the automated physical machine isolation.
  • FIG. 2 a flow chart of the steps of another embodiment of the cluster physical machine fault classification processing method of the present application is shown, which may specifically include the following steps:
  • Step 210 Obtain a physical machine fault information list from the physical machine fault information storage center.
  • the physical machine fault information list includes: physical machine fault information detected by the physical machine fault detecting module outside the cluster from the fault physical machine and reported to the physical machine fault information storage center, and The physical machine fault collection information collected by the physical machine fault collection module outside the cluster from the fault physical machine and reported to the physical machine fault information storage center.
  • Step 220 If it is detected in the physical machine fault information list that the physical machine is faulty due to the network attack, the security attack protection center outside the cluster is triggered to be processed;
  • Step 230 If a software or hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, send an instruction to shut down the failed physical machine to the failed physical machine; and migrate the failed physical machine through the virtualized interface. Virtual machine to other healthy physical machines in the cluster system;
  • an instruction to shut down the failed physical machine is sent to the failed physical machine to instruct the failed physical machine to autonomously shut down the failed physical machine. Or through the out-of-band management module on the physical machine.
  • the types of hardware and software faults that cannot be repaired by the physical machine may include: physical machine down, physical machine CPU abnormality, physical machine memory abnormality, physical machine power module and other hardware problem abnormalities.
  • This type of fault directly causes the physical machine to be unavailable, and needs to be replaced by the hardware module. Therefore, the embodiment of the present application performs hardware replacement or maintenance on the faulty physical machine by isolating the faulty physical machine from the cluster.
  • the out-of-band control system on the traditional physical machine usually has a usability of about 90% or even lower due to hardware failure rate and cost, and at least 99.95% of the cloud computing service itself. Under the commercial availability requirements, the total unavailability duration is 262.8 minutes. If a faulty physical machine cannot be repaired in time, a physical machine failure will directly lead to tens of minutes of manual processing. Therefore, Availability metrics for out-of-band management systems in technology cannot match failure recovery for commercial cloud computing services Service Level Agreement (SLA).
  • SLA Service Level Agreement
  • the fault physicality can be indicated by the instruction of the physical machine fault classification processing module outside the cluster.
  • the machine is shut down autonomously, and the physical machine fault classification processing module outside the cluster migrates the virtual machine on the fault physical machine to other healthy physical machines in the cluster system through the virtualization interface; thereby greatly reducing the repair time of the fault physical machine. , thereby increasing the commercial availability of the system.
  • Step 240 If it is detected in the physical machine fault information list that the physical machine network is completely unreachable and the network unreachable duration reaches a preset time; whether the number of physical machines that are not available in the network exceeds a preset number, if yes, notify the operation maintenance personnel Manually repairing; otherwise, the virtual machine on the failed physical machine is migrated to the other healthy physical machine through the virtualized interface;
  • the preset time may be set to a suitable time period of 3 minutes, 5 minutes, etc. according to actual conditions.
  • the embodiment of the present application needs to further check whether the number of failed physical machines that are not available in the network exceeds the number of physical machines in one cabinet or If the number of physical machines connected to a switch is exceeded, if it is considered to be a cluster-scale network failure, it is necessary to take a telephone alarm and manually repair it by the operation and maintenance personnel, and no longer handle it automatically. This is because for a large-scale physical machine failure, when the physical machine is migrated, the physical machine is shut down. When the equipment room (network equipment or power equipment, etc.) is restored, the physical machine needs to be restarted again.
  • the method provided by the embodiment of the present application can distinguish the type of the physical machine fault, and can greatly shorten the repair time of the fault physical machine, thereby greatly shortening the length of the virtual machine that is unavailable, thereby improving the commercial availability of the system. .
  • the method in the embodiment of the present application may further include:
  • Step 250 If it is detected in the physical machine fault information list that the physical machine network is unreachable, but the network fails to reach the preset time, the network returns to normal, and it is determined that the physical machine network failure is caused by the physical machine restart, then Determine if the current physical machine is healthy, if healthy Restarting the virtual machine on the physical machine through the virtualization interface, and if not healthy, migrating the virtual machine on the failed physical machine to other healthy physical machines in the cluster through the virtualization interface;
  • Step 260 If it is detected in the physical machine fault information list that the physical machine network is unstable and the network instability duration reaches a preset time, send an instruction to the fault physical machine to instruct the fault physical machine to automatically shut down the fault physical machine. Or shutting down the faulty physical machine through the outband management module on the physical machine; and migrating the virtual machine on the faulty physical machine to other healthy physical machines in the cluster system through the virtualization interface;
  • the physical machine network is unstable and the network instability duration reaches the preset time.
  • the physical machine faults are caused by unknown reasons, such as system load class, system network class, and hardware fault class. Although such faults are difficult to check for the essential reasons, the symptoms of such faults are very clear, mainly: physical machine network packet loss, physical machine management channel access exception, physical machine performance use abnormal.
  • the same processing manner may be adopted, that is, sending an instruction to the failed physical machine to instruct the failed physical machine to autonomously shut down the failed physical machine or to shut down the failed physical machine through the out-of-band management module on the physical machine;
  • the virtual machine on the failed physical machine is migrated to other healthy physical machines in the cluster system through the virtualization interface.
  • the healthy physical machine is determined by:
  • a physical machine that does not match successfully is determined to be a healthy physical machine.
  • the embodiments of the present invention can quickly and accurately identify the faults of multiple physical machines, and perform targeted processing to achieve fast and highly reliable physics. Machine fault repair processing to ensure rapid recovery of virtual machine services on it.
  • the embodiment of the present application independently detects the fault dynamics by the physical machine, and performs targeted repair and repair processing on the physical machine fault condition that the physical machine can repair; and the physical machine fault condition that the physical machine cannot repair itself.
  • the physical machine fault classification processing module outside the cluster the classification and repair processing is targeted, thereby effectively reducing the misjudgment of the physical machine failure and The occurrence of missed judgments makes it safer, more stable, and faster to automatically recover virtual machines.
  • the embodiment of the present application is directed to a physical machine fault condition that cannot be repaired by the physical machine itself.
  • the physical machine fault classification processing module outside the cluster may also be instructed. The failure of the physical machine to shut down autonomously, thereby making up for the availability of the shutdown operation of the out-of-band management module cannot meet the commercial standard, and also ensuring the effectiveness of the automated physical machine isolation.
  • the embodiment of the present application also considers the possibility of a physical machine scale failure in a large-scale cloud computing cluster, and determines whether the number of failed physical machines constitutes a computer room level, and adopts different repair processing methods in a targeted manner. Especially for the case of large-scale physical machine failure, it is repaired by manual processing, which effectively avoids the occurrence of system performance due to the frequent migration of virtual machines on the failed physical machine.
  • FIG. 3 a schematic diagram of an embodiment of a virtual machine recovery method of the present application is shown, which may specifically include the following steps:
  • Step 310 The physical machine in the virtualized cluster system independently detects its own fault dynamics
  • each physical machine can periodically detect its own fault dynamics at regular time intervals, for example, once every 30 seconds.
  • Step 320 If the hardware and software faults that the physical machine can be fault-tolerantly repaired are detected autonomously, the fault is repaired by using a fault-tolerant manner;
  • the hardware and software faults that the physical machine itself can be fault-tolerantly repaired in the embodiments of the present application may include: disk faults for storing data, virtualization-related kernel module exceptions, file system exceptions for storing data, and the like.
  • the fault-tolerant repair method specifically includes first, isolating the disk, and then utilizing the mechanism of distributed storage of multiple copies of the cluster to automatically copy the data on the disk to other healthy disks, thereby effectively ensuring the fault. After the disk is isolated, it will not affect the stable operation of the system. Similarly, the file system for storing data is corrupted, too Fault-tolerant repair can be achieved by isolating the disk mounted by the file system.
  • Step 330 If the physical machine itself can detect the failure of the repaired hardware and software failure, the physical machine is repaired by restarting the physical machine;
  • the hardware and software faults that can be repaired by the physical machine in the embodiment of the present application may include: an exception of the root file system read-only, an abnormality that the network card driver restarts can be repaired, an abnormality of the operating system kernel module, and the like. Such hardware and software failures can be fixed by restarting the physical machine.
  • Step 340 Obtain a physical machine fault information list from the physical machine fault information storage center.
  • the physical machine fault classification processing module acquires the physical machine fault information list from the physical machine fault information storage center.
  • the physical machine fault information list includes: physical machine fault information detected by the physical machine fault detecting module outside the cluster from the fault physical machine and reported to the physical machine fault information storage center, and external to the cluster
  • the physical machine fault collection module collects and reports the physical machine fault information from the fault physical machine to the physical machine fault information storage center.
  • Step 350 If it is detected in the physical machine fault information list that the physical machine is faulty due to the network attack, triggering the security attack protection center processing outside the cluster;
  • Step 360 If a software or hardware failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, send an instruction to the failed physical machine to instruct the failed physical machine to autonomously shut down the failed physical machine or pass the physical The out-of-band management module on the machine shuts down the faulty physical machine; and The virtual machine on the failed physical machine is migrated to other healthy physical machines in the cluster system through the virtualization interface.
  • the types of hardware and software faults that cannot be repaired by the physical machine may include: physical machine down, physical machine CPU abnormality, physical machine memory abnormality, physical machine power module and other hardware problem abnormalities.
  • This type of fault directly causes the physical machine to be unavailable, and needs to be replaced by the hardware module. Therefore, the embodiment of the present application performs hardware replacement or maintenance on the faulty physical machine by isolating the faulty physical machine from the cluster.
  • the out-of-band control system on the traditional physical machine usually has a usability of about 90% or even lower due to hardware failure rate and cost, and at least 99.95% of the cloud computing service itself.
  • the total unavailability duration is 262.8 minutes. If a faulty physical machine cannot be repaired in time, a physical machine failure will directly lead to tens of minutes of manual processing. Therefore, The availability metrics of the out-of-band management system in the technology cannot match the Service-Level Agreement (SLA) of the commercial cloud computing service.
  • SLA Service-Level Agreement
  • the fault physicality can be indicated by the instruction of the physical machine fault classification processing module outside the cluster.
  • the machine is shut down autonomously, and the physical machine fault classification processing module outside the cluster migrates the virtual machine on the fault physical machine to other healthy physical machines in the cluster system through the virtualization interface; thereby greatly reducing the repair time of the fault physical machine. , thereby increasing the commercial availability of the system.
  • the method in the embodiment of the present application may further include:
  • Step 370 If it is detected in the physical machine fault information list that the physical machine network is completely unreachable and the network unreachable duration reaches a preset time; determining whether the number of physical machines that are not available in the network exceeds a preset number, if yes, notifying the operation and maintenance personnel Manually repairing; otherwise, virtual machines on the failed physical machine are migrated through the virtualization interface to the other healthy physical machines in the cluster system.
  • the preset time may be set to a suitable time period of 3 minutes, 5 minutes, etc. according to actual conditions.
  • the embodiment of the present application needs to further check whether the number of failed physical machines that are not available in the network exceeds the number of physical machines in one cabinet or the number of physical machines connected to one switch. If it exceeds, it is considered to be a cluster-scale network fault. Then, it is necessary to take a telephone alarm and manually repair it by the operation and maintenance personnel, and no longer handle it automatically. This is because for a large-scale physical machine failure, when the physical machine is migrated, the physical machine is shut down. When the equipment room (network equipment or power equipment, etc.) is restored, the physical machine needs to be restarted again.
  • the method provided by the embodiment of the present application can distinguish the type of the physical machine fault, and can greatly shorten the repair time of the fault physical machine, thereby greatly shortening the length of the virtual machine that is unavailable, thereby improving the commercial availability of the system. .
  • the method in the embodiment of the present application may further include:
  • Step 380 If it is detected in the physical machine fault information list that the physical machine network is unreachable, but the network fails to reach the preset time, the network returns to normal, and it is determined that the physical machine network failure is caused by the physical machine restart, then Determine whether the current physical machine is healthy. If it is healthy, restart the virtual machine on the physical machine through the virtualized interface. If it is not healthy, migrate the virtual machine on the failed physical machine to other healthy physical machines in the cluster through the virtualization interface.
  • the method in the embodiment of the present application may further include:
  • Step 390 If it is detected in the physical machine fault information list that the physical machine network is unstable and the network instability duration reaches a preset time, send an instruction to the fault physical machine to instruct the fault physical machine to automatically shut down the fault physical machine. Or shutting down the faulty physical machine through the outband management module on the physical machine; and migrating the virtual machine on the faulty physical machine to other healthy physical machines in the cluster system through the virtualization interface.
  • the physical machine network is unstable and the network instability duration reaches the preset time.
  • the physical machine faults are caused by unknown reasons, such as system load class, system network class, and hardware fault class. Although such faults are difficult to check for the essential reasons, the symptoms of such faults are very clear, mainly: physical machine network packet loss, physical machine management channel access exception, physical machine performance use abnormal.
  • the same processing method can be adopted, that is, an instruction is sent to the failed physical machine to instruct the failed physical machine to autonomously shut down the failed physical machine. Or shutting down the faulty physical machine through the outband management module on the physical machine; and migrating the virtual machine on the faulty physical machine to other healthy physical machines in the cluster system through the virtualization interface.
  • the healthy physical machine is determined by:
  • a physical machine that does not match successfully is determined to be a healthy physical machine.
  • the embodiments of the present invention can quickly and accurately identify the faults of multiple physical machines, and perform targeted processing to achieve fast and highly reliable physics. Machine fault repair processing to ensure rapid recovery of virtual machine services on it.
  • the embodiment of the present application independently detects the fault dynamics by the physical machine, and performs targeted repair and repair processing on the physical machine fault condition that the physical machine can repair; and the physical machine fault condition that the physical machine cannot repair itself.
  • the physical machine fault classification processing module outside the cluster performs targeted classification and repair processing, thereby effectively reducing the occurrence of false positives and missed judgments of physical machine faults, and automatically recovering virtual machines automatically, stably, and quickly.
  • the embodiment of the present application is directed to a physical machine fault condition that cannot be repaired by the physical machine itself.
  • the physical machine fault classification processing module outside the cluster may also be instructed. The failure of the physical machine to shut down autonomously, thereby making up for the availability of the shutdown operation of the out-of-band management module cannot meet the commercial standard, and also ensuring the effectiveness of the automated physical machine isolation.
  • the embodiment of the present application also considers the possibility of a physical machine scale failure in a large-scale cloud computing cluster, and determines whether the number of failed physical machines constitutes a computer room level, and adopts different repair processing methods in a targeted manner. Especially for the case of large-scale physical machine failure, it is repaired by manual processing, which effectively avoids the occurrence of system performance due to the frequent migration of virtual machines on the failed physical machine.
  • FIG. 4 a schematic diagram of another embodiment of a virtual machine recovery method of the present application is shown, which may specifically include the following steps:
  • the physical machine fault detection module checks the network status of each physical machine in the cluster every 30 seconds and updates to the physical machine fault information storage center; each physical machine in the cluster system independently detects its own fault condition and passes the physical machine fault.
  • the collection module is updated to the physical machine failure information storage center.
  • the physical machine itself can be fault-tolerantly repaired, the physical machine itself is repaired by fault-tolerant mode; for the physical and software failure that the physical machine can restart and repair, the physical machine itself repairs the process by restarting the physical machine; If it is a hardware or software failure that cannot be repaired by the physical machine itself, shutdown processing is performed.
  • the physical machine fault classification processing module acquires the physical machine fault information list from the physical machine fault information storage center every 1 minute; determines whether the physical machine fault information list is empty, and if yes, returns to the loop; otherwise, continues to determine the physical machine fault information. Whether there is a physical machine failure caused by a network attack in the list, if any, triggers the security attack protection center processing outside the cluster; otherwise, it continues to determine whether there is a physical machine itself in the physical machine failure information list.
  • the virtual interface migrates the virtual machine on the failed physical machine to other healthy physical machines in the cluster system.
  • the physical machine fault information list does not cause a physical machine fault due to the network attack, then it is determined whether there is a physical machine network in the physical machine fault information list that is completely unreachable and the network unreachable duration reaches the preset. Time, for example, 3 minutes; if yes, determine whether the number of physical machines that are not available in the network exceeds a preset number, for example, whether the number of failed physical machines exceeds the number of physical machines in one cabinet or the number of physical machines connected to one switch, if exceeded, If it is considered to be a cluster-scale network failure, it is necessary to take a telephone alarm and manually repair it by the operation and maintenance personnel, and no longer handle it automatically. Otherwise, the virtual machine on the failed physical machine is migrated through the virtualization interface to the other healthy physical machine in the cluster system.
  • Determining whether the physical machine network failure is detected in the physical machine fault information list, but the network fails to reach the preset time after the network fails to reach the preset time, and the physical network is unreachable is caused by the physical machine restarting, and then determining Whether the current physical machine is healthy. If it is healthy, the virtual machine on the physical machine is restarted through the virtualized interface. If it is not healthy, the virtual machine on the failed physical machine is migrated to other healthy physical machines in the cluster through the virtualization interface.
  • the physical machine fault information list If it is detected in the physical machine fault information list that the physical machine network is unstable and the network instability duration reaches a preset time, sending an instruction to the fault physical machine to instruct the fault physical machine to autonomously shut down the fault physical machine or pass through the The out-of-band management module on the physical machine shuts down the failed physical machine; and the virtual machine on the failed physical machine is migrated to other healthy physical machines in the cluster system through the virtualization interface.
  • the physical machine network is unstable and the network instability duration reaches the preset time.
  • the physical machine faults are caused by unknown reasons, such as system load class, system network class, and hardware fault class. Although such faults are difficult to check for the essential reasons, the symptoms of such faults are very clear, mainly: physical machine network packet loss, physical machine management channel access exception, physical machine performance use abnormal. For such physical machine failures, the same treatment can be used.
  • the healthy physical machine is determined by:
  • a physical machine that does not match successfully is determined to be a healthy physical machine.
  • the embodiments of the present invention can quickly and accurately identify the faults of multiple physical machines, and perform targeted processing to achieve fast and highly reliable physics. Machine fault repair processing to ensure rapid recovery of virtual machine services on it.
  • the embodiment of the present application independently detects the fault dynamics by the physical machine, and performs targeted repair and repair processing on the physical machine fault condition that the physical machine can repair; and the physical machine fault condition that the physical machine cannot repair itself.
  • Physical machine fault classification through the outside of the cluster The management module has a targeted classification and repair process, thereby effectively reducing the occurrence of false positives and missed judgments of physical machine failures, and safely, stably and quickly performing virtual machine automatic recovery.
  • the embodiment of the present application is directed to a physical machine fault condition that cannot be repaired by the physical machine itself.
  • the physical machine fault classification processing module outside the cluster may also be instructed. The failure of the physical machine to shut down autonomously, thereby making up for the availability of the shutdown operation of the out-of-band management module cannot meet the commercial standard, and also ensuring the effectiveness of the automated physical machine isolation.
  • the embodiment of the present application also considers the possibility of a physical machine scale failure in a large-scale cloud computing cluster, and determines whether the number of failed physical machines constitutes a computer room level, and adopts different repair processing methods in a targeted manner. Especially for the case of large-scale physical machine failure, it is repaired by manual processing, which effectively avoids the occurrence of system performance due to the frequent migration of virtual machines on the failed physical machine.
  • the physical machine fault repair apparatus 500 is applied to a physical machine in a virtualized cluster system, and may specifically include: an autonomous detection module ( selfChecker) 510, autonomous processing module (selfHnadler) 520; wherein:
  • selfChecker autonomous detection module
  • selfHnadler autonomous processing module
  • the autonomous detection module 510 specifically includes: a detecting unit 511, configured to autonomously detect the fault dynamics of the physical machine itself; preferably, the detecting unit 511 can periodically detect the fault dynamics of the physical machine itself at a fixed time interval, for example, every 30 seconds. Detect once.
  • the autonomous processing module 520 specifically includes:
  • the fault-tolerant unit 521 is configured to be repaired by a fault-tolerant manner if the detecting unit 511 detects that the physical machine itself can be fault-tolerantly repaired by the hardware and software.
  • the hardware and software faults that the physical machine itself can be fault-tolerantly repaired in the embodiments of the present application may include: disk faults for storing data, virtualization-related kernel module exceptions, file system exceptions for storing data, and the like.
  • the fault-tolerant repair method specifically includes first, isolating the disk, and then utilizing the mechanism of distributed storage of multiple copies of the cluster to automatically copy the data on the disk to other healthy disks, thereby effectively ensuring the fault. After the disk is isolated, it will not affect the stable operation of the system.
  • fault-tolerant repair can also be achieved by isolating the disk mounted by the file system.
  • the restarting unit 522 is configured to be repaired by restarting the physical machine if the detecting unit 511 detects that the physical machine itself can restart the repaired hardware and software failure.
  • the physical machine itself in the embodiment of the present application can restart the repaired hardware and software failure, and may include: an exception of the root file system read-only, an abnormality of the network card driver restartable repair, an abnormality of the operating system kernel module, and the like.
  • Such hardware and software failures can be fixed by restarting the physical machine.
  • the autonomous processing module 520 may further include:
  • the shutdown unit 523 is configured to: if the detecting unit 511 detects a hardware/software failure that cannot be repaired by the physical machine itself, classify the processing module according to the physical machine fault outside the cluster or pass the outband management module on the physical machine 530: The faulty physical machine is shut down, and the physical machine fault classification processing module outside the cluster migrates the virtual machine on the fault physical machine to other healthy physical machines in the cluster system through the virtualization interface.
  • the out-of-band control system on the conventional physical machine usually has a usability of about 90% or even lower due to hardware failure rate and cost, in the cloud computing service itself. At least 99.95% of the commercial availability requirements, the total unavailability duration is 262.8 minutes. If a faulty physical machine cannot be repaired in time, a physical machine failure will directly lead to tens of minutes of manual processing. Therefore, now The availability metrics of the out-of-band management system in the technology cannot match the Service-Level Agreement (SLA) of the commercial cloud computing service.
  • SLA Service-Level Agreement
  • the fault can be indicated by the instruction of the physical machine fault classification processing module outside the cluster.
  • the physical machine is automatically shut down, and then the physical machine fault classification processing module outside the cluster migrates the virtual machine on the fault physical machine to other healthy physical machines in the cluster system through the virtualization interface; thereby greatly reducing the repair of the fault physical machine. Time, which in turn increases the commercial availability of the system.
  • the autonomous detection module 510 may further include:
  • the reporting unit 512 is configured to report the physical machine fault information to the physical machine fault information storage center through the physical machine fault collection module when the detecting unit 511 automatically detects the physical machine fault caused by the network attack, by the physical outside the cluster
  • the machine fault classification processing module triggers the security attack protection center processing outside the cluster.
  • a security cleaning process is started, for example, traffic cleaning is performed, so that the faulty physical machine is restored to health.
  • the physical network network failure caused by the network DDoS attack and the physical machine network failure caused by the physical machine downtime need to be treated differently. If the physical machine is suffering from the network DDoS attack, The virtual machine migrated to other physical machines, which will have a domino effect, which will increase the risk of failure, that is, other physical machines will be attacked and unavailable, which may eventually cause flooding of the entire cluster network equipment, resulting in a full cluster physical machine. Risk of failure.
  • the embodiments of the present invention can quickly and accurately identify the faults of multiple physical machines, and perform targeted processing to achieve fast and highly reliable physics. Machine fault repair processing to ensure rapid recovery of virtual machine services on it.
  • the embodiment of the present application independently detects the fault dynamics by the physical machine, and performs targeted repair and repair processing on the physical machine fault condition that the physical machine can repair; and the physical machine fault condition that the physical machine cannot repair itself.
  • the physical machine fault classification processing module outside the cluster the classification and repair processing is targeted, thereby effectively reducing the misjudgment of the physical machine failure and The occurrence of missed judgments makes it safer, more stable, and faster to automatically recover virtual machines.
  • the embodiment of the present application is directed to a physical machine fault condition that cannot be repaired by the physical machine itself.
  • the physical machine fault classification processing module outside the cluster may also be instructed. The failure of the physical machine to shut down autonomously, thereby making up for the availability of the shutdown operation of the out-of-band management module cannot meet the commercial standard, and also ensuring the effectiveness of the automated physical machine isolation.
  • the physical machine fault classification processing apparatus 600 may specifically include the following modules:
  • the obtaining module 610 is configured to obtain a physical machine fault information list from the physical machine fault information storage center.
  • the physical machine fault information list includes: a physical machine fault detecting module external to the cluster from the fault physical machine.
  • the physical machine fault information that is detected and reported to the physical machine fault information storage center, and collected by the physical machine fault collection module external to the cluster from the fault physical machine and reported to the physical machine fault information storage center Physical machine failure information.
  • the first processing module 620 is configured to trigger a security attack protection center processing outside the cluster if it is detected in the physical machine fault information list that the physical machine is faulty due to the network attack;
  • the second processing module 630 further includes:
  • a processing unit configured to: if a hardware/software failure that cannot be repaired by the physical machine itself is detected in the physical machine failure information list, send an instruction to shut down the failed physical machine to the failed physical machine; preferably, the instruction may indicate The faulty physical machine autonomously shuts down the faulty physical machine or shuts down the faulty physical machine through an outband management module on the physical machine;
  • a migration processing unit configured to migrate the virtual machine on the faulty physical machine to other healthy physical machines in the cluster system through the virtualization interface.
  • the types of hardware and software faults that cannot be repaired by the physical machine may include: physical machine down, physical machine CPU abnormality, physical machine memory abnormality, physical machine power module and other hardware problem abnormalities.
  • This type of fault directly causes the physical machine to be unavailable, and needs to be replaced by the hardware module. Therefore, the embodiment of the present application performs hardware replacement or maintenance on the faulty physical machine by isolating the faulty physical machine from the cluster.
  • the physical machine fault classification processing device 600 may further include a third processing module 640, and the third processing module 640 specifically includes:
  • the notification processing unit is configured to notify the operation and maintenance personnel of the manual if the physical machine network is completely unreachable and the network failure time reaches the preset time in the physical machine fault information list, and the number of physical machines that are not available in the network exceeds one. repair;
  • a migration processing unit configured to: if it is detected in the physical machine fault information list that the physical machine network is completely unreachable and the network failure time reaches a preset time, and the number of physical machines that are not available in the network does not exceed a preset number, The interface migrates virtual machines on the failed physical machine to the other healthy physical machines in the cluster system.
  • the preset time may be set to a suitable time period of 3 minutes, 5 minutes, etc. according to actual conditions.
  • the embodiment of the present application needs to further check whether the number of failed physical machines that are not available in the network exceeds the number of physical machines in one cabinet or The number of physical machines connected to a switch. If it is exceeded, it is considered to be a cluster-scale network failure. Camp maintenance personnel repair it manually and no longer handle it automatically. This is because for a large-scale physical machine failure, when the physical machine is migrated, the physical machine is shut down. When the equipment room (network equipment or power equipment, etc.) is restored, the physical machine needs to be restarted again.
  • the method provided by the embodiment of the present application can distinguish the type of the physical machine fault, and can greatly shorten the repair time of the fault physical machine, thereby greatly shortening the length of the virtual machine that is unavailable, thereby improving the commercial availability of the system. .
  • the physical machine fault classification processing device 600 may further include a fourth processing module 650, where the fourth processing module 650 specifically includes:
  • Restarting the processing unit if the network of the physical machine fails to be detected in the physical machine fault information list, but the network fails to reach the preset time, the network returns to normal, and the physical network is determined to be unreasonable due to the physical machine restarting. Restarting the virtual machine on the physical machine through the virtualization interface if it is determined that the current physical machine is healthy;
  • the migration processing unit is configured to: if the physical machine network is unreachable in the physical machine fault information list, but the network fails to reach the preset time, the network returns to normal, and determining that the physical machine network is unreachable is caused by the physical machine restarting If the current physical machine is determined to be unhealthy, the virtual machine on the failed physical machine is migrated to other healthy physical machines in the cluster through the virtualization interface.
  • the physical machine fault classification processing device 600 may further include a fifth processing module 660, where the fifth processing module 660 specifically includes:
  • a shutdown processing unit configured to send an instruction to the faulty physical machine to indicate that the faulty physical machine autonomously shuts down if the physical machine network is unstable and the network instability duration reaches a preset time in the physical machine fault information list. Disabling the physical machine or shutting down the failed physical machine through the out-of-band management module on the physical machine;
  • a migration processing unit configured to migrate the virtual machine on the faulty physical machine to other healthy physical machines in the cluster system through the virtualization interface.
  • the physical machine network is unstable and the network instability duration reaches the pre-
  • the time setting situation is mainly caused by some unknown causes of physical machine failure, such as system load class, system network class, hardware fault class and so on.
  • some unknown causes of physical machine failure such as system load class, system network class, hardware fault class and so on.
  • the symptoms of such faults are very clear, mainly: physical machine network packet loss, physical machine management channel access exception, physical machine performance use abnormal.
  • the same processing manner may be adopted, that is, sending an instruction to the failed physical machine to instruct the failed physical machine to autonomously shut down the failed physical machine or to shut down the failed physical machine through the out-of-band management module on the physical machine;
  • the virtual machine on the failed physical machine is migrated to other healthy physical machines in the cluster system through the virtualization interface.
  • the physical machine fault classification processing device 600 may further include:
  • the determining module 670 is configured to match all the physical machines in the cluster in the physical machine fault information list, and determine the physical machine that has not been successfully matched as the healthy physical machine.
  • the embodiments of the present invention can quickly and accurately identify the faults of multiple physical machines, and perform targeted processing to achieve fast and highly reliable physics. Machine fault repair processing to ensure rapid recovery of virtual machine services on it.
  • the embodiment of the present application independently detects the fault dynamics by the physical machine, and performs targeted repair and repair processing on the physical machine fault condition that the physical machine can repair; and the physical machine fault condition that the physical machine cannot repair itself.
  • the physical machine fault classification processing module outside the cluster performs targeted classification and repair processing, thereby effectively reducing the occurrence of false positives and missed judgments of physical machine faults, and automatically recovering virtual machines automatically, stably, and quickly.
  • the embodiment of the present application is directed to a physical machine fault condition that cannot be repaired by the physical machine itself.
  • the physical machine fault classification processing module outside the cluster may also be instructed. The failure of the physical machine to shut down autonomously, thereby making up for the availability of the shutdown operation of the out-of-band management module cannot meet the commercial standard, and also ensuring the effectiveness of the automated physical machine isolation.
  • the embodiment of the present application also considers the possibility of a physical machine scale failure in a large-scale cloud computing cluster, and determines whether the number of failed physical machines constitutes a computer room level, and adopts different repair processing methods in a targeted manner. Especially for large-scale physical machine failures, manual processing is used to repair, thus effectively avoiding virtual machines on the physical machine due to failure. Frequent migrations affect system performance.
  • the virtual machine recovery system includes a physical machine failure repairing device 710 applied to each physical machine in the virtualized cluster system 700.
  • the physical fault repairing device 710 may specifically include: an autonomous detecting module 711 and an autonomous processing module 712; wherein: the autonomous detecting module 711 is configured to autonomously detect a fault dynamic of the physical machine itself; and the autonomous processing module 712 is configured to perform the autonomous detection.
  • the module 711 detects that the physical and electronic faults of the physical machine can be fault-tolerantly repaired, and is repaired by the fault-tolerant method.
  • the self-detection module 711 detects that the physical and mechanical faults of the physical machine itself can be restarted and repairs by restarting the physical machine.
  • the autonomous processing module 712 is further configured to: if the autonomous detection module 711 detects a hardware/software failure that cannot be repaired by the physical machine itself, according to an instruction or a process of the physical machine failure classification processing module 720 outside the cluster.
  • the out-of-band management module 713 on the physical machine closes the faulty physical machine, and the physical machine fault classification processing module 720 outside the cluster migrates the virtual machine on the faulty physical machine to other healthy physics in the cluster system through the virtualization interface. on board.
  • the autonomous detection module 712 is further configured to report the physical machine fault information to the physical machine fault information storage by the physical machine fault collection module 760 when the autonomous detection module 711 autonomously detects that the physical machine is faulty due to the network attack.
  • the center 730 is triggered by the physical attack fault classification processing module 720 outside the cluster to trigger the security attack protection center 740 outside the cluster.
  • the autonomous detection module 711 and the autonomous processing module 712 may be software modules deployed on each physical machine of the cluster, and are automatically started when the physical machine is powered on.
  • the autonomous detection module 711 And the operation of the autonomous processing module 712 does not depend on the file System, only rely on CPU, memory.
  • the physical machine fault information storage center 730 is configured to collect all the reported physical fault information into a physical machine fault information list.
  • the physical machine fault information list includes: a physical machine fault detecting module 750 external to the cluster. The physical machine fault information detected from the fault physical machine and reported to the physical machine fault information storage center 730, and collected by the physical machine fault collection module 760 outside the cluster from the fault physical machine and reported to the The physical machine failure information of the physical machine failure information storage center 730.
  • the physical machine fault classification processing device 720 is configured to obtain a physical machine fault information list from the physical machine fault information storage center 730 through the obtaining module 721, and if it is detected in the physical machine fault information list that it is subjected to a network attack If the physical machine fails, the first attack module 722 triggers the security attack protection center 740 processing outside the cluster; if the hardware and software failure information cannot be repaired by the physical machine itself, the software and hardware failures are detected.
  • the second processing module 723 sends an instruction to the failed physical machine to instruct the failed physical machine to autonomously shut down the failed physical machine or shut down the failed physical machine through the out of band management module 713 on the physical machine, and migrate the faulted physical through the virtualized interface.
  • the virtual machine on the machine goes to other healthy physical machines in the cluster system.
  • the physical machine fault classification processing device 720 may further include a third processing module 724, configured to detect, in the physical machine fault information list, that the physical machine network is completely unreachable and the network unreachable duration reaches a preset time. And determining whether the number of physical machines that are not available in the network exceeds a preset number, and if so, notifying the operation and maintenance personnel to manually repair; otherwise, the virtual machine on the failed physical machine is migrated to the other healthy physical machines in the cluster system through the virtualization interface.
  • a third processing module 724 configured to detect, in the physical machine fault information list, that the physical machine network is completely unreachable and the network unreachable duration reaches a preset time. And determining whether the number of physical machines that are not available in the network exceeds a preset number, and if so, notifying the operation and maintenance personnel to manually repair; otherwise, the virtual machine on the failed physical machine is migrated to the other healthy physical machines in the cluster system through the virtualization interface.
  • the physical machine fault classification processing device 720 may further include a fourth processing module 725, configured to detect that the physical machine network is unreachable but the network failure duration does not reach the preset time in the physical machine fault information list. After the network is restored to normal, and it is determined that the physical network is unreachable, the current physical machine is healthy. If it is healthy, the virtual machine on the physical machine is restarted through the virtualized interface. If it is not healthy, the virtual machine is virtualized. The interface migrates virtual machines on the failed physical machine to other healthy physical machines in the cluster.
  • a fourth processing module 725 configured to detect that the physical machine network is unreachable but the network failure duration does not reach the preset time in the physical machine fault information list. After the network is restored to normal, and it is determined that the physical network is unreachable, the current physical machine is healthy. If it is healthy, the virtual machine on the physical machine is restarted through the virtualized interface. If it is not healthy, the virtual machine is virtualized. The interface migrates virtual machines on the failed physical machine
  • the physical machine fault classification processing device 720 may further include a fifth processing module 726, configured to detect that the physical machine network is unstable and the network instability duration reaches a preset in the physical machine fault information list. And sending an instruction to the faulty physical machine to instruct the faulty physical machine to autonomously shut down the faulty physical machine or shut down the faulty physical machine through an out-of-band management module on the physical machine; and migrating the virtual machine on the faulty physical machine through the virtualization interface The machine goes to other healthy physical machines in the cluster system.
  • a fifth processing module 726 configured to detect that the physical machine network is unstable and the network instability duration reaches a preset in the physical machine fault information list. And sending an instruction to the faulty physical machine to instruct the faulty physical machine to autonomously shut down the faulty physical machine or shut down the faulty physical machine through an out-of-band management module on the physical machine; and migrating the virtual machine on the faulty physical machine through the virtualization interface The machine goes to other healthy physical machines in the cluster system.
  • the physical machine fault classification processing device 720 may further include a determining module 727, configured to match all physical machines in the cluster in the physical machine fault information list, and determine that the physical machine that has not been successfully matched is determined as Health physical machine.
  • a determining module 727 configured to match all physical machines in the cluster in the physical machine fault information list, and determine that the physical machine that has not been successfully matched is determined as Health physical machine.
  • the physical machine fault classification processing device 720, the physical machine fault detection module 750, and the physical machine fault collection module 760 in the virtual machine recovery system are all deployed in the virtualized cluster system 700.
  • the software modules on the physical machines other than the ones can be deployed on different physical machines independently or in the same physical machine.
  • the physical machine failure information storage center 730 is a set of database systems deployed outside of the virtualized cluster system 700.
  • the Security Attack Protection Center 740 can directly use the existing security attack protection system. This embodiment of the present application does not limit this.
  • the embodiments of the present invention can quickly and accurately identify the faults of multiple physical machines, and perform targeted processing to achieve fast and highly reliable physics. Machine fault repair processing to ensure rapid recovery of virtual machine services on it.
  • the embodiment of the present application independently detects the fault dynamics by the physical machine, and performs targeted repair and repair processing on the physical machine fault condition that the physical machine can repair; and the physical machine fault condition that the physical machine cannot repair itself.
  • the physical machine fault classification processing module outside the cluster performs targeted classification and repair processing, thereby effectively reducing the occurrence of false positives and missed judgments of physical machine faults, and automatically recovering virtual machines automatically, stably, and quickly.
  • the embodiment of the present application is directed to a physical machine fault condition that cannot be repaired by the physical machine itself.
  • the physical machine fault classification processing module outside the cluster may also be instructed. The failure of the physical machine to shut down autonomously, thereby making up for the availability of the shutdown operation of the out-of-band management module cannot meet the commercial standard, and also ensuring the effectiveness of the automated physical machine isolation.
  • the embodiment of the present application also considers the possibility of a physical machine scale failure in a large-scale cloud computing cluster, and determines whether the number of failed physical machines constitutes a computer room level, and adopts different repair processing methods in a targeted manner. Especially for the case of large-scale physical machine failure, it is repaired by manual processing, which effectively avoids the occurrence of system performance due to the frequent migration of virtual machines on the failed physical machine.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD@ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD@ROM, optical storage, etc.
  • the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology. The information can be computer readable instructions, Data structure, modules of the program, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, read-only optical read-only memory (CD@ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include non-persistent computer readable media, such as modulated data signals and carrier waves.
  • Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG.
  • These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions are executed by a processor of a computer or other programmable data processing terminal device
  • Means are provided for implementing the functions specified in one or more of the flow or in one or more blocks of the flow chart.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the instruction device implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种集群物理机故障分类处理方法、装置及虚拟机恢复方法、系统。所述物理机故障分类处理方法包括:从物理机故障信息存储中心获取物理机故障信息列表(110);若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理(120);若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上(130)。该方法通过对多种物理机故障场景,进行精细化故障快速、准确的识别,并有针对性的进行分类处理,从而实现快速、高可靠的物理机故障修复处理,以保证其上的虚拟机服务的快速恢复。

Description

物理机故障分类处理方法、装置和虚拟机恢复方法、系统 技术领域
本申请涉及通信技术领域,特别是涉及一种应用于虚拟化集群系统的物理机故障分类处理方法、装置及虚拟机恢复方法、系统。
背景技术
随着计算机技术的迅猛发展,人们开始越来越多的关注如何降低能耗和提高资源利用率,云计算模式应运而生。云计算将所有的计算机抽象成特定的计算资源,然后将这些计算资源提供给用户,而不是像传统那样直接提供一台或多台计算机。云计算模式最大的好处就是用户可以根据自己的需要来申请资源,避免不必要的资源浪费,提高资源利用率。
在云计算环境中,虚拟化集群技术是关键技术之一。虚拟化集群将多台虚拟化服务器组成为一个有机的整体,从而获得很高的计算速度,提升虚拟化系统整体的计算能力。虚拟化集群对多台服务器进行统一管理,通过虚拟化技术将物理资源抽象为存储、计算、网络等各种资源组成大的资源池,通过按需申请资源的方式提供虚拟机给用户。
随着虚拟化集群规模的逐渐扩大,由于集群内物理机软硬件问题导致物理机故障的概率也逐渐增大。物理机故障会直接影响其上所运行的虚拟机服务。为了保证虚拟机业务的正常运行,需要及时发现其所在的物理机故障并迅速处理以恢复虚拟机业务;否则,虚拟机用户会受到物理机故障的影响,无法保证业务的连续性。现有技术可以定时监控物理机状态,当发生物理机故障时,则会对其上的虚拟机进行关机,然后再开机操作;或者是关闭故障物理机,将其上的虚拟机迁移到集群内其他物理机上。
然而,物理机故障通常是由不同的原因而导致的,且物理机故障的现象也会有很多种,而现有技术并未对物理机故障进行精细划分,并未针对性的进行分类处理,因此在实际商业化用途中会存在较多的误判和漏判的情况,从而无法实现物理机故障后其上的虚拟机高可用(High Availability,HA)。
因此,如何更准确、高效、有针对性地进行物理机故障分类修复处理,成为亟需本领域技术人员解决的技术问题。
发明内容
鉴于上述问题,提出了本申请实施例以便提供一种克服上述问题或者至少部分地解决上述问题的一种应用于虚拟化集群系统的物理机机故障分类处理方法、装置及虚拟机恢复方法、系统。
本申请公开一种集群物理机故障分类处理方法,包括:
从物理机故障信息存储中心获取物理机故障信息列表;
若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理;
若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
本申请还公开了一种集群物理机故障分类处理装置,包括:
获取模块,用于从物理机故障信息存储中心获取物理机故障信息列表;
第一处理模块,用于若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理;
第二处理模块,进一步包括:
关闭处理单元,用于若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令;
迁移处理单元,用于通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
本申请还公开了一种虚拟机恢复方法,应用于虚拟化集群系统,所述方法包括:
虚拟化集群系统内的物理机自主检测自身的故障动态;
若自主检测到物理机自身能容错修复的软硬件故障,通过容错方式修复;
若自主检测到物理机自身能重启修复的软硬件故障,通过重启物理机方式修复;
从物理机故障信息存储中心获取物理机故障信息列表;
若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理;
若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
相应的,本申请公开了一种虚拟机恢复系统,包括:
物理机故障修复装置,应用于虚拟化集群系统内的物理机上自主检测物理机自身的故障动态,若自主检测到物理机自身能容错修复的软硬件故障,通过容错方式修复;若自主检测到物理机自身能重启修复的软硬件故障,通过重启物理机方式修复;
物理机故障信息存储中心,用于将所有上报的物理故障信息汇集成物理机故障信息列表;
物理机故障分类处理装置,用于从所述物理机故障信息存储中心获取物理机故障信息列表,若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理;若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令,及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
根据本申请提供的具体实施例,本申请公开了以下技术效果:
本申请实施例可以在大规模的云计算集群中,通过对多种物理机故障 场景,进行精细化故障快速、准确的识别,并有针对性的进行分类处理,从而实现快速、高可靠的物理机故障修复处理,以保证其上的虚拟机服务的快速恢复。
进一步的,本申请实施例通过物理机自主检测自身的故障动态,并对物理机自身能修复的物理机故障情况有针对性的进行分类修复处理;对物理机自身不能修复的物理机故障情况,通过集群外部的物理机故障分类处理模块有针对性的进行分类修复处理,从而有效降低物理机故障的误判和漏判情况的发生,更安全、稳定、快速的进行虚拟机自动恢复。
另外,本申请实施例针对物理机自身不能修复的物理机故障情况,除了可以通过故障物理机上的带外管理模块关闭故障物理机之外,还可以通过集群外部的物理机故障分类处理模块,指示故障物理机自主关机,从而弥补带外管理模块调用关机操作的可用性无法达到商用标准的问题,同时也确保自动化物理机隔离的有效性。
此外,本申请实施例也同时考虑到大规模云计算集群内发生物理机规模故障情况的可能性,通过判断故障物理机的数量是否构成机房级别,并有针对性的采取不同的修复处理方式。尤其是针对大规模物理机故障的情况,采用人工处理的方式修复,从而有效避免由于故障物理机上的虚拟机的频繁迁移而影响系统性能情况的发生。
当然,实施本申请的任一产品并不一定需要同时达到以上所述的所有优点。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请的一种集群物理机故障分类处理方法实施例的步骤流程图;
图2是本申请的另一种集群物理机故障分类处理方法实施例的步骤流程图;
图3是本申请的一种虚拟机恢复方法实施例的步骤流程图;
图4是本申请的另一种虚拟机恢复方法实施例的步骤流程图;
图5是本申请的一种物理机故障修复装置实施例的结构框图;
图6是本申请的一种集群物理机故障分类处理装置实施例的结构框图;
图7是本申请的一种虚拟机恢复系统实施例的结构框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本申请保护的范围。
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
为了方便理解本发明实施例,首先在此介绍本发明实施例描述中会涉及的几个要素:
A、云计算
云计算是一种基于互联网技术的相关服务的增加、使用和交付模式,是在所有使用的服务器上实践分布式计算的服务器集群。也就是说,云计算提供了一个虚拟化的按需动态供应硬件、软件和数据集的弹性资源平台。
B、虚拟集群
在云计算平台上进行集群管理就构成了虚拟集群。所谓的虚拟集群就是通过采用虚拟化技术来虚拟出多台计算节点,从而构建出与物理集群相似而且规模巨大的一个集群系统。也就是说,虚拟集群就是将那些协同完成特定任务的多台同构或异构的计算机连接起来的系统。
C、物理机
虚拟集群系统内协同完成特定任务的多台计算机即为集群物理计算机,简称集群物理机。其中,一台物理机上可以模拟出一台或者多台虚拟的计算机。
D、虚拟机
通过虚拟机软件可以在一台物理机上模拟出一台或者多台虚拟的计算机,而这些虚拟机就像真正的计算机那样进行工作,虚拟机上可以安装操作系统和应用程序,虚拟机还可访问网络资源。对于在虚拟机中运行的应用程序而言,虚拟机就像是在真正的计算机中进行工作。
本申请实施例可以应用在大规模的云计算虚拟化集群系统中,可以通过集群系统内的物理机自主检测自身的故障动态,进而对物理机自身能修复的物理机故障情况有针对性的进行分类修复处理;而对物理机自身不能修复的物理机故障情况,通过集群外部的物理机故障分类处理模块有针对性的进行分类修复处理,从而有效降低物理机故障的误判和漏判情况的发生,更安全、稳定、快速的进行虚拟机自动恢复。
影响虚拟机运行和管理的物理机故障现象可以归纳如下几种:
1、物理机网络不通
其原因主要包括:物理机宕机、网卡异常、上联交换机故障、硬件异常、内核模块异常、物理机重启、网络分布式拒绝服务攻击(Distributed Denial of Service,DDoS)等。
2、物理机丢包
其原因主要包括:物理机负载高、上联网络设备切换、网络DdoS攻击等。
3、物理机硬件系统故障
例如,物理机磁盘、内存、中央处理器(Central Processing Unit,CPU)故障等。
4、物理机软件异常
例如,物理机的文件系统、虚拟化相关模块、操作系统内核模块等操作系统层面的软件异常等。
5、物理机远程访问通道不通
其原因主要包括:网络丢包、系统服务异常、文件系统异常等。
6、物理机性能异常
例如,可能表现为物理机输入输出(Input/Output,I/O)卡顿、负载高等。其原因主要包括:物理机硬件故障、物理机内核模块异常、物理机用户态进程异常等。
可以看出,以上物理机故障的现象并不是一成不变的,而是在一定时间内可以相互转化的,甚至是相关关联、相互交织的。并且,相同的物理机现象其背后的原因可能不一样,因此故障物理机的修复处理方式需要具体区分,例如,对于因网络DDoS攻击而导致的某台物理机网络不通与因物理机宕机而导致的物理机网络不通是需要区别对待的,如果在物理机正遭受网络DDoS攻击时将其上的虚拟机迁移至其他物理机,会产生骨牌效应,导致扩大故障风险,即其他物理机陆续被攻击而不可用,最终可能造成全集群网络设备的泛洪(flooding),导致全集群物理机故障风险。
基于上述物理机故障现象和异常的深层原因分析,本发明实施例中,可以将物理机故障归纳为如下几类:
A、物理机自身能容错修复的软硬件故障类型
例如,存储数据的磁盘故障、虚拟化相关内核模块异常、存储数据的文件系统异常等。
B、物理机自身能重启修复的软硬件故障类型
例如,根文件系统只读等异常、网卡驱动重启可修复的异常、操作系统内核模块异常等。
C、物理机自身不能修复的软硬件故障类型
例如,物理机宕机、物理机CPU异常、物理机内存异常、物理机电源模块等各类硬件问题异常。
另外,还包括未知原因的故障类型,例如,系统负载类、系统网络类、硬件故障类等。这类故障虽然本质的原因比较难查,但是这类故障的现象却很明确,主要是:物理机网络丢包、物理机管理通道访问异常、物理机性能使用异常。
D、物理机遭受网络攻击而导致物理机故障类型
例如,网络DDoS类型安全攻击,从而造成网络大量丢包甚至网络不通。这类故障的现象主要包括:物理机网络不通、网络丢包、管理通道不通等。
因此,本申请实施例通过对多种物理机故障场景,进行精细化故障快速、准确的识别,并有针对性的进行分类处理,从而实现快速、高可靠的物理机故障修复处理,以保证其上的虚拟机服务的快速恢复。例如,本申请实施例可以在十几分钟内处理完成故障物理机上的虚拟机恢复且该虚拟机的功能具备超过99.95%的商用可用性标准。
实施例一
参照图1,示出了本申请的一种集群物理机故障分类处理方法实施例的步骤流程图,所述物理机故障分类处理方法可以应用于虚拟化集群系统,具体可以包括如下步骤:
步骤210,从物理机故障信息存储中心获取物理机故障信息列表;
需要说明的是,所述物理机故障信息列表包括:由所述集群外部的物理机故障探测模块从故障物理机处探测到并上报给所述物理机故障信息存储中心的物理机故障信息,及由所述集群外部的物理机故障收集模块从故障物理机处收集到并上报给所述物理机故障信息存储中心的物理机故 障信息。
步骤220,若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理;
可以理解的是,在实际应用中,所述集群外部的安全攻击防护中心被触发后,会启动安全清洗程序,例如进行流量清洗等,从而使得故障物理机恢复健康。需要说明的是,对于因网络DDoS攻击而导致的某台物理机网络不通与因物理机宕机而导致的物理机网络不通是需要区别对待的,如果在物理机正遭受网络DDoS攻击时将其上的虚拟机迁移至其他物理机,会产生骨牌效应,导致扩大故障风险,即其他物理机陆续被攻击而不可用,最终可能造成全集群网络设备的泛洪(flooding),导致全集群物理机故障风险。
步骤230,若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上;
优选的,若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块。
需要说明的是,所述的物理机自身不能修复的软硬件故障类型可以包括:物理机宕机、物理机CPU异常、物理机内存异常、物理机电源模块等各类硬件问题异常。这类故障会直接导致物理机不可用,且需要更换硬件模块方可修复,因此,本申请实施例通过从集群中将故障物理机隔离后再对故障物理机进行硬件更换或者维护。
此外,针对物理机自身不能修复的软硬件故障的情况下,传统物理机上的带外管控系统由于硬件故障率和成本问题,通常可用性在90%左右甚至更低,在云计算服务本身至少99.95%的商用可用性要求下,全年的不可用性时长共计262.8分钟,如果一台故障物理机无法得到及时修复,则由于一台物理机故障就会直接导致几十分钟的人工处理时耗,因此,现有技 术中的带外管控系统的可用性指标无法匹配商用云计算服务的故障恢复服务等级协议(Service-Level Agreement,SLA)。而本申请实施例提供的技术方案,对传统的带外管控系统进行改进,在带外管理模块可用性达不到商用标准时,可以通过所述集群外部的物理机故障分类处理模块的指令指示故障物理机自主关闭,再由所述集群外部的物理机故障分类处理模块通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上;从而大量缩短故障物理机的修复时间,进而提高系统的商用可用性。
本申请实施例可以在大规模的云计算集群中,通过对多种物理机故障场景,进行精细化故障快速、准确的识别,并有针对性的进行分类处理,从而实现快速、高可靠的物理机故障修复处理,以保证其上的虚拟机服务的快速恢复。
另外,本申请实施例针对物理机自身不能修复的物理机故障情况,除了可以通过故障物理机上的带外管理模块关闭故障物理机之外,还可以通过集群外部的物理机故障分类处理模块,指示故障物理机自主关机,从而弥补带外管理模块调用关机操作的可用性无法达到商用标准的问题,同时也确保自动化物理机隔离的有效性。
实施例二
参照图2,示出了本申请的另一种集群物理机故障分类处理方法实施例的步骤流程图,具体可以包括如下步骤:
步骤210,从物理机故障信息存储中心获取物理机故障信息列表;
需要说明的是,所述物理机故障信息列表包括:由所述集群外部的物理机故障探测模块从故障物理机处探测到并上报给所述物理机故障信息存储中心的物理机故障信息,及由所述集群外部的物理机故障收集模块从故障物理机处收集到并上报给所述物理机故障信息存储中心的物理机故障信息。
步骤220,若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理;
可以理解的是,在实际应用中,所述集群外部的安全攻击防护中心被触发后,会启动安全清洗程序,例如进行流量清洗等,从而使得故障物理机恢复健康。需要说明的是,对于因网络DDoS攻击而导致的某台物理机网络不通与因物理机宕机而导致的物理机网络不通是需要区别对待的,如果在物理机正遭受网络DDoS攻击时将其上的虚拟机迁移至其他物理机,会产生骨牌效应,导致扩大故障风险,即其他物理机陆续被攻击而不可用,最终可能造成全集群网络设备的泛洪(flooding),导致全集群物理机故障风险。
步骤230,若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上;
优选的,若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块。
需要说明的是,所述的物理机自身不能修复的软硬件故障类型可以包括:物理机宕机、物理机CPU异常、物理机内存异常、物理机电源模块等各类硬件问题异常。这类故障会直接导致物理机不可用,且需要更换硬件模块方可修复,因此,本申请实施例通过从集群中将故障物理机隔离后再对故障物理机进行硬件更换或者维护。
此外,针对物理机自身不能修复的软硬件故障的情况下,传统物理机上的带外管控系统由于硬件故障率和成本问题,通常可用性在90%左右甚至更低,在云计算服务本身至少99.95%的商用可用性要求下,全年的不可用性时长共计262.8分钟,如果一台故障物理机无法得到及时修复,则由于一台物理机故障就会直接导致几十分钟的人工处理时耗,因此,现有技术中的带外管控系统的可用性指标无法匹配商用云计算服务的故障恢复 服务等级协议(Service-Level Agreement,SLA)。而本申请实施例提供的技术方案,对传统的带外管控系统进行改进,在带外管理模块可用性达不到商用标准时,可以通过所述集群外部的物理机故障分类处理模块的指令指示故障物理机自主关闭,再由所述集群外部的物理机故障分类处理模块通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上;从而大量缩短故障物理机的修复时间,进而提高系统的商用可用性。
步骤240,若在所述物理机故障信息列表中检测到物理机网络完全不通且网络不通持续时间达到预设时间;判断网络不通的物理机数量是否超过预设数量,如果是则通知运营维修人员人工修复;否则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内到其他健康物理机上;
其中,所述预设时间可以依据实际情况设定为3分钟、5分钟等适合的时间段。
需要说明的是,在检测到物理机网络完全不通且网络不通持续时间达到预设时间的情况下,本申请实施例需要进一步检查网络不通的故障物理机的数量是否超过一个机柜的物理机数量或者一个交换机下联物理机数量,如果超过,则认为是集群规模性网络故障,则需要采取电话报警通运营维修人员人工修复,而不再自动处理。这是由于对于大规模物理机故障,在进行隔离物理机迁移虚拟机时,会导致大量物理机被关闭,当机房设备(网络设备或者电力设备等)恢复后,还需要再次重启物理机,然后恢复虚拟机,这一系列的操作将直接导致人工处理时间加倍甚至更多,从而大大加大虚拟机的不可用时长。因此,本申请实施例提供的方法,对此种物理机故障类型加以区分处理,可以大量缩短故障物理机的修复时间,从而大大缩短其上的虚拟机不可用的时长,进而提高系统的商用可用性。
优选的,本申请实施例所述方法还可以进一步包括:
步骤250,若在所述物理机故障信息列表中检测到物理机网络不通但网络不通持续时间未达到预设时间后网络又恢复正常,且确定物理机网络不通是物理机重启所导致的,则判断当前的物理机是否健康,如果健康则 通过虚拟化接口重启所述物理机上的虚拟机,如果不健康则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群内其他健康物理机上;
步骤260,若在所述物理机故障信息列表中检测到物理机网络不稳定且网络不稳定持续时间达到预设时间,则向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上;
需要说明的是,所述物理机网络不稳定且网络不稳定持续时间达到预设时间的情况主要是一些未知原因造成物理机故障,例如,系统负载类、系统网络类、硬件故障类等。这类故障虽然本质原因比较难查,但是这类故障的现象却很明确,主要是:物理机网络丢包、物理机管理通道访问异常、物理机性能使用异常。对于这类物理机故障,可以采用相同的处理方式,即向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
优选的,在本申请的一个实施例中,通过以下方式确定所述健康物理机:
在所述物理机故障信息列表中匹配所述集群内的所有物理机;
将没有匹配成功的物理机确定为健康物理机。
本申请实施例可以在大规模的云计算集群中,通过对多种物理机故障场景,进行精细化故障快速、准确的识别,并有针对性的进行分类处理,从而实现快速、高可靠的物理机故障修复处理,以保证其上的虚拟机服务的快速恢复。
进一步的,本申请实施例通过物理机自主检测自身的故障动态,并对物理机自身能修复的物理机故障情况有针对性的进行分类修复处理;对物理机自身不能修复的物理机故障情况,通过集群外部的物理机故障分类处理模块有针对性的进行分类修复处理,从而有效降低物理机故障的误判和 漏判情况的发生,更安全、稳定、快速的进行虚拟机自动恢复。
另外,本申请实施例针对物理机自身不能修复的物理机故障情况,除了可以通过故障物理机上的带外管理模块关闭故障物理机之外,还可以通过集群外部的物理机故障分类处理模块,指示故障物理机自主关机,从而弥补带外管理模块调用关机操作的可用性无法达到商用标准的问题,同时也确保自动化物理机隔离的有效性。
此外,本申请实施例也同时考虑到大规模云计算集群内发生物理机规模故障情况的可能性,通过判断故障物理机的数量是否构成机房级别,并有针对性的采取不同的修复处理方式。尤其是针对大规模物理机故障的情况,采用人工处理的方式修复,从而有效避免由于故障物理机上的虚拟机的频繁迁移而影响系统性能情况的发生。
实施例三
参照图3,示出了本申请的一种虚拟机恢复方法的实施例示意图,具体可以包括如下步骤:
步骤310,虚拟化集群系统内的物理机自主检测自身的故障动态;
优选的,每台物理机可以以固定的时间间隔定期自主检测自身的故障动态,例如每隔30秒自主检测一次。
步骤320,若自主检测到物理机自身能容错修复的软硬件故障,通过容错方式修复;
可以理解的是,本申请实施例所述的物理机自身能容错修复的软硬件故障,可以包括:存储数据的磁盘故障、虚拟化相关内核模块异常、存储数据的文件系统异常等。例如,针对存储数据的磁盘故障,容错修复方式具体是,首先隔离磁盘,然后利用集群分布式存储多份数据的机制,实现该磁盘上数据自动复制至其他健康磁盘上,这样可以有效保证该故障磁盘隔离后不会影响系统稳定运行。同样,针对存储数据的文件系统损坏,也 可以通过隔离该文件系统挂载的磁盘达到容错修复的目的。
步骤330,若自主检测到物理机自身能重启修复的软硬件故障,通过重启物理机方式修复;
可以理解的是,本申请实施例所述的物理机自身能修复的软硬件故障,可以包括:根文件系统只读等异常、网卡驱动重启可修复的异常、操作系统内核模块异常等。这类软硬件故障都可以通过重启物理机的方式予以修复。
步骤340,从物理机故障信息存储中心获取物理机故障信息列表;
需要说明的是,由物理机故障分类处理模块从物理机故障信息存储中心获取物理机故障信息列表。所述物理机故障信息列表包括:由所述集群外部的物理机故障探测模块从故障物理机处探测到并上报给所述物理机故障信息存储中心的物理机故障信息,及由所述集群外部的物理机故障收集模块从故障物理机处收集到并上报给所述物理机故障信息存储中心的物理机故障信息。
步骤350,若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理;
可以理解的是,在实际应用中,所述集群外部的安全攻击防护中心被触发后,会启动安全清洗程序,例如进行流量清洗等,从而使得故障物理机恢复健康。需要说明的是,对于因网络DDoS攻击而导致的某台物理机网络不通与因物理机宕机而导致的物理机网络不通是需要区别对待的,如果在物理机正遭受网络DDoS攻击时将其上的虚拟机迁移至其他物理机,会产生骨牌效应,导致扩大故障风险,即其他物理机陆续被攻击而不可用,最终可能造成全集群网络设备的泛洪(flooding),导致全集群物理机故障风险。
步骤360,若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;及 通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
需要说明的是,所述的物理机自身不能修复的软硬件故障类型可以包括:物理机宕机、物理机CPU异常、物理机内存异常、物理机电源模块等各类硬件问题异常。这类故障会直接导致物理机不可用,且需要更换硬件模块方可修复,因此,本申请实施例通过从集群中将故障物理机隔离后再对故障物理机进行硬件更换或者维护。
此外,针对物理机自身不能修复的软硬件故障的情况下,传统物理机上的带外管控系统由于硬件故障率和成本问题,通常可用性在90%左右甚至更低,在云计算服务本身至少99.95%的商用可用性要求下,全年的不可用性时长共计262.8分钟,如果一台故障物理机无法得到及时修复,则由于一台物理机故障就会直接导致几十分钟的人工处理时耗,因此,现有技术中的带外管控系统的可用性指标无法匹配商用云计算服务的故障恢复服务等级协议(Service-Level Agreement,SLA)。而本申请实施例提供的技术方案,对传统的带外管控系统进行改进,在带外管理模块可用性达不到商用标准时,可以通过所述集群外部的物理机故障分类处理模块的指令指示故障物理机自主关闭,再由所述集群外部的物理机故障分类处理模块通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上;从而大量缩短故障物理机的修复时间,进而提高系统的商用可用性。
优选的,本申请实施例所述方法还可以进一步包括:
步骤370,若在所述物理机故障信息列表中检测到物理机网络完全不通且网络不通持续时间达到预设时间;判断网络不通的物理机数量是否超过预设数量,如果是则通知运营维修人员人工修复;否则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内到其他健康物理机上。
其中,所述预设时间可以依据实际情况设定为3分钟、5分钟等适合的时间段。
需要说明的是,在检测到物理机网络完全不通且网络不通持续时间达 到预设时间的情况下,本申请实施例需要进一步检查网络不通的故障物理机的数量是否超过一个机柜的物理机数量或者一个交换机下联物理机数量,如果超过,则认为是集群规模性网络故障,则需要采取电话报警通运营维修人员人工修复,而不再自动处理。这是由于对于大规模物理机故障,在进行隔离物理机迁移虚拟机时,会导致大量物理机被关闭,当机房设备(网络设备或者电力设备等)恢复后,还需要再次重启物理机,然后恢复虚拟机,这一系列的操作将直接导致人工处理时间加倍甚至更多,从而大大加大虚拟机的不可用时长。因此,本申请实施例提供的方法,对此种物理机故障类型加以区分处理,可以大量缩短故障物理机的修复时间,从而大大缩短其上的虚拟机不可用的时长,进而提高系统的商用可用性。
优选的,本申请实施例所述方法还可以进一步包括:
步骤380,若在所述物理机故障信息列表中检测到物理机网络不通但网络不通持续时间未达到预设时间后网络又恢复正常,且确定物理机网络不通是物理机重启所导致的,则判断当前的物理机是否健康,如果健康则通过虚拟化接口重启所述物理机上的虚拟机,如果不健康则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群内其他健康物理机上。
优选的,本申请实施例所述方法还可以进一步包括:
步骤390,若在所述物理机故障信息列表中检测到物理机网络不稳定且网络不稳定持续时间达到预设时间,则向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
需要说明的是,所述物理机网络不稳定且网络不稳定持续时间达到预设时间的情况主要是一些未知原因造成物理机故障,例如,系统负载类、系统网络类、硬件故障类等。这类故障虽然本质原因比较难查,但是这类故障的现象却很明确,主要是:物理机网络丢包、物理机管理通道访问异常、物理机性能使用异常。对于这类物理机故障,可以采用相同的处理方式,即向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机 或通过所述物理机上的带外管理模块关闭故障物理机;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
优选的,在本申请的一个实施例中,通过以下方式确定所述健康物理机:
在所述物理机故障信息列表中匹配所述集群内的所有物理机;
将没有匹配成功的物理机确定为健康物理机。
本申请实施例可以在大规模的云计算集群中,通过对多种物理机故障场景,进行精细化故障快速、准确的识别,并有针对性的进行分类处理,从而实现快速、高可靠的物理机故障修复处理,以保证其上的虚拟机服务的快速恢复。
进一步的,本申请实施例通过物理机自主检测自身的故障动态,并对物理机自身能修复的物理机故障情况有针对性的进行分类修复处理;对物理机自身不能修复的物理机故障情况,通过集群外部的物理机故障分类处理模块有针对性的进行分类修复处理,从而有效降低物理机故障的误判和漏判情况的发生,更安全、稳定、快速的进行虚拟机自动恢复。
另外,本申请实施例针对物理机自身不能修复的物理机故障情况,除了可以通过故障物理机上的带外管理模块关闭故障物理机之外,还可以通过集群外部的物理机故障分类处理模块,指示故障物理机自主关机,从而弥补带外管理模块调用关机操作的可用性无法达到商用标准的问题,同时也确保自动化物理机隔离的有效性。
此外,本申请实施例也同时考虑到大规模云计算集群内发生物理机规模故障情况的可能性,通过判断故障物理机的数量是否构成机房级别,并有针对性的采取不同的修复处理方式。尤其是针对大规模物理机故障的情况,采用人工处理的方式修复,从而有效避免由于故障物理机上的虚拟机的频繁迁移而影响系统性能情况的发生。
实施例四
参照图4,示出了本申请的另一种虚拟机恢复方法的实施例示意图,具体可以包括如下步骤:
物理机故障探测模块每隔30秒检查集群内每台物理机的网络情况,并更新至物理机故障信息存储中心;集群系统内的每台物理机自主检测自身的故障情况,并通过物理机故障收集模块更新至物理机故障信息存储中心。
对于物理机自身能容错修复的软硬件故障的场景,则由物理机自身通过容错方式修复处理;对于物理机自身能重启修复的软硬件故障,则由物理机自身通过重启物理机方式修复处理;如果是物理机自身不能修复的软硬件故障,则进行关机处理。
物理机故障分类处理模块每隔1分钟从物理机故障信息存储中心获取物理机故障信息列表;判断该物理机故障信息列表是否为空,如果是则返回循环;否则继续判断所述物理机故障信息列表中是否有因遭受网络攻击而导致物理机故障的情况,如果有,则触发所述集群外部的安全攻击防护中心处理;否则继续判断在所述物理机故障信息列表中是否有因物理机自身不能修复的软硬件故障的情况,如果有,则向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;再通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
如果确定所述物理机故障信息列表中没有因遭受网络攻击而导致物理机故障的情况,则继续判断在所述物理机故障信息列表中是否有物理机网络完全不通且网络不通持续时间达到预设时间,例如3分钟;如果有则再判断网络不通的物理机数量是否超过预设数量,例如,故障物理机的数量是否超过一个机柜的物理机数量或者一个交换机下联物理机数量,如果超过,则认为是集群规模性网络故障,则需要采取电话报警通运营维修人员人工修复,而不再自动处理。否则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内到其他健康物理机上。
判断在所述物理机故障信息列表中是否有检测到物理机网络不通但网络不通持续时间未达到预设时间后网络又恢复正常,且确定物理机网络不通是物理机重启所导致的,则判断当前的物理机是否健康,如果健康则通过虚拟化接口重启所述物理机上的虚拟机,如果不健康则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群内其他健康物理机上。
如果在所述物理机故障信息列表中检测到物理机网络不稳定且网络不稳定持续时间达到预设时间,则向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
需要说明的是,所述物理机网络不稳定且网络不稳定持续时间达到预设时间的情况主要是一些未知原因造成物理机故障,例如,系统负载类、系统网络类、硬件故障类等。这类故障虽然本质原因比较难查,但是这类故障的现象却很明确,主要是:物理机网络丢包、物理机管理通道访问异常、物理机性能使用异常。对于这类物理机故障,可以采用相同的处理方式。
优选的,在本申请的一个实施例中,通过以下方式确定所述健康物理机:
在所述物理机故障信息列表中匹配所述集群内的所有物理机;
将没有匹配成功的物理机确定为健康物理机。
本申请实施例可以在大规模的云计算集群中,通过对多种物理机故障场景,进行精细化故障快速、准确的识别,并有针对性的进行分类处理,从而实现快速、高可靠的物理机故障修复处理,以保证其上的虚拟机服务的快速恢复。
进一步的,本申请实施例通过物理机自主检测自身的故障动态,并对物理机自身能修复的物理机故障情况有针对性的进行分类修复处理;对物理机自身不能修复的物理机故障情况,通过集群外部的物理机故障分类处 理模块有针对性的进行分类修复处理,从而有效降低物理机故障的误判和漏判情况的发生,更安全、稳定、快速的进行虚拟机自动恢复。
另外,本申请实施例针对物理机自身不能修复的物理机故障情况,除了可以通过故障物理机上的带外管理模块关闭故障物理机之外,还可以通过集群外部的物理机故障分类处理模块,指示故障物理机自主关机,从而弥补带外管理模块调用关机操作的可用性无法达到商用标准的问题,同时也确保自动化物理机隔离的有效性。
此外,本申请实施例也同时考虑到大规模云计算集群内发生物理机规模故障情况的可能性,通过判断故障物理机的数量是否构成机房级别,并有针对性的采取不同的修复处理方式。尤其是针对大规模物理机故障的情况,采用人工处理的方式修复,从而有效避免由于故障物理机上的虚拟机的频繁迁移而影响系统性能情况的发生。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
实施例五
参照图5,示出了本申请的一种物理机故障修复装置实施例的结构框图,所述物理机故障修复装置500应用于虚拟化集群系统内的物理机上,具体可以包括:自主检测模块(selfChecker)510、自主处理模块(selfHnadler)520;其中:
自主检测模块510具体包括:检测单元511,用于自主检测物理机自身的故障动态;优选的,检测单元511可以以固定的时间间隔定期自主检测物理机自身的故障动态,例如每隔30秒自主检测一次。
自主处理模块520,具体包括:
容错单元521,用于若所述检测单元511检测到物理机自身能容错修复的软硬件故障,则通过容错方式修复;
可以理解的是,本申请实施例所述的物理机自身能容错修复的软硬件故障,可以包括:存储数据的磁盘故障、虚拟化相关内核模块异常、存储数据的文件系统异常等。例如,针对存储数据的磁盘故障,容错修复方式具体是,首先隔离磁盘,然后利用集群分布式存储多份数据的机制,实现该磁盘上数据自动复制至其他健康磁盘上,这样可以有效保证该故障磁盘隔离后不会影响系统稳定运行。同样,针对存储数据的文件系统损坏,也可以通过隔离该文件系统挂载的磁盘达到容错修复的目的。
重启单元522,用于若所述检测单元511检测到物理机自身能重启修复的软硬件故障,则通过重启物理机方式修复。
可以理解的是,本申请实施例所述的物理机自身能重启修复的软硬件故障,可以包括:根文件系统只读等异常、网卡驱动重启可修复的异常、操作系统内核模块异常等。这类软硬件故障都可以通过重启物理机的方式予以修复。
优选的,所述自主处理模块520还可以进一步包括:
关机单元523,用于若所述检测单元511检测到物理机自身不能修复的软硬件故障,则根据所述集群外部的物理机故障分类处理模块的指令或通过所述物理机上的带外管理模块530关闭故障物理机,由所述集群外部的物理机故障分类处理模块通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
需要说明的是,针对物理机自身不能修复的软硬件故障的情况下,传统物理机上的带外管控系统由于硬件故障率和成本问题,通常可用性在90%左右甚至更低,在云计算服务本身至少99.95%的商用可用性要求下,全年的不可用性时长共计262.8分钟,如果一台故障物理机无法得到及时修复,则由于一台物理机故障就会直接导致几十分钟的人工处理时耗,因此,现 有技术中的带外管控系统的可用性指标无法匹配商用云计算服务的故障恢复服务等级协议(Service-Level Agreement,SLA)。而本申请实施例提供的技术方案,对传统的带外管控系统进行改进,在带外管理模块530可用性达不到商用标准时,可以通过所述集群外部的物理机故障分类处理模块的指令指示故障物理机自主关闭,再由所述集群外部的物理机故障分类处理模块通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上;从而大量缩短故障物理机的修复时间,进而提高系统的商用可用性。
优选的,所述自主检测模块510还可以进一步包括:
上报单元512,用于当检测单元511自主检测到因遭受网络攻击而导致物理机故障时,通过物理机故障收集模块上报物理机故障信息到物理机故障信息存储中心,由所述集群外部的物理机故障分类处理模块触发所述集群外部的安全攻击防护中心处理。
其中,所述集群外部的安全攻击防护中心被触发后,会启动安全清洗程序,例如进行流量清洗等,从而使得故障物理机恢复健康。需要说明的是,对于因网络DDoS攻击而导致的某台物理机网络不通与因物理机宕机而导致的物理机网络不通是需要区别对待的,如果在物理机正遭受网络DDoS攻击时将其上的虚拟机迁移至其他物理机,会产生骨牌效应,导致扩大故障风险,即其他物理机陆续被攻击而不可用,最终可能造成全集群网络设备的泛洪(flooding),导致全集群物理机故障风险。
本申请实施例可以在大规模的云计算集群中,通过对多种物理机故障场景,进行精细化故障快速、准确的识别,并有针对性的进行分类处理,从而实现快速、高可靠的物理机故障修复处理,以保证其上的虚拟机服务的快速恢复。
进一步的,本申请实施例通过物理机自主检测自身的故障动态,并对物理机自身能修复的物理机故障情况有针对性的进行分类修复处理;对物理机自身不能修复的物理机故障情况,通过集群外部的物理机故障分类处理模块有针对性的进行分类修复处理,从而有效降低物理机故障的误判和 漏判情况的发生,更安全、稳定、快速的进行虚拟机自动恢复。
另外,本申请实施例针对物理机自身不能修复的物理机故障情况,除了可以通过故障物理机上的带外管理模块关闭故障物理机之外,还可以通过集群外部的物理机故障分类处理模块,指示故障物理机自主关机,从而弥补带外管理模块调用关机操作的可用性无法达到商用标准的问题,同时也确保自动化物理机隔离的有效性。
实施例六
参照图6,示出了本申请的一种集群物理机故障分类处理装置实施例的结构框图,所述物理机故障分类处理装置600具体可以包括如下模块:
获取模块610,用于从物理机故障信息存储中心获取物理机故障信息列表;需要说明的是,所述物理机故障信息列表包括:由所述集群外部的物理机故障探测模块从故障物理机处探测到并上报给所述物理机故障信息存储中心的物理机故障信息,及由所述集群外部的物理机故障收集模块从故障物理机处收集到并上报给所述物理机故障信息存储中心的物理机故障信息。
第一处理模块620,用于若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理;
可以理解的是,在实际应用中,所述集群外部的安全攻击防护中心被触发后,会启动安全清洗程序,例如进行流量清洗等,从而使得故障物理机恢复健康。需要说明的是,对于因网络DDoS攻击而导致的某台物理机网络不通与因物理机宕机而导致的物理机网络不通是需要区别对待的,如果在物理机正遭受网络DDoS攻击时将其上的虚拟机迁移至其他物理机,会产生骨牌效应,导致扩大故障风险,即其他物理机陆续被攻击而不可用,最终可能造成全集群网络设备的泛洪(flooding),导致全集群物理机故障风险。
第二处理模块630,进一步包括:
关闭处理单元,用于若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令;优选的,所述指令可以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;
迁移处理单元,用于通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
需要说明的是,所述的物理机自身不能修复的软硬件故障类型可以包括:物理机宕机、物理机CPU异常、物理机内存异常、物理机电源模块等各类硬件问题异常。这类故障会直接导致物理机不可用,且需要更换硬件模块方可修复,因此,本申请实施例通过从集群中将故障物理机隔离后再对故障物理机进行硬件更换或者维护。
优选的,所述物理机故障分类处理装置600还可以进一步包括第三处理模块640,该第三处理模块640具体包括:
通知处理单元,用于若在所述物理机故障信息列表中检测到物理机网络完全不通且网络不通持续时间达到预设时间,并且网络不通的物理机数量超过一台,则通知运营维修人员人工修复;
迁移处理单元,用于若在所述物理机故障信息列表中检测到物理机网络完全不通且网络不通持续时间达到预设时间,并且网络不通的物理机数量未超过预设数量,则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内到其他健康物理机上。
其中,所述预设时间可以依据实际情况设定为3分钟、5分钟等适合的时间段。
需要说明的是,在检测到物理机网络完全不通且网络不通持续时间达到预设时间的情况下,本申请实施例需要进一步检查网络不通的故障物理机的数量是否超过一个机柜的物理机数量或者一个交换机下联物理机数量,如果超过,则认为是集群规模性网络故障,则需要采取电话报警通运 营维修人员人工修复,而不再自动处理。这是由于对于大规模物理机故障,在进行隔离物理机迁移虚拟机时,会导致大量物理机被关闭,当机房设备(网络设备或者电力设备等)恢复后,还需要再次重启物理机,然后恢复虚拟机,这一系列的操作将直接导致人工处理时间加倍甚至更多,从而大大加大虚拟机的不可用时长。因此,本申请实施例提供的方法,对此种物理机故障类型加以区分处理,可以大量缩短故障物理机的修复时间,从而大大缩短其上的虚拟机不可用的时长,进而提高系统的商用可用性。
优选的,所述物理机故障分类处理装置600还可以进一步包括第四处理模块650,所述第四处理模块650具体包括:
重启处理单元,用于若在所述物理机故障信息列表中检测到物理机网络不通但网络不通持续时间未达到预设时间后网络又恢复正常,且确定物理机网络不通是物理机重启所导致的,则在确定当前的物理机是健康的情况下,通过虚拟化接口重启所述物理机上的虚拟机;
迁移处理单元,用于若在所述物理机故障信息列表中检测到物理机网络不通但网络不通持续时间未达到预设时间后网络又恢复正常,且确定物理机网络不通是物理机重启所导致的,则在确定当前的物理机是不健康的情况下,通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群内其他健康物理机上。
优选的,所述物理机故障分类处理装置600还可以进一步包括第五处理模块660,所述第五处理模块660具体包括:
关机处理单元,用于若在所述物理机故障信息列表中检测到物理机网络不稳定且网络不稳定持续时间达到预设时间,则向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;
迁移处理单元,用于通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
需要说明的是,所述物理机网络不稳定且网络不稳定持续时间达到预 设时间的情况主要是一些未知原因造成物理机故障,例如,系统负载类、系统网络类、硬件故障类等。这类故障虽然本质原因比较难查,但是这类故障的现象却很明确,主要是:物理机网络丢包、物理机管理通道访问异常、物理机性能使用异常。对于这类物理机故障,可以采用相同的处理方式,即向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
优选的,所述物理机故障分类处理装置600还可以进一步包括:
确定模块670,用于在所述物理机故障信息列表中匹配所述集群内的所有物理机,将没有匹配成功的物理机确定为健康物理机。
本申请实施例可以在大规模的云计算集群中,通过对多种物理机故障场景,进行精细化故障快速、准确的识别,并有针对性的进行分类处理,从而实现快速、高可靠的物理机故障修复处理,以保证其上的虚拟机服务的快速恢复。
进一步的,本申请实施例通过物理机自主检测自身的故障动态,并对物理机自身能修复的物理机故障情况有针对性的进行分类修复处理;对物理机自身不能修复的物理机故障情况,通过集群外部的物理机故障分类处理模块有针对性的进行分类修复处理,从而有效降低物理机故障的误判和漏判情况的发生,更安全、稳定、快速的进行虚拟机自动恢复。
另外,本申请实施例针对物理机自身不能修复的物理机故障情况,除了可以通过故障物理机上的带外管理模块关闭故障物理机之外,还可以通过集群外部的物理机故障分类处理模块,指示故障物理机自主关机,从而弥补带外管理模块调用关机操作的可用性无法达到商用标准的问题,同时也确保自动化物理机隔离的有效性。
此外,本申请实施例也同时考虑到大规模云计算集群内发生物理机规模故障情况的可能性,通过判断故障物理机的数量是否构成机房级别,并有针对性的采取不同的修复处理方式。尤其是针对大规模物理机故障的情况,采用人工处理的方式修复,从而有效避免由于故障物理机上的虚拟机 的频繁迁移而影响系统性能情况的发生。
实施例七
参照图7,示出了本申请的一种虚拟机恢复系统实施例的架构图,该虚拟机恢复系统包括:物理机故障修复装置710,其应用于虚拟化集群系统700内的每台物理机上;物理机故障分类处理装置720及物理机故障信息存储中心730;其中:
所述物理机故障修复装置710具体可以包括:自主检测模块711、自主处理模块712;其中:自主检测模块711用于自主检测物理机自身的故障动态;自主处理模块712用于若所述自主检测模块711检测到物理机自身能容错修复的软硬件故障,则通过容错方式修复;还用于若自主检测模块711检测到物理机自身能重启修复的软硬件故障,通过重启物理机方式修复。
优选的,所述自主处理模块712还可以用于若所述自主检测模块711检测到物理机自身不能修复的软硬件故障,则根据所述集群外部的物理机故障分类处理模块720的指令或通过所述物理机上的带外管理模块713关闭故障物理机,由所述集群外部的物理机故障分类处理模块720通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
优选的,所述自主检测模块712还可以用于当自主检测模块711自主检测到因遭受网络攻击而导致物理机故障时,通过物理机故障收集模块760上报物理机故障信息到物理机故障信息存储中心730,由所述集群外部的物理机故障分类处理模块720触发所述集群外部的安全攻击防护中心740处理。
需要说明的是,在本申请另一实施例中,该自主检测模块711和自主处理模块712可以是部署在集群每台物理机上的软件模块,在物理机开机时自动启动,该自主检测模块711和自主处理模块712的运行不依赖文件 系统,仅仅依赖CPU、内存。
所述物理机故障信息存储中心730,用于将所有上报的物理故障信息汇集成物理机故障信息列表;其中,所述物理机故障信息列表包括:由所述集群外部的物理机故障探测模块750从故障物理机处探测到并上报给所述物理机故障信息存储中心730的物理机故障信息,及由所述集群外部的物理机故障收集模块760从故障物理机处收集到并上报给所述物理机故障信息存储中心730的物理机故障信息。
所述物理机故障分类处理装置720,用于通过获取模块721从所述物理机故障信息存储中心730获取物理机故障信息列表,若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则通过第一处理模块722触发所述集群外部的安全攻击防护中心740处理;若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则通过第二处理模块723向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块713关闭故障物理机,及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
优选的,所述物理机故障分类处理装置720还可以进一步包括第三处理模块724,用于若在所述物理机故障信息列表中检测到物理机网络完全不通且网络不通持续时间达到预设时间;判断网络不通的物理机数量是否超过预设数量,如果是则通知运营维修人员人工修复;否则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内到其他健康物理机上。
优选的,所述物理机故障分类处理装置720还可以进一步包括第四处理模块725,用于若在所述物理机故障信息列表中检测到物理机网络不通但网络不通持续时间未达到预设时间后网络又恢复正常,且确定物理机网络不通是物理机重启所导致的,则判断当前的物理机是否健康,如果健康则通过虚拟化接口重启所述物理机上的虚拟机,如果不健康则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群内其他健康物理机上。
优选的,所述物理机故障分类处理装置720还可以进一步包括第五处理模块726,用于若在所述物理机故障信息列表中检测到物理机网络不稳定且网络不稳定持续时间达到预设时间,则向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
优选的,所述物理机故障分类处理装置720还可以进一步包括确定模块727,用于在所述物理机故障信息列表中匹配所述集群内的所有物理机,将没有匹配成功的物理机确定为健康物理机。
需要说明的是,所述物理机故障修复装置710以及物理机故障分类处理装置720的具体结构请参见前述实施例的详细说明,此处不再赘述。
需要说明的是,在本申请另一个实施例中,虚拟机恢复系统中的物理机故障分类处理装置720、物理机故障探测模块750、物理机故障收集模块760均为部署在虚拟化集群系统700以外的物理机上的软件模块,其可以各自独立部署在不同的物理机上,也可以合并部署在同一台物理机上。此外,物理机故障信息存储中心730是部署在虚拟化集群系统700以外的一套数据库系统。安全攻击防护中心740可以直接采用现有的安全攻击防护系统。本申请实施例对此不做限制。
本申请实施例,具备以下优点:
本申请实施例可以在大规模的云计算集群中,通过对多种物理机故障场景,进行精细化故障快速、准确的识别,并有针对性的进行分类处理,从而实现快速、高可靠的物理机故障修复处理,以保证其上的虚拟机服务的快速恢复。
进一步的,本申请实施例通过物理机自主检测自身的故障动态,并对物理机自身能修复的物理机故障情况有针对性的进行分类修复处理;对物理机自身不能修复的物理机故障情况,通过集群外部的物理机故障分类处理模块有针对性的进行分类修复处理,从而有效降低物理机故障的误判和漏判情况的发生,更安全、稳定、快速的进行虚拟机自动恢复。
另外,本申请实施例针对物理机自身不能修复的物理机故障情况,除了可以通过故障物理机上的带外管理模块关闭故障物理机之外,还可以通过集群外部的物理机故障分类处理模块,指示故障物理机自主关机,从而弥补带外管理模块调用关机操作的可用性无法达到商用标准的问题,同时也确保自动化物理机隔离的有效性。
此外,本申请实施例也同时考虑到大规模云计算集群内发生物理机规模故障情况的可能性,通过判断故障物理机的数量是否构成机房级别,并有针对性的采取不同的修复处理方式。尤其是针对大规模物理机故障的情况,采用人工处理的方式修复,从而有效避免由于故障物理机上的虚拟机的频繁迁移而影响系统性能情况的发生。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD@ROM、光学存储器等)上实施的计算机程序产品的形式。
在一个典型的配置中,所述计算机设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、 数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD@ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非持续性的电脑可读媒体(transitory media),如调制的数据信号和载波。
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所 以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种应用于虚拟化集群系统的物理机故障修复方法、装置和集群物理机故障分类处理方法、装置及虚拟机恢复方法、系统,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均可有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (32)

  1. 一种集群物理机故障分类处理方法,其特征在于,包括:
    从物理机故障信息存储中心获取物理机故障信息列表;
    若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理;
    若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    若在所述物理机故障信息列表中检测到物理机网络完全不通且网络不通持续时间达到预设时间;
    判断网络不通的物理机数量是否超过预设数量,如果是则通知运营维修人员人工修复;
    否则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内到其他健康物理机上。
  3. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    若在所述物理机故障信息列表中检测到物理机网络不通但网络不通持续时间未达到预设时间后网络又恢复正常,且确定物理机网络不通是物理机重启所导致的,则判断当前的物理机是否健康,如果健康则通过虚拟化接口重启所述物理机上的虚拟机,如果不健康则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群内其他健康物理机上。
  4. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    若在所述物理机故障信息列表中检测到物理机网络不稳定且网络不稳定持续时间达到预设时间,则向故障物理机发送指令以指示所述故障物 理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
  5. 如权利要求1所述的方法,其特征在于,所述若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令的步骤包括:
    向故障物理机发送关闭故障物理机的指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机。
  6. 如权利要求1所述的方法,其特征在于,通过以下方式确定所述健康物理机:
    在所述物理机故障信息列表中匹配所述集群内的所有物理机;
    将没有匹配成功的物理机确定为健康物理机。
  7. 如权利要求1所述的方法,其特征在于,所述物理机故障信息列表包括:由所述集群外部的物理机故障探测模块从故障物理机处探测到并上报给所述物理机故障信息存储中心的物理机故障信息,及由所述集群外部的物理机故障收集模块从故障物理机处收集到并上报给所述物理机故障信息存储中心的物理机故障信息。
  8. 一种集群物理机故障分类处理装置,其特征在于,包括:
    获取模块,用于从物理机故障信息存储中心获取物理机故障信息列表;
    第一处理模块,用于若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理;
    第二处理模块,进一步包括:
    关闭处理单元,用于若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令;
    迁移处理单元,用于通过虚拟化接口迁移所述故障物理机上的虚拟机 到所述集群系统内其他健康物理机上。
  9. 如权利要求8所述的装置,其特征在于,所述装置还包括第三处理模块,所述第三处理模块包括:
    通知处理单元,用于若在所述物理机故障信息列表中检测到物理机网络完全不通且网络不通持续时间达到预设时间,并且网络不通的物理机数量超过一台,则通知运营维修人员人工修复;
    迁移处理单元,用于若在所述物理机故障信息列表中检测到物理机网络完全不通且网络不通持续时间达到预设时间,并且网络不通的物理机数量未超过预设数量,则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内到其他健康物理机上。
  10. 如权利要求8所述的装置,其特征在于,所述装置还包括第四处理模块,所述第四处理模块包括:
    重启处理单元,用于若在所述物理机故障信息列表中检测到物理机网络不通但网络不通持续时间未达到预设时间后网络又恢复正常,且确定物理机网络不通是物理机重启所导致的,则在确定当前的物理机是健康的情况下,通过虚拟化接口重启所述物理机上的虚拟机;
    迁移处理单元,用于若在所述物理机故障信息列表中检测到物理机网络不通但网络不通持续时间未达到预设时间后网络又恢复正常,且确定物理机网络不通是物理机重启所导致的,则在确定当前的物理机是不健康的情况下,通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群内其他健康物理机上。
  11. 如权利要求8所述的装置,其特征在于,所述装置还包括第五处理模块,所述第五处理模块包括:
    关机处理单元,用于若在所述物理机故障信息列表中检测到物理机网络不稳定且网络不稳定持续时间达到预设时间,则向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;
    迁移处理单元,用于通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
  12. 如权利要求8-11任一项所述的装置,其特征在于,所述关闭处理单元,用于若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机。
  13. 如权利要求8-11任一项所述的装置,其特征在于,所述装置还包括:
    确定模块,用于在所述物理机故障信息列表中匹配所述集群内的所有物理机,将没有匹配成功的物理机确定为健康物理机。
  14. 如权利要求8所述的装置,其特征在于,所述物理机故障信息列表包括:由所述集群外部的物理机故障探测模块从故障物理机处探测到并上报给所述物理机故障信息存储中心的物理机故障信息,及由所述集群外部的物理机故障收集模块从故障物理机处收集到并上报给所述物理机故障信息存储中心的物理机故障信息。
  15. 一种虚拟机恢复方法,其特征在于,应用于虚拟化集群系统,所述方法包括:
    虚拟化集群系统内的物理机自主检测自身的故障动态;
    若自主检测到物理机自身能容错修复的软硬件故障,通过容错方式修复;
    若自主检测到物理机自身能重启修复的软硬件故障,通过重启物理机方式修复;
    从物理机故障信息存储中心获取物理机故障信息列表;
    若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理;
    若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
  16. 如权利要求15所述的方法,其特征在于,所述方法还包括:
    若在所述物理机故障信息列表中检测到物理机网络完全不通且网络不通持续时间达到预设时间;判断网络不通的物理机数量是否超过预设数量,如果是则通知运营维修人员人工修复;否则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内到其他健康物理机上。
  17. 如权利要求15所述的方法,其特征在于,所述方法还包括:
    若在所述物理机故障信息列表中检测到物理机网络不通但网络不通持续时间未达到预设时间后网络又恢复正常,且确定物理机网络不通是物理机重启所导致的,则判断当前的物理机是否健康,如果健康则通过虚拟化接口重启所述物理机上的虚拟机,如果不健康则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群内其他健康物理机上。
  18. 如权利要求15所述的方法,其特征在于,所述方法还包括:
    若在所述物理机故障信息列表中检测到物理机网络不稳定且网络不稳定持续时间达到预设时间,则向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
  19. 如权利要求15所述的方法,其特征在于,通过以下方式确定所述健康物理机:
    在所述物理机故障信息列表中匹配所述集群内的所有物理机;
    将没有匹配成功的物理机确定为健康物理机。
  20. 如权利要求15所述的方法,其特征在于,所述从物理机故障信息存储中心获取物理机故障信息列表的步骤包括:
    物理机故障分类处理模块从物理机故障信息存储中心获取物理机故障信息列表。
  21. 如权利要求15所述的方法,其特征在于,所述若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令的步骤包括:
    向故障物理机发送关闭故障物理机的指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机。
  22. 如权利要求15所述的方法,其特征在于,所述物理机故障信息列表包括:由所述集群外部的物理机故障探测模块从故障物理机处探测到并上报给所述物理机故障信息存储中心的物理机故障信息,及由所述集群外部的物理机故障收集模块从故障物理机处收集到并上报给所述物理机故障信息存储中心的物理机故障信息。
  23. 一种虚拟机恢复系统,其特征在于,所述系统包括:
    物理机故障修复装置,应用于虚拟化集群系统内的物理机上自主检测物理机自身的故障动态,若自主检测到物理机自身能容错修复的软硬件故障,通过容错方式修复;若自主检测到物理机自身能重启修复的软硬件故障,通过重启物理机方式修复;
    物理机故障信息存储中心,用于将所有上报的物理故障信息汇集成物理机故障信息列表;
    物理机故障分类处理装置,用于从所述物理机故障信息存储中心获取物理机故障信息列表,若在所述物理机故障信息列表中检测到因遭受网络攻击而导致物理机故障,则触发所述集群外部的安全攻击防护中心处理;若在所述物理机故障信息列表中检测到因物理机自身不能修复的软硬件故障,则向故障物理机发送关闭故障物理机的指令,及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
  24. 如权利要求23所述的系统,其特征在于,所述物理机故障分类处理装置还用于:
    若在所述物理机故障信息列表中检测到物理机网络完全不通且网络不通持续时间达到预设时间;判断网络不通的物理机数量是否超过预设数量,如果是则通知运营维修人员人工修复;否则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内到其他健康物理机上。
  25. 如权利要求23所述的系统,其特征在于,所述物理机故障分类处理装置还用于:
    若在所述物理机故障信息列表中检测到物理机网络不通但网络不通持续时间未达到预设时间后网络又恢复正常,且确定物理机网络不通是物理机重启所导致的,则判断当前的物理机是否健康,如果健康则通过虚拟化接口重启所述物理机上的虚拟机,如果不健康则通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群内其他健康物理机上。
  26. 如权利要求23所述的系统,其特征在于,所述物理机故障分类处理装置还用于:
    若在所述物理机故障信息列表中检测到物理机网络不稳定且网络不稳定持续时间达到预设时间,则向故障物理机发送指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机;及通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
  27. 如权利要求23所述的系统,其特征在于,所述物理机故障分类处理装置还用于:
    在所述物理机故障信息列表中匹配所述集群内的所有物理机,将没有匹配成功的物理机确定为健康物理机。
  28. 如权利要求23所述的系统,其特征在于,所述物理机故障分类处理装置还用于:
    向故障物理机发送关闭故障物理机的指令以指示所述故障物理机自主关闭故障物理机或通过所述物理机上的带外管理模块关闭故障物理机。
  29. 如权利要求23所述的系统,其特征在于,所述物理机故障信息列表包括:由所述集群外部的物理机故障探测模块从故障物理机处探测到并上报给所述物理机故障信息存储中心的物理机故障信息,及由所述集群外部的物理机故障收集模块从故障物理机处收集到并上报给所述物理机故障信息存储中心的物理机故障信息。
  30. 如权利要求23所述的系统,其特征在于,所述物理机故障修复装置包括:
    自主检测模块,包括:
    检测单元,用于自主检测物理机自身的故障动态;
    自主处理模块,包括:
    容错单元,用于若所述检测单元检测到物理机自身能容错修复的软硬件故障,则通过容错方式修复;
    重启单元,用于若所述检测单元检测到物理机自身能重启修复的软硬件故障,则通过重启物理机方式修复。
  31. 如权利要求30所述的系统,其特征在于,所述自主处理模块还包括:
    关机单元,用于若所述检测单元检测到物理机自身不能修复的软硬件故障,则根据所述集群外部的物理机故障分类处理模块的指令或通过所述物理机上的带外管理模块关闭故障物理机,由所述集群外部的物理机故障分类处理模块通过虚拟化接口迁移所述故障物理机上的虚拟机到所述集群系统内其他健康物理机上。
  32. 如权利要求30所述的系统,其特征在于,所述自主检测模块还包括:
    上报单元,用于当自主检测模块自主检测到因遭受网络攻击而导致物理机故障时,通过物理机故障收集模块上报物理机故障信息到物理机故障信息存储中心,由所述集群外部的物理机故障分类处理模块触发所述集群外部的安全攻击防护中心处理。
PCT/CN2017/074618 2016-03-10 2017-02-23 物理机故障分类处理方法、装置和虚拟机恢复方法、系统 WO2017152763A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610136817.0A CN107179957B (zh) 2016-03-10 2016-03-10 物理机故障分类处理方法、装置和虚拟机恢复方法、系统
CN201610136817.0 2016-03-10

Publications (1)

Publication Number Publication Date
WO2017152763A1 true WO2017152763A1 (zh) 2017-09-14

Family

ID=59790073

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/074618 WO2017152763A1 (zh) 2016-03-10 2017-02-23 物理机故障分类处理方法、装置和虚拟机恢复方法、系统

Country Status (3)

Country Link
CN (1) CN107179957B (zh)
TW (1) TWI746512B (zh)
WO (1) WO2017152763A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109144765A (zh) * 2018-08-21 2019-01-04 平安科技(深圳)有限公司 报表生成方法、装置、计算机设备及存储介质
CN111666170A (zh) * 2020-05-29 2020-09-15 中国工商银行股份有限公司 基于分布式框架的故障节点处理方法及装置
CN111984969A (zh) * 2020-08-20 2020-11-24 北京金山云网络技术有限公司 虚拟机的故障报警方法、装置及电子设备
CN112148485A (zh) * 2020-09-16 2020-12-29 杭州安恒信息技术股份有限公司 超融合平台故障恢复方法、装置、电子装置和存储介质
CN118316981A (zh) * 2024-06-11 2024-07-09 山东怡然信息技术有限公司 物联网设备数据处理的云边协同系统

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108062471B (zh) * 2017-12-19 2021-07-20 龙岩学院 一种云计算网络运行过程中的风险处理方法及设备
CN108153618B (zh) * 2017-12-22 2021-12-03 国网浙江杭州市萧山区供电有限公司 硬盘数据恢复方法、装置及硬盘数据恢复设备
US11842149B2 (en) * 2018-03-02 2023-12-12 General Electric Company System and method for maintenance of a fleet of machines
CN108763039B (zh) * 2018-04-02 2021-09-21 创新先进技术有限公司 一种业务故障模拟方法、装置及设备
TWI686696B (zh) * 2018-08-14 2020-03-01 財團法人工業技術研究院 計算節點及其失效偵測方法與雲端資料處理系統
CN109587331B (zh) * 2018-11-26 2021-02-02 广州微算互联信息技术有限公司 云手机故障自动修复的方法与系统
CN109614260B (zh) * 2018-11-28 2022-06-03 北京小米移动软件有限公司 通信故障判断方法、装置、电子设备和存储介质
CN110262917A (zh) * 2019-05-15 2019-09-20 平安科技(深圳)有限公司 宿主机自愈方法、装置、计算机设备及存储介质
CN110247821B (zh) * 2019-06-04 2022-10-18 平安科技(深圳)有限公司 一种故障检测方法及相关设备
CN110377396A (zh) * 2019-07-04 2019-10-25 深圳先进技术研究院 一种虚拟机自动迁移方法、系统及电子设备
CN111224989A (zh) * 2020-01-09 2020-06-02 武汉思普崚技术有限公司 一种虚拟微隔离网络的攻击面防护方法及系统
CN111277568A (zh) * 2020-01-09 2020-06-12 武汉思普崚技术有限公司 一种分布式虚拟网络的隔离攻击方法及系统
CN111258711B (zh) * 2020-01-09 2022-05-03 武汉思普崚技术有限公司 一种多协议的网络微隔离方法及系统
CN111224990B (zh) * 2020-01-09 2022-05-03 武汉思普崚技术有限公司 一种分布式微隔离网络的流量牵引方法及系统
CN111262841B (zh) * 2020-01-09 2022-05-03 武汉思普崚技术有限公司 一种虚拟微隔离网络的资源调度方法及系统
CN111262840A (zh) * 2020-01-09 2020-06-09 武汉思普崚技术有限公司 一种虚拟网络的攻击面转移方法及系统
CN111273995A (zh) * 2020-01-09 2020-06-12 武汉思普崚技术有限公司 一种虚拟微隔离网络的安全调度方法及系统
CN111176795B (zh) * 2020-01-09 2022-05-03 武汉思普崚技术有限公司 一种分布式虚拟网络的动态迁移方法及系统
CN111212079B (zh) * 2020-01-09 2022-05-03 武汉思普崚技术有限公司 一种基于业务的微隔离流量牵引方法及系统
CN111399978A (zh) * 2020-03-02 2020-07-10 中铁信弘远(北京)软件科技有限责任公司 一种基于OpenStack的故障迁移系统及迁移方法
CN111796959B (zh) * 2020-06-30 2023-08-08 中国工商银行股份有限公司 宿主机容器自愈方法、装置及系统
CN112165495B (zh) * 2020-10-13 2023-05-09 北京计算机技术及应用研究所 一种基于超融合架构防DDoS攻击方法、装置及超融合集群
US11693694B2 (en) 2021-03-29 2023-07-04 Red Hat, Inc. Migrating quantum services from quantum computing devices to quantum simulators
CN113157476B (zh) * 2021-04-10 2024-08-27 作业帮教育科技(北京)有限公司 虚拟云环境中显卡故障的处理方法及装置
TWI847064B (zh) * 2021-10-13 2024-07-01 中華電信股份有限公司 設備檢測裝置及設備檢測方法
CN114780272B (zh) * 2022-04-18 2023-03-17 北京亚康万玮信息技术股份有限公司 基于共享存储和虚拟化的智能故障自愈调度方法和装置
CN114884836A (zh) * 2022-04-28 2022-08-09 济南浪潮数据技术有限公司 一种虚拟机高可用方法、装置及介质
CN115080211A (zh) * 2022-06-30 2022-09-20 济南浪潮数据技术有限公司 一种虚拟化平台系统的任务调度方法、系统及相关组件
CN115484267B (zh) * 2022-09-15 2024-09-17 中国联合网络通信集团有限公司 多集群部署处理方法、装置、电子设备和存储介质
CN116074184B (zh) * 2023-03-21 2023-06-27 云南莱瑞科技有限公司 一种电力调度中心网络故障预警系统
CN116401009A (zh) * 2023-03-28 2023-07-07 北京益安在线科技股份有限公司 一种基于kvm虚拟化的智能管理系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110060722A1 (en) * 2009-09-07 2011-03-10 Icon Business Systems Limited Centralized management mode backup disaster recovery system
CN102394774A (zh) * 2011-10-31 2012-03-28 广东电子工业研究院有限公司 云计算操作系统的控制器服务状态监控和故障恢复方法
CN102629224A (zh) * 2012-04-26 2012-08-08 广东电子工业研究院有限公司 一种基于云平台的一体化数据容灾方法及其装置
CN102984739A (zh) * 2011-09-07 2013-03-20 中兴通讯股份有限公司 故障信息处理方法及装置
CN103095506A (zh) * 2013-02-06 2013-05-08 浪潮电子信息产业股份有限公司 一种云环境下基于设备健康状态的资源调整方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5949759A (en) * 1995-12-20 1999-09-07 International Business Machines Corporation Fault correlation system and method in packet switching networks
US8524457B2 (en) * 2009-09-22 2013-09-03 William Patterson Method for the selection of specific affinity binders by homogeneous noncompetitive assay
CN103167004A (zh) * 2011-12-15 2013-06-19 中国移动通信集团上海有限公司 云平台主机系统故障修复方法及云平台前端控制服务器
EP2687982A1 (en) * 2012-07-16 2014-01-22 NTT DoCoMo, Inc. Hierarchical system for managing a plurality of virtual machines, method and computer program
US9141487B2 (en) * 2013-01-15 2015-09-22 Microsoft Technology Licensing, Llc Healing cloud services during upgrades
CN103152419B (zh) * 2013-03-08 2016-04-20 中标软件有限公司 一种云计算平台的高可用集群管理方法
CN103607296B (zh) * 2013-11-01 2017-08-22 新华三技术有限公司 一种虚拟机故障处理方法和设备
CN106537354B (zh) * 2014-07-22 2020-01-07 日本电气株式会社 虚拟化基础设施管理装置、系统、方法和记录介质
CN104392175B (zh) * 2014-11-26 2018-05-29 华为技术有限公司 一种云计算系统中云应用攻击行为处理方法、装置及系统
CN105306225B (zh) * 2015-11-03 2018-09-07 国云科技股份有限公司 一种基于Openstack的物理机远程关机方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110060722A1 (en) * 2009-09-07 2011-03-10 Icon Business Systems Limited Centralized management mode backup disaster recovery system
CN102984739A (zh) * 2011-09-07 2013-03-20 中兴通讯股份有限公司 故障信息处理方法及装置
CN102394774A (zh) * 2011-10-31 2012-03-28 广东电子工业研究院有限公司 云计算操作系统的控制器服务状态监控和故障恢复方法
CN102629224A (zh) * 2012-04-26 2012-08-08 广东电子工业研究院有限公司 一种基于云平台的一体化数据容灾方法及其装置
CN103095506A (zh) * 2013-02-06 2013-05-08 浪潮电子信息产业股份有限公司 一种云环境下基于设备健康状态的资源调整方法

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109144765A (zh) * 2018-08-21 2019-01-04 平安科技(深圳)有限公司 报表生成方法、装置、计算机设备及存储介质
CN109144765B (zh) * 2018-08-21 2024-02-02 平安科技(深圳)有限公司 报表生成方法、装置、计算机设备及存储介质
CN111666170A (zh) * 2020-05-29 2020-09-15 中国工商银行股份有限公司 基于分布式框架的故障节点处理方法及装置
CN111666170B (zh) * 2020-05-29 2024-04-12 中国工商银行股份有限公司 基于分布式框架的故障节点处理方法及装置
CN111984969A (zh) * 2020-08-20 2020-11-24 北京金山云网络技术有限公司 虚拟机的故障报警方法、装置及电子设备
CN112148485A (zh) * 2020-09-16 2020-12-29 杭州安恒信息技术股份有限公司 超融合平台故障恢复方法、装置、电子装置和存储介质
CN118316981A (zh) * 2024-06-11 2024-07-09 山东怡然信息技术有限公司 物联网设备数据处理的云边协同系统

Also Published As

Publication number Publication date
TWI746512B (zh) 2021-11-21
TW201738747A (zh) 2017-11-01
CN107179957B (zh) 2020-08-25
CN107179957A (zh) 2017-09-19

Similar Documents

Publication Publication Date Title
WO2017152763A1 (zh) 物理机故障分类处理方法、装置和虚拟机恢复方法、系统
US10152382B2 (en) Method and system for monitoring virtual machine cluster
CN102231681B (zh) 一种高可用集群计算机系统及其故障处理方法
CN105187249B (zh) 一种故障恢复方法及装置
US8910172B2 (en) Application resource switchover systems and methods
JP4942835B2 (ja) 仮想インフラストラクチャを用いた情報技術リスク管理
US9450700B1 (en) Efficient network fleet monitoring
Panda et al. {IASO}: A {Fail-Slow} Detection and Mitigation Framework for Distributed Storage Services
US10489232B1 (en) Data center diagnostic information
US9841986B2 (en) Policy based application monitoring in virtualized environment
CN106789306B (zh) 通信设备软件故障检测收集恢复方法和系统
CN106856489A (zh) 一种分布式存储系统的服务节点切换方法和装置
CN104486100B (zh) 故障处理装置及方法
CN105302661A (zh) 一种实现虚拟化管理平台高可用的系统和方法
WO2020167463A1 (en) Interface for fault prediction and detection using time-based distributed data
WO2023092772A1 (zh) 一种虚拟化集群高可用性的实现方法和设备
CN113825164A (zh) 网络故障修复方法、装置、存储介质及电子设备
CN105068763A (zh) 一种针对存储故障的虚拟机容错系统和方法
CN114064217B (zh) 一种基于OpenStack的节点虚拟机迁移方法及装置
CN115766405A (zh) 一种故障处理方法、装置、设备和存储介质
US9886070B2 (en) Method, system, and computer program product for taking an I/O enclosure offline
US10365934B1 (en) Determining and reporting impaired conditions in a multi-tenant web services environment
CN104683131A (zh) 一种应用级虚拟化高可靠性方法及装置
CN113760459A (zh) 虚拟机故障检测方法、存储介质和虚拟化集群
JP6984119B2 (ja) 監視装置、監視プログラム、及び監視方法

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17762449

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17762449

Country of ref document: EP

Kind code of ref document: A1