WO2018054081A1 - 故障处理方法、虚拟架构管理系统和业务管理系统 - Google Patents

故障处理方法、虚拟架构管理系统和业务管理系统 Download PDF

Info

Publication number
WO2018054081A1
WO2018054081A1 PCT/CN2017/085356 CN2017085356W WO2018054081A1 WO 2018054081 A1 WO2018054081 A1 WO 2018054081A1 CN 2017085356 W CN2017085356 W CN 2017085356W WO 2018054081 A1 WO2018054081 A1 WO 2018054081A1
Authority
WO
WIPO (PCT)
Prior art keywords
management system
virtual machine
virtual
service
fault
Prior art date
Application number
PCT/CN2017/085356
Other languages
English (en)
French (fr)
Inventor
李候青
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2018054081A1 publication Critical patent/WO2018054081A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications

Definitions

  • the present invention relates to the field of cloud computing, and in particular, to a fault processing method, a virtual architecture management system, a service management system, and a virtualized computer system.
  • the fault of the faulty device is sent to the virtual infrastructure management system through the fault alarm message, and the virtual infrastructure management system sends the fault alarm message to the service management system.
  • the service management system determines the affected virtual machine and service application according to the fault alarm message, and performs fault processing operations on the affected service application. This makes it necessary for the service management system to be aware of the service application corresponding to the hardware and hardware faults, so that the service application system can perform fault processing on the service application. This can prevent the service management system from quickly notifying the faulty device to the service application affected by the faulty device and affecting the service application. Sex.
  • the present invention provides a fault processing method, a virtual architecture management system, a service management system, and a virtualized computer system, which can quickly notify the affected virtual machine of the impact of the hardware failure on the virtual machine, thereby improving service reliability. .
  • the present invention provides a method of fault handling.
  • the fault processing method is used for fault processing in a virtualized computer system.
  • the virtualized computer system includes: a virtual infrastructure management system, a service management system, and at least one virtual machine, where at least one virtual machine runs on at least one physical device.
  • At least one virtual machine is used to execute a business application
  • a business management system is used to manage a business application
  • a virtual infrastructure management system is used to manage at least one virtual machine and at least one physical device.
  • the fault processing method includes: obtaining, by the virtual architecture management system, a fault alarm message, where the fault alarm message carries the identifier information of the faulty device and the fault type; the virtual architecture management system determines the first virtual machine set according to the fault alarm message, where the first virtual machine set includes the The at least one first virtual machine affected by the faulty device; the virtual infrastructure management system sends a status alarm message to the service management system, where the status alarm message carries information of the first virtual machine set.
  • the virtual infrastructure management system directly analyzes and processes the fault alarm message, acquires one or more virtual machines affected by the faulty device, and sends the fault to the service management system.
  • the information of the virtual machine enables the service management system to directly analyze the affected business applications according to the information of the virtual machines, and then process the affected business applications.
  • the virtual architecture management system directly determines the information of the virtual machine affected by the faulty device according to the fault alarm message of the faulty device, so that the service management system can directly analyze the state alarm message according to the first virtual machine set.
  • the affected business application is analyzed, and the affected virtual machine is analyzed according to the alarm message of the faulty device, and the affected business application is analyzed. Therefore, the service management system does not need to directly sense the hardware fault, and thus can quickly trigger the impact processing of the service application, reduce service loss, and improve the reliability of the service application.
  • the fault processing method further includes: determining, by the virtual architecture management system, impact information of the first virtual machine set according to the fault alarm message of the faulty hardware, where the impact information is used to indicate that the faulty device is to the first virtual
  • the type and/or level of the impact generated by the first virtual machine in the set of machines; accordingly, the status alert message may also carry the impact information of the first set of virtual machines.
  • the virtual architecture management system can obtain the type and/or level of the impact of the fault of the faulty device on the virtual machine according to the fault alarm information of the faulty device, and then acquire the affected type of the virtual machine.
  • the status alarm message sent to the service management system further carries the impact information indicating the type and/or level of the impact of the faulty device on the first virtual machine in the first virtual machine set, so that the service management system or service The system can process the service application according to the impact information to further improve the reliability of the service application.
  • the status alarm information may further include identifier information, alarm identifier information, alarm name information, alarm object type information, alarm type information, alarm generation time information, and alarm component type of the first virtual machine in the first virtual machine set.
  • Information, alarm component identification information, and alarm component name information may be included in the first virtual machine set.
  • the status alarm information may include fault type information of the faulty device.
  • the type of the impact of the faulty device on the first virtual machine in the first virtual machine set includes one or more types of fault, high risk, medium risk, low risk, or no impact. .
  • the level of impact of the faulty device on the first virtual machine in the first set of virtual machines includes an emergency, important or not important.
  • the fault processing method further includes: receiving, by the virtual architecture management system, a first request message sent by the service management system, where the first request message is used to indicate the virtual machine to be restored, and the virtual machine to be restored A subset of the first virtual machine set; the virtual infrastructure management system preferentially restores the virtual machine to be restored according to the first request information.
  • the virtual infrastructure management system may perform recovery processing on at least one virtual machine in the first virtual machine set affected by the fault of the faulty device according to the priority indicated by the service management system according to the request of the service management system.
  • the recovery process performed by the virtual architecture management system on the virtual machine may include: virtual machine hot migration.
  • the fault processing method further includes: if the virtual infrastructure management system does not receive the first request information sent by the service management system within the preset time threshold, recovering according to the preset virtual machine recovery policy. The first virtual machine in a virtual machine set.
  • the fault processing method can ensure that when the service management system has no information indicating how the virtual infrastructure management system recovers the virtual machine in the first virtual machine set, the virtual infrastructure management system can actively perform the first virtual machine set according to the pre-configured recovery policy. The first virtual machine is restored.
  • the fault processing method further includes: the virtual architecture management system sends a status alarm clear message to the service management system.
  • the virtual infrastructure management system sends a status alarm clear message to the service management system after the virtual machine is restored, so that the service management system can clear the related status alarm message received according to the status alarm clear message, thereby
  • the service management system is configured to analyze and process status alarm messages related to the restored virtual machine.
  • the present invention provides a virtual architecture management system, where the virtual architecture management system includes The various modules of the fault handling method of any of the possible implementations of the first aspect or the first aspect are performed.
  • the virtual infrastructure management system of the present invention After obtaining the fault alarm message on the faulty device, the virtual infrastructure management system of the present invention directly analyzes and processes the fault alarm message, acquires one or more virtual machines affected by the faulty device, and sends the virtual machine to the service management system.
  • the information of the machine enables the service management system to directly analyze the affected business applications according to the information of the virtual machines, and then can be processed by the affected business applications.
  • the virtual architecture management system directly determines the information of the virtual machine affected by the faulty device according to the fault alarm message of the faulty device, so that the service management system can directly analyze the state alarm message according to the first virtual machine set.
  • the affected business application is analyzed, and the affected virtual machine is analyzed according to the alarm message of the faulty device, and the affected business application is analyzed. Therefore, the service management system does not need to directly sense the hardware fault, and thus can quickly trigger the impact processing of the service application, reduce service loss, and improve the reliability of the service application.
  • the present invention provides a virtual infrastructure management system including a processor, a memory, a communication interface, and a bus.
  • the processor, the memory, and the communication interface communicate through a bus, and may also implement communication by other means such as wireless transmission.
  • the memory is for storing instructions for executing instructions stored by the memory.
  • the memory stores the program code, and the processor can invoke the program code stored in the memory to perform the first aspect and the fault handling method in any of the possible implementations of the first aspect.
  • the present invention provides a computer readable medium storing program code for execution by a virtual infrastructure management system, the program code comprising for performing the first aspect and the first aspect An instruction for a fault handling method in a possible implementation.
  • the present invention further provides a fault processing method for performing fault processing in a virtualized computer system
  • the virtualized computer system comprising: a virtual architecture management system, a service management system, and at least one virtual machine
  • the at least one virtual machine is running on at least one physical device, the at least one virtual machine is configured to execute a service application, the service management system is configured to manage the service application, and the virtual infrastructure management system is configured to manage the at least one virtual machine and the at least one virtual machine
  • a physical device includes: a service management system receives a status alarm message sent by the virtual infrastructure management system, where the status alarm message carries information about the first virtual machine set affected by the faulty device, where the first virtual machine set includes At least one first virtual machine; the service management system determines a service application associated with the at least one first virtual machine according to the status alarm message; and the service management system performs a processing operation on the associated service application.
  • the affected service application can be directly analyzed according to the information of the virtual machine.
  • the affected business applications can be processed.
  • the service management system can directly analyze the affected service application according to the state alarm message of the first virtual machine set, instead of analyzing the obtained virtual machine according to the alarm message of the faulty device, and analyzing the affected virtual machine. Affected business applications. Therefore, the service management system does not need to directly sense the hardware fault, and thus can quickly trigger the impact processing of the service application, reduce service loss, and improve the reliability of the service application.
  • the status alarm message of the first virtual machine set further carries the impact information of the first virtual machine set, where the impact information is used to indicate that the faulty device is at least one of the first virtual machine set.
  • the service management system performs a processing operation on the service application, where the service management system performs a processing operation on the service application according to the impact information of the first virtual machine set.
  • the service management system receives the first virtual machine set from the virtual architecture management system.
  • the status alarm message also carries the impact information indicating the type and/or level of the impact of the faulty device on the first virtual machine in the first virtual machine set, so that the service management system or the service system can further determine the impact information according to the impact information. Process business applications to further improve the reliability of business applications.
  • the status alarm information may further include identifier information, alarm identifier information, alarm name information, alarm object type information, alarm type information, alarm generation time information, and alarm component type of the first virtual machine in the first virtual machine set.
  • Information, alarm component identification information, and alarm component name information may be included in the first virtual machine set.
  • the status alarm information may include fault type information of the faulty device.
  • the type of impact generated by the first virtual machine set includes one or more types of failure, high risk, medium risk, low risk, or no impact.
  • the level of impact of the faulty device on the first virtual machine in the first set of virtual machines includes an emergency, important or not important.
  • the processing operation includes at least one of the following manners:
  • the service management system switches the service application associated with the at least one first virtual machine to a virtual machine that is not affected by the faulty device;
  • the service management system identifies the application state information of the at least one first virtual machine as an isolated state, where the isolation state is used to indicate that the at least one first virtual machine stops executing the service application associated with the at least one first virtual machine; or
  • the service management system sends a first request message to the virtual infrastructure management system, where the first request message is used to indicate the virtual machine to be restored, and the virtual machine to be restored is a subset of the first virtual machine set; or
  • the service management system sends a status alarm message to the control node of the service application associated with the at least one first virtual machine, so that the control node switches the service application associated with the at least one first virtual machine to the virtual device that is not affected by the faulty device according to the status alarm message.
  • the machine executes or identifies the application state information of the at least one first virtual machine as an isolated state.
  • the fault processing method further includes: the service management system determining the first request message according to the impact information of the first virtual machine set.
  • the service management system may determine, according to the impact information of the first virtual machine, a priority of the virtual machine to be restored that is required to be restored by the virtual infrastructure management system in the first virtual machine set, and send the same to the virtual machine architecture management. a first request message indicating a recovery priority of the virtual machines to be restored, so that the virtual infrastructure management system can at least one of the first virtual machine set affected by the failure of the failed device according to the priority indicated by the service management system The virtual machine performs recovery processing.
  • a specific implementation manner that the service management system sends a first request message for indicating a recovery priority of the virtual machine to be restored to the virtual infrastructure management system may be: the service management system is based on the service application. The priority sends a first request message to the virtual infrastructure management system.
  • the service management system indicates, according to the priority of the service application associated with the first virtual machine in the first virtual machine set, that is, the priority of the service application affected by the faulty device indicates the virtual infrastructure management system to the first virtual
  • the virtual machine to be restored in the machine set is restored, so that the high-priority service application can be restored first, and the reliability of the service application is further ensured.
  • the service management system may send the first request message to the virtual infrastructure management system according to the impact information of the first virtual machine set and the priority of the associated service application.
  • the service management system sends the first request message to the virtual architecture management system.
  • a specific implementation manner is: the service management system sends a first request message to the virtual infrastructure management system according to the deployment mode of the service application, where the deployment mode of the service application includes at least one of a primary standby mode, a load sharing mode, and a single virtual machine mode. .
  • the service management system instructs the virtual infrastructure management system to perform recovery processing on the virtual machine to be restored in the first virtual machine set according to the deployment mode of the service application, that is, according to the deployment mode of the service application affected by the faulty device.
  • the service management system may send the first request message to the virtual architecture management system according to the impact information of the first virtual machine set and the deployment mode of the service application, or may be virtualized according to the deployment mode of the service application and the priority of the service application.
  • the architecture management system sends the first request message, or may send the first request message to the virtual architecture management system according to the impact information of the first virtual machine set, the deployment mode of the service application, and the priority of the service application.
  • the fault processing method further includes: the service management system receives a status alarm clear message sent by the virtual infrastructure management system; and the service management system clears the related status alarm message received before according to the status alarm clear message.
  • the service management system can clear the related state alarm message received before the state alarm clearing message sent by the virtual infrastructure management system, so as to avoid analyzing and processing the state alarm message related to the restored virtual machine.
  • the present invention provides a service management system, the service management system comprising various modules for performing a fault processing method in any of the possible implementations of the fifth aspect or the fifth aspect.
  • the present invention provides a service management system including a processor, a memory, a communication interface, and a bus.
  • the processor, the memory, and the communication interface communicate through a bus, and may also implement communication by other means such as wireless transmission.
  • the memory is for storing instructions for executing instructions stored by the memory.
  • the memory stores the program code, and the processor can call the program code stored in the memory to perform the fault processing method in any of the possible implementations of the fifth aspect and the fifth aspect.
  • the present invention provides a computer readable medium storing program code for execution by a service management system, the program code comprising for performing any of the fifth aspect or the fifth aspect An instruction for a fault handling method in a possible implementation.
  • the present invention provides a virtualized computer system, including a virtual management node and a service management node, configured to perform fault processing in any of the possible implementations of the first aspect or the first aspect.
  • the method, the service management node is configured to perform a fault processing method in any of the possible implementations of the fifth aspect or the fifth aspect.
  • FIG. 1A is a schematic system configuration diagram of a fault processing method to which an embodiment of the present invention is applied.
  • FIG. 1B is another schematic system structural diagram of a fault processing method to which an embodiment of the present invention is applied.
  • FIG. 2 is a schematic flowchart of a fault processing method according to an embodiment of the present invention.
  • FIG. 3 is a schematic flowchart of a fault processing method according to another embodiment of the present invention.
  • FIG. 4 is a schematic flowchart of a fault processing method according to another embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a virtual architecture management system according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a service management system according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a virtual architecture management system according to another embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a service management system according to another embodiment of the present invention.
  • FIG. 1A and FIG. 1B may be hardware, or may be functionally divided software or the above two. Combination of people.
  • the hardware resources 110 may include one or more devices, and each device may be a hardware device resource such as an X86 server, a storage device, or a network device, and may be used to provide hardware functions such as computing, storage, and networking.
  • a hardware device resource such as an X86 server, a storage device, or a network device, and may be used to provide hardware functions such as computing, storage, and networking.
  • the virtualization layer 120 virtualizes hardware resources such as computing, storage, and network through virtualization technology.
  • the virtualization technology may use Xen, HyperV, or KVM, and the present invention is not limited.
  • the virtual resource (Virtual Resources) 130 refers to a virtual resource formed by virtualizing the hardware resource 110 by a virtualization technology, such as virtual computing, virtual networking, virtual storage, and the like.
  • the hardware resource 110, the virtualization layer 120, and the virtual resource 130 can be a virtualized infrastructure layer, and provide an infrastructure layer such as a virtual resource or a virtual resource pool for the upper layer service.
  • One or more business application functions are deployed in the business system 140, and each business application is deployed on one or more virtual machines, that is, these virtual machines are used to execute business applications.
  • the virtual machine is deployed on a device in hardware resource 110.
  • Each business application has a corresponding control node.
  • the control node is used to manage the corresponding business application.
  • a control node can also be referred to as an arbitration node.
  • the control node can be deployed in the service system, and one control node can separately manage a corresponding service application, as shown in FIG. 1A; one control node can also manage multiple service applications, as shown in FIG. 1B.
  • a control node may refer to a hardware device used to manage a corresponding service application, or a virtual machine in a plurality of virtual machines running by a service application.
  • the virtualized infrastructure management system (150) implements the management of the virtualized infrastructure, and is responsible for unified management, monitoring, and physical management of virtual hardware (ie, hardware resources 110) and virtual machines deployed on the hardware resources 110. Resource scheduling, fault handling, etc., provide resource support for business system operations, and provide open interfaces. Virtualization architecture management system 150 may also be referred to as being part of a virtualization layer.
  • the service management system 160 is configured to manage a service application running on a virtual machine, such as creating a service application, issuing a service application, scheduling a virtual resource in a service application, and closing a service application.
  • a business management system can manage one or more business applications.
  • the service management system invokes the interface provided by the virtual infrastructure management system to provide resources for the operation of the service application, and implements service application release and deployment.
  • the service management system 160 and the service system 140 are collectively referred to as an application layer.
  • the service management system 160 and the service system 140 may be logically separate systems, as shown in Figures 1A and 1B, and the functions of both may be implemented by one system.
  • the following detailed description of the embodiments of the present invention is specifically described by taking the virtualized computer system shown in FIG. 1A as an example.
  • the service system runs on the virtual machine in the virtual resource 130.
  • the service system does not need to care about the specific hardware device, and does not need to know which hardware device the virtual machine where the service application is located runs on, and the service management system and The business system does not need to directly sense the impact of the device and the faulty device on the business application.
  • the present invention proposes a new fault processing method, a virtual architecture management system, a service management system, and a virtualized computer system, so that the service management system does not directly sense the impact of the device and the device failure on the business application, but can learn from the virtual architecture management system.
  • the impact of device failure on the VM so that the affected business applications can be quickly learned, so that the affected business applications can be processed quickly.
  • the fault processing method of the embodiment of the present invention is described in detail below by taking the virtualized computer system shown in FIG. 1A as an example.
  • FIG. 2 is a schematic flowchart of a fault processing method according to an embodiment of the present invention. It should be understood that FIG. 2 illustrates the steps or operations of the fault handling method, but these steps or operations are merely examples, and other embodiments of the present invention may also perform other operations or variations of the various operations in FIG. 2. Moreover, the various steps in FIG. 2 may be performed in a different order than that presented in FIG. 2, and it is possible that not all operations in FIG. 2 are to be performed.
  • the virtual architecture management system acquires a fault alarm message, where the fault alarm message carries the identifier information of the faulty device and the fault type.
  • the faulty device may be any one or more of the hardware resources 110 shown in FIG. 1A, and the fault type includes a whole machine fault or a partial hardware fault.
  • the fault type may be a fault of the X86 server, or at least one hardware fault in the CPU, memory, network card, or disk of the X86 server.
  • a faulty device such as a server or a storage device
  • the virtual infrastructure management system can obtain a fault alarm message of the faulty device in multiple manners or protocols.
  • the faulty device can be managed through a simple network.
  • the Simple Network Management Protocol (SNMP) reports the fault alarm message of the faulty device to the virtual infrastructure management system, or the virtual infrastructure management system can query the fault alarm message of the faulty device through the Representational State Transfer (REST) interface.
  • SNMP Simple Network Management Protocol
  • REST Representational State Transfer
  • the virtual architecture management system determines, according to the fault alarm message of the faulty device, the first set of virtual machines, where the first set of virtual machines includes at least one first virtual machine affected by the faulty device.
  • the virtual infrastructure management system determines the first virtual machine set affected by the faulty device according to the fault alarm message.
  • the specific implementation manner of determining the first virtual machine set by determining the first virtual machine set according to the fault alarm message may be: the virtual architecture management system according to the faulty device identification information and the fault Type, from the database of the virtual infrastructure management system, querying information of all or part of the virtual machines deployed on the faulty device and affected by the failure of the faulty device.
  • each virtual machine in the affected virtual machine may be referred to as a first virtual machine, and all first virtual machines constitute a first virtual machine set.
  • S230 The virtual architecture management system sends a status alarm message to the service management system, where the status alarm message carries Information with a first set of virtual machines.
  • the virtual infrastructure management system may send the status alarm message to the service management system at one time, or may send the status alarm message to the service management system multiple times.
  • the virtual architecture management system can also generate a status alarm message for all the affected virtual machines, that is, all the first virtual machines in the first virtual machine set generate a status alarm message, which is not limited by the present invention.
  • the service management system may store the status alarm message, such as recording or saving the status alarm message in a database of the service management system.
  • the service management system determines, according to the status alarm message of the first virtual machine set, a service application associated with at least one first virtual machine in the first virtual machine set.
  • the service management system After receiving the status alarm message of the first virtual machine set sent by the virtual infrastructure management system, the service management system associates the status alarm information with the service application to identify the specific affected service application.
  • the specific implementation manner may be: according to the first virtual Querying the information of the affected first virtual machine carried in the state alarm message of the machine set, querying the correspondence between the first virtual machine and the service application from the database or the configuration file of the service management system, and identifying the specific affected service application .
  • the service management system performs a processing operation on the service application associated with the at least one first virtual machine in the first virtual machine set.
  • the implementation manner that the service management system performs a processing operation on the service application associated with the first virtual machine in the first virtual machine set may be: the service management system sends the first virtual machine set to the control node corresponding to the service application. Information. The information of the first virtual machine set is used to instruct the control node to perform recovery processing on the service application.
  • performing, by the service management system, a processing operation on the service application associated with the at least one first virtual machine in the first virtual machine set includes at least one of the following manners:
  • Manner 1 The service management system switches the service application associated with the at least one first virtual machine to the virtual machine that is not affected by the faulty device.
  • Manner 2 The service management system identifies the application state information of the at least one first virtual machine as an isolated state, where the isolation state is used to instruct the at least one first virtual machine to stop executing the at least one first virtual authority A business application that isolates affected virtual machines in business applications.
  • Manner 3 The service management system sends a first request message to the virtual infrastructure management system, where the first request message is used to indicate the virtual machine to be restored, and the virtual machine to be restored is a subset of the first virtual machine set.
  • Manner 4 The service management system sends a status alarm message to the control node of the service application associated with the at least one first virtual machine, so that the control node switches the service application associated with the at least one first virtual machine to the unacceptable according to the status alarm message.
  • the virtual machine affected by the faulty device performs or identifies the application state information of the at least one first virtual machine as the isolated state.
  • the virtual infrastructure management system after obtaining the fault alarm message on the faulty device, directly analyzes and processes the fault alarm message, acquires one or more virtual machines affected by the faulty device, and sends the fault to the service management system. Virtual machine information.
  • the service management system can directly analyze the affected business applications according to the information analysis of these virtual machines, and then can process the affected business applications.
  • the virtual architecture management system directly determines the information of the virtual machine affected by the faulty device according to the fault alarm message of the faulty device, so that The service management system can directly analyze the affected service application according to the status alarm message of the first virtual machine set, instead of analyzing the affected virtual machine according to the alarm message of the faulty device, and analyzing the affected service application. Therefore, the service management system does not need to directly sense the hardware fault, and thus can quickly trigger the impact processing of the service application, reduce service loss, and improve the reliability of the service application.
  • the virtual architecture management system may determine the impact information of the first virtual machine set according to the fault alarm message of the faulty device, where the impact information is used to indicate that the faulty device is at least one of the first virtual machine set.
  • the type and/or level of impact generated by a virtual machine may further carry the impact information
  • the status alarm message received by the service management system from the virtual architecture management system may carry the impact information.
  • the service management system then performs a processing operation on the service application associated with the first virtual machine in the first virtual machine set according to the impact information of the first virtual machine set.
  • the user can define the type and/or level of the virtual machine affected by the failure of the faulty device according to requirements.
  • the following is an example of the type and level of the virtual machine affected by the faulty device in the embodiment of the present invention.
  • the physical server is faulty (including the failure of the physical server to power off, the host operating system fault, and other faults that cannot provide computing resources), the storage device fault (the storage device is powered off, all the links are broken, etc.), and other hardware faults lead to the virtual machine.
  • the type of virtual machine affected can be set to fault and the level can be set to urgent. For a NIC or other hardware failure, if the virtual machine fails to work properly, the type of the virtual machine affected can be faulty, and the level can be set to emergency.
  • a component failure occurs in a physical server, such as a central processing unit (CPU), memory, or part of a network adapter
  • a physical server such as a central processing unit (CPU), memory, or part of a network adapter
  • the type of the virtual machine may be affected. Set to high risk and the level can be set to important.
  • a component failure occurs on the storage device, such as partial link interruption or partial controller failure
  • the affected type of the virtual machine can be set to medium risk, and the level can be set. For the second.
  • the type of the affected virtual machine can be set to fault, and the level can be set to emergency.
  • the status alarm message sent by the virtual infrastructure management system to the service management system may include information of the first virtual machine set, that is, the identifier of the at least one first virtual machine that is affected.
  • the impact information of the first set of virtual machines may also be included, ie the type and/or level of impact of the faulty device on at least one of the first virtual machines in the first set of virtual machines.
  • the alarm status message may further include a generation time, a clearing time, an alarm synchronization number, an alarm name, an alarm object type, and the like.
  • the status alarm message of the virtual machine may also carry information such as the fault cause of the faulty device.
  • the information included in the status alarm message of the virtual machine is not limited to the contents listed above.
  • the virtual architecture management system In the method implemented by the virtual infrastructure management system, the virtual architecture management system only analyzes and processes the alarm message of the faulty device, and provides the analyzed information to the service management system, but the fault processing method performed by the virtual architecture management system is followed.
  • the service management system may invoke an interface provided by the virtual architecture management system. Request the virtual infrastructure management system to process the affected virtual machines. Specifically, the service management system may determine, according to the impact information of the first virtual machine set, a first request message for indicating the virtual machine to be restored that needs to be preferentially restored, where the virtual machine virtual machine to be restored is the first virtual machine set. A subset of it. The service management system then sends the first request message to the virtual infrastructure management system.
  • the service management system determines, according to the impact information of the first virtual machine, the priority of the virtual machine that needs to be restored by the virtual infrastructure management system in the first virtual machine set, and sends a recovery priority to the virtual machine architecture management to indicate the virtual machine to be restored.
  • the first request message enables the virtual infrastructure management system to perform recovery processing on at least one virtual machine in the first virtual machine set affected by the failure of the faulty device according to the priority indicated by the service management system.
  • another specific implementation manner that the service management system sends the first request message for indicating the recovery priority of the virtual machine to be restored to the virtual infrastructure management system may be: the service management system according to the priority of the service application The virtual infrastructure management system sends a first request message.
  • the service management system indicates, according to the priority of the service application associated with the first virtual machine in the first virtual machine set, that is, the priority of the service application affected by the faulty device indicates the virtual virtual machine management system to the first virtual machine set.
  • the recovery of the virtual machine to be restored is performed to ensure that the high-priority service application can be restored in priority, thereby further ensuring the reliability of the service application.
  • the service management system may instruct the virtual machine architecture management system to preferentially restore the first virtual machine with a higher priority in the first virtual machine set by using the first request message.
  • the service management system may send the first request message to the virtual infrastructure management system according to the impact information of the first virtual machine set and the priority of the associated service application.
  • a specific implementation manner of the service management system sending the first request message to the virtual infrastructure management system is: the service management system sends the first request message to the virtual architecture management system according to the deployment mode of the service application, and the deployment mode of the service application
  • the method includes at least one of an active/standby mode, a load sharing mode, and a single virtual machine mode.
  • the service management system instructs the virtual infrastructure management system to perform recovery processing on the virtual machine to be restored in the first virtual machine set according to the deployment mode of the service application, that is, according to the deployment mode of the service application affected by the faulty device.
  • the service management system may instruct the virtual infrastructure management system to first restore the primary virtual machine in the active and standby virtual machines of the service application in the active/standby mode.
  • the service management system may send the first request message to the virtual architecture management system according to the impact information of the first virtual machine set and the deployment mode of the service application, or may be virtualized according to the deployment mode of the service application and the priority of the service application.
  • the architecture management system sends the first request message, or may send the first request message to the virtual architecture management system according to the impact information of the first virtual machine set, the deployment mode of the service application, and the priority of the service application.
  • the virtual architecture management system may perform recovery processing on the virtual machine to be restored in the first virtual machine set according to a certain priority according to the indication of the first request message.
  • the specific recovery form of the virtual machine management system for the virtual machine may be virtual machine migration, that is, the virtual machine is migrated from the faulty device to other normal devices; or the virtual machine snapshot may be used to restore the virtual machine on other normal devices.
  • the virtual infrastructure management system does not receive the first request information sent by the service management system for indicating the virtual machine in the first virtual machine set that needs to be preferentially restored within the preset time threshold, according to the preset virtual machine.
  • the recovery policy restores the first virtual machine in the first virtual machine set.
  • the virtual infrastructure management system may actively restore at least one first virtual machine in the first virtual machine set according to the preset virtual machine recovery policy.
  • the service management system can send a status alarm clear message to the service management system to indicate that the service management system can clear the previously received status alarm message corresponding to the processed virtual machine.
  • the service management system After receiving the status alarm clearing message sent by the virtual infrastructure management system, the service management system can clear the status alarm message of the corresponding virtual machine, and reduce the maintenance of the recovered alarm by the service management system, thereby saving resources and improving efficiency.
  • the service management system clears the status alarm message in a specific form, which may be: deleting the stored status alarm message, or modifying a certain information in the status alarm message, so that the information indicates that the virtual machine corresponding to the status alarm message has been restored.
  • the service management system may send the first service to the control node associated with the service application after determining the service application associated with the first virtual machine in the first virtual machine set according to the information of the first virtual machine set. Information about the collection of virtual machines.
  • control node of the service application may process the affected service application according to the information of the first virtual machine in the first virtual machine set.
  • control node of the service application may also process the service application according to the deployment mode of the service application. For example, when the service is deployed in the active/standby mode, if the primary VM fails, the control node needs to perform the active/standby switchover. If the standby VM fails, the control node does not need to perform the active/standby switchover. For example, when the business application is deployed in load sharing mode, the control node isolates the affected VMs.
  • the control node of the service application may process the service application according to the deployment mode of the service application and the impact information of the first virtual machine set. For example, when the impact information of the first virtual machine set indicates that the type of the fault affecting the first virtual machine is faulty, the level is urgent, and the service application is deployed in the active/standby mode, if the primary VM fails, the control node needs to perform the active/standby mode. If the standby VM is faulty or the service application is not important, the control node may not process, that is, the control node does not need to switch between active and standby. It should be understood that the manner in which the service application is processed according to the type, the level, and the deployment mode of the virtual machine is only exemplified, and the specific implementation may be defined according to the requirements of the user, which is not limited by the present invention.
  • control node of the service application may send a service processing feedback message to the service management system to notify the service management system of the processing result of the service application.
  • Compute node 1, compute node 2, and compute node 3 may be device 1, device 2, and device 3, respectively, of FIG. 1A or FIG. 1B.
  • APP Application, APP
  • App1 which is associated with VM1 and VM2 and deployed in active/standby mode.
  • VM1 is deployed on compute node 1 and is the primary virtual machine of APP1.
  • VM2 is deployed on compute node 2 as the standby virtual machine of APP1.
  • App2 which is associated with VM3 and VM4, and is deployed in load sharing mode.
  • VM3 is deployed on server compute node 2, and VM4 is deployed on compute node 3.
  • the computing node 1 when the computing node 1 is powered off, the computing node 1 reports the fault alarm message of the computing node 1 to the virtual infrastructure management through the SNMP protocol.
  • the virtual infrastructure management receives the fault alarm message, and determines the virtual machine affected by the fault according to the fault alarm message, and generates a state alarm message of the virtual machine.
  • the specific steps are as follows.
  • the virtual architecture management system receives the hardware failure alarm message of the computing node 1, queries the virtual machine list running on the computing node 1 from the database of the virtual infrastructure management system, and obtains the VM1 of the affected virtual machine, and obtains the VM1. ID and other information.
  • the virtual architecture management system generates a status alarm message of VM1, and the carrying information includes: VM1ID, the type of VM1 affected (for fault), the generation time, the level of VM1 affected (for emergency), and the fault type of the faulty device ( To calculate node 1 complete machine failure) and so on.
  • the virtual architecture management system sends a status alarm message of the VM1 to the service management system.
  • the service management system receives the status alarm message of the virtual machine sent by the virtual architecture management system, obtains the information such as the ID of the VM1, and queries the correspondence between the VM1 and the service application from the database of the service management system, and obtains the affected service application as App1. .
  • the service management system sends a notification message to the control node of App1 to notify VM1 of the failure.
  • the control node determines to upgrade VM2 to the primary server based on the notification message.
  • S410 The service management system invokes an interface provided by the virtual architecture management system, and sends a first request message to the virtual architecture management system, requesting the virtual architecture management system to quickly restore the VM1.
  • the virtual infrastructure management system migrates VM1 to the computing node 3, and at this time, the VM1 becomes the standby virtual machine of App1.
  • the virtual infrastructure management system may also perform fault isolation on the computing node 1.
  • the deployment of the application in the service system is as shown in Figure 4.
  • App1 is deployed in active/standby mode
  • VM2 is deployed on compute node 2 as the primary virtual machine
  • VM1 is deployed on compute node 3 as the standby virtual machine
  • App2 is deployed in load sharing mode
  • VM3 is deployed on compute node 2
  • VM4 is deployed on compute node 3.
  • Compute node 1 failure isolated from the resource pool.
  • the virtual architecture management system sends an alarm message to the virtual infrastructure management system, and the virtual infrastructure management system determines, according to the alarm message, that the affected virtual machine is VM1, and determines the type and level of impact of the VM1.
  • the service management system does not need to directly process the hardware alarm message, and can directly obtain the affected VM1 information and the impact information of the VM1 from the virtual architecture management system, and then determine the service application running on the VM1 as the App1, the service management system.
  • the control node of App1 is notified to process App1, and the virtual infrastructure management system is requested to recover VM1.
  • the virtual infrastructure management system migrates VM1 to compute node 3 according to the request of the service management system.
  • the control node of App1 After acquiring the information of VM1 and the impact information of VM1 from the service management system, the control node of App1 switches the original standby virtual machine VM2 of App1 to the primary virtual machine, and sets VM1 migrated to the computing node 3 as the standby virtual machine. In order to ensure the operation of App1, improve the reliability of App1.
  • FIG. 5 is a schematic structural diagram of a virtual architecture management system according to an embodiment of the present invention. It should be understood that the virtual architecture management system 500 illustrated in FIG. 5 is merely an example, and the virtual architecture management system of the embodiment of the present invention may further include other modules or units, or include modules similar in function to the respective modules in FIG. 5, or It is not necessary to include all the modules in Figure 5.
  • the obtaining module 510 is configured to obtain a fault alarm message, where the fault alarm message carries the identifier information of the faulty device and the fault type.
  • the determining module 520 is configured to determine, according to the fault alarm message, a first set of virtual machines, where the first set of virtual machines includes at least one first virtual machine that is affected by the faulty device.
  • the sending module 530 is configured to send a status alarm message to the service management system, where the status alarm message carries information of the first virtual machine set.
  • the virtual infrastructure management system after obtaining the fault alarm message on the faulty device, directly analyzes and processes the fault alarm message, acquires one or more virtual machines affected by the faulty device, and sends the fault to the service management system.
  • the information of the virtual machine enables the service management system to directly analyze the affected business applications according to the information of the virtual machines, and then process the affected business applications.
  • the virtual architecture management system directly determines the information of the virtual machine affected by the faulty device according to the fault alarm message of the faulty device, so that the service management system can directly analyze the state alarm message according to the first virtual machine set.
  • the affected business application is analyzed, and the affected virtual machine is analyzed according to the alarm message of the faulty device, and the affected business application is analyzed. Therefore, the service management system does not need to directly sense the hardware fault, and thus can quickly trigger the impact processing of the service application, reduce service loss, and improve the reliability of the service application.
  • the determining module is further configured to determine impact information of the first virtual machine set according to the fault alarm message, where the impact information is used to indicate that the faulty device is to the at least one The type and/or level of impact generated by the first virtual machine. And sending, by the virtual infrastructure management system, the status alarm message to the service management system, and carrying the impact information of the first virtual machine set.
  • the virtual architecture management system may acquire the type and/or level of the impact of the fault of the faulty device on the virtual machine according to the fault alarm information of the faulty device, and then acquire the affected type of the virtual machine.
  • the status alarm message sent to the service management system further carries the impact information indicating the type and/or level of the impact of the faulty device on the first virtual machine in the first virtual machine set, so that the service management system or service The system can process the service application according to the impact information to further improve the reliability of the service application.
  • the type of the impact that the faulty device generates on the at least one first virtual machine includes at least one of the following: fault, high risk, medium risk, low risk, or no impact.
  • the virtual architecture management system further includes a receiving module and a recovery module.
  • the receiving module is configured to receive a first request message sent by the service management system, where the first request message is used to indicate a virtual machine to be restored that needs to be restored first, and the virtual machine to be restored is the first A subset of the virtual machine collection.
  • the recovery module is configured to preferentially restore the virtual machine to be restored according to the first request information.
  • the virtual architecture management system may perform recovery processing on at least one virtual machine in the first virtual machine set affected by the fault of the faulty device according to the priority indicated by the service management system according to the request of the service management system.
  • the recovery module is further configured to: when the first request information sent by the service management system is not received within a preset time threshold, recover the according to a preset virtual machine recovery policy. At least one first virtual machine.
  • the embodiment of the present invention can ensure that there is no information in the service management system to indicate how the virtual architecture management system recovers the first virtual
  • the virtual infrastructure management system can actively restore the first virtual machine in the first virtual machine set according to the pre-configured recovery policy.
  • the sending module is further configured to send a status alarm clear message to the service management system, where the status alarm clear message is used to indicate that the service management system clears the service management system.
  • the status alarm message is further configured to send a status alarm clear message to the service management system, where the status alarm clear message is used to indicate that the service management system clears the service management system. The status alarm message.
  • the service management system after the virtual infrastructure management system performs the recovery process on the virtual machine, the service management system sends a status alarm clear message to the service management system, so that the service management system can clear the related status alarm message received before, according to the status alarm clear message, thereby
  • the service management system is configured to analyze and process status alarm messages related to the restored virtual machine.
  • the virtual architecture management system 500 of the embodiment of the present invention may be implemented by an Application Specific Integrated Circuit (ASIC) or a Programmable Logic Device (PLD), and the PLD may be a complex program.
  • ASIC Application Specific Integrated Circuit
  • PLD Programmable Logic Device
  • CPLD Complex Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • GAL Generic Array Logic
  • the virtual infrastructure management system 500 illustrated in FIG. 5 may correspond to the virtual infrastructure management system in the fault processing method illustrated in FIG. 2, and the above and other operations and/or functions of the respective units in the virtual infrastructure management system 500 respectively In order to implement the corresponding process of the fault processing method in FIG. 2, for brevity, details are not described herein again.
  • FIG. 6 is a schematic structural diagram of a service management system according to an embodiment of the present invention. It should be understood that the service management system 600 shown in FIG. 6 is only an example, and the service system of the embodiment of the present invention may further include other modules or units, or include modules similar to those of the modules in FIG. 6, or may not include All the modules in Figure 6.
  • the receiving module 610 is configured to receive a status alarm message sent by the virtual infrastructure management system, where the status alarm message carries information about a first virtual machine set affected by the faulty device, where the first virtual machine set includes at least one A virtual machine.
  • the determining module 620 is configured to determine, according to the status alarm message, a service application associated with the at least one first virtual machine.
  • the processing module 630 is configured to perform a processing operation on the service application associated with the at least one first virtual machine.
  • the service management system after the service management system receives the information of the virtual machine in the first virtual machine set affected by the faulty device from the virtual infrastructure management system, the service application can be directly analyzed according to the information of the virtual machine. In turn, the affected business applications can be processed.
  • the service management system can directly analyze the affected service application according to the state alarm message of the first virtual machine set, instead of analyzing the obtained virtual machine according to the alarm message of the faulty device, and analyzing the affected virtual machine. Affected business applications. Therefore, the service management system does not need to directly sense the hardware fault, and thus can quickly trigger the impact processing of the service application, reduce service loss, and improve the reliability of the service application.
  • the status alarm message further carries the impact information of the first virtual machine set, where the impact information is used to indicate the impact of the faulty device on the at least one first virtual machine.
  • the processing module is specifically configured to perform a processing operation on the service application associated with the at least one first virtual machine according to the impact information of the first virtual machine set.
  • the status alarm of the first virtual machine set received by the service management system from the virtual architecture management system also carries impact information indicating the type and/or level of the impact of the faulty device on the first virtual machine in the first set of virtual machines, so that the service management system or the service system can further serve the service according to the impact information.
  • the application processes to further improve the reliability of business applications.
  • the type of the impact generated by the first virtual machine set includes at least one of the following: fault, high risk, medium risk, low risk, or no impact.
  • the processing operation includes at least one of the following manners:
  • the service management system switches the service application associated with the at least one first virtual machine to a virtual machine that is not affected by the faulty device;
  • the service management system identifies the application state information of the at least one first virtual machine as an isolated state, where the isolation state is used to instruct the at least one first virtual machine to stop performing execution of the at least one first virtual machine association Business application; or
  • the service management system sends a first request message to the virtual infrastructure management system, where the first request message is used to indicate a virtual machine to be restored, and the virtual machine to be restored is one of the first virtual machine set. Subset; or
  • the service management system Transmitting, by the service management system, the status alarm message to a control node of the service application associated with the at least one first virtual machine, so that the control node associates the at least one first virtual machine according to the status alarm message
  • the service application switches to the virtual machine that is not affected by the faulty device to perform or identifies the application state information of the at least one first virtual machine as the isolated state.
  • the determining module is further configured to determine, according to the impact information of the first virtual machine set, a first request message, where the first request message is used to indicate that the virtual to be restored needs priority recovery.
  • the virtual machine to be restored is a subset of the first virtual machine set.
  • the service management system further includes a sending module, configured to send the first request message to the virtual infrastructure management system.
  • the service management system may determine, according to the impact information of the first virtual machine, the priority of the virtual machine to be restored that is required to be restored by the virtual infrastructure management system in the first virtual machine set, and send the same to the virtual machine architecture management. a first request message indicating a recovery priority of the virtual machines to be restored, so that the virtual infrastructure management system can at least one of the first virtual machine set affected by the failure of the failed device according to the priority indicated by the service management system The virtual machine performs recovery processing.
  • the sending module is further configured to send the first request message to the virtual architecture management system according to a priority of the service application that is associated by the at least one first virtual machine.
  • the service management system indicates, according to the priority of the service application associated with the first virtual machine in the first virtual machine set, that is, the priority of the service application affected by the faulty device indicates the virtual infrastructure management system to the first virtual
  • the virtual machine to be restored in the machine set is restored, so that the high-priority service application can be restored first, and the reliability of the service application is further ensured.
  • the sending module is further configured to send the first request message to the virtual architecture management system according to a deployment mode of the service application associated with the at least one first virtual machine, where the at least The deployment mode of the service application associated with the first virtual machine includes at least one of a primary standby mode, a load sharing mode, and a single virtual machine mode.
  • the service management system instructs the virtual infrastructure management system to restore the virtual machine to be restored in the first virtual machine set according to the deployment mode of the affected service application, that is, according to the deployment mode of the service application affected by the faulty device. deal with.
  • the receiving module is further configured to receive a status sent by the virtual architecture management system.
  • the alarm clearing message is further used by the processing module to clear the status alarm message according to the status alarm clear message.
  • the service management system can clear the related state alarm message received by the state alarm clearing message sent by the virtual infrastructure management system, so as to avoid analyzing and processing the state alarm message related to the restored virtual machine.
  • the service management system 600 of the embodiment of the present invention may be implemented by an application specific integrated circuit, or a programmable logic device, and the PLD may be a complex program logic device, a field programmable gate array, a general array logic, or any combination thereof.
  • the service management system 600 and its respective modules may also be software modules.
  • the service management system 600 illustrated in FIG. 6 may correspond to the service management system in the fault processing method illustrated in FIG. 2, and the above and other operations and/or functions of the respective units in the service management system 600 are respectively implemented in order to implement the map.
  • the corresponding flow of the fault handling method in 2 is not repeated here for brevity.
  • FIG. 7 is a schematic structural diagram of a virtual architecture management system 700 according to another embodiment of the present invention.
  • the virtual infrastructure management system 700 includes a processor 710, a memory 720, a communication interface 730, and a bus 740.
  • the processor 710, the memory 720, and the communication interface 730 communicate via the bus 740, and may also implement communication by other means such as wireless transmission.
  • the memory 720 is for storing instructions, and the processor 710 is configured to execute instructions stored by the memory 720.
  • the memory 720 stores program code, and the processor 710 can call the program code stored in the memory 720 to perform the following operations:
  • Obtaining a fault alarm message where the fault alarm message carries identification information of the faulty device and a fault type; determining, according to the fault alarm message, a first set of virtual machines, where the first set of virtual machines includes at least one affected by the faulty device
  • the first virtual machine sends a status alarm message to the service management system, where the status alarm message carries information of the first virtual machine set.
  • the virtual infrastructure management system after obtaining the fault alarm message on the faulty device, directly analyzes and processes the fault alarm message, acquires one or more virtual machines affected by the faulty device, and sends the fault to the service management system.
  • the information of the virtual machine enables the service management system to directly analyze the affected business applications according to the information of the virtual machines, and then process the affected business applications.
  • the virtual architecture management system directly determines the information of the virtual machine affected by the faulty device according to the fault alarm message of the faulty device, so that the service management system can directly analyze the state alarm message according to the first virtual machine set.
  • the affected business application is analyzed, and the affected virtual machine is analyzed according to the alarm message of the faulty device, and the affected business application is analyzed. Therefore, the service management system does not need to directly sense the hardware fault, and thus can quickly trigger the impact processing of the service application, reduce service loss, and improve the reliability of the service application.
  • the processor 710 may further invoke the program code stored in the memory 720 to perform an operation of: determining, according to the fault alarm message, impact information of the first virtual machine set, where the impact information is used. Indicates the type and/or level of impact of the faulty device on the at least one first virtual machine.
  • the status alarm message further carries the impact information.
  • the virtual architecture management system may acquire the type and/or level of the impact of the fault of the faulty device on the virtual machine according to the fault alarm information of the faulty device, and then acquire the affected type of the virtual machine.
  • the status alarm message sent to the service management system further carries the impact information indicating the type and/or level of the impact of the faulty device on the first virtual machine in the first virtual machine set, so that the service management system or service The system can further process the business application according to the impact information, further improving the business application. reliability.
  • the type of the impact that the faulty device generates on the at least one first virtual machine includes at least one of the following: fault, high risk, medium risk, low risk, or no impact.
  • the processor 710 may invoke the program code stored in the memory 720 to perform the following operations: receiving a first request message sent by the service management system, where the first request message is used to indicate that priority recovery is required.
  • the virtual machine to be restored, the virtual machine to be restored is a subset of the first virtual machine set.
  • the processor is further configured to preferentially restore the virtual machine to be restored according to the first request information.
  • the virtual architecture management system may perform recovery processing on at least one virtual machine in the first virtual machine set affected by the fault of the faulty device according to the priority indicated by the service management system according to the request of the service management system.
  • the processor 710 may invoke the program code stored in the memory 720 to perform the following operations: when the first request information sent by the service management system is not received within the preset time threshold, The preset virtual machine recovery policy restores the at least one first virtual machine.
  • the embodiment of the present invention can ensure that when the service management system has no information indicating how the virtual architecture management system recovers the virtual machine in the first virtual machine set, the virtual infrastructure management system can actively perform the first virtual machine set according to the pre-configured recovery policy. The first virtual machine is restored.
  • the processor 710 may invoke the program code stored in the memory 720 to: send a status alarm clear message to the service management system, where the status alarm clear message is used to indicate the service management.
  • the system clears the status alarm message in the service management system.
  • the service management system after the virtual infrastructure management system performs the recovery process on the virtual machine, the service management system sends a status alarm clear message to the service management system, so that the service management system can clear the related status alarm message received before, according to the status alarm clear message, thereby
  • the service management system is configured to analyze and process status alarm messages related to the restored virtual machine.
  • FIG. 7 may correspond to the virtual architecture management system shown in FIG. 5, and the foregoing and other operations of the various units in the virtual architecture management system of the embodiment of the present invention and/or
  • the functions of the fault processing method shown in FIG. 2 are respectively implemented in the following.
  • FIG. 8 is a schematic structural diagram of a service management system 800 according to another embodiment of the present invention.
  • the service management system 800 includes a processor 810, a memory 820, a communication interface 830, and a bus 840.
  • the processor 810, the memory 820, and the communication interface 830 communicate via the bus 840, and may also implement communication by other means such as wireless transmission.
  • the memory 820 is for storing instructions, and the processor 810 is configured to execute instructions stored by the memory 820.
  • the memory 820 stores program code, and the processor 810 can call the program code stored in the memory 820 to perform the following operations:
  • the status alarm message determines a service application associated with the at least one first virtual machine; and performs a processing operation on the service application associated with the at least one first virtual machine.
  • the service management system after the service management system receives the information of the virtual machine in the first virtual machine set affected by the faulty device from the virtual infrastructure management system, the service application can be directly analyzed according to the information of the virtual machine. In turn, the affected business applications can be processed. Compared with the prior art, the business management system can be directly based on the first The status alarm message of the virtual machine set is analyzed to obtain the affected service application, instead of analyzing the affected virtual machine according to the alarm message of the faulty device, and then analyzing the affected service application. Therefore, the service management system does not need to directly sense the hardware fault, and thus can quickly trigger the impact processing of the service application, reduce service loss, and improve the reliability of the service application.
  • the status alarm message further carries the impact information of the first virtual machine set, where the impact information is used to indicate the impact of the faulty device on the at least one first virtual machine.
  • the processor is specifically configured to perform a processing operation on the service application associated with the at least one first virtual machine according to the impact information of the first virtual machine set.
  • the status alarm message of the first virtual machine set received by the service management system from the virtual infrastructure management system further carries a type indicating the impact of the faulty device on the first virtual machine in the first virtual machine set. And/or the impact information of the level, so that the service management system or the service system can further process the service application according to the impact information, thereby further improving the reliability of the service application.
  • the type of the impact generated by the first virtual machine set includes at least one of the following: fault, high risk, medium risk, low risk, or no impact.
  • the processing operation includes at least one of the following manners:
  • the service management system switches the service application associated with the at least one first virtual machine to a virtual machine that is not affected by the faulty device;
  • the service management system identifies the application state information of the at least one first virtual machine as an isolated state, where the isolation state is used to instruct the at least one first virtual machine to stop performing execution of the at least one first virtual machine association Business application; or
  • the service management system sends a first request message to the virtual infrastructure management system, where the first request message is used to indicate a virtual machine to be restored, and the virtual machine to be restored is one of the first virtual machine set. Subset; or
  • the service management system Transmitting, by the service management system, the status alarm message to a control node of the service application associated with the at least one first virtual machine, so that the control node associates the at least one first virtual machine according to the status alarm message
  • the service application switches to the virtual machine that is not affected by the faulty device to perform or identifies the application state information of the at least one first virtual machine as the isolated state.
  • the processor 710 may invoke the program code stored in the memory 720 to perform the following operations: determining the first request message according to the impact information of the first virtual machine set.
  • the transmitter 840 is configured to send the first request message to the virtual infrastructure management system.
  • the service management system may determine, according to the impact information of the first virtual machine, the priority of the virtual machine to be restored that is required to be restored by the virtual infrastructure management system in the first virtual machine set, and send the same to the virtual machine architecture management. a first request message indicating a recovery priority of the virtual machines to be restored, so that the virtual infrastructure management system can at least one of the first virtual machine set affected by the failure of the failed device according to the priority indicated by the service management system The virtual machine performs recovery processing.
  • the processor 710 may invoke the program code stored in the memory 720 to perform an operation of: transmitting, according to a priority of the service application associated with the at least one first virtual machine, to the virtual architecture management system.
  • the first request message is described.
  • the service management system indicates, according to the priority of the service application associated with the first virtual machine in the first virtual machine set, that is, the priority of the service application affected by the faulty device indicates the virtual infrastructure management system to the first virtual Machine set.
  • the processor 710 may invoke the program code stored in the memory 720 to perform the following operations: sending, according to the deployment mode of the service application associated with the at least one first virtual machine, to the virtual architecture management system.
  • the deployment mode of the service application associated with the at least one first virtual machine includes at least one of a primary standby mode, a load sharing mode, and a single virtual machine mode.
  • the service management system instructs the virtual infrastructure management system to perform recovery processing on the virtual machine to be restored in the first virtual machine set according to the deployment mode of the service application, that is, according to the deployment mode of the service application affected by the faulty device.
  • the processor 710 may invoke the program code stored in the memory 720 to: receive a status alarm clear message sent by the virtual architecture management system, and the processor is further configured to use the status according to the status The alarm clear message clears the status alarm message.
  • the service management system can clear the related state alarm message received by the state alarm clearing message sent by the virtual infrastructure management system, so as to avoid analyzing and processing the state alarm message related to the restored virtual machine.
  • the service management system of the embodiment of the present invention shown in FIG. 8 may correspond to the service management system shown in FIG. 6, and the foregoing and other operations and/or functions of the respective units in the service management system of the embodiment of the present invention are respectively In order to implement the corresponding process executed by the service management system in the fault processing method shown in FIG. 2, for brevity, details are not described herein again.
  • the processor in the embodiment of the present invention may be an integrated circuit chip with signal processing capability.
  • each step of the foregoing method embodiment may be completed by an integrated logic circuit of hardware in a processor or an instruction in a form of software.
  • the processor may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a Field Programmable Gate Array (FPGA), or the like. Programming logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present invention may be implemented or carried out.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present invention may be directly implemented by the hardware decoding processor, or may be performed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a conventional storage medium such as random access memory, flash memory, read only memory, programmable read only memory or electrically erasable programmable memory, registers, and the like.
  • the storage medium is located in the memory, and the processor reads the information in the memory and combines the hardware to complete the steps of the above method.
  • the memory in the embodiments of the present invention may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be a read-only memory (ROM), a programmable read only memory (PROM), an erasable programmable read only memory (Erasable PROM, EPROM), or an electric Erase programmable read only memory (EEPROM) or flash memory.
  • the volatile memory can be a Random Access Memory (RAM) that acts as an external cache.
  • RAM Random Access Memory
  • many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (Synchronous DRAM).
  • SDRAM double data rate synchronous dynamic random access Memory
  • DDR SDRAM Double Data Rate SDRAM
  • ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • SDRAM Synchronous Connection Dynamic Random Access Memory
  • Direct Rambus RAM Direct Memory Bus Random Access Memory
  • system and “network” are used interchangeably herein.
  • the term “and/or” in this context is merely an association describing the associated object, indicating that there may be three relationships, for example, A and / or B, which may indicate that A exists separately, and both A and B exist, respectively. B these three situations.
  • the character "/" in this article generally indicates that the contextual object is an "or" relationship.
  • B corresponding to A means that B is associated with A, and B can be determined according to A.
  • determining B from A does not mean that B is only determined based on A, and that B can also be determined based on A and/or other information.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be in an electrical, mechanical or other form.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the functions may be stored in a computer readable storage medium if implemented in the form of a software functional unit and sold or used as a standalone product.
  • the technical solution of the present invention which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like, which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本发明提供故障处理方法、虚拟架构管理系统、业务管理系统和虚拟化计算机系统。该故障处理方法包括:虚拟架构管理系统获取故障告警消息,该故障告警消息携带故障设备的标识信息和故障类型;虚拟架构管理系统根据故障告警消息确定第一虚拟机集合,第一虚拟机集合包括受故障设备影响的至少一个第一虚拟机;虚拟架构管理系统向业务管理系统发送状态告警消息,该状态告警消息携带第一虚拟机集合的信息。本发明的技术方案能够快速地将硬件故障的影响通知给受影响的业务,从而提高业务可靠性。

Description

故障处理方法、虚拟架构管理系统和业务管理系统 技术领域
本发明涉及云计算领域,尤其涉及一种故障处理方法、虚拟架构管理系统、业务管理系统和虚拟化计算机系统。
背景技术
在云计算领域,各个行业的业务系统正在实施虚拟化或云化部署。目前,业务系统中的业务运行在虚拟机上,虚拟机部署在作为共享资源池的硬件设备上,即业务不再采用传统的专用硬件或物理服务器部署,以实现软件硬件解耦和提高资源利用率。
目前,当设备(如物理主机、存储设备等)发生故障后,会把故障设备的故障通过故障告警消息发送给虚拟架构管理系统,虚拟架构管理系统再把故障告警消息发送给业务管理系统,由业务管理系统根据故障告警消息确定受影响的虚拟机和业务应用,并对受影响的业务应用执行故障处理操作。这使得业务管理系统需要感知硬件和硬件故障对应的业务应用,才能对业务应用执行故障处理,这样会使得业务管理系统不能快速地将故障设备通知给故障设备所影响的业务应用,影响业务应用可靠性。
发明内容
本发明提供一种故障处理方法、虚拟架构管理系统业务管理系统和虚拟化计算机系统,能够快速地将硬件故障对虚拟机的影响通知给受影响的虚拟机所影响的业务,从而提高业务可靠性。
第一方面,本发明提供了一种故障处理方法。该故障处理方法用于在虚拟化计算机系统中进行故障处理,该虚拟化计算机系统包括:虚拟架构管理系统、业务管理系统以及至少一个虚拟机,至少一个虚拟机运行在至少一台物理设备上,至少一个虚拟机用于执行业务应用,业务管理系统用于管理业务应用,虚拟架构管理系统用于管理至少一个虚拟机和至少一台物理设备。该故障处理方法包括:虚拟架构管理系统获取故障告警消息,故障告警消息携带故障设备的标识信息和故障类型;虚拟架构管理系统根据故障告警消息确定第一虚拟机集合,第一虚拟机集合包括受所述故障设备影响的至少一个第一虚拟机;虚拟架构管理系统向业务管理系统发送状态告警消息,状态告警消息携带第一虚拟机集合的信息。
该故障处理方法中,虚拟架构管理系统获取到故障设备上的故障告警消息后,直接对该故障告警消息进行分析处理,获取故障设备影响的一个或多个虚拟机,并向业务管理系统发送这些虚拟机的信息,使得业务管理系统可以直接根据这些虚拟机的信息分析得到受影响的业务应用,进而可以对受影响的业务应用进行处理。与现有技术相比,由虚拟架构管理系统直接根据故障设备的故障告警消息确定受故障设备影响的虚拟机的信息,使得业务管理系统可以直接根据第一虚拟机集合的状态告警消息分析得到受影响的业务应用,而不是根据故障设备的告警消息去分析得到受影响的虚拟机、再分析受影响的业务应用。从而使得业务管理系统不需要直接感知硬件故障,进而可以快速触发业务应用的影响处理,降低业务损失,提高业务应用的可靠性。
在一种可能的实现方式中,所述故障处理方法还包括:虚拟架构管理系统根据故障硬件的故障告警消息确定第一虚拟机集合的影响信息,该影响信息用于指示故障设备对第一虚拟机集合中的第一虚拟机产生的影响的类型和/或级别;相应地,状态告警消息还可以携带第一虚拟机集合的影响信息。
该故障处理方法中,虚拟架构管理系统根据故障设备的故障告警信息除了可以获取受影响的至少一个虚拟机,还可以获取故障设备发生的故障对这些虚拟机的影响的类型和/或级别,然后在向业务管理系统发送的状态告警消息中还携带用于指示故障设备对第一虚拟机集合中的第一虚拟机产生的影响的类型和/或级别的影响信息,从而使得业务管理系统或业务系统可以更加根据该影响信息对业务应用进行处理,进一步提高业务应用的可靠性。
可选地,状态告警信息还可以包括第一虚拟机集合中的第一虚拟机的标识信息、告警标识信息、告警名称信息、告警对象类型信息、告警类型信息、告警产生时间信息、告警部件类型信息、告警部件标识信息和告警部件名称信息。
可选地,所述状态告警信息可以包括故障设备的故障类型信息。
在一种可能的实现方式中,故障设备对第一虚拟机集合中的第一虚拟机产生的影响的类型包括故障、高风险、中风险、低风险或无影响中的一种或多种类型。
可选地,故障设备对第一虚拟机集合中的第一虚拟机产生的影响的级别包括紧急、重要或不重要。
在一种可能的实现方式中,该故障处理方法还包括:虚拟架构管理系统接收业务管理系统发送的第一请求消息,第一请求消息用于指示待恢复的虚拟机,该待恢复的虚拟机为第一虚拟机集合中一个子集;虚拟架构管理系统根据第一请求信息优先恢复该待恢复的虚拟机。
该故障处理方法中,虚拟架构管理系统可以根据业务管理系统的请求,根据业务管理系统指示的优先级,对受故障设备的故障所影响的第一虚拟机集合中的至少一个虚拟机进行恢复处理。
可选地,虚拟架构管理系统对虚拟机进行的恢复处理可以包括:虚拟机热迁移。
在一种可能的实现方式中,该故障处理方法还包括:若虚拟架构管理系统在预置时间阈值内未接收到业务管理系统发送的第一请求信息,则按照预置虚拟机恢复策略恢复第一虚拟机集合中的第一虚拟机。
该故障处理方法可以保证在业务管理系统没有信息指示虚拟架构管理系统如何恢复第一虚拟机集合中的虚拟机时,虚拟架构管理系统可以主动根据预先配置的恢复策略对第一虚拟机集合中的第一虚拟机进行恢复。
在一种可能的实现方式中,该故障处理方法还包括:虚拟架构管理系统向业务管理系统发送状态告警清除消息。
该故障处理方法中,虚拟架构管理系统对虚拟机进行恢复处理后,向业务管理系统发送状态告警清除消息,使得业务管理系统可以根据该状态告警清除消息清除之前接收的相关的状态告警消息,从而避免业务管理系统对已经恢复的虚拟机相关的状态告警消息进行分析处理。
第二方面,本发明提供了一种虚拟架构管理系统,所述虚拟架构管理系统包括用于 执行第一方面或第一方面任一种可能实现方式中的故障处理方法的各个模块。
本发明提供的虚拟架构管理系统,获取到故障设备上的故障告警消息后,直接对该故障告警消息进行分析处理,获取故障设备影响的一个或多个虚拟机,并向业务管理系统发送这些虚拟机的信息,使得业务管理系统可以直接根据这些虚拟机的信息分析得到受影响的业务应用,进而可以受影响的业务应用进行处理。与现有技术相比,由虚拟架构管理系统直接根据故障设备的故障告警消息确定受故障设备影响的虚拟机的信息,使得业务管理系统可以直接根据第一虚拟机集合的状态告警消息分析得到受影响的业务应用,而不是根据故障设备的告警消息去分析得到受影响的虚拟机、再分析受影响的业务应用。从而使得业务管理系统不需要直接感知硬件故障,进而可以快速触发业务应用的影响处理,降低业务损失,提高业务应用的可靠性。
第三方面,本发明提供一种虚拟架构管理系统,所述虚拟架构管理系统包括处理器、存储器、通信接口和总线。其中,处理器、存储器、通信接口通过总线进行通信,也可以通过无线传输等其他手段实现通信。该存储器用于存储指令,该处理器用于执行该存储器存储的指令。该存储器存储程序代码,且处理器可以调用存储器中存储的程序代码执行第一方面及第一方面任一种可能实现方式中的故障处理方法。
第四方面,本发明提供了一种计算机可读介质,所述计算机可读介质存储用于虚拟架构管理系统执行的程序代码,所述程序代码包括用于执行第一方面及第一方面任一种可能实现方式中的故障处理方法的指令。
第五方面,本发明还提供了一种故障处理方法,该故障处理方法用于在虚拟化计算机系统中进行故障处理,虚拟化计算机系统包括:虚拟架构管理系统、业务管理系统以及至少一个虚拟机,该至少一个虚拟机运行在至少一台物理设备上,该至少一个虚拟机用于执行业务应用,业务管理系统用于管理业务应用,虚拟架构管理系统用于管理该至少一个虚拟机和该至少一台物理设备;该故障处理方法包括:业务管理系统接收虚拟架构管理系统发送的状态告警消息,该状态告警消息携带受故障设备影响的第一虚拟机集合的信息,第一虚拟机集合中包括至少一个第一虚拟机;业务管理系统根据状态告警消息确定至少一个第一虚拟机关联的业务应用;业务管理系统对关联的业务应用执行处理操作。
该故障处理方法中,业务管理系统从虚拟架构管理系统接收到受故障设备影响的第一虚拟机集合中的虚拟机的信息后,可以直接根据这些虚拟机的信息分析得到受影响的业务应用,进而可以对受影响的业务应用进行处理。与现有技术相比,业务管理系统可以直接根据第一虚拟机集合的状态告警消息分析得到受影响的业务应用,而不是根据故障设备的告警消息去分析得到受影响的虚拟机、再分析受影响的业务应用。从而使得业务管理系统不需要直接感知硬件故障,进而可以快速触发业务应用的影响处理,降低业务损失,提高业务应用的可靠性。
在一种可能的实现方式中,第一虚拟机集合的状态告警消息还携带第一虚拟机集合的影响信息,该影响信息用于指示故障设备对第一虚拟机集合中的至少一个第一虚拟机产生的影响的类型和/或级别。相应地,业务管理系统对业务应用执行处理操作包括:业务管理系统根据第一虚拟机集合的影响信息对业务应用执行处理操作。
该故障处理方法中,业务管理系统从虚拟架构管理系统接收的第一虚拟机集合的状 态告警消息中还携带用于指示故障设备对第一虚拟机集合中的第一虚拟机产生的影响的类型和/或级别的影响信息,从而使得业务管理系统或业务系统可以更加根据该影响信息对业务应用进行处理,进一步提高业务应用的可靠性。
可选地,状态告警信息还可以包括第一虚拟机集合中的第一虚拟机的标识信息、告警标识信息、告警名称信息、告警对象类型信息、告警类型信息、告警产生时间信息、告警部件类型信息、告警部件标识信息和告警部件名称信息。
可选地,所述状态告警信息可以包括故障设备的故障类型信息。
在一种可能的实现方式中,第一虚拟机集合产生的影响的类型包括故障、高风险、中风险、低风险或无影响中的一种或多种类型。
可选地,故障设备对第一虚拟机集合中的第一虚拟机产生的影响的级别包括紧急、重要或不重要。
在一种可能的实现方式中,处理操作包括以下方式中的至少一种:
业务管理系统将至少一个第一虚拟机关联的业务应用切换至未受故障设备影响的虚拟机执行;或
业务管理系统将至少一个第一虚拟机的应用状态信息标识为隔离状态,隔离状态用于指示至少一个第一虚拟机停止执行至少一个第一虚拟机关联的业务应用;或
业务管理系统向虚拟架构管理系统发送第一请求消息,第一请求消息用于指示待恢复的虚拟机,待恢复的虚拟机为第一虚拟机集合中一个子集;或
业务管理系统向至少一个第一虚拟机关联的业务应用的控制节点发送状态告警消息,以使得控制节点根据状态告警消息将至少一个第一虚拟机关联的业务应用切换至未受故障设备影响的虚拟机执行或将至少一个第一虚拟机的应用状态信息标识为隔离状态。
在一种可能的实现方式中,该故障处理方法还包括:业务管理系统根据第一虚拟机集合的影响信息确定第一请求消息。
该故障处理方法中,业务管理系统可以根据第一虚拟机的影响信息确定第一虚拟机集合中需要虚拟架构管理系统恢复的待恢复的虚拟机的优先级,并向虚拟机架构管理发送用于指示这些待恢复的虚拟机的恢复优先级的第一请求消息,使得虚拟架构管理系统可以根据业务管理系统指示的优先级,对受故障设备的故障所影响的第一虚拟机集合中的至少一个虚拟机进行恢复处理。
在一种可能的实现方式中,业务管理系统向虚拟架构管理系统发送用于指示待恢复的虚拟机的恢复优先级的第一请求消息的一种具体实现方式可以为:业务管理系统根据业务应用的优先级向虚拟架构管理系统发送第一请求消息。
该故障处理方法中,业务管理系统根据第一虚拟机集合中的第一虚拟机相关联的业务应用的优先级,即根据故障设备影响的业务应用的优先级指示虚拟架构管理系统对第一虚拟机集合中的待恢复的虚拟机进行恢复处理,从而可以保证高优先级的业务应用可以优先得到恢复,进一步保证业务应用的可靠性。
可选地,业务管理系统可以根据第一虚拟机集合的影响信息和相关联的业务应用的优先级向虚拟架构管理系统发送第一请求消息。
在一种可能的实现方式中,业务管理系统向虚拟架构管理系统发送第一请求消息的 一种具体实现方式为:业务管理系统根据业务应用的部署模式向虚拟架构管理系统发送第一请求消息,业务应用的部署模式包括主备模式、负荷分担模式和单虚拟机模式中的至少一种。
该故障处理方法中,业务管理系统根据业务应用的部署模式,即根据故障设备影响的业务应用的部署模式指示虚拟架构管理系统对第一虚拟机集合中的待恢复的虚拟机进行恢复处理。
可选地,业务管理系统可以根据第一虚拟机集合的影响信息和业务应用的部署模式向虚拟架构管理系统发送第一请求消息,或可以根据业务应用的部署模式和业务应用的优先级向虚拟架构管理系统发送第一请求消息,或可以根据第一虚拟机集合的影响信息、业务应用的部署模式和业务应用的优先级向虚拟架构管理系统发送第一请求消息。
在一种可能的实现方式中,该故障处理方法还包括:业务管理系统接收虚拟架构管理系统发送的状态告警清除消息;业务管理系统根据该状态告警清除消息清除之前接收的相关的状态告警消息。
该故障处理方法中,业务管理系统可以根据虚拟架构管理系统发送的状态告警清除消息清除之前接收的相关的状态告警消息,从而避免对已经恢复的虚拟机相关的状态告警消息进行分析处理。
第六方面,本发明提供了一种业务管理系统,所述业务管理系统包括用于执行第五方面或第五方面的任一可能的实现方式中的故障处理方法的各个模块。
第七方面,本发明提供了一种业务管理系统,所述业务管理系统包括处理器、存储器、通信接口和总线。其中,处理器、存储器、通信接口通过总线进行通信,也可以通过无线传输等其他手段实现通信。该存储器用于存储指令,该处理器用于执行该存储器存储的指令。该存储器存储程序代码,且处理器可以调用存储器中存储的程序代码执行第五方面及第五方面任一种可能实现方式中的故障处理方法。
第八方面,本发明提供了一种计算机可读介质,所述计算机可读介质存储用于业务管理系统执行的程序代码,所述程序代码包括用于执行第五方面或第五方面的任一可能的实现方式中的故障处理方法的指令。
第九方面,本发明提供了一种虚拟化计算机系统,包括虚拟管理节点和业务管理节点,该虚拟化管理节点用于执行第一方面或第一方面的任一可能的实现方式中的故障处理方法,该业务管理节点用于执行第五方面或第五方面的任一可能的实现方式中的故障处理方法。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例中所需要使用的附图作简单地介绍,显而易见地,下面所描述的附图仅仅是本发明的一些实施例的附图。
图1A是应用本发明实施例的故障处理方法的示意性系统结构图。
图1B是应用本发明实施例的故障处理方法的另一种示意性系统结构图。
图2是本发明一个实施例的故障处理方法的示意性流程图。
图3是本发明另一个实施例的故障处理方法的示意性流程图。
图4是本发明另一个实施例的故障处理方法的示意性流程图。
图5是本发明一个实施例的虚拟架构管理系统的示意性结构图。
图6是本发明一个实施例的业务管理系统的示意性结构图。
图7是本发明另一个实施例的虚拟架构管理系统的示意性结构图。
图8是本发明另一个实施例的业务管理系统的示意性结构图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。
为了便于理解,先从整体上描述能够实施本发明实施例的故障处理方法的系统架构的示例图。应理解,本发明实施例并不限于图1A和图1B所示的虚拟化计算机系统中,此外,图1A和图1B中的装置可以是硬件,也可以是从功能上划分的软件或者以上二者的结合。
硬件资源(Hardware Resources)110可以包括一个或多个设备,每个设备可以为X86服务器、存储设备、网络设备等硬件设备资源,可用于提供计算、存储、网络等硬件功能。
虚拟化层(Virtualization Layer)120通过虚拟化技术对计算、存储、网络等硬件资源进行虚拟化,其中,虚拟化技术可以使用Xen,HyperV,也可以使用KVM,本发明不作限制。
虚拟资源(Virtual Resources)130是指通过虚拟化技术对硬件资源110进行虚拟化形成的虚拟资源,如虚拟计算、虚拟网络、虚拟存储等。
硬件资源110、虚拟化层120和虚拟资源130又可以成为虚拟架构层(Virtualized Infrastructure Layer),为上层业务提供虚拟资源或虚拟资源池等基础设施层。
业务系统140中部署一个或多个业务应用功能,每个业务应用部署在一个或多个虚拟机上,即这些虚拟机用于执行业务应用。虚拟机部署在硬件资源110中的设备上。
每个业务应用有对应的控制节点。控制节点用于对对应的业务应用进行管理。控制节点也可称为仲裁节点。控制节点可以部署在业务系统中,一个控制节点可以分别管理对应的一个业务应用,如图1A所示;一个控制节点也可以管理多个业务应用,如图1B所示。控制节点可以指用于对对应的业务应用进行管理的硬件装置,也可以指业务应用运行的多个虚拟机中的一个虚拟机。
虚拟架构管理(Virtualized Infrastructure Manager)系统150实现虚拟化基础设施的管理,负责对物理硬件(即硬件资源110)虚拟化资源和部署在硬件资源110中的设备上的虚拟机进行统一管理、监控、资源调度、故障处理等,为业务系统运行提供资源支持,并提供开放接口等。虚拟化架构管理系统150也可以称为是虚拟化层的组成部分。
业务管理系统160,用于对运行在虚拟机上的业务应用进行管理,如创建业务应用、发放业务应用、业务应用中虚拟资源调度、及关闭业务应用等。业务管理系统可以管理一个或多个业务应用。业务管理系统调用虚拟架构管理系统提供的接口,为业务应用运行提供资源,实现业务应用发放、部署等。业务管理系统160与虚拟架构管理系统150 对接。当然,业务管理系统可以与多个虚拟架构管理系统对接。
其中,业务管理系统160和业务系统140又可统称为应用层。业务管理系统160和业务系统140可以是逻辑分开的系统,如图1A和1B所示,也可以由一个系统实现二者的功能。本发明实施例的以下具体描述中以图1A所示虚拟化计算机系统为例进行具体描述。
由上述内容可知,业务系统运行在虚拟资源130中的虚拟机上,业务系统不需要关心具体的硬件设备,也不需要知道业务应用所在的虚拟机具体在哪个硬件设备上运行,业务管理系统和业务系统均不需要直接感知设备及故障设备对业务应用的影响。
因此本发明提出新的故障处理方法、虚拟架构管理系统、业务管理系统和虚拟化计算机系统,使得业务管理系统不用直接感知设备以及设备故障对业务应用的影响,而是可以从虚拟架构管理系统获知设备故障对VM的影响,从而可以快速地获知受影响的业务应用,进而使得受影响的业务应用能够快速地得到处理。
下面以图1A所示的虚拟化计算机系统为例对本发明实施例的故障处理方法进行详细的介绍。
图2为本发明实施例的故障处理方法的示意性流程图。应理解,图2示出了故障处理方法的步骤或操作,但这些步骤或操作仅是示例,本发明实施例还可以执行其他操作或者图2中的各个操作的变形。此外,图2中的各个步骤可以按照与图2呈现的不同的顺序来执行,并且有可能并非要执行图2中的全部操作。
S210,虚拟架构管理系统获取故障告警消息,故障告警消息携带故障设备的标识信息和故障类型。
其中,故障设备可以是图1A中所示硬件资源110中任意一种或多种设备,故障类型包括整机故障或部分硬件故障。
例如,若故障设备为X86服务器,则故障类型可以为X86服务器整机故障,也可以是X86服务器中CPU、内存、网卡、磁盘中至少一种硬件故障。
本发明实施例中,故障设备(如服务器、存储设备等)可以快速检测自身故障,然后虚拟架构管理系统可以通过多种方式或协议获取故障设备的故障告警消息,如故障设备可以通过简单网络管理协议(Simple Network Management Protocol,SNMP)向虚拟架构管理系统上报故障设备的故障告警消息,或者虚拟架构管理系统可以通过表述性状态传递(Representational State Transfer,REST)接口查询故障设备的故障告警消息。
S220,虚拟架构管理系统根据故障设备的故障告警消息确定第一虚拟机集合,第一虚拟机集合包括受故障设备影响的至少一个第一虚拟机。
虚拟架构管理系统获取到故障设备的故障告警消息后,根据该故障告警消息确定受故障设备影响的第一虚拟机集合。虚拟架构管理系统获取到故障设备的故障告警消息后,根据该故障告警消息确定第一虚拟机集合确定第一虚拟机集合的具体实现方式可以是:虚拟架构管理系统根据故障设备的标识信息和故障类型,从虚拟架构管理系统的数据库中,查询部署在该故障设备上且受该故障设备发生的故障所影响的全部或部分虚拟机的信息。为了后续描述方便,可以将受到影响的虚拟机中的每个虚拟机称为第一虚拟机,所有的第一虚拟机组成第一虚拟机集合。
S230,虚拟架构管理系统向业务管理系统发送状态告警消息其中,状态告警消息携 带第一虚拟机集合的信息。
当虚拟架构管理系统为多个第一虚拟机分别生成一条状态告警消息时,虚拟架构管理系统可以一次将这些状态告警消息发送给业务管理系统,也可以分多次发送给业务管理系统。
当然,虚拟架构管理系统也可以为所有受影响的虚拟机生成一个状态告警消息,即第一虚拟机集合中所有第一虚拟机生成一个状态告警消息,本发明对此不作限制。
业务管理系统接收虚拟架构管理系统发送的第一虚拟机集合的状态告警消息后,可以存储将该状态告警消息,如将该状态告警消息记录或保存在业务管理系统的数据库中。
S240,业务管理系统根据第一虚拟机集合的状态告警消息确定第一虚拟机集合中至少一个第一虚拟机关联的业务应用。
业务管理系统接收到虚拟架构管理系统发送的第一虚拟机集合的状态告警消息后,将该状态告警信息和业务应用关联,识别具体受影响的业务应用,具体实现方式可以为:根据第一虚拟机集合的状态告警消息中携带的受影响的第一虚拟机的信息,从业务管理系统的数据库或配置文件中,查询第一虚拟机和业务应用的对应关系,识别出具体受影响的业务应用。
S250,业务管理系统对第一虚拟机集合中的至少一个第一虚拟机关联的业务应用执行处理操作。
具体而言,业务管理系统对第一虚拟机集合中的第一虚拟机关联的业务应用执行处理操作的一种实现方式可以是:业务管理系统向业务应用对应的控制节点发送第一虚拟机集合的信息。其中,第一虚拟机集合的信息用于指示控制节点对该业务应用进行恢复处理。
可选地,业务管理系统对第一虚拟机集合中的至少一个第一虚拟机关联的业务应用执行处理操作包括以下方式中的至少一种:
方式一:业务管理系统将受影响的至少一个第一虚拟机关联的业务应用切换至未受所述故障设备影响的虚拟机执行
方式二:业务管理系统将将所述至少一个第一虚拟机的应用状态信息标识为隔离状态,所述隔离状态用于指示所述至少一个第一虚拟机停止执行所述至少一个第一虚拟机关联的业务应用,即在业务应用中隔离受影响的虚拟机。
方式三:业务管理系统向虚拟架构管理系统发送第一请求消息,第一请求消息用于指示待恢复的虚拟机,待恢复的虚拟机为所述第一虚拟机集合中一个子集。
方式四:业务管理系统向至少一个第一虚拟机关联的业务应用的控制节点发送状态告警消息,以使得控制节点根据所述状态告警消息将至少一个第一虚拟机关联的业务应用切换至未受所述故障设备影响的虚拟机执行或将所述至少一个第一虚拟机的应用状态信息标识为所述隔离状态。
本发明实施例中,虚拟架构管理系统获取到故障设备上的故障告警消息后,直接对该故障告警消息进行分析处理,获取故障设备影响的一个或多个虚拟机,并向业务管理系统发送这些虚拟机的信息。业务管理系统可以直接根据这些虚拟机的信息分析得到受影响的业务应用,进而可以对受影响的业务应用进行处理。与现有技术相比,由虚拟架构管理系统直接根据故障设备的故障告警消息确定受故障设备影响的虚拟机的信息,使 得业务管理系统可以直接根据第一虚拟机集合的状态告警消息分析得到受影响的业务应用,而不是根据故障设备的告警消息去分析得到受影响的虚拟机、再分析受影响的业务应用。从而使得业务管理系统不需要直接感知硬件故障,进而可以快速触发业务应用的影响处理,降低业务损失,提高业务应用的可靠性。
本发明实施例中,可选地,虚拟架构管理系统可以根据故障设备的故障告警消息确定第一虚拟机集合的影响信息,该影响信息用于指示故障设备对第一虚拟机集合中至少一个第一虚拟机产生的影响的类型和/或级别。对应地,虚拟机架构管理系统向业务管理系统发送的状态告警消息还可以携带该影响信息,业务管理系统从虚拟架构管理系统出接收的状态告警消息可以携带该影响信息。然后业务管理系统根据第一虚拟机集合的影响信息对所述第一虚拟机集合中的第一虚拟机关联的业务应用执行处理操作。
用户可以根据需求定义虚拟机受到故障设备的故障所影响的类型和/或级别,下面是本发明实施例的虚拟机受到故障设备的影响的类型和级别的示例。
当物理服务器发生整机故障(包括物理服务器下电、主机操作系统故障等不能提供计算资源的故障)、存储设备故障(存储设备下电、全部断链等情况)、以及其他硬件故障导致虚拟机无法运行、为业务提供服务时,则虚拟机受影响的类型可设置为故障,级别可设置为紧急。对于网卡或其他硬件故障,若导致虚拟机无法正常工作时,则虚拟机受影响的类型可以为故障,级别可以设置为紧急。
当物理服务器发生部件故障,如中央处理器(Central Processing Unit,CPU)、内存、部分网卡发生故障时,若暂时不影响虚拟机运行,但存在运行风险的情况,则虚拟机受影响的类型可以设置为高风险,级别可设置为重要。
当存储设备发生部件故障,如部分链路中断、部分控制器故障等,若暂时不影响虚拟机运行,但存在运行风险的情况,则虚拟机受影响的类型可以设置为中风险,级别可以设置为次要。
通常情况下,凡是硬件故障导致虚拟机无法运行或无法对外提供服务时,虚拟机受影响的类型均可以设置为故障,级别均可以设置为紧急。
而对于不影响任何虚拟机运行的硬件故障,则可以不设置虚拟机的受影响的类型和级别,或者可以设置虚拟机受影响的类型为低风险或无风险,级别为提示。
通过上面内容的描述可知,虚拟架构管理系统向业务管理系统发送的状态告警消息可以包括第一虚拟机集合的信息,即受影响的至少一个第一虚拟机的标识。还可以包括第一虚拟机集合的影响信息,即故障设备对第一虚拟机集合中至少一个第一虚拟机产生的影响的类型和/或级别。
可选地,告警状态消息还可以包括产生时间、清除时间、告警同步号、告警名称、告警对象类型等。除了上述信息,虚拟机的状态告警消息还可以携带故障设备的故障原因等信息。当然,虚拟机的状态告警消息包括的信息不限于上述列举的内容。
上述虚拟架构管理系统执行的方法中,虚拟架构管理系统虽然只是对故障设备的告警消息进行分析处理,并向业务管理系统提供分析得到的信息,但是虚拟架构管理系统执行的该故障处理方法是后续对受故障影响的虚拟机进行处理或后续对受故障影响的业务应用的处理之前行之有效的方法,因此可以毫无意义地将其称为故障处理方法。
本发明实施例中,可选地,业务管理系统可以调用虚拟架构管理系统提供的接口, 请求虚拟架构管理系统对受影响的虚拟机进行处理。具体而言,业务管理系统可以根据第一虚拟机集合的影响信息确定用于指示需要优先恢复的待恢复的虚拟机的第一请求消息,该待恢复的虚拟机虚拟机为第一虚拟机集合中一个子集。然后业务管理系统向虚拟架构管理系统发送该第一请求消息。
业务管理系统根据第一虚拟机的影响信息确定第一虚拟机集合中需要虚拟架构管理系统恢复的虚拟机的优先级,并向虚拟机架构管理发送用于指示待恢复的虚拟机的恢复优先级的第一请求消息,使得虚拟架构管理系统可以根据业务管理系统指示的优先级,对受故障设备的故障所影响的第一虚拟机集合中的至少一个虚拟机进行恢复处理。
可选地,业务管理系统向虚拟架构管理系统发送用于指示待恢复的虚拟机的恢复优先级的第一请求消息的另一种具体实现方式可以为:业务管理系统根据业务应用的优先级向虚拟架构管理系统发送第一请求消息。
具体而言,业务管理系统根据第一虚拟机集合中的第一虚拟机相关联的业务应用的优先级,即根据故障设备影响的业务应用的优先级指示虚拟架构管理系统对第一虚拟机集合中的待恢复的虚拟机进行恢复处理,从而可以保证高优先级的业务应用可以优先得到恢复,进一步保证业务应用的可靠性。
如业务管理系统可以通过第一请求消息指示虚拟机架构管理系统优先恢复第一虚拟机集合中优先级高的第一虚拟机。
可选地,业务管理系统可以根据第一虚拟机集合的影响信息和相关联的业务应用的优先级向虚拟架构管理系统发送第一请求消息。
可选地,业务管理系统向虚拟架构管理系统发送第一请求消息的一种具体实现方式为:业务管理系统根据业务应用的部署模式向虚拟架构管理系统发送第一请求消息,业务应用的部署模式包括主备模式、负荷分担模式和单虚拟机模式中的至少一种。
具体而言,业务管理系统根据业务应用的部署模式,即根据故障设备影响的业务应用的部署模式指示虚拟架构管理系统对第一虚拟机集合中的待恢复的虚拟机进行恢复处理。
如业务管理系统可以通过第一请求消息指示虚拟架构管理系统优先恢复部署模式为主备模式的业务应用的主备虚拟机中的主虚拟机。
可选地,业务管理系统可以根据第一虚拟机集合的影响信息和业务应用的部署模式向虚拟架构管理系统发送第一请求消息,或可以根据业务应用的部署模式和业务应用的优先级向虚拟架构管理系统发送第一请求消息,或可以根据第一虚拟机集合的影响信息、业务应用的部署模式和业务应用的优先级向虚拟架构管理系统发送第一请求消息。
虚拟架构管理系统接收业务管理系统发送的第一请求消息后,可以根据第一请求消息的指示对第一虚拟机集合中的待恢复的虚拟机按照一定的优先级进行恢复处理。虚拟架构管理系统对虚拟机的具体恢复形式可以是虚拟机迁移,即将虚拟机从故障设备迁移到其他正常设备;还可以是利用虚拟机快照在其他正常设备上恢复该虚拟机。
可选地,若虚拟架构管理系统在预置时间阈值内未接收到业务管理系统发送的用于指示第一虚拟机集合中需要优先恢复的虚拟机的第一请求信息,则按照预置虚拟机恢复策略恢复第一虚拟机集合中的第一虚拟机。
这样可以保证在业务管理系统没有信息指示虚拟架构管理系统如何恢复第一虚拟机 集合中的虚拟机时,虚拟架构管理系统可以主动根据预置虚拟机恢复策略对第一虚拟机集合中的至少一个第一虚拟机进行恢复。
可选地,无论是业务管理系统请求虚拟架构管理系统对受影响的虚拟机进行处理,还是虚拟架构管理系统主动对受影响的虚拟机进行处理,虚拟架构管理系统对受影响的虚拟机处理完后,均可以给业务管理系统发送状态告警清除消息,以指示业务管理系统可以清除之前接收到的、与该进行处理的虚拟机对应的状态告警消息。
业务管理系统收到虚拟架构管理系统发送的状态告警清除消息后,可以将对应的虚拟机的状态告警消息清除,减少业务管理系统对已恢复告警的维护工作,从而可以节省资源,提高效率。
业务管理系统清除状态告警消息的具体形式可以是将存储的状态告警消息删掉,也可以是修改状态告警消息中的某个信息,使得该信息指示该状态告警消息对应的虚拟机已经恢复了。
本发明实施例中,可选地,业务管理系统根据第一虚拟机集合的信息确定第一虚拟机集合中的第一虚拟机关联的业务应用后,可以向业务应用关联的控制节点发送第一虚拟机集合的信息。
当业务应用的控制节点接收到业务管理系统发送的第一虚拟机集合的信息后,可以根据第一虚拟机集合中的第一虚拟机的信息对受影响的业务应用进行处理。
可选地,业务应用的控制节点还可以根据业务应用的部署模式对业务应用进行处理。如当业务应用为主备模式部署时,若主虚拟机故障,则控制节点需要进行主备切换;若备VM故障,控制节点不需要主备切换。如当业务应用为负荷分担模式部署时,控制节点将受影响的VM隔离。
可选地,业务应用的控制节点可以根据业务应用的部署模式和第一虚拟机集合的影响信息对业务应用进行处理。如当第一虚拟机集合的影响信息指示故障设备对第一虚拟机的影响的类型为故障、级别为紧急,且业务应用为主备模式部署,若主VM故障,则控制节点需要进行主备切换,若备VM故障或业务应用不重要,则控制节点可以不作处理,即控制节点不需要主备切换。应了解,上述根据虚拟机受影响的类型、级别及部署模式等对业务应用进行处理的方式只是示例性说明,其具体实现可以根据用户的需求来定义,本发明对此不作限制。
可选地,业务应用的控制节点对业务应用处理完成后,可以向业务管理系统发送业务处理反馈消息,告知业务管理系统其对业务应用的处理结果。
下面结合图3,以物理主机故障为例,详细介绍本发明实施例的故障处理方法。如图3所示,其中包括三个设备,分别为计算节点1、计算节点2和计算节点3。计算节点1、计算节点2和计算节点3可以分别为图1A或图1B中的设备1、设备2和设备3。
在虚拟化计算机系统中部署了2种业务应用(Application,APP)。一种应用为App1,与VM1和VM2关联,采用主备模式部署。其中,VM1部署在计算节点1上,为APP1的主用虚拟机;VM2部署在计算节点2上,为APP1的备用虚拟机。另一种应用为App2,与VM3和VM4关联,采用负荷分担模式部署,VM3部署在服务器计算节点2上,VM4部署在计算节点3上。
S402,当计算节点1发生掉电故障时,计算节点1通过SNMP协议向虚拟架构管理上报计算节点1的故障告警消息。
S404,虚拟架构管理接收到故障告警消息,根据该故障告警消息,确定受故障影响的虚拟机,并产生虚拟机的状态告警消息,具体步骤如下。
(1)虚拟架构管理系统收到计算节点1的硬件故障告警消息,从虚拟架构管理系统的数据库中查询计算节点1上运行的虚拟机列表,获取到受影响的虚拟机有VM1,得到VM1的ID等信息。
(2)由于计算节点1掉电故障导致VM1故障,VM1无法运行提供服务,因此可以将VM1受影响的类型设置为故障,VM1受影响的级别设置为紧急。
(3)虚拟架构管理系统产生VM1的状态告警消息,其携带信息包括:VM1ID、VM1受影响的类型(为故障)、产生时间、VM1受影响的级别(为紧急)、故障设备的故障类型(为计算节点1整机故障)等。
S406,虚拟架构管理系统向业务管理系统发送VM1的状态告警消息。
S408,业务管理系统接收虚拟架构管理系统发送的虚拟机的状态告警消息,获得VM1的ID等信息,从业务管理系统的数据库查询出VM1和业务应用的对应关系,得到受影响的业务应用为App1。
业务管理系统向App1的控制节点发送通知消息,通知VM1的故障。然后该控制节点根据通知消息确定把VM2升为主用服务器。
S410,业务管理系统调用虚拟架构管理系统提供的接口,向虚拟架构管理系统发送第一请求消息,请求虚拟架构管理系统快速恢复VM1。
S412,虚拟架构管理系统将VM1迁移到计算节点3中,此时,VM1变为App1的备用虚拟机。
此时,在具体实施过程中,虚拟架构管理系统还可以对计算节点1进行故障隔离。
S414,虚拟架构管理系统把VM1恢复后,给业务管理发送VM1状态告警清除消息。
经过故障处理后,业务系统中的应用的部署情况如图4所示。其中,App1采用主备模式部署,VM2部署在计算节点2上为主用虚拟机,VM1部署在计算节点3上为备用虚拟机。App2采用负荷分担模式部署,VM3部署在计算节点2上,VM4部署在计算节点3上。计算节点1故障,从资源池隔离。
上述实施例中,计算节点1发生故障后,向虚拟架构管理系统发送告警消息,虚拟架构管理系统根据告警消息确定受影响的虚拟机为VM1,且确定VM1受到的影响的类型和级别。业务管理系统不用直接对硬件的告警消息进行处理,即可直接从虚拟架构管理系统处获取受影响的VM1的信息和VM1受到的影响信息,进而确定VM1上运行的业务应用为App1,业务管理系统通知App1的控制节点对App1进行处理,并请求虚拟架构管理系统对VM1进行恢复。虚拟架构管理系统根据业务管理系统的请求将VM1迁移到计算节点3上。App1的控制节点从业务管理系统处获取VM1的信息及VM1受到的影响信息后,将App1原来的备虚拟机VM2切换为主虚拟机,并将迁移到计算节点3上的VM1设置为备用虚拟机,从而保证App1的运行,提高App1的可靠性。
上面结合图2至图4介绍了本发明实施例的故障处理方法,下面结合图5至图8介绍本发明实施例的虚拟架构管理系统和业务管理系统。
图5为本发明一个实施例的虚拟架构管理系统的示意性结构图。应理解,图5示出的虚拟架构管理系统500仅是示例,本发明实施例的虚拟架构管理系统还可包括其他模块或单元,或者包括与图5中的各个模块的功能相似的模块,或者并非要包括图5中的所有模块。
获取模块510,用于获取故障告警消息,所述故障告警消息携带故障设备的标识信息和故障类型。
确定模块520,用于根据所述故障告警消息确定第一虚拟机集合,所述第一虚拟机集合包括受所述故障设备影响的至少一个第一虚拟机。
发送模块530,用于向所述业务管理系统发送状态告警消息,所述状态告警消息携带所述第一虚拟机集合的信息。
本发明实施例中,虚拟架构管理系统获取到故障设备上的故障告警消息后,直接对该故障告警消息进行分析处理,获取故障设备影响的一个或多个虚拟机,并向业务管理系统发送这些虚拟机的信息,使得业务管理系统可以直接根据这些虚拟机的信息分析得到受影响的业务应用,进而可以对受影响的业务应用进行处理。与现有技术相比,由虚拟架构管理系统直接根据故障设备的故障告警消息确定受故障设备影响的虚拟机的信息,使得业务管理系统可以直接根据第一虚拟机集合的状态告警消息分析得到受影响的业务应用,而不是根据故障设备的告警消息去分析得到受影响的虚拟机、再分析受影响的业务应用。从而使得业务管理系统不需要直接感知硬件故障,进而可以快速触发业务应用的影响处理,降低业务损失,提高业务应用的可靠性。
可选地,作为一个实施例,所述确定模块还用于根据所述故障告警消息确定所述第一虚拟机集合的影响信息,所述影响信息用于指示所述故障设备对所述至少一个第一虚拟机产生的影响的类型和/或级别。则所述虚拟架构管理系统向所述业务管理系统发送状态告警消息还携带所述第一虚拟机集合的影响信息。
本发明实施例中,虚拟架构管理系统根据故障设备的故障告警信息除了可以获取受影响的至少一个虚拟机,还可以获取故障设备发生的故障对这些虚拟机的影响的类型和/或级别,然后在向业务管理系统发送的状态告警消息中还携带用于指示故障设备对第一虚拟机集合中的第一虚拟机产生的影响的类型和/或级别的影响信息,从而使得业务管理系统或业务系统可以更加根据该影响信息对业务应用进行处理,进一步提高业务应用的可靠性。
可选地,作为一个实施例,所述故障设备对所述至少一个第一虚拟机产生的影响的类型包括以下至少一种:故障、高风险、中风险、低风险或无影响。
可选地,作为一个实施例,所述虚拟架构管理系统还包括接收模块和恢复模块。所述接收模块用于接收所述业务管理系统发送的第一请求消息,所述第一请求消息用于指示需要优先恢复的待恢复的虚拟机,所述待恢复的虚拟机为所述第一虚拟机集合中一个子集。所述恢复模块用于根据所述第一请求信息优先恢复待恢复的虚拟机。
本发明实施例中,虚拟架构管理系统可以根据业务管理系统的请求,根据业务管理系统指示的优先级,对受故障设备的故障所影响的第一虚拟机集合中的至少一个虚拟机进行恢复处理。
可选地,作为一个实施例,所述恢复模块还用于在预置时间阈值内未接收到所述业务管理系统发送的所述第一请求信息时,按照预置虚拟机恢复策略恢复所述至少一个第一虚拟机。
本发明实施例可以保证在业务管理系统没有信息指示虚拟架构管理系统如何恢复第一虚 拟机集合中的虚拟机时,虚拟架构管理系统可以主动根据预先配置的恢复策略对第一虚拟机集合中的第一虚拟机进行恢复。
可选地,作为一个实施例,所述发送模块还用于向所述业务管理系统发送状态告警清除消息,所述状态告警清除消息用于指示所述业务管理系统清除所述业务管理系统中的所述状态告警消息。
本发明实施例中,虚拟架构管理系统对虚拟机进行恢复处理后,向业务管理系统发送状态告警清除消息,使得业务管理系统可以根据该状态告警清除消息清除之前接收的相关的状态告警消息,从而避免业务管理系统对已经恢复的虚拟机相关的状态告警消息进行分析处理。
应理解的是,本发明实施例的虚拟架构管理系统500可以通过专用集成电路(Application Specific Integrated Circuit,ASIC)实现,或可编程逻辑器件(Programmable Logic Device,PLD)实现,上述PLD可以是复杂程序逻辑器件(Complex Programmable Logic Device,CPLD),现场可编程门阵列(Field-Programmable Gate Array,FPGA),通用阵列逻辑(Generic Array Logic,GAL)或其任意组合。通过软件实现图2所示的故障处理方法中由虚拟架构管理系统执行的步骤时,虚拟架构管理系统500及其各个模块也可以为软件模块。
应理解,图5所示的虚拟架构管理系统500可对应于图2所示故障处理方法中的虚拟架构管理系统,并且虚拟架构管理系统500中的各个单元的上述和其它操作和/或功能分别为了实现图2中的故障处理方法的相应流程,为了简洁,在此不再赘述。
图6为本发明一个实施例的业务管理系统的示意性结构图。应理解,图6示出的业务管理系统600仅是示例,本发明实施例的业务系统还可包括其他模块或单元,或者包括与图6中的各个模块的功能相似的模块,或者并非要包括图6中的所有模块。
接收模块610,用于接收所述虚拟架构管理系统发送的状态告警消息,所述状态告警消息携带受故障设备影响的第一虚拟机集合的信息,所述第一虚拟机集合中包括至少一个第一虚拟机。
确定模块620,用于根据所述状态告警消息确定所述至少一个第一虚拟机关联的业务应用。
处理模块630,用于对所述至少一个第一虚拟机关联的业务应用执行处理操作。
本发明实施例中,业务管理系统从虚拟架构管理系统接收到受故障设备影响的第一虚拟机集合中的虚拟机的信息后,可以直接根据这些虚拟机的信息分析得到受影响的业务应用,进而可以对受影响的业务应用进行处理。与现有技术相比,业务管理系统可以直接根据第一虚拟机集合的状态告警消息分析得到受影响的业务应用,而不是根据故障设备的告警消息去分析得到受影响的虚拟机、再分析受影响的业务应用。从而使得业务管理系统不需要直接感知硬件故障,进而可以快速触发业务应用的影响处理,降低业务损失,提高业务应用的可靠性。
可选地,作为一个实施例,所述状态告警消息还携带所述第一虚拟机集合的影响信息,所述影响信息用于指示所述故障设备对所述至少一个第一虚拟机产生的影响的类型和/或级别。其中,所述处理模块具体用于根据所述第一虚拟机集合的影响信息对所述至少一个第一虚拟机关联的业务应用执行处理操作。
本发明实施例中,业务管理系统从虚拟架构管理系统接收的第一虚拟机集合的状态告警 消息中还携带用于指示故障设备对第一虚拟机集合中的第一虚拟机产生的影响的类型和/或级别的影响信息,从而使得业务管理系统或业务系统可以更加根据该影响信息对业务应用进行处理,进一步提高业务应用的可靠性。
可选地,作为一个实施例,所述第一虚拟机集合产生的影响的类型包括以下至少一种:故障、高风险、中风险、低风险或无影响。
可选地,作为一个实施例,所述处理操作包括以下方式中的至少一种:
所述业务管理系统将所述至少一个第一虚拟机关联的业务应用切换至未受所述故障设备影响的虚拟机执行;或
所述业务管理系统将所述至少一个第一虚拟机的应用状态信息标识为隔离状态,所述隔离状态用于指示所述至少一个第一虚拟机停止执行所述至少一个第一虚拟机关联的业务应用;或
所述业务管理系统向所述虚拟架构管理系统发送第一请求消息,所述第一请求消息用于指示待恢复的虚拟机,所述待恢复的虚拟机为所述第一虚拟机集合中一个子集;或
所述业务管理系统向所述至少一个第一虚拟机关联的业务应用的控制节点发送所述状态告警消息,以使得所述控制节点根据所述状态告警消息将所述至少一个第一虚拟机关联的业务应用切换至所述未受所述故障设备影响的虚拟机执行或将所述至少一个第一虚拟机的应用状态信息标识为所述隔离状态。
可选地,作为一个实施例,所述确定模块还用于根据所述第一虚拟机集合的影响信息确定第一请求消息,所述第一请求消息用于指示需要优先恢复的待恢复的虚拟机,所述待恢复的虚拟机为所述第一虚拟机集合中一个子集。所述业务管理系统还包括发送模块,用于向所述虚拟架构管理系统发送所述第一请求消息。
本发明实施例中,业务管理系统可以根据第一虚拟机的影响信息确定第一虚拟机集合中需要虚拟架构管理系统恢复的待恢复的虚拟机的优先级,并向虚拟机架构管理发送用于指示这些待恢复的虚拟机的恢复优先级的第一请求消息,使得虚拟架构管理系统可以根据业务管理系统指示的优先级,对受故障设备的故障所影响的第一虚拟机集合中的至少一个虚拟机进行恢复处理。
可选地,作为一个实施例,所述发送模块还用于根据所述至少一个第一虚拟机关联的业务应用的优先级向所述虚拟架构管理系统发送所述第一请求消息。
本发明实施例中,业务管理系统根据第一虚拟机集合中的第一虚拟机相关联的业务应用的优先级,即根据故障设备影响的业务应用的优先级指示虚拟架构管理系统对第一虚拟机集合中的待恢复的虚拟机进行恢复处理,从而可以保证高优先级的业务应用可以优先得到恢复,进一步保证业务应用的可靠性。
可选地,作为一个实施例,所述发送模块还用于根据所述至少一个第一虚拟机关联的业务应用的部署模式向所述虚拟架构管理系统发送所述第一请求消息,所述至少一个第一虚拟机关联的业务应用的部署模式包括主备模式、负荷分担模式和单虚拟机模式中的至少一种。
本发明实施例中,业务管理系统根据受影响的业务应用的部署模式,即根据故障设备影响的业务应用的部署模式指示虚拟架构管理系统对第一虚拟机集合中的待恢复的虚拟机进行恢复处理。
可选地,作为一个实施例,所述接收模块还用于接收所述虚拟架构管理系统发送的状态 告警清除消息,所述处理模块还用于根据所述状态告警清除消息清除所述状态告警消息。
本发明实施例中,业务管理系统可以根据虚拟架构管理系统发送的状态告警清除消息清除之前接收的相关的状态告警消息,从而避免对已经恢复的虚拟机相关的状态告警消息进行分析处理。
应理解的是,本发明实施例的业务管理系统600可以通过专用集成电路实现,或可编程逻辑器件实现,上述PLD可以是复杂程序逻辑器件,现场可编程门阵列,通用阵列逻辑或其任意组合。通过软件实现图2所示的故障处理方法中由业务管理系统执行的步骤时,业务管理系统600及其各个模块也可以为软件模块。
应理解,图6所示的业务管理系统600可对应于图2所示故障处理方法中的业务管理系统,并且业务管理系统600中的各个单元的上述和其它操作和/或功能分别为了实现图2中的故障处理方法的相应流程,为了简洁,在此不再赘述。
图7是本发明另一个实施例的虚拟架构管理系统700的示意性结构图。虚拟架构管理系统700包括处理器710、存储器720、通信接口730和总线740。其中,处理器710、存储器720、通信接口730通过总线740进行通信,也可以通过无线传输等其他手段实现通信。该存储器720用于存储指令,该处理器710用于执行该存储器720存储的指令。该存储器720存储程序代码,且处理器710可以调用存储器720中存储的程序代码执行以下操作:
获取故障告警消息,所述故障告警消息携带故障设备的标识信息和故障类型;根据所述故障告警消息确定第一虚拟机集合,所述第一虚拟机集合包括受所述故障设备影响的至少一个第一虚拟机;向所述业务管理系统发送状态告警消息,所述状态告警消息携带所述第一虚拟机集合的信息。
本发明实施例中,虚拟架构管理系统获取到故障设备上的故障告警消息后,直接对该故障告警消息进行分析处理,获取故障设备影响的一个或多个虚拟机,并向业务管理系统发送这些虚拟机的信息,使得业务管理系统可以直接根据这些虚拟机的信息分析得到受影响的业务应用,进而可以对受影响的业务应用进行处理。与现有技术相比,由虚拟架构管理系统直接根据故障设备的故障告警消息确定受故障设备影响的虚拟机的信息,使得业务管理系统可以直接根据第一虚拟机集合的状态告警消息分析得到受影响的业务应用,而不是根据故障设备的告警消息去分析得到受影响的虚拟机、再分析受影响的业务应用。从而使得业务管理系统不需要直接感知硬件故障,进而可以快速触发业务应用的影响处理,降低业务损失,提高业务应用的可靠性。
可选地,作为一个实施例,处理器710还可以调用存储器720中存储的程序代码执行以下操作:根据所述故障告警消息确定所述第一虚拟机集合的影响信息,所述影响信息用于指示所述故障设备对所述至少一个第一虚拟机产生的影响的类型和/或级别。其中,所述状态告警消息还携带所述影响信息。
本发明实施例中,虚拟架构管理系统根据故障设备的故障告警信息除了可以获取受影响的至少一个虚拟机,还可以获取故障设备发生的故障对这些虚拟机的影响的类型和/或级别,然后在向业务管理系统发送的状态告警消息中还携带用于指示故障设备对第一虚拟机集合中的第一虚拟机产生的影响的类型和/或级别的影响信息,从而使得业务管理系统或业务系统可以更加根据该影响信息对业务应用进行处理,进一步提高业务应用的 可靠性。
可选地,作为一个实施例,所述故障设备对所述至少一个第一虚拟机产生的影响的类型包括以下至少一种:故障、高风险、中风险、低风险或无影响。
可选地,作为一个实施例,处理器710可以调用存储器720中存储的程序代码执行以下操作:接收所述业务管理系统发送的第一请求消息,所述第一请求消息用于指示需要优先恢复的待恢复的虚拟机,所述待恢复的虚拟机为所述第一虚拟机集合中一个子集。所述处理器还用于根据所述第一请求信息优先恢复所述待恢复的虚拟机。
本发明实施例中,虚拟架构管理系统可以根据业务管理系统的请求,根据业务管理系统指示的优先级,对受故障设备的故障所影响的第一虚拟机集合中的至少一个虚拟机进行恢复处理。
可选地,作为一个实施例,处理器710可以调用存储器720中存储的程序代码执行以下操作:在预置时间阈值内未接收到所述业务管理系统发送的所述第一请求信息时,按照预置虚拟机恢复策略恢复所述至少一个第一虚拟机。
本发明实施例可以保证在业务管理系统没有信息指示虚拟架构管理系统如何恢复第一虚拟机集合中的虚拟机时,虚拟架构管理系统可以主动根据预先配置的恢复策略对第一虚拟机集合中的第一虚拟机进行恢复。
可选地,作为一个实施例,处理器710可以调用存储器720中存储的程序代码执行以下操作:向所述业务管理系统发送状态告警清除消息,所述状态告警清除消息用于指示所述业务管理系统清除所述业务管理系统中的所述状态告警消息。
本发明实施例中,虚拟架构管理系统对虚拟机进行恢复处理后,向业务管理系统发送状态告警清除消息,使得业务管理系统可以根据该状态告警清除消息清除之前接收的相关的状态告警消息,从而避免业务管理系统对已经恢复的虚拟机相关的状态告警消息进行分析处理。
应理解,图7所示本发明实施例的虚拟架构管理系统可对应于图5所示的虚拟架构管理系统,并且本发明实施例的虚拟架构管理系统中的各个单元的上述和其它操作和/或功能分别为了实现图2所示的故障处理方法中由虚拟架构管理系统执行的相应流程,为了简洁,在此不再赘述。
图8是本发明另一个实施例的业务管理系统800的示意性结构图。业务管理系统800包括处理器810、存储器820、通信接口830和总线840。其中,处理器810、存储器820、通信接口830通过总线840进行通信,也可以通过无线传输等其他手段实现通信。该存储器820用于存储指令,该处理器810用于执行该存储器820存储的指令。该存储器820存储程序代码,且处理器810可以调用存储器820中存储的程序代码执行以下操作:
接收所述虚拟架构管理系统发送的状态告警消息,所述状态告警消息携带受故障设备影响的第一虚拟机集合的信息,所述第一虚拟机集合中包括至少一个第一虚拟机;根据所述状态告警消息确定所述至少一个第一虚拟机关联的业务应用;对所述至少一个第一虚拟机关联的业务应用执行处理操作。
本发明实施例中,业务管理系统从虚拟架构管理系统接收到受故障设备影响的第一虚拟机集合中的虚拟机的信息后,可以直接根据这些虚拟机的信息分析得到受影响的业务应用,进而可以对受影响的业务应用进行处理。与现有技术相比,业务管理系统可以直接根据第一 虚拟机集合的状态告警消息分析得到受影响的业务应用,而不是根据故障设备的告警消息去分析得到受影响的虚拟机、再分析受影响的业务应用。从而使得业务管理系统不需要直接感知硬件故障,进而可以快速触发业务应用的影响处理,降低业务损失,提高业务应用的可靠性。
可选地,作为一个实施例,所述状态告警消息还携带所述第一虚拟机集合的影响信息,所述影响信息用于指示所述故障设备对所述至少一个第一虚拟机产生的影响的类型和/或级别。其中,所述处理器具体用于根据所述第一虚拟机集合的影响信息对所述至少一个第一虚拟机关联的业务应用执行处理操作。
本发明实施例中,业务管理系统从虚拟架构管理系统接收的第一虚拟机集合的状态告警消息中还携带用于指示故障设备对第一虚拟机集合中的第一虚拟机产生的影响的类型和/或级别的影响信息,从而使得业务管理系统或业务系统可以更加根据该影响信息对业务应用进行处理,进一步提高业务应用的可靠性。
可选地,作为一个实施例,所述第一虚拟机集合产生的影响的类型包括以下至少一种:故障、高风险、中风险、低风险或无影响。
可选地,作为一个实施例,所述处理操作包括以下方式中的至少一种:
所述业务管理系统将所述至少一个第一虚拟机关联的业务应用切换至未受所述故障设备影响的虚拟机执行;或
所述业务管理系统将所述至少一个第一虚拟机的应用状态信息标识为隔离状态,所述隔离状态用于指示所述至少一个第一虚拟机停止执行所述至少一个第一虚拟机关联的业务应用;或
所述业务管理系统向所述虚拟架构管理系统发送第一请求消息,所述第一请求消息用于指示待恢复的虚拟机,所述待恢复的虚拟机为所述第一虚拟机集合中一个子集;或
所述业务管理系统向所述至少一个第一虚拟机关联的业务应用的控制节点发送所述状态告警消息,以使得所述控制节点根据所述状态告警消息将所述至少一个第一虚拟机关联的业务应用切换至所述未受所述故障设备影响的虚拟机执行或将所述至少一个第一虚拟机的应用状态信息标识为所述隔离状态。
可选地,作为一个实施例,处理器710可以调用存储器720中存储的程序代码执行以下操作:根据所述第一虚拟机集合的影响信息确定第一请求消息。所述发送器840用于向所述虚拟架构管理系统发送所述第一请求消息。
本发明实施例中,业务管理系统可以根据第一虚拟机的影响信息确定第一虚拟机集合中需要虚拟架构管理系统恢复的待恢复的虚拟机的优先级,并向虚拟机架构管理发送用于指示这些待恢复的虚拟机的恢复优先级的第一请求消息,使得虚拟架构管理系统可以根据业务管理系统指示的优先级,对受故障设备的故障所影响的第一虚拟机集合中的至少一个虚拟机进行恢复处理。
可选地,作为一个实施例,处理器710可以调用存储器720中存储的程序代码执行以下操作:根据所述至少一个第一虚拟机关联的业务应用的优先级向所述虚拟架构管理系统发送所述第一请求消息。
本发明实施例中,业务管理系统根据第一虚拟机集合中的第一虚拟机相关联的业务应用的优先级,即根据故障设备影响的业务应用的优先级指示虚拟架构管理系统对第一虚拟机集 合中的待恢复的虚拟机进行恢复处理,从而可以保证高优先级的业务应用可以优先得到恢复,进一步保证业务应用的可靠性。
可选地,作为一个实施例,处理器710可以调用存储器720中存储的程序代码执行以下操作:根据所述至少一个第一虚拟机关联的业务应用的部署模式向所述虚拟架构管理系统发送所述第一请求消息,所述至少一个第一虚拟机关联的业务应用的部署模式包括主备模式、负荷分担模式和单虚拟机模式中的至少一种。
本发明实施例中,业务管理系统根据业务应用的部署模式,即根据故障设备影响的业务应用的部署模式指示虚拟架构管理系统对第一虚拟机集合中的待恢复的虚拟机进行恢复处理。
可选地,作为一个实施例,处理器710可以调用存储器720中存储的程序代码执行以下操作:接收所述虚拟架构管理系统发送的状态告警清除消息,所述处理器还用于根据所述状态告警清除消息清除所述状态告警消息。
本发明实施例中,业务管理系统可以根据虚拟架构管理系统发送的状态告警清除消息清除之前接收的相关的状态告警消息,从而避免对已经恢复的虚拟机相关的状态告警消息进行分析处理。
应理解,图8所示本发明实施例的业务管理系统可对应于图6所示的业务管理系统,并且本发明实施例的业务管理系统中的各个单元的上述和其它操作和/或功能分别为了实现图2所示的故障处理方法中由业务管理系统执行的相应流程,为了简洁,在此不再赘述。
可以理解,本发明实施例中的处理器可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本发明实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
可以理解,本发明实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存 取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。应注意,本文描述的系统和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
另外,本文中术语“系统”和“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
应理解,在本发明实施例中,“与A相应的B”表示B与A相关联,根据A可以确定B。但还应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确定B。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (30)

  1. 一种故障处理方法,用于在虚拟化计算机系统中进行故障处理,所述虚拟化计算机系统包括:虚拟架构管理系统、业务管理系统以及至少一个虚拟机,所述至少一个虚拟机运行在至少一台物理设备上,所述至少一个虚拟机用于执行业务应用,所述业务管理系统用于管理所述业务应用,所述虚拟架构管理系统用于管理所述至少一个虚拟机和所述至少一台物理设备,其特征在于,所述故障处理方法包括:
    所述虚拟架构管理系统获取故障告警消息,所述故障告警消息携带故障设备的标识信息和故障类型;
    所述虚拟架构管理系统根据所述故障告警消息确定第一虚拟机集合,所述第一虚拟机集合包括受所述故障设备影响的至少一个第一虚拟机;
    所述虚拟架构管理系统向所述业务管理系统发送状态告警消息,所述状态告警消息携带所述第一虚拟机集合的信息。
  2. 根据权利要求1所述的故障处理方法,其特征在于,所述故障处理方法还包括:
    所述虚拟架构管理系统根据所述故障告警消息确定所述第一虚拟机集合的影响信息,所述影响信息用于指示所述故障设备对所述至少一个第一虚拟机产生的影响的类型和/或级别;
    则所述虚拟架构管理系统向所述业务管理系统发送状态告警消息还携带所述第一虚拟机集合的影响信息。
  3. 根据权利要求2所述的故障处理方法,其特征在于,所述故障设备对所述至少一个第一虚拟机产生的影响的类型包括以下至少一种:故障、高风险、中风险、低风险或无影响。
  4. 根据权利要求1至3中任一所述故障处理方法,其特征在于,所述故障处理方法还包括:
    所述虚拟架构管理系统接收所述业务管理系统发送的第一请求消息,所述第一请求消息用于指示待恢复的虚拟机,所述待恢复的虚拟机为所述第一虚拟机集合中一个子集;
    所述虚拟架构管理系统根据所述第一请求信息优先恢复所述待恢复的虚拟机。
  5. 根据权利要求1至3中任一项所述故障处理方法,其特征在于,所述故障处理方法还包括:
    若所述虚拟架构管理系统在预置时间阈值内未接收到所述业务管理系统发送的所述第一请求信息,则按照预置虚拟机恢复策略恢复所述至少一个第一虚拟机。
  6. 根据权利要求1至5中任一项所述的故障处理方法,其特征在于,所述故障处理方法还包括:
    所述虚拟架构管理系统向所述业务管理系统发送状态告警清除消息。
  7. 一种虚拟架构管理系统,用于在虚拟化计算机系统中进行故障处理,所述虚拟化计算机系统包括:虚拟架构管理系统、业务管理系统以及至少一个虚拟机,所述至少一个虚拟机运行在至少一台物理设备上,所述至少一个虚拟机用于执行业务应用,所述业务管理系统用于管理所述业务应用,所述虚拟架构管理系统用于管理所述至少一个虚拟机和所述至少一台物理设备,其特征在于,所述虚拟架构管理系统包括:
    获取模块,用于获取故障告警消息,所述故障告警消息携带故障设备的标识信息和故障类型;
    确定模块,用于根据所述故障告警消息确定第一虚拟机集合,所述第一虚拟机集合包括 受所述故障设备影响的至少一个第一虚拟机;
    发送模块,用于向所述业务管理系统发送状态告警消息,所述状态告警消息携带所述第一虚拟机集合的信息。
  8. 根据权利要求7所述的虚拟架构管理系统,其特征在于,所述确定模块还用于根据所述故障告警消息确定所述第一虚拟机集合的影响信息,所述影响信息用于指示所述故障设备对所述至少一个第一虚拟机产生的影响的类型和/或级别;
    则所述发送模块向所述业务管理系统发送的所述状态告警消息还携带所述第一虚拟机集合的影响信息。
  9. 根据权利要求8所述的虚拟架构管理系统,其特征在于,所述故障设备对所述至少一个第一虚拟机产生的影响的类型包括以下至少一种:故障、高风险、中风险、低风险或无影响。
  10. 根据权利要求7至9中任一项所述的虚拟架构管理系统,其特征在于,所述虚拟架构管理系统还包括接收模块和恢复模块;
    所述接收模块,用于接收所述业务管理系统发送的第一请求消息,所述第一请求消息用于指示待恢复的虚拟机,所述待恢复的虚拟机为所述第一虚拟机集合中一个子集;
    所述恢复模块,用于根据所述第一请求信息优先恢复所述待恢复的虚拟机。
  11. 根据权利要求7至9中任一项所述的虚拟架构管理系统,其特征在于,所述恢复模块还用于在预置时间阈值内未接收到所述业务管理系统发送的所述第一请求信息时,按照预置虚拟机恢复策略恢复所述至少一个第一虚拟机。
  12. 根据权利要求7至11中任一项所述的虚拟架构管理系统,其特征在于,所述发送模块还用于向所述业务管理系统发送状态告警清除消息。
  13. 一种故障处理方法,用于在虚拟化计算机系统中进行故障处理,所述虚拟化计算机系统包括:虚拟架构管理系统、业务管理系统以及至少一个虚拟机,所述至少一个虚拟机运行在至少一台物理设备上,所述至少一个虚拟机用于执行业务应用,所述业务管理系统用于管理所述业务应用,所述虚拟架构管理系统用于管理所述至少一个虚拟机和所述至少一台物理设备;其特征在于,所述故障处理方法包括:
    所述业务管理系统接收所述虚拟架构管理系统发送的状态告警消息,所述状态告警消息携带受故障设备影响的第一虚拟机集合的信息,所述第一虚拟机集合中包括至少一个第一虚拟机;
    所述业务管理系统根据所述状态告警消息确定所述至少一个第一虚拟机关联的业务应用;
    所述业务管理系统对所述至少一个第一虚拟机关联的业务应用执行处理操作。
  14. 根据权利要求13所述的故障处理方法,其特征在于,所述状态告警消息还携带所述第一虚拟机集合的影响信息,所述影响信息用于指示所述故障设备对所述至少一个第一虚拟机产生的影响的类型和/或级别;
    则所述业务管理系统对所述至少一个第一虚拟机关联的业务应用执行处理操作,包括:
    所述业务管理系统根据所述第一虚拟机集合的影响信息对所述至少一个第一虚拟机关联的业务应用执行处理操作。
  15. 根据权利要求14所述的故障处理方法,其特征在于,所述第一虚拟机集合产生的影响的类型包括以下至少一种:故障、高风险、中风险、低风险或无影响。
  16. 根据权利要求13至15中任一所述故障处理方法,其特征在于,所述业务管理系统对所述至少一个第一虚拟机关联的业务应用执行处理操作包括以下方式中的至少一种:
    所述业务管理系统将所述至少一个第一虚拟机关联的业务应用切换至未受所述故障设备影响的虚拟机执行;或
    所述业务管理系统将所述至少一个第一虚拟机的应用状态信息标识为隔离状态,所述隔离状态用于指示所述至少一个第一虚拟机停止执行所述至少一个第一虚拟机关联的业务应用;或
    所述业务管理系统向所述虚拟架构管理系统发送第一请求消息,所述第一请求消息用于指示待恢复的虚拟机,所述待恢复的虚拟机为所述第一虚拟机集合中一个子集;或
    所述业务管理系统向所述至少一个第一虚拟机关联的业务应用的控制节点发送所述状态告警消息,以使得所述控制节点根据所述状态告警消息将所述至少一个第一虚拟机关联的业务应用切换至所述未受所述故障设备影响的虚拟机执行或将所述至少一个第一虚拟机的应用状态信息标识为所述隔离状态。
  17. 根据权利要求16所述的故障处理方法,其特征在于,所述业务管理系统向所述虚拟架构管理系统发送第一请求消息包括:
    所述业务管理系统根据所述第一虚拟机集合的影响信息确定第一请求消息;
    所述业务管理系统向所述虚拟架构管理系统发送所述第一请求消息。
  18. 根据权利要求17所述的故障处理方法,其特征在于,所述业务管理系统向所述虚拟架构管理系统发送所述第一请求消息,包括:
    所述业务管理系统根据所述至少一个第一虚拟机关联的业务应用的优先级向所述虚拟架构管理系统发送所述第一请求消息。
  19. 根据权利要求17或18所述的故障处理方法,其特征在于,所述业务管理系统向所述虚拟架构管理系统发送所述第一请求消息,包括:
    所述业务管理系统根据所述至少一个第一虚拟机关联的业务应用的部署模式向所述虚拟架构管理系统发送所述第一请求消息,所述至少一个第一虚拟机关联的业务应用的部署模式包括主备模式、负荷分担模式和单虚拟机模式中的至少一种。
  20. 根据权利要求13至19中任一项所述的故障处理方法,其特征在于,所述故障处理方法还包括:
    所述业务管理系统接收所述虚拟架构管理系统发送的状态告警清除消息;
    所述业务管理系统根据所述状态告警清除消息清除所述状态告警消息。
  21. 一种业务管理系统,用于在虚拟化计算机系统中进行故障处理,所述虚拟化计算机系统包括:虚拟架构管理系统、业务管理系统以及至少一个虚拟机,所述至少一个虚拟机运行在至少一台物理设备上,所述至少一个虚拟机用于执行业务应用,所述业务管理系统用于管理所述业务应用,所述虚拟架构管理系统用于管理所述至少一个虚拟机和所述至少一台物理设备,其特征在于,所述业务管理系统包括:
    接收模块,用于接收所述虚拟架构管理系统发送的状态告警消息,所述状态告警消息携带受故障设备影响的第一虚拟机集合的信息,所述第一虚拟机集合中包括至少一个第一虚拟机;
    确定模块,用于根据所述状态告警消息确定所述至少一个第一虚拟机关联的业务应用;
    处理模块,用于对所述至少一个第一虚拟机关联的业务应用执行处理操作。
  22. 根据权利要求21所述的业务管理系统,其特征在于,所述状态告警消息还携带所述第一虚拟机集合的影响信息,所述影响信息用于指示所述故障设备对所述至少一个第一虚拟机产生的影响的类型和/或级别;
    则所述处理模块对所述至少一个第一虚拟机关联的业务应用执行处理操作,包括根据所述第一虚拟机集合的影响信息对所述至少一个第一虚拟机关联的业务应用执行处理操作。
  23. 根据权利要求22所述的业务管理系统,其特征在于,所述第一虚拟机集合产生的影响的类型包括以下至少一种:故障、高风险、中风险、低风险或无影响。
  24. 根据权利要求21至23中任一所述业务管理系统,其特征在于,所述处理模块对所述至少一个第一虚拟机关联的业务应用执行处理操作包括以下方式中的至少一种:
    将所述至少一个第一虚拟机关联的业务应用切换至未受所述故障设备影响的虚拟机执行;或
    将所述至少一个第一虚拟机的应用状态信息标识为隔离状态,所述隔离状态用于指示所述至少一个第一虚拟机停止执行所述至少一个第一虚拟机关联的业务应用;或
    向所述虚拟架构管理系统发送第一请求消息,所述第一请求消息用于指示待恢复的虚拟机,所述待恢复的虚拟机为所述第一虚拟机集合中一个子集;或
    向所述至少一个第一虚拟机关联的业务应用的控制节点发送所述状态告警消息,以使得所述控制节点根据所述状态告警消息将所述至少一个第一虚拟机关联的业务应用切换至所述未受所述故障设备影响的虚拟机执行或将所述至少一个第一虚拟机的应用状态信息标识为所述隔离状态。
  25. 根据权利要求21至24任一所述的业务管理系统,其特征在于,所述确定模块还用于根据所述第一虚拟机集合的影响信息确定第一请求消息;
    其中,所述业务管理系统还包括发送模块,用于向所述虚拟架构管理系统发送所述第一请求消息。
  26. 根据权利要求25所述的业务管理系统,其特征在于,所述发送模块还用于根据所述至少一个第一虚拟机关联的业务应用的优先级向所述虚拟架构管理系统发送所述第一请求消息。
  27. 根据权利要求25或26所述的业务管理系统,其特征在于,所述发送模块还用于根据所述至少一个第一虚拟机关联的业务应用的部署模式向所述虚拟架构管理系统发送所述第一请求消息,所述至少一个第一虚拟机关联的业务应用的部署模式包括主备模式、负荷分担模式和单虚拟机模式中的至少一种。
  28. 根据权利要求21至27中任一项所述的业务管理系统,其特征在于,
    所述接收模块还用于接收所述虚拟架构管理系统发送的状态告警清除消息;
    所述处理模块还用于根据所述状态告警清除消息清除所述状态告警消息。
  29. 一种虚拟架构管理系统,其特征在于,所述虚拟架构管理系统包括处理器、存储器、通信接口和总线。其中,处理器、存储器、通信接口通过总线进行通信;所述存储器用于存储指令,所述虚拟架构管理系统运行时,所述处理器执行所述存储器存储的指令以利用所述虚拟架构管理系统中的硬件资源执行权利要求1至6中任一所述方法。
  30. 一种业务管理系统,其特征在于,所述业务管理系统包括处理器、存储器、通信接口和总线。其中,处理器、存储器、通信接口通过总线进行通信;所述存储器用于存储指令,所述业务管理系统运行时,所述处理器执行所述存储器存储的指令以利用所述业务管理系统中的硬件资源执行权利要求13至20中任一所述方法。
PCT/CN2017/085356 2016-09-22 2017-05-22 故障处理方法、虚拟架构管理系统和业务管理系统 WO2018054081A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610841392.3A CN106452846A (zh) 2016-09-22 2016-09-22 故障处理方法、虚拟架构管理系统和业务管理系统
CN201610841392.3 2016-09-22

Publications (1)

Publication Number Publication Date
WO2018054081A1 true WO2018054081A1 (zh) 2018-03-29

Family

ID=58166295

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/085356 WO2018054081A1 (zh) 2016-09-22 2017-05-22 故障处理方法、虚拟架构管理系统和业务管理系统

Country Status (2)

Country Link
CN (1) CN106452846A (zh)
WO (1) WO2018054081A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109600250A (zh) * 2018-09-29 2019-04-09 中国平安人寿保险股份有限公司 业务系统故障通知方法、装置、电子装置及存储介质
CN112866009A (zh) * 2021-01-04 2021-05-28 国网山东省电力公司青岛供电公司 一种综合服务站虚拟网络故障诊断方法及装置
CN113315653A (zh) * 2021-04-30 2021-08-27 新华三大数据技术有限公司 网络设备的管理方法及装置、网络设备、计算机设备
WO2022067835A1 (en) * 2020-10-01 2022-04-07 Nokia Shanghai Bell Co., Ltd. Method, apparatus and computer program
CN115086143A (zh) * 2022-04-28 2022-09-20 阿里巴巴(中国)有限公司 故障预警方法及装置
CN116643906A (zh) * 2023-06-01 2023-08-25 北京首都在线科技股份有限公司 云平台故障的处理方法、装置、电子设备及存储介质

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106452846A (zh) * 2016-09-22 2017-02-22 华为技术有限公司 故障处理方法、虚拟架构管理系统和业务管理系统
CN108874411A (zh) * 2017-05-12 2018-11-23 华为技术有限公司 一种基础设施软件升级的方法和相关系统
EP3764226A4 (en) * 2018-04-12 2021-03-10 Huawei Technologies Co., Ltd. VIRTUAL MACHINE STATUS DETECTION METHOD AND DEVICE
CN109039740B (zh) * 2018-08-01 2022-07-19 平安科技(深圳)有限公司 一种处理运维监控告警的方法及设备
CN110120146A (zh) * 2019-04-25 2019-08-13 新浪网技术(中国)有限公司 一种基于报警中台系统的报警方法及报警中台系统
CN110083584A (zh) * 2019-05-07 2019-08-02 深信服科技股份有限公司 文件重建方法、装置、设备及计算机可读存储介质
CN110888754A (zh) * 2019-11-14 2020-03-17 北京金山云网络技术有限公司 一种消息获得方法及装置
CN114048004A (zh) * 2021-11-22 2022-02-15 北京志凌海纳科技有限公司 虚拟机高可用批量调度方法、装置、设备及存储介质
CN115858222B (zh) * 2022-12-19 2024-01-02 安超云软件有限公司 一种虚拟机故障处理方法、系统及电子设备
CN116401009A (zh) * 2023-03-28 2023-07-07 北京益安在线科技股份有限公司 一种基于kvm虚拟化的智能管理系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102984214A (zh) * 2012-11-08 2013-03-20 华为技术有限公司 一种实现电信云中业务迁移的方法及装置
CN103559124A (zh) * 2013-10-24 2014-02-05 华为技术有限公司 故障快速检测方法及装置
CN105051698A (zh) * 2013-03-28 2015-11-11 瑞典爱立信有限公司 用于基础设施即服务云中故障管理的方法和布置
US20150347264A1 (en) * 2014-05-28 2015-12-03 Vmware, Inc. Tracking application deployment errors via cloud logs
CN106452846A (zh) * 2016-09-22 2017-02-22 华为技术有限公司 故障处理方法、虚拟架构管理系统和业务管理系统

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5549733B2 (ja) * 2010-08-18 2014-07-16 富士通株式会社 計算機管理装置、計算機管理システム及び計算機システム
US9760443B2 (en) * 2014-06-28 2017-09-12 Vmware, Inc. Using a recovery snapshot during live migration

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102984214A (zh) * 2012-11-08 2013-03-20 华为技术有限公司 一种实现电信云中业务迁移的方法及装置
CN105051698A (zh) * 2013-03-28 2015-11-11 瑞典爱立信有限公司 用于基础设施即服务云中故障管理的方法和布置
CN103559124A (zh) * 2013-10-24 2014-02-05 华为技术有限公司 故障快速检测方法及装置
US20150347264A1 (en) * 2014-05-28 2015-12-03 Vmware, Inc. Tracking application deployment errors via cloud logs
CN106452846A (zh) * 2016-09-22 2017-02-22 华为技术有限公司 故障处理方法、虚拟架构管理系统和业务管理系统

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109600250A (zh) * 2018-09-29 2019-04-09 中国平安人寿保险股份有限公司 业务系统故障通知方法、装置、电子装置及存储介质
CN109600250B (zh) * 2018-09-29 2023-07-18 中国平安人寿保险股份有限公司 业务系统故障通知方法、装置、电子装置及存储介质
WO2022067835A1 (en) * 2020-10-01 2022-04-07 Nokia Shanghai Bell Co., Ltd. Method, apparatus and computer program
CN112866009A (zh) * 2021-01-04 2021-05-28 国网山东省电力公司青岛供电公司 一种综合服务站虚拟网络故障诊断方法及装置
CN112866009B (zh) * 2021-01-04 2023-03-10 国网山东省电力公司青岛供电公司 一种综合服务站虚拟网络故障诊断方法及装置
CN113315653A (zh) * 2021-04-30 2021-08-27 新华三大数据技术有限公司 网络设备的管理方法及装置、网络设备、计算机设备
CN113315653B (zh) * 2021-04-30 2022-07-12 新华三大数据技术有限公司 网络设备的管理方法及装置、网络设备、计算机设备
CN115086143A (zh) * 2022-04-28 2022-09-20 阿里巴巴(中国)有限公司 故障预警方法及装置
CN116643906A (zh) * 2023-06-01 2023-08-25 北京首都在线科技股份有限公司 云平台故障的处理方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN106452846A (zh) 2017-02-22

Similar Documents

Publication Publication Date Title
WO2018054081A1 (zh) 故障处理方法、虚拟架构管理系统和业务管理系统
CN109815043B (zh) 故障处理方法、相关设备及计算机存储介质
US10831574B2 (en) Remote procedure call method for network device and network device
US11003553B2 (en) Method and apparatus for failover processing
US9052935B1 (en) Systems and methods for managing affinity rules in virtual-machine environments
US10541862B2 (en) VNF processing policy determining method, apparatus, and system
KR102059251B1 (ko) 노드 시스템, 서버 장치, 스케일링 제어 방법 및 프로그램
US9600380B2 (en) Failure recovery system and method of creating the failure recovery system
US9489230B1 (en) Handling of virtual machine migration while performing clustering operations
EP4083786A1 (en) Cloud operating system management method and apparatus, server, management system, and medium
US9825808B2 (en) Network configuration via abstraction components and standard commands
US20150324216A1 (en) Self-repairing configuration service for virtual machine migration
WO2018058942A1 (zh) 一种数据处理方法以及备份服务器
US20150263970A1 (en) Take-over of network frame handling in a computing environment
US10735253B2 (en) Alarm information reporting method and apparatus
WO2018137520A1 (zh) 一种业务恢复方法及装置
WO2018171392A1 (zh) 一种虚拟机扩缩容方法及虚拟管理设备
US20120150985A1 (en) VIOS Cluster Alert Framework
WO2021047619A1 (zh) 虚拟网卡链路状态设置方法、装置及存储介质
US11025594B2 (en) Secret information distribution method and device
JP7175997B2 (ja) 仮想ネットワークでのストレージサービスの品質の管理
CN113206760B (zh) 用于vrf资源分配的接口配置更新方法、装置与电子设备
US11797399B2 (en) Plug-in based framework to provide fault tolerance and high availability in distributed systems
WO2015159359A1 (ja) 物理計算機
CN113127143A (zh) 一种虚拟机热迁移方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17852155

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17852155

Country of ref document: EP

Kind code of ref document: A1