CN112395047A - Virtual machine fault evacuation method, system and computer readable medium - Google Patents

Virtual machine fault evacuation method, system and computer readable medium Download PDF

Info

Publication number
CN112395047A
CN112395047A CN202011308545.0A CN202011308545A CN112395047A CN 112395047 A CN112395047 A CN 112395047A CN 202011308545 A CN202011308545 A CN 202011308545A CN 112395047 A CN112395047 A CN 112395047A
Authority
CN
China
Prior art keywords
node
evacuation
nodes
monitoring
virtual machines
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011308545.0A
Other languages
Chinese (zh)
Inventor
朱从林
杨帅麒
雷准富
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huayun Data Holding Group Co Ltd
Original Assignee
Huayun Data Holding Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huayun Data Holding Group Co Ltd filed Critical Huayun Data Holding Group Co Ltd
Priority to CN202011308545.0A priority Critical patent/CN112395047A/en
Publication of CN112395047A publication Critical patent/CN112395047A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3051Monitoring arrangements for monitoring the configuration of the computing system or of the computing system component, e.g. monitoring the presence of processing resources, peripherals, I/O links, software programs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Mathematical Physics (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention provides a virtual machine fault evacuation method, a virtual machine fault evacuation system and a computer readable medium, wherein the method comprises the steps of selecting at least two or more nodes and deploying monitoring service so as to monitor the service state of an existing monitoring agent in the node through the monitoring service; when determining that the node with the virtual machine is detected to have a fault, sending an evacuation priority to the virtual machine deployed in the node with the fault, and acquiring configuration information of the virtual machine deployed in the node with the fault; and the monitoring service polls and detects the health state of the node, determines at least one healthy target node capable of receiving the evacuated virtual machines from the rest nodes when the node fails, and evacuates the virtual machines to the target node one by one according to the evacuation priority order and rebuilds the virtual machines. Through the method and the device, resource contention of the virtual machine in the evacuation process is effectively avoided, and the influence on services in the virtual machine fault evacuation process is reduced.

Description

Virtual machine fault evacuation method, system and computer readable medium
Technical Field
The invention relates to the technical field of cloud computing, in particular to a virtual machine fault evacuation method, a virtual machine fault evacuation system and a computer readable medium.
Background
The cloud platform is deployed in the data center and used for virtualizing the hardware capability of the data center and then providing the virtualized hardware capability to customers. Common cloud platforms include OpenStack, VMWarevSphere, XenServer, Ovirt, etc. Once a certain node is down or other abnormal conditions occur in the operating process of the Openstack cloud platform, the problems that a client cannot connect, a service is stopped, the client cannot normally operate or operate and the like occur, and the current processing mechanism is as follows: if the node is still running, manually migrating the node to other healthy nodes; if the node is down, the method is mainly realized through two modes, namely, 1) according to a distributed node health check mechanism, a core virtual machine is automatically evacuated to the healthy nodes with sufficient reserved resources according to a prefabricated strategy, and resource contention is reduced; the non-core virtual machines are automatically evacuated to other healthy nodes. Method 2) restoring the backup data to the designated healthy node through the virtual machine. The above prior art that performs evacuation migration on a failed virtual machine still has the following drawbacks.
The core virtual machine evacuated healthy node (group) and cluster cannot receive a new evacuated virtual machine, and single node (group) and cluster resources are wasted. If the non-core virtual machines are adjusted to be capable of being evacuated to the destination node (group) and the cluster evacuated by the core virtual machines, resource contention occurs again, and the sequence cannot be coordinated. Therefore, the balanced distribution of global resources cannot be realized, the utilization rate of the resources is maximized, and the situation that the evacuation of part of service virtual machines fails due to insufficient resources of a target node (group) and a cluster easily occurs, so that the service is down for a long time, and the service continuity is influenced.
In view of the above, there is a need to improve an evacuation method of a virtual machine in a cloud platform in the prior art when a failure occurs, so as to solve the above problem.
Disclosure of Invention
The invention aims to disclose a virtual machine fault evacuation method, a virtual machine fault evacuation system and a computer readable medium, which are used for solving the defects that a virtual machine deployed in a node of a server cluster in the prior art executes fault evacuation when the node fails, and solving the technical problems that the original healthy node cannot accept the evacuated virtual machine and resource contention occurs among nodes of the server cluster due to the fact that the virtual machine is evacuated.
In order to achieve the first object, the present invention provides a method for evacuating a virtual machine failure, including:
selecting at least two or more nodes and deploying monitoring service so as to monitor the service state of the existing monitoring agent in the node through the monitoring service;
when determining that the node with the virtual machine is detected to have a fault, sending an evacuation priority to the virtual machine deployed in the node with the fault, and acquiring configuration information of the virtual machine deployed in the node with the fault;
and the monitoring service polls and detects the health state of the node, determines at least one healthy target node capable of receiving the evacuated virtual machines from the rest nodes when the node fails, evacuates the virtual machines to the target node one by one according to the evacuation priority order, and reconstructs the evacuated virtual machines in the target node according to the configuration information.
As a further improvement of the present invention, the existing monitoring agent includes: a storage network monitoring agent deployed in all nodes for monitoring the storage network, a management network monitoring agent deployed in all nodes for monitoring the management network, and a traffic network monitoring agent deployed in all nodes for monitoring the traffic network.
As a further improvement of the present invention, after reconstructing the evacuated virtual machine according to the configuration information in the target node, the method further includes:
and updating the configuration information to obtain updated configuration information.
As a further improvement of the present invention, the method further comprises:
and when the failed node is recovered, the evacuated virtual machines are subjected to a retest, and the configuration information and the residual data corresponding to the evacuated virtual machines in the failed node are deleted.
As a further improvement of the present invention, the method further comprises:
and storing the acquired configuration information of the virtual machines deployed in the failed nodes in a database, and executing the operation of deleting the configuration information and the residual data corresponding to the evacuated virtual machines in the failed nodes according to the configuration information stored in the database.
As a further improvement of the present invention, the node is defined as a computing node, a control node with a computing node function, or a storage node with a computing node function, the monitoring service deployed in the node forwards the health state information of the node to the monitoring service of any node, the monitoring services in all the nodes form a monitoring service cluster, and a master node is elected by the monitoring service cluster, and the master node obtains the health state information of the remaining nodes to synchronize the health state information to any other node.
As a further improvement of the invention, the monitoring services in each node are communicated with each other through the API (application program interface) of the node to which the monitoring services belong, the monitoring services acquire the health state information of each node in the server cluster, and if evacuation is needed, the Nova interface is called to acquire the evacuation priority of the virtual machine in the fault node and the evacuation priority is issued to the virtual machine which is executing evacuation.
As a further improvement of the present invention, the evacuation priorities include evacuation priorities of different levels, the virtual machine receiving the high evacuation priority is preferentially evacuated, the remaining virtual machines of the low evacuation priority are evacuated to a target node which is arbitrary and conforms to the resources required for receiving the evacuated virtual machine of the low evacuation priority, the target nodes of the virtual machine receiving the high evacuation priority and the remaining virtual machines of the low evacuation priority are determined from the remaining nodes based on Nova, the target nodes are healthy and the remaining resources of the target nodes conform to the resources required for receiving the evacuated virtual machine.
As a further improvement of the present invention, the method further comprises: and the target node determines a reconstruction priority according to the evacuation priority and determines the reconstruction sequence of the evacuated virtual machines in the target node according to the reconstruction priority.
Meanwhile, based on the same inventive concept, to achieve the second object, the present application further discloses a virtual machine fault evacuation system, including:
the evacuation priority generation component is used for issuing evacuation priorities to the virtual machines deployed in the failed nodes when determining that the nodes with the virtual machines deployed therein are detected to be failed, and acquiring configuration information of the virtual machines deployed in the failed nodes;
the monitoring component is deployed in the nodes of the server cluster to monitor the service state of the existing monitoring agents in the nodes through the monitoring component, the monitoring component polls and detects the health state of the nodes, when the nodes break down, at least one healthy target node capable of receiving the evacuated virtual machines is determined from the rest nodes, the virtual machines evacuate the virtual machines to the target nodes one by one according to the evacuation priority sequence, and the evacuated virtual machines are rebuilt in the target nodes through the rebuilding component according to the configuration information.
As a further improvement of the present invention, the system further comprises:
and the review component is used for reviewing the evacuated virtual machines through the review component after the failed node is recovered, and deleting the configuration information and the residual data corresponding to the evacuated virtual machines in the failed node.
As a further improvement of the present invention, the monitoring agent comprises: a storage network monitoring agent for monitoring the storage network, a management network monitoring agent for monitoring the management network, and a service network storage agent for monitoring the service network;
the monitoring components in each node are communicated with each other through the API (application program interface) interfaces of the nodes, the monitoring components acquire the health state information of each node in the server cluster, and if evacuation is needed, the monitoring components call the Nova interfaces to acquire the evacuation priority of the virtual machine in the fault node and send the evacuation priority to the virtual machine to be evacuated.
As a further improvement of the present invention, the system further comprises:
a database is stored in the database, and the database is used as a database,
the evacuation priority generation component stores the acquired configuration information of the virtual machines deployed in the failed node in a database, and the review component deletes the configuration information and the residual data corresponding to the evacuated virtual machines in the failed node according to the configuration information stored in the database.
Finally, based on the same inventive concept, to achieve the third objective, the present application further discloses a computer-readable medium, in which computer program instructions are stored, and the computer program instructions are read and executed by a processor to perform the steps of the virtual machine fault evacuation method according to any one of the above inventions.
Compared with the prior art, the invention has the beneficial effects that:
in the application, when any node in the server cluster fails, the virtual machines in the failed node can be evacuated to one or more target nodes which are healthy and have the residual resources meeting the requirement of receiving the resources needed by the evacuated virtual machines based on the evacuation priority, so that the technical problem that the evacuated virtual machines cannot be accepted by the originally healthy nodes caused by the virtual machines in the evacuation process is solved, resource contention among the nodes of the server cluster in the evacuation process of the virtual machines is effectively avoided, and the influence on services in the failure evacuation process of the virtual machines is reduced.
Drawings
FIG. 1 is a flowchart illustrating a method for evacuating a virtual machine failure according to the present invention;
FIG. 2 is a topology diagram of monitoring tools deployed on various node servers in a server cluster;
FIG. 3 is a diagram of an example of a specific node server in a server cluster with virtual machines having different evacuation priorities being evacuated to other node servers;
fig. 4 is an example diagram illustrating that the virtual machine in the failed node server is evacuated to another node server, the virtual machine evacuated to the healthy node server is reconstructed, and the failed node server performs a review on the virtual machine retained in the node server that failed before after recovery;
FIG. 5 is a diagram illustrating a detailed example of the reconstruction and recovery process;
FIG. 6 is a topology diagram of a virtual machine fault evacuation system according to the present invention;
FIG. 7 is a topology diagram of a computer readable medium of the present invention.
Detailed Description
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
Before explaining the various embodiments of the present application in detail, the meanings of technical terms appearing in the present application are briefly described.
Term "Resource(s)": including but not limited to one or any combination of several of virtual computing resources, virtual storage resources, virtual network resources, interface resources, IP address resources, MAC address resources, or hardware resources formed in a scenario based on a cloud platform, a data center, a server cluster, and the like.
Term "Cloud host"refers to the business processing unit formed by the aforementioned resource configuration and responding to the user, and can be associated with"Virtual machine"(or"VM") are equally or equivalently understood. A cloud host or Virtual Machine (VM) deployed in a particular node provides various services, or responses directly or indirectly to a user (user).
The first embodiment is as follows:
referring to fig. 1 to 5, this embodiment discloses a specific implementation of a virtual machine fault evacuation method (hereinafter referred to as "method").
Referring to fig. 2, the method may be applied to a cloud platform based server cluster 100. The cloud platform is used for virtualizing the hardware capability of the data center and then providing the virtualized hardware capability to the customer. The cloud platform comprises OpenStack, VMWare vSphere, XenServer, Ovirt and the like. One or more virtual machines are deployed in the cloud platform, and nodes for providing computing services, storage services and network services are provided.
The server cluster 100 includes nodes 1, 2, and 3, and the nodes 1, 2, and 3 form various resources by the virtualization system 200 and mount the resources to the virtual machines. The node 1, the node 2 and the node 3 are connected to each other through the IPMI network 40. A greater number of nodes, but only two nodes, may also be included in the server cluster 100. The database 45 is accessible between the nodes via the IPMI network 40. The configuration information of the virtual machines deployed by the nodes is determined by means of the management network 20 accessing the database 45 of the server cluster 100. The node 1 deploys an API interface 401, the node 2 deploys an API interface 402, and the node 3 deploys an API interface 403, and accesses the IPMI network 40. Any pair of nodes (for example, between the node 1 and the node 2) among the node 1, the node 2 and the node 3 can access data mutually and receive a virtual machine evacuated by the other node when a fault occurs. In the present application, "node server" and "node" have the same meaning and may be an X86 computer, a physical server, or the like having resources.
Referring to fig. 5, when node 1 fails and node 2 and node 3 do not fail (i.e. are in a healthy state), VM11, VM12 and VM13 in node 1 need to be evacuated, and if all three virtual machines are evacuated to node 3, node 3 is called a target node. Meanwhile, in the present embodiment, the high, medium, and low levels of the evacuation priority issued to the virtual machine in each node are also relatively general. Applicant set VM11 to have a high priority for evacuation, VM12 to have a medium priority for evacuation, and VM13 to have a low priority for evacuation. Then VM11, VM12, and VM13 execute evacuation operations in sequence according to the evacuation priority. Whether the VM11 is evacuated to the node 2 or the node 3 may be random, or it may be preferable to determine which of the node 2 or the node 3 as the target node has the remaining resources that are necessary for receiving the evacuated virtual machine, and to use the determined resources as the target node. The evacuation process described above can be seen in fig. 4. Of course, if the number of virtual machines performing evacuation in the node 1 is large, the VM11, the VM12, and the VM13 may be migrated to one or more nodes as target nodes, respectively. The node to which the evacuated virtual machine is evacuated depends on the healthy nodes in the server cluster 100 and the real-time status of the remaining resources in the healthy nodes.
Referring to fig. 3, existing monitoring agents are respectively configured in the nodes 1 to 3. The existing monitoring agent includes: a storage network monitoring agent deployed in all nodes for monitoring the storage network 10, a management network monitoring agent deployed in all nodes for monitoring the management network 20, and a traffic network monitoring agent deployed in all nodes for monitoring the traffic network 30. The storage network monitoring agent in each node monitors the storage network 10, the management network monitoring agent in each node monitors the management network 20, and the service network monitoring agent in each node monitors the service network 30. The storage network monitoring agent, the management network monitoring agent and the service network monitoring agent in each node are managed by the monitoring services 411 to 413 of the node to which the agent belongs.
Referring to fig. 1, in the present embodiment, the method includes the following steps S1 to S3.
First, step S1 is executed to select at least two or more nodes and deploy a monitoring service, so as to monitor the service state of the existing monitoring agent in the node through the monitoring service.
Then, step S2 is executed, and when it is determined that the node in which the virtual machine is deployed is detected to be faulty, the evacuation priority is issued to the virtual machine deployed in the faulty node, and the configuration information of the virtual machine deployed in the faulty node is acquired. The evacuation priority can be configured by a user in a terminal device or a terminal device with a UI interface in a self-defining way.
In this embodiment, a plurality of virtual machines in a single node are issued with different evacuation priority policies, and a high-priority evacuation priority is issued for the core virtual machine, so as to ensure that the core virtual machine is preferentially performing evacuation operations when a specific node fails or is down. Meanwhile, a high-priority evacuation priority can be determined in a single node, and evacuation priorities with the same priority are issued to a plurality of virtual machines to be subjected to evacuation operation. For example, a high-priority evacuation priority is issued to VM11, and an evacuation priority of the same priority is issued to VM12 and VM13 at the same time. Whether VM12 and VM13 are simultaneously evacuated to node 2, node 3, or node 2 and node 3, respectively, depends on the resources required by VM12 and VM 13.
Finally, step S3 is executed, the monitoring service polls and detects the health status of the local node, and when the local node fails, at least one healthy target node capable of receiving the evacuated virtual machine is determined from the remaining nodes, the virtual machines VM11 to VM13 evacuate the virtual machines to the target node (i.e., node 3) one by one according to the evacuation priority order, and the evacuated virtual machines are reconstructed in the target node according to the configuration information. The "evacuation priority order" refers to the evacuation priority level issued to each virtual machine in the same node, and the evacuation priority level is relatively high or low, and by issuing the evacuation priority level to the virtual machines, the robustness of performing evacuation operation on the virtual machines in a node when the node fails can be improved.
For example, when the node 1 fails, the monitoring service 411 of the node 1 polls that the health status of the node 1 is abnormal, and triggers the virtual machine evacuation process. At this time, VM21, VM22, and VM23 are deployed in node 2, and VM31, VM32, and VM33 are deployed in node 3. The monitoring service deployed by each node monitors the IPMI network 40, the storage network 10, the management network 20, and the traffic network 30. The monitoring services, namely, the monitoring service 411, the monitoring service 412, and the monitoring service 413, of the three nodes perform distributed health check on the IPMI network 40, the storage network 10, the management network 20, and the service network 30, and when a node(s) fails, notify other nodes in time, and perform evacuation of the virtual machines in the failed node(s).
Nodes 1 to 3 are defined as a computing node, a control node with a computing node function, or a storage node with a computing node function, monitoring services deployed in the nodes forward health state information of the node to monitoring services of any node, monitoring services (i.e., the monitoring service 411, the monitoring service 412, and the monitoring service 413) in all the nodes form a monitoring service cluster, a master node is elected by the monitoring service cluster, and the master node acquires health state information of the remaining nodes to synchronize the health state information to any other node. For example, when node 1 fails, node 3 is elected by the monitoring service cluster as the master node and node 2 is taken as the slave node.
The monitoring services in each node communicate with each other through the API interfaces 401-403 of the nodes, the monitoring services acquire the health state information of each node in the server cluster 100, and if evacuation is needed, the Nova interface is called to acquire the evacuation priority of the virtual machine in the failed node and send the evacuation priority to the virtual machine executing evacuation, for example, the evacuation priority is sent to the VM11, the VM12 and the VM13 in the node 1. VM11 has a high priority evacuation priority, VM12 has a medium priority evacuation priority, and VM13 has a low priority evacuation priority. VM11 is first performing an evacuation operation, VM12 is second performing an evacuation operation, and VM13 is last performing an evacuation operation. Nova is a computing organization controller in the OpenStack cloud platform. All activities that support the lifecycle of instances (instances) in the OpenStack cloud platform are handled by Nova. This makes Nova a platform responsible for managing computing resources, networks, authentication, and needed scalability.
In this embodiment, the method further includes: the configuration information of the virtual machines deployed in the failed node is acquired and stored in the database 45, and the operation of deleting the configuration information and the residual data corresponding to the evacuated virtual machines in the failed node is executed according to the configuration information stored in the database 45. After reconstructing the evacuated virtual machines according to the configuration information in the target node, the method further comprises: and updating the configuration information to obtain updated configuration information. Meanwhile, before and after the VMs 11 to 13 are evacuated, the configuration information of the current state needs to be rewritten into the database 45.
As shown in fig. 5, when the VM11 corresponding to the evacuation priority of the high priority is determined to be evacuated to the node 3, a query operation for querying the configuration information of the VM11 needs to be first performed to the database 45 (see DB in fig. 5). Then, when the configuration information previously stored in the database 45 according to the queried VM11 is in a healthy state and the remaining resources are larger than the resources required by the VM11, a rebuilding operation is performed in the node 3 according to the configuration information to form the VM11, the VM12 and the VM 13. The performance of the rebuild operation by the node 3 requires the invocation of configuration information in the database 45 that was saved by the VM11 in the database 45 prior to the evacuation operation being performed to accurately direct the rebuild operation of the VM11 in the node 3.
Preferably, the method disclosed in this embodiment further includes: and the target node (namely the node 3) determines a reconstruction priority according to the evacuation priority, and determines the reconstruction sequence of the evacuated virtual machines in the target node according to the reconstruction priority. Therefore, VM11 was first rebuilt in node 3, VM12 was second rebuilt in node 3, and VM13 was last performed rebuild in node 3. Generally, the virtual machine corresponding to the evacuation priority with the high priority may be a service virtual machine supporting services upwards. The service system is restored as soon as possible, so that the interruption time of the service provided by the VM11 to the user can be reduced, and the service downtime is reduced.
The evacuation priorities comprise evacuation priorities of different levels, the virtual machines receiving the high evacuation priority are preferentially evacuated, the remaining virtual machines with the low evacuation priority are evacuated to any target nodes which are consistent with the resources required by the virtual machines receiving the evacuated low evacuation priority, the target nodes of the virtual machines receiving the high evacuation priority and the remaining virtual machines receiving the low evacuation priority are determined from the remaining nodes based on Nova, the target nodes are healthy, and the remaining resources of the target nodes are consistent with the resources required by the virtual machines receiving the evacuated. Therefore, when the virtual machine of the failed node is evacuated at the target node, the reasonable configuration of the resources formed by the server cluster can be realized, the situation that the resources of the target node cannot support the operation of the evacuated virtual machine after the target node receives the evacuated virtual machine is prevented, and the reasonable utilization of the resources formed by each node in the server cluster 100 and the balanced configuration of the resources among the nodes are realized.
In this embodiment, the method further includes: and when the failed node is recovered, the evacuated virtual machines are subjected to a retest, and the configuration information and the residual data corresponding to the evacuated virtual machines in the failed node are deleted. After the node 1 is restored, the VMs 11 to 13 evacuated to the node 3 need to be migrated to the node 1 again. Deleting the configuration information and the residual data corresponding to the evacuated virtual machine in the failed node means deleting the residual data of the VMs 11 to 13 in the node 1 to delete the reconstructed virtual machine data remaining on the own node (i.e., the node 1), thereby preventing the virtual machine data of the node 1 before the failure from being repeatedly evacuated.
By the method disclosed by the embodiment, when the cloud platform node fails, the affected virtual machine can be automatically evacuated to other healthy nodes for reconstruction, manual intervention is not needed in the whole process, and the problems that the existing cloud platform cannot automatically detect the health state of the node and cannot automatically evacuate the virtual machine of the failed node to the healthy node are solved. Meanwhile, the risk of long-time downtime of the service is reduced, the cost of manual intervention of virtual machine evacuation is reduced, the service continuity is enhanced, and the user experience of the cloud platform is improved.
More importantly, the method disclosed by the embodiment solves the technical problem that the evacuated virtual machine cannot be accepted by the originally healthy nodes in the evacuation process of the virtual machine, effectively avoids resource contention among the nodes of the server cluster in the evacuation process of the virtual machine, and reduces the adverse effects of interruption, timeout delay and the like on services in the fault evacuation process of the virtual machine.
Example two:
referring to fig. 6, the present embodiment discloses an embodiment of a virtual machine fault evacuation system. The virtual machine fault evacuation system 400 (hereinafter referred to as "system") performs the steps of the virtual machine fault evacuation method disclosed in the first embodiment. The system 400 communicates with the server cluster 100 to evacuate the virtual machines deployed in the plurality of nodes of the server cluster 100 to any one or more healthy nodes according to evacuation priorities when a certain node fails, and to reconstruct and re-check the failed node after the failed node is recovered. The server cluster 100 may be understood as a data center or a cloud platform deployed in a data center. The system 400 executes evacuation operations and subsequent reconstruction operations on virtual machines in a plurality of nodes in the server cluster 100 according to the virtual machines issued to the specific nodes when a failure occurs in a certain node(s).
In this embodiment, a virtual machine fault evacuation system 400 includes:
the evacuation priority generation component 41, when determining that the node in which the virtual machine is deployed is detected to be faulty, issues an evacuation priority to the virtual machine deployed in the faulty node, and acquires configuration information of the virtual machine deployed in the faulty node.
The monitoring component 42 is deployed in nodes (node 1 to node 3) of the server cluster, and is configured to monitor a service state of an existing monitoring agent in the node through the monitoring component 42, the monitoring component 42 polls and detects a health state of the node, and determines at least one healthy target node capable of receiving an evacuated virtual machine from remaining nodes when the node fails, the virtual machine evacuates the virtual machine to the target node one by one according to an evacuation priority order, and the evacuated virtual machine is reconstructed in the target node through the reconstruction component 43 according to the configuration information. The monitoring component 42 forms computer executable logic, namely monitoring service 411-monitoring service 413 in one embodiment.
In this embodiment, the system 400 further includes: and a review module 44 for reviewing the evacuated virtual machines by the review module 44 after the failed node is recovered, and deleting the configuration information and the residual data corresponding to the evacuated virtual machines in the failed node.
The monitoring agent includes: a storage network monitoring agent for monitoring the storage network, a management network monitoring agent for monitoring the management network, and a service network storage agent for monitoring the service network. The monitoring components in each node are communicated with each other through the API (application program interface) interfaces of the nodes, the monitoring components acquire the health state information of each node in the server cluster, and if evacuation is needed, the monitoring components call the Nova interfaces to acquire the evacuation priority of the virtual machine in the fault node and send the evacuation priority to the virtual machine to be evacuated.
The system 400 further comprises: a database 45. The evacuation priority generation component 41 stores the configuration information of the virtual machine, which is acquired and deployed in the failed node, in the database 45, and the review component 44 performs an operation of deleting the configuration information and the residual data corresponding to the evacuated virtual machine in the failed node according to the configuration information stored in the database 45.
Meanwhile, the nodes 1 to 3 in the server cluster 100 disclosed in this embodiment may also be super-fusion nodes. Each super-fusion node can be established by one super-fusion server.
The system disclosed in this embodiment and the technical solutions of the same parts in the first embodiment are described with reference to the first embodiment, and are not described herein again.
Example three:
referring to FIG. 7, the present embodiment discloses an embodiment of a computer readable medium 700. The computer-readable medium 700 may be disposed in whole or in part in a physical form of a computer, server, cluster of servers, or data center. In the present embodiment, a computer readable medium 700 is provided, the computer readable medium 700 stores computer program instructions 701, and when the computer program instructions 701 are read and executed by a processor 702, the steps of the virtual machine fault evacuation method according to an embodiment of the disclosure are performed.
Alternatively, the computer-readable medium 700 may be configured as a server and the server is run on a physical device that constructs a private cloud, a hybrid cloud, or a public cloud. Meanwhile, the computer readable medium 700 may be configured as a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The computer readable medium 700 is used for storing a program, and the processor 702 receives an execution instruction to execute an adaptive service construction method disclosed in an embodiment.
Meanwhile, the processor 702 of the present embodiment may be an integrated circuit chip having signal processing capability. The Processor 702 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor.
For a technical solution of a computer-readable medium disclosed in this embodiment that is the same as that in embodiment one and/or embodiment two, please refer to embodiment one and/or embodiment two, which is not described herein again.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (14)

1. A virtual machine fault evacuation method is characterized by comprising the following steps:
selecting at least two or more nodes and deploying monitoring service so as to monitor the service state of the existing monitoring agent in the node through the monitoring service;
when determining that the node with the virtual machine is detected to have a fault, sending an evacuation priority to the virtual machine deployed in the node with the fault, and acquiring configuration information of the virtual machine deployed in the node with the fault;
and the monitoring service polls and detects the health state of the node, determines at least one healthy target node capable of receiving the evacuated virtual machines from the rest nodes when the node fails, evacuates the virtual machines to the target node one by one according to the evacuation priority order, and reconstructs the evacuated virtual machines in the target node according to the configuration information.
2. The method of claim 1, wherein the existing monitoring agent comprises: a storage network monitoring agent deployed in all nodes for monitoring the storage network, a management network monitoring agent deployed in all nodes for monitoring the management network, and a traffic network monitoring agent deployed in all nodes for monitoring the traffic network.
3. The method of claim 1, further comprising, after rebuilding the evacuated virtual machines in the target node according to the configuration information:
and updating the configuration information to obtain updated configuration information.
4. The method of claim 3, further comprising:
and when the failed node is recovered, the evacuated virtual machines are subjected to a retest, and the configuration information and the residual data corresponding to the evacuated virtual machines in the failed node are deleted.
5. The method of claim 4, further comprising:
and storing the acquired configuration information of the virtual machines deployed in the failed nodes in a database, and executing the operation of deleting the configuration information and the residual data corresponding to the evacuated virtual machines in the failed nodes according to the configuration information stored in the database.
6. The method according to claim 1, wherein the node is defined as a computing node, a control node with a computing node function, or a storage node with a computing node function, the monitoring service deployed in the node forwards the health status information of the node to the monitoring service of any node, the monitoring services in all the nodes form a monitoring service cluster, and a master node is elected by the monitoring service cluster, and the master node obtains the health status information of the remaining nodes to synchronize the health status information to any other node.
7. The method according to claim 6, wherein the monitoring services in each node communicate with each other through the API (application program interface) of the node to which the monitoring services belong, the monitoring services acquire the health state information of each node in the server cluster, and if evacuation is needed, the Nova interface is called to acquire the evacuation priority of the virtual machine in the failed node and send the evacuation priority to the virtual machine which is subjected to evacuation.
8. The method of claim 7, wherein the evacuation priorities comprise different levels of evacuation priorities, the virtual machines receiving a high evacuation priority are preferentially evacuated, the remaining virtual machines having a low evacuation priority are evacuated to any target node that meets the resources required by the virtual machines receiving the evacuated low evacuation priority, the target nodes receiving the virtual machines having the high evacuation priority and the virtual machines having the remaining low evacuation priority are determined from the remaining nodes based on Nova, the target nodes are healthy, and the remaining resources of the target nodes meet the resources required by the virtual machines receiving the evacuated.
9. The method according to any one of claims 1 to 8, further comprising: and the target node determines a reconstruction priority according to the evacuation priority and determines the reconstruction sequence of the evacuated virtual machines in the target node according to the reconstruction priority.
10. A virtual machine fault evacuation system, comprising:
the evacuation priority generation component is used for issuing evacuation priorities to the virtual machines deployed in the failed nodes when determining that the nodes with the virtual machines deployed therein are detected to be failed, and acquiring configuration information of the virtual machines deployed in the failed nodes;
the monitoring component is deployed in the nodes of the server cluster to monitor the service state of the existing monitoring agents in the nodes through the monitoring component, the monitoring component polls and detects the health state of the nodes, when the nodes break down, at least one healthy target node capable of receiving the evacuated virtual machines is determined from the rest nodes, the virtual machines evacuate the virtual machines to the target nodes one by one according to the evacuation priority sequence, and the evacuated virtual machines are rebuilt in the target nodes through the rebuilding component according to the configuration information.
11. The system of claim 10, further comprising:
and the review component is used for reviewing the evacuated virtual machines through the review component after the failed node is recovered, and deleting the configuration information and the residual data corresponding to the evacuated virtual machines in the failed node.
12. The system of claim 11, wherein the monitoring agent comprises: a storage network monitoring agent for monitoring the storage network, a management network monitoring agent for monitoring the management network, and a service network storage agent for monitoring the service network;
the monitoring components in each node are communicated with each other through the API (application program interface) interfaces of the nodes, the monitoring components acquire the health state information of each node in the server cluster, and if evacuation is needed, the monitoring components call the Nova interfaces to acquire the evacuation priority of the virtual machine in the fault node and send the evacuation priority to the virtual machine to be evacuated.
13. The system of claim 11, further comprising:
a database is stored in the database, and the database is used as a database,
the evacuation priority generation component stores the acquired configuration information of the virtual machines deployed in the failed node in a database, and the review component deletes the configuration information and the residual data corresponding to the evacuated virtual machines in the failed node according to the configuration information stored in the database.
14. A computer readable medium having stored thereon computer program instructions which, when read and executed by a processor, perform the steps of the virtual machine fault evacuation method of any one of claims 1-9.
CN202011308545.0A 2020-11-20 2020-11-20 Virtual machine fault evacuation method, system and computer readable medium Pending CN112395047A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011308545.0A CN112395047A (en) 2020-11-20 2020-11-20 Virtual machine fault evacuation method, system and computer readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011308545.0A CN112395047A (en) 2020-11-20 2020-11-20 Virtual machine fault evacuation method, system and computer readable medium

Publications (1)

Publication Number Publication Date
CN112395047A true CN112395047A (en) 2021-02-23

Family

ID=74607600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011308545.0A Pending CN112395047A (en) 2020-11-20 2020-11-20 Virtual machine fault evacuation method, system and computer readable medium

Country Status (1)

Country Link
CN (1) CN112395047A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113765709A (en) * 2021-08-23 2021-12-07 中国人寿保险股份有限公司上海数据中心 Openstack cloud platform-based multi-dimensional monitoring-based high-availability realization system and method for virtual machine
CN113923215A (en) * 2021-09-09 2022-01-11 深信服科技股份有限公司 Virtual machine scheduling method, electronic device and storage medium
CN114048004A (en) * 2021-11-22 2022-02-15 北京志凌海纳科技有限公司 High-availability batch scheduling method, device, equipment and storage medium for virtual machines
CN114257601A (en) * 2021-12-16 2022-03-29 杭州谐云科技有限公司 Cloud edge cooperative cluster construction method and system
CN114598665A (en) * 2022-01-19 2022-06-07 锐捷网络股份有限公司 Resource scheduling method and device, computer readable storage medium and electronic equipment
CN115733734A (en) * 2022-10-11 2023-03-03 北京市建筑设计研究院有限公司 Service node repairing method and device, electronic equipment and storage medium
EP4274176A1 (en) * 2022-05-04 2023-11-08 Red Hat, Inc. Data preservation for node evacuation in unstable nodes within a mesh

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929715A (en) * 2012-10-31 2013-02-13 曙光云计算技术有限公司 Method and system for scheduling network resources based on virtual machine migration
CN103701627A (en) * 2012-09-27 2014-04-02 北京搜狐新媒体信息技术有限公司 Cloud computing platform fault detection method, cloud computing platform fault detection method, solving method and solving device
CN103856548A (en) * 2012-12-07 2014-06-11 华为技术有限公司 Dynamic resource scheduling method and dynamic resource scheduler
CN109818785A (en) * 2019-01-15 2019-05-28 无锡华云数据技术服务有限公司 A kind of data processing method, server cluster and storage medium
CN111181780A (en) * 2019-12-21 2020-05-19 苏州浪潮智能科技有限公司 HA cluster-based host pool switching method, system, terminal and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103701627A (en) * 2012-09-27 2014-04-02 北京搜狐新媒体信息技术有限公司 Cloud computing platform fault detection method, cloud computing platform fault detection method, solving method and solving device
CN102929715A (en) * 2012-10-31 2013-02-13 曙光云计算技术有限公司 Method and system for scheduling network resources based on virtual machine migration
CN103856548A (en) * 2012-12-07 2014-06-11 华为技术有限公司 Dynamic resource scheduling method and dynamic resource scheduler
CN109818785A (en) * 2019-01-15 2019-05-28 无锡华云数据技术服务有限公司 A kind of data processing method, server cluster and storage medium
CN111181780A (en) * 2019-12-21 2020-05-19 苏州浪潮智能科技有限公司 HA cluster-based host pool switching method, system, terminal and storage medium

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113765709A (en) * 2021-08-23 2021-12-07 中国人寿保险股份有限公司上海数据中心 Openstack cloud platform-based multi-dimensional monitoring-based high-availability realization system and method for virtual machine
CN113765709B (en) * 2021-08-23 2022-09-20 中国人寿保险股份有限公司上海数据中心 Openstack cloud platform-based multi-dimensional monitoring-based high-availability realization system and method for virtual machine
CN113923215A (en) * 2021-09-09 2022-01-11 深信服科技股份有限公司 Virtual machine scheduling method, electronic device and storage medium
CN113923215B (en) * 2021-09-09 2023-12-29 深信服科技股份有限公司 Virtual machine scheduling method, electronic equipment and storage medium
CN114048004A (en) * 2021-11-22 2022-02-15 北京志凌海纳科技有限公司 High-availability batch scheduling method, device, equipment and storage medium for virtual machines
CN114257601A (en) * 2021-12-16 2022-03-29 杭州谐云科技有限公司 Cloud edge cooperative cluster construction method and system
CN114257601B (en) * 2021-12-16 2023-11-17 杭州谐云科技有限公司 Cloud-edge cooperative cluster construction method and system
CN114598665A (en) * 2022-01-19 2022-06-07 锐捷网络股份有限公司 Resource scheduling method and device, computer readable storage medium and electronic equipment
EP4274176A1 (en) * 2022-05-04 2023-11-08 Red Hat, Inc. Data preservation for node evacuation in unstable nodes within a mesh
US11868219B2 (en) 2022-05-04 2024-01-09 Red Hat, Inc. Data preservation for node evacuation in unstable nodes within a mesh
CN115733734A (en) * 2022-10-11 2023-03-03 北京市建筑设计研究院有限公司 Service node repairing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112395047A (en) Virtual machine fault evacuation method, system and computer readable medium
US10860439B2 (en) Failover and recovery for replicated data instances
US9817721B1 (en) High availability management techniques for cluster resources
US20210326167A1 (en) Vnf service instantiation method and apparatus
US10372565B2 (en) Method and apparatus for failover processing
US10771318B1 (en) High availability on a distributed networking platform
US8910160B1 (en) Handling of virtual machine migration while performing clustering operations
CN109656742B (en) Node exception handling method and device and storage medium
US11593234B2 (en) Cloud restart for VM failover and capacity management
CN103595801B (en) Cloud computing system and real-time monitoring method for virtual machine in cloud computing system
CN105681077A (en) Fault processing method, device and system
CN108347339B (en) Service recovery method and device
CN113872997B (en) Container group POD reconstruction method based on container cluster service and related equipment
US11153173B1 (en) Dynamically updating compute node location information in a distributed computing environment
CN111431980A (en) Distributed storage system and path switching method thereof
CN111181780A (en) HA cluster-based host pool switching method, system, terminal and storage medium
CN111935244A (en) Service request processing system and super-integration all-in-one machine
CN106612314A (en) System for realizing software-defined storage based on virtual machine
CN109818785B (en) Data processing method, server cluster and storage medium
CN104052799B (en) A kind of method that High Availabitity storage is realized using resource ring
CN116112569B (en) Micro-service scheduling method and management system
CN109284169B (en) Big data platform process management method based on process virtualization and computer equipment
CN116192885A (en) High-availability cluster architecture artificial intelligent experiment cloud platform data processing method and system
CN114363356A (en) Data synchronization method, system, device, computer equipment and storage medium
CN107783855B (en) Fault self-healing control device and method for virtual network element

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination