CN111897626A - Cloud computing scene-oriented virtual machine high-reliability system and implementation method - Google Patents

Cloud computing scene-oriented virtual machine high-reliability system and implementation method Download PDF

Info

Publication number
CN111897626A
CN111897626A CN202010644139.5A CN202010644139A CN111897626A CN 111897626 A CN111897626 A CN 111897626A CN 202010644139 A CN202010644139 A CN 202010644139A CN 111897626 A CN111897626 A CN 111897626A
Authority
CN
China
Prior art keywords
control component
mode
virtual machine
abnormal condition
recovery
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010644139.5A
Other languages
Chinese (zh)
Inventor
王昊
高泽旭
肖丁
吴江
李涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fiberhome Telecommunication Technologies Co Ltd
Original Assignee
Fiberhome Telecommunication Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fiberhome Telecommunication Technologies Co Ltd filed Critical Fiberhome Telecommunication Technologies Co Ltd
Priority to CN202010644139.5A priority Critical patent/CN111897626A/en
Publication of CN111897626A publication Critical patent/CN111897626A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45575Starting, stopping, suspending or resuming virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45591Monitoring or debugging support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances

Abstract

The invention discloses a cloud computing scene-oriented virtual machine high-reliability system and an implementation method thereof, wherein the virtual machine high-reliability system comprises a control node and a plurality of computing nodes, wherein the control node is provided with at least one control component, and the computing nodes are provided with execution components; the execution component is used for carrying out abnormity detection on the computing node where the execution component is located and reporting corresponding detection information to the control component; the control component is used for judging whether the corresponding computing node is abnormal or not according to the detection information, if so, triggering an alarm and recovering the abnormal condition according to the HA mode of the system. In the invention, various HA mode designs are adopted, corresponding HA modes can be flexibly configured according to different application scenes or different requirements, and the different HA modes can be flexibly switched.

Description

Cloud computing scene-oriented virtual machine high-reliability system and implementation method
Technical Field
The invention belongs to the field of cloud computing, and particularly relates to a cloud computing scene-oriented virtual machine high-reliability system and an implementation method.
Background
In the field of cloud computing, the services provided are typically divided into three layers: infrastructure As A Service (IAAS), Platform As A Service (PAAS), and Software As A Service (SAAS). In the infrastructure service layer, the services provided to the consumers are mainly the utilization of all physical infrastructures, and can be mainly divided into the utilization of computing, storage and network resources, and are generally provided to the consumers in the form of virtual machines.
In an IAAS cloud platform system, a virtual machine HA (High Available, abbreviated as HA), which is a key technology for ensuring stable and continuous operation of a virtual machine, belongs to one of basic core capabilities of a cloud platform, and by using the virtual machine High reliability technology, when a computing server or a network system used by the cloud platform HAs a software and hardware fault, an anomaly is identified by a detection mechanism, and according to a corresponding action strategy in a preset anomaly scene, a virtual machine running on the computing server is automatically migrated to other normal servers to ensure continuous and normal operation of the virtual machine.
Currently, the HA in the industry is generally automatically determined by the system according to an abnormal condition to execute the HA operation. However, as the cloud platform is applied in more and more different industry fields, higher requirements are also provided for the HA capability according to different requirements of use scenes and operation and maintenance levels in the cloud data center, and the application is not limited to a full-automatic mode any more, and the main problems in the current technology are as follows: only one single HA mode is supported; due to high automation, defects in design implementation are easily amplified, for example, abnormal scene misjudgment affects normal operation of the whole system, and further popularization and use of the HA technology are also limited.
Disclosure of Invention
Aiming at the defects or the improvement requirements of the prior art, the invention provides a cloud computing scene-oriented virtual machine high-reliability system and an implementation method, and aims to adopt multiple HA mode designs, flexibly configure corresponding HA modes according to different application scenes or different requirements, and flexibly switch between different HA modes, so that the technical problem of abnormal scene misjudgment caused by high automation is solved.
In order to achieve the above object, according to an aspect of the present invention, a highly reliable system of a virtual machine for a cloud computing scenario is provided, where the highly reliable system of a virtual machine includes a control node and a plurality of computing nodes, the control node is provided with at least one control component, and the computing nodes are provided with execution components;
the execution component is used for carrying out abnormity detection on the computing node where the execution component is located and reporting corresponding detection information to the control component;
the control component is used for judging whether the corresponding computing node is abnormal or not according to the detection information, if so, triggering an alarm and recovering the abnormal condition according to the HA mode of the system;
the HA modes comprise a manual mode, a semi-automatic mode and a full-automatic mode, and can be selectively configured according to actual requirements;
when the HA mode is a manual mode, the control component is used for receiving a recovery strategy of operation and maintenance personnel and recovering according to the recovery strategy;
when the HA mode is a semi-automatic mode, the control component is used for receiving a recovery strategy of operation and maintenance personnel, recovering according to the recovery strategy, and if the abnormal condition is not relieved within a preset time threshold, automatically recovering the abnormal condition by the control component;
and when the HA mode is a full-automatic mode, automatically recovering the abnormal condition by the control component.
Preferably, the control component is further configured to obtain host names corresponding to all the computing nodes, and perform HA modeling for each computing node according to the host name and parameters required by the model table.
Preferably, the control node is provided with a main control assembly and two standby control assemblies, and when the main control assembly fails, the standby control assemblies provide high-reliability service to the outside.
Preferably, the master control component is configured to register a service lock with the ETCD cluster and periodically update the service lock;
after the service lock exceeds a set life cycle, the service lock is automatically released, the standby control assembly automatically executes a lock grabbing action, and then the standby control assembly is triggered to be upgraded into a main control assembly.
Preferably, the virtual machine high-reliability system comprises a plurality of application program interfaces, and the plurality of application program interfaces are arranged on the control node to provide corresponding services for the outside;
the application program interface comprises an HA object list query interface, an HA object query interface, an HA trigger interface, an HA pause interface, an HA recovery interface and an HA historical task query interface.
According to another aspect of the present invention, a cloud computing scene-oriented virtual machine high-reliability implementation method is provided, where the implementation method is applied to a virtual machine high-reliability system, the virtual machine high-reliability system includes a control node and a plurality of computing nodes, the control node is provided with at least one control component, and the computing nodes are provided with execution components;
the implementation method comprises the following steps:
the execution component carries out anomaly detection on the computing node where the execution component is located and reports corresponding detection information to the control component;
the control component judges whether the corresponding computing node is abnormal or not according to the detection information;
if the corresponding computing node is abnormal, triggering an alarm and recovering the abnormal condition according to the HA mode of the system;
when the HA mode is a manual mode, the control component receives a recovery strategy of operation and maintenance personnel, and system recovery is carried out according to the recovery strategy;
when the HA mode is a semi-automatic mode, the control component receives a recovery strategy of operation and maintenance personnel, the recovery is carried out according to the recovery strategy, and if the abnormal condition is not eliminated within a preset time threshold, the control component automatically carries out the recovery of the abnormal condition;
and when the HA mode is a full-automatic mode, the control component automatically recovers the abnormal condition.
Preferably, before the control component determines whether the corresponding computing node is abnormal according to the detection information, the method further includes:
and the control component acquires host names corresponding to all the computing nodes, and performs HA modeling according to the host names and parameters required by the model table for each computing node.
Preferably, the control node is provided with a main control assembly and two standby control assemblies;
the master control component registers a service lock with the ETCD cluster and periodically updates the service lock;
when the main control assembly breaks down, the service lock cannot be updated, the service lock is automatically released after the service lock exceeds a set life cycle, the standby control assembly automatically executes a lock grabbing action, and the standby control assembly is triggered to be upgraded into the main control assembly.
Preferably, the control component receives a recovery strategy of the operation and maintenance personnel, performs recovery according to the recovery strategy, and if the abnormal condition is not eliminated within a preset time threshold, the automatically performing recovery of the abnormal condition by the control component includes:
judging whether the virtual machine on the abnormal computing node needs HA operation;
if HA operation is needed, the control component receives a recovery strategy of operation and maintenance personnel and recovers according to the recovery strategy;
judging whether the abnormal condition is relieved or not;
if the abnormal condition is relieved, eliminating the alarm;
if the abnormal condition is not relieved, judging whether the manual recovery time exceeds a preset time threshold value;
if the time exceeds the preset time threshold, judging whether a suspended HA request from the appointed computing node is received;
if the HA pause request is not received, reporting an alarm notification and starting an automatic process, and automatically recovering the abnormal condition by the control component;
and if the request of suspending the HA is received, stopping the automation process on the appointed computing node, and updating the state of the corresponding HA object on the appointed host computer into a suspended state.
Preferably, when the HA mode is a fully automatic mode, the automatically recovering the abnormal condition by the control module includes:
judging whether the virtual machine on the abnormal computing node needs HA operation;
if the HA operation is needed, reporting an alarm, detecting all the computing nodes, and recording the total number of hosts needing the HA operation;
calculating the occupation proportion of the total number of the hosts needing HA operation relative to the total number of the hosts of the system;
judging whether the occupation proportion is larger than a set proportion threshold value or not;
if the ratio is larger than the set ratio threshold, stopping issuing HA operation to the corresponding computing node;
and if the ratio is not greater than the set ratio threshold, performing HA action according to a host list needing HA operation so as to eliminate the alarm.
Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects: the invention provides a cloud computing scene-oriented virtual machine high-reliability system and an implementation method thereof, wherein the virtual machine high-reliability system comprises a control node and a plurality of computing nodes, wherein the control node is provided with at least one control component, and the computing nodes are provided with execution components; the execution component is used for carrying out abnormity detection on the computing node where the execution component is located and reporting corresponding detection information to the control component; the control component is used for judging whether the corresponding computing node is abnormal or not according to the detection information, if so, triggering an alarm and recovering the abnormal condition according to the HA mode of the system; the HA mode comprises a manual mode, a semi-automatic mode and a full-automatic mode, and can be selectively configured according to actual requirements; when the HA mode is a manual mode, the control component is used for receiving a recovery strategy of the operation and maintenance personnel and recovering according to the recovery strategy; when the HA mode is a semi-automatic mode, the control component is used for receiving a recovery strategy of operation and maintenance personnel, recovering according to the recovery strategy, and if the abnormal condition is not relieved within a preset time threshold, automatically recovering the abnormal condition by the control component; when the HA mode is a full-automatic mode, the control component automatically recovers the abnormal condition.
In the invention, various HA mode designs are adopted, corresponding HA modes can be flexibly configured according to different application scenes or different requirements, and the different HA modes can be flexibly switched. The manual mode can effectively reduce the additional overhead and the misjudgment probability caused by full-automatic action, and is suitable for scenes with relatively stable hardware environment and guaranteed operation and maintenance personnel; the semi-automatic mode design not only retains the capability of the automatic HA, but also can provide a configurable time window for operation and maintenance personnel, and if the abnormal condition is simple and can be quickly repaired, the full-automatic mode is not required to be triggered to rebuild the virtual machine.
Drawings
Fig. 1 is a schematic structural diagram of a virtual machine high-reliability system oriented to a cloud computing scenario according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a highly reliable implementation method of a virtual machine for a cloud computing scenario according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of an implementation method in a manual mode according to an embodiment of the present invention;
FIG. 4 is a flow chart of a method for implementing the semi-automatic mode according to an embodiment of the present invention;
fig. 5 is a schematic flow chart of an implementation method in a full-automatic mode according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
In order to facilitate understanding of the technical solution of the present invention, the following terms are first explained:
HA: high Availability, abbreviated HA, is often used to describe a system that is specially designed to reduce down time to maintain High reliability of its services.
HASTECK Server: and the control component of the high-reliability system is responsible for providing high-reliability service for the outside, is used as a brain for HA management, is used for managing global HA behaviors, performs system abnormity analysis, and adopts different execution strategies to recover abnormal conditions based on different HA modes.
HASTACK Agent: the execution components of the high-reliability system on each computing node are matched with the hash Server, and the high-reliability system is responsible for detecting and alarming abnormal conditions and specifically executes HA operation in each HA mode.
ETCD: the high-availability distributed key value (key-value) database is realized by GO language, and strong consistency is ensured by a consistency algorithm.
API: an Application Programming Interface, abbreviated as API, exposes the kernel through the API for external access and invocation.
Example 1:
referring to fig. 1, the embodiment provides a highly reliable system of a virtual machine for a cloud computing scenario, where the highly reliable system of a virtual machine includes a control node and a plurality of computing nodes, the control node is provided with at least one control component (hash Server), the computing nodes are provided with execution components (hash Agent), and the control node and the computing nodes are specifically hosts.
The execution component is used for carrying out anomaly detection on the computing node where the execution component is located and reporting corresponding detection information to the control component, wherein the detection information comprises: whether the service operation is normal or not, whether the storage network is normal or not, whether the storage system reads and writes normal information or not, and the like.
And the control component is used for judging whether the corresponding computing node is abnormal or not according to the detection information, and if so, triggering an alarm and recovering the abnormal condition according to the HA mode of the system. And the control component judges on the global level according to the detection information reported by the computing node, so as to decide whether the computing node needs to perform HA operation. The system is internally provided with an HA strategy matrix, and the control component carries out corresponding HA operation according to abnormal conditions based on the HA strategy matrix arranged in the system.
In this embodiment, the virtual machine high-reliability system can support three HA modes: full automatic mode, semi-automatic mode and manual mode. In actual use, the configuration can be selectively performed according to requirements, wherein a default configuration mode is a manual mode, and the device can support a Ceph (distributed file system) and an SAN (Storage Area Network, abbreviated as SAN) Storage device.
And when the HA mode is a manual mode, the control component is used for receiving a recovery strategy of operation and maintenance personnel and performing system recovery according to the recovery strategy. In the manual mode, the automatic HA capability of the system is stopped, an abnormal report alarm occurs, manual intervention is used for processing and judging whether HA actions of host granularity need to be triggered, and the capability of automatically pulling up the virtual machine in situ after the host is restarted can be supported.
And when the HA mode is a semi-automatic mode, the control component is used for receiving a recovery strategy of operation and maintenance personnel, performing system recovery according to the recovery strategy, and if the abnormal condition is not relieved within a preset time threshold, automatically performing recovery of the abnormal condition by the control component. The control component receives a recovery strategy of operation and maintenance personnel and recovers according to the recovery strategy; judging whether the abnormal condition is relieved or not, and eliminating the alarm if the abnormal condition is relieved; if the abnormal condition is not relieved, judging whether the manual recovery time exceeds a preset time threshold, if so, judging whether a suspended HA request from a specified computing node is received, if not, reporting an alarm notification and starting an automatic process, and automatically recovering the abnormal condition by the control component; and if the request of suspending the HA is received, stopping the automation process on the appointed computing node, and updating the state of the corresponding HA object on the appointed host computer into a suspended state.
In the semi-automatic mode, a time period (configurable) of manual intervention is added before HA actions are automatically processed, when an exception occurs, an alarm is generated firstly, then a control component starts to time to wait for manual intervention processing, if the exception scene is eliminated within a time threshold, automatic HA processing is not carried out any more, and if the exception scene still exists within the time threshold, processing is started according to an automatic flow. Meanwhile, in order to increase the flexibility of the operation and maintenance personnel in handling the abnormity, the system provides the functions of suspending the automatic HA and recovering the automatic HA, the automatic HA capacity of the system can be temporarily stopped by the mutual cooperation of the automatic HA and the recovery system, the problem can be solved by the operation and maintenance personnel in sufficient time, and after the problem is solved, the HA capacity only needs to be recovered again.
And when the HA mode is a full-automatic mode, automatically recovering the abnormal condition by the control component. The specific implementation mode is as follows: judging whether a virtual machine on a computing node with an abnormal operation needs HA operation, if the virtual machine needs HA operation, reporting an alarm, detecting all computing nodes, recording the total number of hosts needing HA operation, calculating the occupation ratio of the total number of the hosts needing HA operation relative to the total number of the hosts of the system, judging whether the occupation ratio is greater than a set ratio threshold value, and if the occupation ratio is greater than the set ratio threshold value, stopping issuing HA operation to the corresponding computing node; if the ratio is not greater than the set ratio threshold (which can be configured by itself, for example, 50%), the HA action is performed according to the host list of the HA operation required, so as to eliminate the alarm.
In a full-automatic mode, global HA action prejudging logic is added to ensure that the current network does not have a scene of large-area virtual machine HA.
In this embodiment, the control node is provided with a main control component and two standby control components, and when the main control component fails, the standby control components provide high-reliability service to the outside.
Although the high reliability of the system can be improved by adopting the main/standby mode, if network partition is abnormal, the problem that the control component becomes double main brain cracks is caused, so that the judgment mechanism of the whole high-reliability system is influenced, and the large-area fault of the cloud platform system is caused.
In a preferred embodiment, to avoid the problem of double-master split brain, a one-layer service lock design is added. Specifically, the control node is provided with a service lock service, wherein the service lock service belongs to an additional function of an overall high-reliability system, the service lock is a piece of information stored in an ETCD cluster, the control assembly utilizes the ETCD cluster to store the service lock information of the control assembly, the main and standby control assemblies are controlled through the service lock information, who comes to be the main and standby control assemblies, and the service lock assembly is in charge of ensuring that the HASTACKservice self is in abnormal system faults, faults with multiple main problems cannot occur, and the reliability of the HASTACK self service is ensured.
The specific implementation mode is as follows: the master control assembly is used for registering a service lock in the ETCD cluster and periodically updating the service lock; after the service lock exceeds a set life cycle, the service lock is automatically released, the standby control assembly automatically executes a lock grabbing action, and then the standby control assembly is triggered to be upgraded into a main control assembly.
In actual use, when a main service process of the control assembly is started, whether a service lock exists in the ETCD cluster is detected firstly, and if not, the control assembly registers the service lock in the ETCD cluster. When other standby services try to start, whether a service lock exists in the ETCD cluster or not is detected, if the existing lock is found, the ETCD cluster is immediately exited, and no action is performed.
Furthermore, each service lock is provided with a life cycle time setting, under a normal condition, the control assembly needs to continuously update the life cycle of the lock to ensure that the lock is always effective, if the control assembly is abnormal, the service lock is automatically released when the service lock exceeds the life cycle, and other standby control assemblies can automatically perform a lock robbing action, so that the standby control assembly is triggered to be upgraded into the action of the main control assembly.
In this embodiment, the virtual machine high-reliability system further includes a plurality of application program interfaces APIs, where the plurality of application program interfaces are disposed on the control node to provide corresponding services to the outside, where the application program interfaces may be standard REST (abbreviated RST) APIs.
The application program interface comprises an HA object list query interface, an HA object query interface, an HA trigger interface, an HA pause interface, an HA recovery interface and an HA historical task query interface.
Specifically, the HA object list query interface is configured to query HA object list information of all hosts in the system; the HA object query interface is used for querying HA detailed information of a single host in the system; the HA trigger interface (host granularity) is used for triggering HA actions of one or a plurality of hosts, the HA pause interface (host granularity) is used for pausing the system automation HA actions, the HA recovery interface (host granularity) is used for recovering the system automation HA actions, and the HA historical task query interface is used for querying all historical HA action records in the system.
The high-reliability system of the virtual machine of the embodiment has at least the following advantages: by adopting the design of multiple HA modes, the corresponding HA modes can be flexibly configured according to different application scenes or different requirements, and the different HA modes can be flexibly switched. The manual mode can effectively reduce the additional overhead and the misjudgment probability caused by full-automatic action, and is suitable for scenes with relatively stable hardware environment and guaranteed operation and maintenance personnel; the semi-automatic mode design not only retains the capability of the automatic HA, but also can provide a configurable time window for operation and maintenance personnel, and if the abnormal condition is simple and can be quickly repaired, the full-automatic mode is not required to be triggered to rebuild the virtual machine.
In addition, the high-reliability control assembly adopts a distributed service lock design, the problem of split brains of multiple hosts can be avoided, the misjudgment of HA actions is avoided, and the overall reliability index of the system virtual machine is effectively improved. Moreover, the system provides an API interface for the outside, and is convenient to be integrated and called by other systems.
Example 2:
with reference to embodiment 1, this embodiment provides a cloud computing scenario-oriented high-reliability implementation method for a virtual machine, where the implementation method is applied to a virtual machine high-reliability system, where the virtual machine high-reliability system includes a control node and a plurality of computing nodes, the control node is provided with at least one control component, and the computing nodes are provided with execution components;
referring to fig. 2, the implementation method includes the following steps:
step 101: and the execution component performs anomaly detection on the computing node where the execution component is located and reports corresponding detection information to the control component.
And the control component acquires host names corresponding to all the computing nodes, and performs HA modeling according to the host names and parameters required by the model table for each computing node. The specific modeling process is as follows: obtaining host names corresponding to all the computing nodes, carrying out HA modeling according to the host names and parameters required by the model table for each computing node, judging whether a corresponding HA object exists in the database or not, if not, generating the model table of the HA object, storing the model table into the database, and if so, carrying out modeling on the next computing node until the HA modeling of all the computing nodes is completed.
The information of the compute node includes a host name corresponding to the compute node, and parameters required by the model table are specifically shown in table 1 below, where table 1 is an illustration of a data format of the HA object table. The HASTATCK Server performs object modeling according to the host name and the parameters required by the model table, and stores the object modeling into an HA object table corresponding to the database, wherein the data format of the HA object table is shown in the following table 1:
Figure BDA0002572510040000121
table 1 schematic of data format of HA object table
Step 102: and the control component judges whether the corresponding computing node is abnormal or not according to the detection information.
Step 103: and if the corresponding computing node is abnormal, triggering an alarm and performing abnormal recovery according to the HA mode of the system.
In this embodiment, the virtual machine high-reliability system can support three HA modes: full automatic mode, semi-automatic mode and manual mode. In practical use, the configuration can be selectively carried out according to the requirement.
Step 104: and when the HA mode is a manual mode, the control component receives a recovery strategy of operation and maintenance personnel and carries out system recovery according to the recovery strategy.
And when the HA mode is a manual mode, the control component receives a recovery strategy of operation and maintenance personnel and carries out system recovery according to the recovery strategy. In the manual mode, the automatic HA capability of the system is stopped, an abnormal report alarm occurs, manual intervention is used for processing and judging whether HA actions of host granularity need to be triggered, and the capability of automatically pulling up the virtual machine in situ after the host is restarted can be supported. Please refer to example 3 for a specific implementation manner.
Step 105: when the HA mode is a semi-automatic mode, the control component receives a recovery strategy of operation and maintenance personnel, the recovery is carried out according to the recovery strategy, and if the abnormal condition is not eliminated within a preset time threshold, the control component automatically carries out the recovery of the abnormal condition.
When the HA mode is a semi-automatic mode, the control component judges whether the virtual machine on the abnormal computing node needs HA operation; if HA operation is needed, firstly receiving a recovery strategy of operation and maintenance personnel, performing system recovery according to the recovery strategy, and if the abnormal condition is not relieved within a preset time threshold, automatically recovering the abnormal condition by the control component. The control component receives a recovery strategy of operation and maintenance personnel and recovers according to the recovery strategy; judging whether the abnormal condition is relieved or not, and eliminating the alarm if the abnormal condition is relieved; if the abnormal condition is not relieved, judging whether the manual recovery time exceeds a preset time threshold, if so, judging whether a suspended HA request from a specified computing node is received, if not, reporting an alarm notification and starting an automatic process, and automatically recovering the abnormal condition by the control component; and if the request of suspending the HA is received, stopping the automation process on the appointed computing node, and updating the state of the corresponding HA object on the appointed host computer into a suspended state. Please refer to example 4 for a specific implementation manner.
Step 106: and when the HA mode is a full-automatic mode, the control component automatically recovers the abnormal condition.
And when the HA mode is a full-automatic mode, automatically recovering the abnormal condition by the control component. The specific implementation mode is as follows: judging whether a virtual machine on a computing node with an abnormal operation needs HA operation, if the virtual machine needs HA operation, reporting an alarm, detecting all computing nodes, recording the total number of hosts needing HA operation, calculating the occupation ratio of the total number of the hosts needing HA operation relative to the total number of the hosts of the system, judging whether the occupation ratio is greater than a set ratio threshold value, and if the occupation ratio is greater than the set ratio threshold value, stopping issuing HA operation to the corresponding computing node; if the ratio is not greater than the set ratio threshold (which can be configured by itself, for example, 50%), the HA action is performed according to the host list of the HA operation required, so as to eliminate the alarm. See example 5 for a specific implementation.
In this embodiment, the control node is provided with a main control component and two standby control components, and when the main control component fails, the standby control components provide high-reliability service to the outside.
Although the high reliability of the system can be improved by adopting the main/standby mode, if network partition is abnormal, the problem that the control component becomes double main brain cracks is caused, so that the judgment mechanism of the whole high-reliability system is influenced, and the large-area fault of the cloud platform system is caused.
In a preferred embodiment, to avoid the problem of double-master split brain, a one-layer service lock design is added. The master control component registers a service lock with the ETCD cluster and periodically updates the service lock; when the main control assembly breaks down, the service lock cannot be updated, the service lock is automatically released after the service lock exceeds a set life cycle, the standby control assembly automatically executes a lock grabbing action, and the standby control assembly is triggered to be upgraded into the main control assembly.
Example 3:
referring to fig. 3, this embodiment specifically illustrates that, in the manual mode, the highly reliable implementation method for the virtual machine facing the cloud computing scenario specifically includes the following steps:
step 201: and when the system is started, judging the HA mode of the system.
If the mode is the manual mode, go to step 203; if the mode is the non-manual mode, go to step 202.
Step 202: and entering a corresponding HA mode implementation process.
Step 203: and acquiring information of all the computing nodes.
In this embodiment, the Nova service list interface is called to obtain information of all the computing nodes.
Step 204: and performing HA modeling according to the information of the computing nodes.
Step 205: it is determined whether a corresponding HA object already exists within the database.
If not, go to step 206; if so, go to step 207.
Step 206: and establishing a model of the HA object, and storing the model into a database.
After the modeling of the present compute node is completed, step 207 is performed.
Step 207: the HA modeling is performed on the next compute node.
Step 208: and after all the computing nodes complete modeling, the system enters a circular detection process.
Step 209: and the control component judges whether the computing node is abnormal or not according to the detection information reported by the execution component.
If there is an exception, go to step 210; if there is no exception, go to step 208.
Step 210: and judging whether the virtual machine on the abnormal computing node needs HA operation.
If yes, go to step 211; if not, step 208 is performed.
Step 211: and reporting an alarm to inform operation and maintenance personnel.
Step 212: and receiving a recovery strategy of the operation and maintenance personnel, and performing system recovery according to the recovery strategy.
Step 213: and judging whether the abnormal condition is eliminated.
If the abnormal condition is not eliminated, go to step 212; if the abnormal condition has been eliminated, step 214 is executed.
Step 214: the alarm is eliminated.
In this embodiment, the HA actions at the host granularity are handled and determined by human intervention, and the ability of the virtual machine to automatically pull up in place after the host is restarted can be supported.
Example 4:
referring to fig. 4, this embodiment specifically illustrates that, in the semi-automatic mode, the highly reliable implementation method for the virtual machine facing the cloud computing scenario specifically includes the following steps:
step 301: and when the system is started, judging the HA mode of the system.
If the mode is the semi-automatic mode, go to step 303; if not, go to step 302.
Step 302: and entering a corresponding HA mode implementation process.
Step 303: and acquiring information of all the computing nodes.
Step 304: and performing HA modeling according to the information of the computing nodes.
Step 305: it is determined whether a corresponding HA object already exists within the database.
Step 306: and establishing a model of the HA object, and storing the model into a database.
Step 307: the HA modeling is performed on the next compute node.
Step 308: and after all the computing nodes complete modeling, the system enters a circular detection process.
Step 309: and the control component judges whether the computing node is abnormal or not according to the detection information reported by the execution component.
Step 310: and judging whether the virtual machine on the abnormal computing node needs HA operation.
Wherein, steps 303 to 310 are the same as embodiment 3, and are not described herein again.
Step 311: reporting an alarm to inform operation and maintenance personnel, and recording the starting time of manual recovery.
Step 312: and receiving a recovery strategy of the operation and maintenance personnel, and performing system recovery according to the recovery strategy.
Step 313: and judging whether the abnormal condition is eliminated.
If the abnormal condition is not eliminated, go to step 315; if the abnormal condition has been eliminated, step 314 is performed.
Step 314: the alarm is eliminated.
Step 315: and judging whether the manual recovery is overtime or not.
If not, go to step 312; if yes, go to step 316.
Step 316: it is determined whether a suspend HA request is received.
If no suspend HA request is received, go to step 317; if a suspend HA request is received, go to step 319.
Step 317: and reporting an alarm and starting an automatic HA process.
Step 318: the automated HA ends, eliminating the alarm.
Step 319: the automated HA process on the designated compute node is stopped.
Step 320: the status and identification bits of the HA object on the designated compute node are updated.
In conjunction with table 1, when requesting to suspend the HA request, the HAStack service will update the HA object status field status corresponding to the host in the database to the suspended state, and the semi _ auto _ pause flag bit is TRUE. After receiving the HA restoring request, the control component updates the state and the identification bits of the HA objects on the appointed computing nodes according to the table 1.
In this embodiment, a function of suspending the automated HA and a function of resuming the automated HA are provided, and the two functions cooperate with each other to temporarily stop the automated HA capability of the system, so as to troubleshoot the problem with sufficient time for the operation and maintenance personnel, and after the problem is solved, only the HA capability needs to be resumed again.
Example 5:
referring to fig. 5, this embodiment specifically illustrates that, in the full-automatic mode, the method for implementing the cloud computing scene oriented virtual machine with high reliability specifically includes the following steps:
step 401: and when the system is started, judging the HA mode of the system.
If the mode is the full-automatic mode, go to step 403; if the mode is not the full-automatic mode, go to step 402.
Step 402: and entering a corresponding HA mode implementation process.
Step 403: and acquiring information of all the computing nodes.
Step 404: and performing HA modeling according to the information of the computing nodes.
Step 405: it is determined whether a corresponding HA object already exists within the database.
Step 406: and establishing a model of the HA object, and storing the model into a database.
Step 407: the HA modeling is performed on the next compute node.
Step 408: and after all the computing nodes complete modeling, the system enters a circular detection process.
Step 409: and the control component judges whether the computing node is abnormal or not according to the detection information reported by the execution component.
Step 410: and judging whether the virtual machine on the abnormal computing node needs HA operation.
Wherein, steps 403 to 410 are the same as embodiment 3, and are not described herein again.
Step 411: and reporting an alarm, detecting all the computing nodes, and recording the total number of hosts needing HA operation.
Step 412: and calculating the occupation proportion of the total number of the hosts needing HA operation relative to the total number of the hosts of the system.
Step 413: and judging whether the occupation proportion is larger than a set proportion threshold value or not.
If so, go to step 414; if not, go to step 415
Step 414: and stopping issuing HA operation to the corresponding computing node, and reporting an alarm.
Step 415: and performing HA actions according to the host list needing HA operation.
In this embodiment, the HA action is performed according to the host list that needs the HA operation, so as to recover from the abnormal situation.
Step 416: the automated HA ends, eliminating the alarm.
In a full-automatic mode, global HA action prejudging logic is added to ensure that the current network does not have a scene of large-area virtual machine HA.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. The virtual machine high-reliability system is characterized by comprising a control node and a plurality of computing nodes, wherein the control node is provided with at least one control component, and the computing nodes are provided with execution components;
the execution component is used for carrying out abnormity detection on the computing node where the execution component is located and reporting corresponding detection information to the control component;
the control component is used for judging whether the corresponding computing node is abnormal or not according to the detection information, if so, triggering an alarm and recovering the abnormal condition according to the HA mode of the system;
the HA modes comprise a manual mode, a semi-automatic mode and a full-automatic mode, and can be selectively configured according to actual requirements;
when the HA mode is a manual mode, the control component is used for receiving a recovery strategy of operation and maintenance personnel and recovering according to the recovery strategy;
when the HA mode is a semi-automatic mode, the control component is used for receiving a recovery strategy of operation and maintenance personnel, recovering according to the recovery strategy, and if the abnormal condition is not relieved within a preset time threshold, automatically recovering the abnormal condition by the control component;
and when the HA mode is a full-automatic mode, automatically recovering the abnormal condition by the control component.
2. The virtual machine high-reliability system according to claim 1, wherein the control component is further configured to obtain host names corresponding to all the compute nodes, and perform HA modeling for each compute node according to the host name and parameters required by the model table.
3. The virtual machine high-reliability system according to claim 1, wherein a main control component and two standby control components are disposed on the control node, and when the main control component fails, the standby control components provide high-reliability service to the outside.
4. The virtual machine high reliability system of claim 3 wherein the master control component is configured to register a service lock with the ETCD cluster and periodically update the service lock;
after the service lock exceeds a set life cycle, the service lock is automatically released, the standby control assembly automatically executes a lock grabbing action, and then the standby control assembly is triggered to be upgraded into a main control assembly.
5. The virtual machine high-reliability system according to claim 1, wherein the virtual machine high-reliability system comprises a plurality of application program interfaces, and the plurality of application program interfaces are arranged on the control node to provide corresponding services to the outside;
the application program interface comprises an HA object list query interface, an HA object query interface, an HA trigger interface, an HA pause interface, an HA recovery interface and an HA historical task query interface.
6. The implementation method is applied to a virtual machine high-reliability system, the virtual machine high-reliability system comprises a control node and a plurality of computing nodes, the control node is provided with at least one control component, and the computing nodes are provided with execution components;
the implementation method comprises the following steps:
the execution component carries out anomaly detection on the computing node where the execution component is located and reports corresponding detection information to the control component;
the control component judges whether the corresponding computing node is abnormal or not according to the detection information;
if the corresponding computing node is abnormal, triggering an alarm and recovering the abnormal condition according to the HA mode of the system;
when the HA mode is a manual mode, the control component receives a recovery strategy of operation and maintenance personnel, and system recovery is carried out according to the recovery strategy;
when the HA mode is a semi-automatic mode, the control component receives a recovery strategy of operation and maintenance personnel, the recovery is carried out according to the recovery strategy, and if the abnormal condition is not eliminated within a preset time threshold, the control component automatically carries out the recovery of the abnormal condition;
and when the HA mode is a full-automatic mode, the control component automatically recovers the abnormal condition.
7. The method according to claim 6, wherein before the controlling component determining whether there is an anomaly in the corresponding computing node according to the detection information, the method further comprises:
and the control component acquires host names corresponding to all the computing nodes, and performs HA modeling according to the host names and parameters required by the model table for each computing node.
8. The implementation method of claim 6, wherein the control node is provided with a main control component and two standby control components;
the master control component registers a service lock with the ETCD cluster and periodically updates the service lock;
when the main control assembly breaks down, the service lock cannot be updated, the service lock is automatically released after the service lock exceeds a set life cycle, the standby control assembly automatically executes a lock grabbing action, and the standby control assembly is triggered to be upgraded into the main control assembly.
9. The method according to claim 6, wherein the control component receives a recovery strategy of the operation and maintenance personnel, performs recovery according to the recovery strategy, and if the abnormal condition is not eliminated within a preset time threshold, the control component automatically performs recovery of the abnormal condition, including:
judging whether the virtual machine on the abnormal computing node needs HA operation;
if HA operation is needed, the control component receives a recovery strategy of operation and maintenance personnel and recovers according to the recovery strategy;
judging whether the abnormal condition is relieved or not;
if the abnormal condition is relieved, eliminating the alarm;
if the abnormal condition is not relieved, judging whether the manual recovery time exceeds a preset time threshold value;
if the time exceeds the preset time threshold, judging whether a suspended HA request from the appointed computing node is received;
if the HA pause request is not received, reporting an alarm notification and starting an automatic process, and automatically recovering the abnormal condition by the control component;
and if the request of suspending the HA is received, stopping the automation process on the appointed computing node, and updating the state of the corresponding HA object on the appointed host computer into a suspended state.
10. The implementation method of claim 6, wherein when the HA mode is a fully automatic mode, the control component automatically performing recovery of the abnormal condition comprises:
judging whether the virtual machine on the abnormal computing node needs HA operation;
if the HA operation is needed, reporting an alarm, detecting all the computing nodes, and recording the total number of hosts needing the HA operation;
calculating the occupation proportion of the total number of the hosts needing HA operation relative to the total number of the hosts of the system;
judging whether the occupation proportion is larger than a set proportion threshold value or not;
if the ratio is larger than the set ratio threshold, stopping issuing HA operation to the corresponding computing node;
and if the ratio is not greater than the set ratio threshold, performing HA action according to a host list needing HA operation so as to eliminate the alarm.
CN202010644139.5A 2020-07-07 2020-07-07 Cloud computing scene-oriented virtual machine high-reliability system and implementation method Pending CN111897626A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010644139.5A CN111897626A (en) 2020-07-07 2020-07-07 Cloud computing scene-oriented virtual machine high-reliability system and implementation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010644139.5A CN111897626A (en) 2020-07-07 2020-07-07 Cloud computing scene-oriented virtual machine high-reliability system and implementation method

Publications (1)

Publication Number Publication Date
CN111897626A true CN111897626A (en) 2020-11-06

Family

ID=73191845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010644139.5A Pending CN111897626A (en) 2020-07-07 2020-07-07 Cloud computing scene-oriented virtual machine high-reliability system and implementation method

Country Status (1)

Country Link
CN (1) CN111897626A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114499945A (en) * 2021-12-22 2022-05-13 天翼云科技有限公司 Intrusion detection method and device for virtual machine
CN117032881A (en) * 2023-07-31 2023-11-10 广东保伦电子股份有限公司 Method, device and storage medium for detecting and recovering abnormality of virtual machine

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101291243A (en) * 2007-04-16 2008-10-22 广东省新支点技术服务有限公司 Split brain preventing method for highly available cluster system
US20080288655A1 (en) * 2004-10-14 2008-11-20 International Business Machines Corporation Subscription Propagation in a High Performance Highly Available Content based Publish Subscribe System
CN106775953A (en) * 2016-12-30 2017-05-31 北京中电普华信息技术有限公司 Realize the method and system of OpenStack High Availabitities
CN107168779A (en) * 2017-03-31 2017-09-15 咪咕互动娱乐有限公司 A kind of task management method and system
CN109684032A (en) * 2018-12-04 2019-04-26 武汉烽火信息集成技术有限公司 The OpenStack virtual machine High Availabitity calculate node device and management method of anti-fissure
US20220004334A1 (en) * 2018-09-30 2022-01-06 Beijing Kingsoft Cloud Network Technology Co., Ltd. Data Storage Method, Apparatus and System, and Server, Control Node and Medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080288655A1 (en) * 2004-10-14 2008-11-20 International Business Machines Corporation Subscription Propagation in a High Performance Highly Available Content based Publish Subscribe System
CN101291243A (en) * 2007-04-16 2008-10-22 广东省新支点技术服务有限公司 Split brain preventing method for highly available cluster system
CN106775953A (en) * 2016-12-30 2017-05-31 北京中电普华信息技术有限公司 Realize the method and system of OpenStack High Availabitities
CN107168779A (en) * 2017-03-31 2017-09-15 咪咕互动娱乐有限公司 A kind of task management method and system
US20220004334A1 (en) * 2018-09-30 2022-01-06 Beijing Kingsoft Cloud Network Technology Co., Ltd. Data Storage Method, Apparatus and System, and Server, Control Node and Medium
CN109684032A (en) * 2018-12-04 2019-04-26 武汉烽火信息集成技术有限公司 The OpenStack virtual machine High Availabitity calculate node device and management method of anti-fissure

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114499945A (en) * 2021-12-22 2022-05-13 天翼云科技有限公司 Intrusion detection method and device for virtual machine
CN114499945B (en) * 2021-12-22 2023-08-04 天翼云科技有限公司 Intrusion detection method and device for virtual machine
CN117032881A (en) * 2023-07-31 2023-11-10 广东保伦电子股份有限公司 Method, device and storage medium for detecting and recovering abnormality of virtual machine

Similar Documents

Publication Publication Date Title
US7802128B2 (en) Method to avoid continuous application failovers in a cluster
US9785521B2 (en) Fault tolerant architecture for distributed computing systems
CN109656742B (en) Node exception handling method and device and storage medium
US20050283673A1 (en) Information processing apparatus, information processing method, and program
CN109408210B (en) Distributed timed task management method and system
US20120278478A1 (en) Method and system for monitoring a monitoring-target process
CN114064414A (en) High-availability cluster state monitoring method and system
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
CN111897626A (en) Cloud computing scene-oriented virtual machine high-reliability system and implementation method
CN112732674B (en) Cloud platform service management method, device, equipment and readable storage medium
US7401256B2 (en) System and method for highly available data processing in cluster system
CN112199240A (en) Method for switching nodes during node failure and related equipment
WO2021155668A1 (en) Method and device for making online hot backup of database
JPH11259326A (en) Hot standby system, automatic re-execution method for the same and storage medium therefor
CN111290767A (en) Container group updating method and system with service quick recovery function
US8812900B2 (en) Managing storage providers in a clustered appliance environment
CN112269693B (en) Node self-coordination method, device and computer readable storage medium
CN114816866A (en) Fault processing method and device, electronic equipment and storage medium
CN113946543A (en) Data archiving method, device, equipment and storage medium based on artificial intelligence
CN110188008B (en) Job scheduling master-slave switching method and device, computer equipment and storage medium
CN110806917A (en) Anti-split virtual machine high-availability management device and method
CN109144788B (en) Method, device and system for reconstructing OSD
US9110850B2 (en) Method for accelerating start up of a computerized system
JP5601587B2 (en) Process restart device, process restart method, and process restart program
JP2009086758A (en) Computer system and system management program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201106

RJ01 Rejection of invention patent application after publication