CN111865695A

CN111865695A - Method and system for automatic fault handling in cloud environment

Info

Publication number: CN111865695A
Application number: CN202010737436.4A
Authority: CN
Inventors: 陈玉林; 蔡卫卫; 宋伟; 申嘉童
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2020-10-30

Abstract

The invention discloses a method and a system for automatic fault handling in a cloud environment, which belong to the technical field of cloud computing. The system for automatically processing the fault in the cloud environment comprises an index acquisition module, an index acquisition and storage module, an abnormality detection module, an abnormality notification module, an abnormality processing module, a recovery detection module, a result feedback module and a log module, wherein a virtual environment is built in a physical environment, and the modules are integrated into the system. The invention can ensure that the fault in the cloud environment has no perception on the user level, does not influence the user service, ensures the user experience and reduces the operation and maintenance burden.

Description

Method and system for automatic fault handling in cloud environment

Technical Field

The invention relates to the technical field of cloud computing, in particular to a method and a system for automatic fault handling in a cloud environment.

Background

In the cloud environment, the mode of using the virtual machine can provide great convenience, and the cloud environment has the advantages that resource allocation is very convenient, a user can apply for the virtual machine with a proper specification according to actual needs, dynamic allocation of resources can be realized, services are developed on the cloud, inconvenience of a traditional self-built environment can be avoided, the problem of operation and maintenance does not need to be considered by the user, and development efficiency and cost are effectively improved.

In the prior art, there are many highly available schemes, and there are highly available technologies for both basic physical facilities, such as switches, routers, network cards, power supplies, and software-level facilities, such as databases, message queues, and proxy services, but even if there are these technologies, it cannot be guaranteed that the environment is out of order, which does not affect the user, and a mechanism for guaranteeing that the environment of the user layer is not affected is required.

Disclosure of Invention

The technical task of the invention is to provide a method and a system for automatically processing faults in a cloud environment, aiming at the defects, so that the faults in the cloud environment are not perceived on a user plane and the user service is not influenced.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for automatic fault handling in a cloud environment is characterized in that a virtual environment is built on a user level in the cloud environment, and automatic fault handling is achieved through index acquisition, index acquisition and storage, abnormality detection, abnormality notification, abnormality handling, recovery detection and result feedback.

By using the method, the abnormal environment can be recovered in as short a time as possible, or the cloud resources can be ensured to be transferred to the normal physical environment under the condition of short-term unrecoverable, so that the user level is ensured to have no perception on the abnormality and not to influence the user service;

by introducing an automatic processing mechanism, the workload of operation and maintenance personnel can be reduced, most of fault operation and maintenance personnel do not need to immediately process, the automatic processing program is executed, and the time for the operation and maintenance personnel to recover the environment validity period is prolonged.

Specifically, the index acquisition is performed, and data acquisition is performed to acquire environment information, wherein the environment index information has various types including an instantaneous value, an accumulated value, a variance value and an absolute value;

the index acquisition and storage periodically requests index information acquired by the index acquisition, and the index information is stored in a time sequence database for a period of time, so that the later data processing and history tracking are facilitated;

the abnormality detection is to check whether the collected indexes are abnormal or not through operation;

the abnormal notification is used for exporting abnormal information and sending out alarm information in a message queue mode;

the exception processing is to subscribe the alarm information, when the exception notification finds an exception, an exception message is sent, the exception processing is to capture the exception message, extract useful information from the exception message and perform corresponding processing according to the type;

in the recovery detection, in the exception processing, an asynchronous mode is generally adopted for execution in consideration of efficiency, a processing result is not detected, and a message is sent to a queue after the processing is finished; the recovery detection is to realize the judgment of recovery through a long-cycle task, and can select to trigger fault processing again or feed back a message when abnormal;

the result feedback, the feedback messages subscribing to exception handling, typically contains only exception messages, which interface with common message notification means.

The log module records the fault details and key steps in the fault recovery process, so that the follow-up maintenance is facilitated.

Further, the method comprises the following concrete implementation steps:

1) acquiring index data;

2) the acquired data are sent to a data storage end, and the data storage end processes the information and then samples and stores the information;

3) the state inference is carried out on the specified indexes, an alarm message is sent to a message queue, and whether the environment information has a fault, a fault type and a fault severity level is judged by using a state inference mode;

4) different alarm information is sent to different processing units in a routing distribution mode, or part of information is directly informed to operation and maintenance personnel;

5) processing the operation fault of the virtual machine by using a virtual machine processing technology including hot migration, cold migration or/and evacuation, and ensuring that the influence on the tenant side is reduced to the minimum;

6) detecting and recovering the result of the abnormal processing by using a long-circulation mode, and re-requesting the failed processing or sending information to a feedback module;

7) the feedback module records the processing result information and sends a notice to a processor according to the configuration for interaction of an administrator or a fault processor;

8) and recording the operation flow and the processing result, and specifying the log level.

Preferably, the index data acquisition uses an interfacing programming mode, and a universal acquisition interface, namely, a collection interface, is adopted, so that the module can be expanded by the interface; and the acquisition request is passively initiated, the utilization rate of system resources of the acquisition module is reduced, and the consumption of the system resources caused by monitoring is effectively reduced.

Preferably, the storage back end of the data storage end uses a time sequence database, and uses a centralized module to perform unified data acquisition and storage.

Preferably, the fault is handled by a scheduling mechanism, faults of different priorities are handled hierarchically, and the fault handling logic is executed in an asynchronous manner.

The invention also claims a system for automatic fault handling in a cloud environment, which comprises an index acquisition module, an index acquisition and storage module, an abnormality detection module, an abnormality notification module, an abnormality handling module, a recovery detection module, a result feedback module and a log module, wherein a virtual environment is established in a physical environment, and the modules are integrated into the system. Therefore, user experience is guaranteed, operation and maintenance burden is reduced, the framework is designed to be expandable, and function expansion can be easily achieved.

In particular, the method comprises the following steps of,

the index acquisition module is used for acquiring environment information, wherein the information comprises an instantaneous value, an accumulated value, a variance value and an absolute value;

the index acquisition and storage module periodically requests the index acquisition module to obtain index information, and the acquired information is stored in the time sequence database for a period of time, so that the later data processing and history tracking are facilitated;

the abnormity detection module checks whether the collected indexes are abnormal or not through operation;

the abnormal notification module is used for exporting the abnormal information and sending the alarm information out in a message queue mode;

the abnormity processing module subscribes alarm information, the abnormity notification module sends abnormity information when discovering abnormity, the abnormity processing module captures the information and extracts useful information from the abnormity information, and different processing is carried out according to different types;

the recovery detection module realizes the judgment of recovery through a long-cycle task, and can choose to trigger fault processing again or feed back a message when abnormal;

the result feedback module subscribes feedback information of exception handling;

the log module is used for recording fault details and key steps in the fault recovery process.

Further, the system performs automatic fault handling specifically as follows:

1) and using the index acquisition module to acquire data, wherein the module is provided with a universal acquisition interface, and acquisition is initiated passively.

2) The data acquired by the index acquisition module is sent to a data storage end, and the data storage end processes the information and then samples and stores the information;

3) the abnormal detection module deduces the state of the specified index and sends alarm information to a message queue;

4) the abnormal notification module sends different alarm information to different processing units, or directly informs operation and maintenance personnel of part of the information;

5) the abnormal processing module processes common faults, and for the faults which cause the virtual machine to run abnormally, virtual machine processing technology including hot migration, cold migration or/and evacuation is used for processing the running faults of the virtual machine;

6) the recovery detection module detects the result of recovering the abnormal processing in a long-circulation mode and re-requests the failed processing or sends information to the feedback module;

7) the feedback module records the processing result information and can send a notice to a processor according to the configuration;

8) and the log module records the operation flow and the processing result.

The invention also claims a computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the above-described method.

Compared with the prior art, the method and the system for automatically processing the fault in the cloud environment have the following beneficial effects:

the method and the system realize modular division of the system according to different functions, are convenient for development and testing, can effectively improve the fault tolerance of the cloud platform, and can treat common faults on the platform through various technical means, such as hardware abnormity (CPU, memory, network card and power supply) can migrate (evacuate) cloud resources to a normal environment through a virtualization technology; if the resource distribution is unbalanced, the resource with high load is transferred to the environment with low load in real time through an automatic coordination means; if the basic software fault detects the availability in a probe mode, the system alarms and recovers processing in time when the abnormity is found.

Drawings

Fig. 1 is a system configuration diagram of automatic failure processing in a cloud environment according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific examples.

While the "cloud" provides great convenience, it also presents some challenges to cloud development and operation and maintenance, and the actual "cloud" is built on a physical environment, which may still have various problems or failures. The existing automatic fault handling framework for the cloud environment is not mature enough, and the embodiment provides an available automatic fault handling method, and the final aim is to ensure that faults in the cloud environment are not perceived to a user plane.

The index collection aims at obtaining environment information, and data collection is carried out to obtain the environment information, and the environment index information has various types including instantaneous values, accumulated values, variance values and absolute values;

in the recovery detection, in the exception processing, an asynchronous mode is generally adopted for efficiency, the processing result is not detected, and on the contrary, a message is sent to a queue after the processing is finished; the recovery detection is to realize the judgment of recovery through a long-cycle task, and can select to trigger fault processing again or feed back a message when abnormal;

The method comprises the following concrete implementation steps:

1) acquiring index data, and realizing that the interface can expand the module by using an interfaced programming mode and adopting a universal acquisition interface (collection); the acquisition request is passively initiated, the utilization rate of system resources of an acquisition module is reduced, and the consumption of the system resources caused by monitoring is effectively reduced;

2) the acquired data are sent to a data storage end, and the data storage end processes the information and then samples and stores the information; the storage back end of the data storage end uses a time sequence database and a centralized module to carry out unified data acquisition and storage;

5) processing the operation fault of the virtual machine by using a virtual machine processing technology including hot migration, cold migration or/and evacuation, and ensuring that the influence on the tenant side is reduced to the minimum; processing faults through a scheduling mechanism, carrying out fault grading processing on the faults with different priorities, and executing a fault processing logic in an asynchronous mode;

The invention also claims a system for automatic fault handling in a cloud environment, which comprises an index acquisition module, an index acquisition and storage module, an abnormality detection module, an abnormality notification module, an abnormality handling module, a recovery detection module, a result feedback module and a log module, wherein a virtual environment is built on a physical environment, and the modules are integrated into the system, so that the user experience is ensured, the operation and maintenance burden is reduced, and the design framework is expandable, so that the expansion of functions can be easily realized.

The index acquisition module is a system with a single function, and aims to acquire environment information, and the index information has various types including instantaneous values, accumulated values, variance values, absolute values and the like.

The index acquisition and storage module periodically requests the index acquisition module to obtain index information, and the acquired information is stored in the time sequence database for a period of time, so that the later-stage data processing and history tracking are facilitated.

The abnormity detection module uses an index acquisition and storage module, and the aim of the abnormity detection module is to check whether the collected indexes are abnormal or not through operation.

The abnormal informing module is a module for exporting the abnormal information and sends the alarm information out by using a message queue mode.

The abnormity processing module subscribes the alarm information, and the abnormity notification module sends an abnormity message when discovering abnormity, and can be captured by the abnormity processing module; and the exception processing module captures the message and extracts useful information from the exception message, and performs different processing according to different types.

In the exception handling module, for the consideration of efficiency, an asynchronous mode is generally adopted for execution, the processing result is not detected, and on the contrary, a message is sent to a queue after the processing is finished; the recovery detection module realizes the judgment of recovery through a long-cycle task, and can choose to trigger fault processing again or feed back a message when abnormal.

The result feedback module subscribes to feedback messages for exception handling, generally referred to as messages containing exceptions, and interfaces with common message notification means.

The log module is used for recording fault details and key steps in the fault recovery process, and facilitates subsequent maintenance.

As shown in fig. 1, a system architecture diagram of automatic failure handling in a cloud environment is shown, and the system bottom layer uses an infrastructure including a database and a message queue for index data storage and inter-component message delivery, respectively. Each component is responsible for independent functions, and in addition, the system finally serves operation and maintenance personnel and provides an interface for viewing and subsequent processing.

The system carries out automatic fault processing and comprises the following specific steps:

1) the data acquisition is carried out by using the index acquisition module, and the module is provided with a universal acquisition interface, namely, the interface can expand the module; in addition, the collection is initiated passively, and the utilization rate of system resources of a collection module is low;

4) the abnormal notification module plays a role of routing, and sends different alarm information to different processing units, or directly informs operation and maintenance personnel of part of information;

5) the abnormal processing module is the core of the whole system, processes common faults, and processes the virtual machine operation faults by using virtual machine processing technologies such as hot migration, cold migration, evacuation and the like for the faults which cause the virtual machine to operate abnormally, so that the influence received by a user side is reduced to the minimum;

8) and the log module records the operation flow and the processing result.

In the system, the index acquisition module passively initiates a request by using an interfaced programming mode to effectively reduce system resource consumption caused by monitoring; the index acquisition and storage module stores the back end and uses a time sequence database, and a centralized module is used for uniformly acquiring and storing data; the abnormality detection module judges whether the environmental information has a fault, a fault type and a fault severity level by using a state inference mode; using an abnormal notification module to send different messages to different processing back ends in a routing distribution mode; the exception handling module handles faults through a scheduling mechanism, carries out fault grading processing on the faults with different priorities, and executes fault handling logic in an asynchronous mode; the recovery detection module uses long circulation to check whether the result recovered by the asynchronous mode in the exception handling module is processed normally; through the use of a feedback module, the interaction of an administrator or a fault handling person is used; by recording the processing information using the logging module, the log level can be specified.

An embodiment of the present invention further provides a computer-readable medium, where the computer-readable medium stores computer instructions, and when the computer instructions are executed by a processor, the processor is caused to execute the method for automatic fault handling in a cloud environment described in the above embodiment of the present invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

While the invention has been shown and described in detail in the drawings and in the preferred embodiments, it is not intended to limit the invention to the embodiments disclosed, and it will be apparent to those skilled in the art that various combinations of the code auditing means in the various embodiments described above may be used to obtain further embodiments of the invention, which are also within the scope of the invention.

Claims

1. A method for automatic fault handling in a cloud environment is characterized in that a virtual environment is built on a user level in the cloud environment, and automatic fault handling is achieved through index collection, index acquisition and storage, abnormality detection, abnormality notification, abnormality handling, recovery detection and result feedback.

2. The method for automatic fault handling in the cloud environment according to claim 1, wherein the index collection is performed to collect data to obtain environment information, which includes instantaneous values, accumulated values, variance values and absolute values;

the index acquisition and storage is to periodically request the index information acquired by the index acquisition, and the index information is stored in a time sequence database for a period of time;

the exception handling is to subscribe the alarm information, capture the exception information, extract useful information from the exception information and carry out corresponding handling according to the type;

the recovery detection is to realize the judgment of recovery through a long-cycle task, and can select to trigger fault processing again or feed back a message when abnormal;

and feeding back the result, and subscribing feedback information of exception handling.

3. The method for automatic fault handling in the cloud environment according to claim 1 or 2, wherein the method is implemented by the following steps:

1) acquiring index data;

3) sending an alarm message to a message queue by deducing the state of the specified index;

4) different alarm information is sent to different processing units, or part of information is directly informed to operation and maintenance personnel;

5) processing the operation fault of the virtual machine by using virtual machine processing technology, including hot migration, cold migration or/and evacuation;

7) the feedback module records the processing result information and sends a notice to a processor according to the configuration;

8) and recording the operation flow and the processing result.

4. The method according to claim 3, wherein the index data acquisition passively initiates an acquisition request using an interfaced programming approach.

5. The method for automatic fault handling in the cloud environment according to claim 3, wherein a time sequence database is used by a storage back end of the data storage end, and a centralized module is used for unified data acquisition and storage.

6. The method of claim 3, wherein the fault is handled by a scheduling mechanism, the faults of different priorities are handled in a hierarchical manner, and the fault handling logic is executed in an asynchronous manner.

7. The system for automatically processing the fault in the cloud environment is characterized by comprising an index acquisition module, an index acquisition and storage module, an abnormality detection module, an abnormality notification module, an abnormality processing module, a recovery detection module, a result feedback module and a log module, wherein a virtual environment is built in a physical environment, and the modules are integrated into the system.

8. The system for automatic failure handling in cloud environment according to claim 7,

the index acquisition and storage module periodically requests the index acquisition module to obtain index information, and the acquired information is stored in the time sequence database for a period of time;

9. The system for automatic fault handling in cloud environment according to claim 7 or 8, wherein the system performs automatic fault handling specifically as follows:

8) and the log module records the operation flow and the processing result.

10. Computer readable medium, characterized in that it has stored thereon computer instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 6.