CN111865695A - Method and system for automatic fault handling in cloud environment - Google Patents

Method and system for automatic fault handling in cloud environment Download PDF

Info

Publication number
CN111865695A
CN111865695A CN202010737436.4A CN202010737436A CN111865695A CN 111865695 A CN111865695 A CN 111865695A CN 202010737436 A CN202010737436 A CN 202010737436A CN 111865695 A CN111865695 A CN 111865695A
Authority
CN
China
Prior art keywords
information
module
processing
index
abnormal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010737436.4A
Other languages
Chinese (zh)
Inventor
陈玉林
蔡卫卫
宋伟
申嘉童
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN202010737436.4A priority Critical patent/CN111865695A/en
Publication of CN111865695A publication Critical patent/CN111865695A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a method and a system for automatic fault handling in a cloud environment, which belong to the technical field of cloud computing. The system for automatically processing the fault in the cloud environment comprises an index acquisition module, an index acquisition and storage module, an abnormality detection module, an abnormality notification module, an abnormality processing module, a recovery detection module, a result feedback module and a log module, wherein a virtual environment is built in a physical environment, and the modules are integrated into the system. The invention can ensure that the fault in the cloud environment has no perception on the user level, does not influence the user service, ensures the user experience and reduces the operation and maintenance burden.

Description

Method and system for automatic fault handling in cloud environment
Technical Field
The invention relates to the technical field of cloud computing, in particular to a method and a system for automatic fault handling in a cloud environment.
Background
In the cloud environment, the mode of using the virtual machine can provide great convenience, and the cloud environment has the advantages that resource allocation is very convenient, a user can apply for the virtual machine with a proper specification according to actual needs, dynamic allocation of resources can be realized, services are developed on the cloud, inconvenience of a traditional self-built environment can be avoided, the problem of operation and maintenance does not need to be considered by the user, and development efficiency and cost are effectively improved.
In the prior art, there are many highly available schemes, and there are highly available technologies for both basic physical facilities, such as switches, routers, network cards, power supplies, and software-level facilities, such as databases, message queues, and proxy services, but even if there are these technologies, it cannot be guaranteed that the environment is out of order, which does not affect the user, and a mechanism for guaranteeing that the environment of the user layer is not affected is required.
Disclosure of Invention
The technical task of the invention is to provide a method and a system for automatically processing faults in a cloud environment, aiming at the defects, so that the faults in the cloud environment are not perceived on a user plane and the user service is not influenced.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for automatic fault handling in a cloud environment is characterized in that a virtual environment is built on a user level in the cloud environment, and automatic fault handling is achieved through index acquisition, index acquisition and storage, abnormality detection, abnormality notification, abnormality handling, recovery detection and result feedback.
By using the method, the abnormal environment can be recovered in as short a time as possible, or the cloud resources can be ensured to be transferred to the normal physical environment under the condition of short-term unrecoverable, so that the user level is ensured to have no perception on the abnormality and not to influence the user service;
by introducing an automatic processing mechanism, the workload of operation and maintenance personnel can be reduced, most of fault operation and maintenance personnel do not need to immediately process, the automatic processing program is executed, and the time for the operation and maintenance personnel to recover the environment validity period is prolonged.
Specifically, the index acquisition is performed, and data acquisition is performed to acquire environment information, wherein the environment index information has various types including an instantaneous value, an accumulated value, a variance value and an absolute value;
the index acquisition and storage periodically requests index information acquired by the index acquisition, and the index information is stored in a time sequence database for a period of time, so that the later data processing and history tracking are facilitated;
the abnormality detection is to check whether the collected indexes are abnormal or not through operation;
the abnormal notification is used for exporting abnormal information and sending out alarm information in a message queue mode;
the exception processing is to subscribe the alarm information, when the exception notification finds an exception, an exception message is sent, the exception processing is to capture the exception message, extract useful information from the exception message and perform corresponding processing according to the type;
in the recovery detection, in the exception processing, an asynchronous mode is generally adopted for execution in consideration of efficiency, a processing result is not detected, and a message is sent to a queue after the processing is finished; the recovery detection is to realize the judgment of recovery through a long-cycle task, and can select to trigger fault processing again or feed back a message when abnormal;
the result feedback, the feedback messages subscribing to exception handling, typically contains only exception messages, which interface with common message notification means.
The log module records the fault details and key steps in the fault recovery process, so that the follow-up maintenance is facilitated.
Further, the method comprises the following concrete implementation steps:
1) acquiring index data;
2) the acquired data are sent to a data storage end, and the data storage end processes the information and then samples and stores the information;
3) the state inference is carried out on the specified indexes, an alarm message is sent to a message queue, and whether the environment information has a fault, a fault type and a fault severity level is judged by using a state inference mode;
4) different alarm information is sent to different processing units in a routing distribution mode, or part of information is directly informed to operation and maintenance personnel;
5) processing the operation fault of the virtual machine by using a virtual machine processing technology including hot migration, cold migration or/and evacuation, and ensuring that the influence on the tenant side is reduced to the minimum;
6) detecting and recovering the result of the abnormal processing by using a long-circulation mode, and re-requesting the failed processing or sending information to a feedback module;
7) the feedback module records the processing result information and sends a notice to a processor according to the configuration for interaction of an administrator or a fault processor;
8) and recording the operation flow and the processing result, and specifying the log level.
Preferably, the index data acquisition uses an interfacing programming mode, and a universal acquisition interface, namely, a collection interface, is adopted, so that the module can be expanded by the interface; and the acquisition request is passively initiated, the utilization rate of system resources of the acquisition module is reduced, and the consumption of the system resources caused by monitoring is effectively reduced.
Preferably, the storage back end of the data storage end uses a time sequence database, and uses a centralized module to perform unified data acquisition and storage.
Preferably, the fault is handled by a scheduling mechanism, faults of different priorities are handled hierarchically, and the fault handling logic is executed in an asynchronous manner.
The invention also claims a system for automatic fault handling in a cloud environment, which comprises an index acquisition module, an index acquisition and storage module, an abnormality detection module, an abnormality notification module, an abnormality handling module, a recovery detection module, a result feedback module and a log module, wherein a virtual environment is established in a physical environment, and the modules are integrated into the system. Therefore, user experience is guaranteed, operation and maintenance burden is reduced, the framework is designed to be expandable, and function expansion can be easily achieved.
In particular, the method comprises the following steps of,
the index acquisition module is used for acquiring environment information, wherein the information comprises an instantaneous value, an accumulated value, a variance value and an absolute value;
the index acquisition and storage module periodically requests the index acquisition module to obtain index information, and the acquired information is stored in the time sequence database for a period of time, so that the later data processing and history tracking are facilitated;
the abnormity detection module checks whether the collected indexes are abnormal or not through operation;
the abnormal notification module is used for exporting the abnormal information and sending the alarm information out in a message queue mode;
the abnormity processing module subscribes alarm information, the abnormity notification module sends abnormity information when discovering abnormity, the abnormity processing module captures the information and extracts useful information from the abnormity information, and different processing is carried out according to different types;
the recovery detection module realizes the judgment of recovery through a long-cycle task, and can choose to trigger fault processing again or feed back a message when abnormal;
the result feedback module subscribes feedback information of exception handling;
the log module is used for recording fault details and key steps in the fault recovery process.
Further, the system performs automatic fault handling specifically as follows:
1) and using the index acquisition module to acquire data, wherein the module is provided with a universal acquisition interface, and acquisition is initiated passively.
2) The data acquired by the index acquisition module is sent to a data storage end, and the data storage end processes the information and then samples and stores the information;
3) the abnormal detection module deduces the state of the specified index and sends alarm information to a message queue;
4) the abnormal notification module sends different alarm information to different processing units, or directly informs operation and maintenance personnel of part of the information;
5) the abnormal processing module processes common faults, and for the faults which cause the virtual machine to run abnormally, virtual machine processing technology including hot migration, cold migration or/and evacuation is used for processing the running faults of the virtual machine;
6) the recovery detection module detects the result of recovering the abnormal processing in a long-circulation mode and re-requests the failed processing or sends information to the feedback module;
7) the feedback module records the processing result information and can send a notice to a processor according to the configuration;
8) and the log module records the operation flow and the processing result.
The invention also claims a computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the above-described method.
Compared with the prior art, the method and the system for automatically processing the fault in the cloud environment have the following beneficial effects:
the method and the system realize modular division of the system according to different functions, are convenient for development and testing, can effectively improve the fault tolerance of the cloud platform, and can treat common faults on the platform through various technical means, such as hardware abnormity (CPU, memory, network card and power supply) can migrate (evacuate) cloud resources to a normal environment through a virtualization technology; if the resource distribution is unbalanced, the resource with high load is transferred to the environment with low load in real time through an automatic coordination means; if the basic software fault detects the availability in a probe mode, the system alarms and recovers processing in time when the abnormity is found.
Drawings
Fig. 1 is a system configuration diagram of automatic failure processing in a cloud environment according to an embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific examples.
While the "cloud" provides great convenience, it also presents some challenges to cloud development and operation and maintenance, and the actual "cloud" is built on a physical environment, which may still have various problems or failures. The existing automatic fault handling framework for the cloud environment is not mature enough, and the embodiment provides an available automatic fault handling method, and the final aim is to ensure that faults in the cloud environment are not perceived to a user plane.
A method for automatic fault handling in a cloud environment is characterized in that a virtual environment is built on a user level in the cloud environment, and automatic fault handling is achieved through index acquisition, index acquisition and storage, abnormality detection, abnormality notification, abnormality handling, recovery detection and result feedback.
By using the method, the abnormal environment can be recovered in as short a time as possible, or the cloud resources can be ensured to be transferred to the normal physical environment under the condition of short-term unrecoverable, so that the user level is ensured to have no perception on the abnormality and not to influence the user service;
by introducing an automatic processing mechanism, the workload of operation and maintenance personnel can be reduced, most of fault operation and maintenance personnel do not need to immediately process, the automatic processing program is executed, and the time for the operation and maintenance personnel to recover the environment validity period is prolonged.
The index collection aims at obtaining environment information, and data collection is carried out to obtain the environment information, and the environment index information has various types including instantaneous values, accumulated values, variance values and absolute values;
the index acquisition and storage periodically requests index information acquired by the index acquisition, and the index information is stored in a time sequence database for a period of time, so that the later data processing and history tracking are facilitated;
the abnormality detection is to check whether the collected indexes are abnormal or not through operation;
the abnormal notification is used for exporting abnormal information and sending out alarm information in a message queue mode;
the exception processing is to subscribe the alarm information, when the exception notification finds an exception, an exception message is sent, the exception processing is to capture the exception message, extract useful information from the exception message and perform corresponding processing according to the type;
in the recovery detection, in the exception processing, an asynchronous mode is generally adopted for efficiency, the processing result is not detected, and on the contrary, a message is sent to a queue after the processing is finished; the recovery detection is to realize the judgment of recovery through a long-cycle task, and can select to trigger fault processing again or feed back a message when abnormal;
the result feedback, the feedback messages subscribing to exception handling, typically contains only exception messages, which interface with common message notification means.
The log module records the fault details and key steps in the fault recovery process, so that the follow-up maintenance is facilitated.
The method comprises the following concrete implementation steps:
1) acquiring index data, and realizing that the interface can expand the module by using an interfaced programming mode and adopting a universal acquisition interface (collection); the acquisition request is passively initiated, the utilization rate of system resources of an acquisition module is reduced, and the consumption of the system resources caused by monitoring is effectively reduced;
2) the acquired data are sent to a data storage end, and the data storage end processes the information and then samples and stores the information; the storage back end of the data storage end uses a time sequence database and a centralized module to carry out unified data acquisition and storage;
3) the state inference is carried out on the specified indexes, an alarm message is sent to a message queue, and whether the environment information has a fault, a fault type and a fault severity level is judged by using a state inference mode;
4) different alarm information is sent to different processing units in a routing distribution mode, or part of information is directly informed to operation and maintenance personnel;
5) processing the operation fault of the virtual machine by using a virtual machine processing technology including hot migration, cold migration or/and evacuation, and ensuring that the influence on the tenant side is reduced to the minimum; processing faults through a scheduling mechanism, carrying out fault grading processing on the faults with different priorities, and executing a fault processing logic in an asynchronous mode;
6) detecting and recovering the result of the abnormal processing by using a long-circulation mode, and re-requesting the failed processing or sending information to a feedback module;
7) the feedback module records the processing result information and sends a notice to a processor according to the configuration for interaction of an administrator or a fault processor;
8) and recording the operation flow and the processing result, and specifying the log level.
The invention also claims a system for automatic fault handling in a cloud environment, which comprises an index acquisition module, an index acquisition and storage module, an abnormality detection module, an abnormality notification module, an abnormality handling module, a recovery detection module, a result feedback module and a log module, wherein a virtual environment is built on a physical environment, and the modules are integrated into the system, so that the user experience is ensured, the operation and maintenance burden is reduced, and the design framework is expandable, so that the expansion of functions can be easily realized.
The index acquisition module is a system with a single function, and aims to acquire environment information, and the index information has various types including instantaneous values, accumulated values, variance values, absolute values and the like.
The index acquisition and storage module periodically requests the index acquisition module to obtain index information, and the acquired information is stored in the time sequence database for a period of time, so that the later-stage data processing and history tracking are facilitated.
The abnormity detection module uses an index acquisition and storage module, and the aim of the abnormity detection module is to check whether the collected indexes are abnormal or not through operation.
The abnormal informing module is a module for exporting the abnormal information and sends the alarm information out by using a message queue mode.
The abnormity processing module subscribes the alarm information, and the abnormity notification module sends an abnormity message when discovering abnormity, and can be captured by the abnormity processing module; and the exception processing module captures the message and extracts useful information from the exception message, and performs different processing according to different types.
In the exception handling module, for the consideration of efficiency, an asynchronous mode is generally adopted for execution, the processing result is not detected, and on the contrary, a message is sent to a queue after the processing is finished; the recovery detection module realizes the judgment of recovery through a long-cycle task, and can choose to trigger fault processing again or feed back a message when abnormal.
The result feedback module subscribes to feedback messages for exception handling, generally referred to as messages containing exceptions, and interfaces with common message notification means.
The log module is used for recording fault details and key steps in the fault recovery process, and facilitates subsequent maintenance.
As shown in fig. 1, a system architecture diagram of automatic failure handling in a cloud environment is shown, and the system bottom layer uses an infrastructure including a database and a message queue for index data storage and inter-component message delivery, respectively. Each component is responsible for independent functions, and in addition, the system finally serves operation and maintenance personnel and provides an interface for viewing and subsequent processing.
The system carries out automatic fault processing and comprises the following specific steps:
1) the data acquisition is carried out by using the index acquisition module, and the module is provided with a universal acquisition interface, namely, the interface can expand the module; in addition, the collection is initiated passively, and the utilization rate of system resources of a collection module is low;
2) the data acquired by the index acquisition module is sent to a data storage end, and the data storage end processes the information and then samples and stores the information;
3) the abnormal detection module deduces the state of the specified index and sends alarm information to a message queue;
4) the abnormal notification module plays a role of routing, and sends different alarm information to different processing units, or directly informs operation and maintenance personnel of part of information;
5) the abnormal processing module is the core of the whole system, processes common faults, and processes the virtual machine operation faults by using virtual machine processing technologies such as hot migration, cold migration, evacuation and the like for the faults which cause the virtual machine to operate abnormally, so that the influence received by a user side is reduced to the minimum;
6) the recovery detection module detects the result of recovering the abnormal processing in a long-circulation mode and re-requests the failed processing or sends information to the feedback module;
7) the feedback module records the processing result information and can send a notice to a processor according to the configuration;
8) and the log module records the operation flow and the processing result.
In the system, the index acquisition module passively initiates a request by using an interfaced programming mode to effectively reduce system resource consumption caused by monitoring; the index acquisition and storage module stores the back end and uses a time sequence database, and a centralized module is used for uniformly acquiring and storing data; the abnormality detection module judges whether the environmental information has a fault, a fault type and a fault severity level by using a state inference mode; using an abnormal notification module to send different messages to different processing back ends in a routing distribution mode; the exception handling module handles faults through a scheduling mechanism, carries out fault grading processing on the faults with different priorities, and executes fault handling logic in an asynchronous mode; the recovery detection module uses long circulation to check whether the result recovered by the asynchronous mode in the exception handling module is processed normally; through the use of a feedback module, the interaction of an administrator or a fault handling person is used; by recording the processing information using the logging module, the log level can be specified.
An embodiment of the present invention further provides a computer-readable medium, where the computer-readable medium stores computer instructions, and when the computer instructions are executed by a processor, the processor is caused to execute the method for automatic fault handling in a cloud environment described in the above embodiment of the present invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.
In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.
Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.
Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.
While the invention has been shown and described in detail in the drawings and in the preferred embodiments, it is not intended to limit the invention to the embodiments disclosed, and it will be apparent to those skilled in the art that various combinations of the code auditing means in the various embodiments described above may be used to obtain further embodiments of the invention, which are also within the scope of the invention.

Claims (10)

1. A method for automatic fault handling in a cloud environment is characterized in that a virtual environment is built on a user level in the cloud environment, and automatic fault handling is achieved through index collection, index acquisition and storage, abnormality detection, abnormality notification, abnormality handling, recovery detection and result feedback.
2. The method for automatic fault handling in the cloud environment according to claim 1, wherein the index collection is performed to collect data to obtain environment information, which includes instantaneous values, accumulated values, variance values and absolute values;
the index acquisition and storage is to periodically request the index information acquired by the index acquisition, and the index information is stored in a time sequence database for a period of time;
the abnormality detection is to check whether the collected indexes are abnormal or not through operation;
the abnormal notification is used for exporting abnormal information and sending out alarm information in a message queue mode;
the exception handling is to subscribe the alarm information, capture the exception information, extract useful information from the exception information and carry out corresponding handling according to the type;
the recovery detection is to realize the judgment of recovery through a long-cycle task, and can select to trigger fault processing again or feed back a message when abnormal;
and feeding back the result, and subscribing feedback information of exception handling.
3. The method for automatic fault handling in the cloud environment according to claim 1 or 2, wherein the method is implemented by the following steps:
1) acquiring index data;
2) the acquired data are sent to a data storage end, and the data storage end processes the information and then samples and stores the information;
3) sending an alarm message to a message queue by deducing the state of the specified index;
4) different alarm information is sent to different processing units, or part of information is directly informed to operation and maintenance personnel;
5) processing the operation fault of the virtual machine by using virtual machine processing technology, including hot migration, cold migration or/and evacuation;
6) detecting and recovering the result of the abnormal processing by using a long-circulation mode, and re-requesting the failed processing or sending information to a feedback module;
7) the feedback module records the processing result information and sends a notice to a processor according to the configuration;
8) and recording the operation flow and the processing result.
4. The method according to claim 3, wherein the index data acquisition passively initiates an acquisition request using an interfaced programming approach.
5. The method for automatic fault handling in the cloud environment according to claim 3, wherein a time sequence database is used by a storage back end of the data storage end, and a centralized module is used for unified data acquisition and storage.
6. The method of claim 3, wherein the fault is handled by a scheduling mechanism, the faults of different priorities are handled in a hierarchical manner, and the fault handling logic is executed in an asynchronous manner.
7. The system for automatically processing the fault in the cloud environment is characterized by comprising an index acquisition module, an index acquisition and storage module, an abnormality detection module, an abnormality notification module, an abnormality processing module, a recovery detection module, a result feedback module and a log module, wherein a virtual environment is built in a physical environment, and the modules are integrated into the system.
8. The system for automatic failure handling in cloud environment according to claim 7,
the index acquisition module is used for acquiring environment information, wherein the information comprises an instantaneous value, an accumulated value, a variance value and an absolute value;
the index acquisition and storage module periodically requests the index acquisition module to obtain index information, and the acquired information is stored in the time sequence database for a period of time;
the abnormity detection module checks whether the collected indexes are abnormal or not through operation;
the abnormal notification module is used for exporting the abnormal information and sending the alarm information out in a message queue mode;
the abnormity processing module subscribes alarm information, the abnormity notification module sends abnormity information when discovering abnormity, the abnormity processing module captures the information and extracts useful information from the abnormity information, and different processing is carried out according to different types;
the recovery detection module realizes the judgment of recovery through a long-cycle task, and can choose to trigger fault processing again or feed back a message when abnormal;
the result feedback module subscribes feedback information of exception handling;
the log module is used for recording fault details and key steps in the fault recovery process.
9. The system for automatic fault handling in cloud environment according to claim 7 or 8, wherein the system performs automatic fault handling specifically as follows:
1) and using the index acquisition module to acquire data, wherein the module is provided with a universal acquisition interface, and acquisition is initiated passively.
2) The data acquired by the index acquisition module is sent to a data storage end, and the data storage end processes the information and then samples and stores the information;
3) the abnormal detection module deduces the state of the specified index and sends alarm information to a message queue;
4) the abnormal notification module sends different alarm information to different processing units, or directly informs operation and maintenance personnel of part of the information;
5) the abnormal processing module processes common faults, and for the faults which cause the virtual machine to run abnormally, virtual machine processing technology including hot migration, cold migration or/and evacuation is used for processing the running faults of the virtual machine;
6) the recovery detection module detects the result of recovering the abnormal processing in a long-circulation mode and re-requests the failed processing or sends information to the feedback module;
7) the feedback module records the processing result information and can send a notice to a processor according to the configuration;
8) and the log module records the operation flow and the processing result.
10. Computer readable medium, characterized in that it has stored thereon computer instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 6.
CN202010737436.4A 2020-07-28 2020-07-28 Method and system for automatic fault handling in cloud environment Pending CN111865695A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010737436.4A CN111865695A (en) 2020-07-28 2020-07-28 Method and system for automatic fault handling in cloud environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010737436.4A CN111865695A (en) 2020-07-28 2020-07-28 Method and system for automatic fault handling in cloud environment

Publications (1)

Publication Number Publication Date
CN111865695A true CN111865695A (en) 2020-10-30

Family

ID=72948774

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010737436.4A Pending CN111865695A (en) 2020-07-28 2020-07-28 Method and system for automatic fault handling in cloud environment

Country Status (1)

Country Link
CN (1) CN111865695A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579335A (en) * 2020-12-18 2021-03-30 歌尔光学科技有限公司 Intelligent equipment fault processing method, device, equipment and storage medium
CN113900888A (en) * 2021-09-22 2022-01-07 山东新一代信息产业技术研究院有限公司 Indoor distribution robot state monitoring system
CN113965459A (en) * 2021-10-08 2022-01-21 浪潮云信息技术股份公司 Consul-based method for monitoring host network to realize high availability of computing nodes
CN115858324A (en) * 2023-02-02 2023-03-28 北京神州光大科技有限公司 IT equipment fault processing method, device, equipment and medium based on AI
CN116155686A (en) * 2023-01-30 2023-05-23 浪潮云信息技术股份公司 Method for judging node faults in cloud environment
CN117130965A (en) * 2023-02-24 2023-11-28 荣耀终端有限公司 Sensor Hub data management method and electronic equipment
CN116155686B (en) * 2023-01-30 2024-05-31 浪潮云信息技术股份公司 Method for judging node faults in cloud environment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550012A (en) * 2015-12-07 2016-05-04 国云科技股份有限公司 Method for custom recovery of malfunctioning virtual machine
CN107547273A (en) * 2017-08-18 2018-01-05 国网山东省电力公司信息通信公司 A kind of support method and system of power system virtual instance High Availabitity
CN109240246A (en) * 2018-10-31 2019-01-18 特变电工南京智能电气有限公司 A kind of charging station intelligence operational system and method
CN110290012A (en) * 2019-07-03 2019-09-27 浪潮云信息技术有限公司 The detection recovery system and method for RabbitMQ clustering fault
CN110809017A (en) * 2019-08-16 2020-02-18 云南电网有限责任公司玉溪供电局 Data analysis application platform system based on cloud platform and micro-service framework
CN110912755A (en) * 2019-12-16 2020-03-24 浪潮云信息技术有限公司 System and method for network card fault monitoring and automatic recovery in cloud environment
CN111181767A (en) * 2019-12-10 2020-05-19 中国航空工业集团公司成都飞机设计研究所 Monitoring and fault self-healing system and method for complex system
CN111239853A (en) * 2020-03-06 2020-06-05 南京云狐信息科技有限公司 Automatic meteorological observation system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550012A (en) * 2015-12-07 2016-05-04 国云科技股份有限公司 Method for custom recovery of malfunctioning virtual machine
CN107547273A (en) * 2017-08-18 2018-01-05 国网山东省电力公司信息通信公司 A kind of support method and system of power system virtual instance High Availabitity
CN109240246A (en) * 2018-10-31 2019-01-18 特变电工南京智能电气有限公司 A kind of charging station intelligence operational system and method
CN110290012A (en) * 2019-07-03 2019-09-27 浪潮云信息技术有限公司 The detection recovery system and method for RabbitMQ clustering fault
CN110809017A (en) * 2019-08-16 2020-02-18 云南电网有限责任公司玉溪供电局 Data analysis application platform system based on cloud platform and micro-service framework
CN111181767A (en) * 2019-12-10 2020-05-19 中国航空工业集团公司成都飞机设计研究所 Monitoring and fault self-healing system and method for complex system
CN110912755A (en) * 2019-12-16 2020-03-24 浪潮云信息技术有限公司 System and method for network card fault monitoring and automatic recovery in cloud environment
CN111239853A (en) * 2020-03-06 2020-06-05 南京云狐信息科技有限公司 Automatic meteorological observation system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112579335A (en) * 2020-12-18 2021-03-30 歌尔光学科技有限公司 Intelligent equipment fault processing method, device, equipment and storage medium
CN113900888A (en) * 2021-09-22 2022-01-07 山东新一代信息产业技术研究院有限公司 Indoor distribution robot state monitoring system
CN113965459A (en) * 2021-10-08 2022-01-21 浪潮云信息技术股份公司 Consul-based method for monitoring host network to realize high availability of computing nodes
CN116155686A (en) * 2023-01-30 2023-05-23 浪潮云信息技术股份公司 Method for judging node faults in cloud environment
CN116155686B (en) * 2023-01-30 2024-05-31 浪潮云信息技术股份公司 Method for judging node faults in cloud environment
CN115858324A (en) * 2023-02-02 2023-03-28 北京神州光大科技有限公司 IT equipment fault processing method, device, equipment and medium based on AI
CN117130965A (en) * 2023-02-24 2023-11-28 荣耀终端有限公司 Sensor Hub data management method and electronic equipment

Similar Documents

Publication Publication Date Title
CN111865695A (en) Method and system for automatic fault handling in cloud environment
CN105357038B (en) Monitor the method and system of cluster virtual machine
CN103201724B (en) Providing application high availability in highly-available virtual machine environments
CN102656565B (en) Failover and recovery for replicated data instances
Zheng et al. Co-analysis of RAS log and job log on Blue Gene/P
CN102640108B (en) The monitoring of replicated data
CN109714202B (en) Client off-line reason distinguishing method and cluster type safety management system
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
US10545807B2 (en) Method and system for acquiring parameter sets at a preset time interval and matching parameters to obtain a fault scenario type
CN112506702B (en) Disaster recovery method, device, equipment and storage medium for data center
JP2001188765A (en) Technique for referring to fault information showing plural related fault under distributed computing environment
JP2008217735A (en) Fault analysis system, method and program
CN111046011A (en) Log collection method, system, node, electronic device and readable storage medium
CN110618864A (en) Interrupt task recovery method and device
CN111382008B (en) Virtual machine data backup method, device and system
CN108199901B (en) Hardware repair reporting method, system, device, hardware management server and storage medium
CN108762886A (en) The fault detect restoration methods and system of virtual machine
CN112000504A (en) Fault processing method and device for computing node and electronic equipment
CN111193643A (en) Cloud server state monitoring system and method
CN106875018B (en) Method and device for automatic maintenance of super-large-scale machine
CN110912755A (en) System and method for network card fault monitoring and automatic recovery in cloud environment
CN113672452A (en) Method and system for monitoring operation of data acquisition task
CN105025179A (en) Method and system for monitoring service agents of call center
JP4102592B2 (en) Failure information notification system with an aggregation function and a program for causing a machine to function as a failure information notification means with an aggregation function
CN111104266A (en) Access resource allocation method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201030

RJ01 Rejection of invention patent application after publication