CN109144829B

CN109144829B - Fault processing method and device, computer equipment and storage medium

Info

Publication number: CN109144829B
Application number: CN201811002316.9A
Authority: CN
Inventors: 冷迪; 陈瑞; 黄建华; 庞宁; 吕志宁
Original assignee: Shenzhen Power Supply Bureau Co Ltd
Current assignee: Shenzhen Power Supply Bureau Co Ltd
Priority date: 2018-08-30
Filing date: 2018-08-30
Publication date: 2022-03-22
Anticipated expiration: 2038-08-30
Also published as: CN109144829A

Abstract

The application relates to a fault processing method, a fault processing device, computer equipment and a storage medium. The method comprises the following steps: acquiring fault alarm information; carrying out convergence processing on the fault alarm information and converting the fault alarm information into a fault alarm problem; acquiring a corresponding fault processing rule from a database according to the fault alarm problem; and performing fault processing according to the fault processing rule. The method can avoid repeated receiving of the alarm information, save the fault processing time and the occupied storage resources, solve the problem of insufficient automation and automatically process the fault alarm information.

Description

Fault processing method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer applications, and in particular, to a fault handling method and apparatus, a computer device, and a computer storage medium.

Background

With the development of computer technology, operation and maintenance technology is emerging. The operation maintenance is to monitor the service operation state, so as to find out the abnormal operation and resource consumption of the service in time. When a fault occurs, the operation and maintenance engineer processes any abnormity of the service in time, and the problem is prevented from being enlarged as much as possible and even the service is stopped. And the operation and maintenance engineer needs to make a plan for processing various service abnormalities so that the plan can be manually executed when problems occur to achieve the purpose of loss stopping.

However, the current operation and maintenance method has the problem of insufficient automation.

Disclosure of Invention

In view of the above, there is a need to provide a fault handling method, apparatus, computer device and computer storage medium capable of reducing alarm storms and automatically handling fault alarm information in response to the technical problem of insufficient automation.

A method of fault handling, the method comprising: acquiring fault alarm information; carrying out convergence processing on the fault alarm information and converting the fault alarm information into a fault alarm problem; acquiring a corresponding fault processing rule from a database according to the fault alarm problem; and performing fault processing according to the fault processing rule.

In one embodiment, before acquiring the fault warning information, the method further includes: acquiring a fault alarm problem and a script for solving the fault alarm problem; splitting the script into sub-processing operations; recombining the sub-processing operations to form a fault handling rule; the fault alarm problem and the fault handling rule are stored in the database.

In one embodiment, the converging processing of the fault alarm information and the conversion into the fault alarm problem include: converging the same fault warning information which appears in the preset time into a piece of fault warning information; and converting the fault alarm information into a corresponding fault alarm problem.

In one embodiment, obtaining the corresponding fault handling rule from the database according to the fault alarm problem includes: matching the fault alarm problem in the database according to the fault alarm problem; and when the fault alarm problem is successfully matched with the fault alarm problem in the database, calling a fault processing rule in the database corresponding to the fault alarm problem in the database.

In one embodiment, the method further comprises: when the fault alarm problem is matched with the fault alarm problem in the database unsuccessfully, reporting the fault alarm problem; acquiring an input fault processing rule according to the fault alarm problem; and correspondingly storing the fault alarm problem and the input fault processing rule in the database.

In one embodiment, the obtaining the corresponding fault handling rule from the database according to the fault alarm problem further includes: when the fault alarm problem is matched with the fault alarm problem in the database unsuccessfully, identifying the fault alarm problem to obtain a fault sub-problem; matching a corresponding fault processing rule according to the fault subproblem; and forming the fault processing rule corresponding to the fault sub-problem into the fault processing rule corresponding to the fault alarm problem.

In one embodiment, the data in the database is stored in a Redis manner.

A fault handling apparatus, the apparatus comprising: the acquisition module is used for acquiring fault alarm information; the fault processing device is used for acquiring a corresponding fault processing rule from a database according to the fault alarm problem; the conversion module is used for carrying out convergence processing on the fault alarm information and converting the fault alarm information into a fault alarm problem; and the processing module is used for processing the fault according to the fault processing rule.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program: acquiring fault alarm information; carrying out convergence processing on the fault alarm information and converting the fault alarm information into a fault alarm problem; acquiring a corresponding fault processing rule from a database according to the fault alarm problem; and performing fault processing according to the fault processing rule.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of: acquiring fault alarm information; carrying out convergence processing on the fault alarm information and converting the fault alarm information into a fault alarm problem; acquiring a corresponding fault processing rule from a database according to the fault alarm problem; and performing fault processing according to the fault processing rule.

According to the fault processing method, the fault processing device, the computer equipment and the storage medium, the alarm storm can be effectively reduced by acquiring the fault alarm information and carrying out convergence processing on the information, the repeated receiving of the alarm information is avoided, and the fault processing time and the occupied storage resources are saved; the fault alarm information is converted into the fault alarm problem, the fault processing rule is obtained according to the fault alarm problem, fault processing is carried out, the problem of insufficient automation can be solved, and the fault alarm information is automatically processed.

Drawings

FIG. 1 is a diagram of an application environment of a fault handling method in one embodiment;

FIG. 2 is a flow diagram illustrating a method for fault handling in one embodiment;

FIG. 3 is a flow diagram that illustrates the generation of fault handling rules, according to one embodiment;

FIG. 4 is a flow chart illustrating a fault handling method according to another embodiment;

FIG. 5 is a flow chart illustrating a fault handling method according to yet another embodiment;

FIG. 6 is a flow chart illustrating a fault handling method according to yet another embodiment;

FIG. 7 is a diagram of an application scenario of the fault handling method in another embodiment;

FIG. 8 is a block diagram showing the structure of a failure processing apparatus according to an embodiment;

fig. 9 is a block diagram showing the structure of a failure processing apparatus according to another embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The fault handling method provided by the application can be applied to the application environment shown in fig. 1. Wherein the first server 102 communicates with the second server 104 over a network. The first server 102 and the second server 104 may be implemented by separate servers or a server cluster composed of a plurality of servers, and the first server 102 may be a heterogeneous database system, which is a collection of related database systems, and can implement sharing and transparent access of data. Each component of the heterogeneous database has autonomy, and each database system still has application characteristics, integrity control and security control of the database while realizing data sharing.

In one embodiment, as shown in fig. 2, a failure handling method is provided, which is described by taking the method as an example applied to the second server 104 in fig. 1, and includes the following steps:

step 202, obtaining fault alarm information.

The failure warning information refers to failure warning information generated when the first server 102 fails. The fault alarm information may include a server downtime, an abnormal service stop, insufficient disk space, a server ping failure, and the like.

Specifically, the second server 104 acquires failure alarm information generated when the first server 102 fails from the first server 102. The second server 104 includes a pre-positioned receiver that provides an interface for automatically pulling or receiving other alarm information sources, receiving fault alarm information from the alarm information sources, and pushing the fault alarm information to the fault translator.

The alarm information source includes Zabbix, open-falcon, Naglos, CMDB (Configuration Management Database), and the like, and is used for monitoring a network system, a terminal, a Database, a service, a process, and the like.

Zabbix is an enterprise-level open source solution providing distributed system monitoring and network monitoring functions, and can monitor various network parameters and ensure the safe operation of a server system.

The Open-falcon is an enterprise-level, highly available, extensible Open source monitoring solution.

Naglos is an open-source computer system and a network monitoring tool, and can effectively monitor the host states of a Windows system, a Linux system and a Unix system.

The Linux system is a Unix-like operating system, and is an operating system with multiple clients, multiple tasks, multiple threads and multiple CPUs (Central Processing units).

The Unix operating system is a powerful multi-user-side and multi-task operating system, supports various processor architectures, and belongs to a time-sharing operating system according to the classification of the operating system.

The CMDB (Configuration Management Database) is used for storing and managing various Configuration information of devices in the enterprise architecture, and is closely associated with all service support and service delivery processes, supports the operation of these processes, and exerts the value of the Configuration information.

By acquiring the fault alarm information from different alarm information sources, the consistency and accuracy of the fault alarm information can be ensured, the redundancy of data is reduced, the fault alarm information does not need to be stored in different places in the second server 104, and the management cost is reduced.

Step 204, the fault alarm information is converged and converted into a fault alarm problem.

Wherein, the convergence processing refers to converging the same fault alarm information into a piece of fault alarm information.

Specifically, the second server 104 converges the same fault warning information obtained from the first server 102 into the same fault warning information, and converts the same fault warning information into a fault warning problem corresponding to the fault warning information.

And step 206, acquiring a corresponding fault processing rule from the database according to the fault alarm problem.

Wherein, the fault processing rule refers to a scheme for solving the fault alarm problem.

Specifically, the second server 104 obtains the stored solution for solving the fault alarm problem from a database according to the fault alarm problem, and the database may be located in the second server 104 or may be independent of the second server 104.

And step 208, performing fault processing according to the fault processing rule.

Specifically, the second server 104 performs fault processing on the fault processing problem according to the fault processing rule acquired from the database.

In this embodiment, when the pre-receiver receives fault alarm information that an application server is down and a large number of ping unreachable messages are received by the zabbix alarm system, the fault converter converts the fault alarm information into a "server ping fault", invokes a fault processing rule in a database, executes a "restart server" script, and performs fault processing according to the fault processing rule.

In this embodiment, when the second server 104 receives the fault alarm problem of abnormal service stop, the fault processing rule in the database is called, the "restart service" script is executed, and fault processing is performed according to the fault processing rule.

In this embodiment, when the second server 104 receives a failure alarm problem of insufficient disk space or a performance problem, the second server invokes a failure handling rule in the database, executes a "clean log file and kill process" script, and performs failure handling according to the failure handling rule.

In this embodiment, after the fault processing is completed, the second server 104 may also feed back the fault processing result to the user side in real time, so as to facilitate the secondary checking processing.

In the fault processing method, the alarm storm can be effectively reduced by acquiring the fault alarm information and carrying out convergence processing on the information, the repeated receiving of the alarm information is avoided, and the fault processing time and the occupied storage resources are saved; the fault alarm information is converted into the fault alarm problem, the fault processing rule is obtained according to the fault alarm problem, fault processing is carried out, the problem of insufficient automation can be solved, and the fault alarm information is automatically processed.

In one embodiment, as shown in fig. 3, before acquiring the fault warning information, the fault handling method further includes:

step 302, a fault alarm problem and a script for solving the fault alarm problem are obtained.

Wherein, the script for solving the fault alarm problem is a solution for solving the fault alarm problem.

Specifically, the second server 104 obtains a failure alarm problem input by the user terminal or imported from another terminal or system and a script for solving the failure alarm problem.

Step 304, the script is split into sub-processing operations.

The sub-processing operation refers to a processing step in the script, and the processing step corresponds to solving a basic problem and is also the smallest fault processing unit.

Specifically, the second server 104 breaks the script solving the failure alarm problem into individual sub-processing operations.

In this embodiment, in the field of operation and maintenance technology, most of the operations may be completed by executing a script. The second server 104 treats each script's sub-processing operations as an atom, and stores a large number of the sub-processing operations in an atom store, which is located in the second server 104.

At step 306, the sub-processing operations are reassembled to form the fault handling rule.

Specifically, the second server 104 forms a solution to the fault alarm problem by recombining the processing steps in the script.

In this embodiment, the second server 104 may obtain the sub-processing operation from the atomic library to form a script processing step, and schedule the required sub-processing operation according to the fault processing problem to be organized into the fault processing rule. In the whole process, many atoms can be reused, so that the required atoms can be combined into fault processing rules in an atom library for specific fault alarm problems.

Step 308, the fault alarm problem and the fault handling rule are correspondingly stored in the database.

Specifically, the second server 104 stores the fault alarm problem and the fault handling rules for solving the fault alarm problem in the database in a one-to-one correspondence. The database is also used for storing configuration information, execution logs of fault processing and the like.

In this embodiment, the second server 104 stores the fault alarm problem in the form of a fault alarm table, but is not limited thereto, defines a fault handling rule table, and stores the fault alarm problem and the fault handling rule in the database after configuring the corresponding relationship between the fault alarm problem and the fault handling rule.

In the fault processing method, the script is divided into the sub-processing operations to solve the basic problem by acquiring the fault alarm problem and the scheme for solving the problem, and the sub-processing operations can be repeatedly used; different fault processing rules are formed by recombining the sub-processing operations, so that the occupied space can be reduced, more fault alarm problems can be solved, the compiling time and the response time can be saved, and a script does not need to be rewritten for each fault alarm problem; and the fault alarm problem and the fault processing rule are correspondingly stored in the database, so that the calling is quicker and more intuitive.

In one embodiment, the converging the fault warning information and converting the fault warning information into the fault warning problem includes: converging the same fault warning information which appears in the preset time into a piece of fault warning information; and converting the fault alarm information into a corresponding fault alarm problem.

The fault warning information refers to the preset time of 0 to 24 hours, but is not limited thereto, and in this embodiment, 5 minutes is used as the preset time.

Specifically, the second server 104 converges and merges the same fault warning information within a preset time, i.e., within 5 minutes but not limited to 5 minutes, into the same fault warning information, and converts the fault warning information into a fault warning problem corresponding to the fault warning information through the fault converter.

In this embodiment, when the current alarm receives the fault alarm information that the "alarm system sent by the first server 102 has an application server down and a large number of ping unreachable messages" and several hundred alarms occur within 5 minutes, an alarm storm is caused. In fact, the alarm storm is the same problem, and the second server 104 may receive only one fault alarm message through the convergence processing of the alarm message. The fault converter converts the fault alarm information into a server ping fault, and calls a fault alarm rule corresponding to the server ping fault from a database, such as a script of a server restart, so as to solve the problem of the server ping fault.

In this embodiment, the fault converter further includes a fault category library, which includes: the method can be used for directly processing classes, reminding post-processing classes and classes requiring manual intervention. Specifically, when the fault alarm problem is stored in the database, the fault alarm problem is classified into a directly processable class; when the fault alarm problem is a more complex fault alarm problem and cannot be solved immediately, classifying the fault alarm problem into a reminding post-processing class; when the fault alarm problem is a novel fault alarm problem and cannot be immediately solved, the fault alarm problem is classified into a type requiring manual intervention processing.

In the fault processing method, the same fault warning information is converged into the same fault warning information, so that the warning storm is reduced, the repeated reception of the warning information is avoided, and the fault processing time and the occupied storage resource can be saved.

Specifically, the second server 104 matches the fault alarm problem from the fault alarm problem table in the database according to the fault alarm problem, and when the matching is successful, invokes the fault handling rule corresponding to the fault handling problem from the fault handling rule table in the database.

According to the fault processing method, the fault processing problem in the database is searched according to the fault processing problem, and the fault processing rule is called, so that the effect of fault self-healing can be achieved, the fault response is rapid, and the fault rate of manual operation is reduced.

In one embodiment, as shown in fig. 4, the fault handling method further includes:

step 402, reporting the fault alarm problem when the fault alarm problem fails to match with the fault alarm problem in the database.

Specifically, when the fault alarm problem in the second server 104 cannot be matched with the fault alarm problem in the database, the second server 104 reports the fault alarm problem.

Step 404, obtaining the input fault processing rule according to the fault alarm problem.

Specifically, the user creates a fault handling rule corresponding to the fault warning problem according to the fault warning problem, and uploads the fault handling rule to the second server 104, and the second server 104 obtains the fault handling rule input by the user.

Step 406, correspondingly storing the fault alarm problem and the input fault handling rule in the database.

Specifically, the second server 104 stores the fault alarm problem in the form of, but not limited to, a fault alarm problem table in the database in correspondence with the input fault handling rule.

In the fault processing method, the input fault processing rule is acquired, so that the problem of fault alarm which cannot be solved in the database can be effectively solved, the problem of fault alarm and the corresponding fault processing rule are stored in the database, the storage capacity of the database is expanded, and the problem solving capability of the database can be enhanced.

In one embodiment, as shown in fig. 5, acquiring the corresponding fault handling rule from the database according to the fault alarm problem further includes:

step 502, when the matching between the fault alarm problem and the fault alarm problem in the database fails, identifying the fault alarm problem to obtain a fault sub-problem.

The failure sub-problem refers to a smaller unit in the failure alarm problem, that is, the failure alarm problem can be split into a plurality of failure sub-problems.

Specifically, when the fault alarm problem fails to match the fault alarm problem in the database, i.e., the fault alarm problem is not stored in the database, the second server 104 identifies the fault alarm problem, and obtains two or more fault sub-problems.

And step 504, matching the corresponding fault processing rule according to the fault subproblem.

Specifically, the second server 104 matches the sub-fault problem split according to the fault alarm problem with the fault alarm problem in the database, and invokes the fault handling rule corresponding to the sub-fault problem if the matching is successful.

Step 506, the fault processing rule corresponding to the fault subproblem is combined into the fault processing rule corresponding to the fault alarm problem.

Specifically, the second server 104 arranges the fault handling rules corresponding to the sub-problems one by one in order according to the corresponding order when the sub-problems are decomposed, so as to obtain the fault handling rules corresponding to the fault alarm problem.

In the fault processing method, the fault alarm problem is divided into two or more corresponding fault sub-problems, so that the step of solving the fault alarm problem can be detailed, and meanwhile, the fault alarm rule corresponding to the fault alarm problem does not need to be written again, so that the time is saved and the occupied resources are reduced.

In one embodiment, the data in the database is stored in a Redis manner.

Redis is a storage system that periodically writes updated data to disk or writes modify operations to additional recording files, and implements master-slave synchronization based thereon.

In particular, Redis storage is employed because the fault alarm information is not structured data, needs to be stored by an unstructured storage engine, and needs to be flushed and computationally processed into a fault alarm problem, which requires cache processing.

In one embodiment, the fault handling method employs plug-in management. Each alarm information source corresponds to a fault alarm plug-in unit, and when the alarm information source is changed, for example, from Zabbix alarm information source to Open-falcon alarm information source, the fault processing can be performed by reconfiguring the adapter interface in the pre-receiver. The fault processing method can adapt to various alarm information sources by adopting a plug-in management mode, and enhances expandability.

In one embodiment, the fault handling method further comprises the step of providing a system management interface through the management and control console, and performing system regulation and control and configuration of the plug-in through the system management interface.

In one embodiment, as shown in fig. 6, there is provided a fault handling method including the steps of:

step 602, a fault alarm problem and a script for solving the fault alarm problem are obtained.

Step 604, the script is split into sub-processing operations.

In this embodiment, the second server 104 refines the sub-processing operations of each script into atoms, and stores a large number of sub-processing operations in an atom library, which is located in a database in the second server 104.

At step 606, the sub-processing operations are reassembled to form the fault handling rules.

Step 608, the fault alarm problem and the fault handling rule are correspondingly stored in the database.

Step 610, obtaining fault alarm information from an alarm information source.

Specifically, the second server 104 acquires failure alarm information generated when the first server 102 fails from the first server 102. The second server 104 includes a pre-receiver that provides an interface for receiving automatic pulls or other sources of alarm information, including Zabbix, open-falcon, Naglos, and CMDB, etc., that may be used to monitor network systems, terminals, databases, services, processes, etc. The second server 104 receives the fault alarm information in the alarm information source and pushes the fault alarm information to the fault converter.

Step 612, the fault alarm information is converged and converted into a fault alarm problem.

Specifically, the second server 104 converges the same fault warning information obtained from the first server 102 into the same fault warning information, and converts the same fault warning information into a fault warning problem corresponding to the fault warning information through the fault converter.

Step 614, obtaining the corresponding fault processing rule from the database according to the fault alarm problem.

In this embodiment, when the fault alarm problem in the second server 104 cannot be matched with the fault alarm problem in the data, the second server 104 reports the fault alarm problem. The user end creates a fault processing rule corresponding to the fault alarm problem according to the fault alarm problem and uploads the fault processing rule to the second server 104, the second server 104 obtains the fault processing rule input by the user end, and the fault alarm problem is stored in the database corresponding to the input fault processing rule according to the form of a fault alarm problem table but not limited thereto.

In this embodiment, when the matching between the fault alarm problem and the fault alarm problem in the database fails, that is, the fault alarm problem is not stored in the database, the second server 104 identifies the fault alarm problem to obtain two or more fault sub-problems. And the second server 104 matches the fault subproblems split according to the fault alarm problem with the fault alarm problems in the database, and if the matching is successful, calls the fault processing rule corresponding to the fault subproblem. The second server 104 arranges the fault processing rules corresponding to the sub-problems one by one according to the corresponding sequence when the sub-problems are decomposed, so as to obtain the fault processing rules corresponding to the fault alarm problem.

In step 616, the fault is processed according to the fault processing rule.

In this embodiment, the fault handling method further includes providing a system management interface through the management and control console, and performing system regulation and control and configuration of the plug-in through the system management interface.

In one embodiment, the fault handling method is applied to the application scenario in fig. 7 as an example. The second server 104 includes a pre-receiver, a fault interpreter, a processor, and a database that also includes a atom library.

Specifically, the second server 104 pulls or receives the alarm information source from the front end receiver, receives the fault alarm information from the alarm information source, and pushes the fault alarm information to the fault converter in a queue. The fault converter converges the fault alarm information into the same fault alarm information and converts the fault alarm information into a fault alarm problem corresponding to the fault alarm information. And the processor processes the fault alarm problem according to the fault processing rule acquired from the database. Within the database is an atomic library, which is used to store sub-processing operations. The database is also used to store template content, configuration information, execution logs, and the like. In addition, the management and control console is used for providing a system management interface, and the system and the plug-in configuration are regulated through the interface.

In this embodiment, when the pre-receiver receives fault alarm information that an application server is down and a large number of ping unreachable messages are received by the zabbix alarm system, the fault converter converts the fault alarm information into a "server ping fault", the processor calls a fault processing rule in the database, executes a "restart server" script, and performs fault processing according to the fault processing rule.

In this embodiment, when the alarm information source is changed, for example, from Zabbix alarm source to Open-falcon alarm source, the failure processing may be performed by reconfiguring the adapter interface in the pre-receiver.

It should be understood that although the various steps in the flow charts of fig. 2-6 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-6 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 8, there is provided a fault handling apparatus including: an obtaining module 802, a converting module 804, and a processing module 806, wherein:

an obtaining module 802, configured to obtain fault warning information; and the fault processing module is also used for acquiring a corresponding fault processing rule from a database according to the fault alarm problem.

Specifically, the obtaining module 802 is configured to obtain the fault alarm information generated when the first server 102 fails from the first server 102, automatically pull or accept other alarm information sources, receive the fault alarm information in the alarm information sources, and push the fault alarm information to the fault converter. The alarm information source comprises Zabbix, open-falcon, Naglos, CMDB and the like, and is used for monitoring a network system, a terminal, a database, services, processes and the like.

Specifically, the obtaining module 802 is further configured to obtain, from the database, a stored solution for solving the fault alarm problem according to the fault alarm problem.

A conversion module 804, configured to perform convergence processing on the fault alarm information, and convert the fault alarm information into a fault alarm problem.

Specifically, the conversion module 804 is configured to converge the same fault warning information obtained from the first server 102 into the same fault warning information, and convert the same fault warning information into a fault warning problem corresponding to the fault warning information.

A processing module 806, configured to perform fault processing according to the fault processing rule.

Specifically, the processing module 806 is configured to perform fault processing on the fault processing problem according to the fault processing rule obtained from the database.

In this embodiment, when the pre-receiver receives fault alarm information that an application server is down and a large number of ping unreachable messages are received by the zabbix alarm system, the fault converter converts the fault alarm information into a "server ping fault", and the processing module 806 is configured to call a fault processing rule in the database, execute a "restart server" script, and perform fault processing according to the fault processing rule.

In this embodiment, when the second server 104 receives the fault alarm problem of abnormal service stop, the processing module 806 is configured to invoke a fault processing rule in the database, execute a "restart service" script, and perform fault processing according to the fault processing rule.

In this embodiment, when the second server 104 receives a failure alarm problem of insufficient disk space or a performance problem, the processing module 806 is configured to call a failure processing rule in the database, execute a "clean log file and kill process" script, and perform failure processing according to the failure processing rule.

In this embodiment, after the fault processing is completed, the processing module 806 may be configured to feed back the fault processing result to the user side in real time, so as to facilitate the secondary checking processing.

In one embodiment, as shown in fig. 9, the fault handling apparatus further includes a splitting module 808, a reorganizing module 810, and a storing module 812, which are not executed in the exact order of fig. 9. Before the obtaining module 802 obtains the information of the fault alarm problem, it is further configured to obtain the fault alarm problem and a script for solving the fault alarm problem.

Specifically, the obtaining module 802 is configured to obtain a fault alarm problem input by a user side or imported from another terminal or system, and a script for solving the fault alarm problem.

A splitting module 808 configured to split the script into sub-processing operations. In particular, the splitting module 808 is configured to split the script that solves the problem of the fault alarm into individual sub-processing operations. In this embodiment, the splitting module 808 is further configured to refine the sub-processing operations of each script into atoms, and store a large number of sub-processing operations in an atom library.

And a restructuring module 810 for restructuring the sub-processing operation to form the fault handling rule. Specifically, the restructuring module 810 is configured to form a solution to the fault alarm problem by restructuring the processing steps in the script.

In this embodiment, the restructuring module 810 is configured to obtain the sub-processing operation from the atomic library, form a script processing step, schedule the required sub-processing operation according to the corresponding fault handling problem, and compile a fault handling rule.

A storage module 812, configured to correspondingly store the fault alarm problem and the fault handling rule in the database. Specifically, the storage module 812 is configured to store the fault alarm problem and the fault handling rule for solving the fault alarm problem in the database in a one-to-one correspondence. The data in the database adopts a Redis storage device. In particular, since the fault alarm information is not structured data, needs to be stored by an unstructured storage engine, and needs to be flushed and computed to the fault alarm problem, cache processing is required, so the Redis storage device is used for storage. In this embodiment, the storage module 812 is configured to store the fault alarm problem as a fault alarm table, but is not limited thereto, define a fault handling rule table, and store the fault alarm problem and the fault handling rule in the database after configuring the corresponding relationship between the fault alarm problem and the fault handling rule.

In one embodiment, the conversion module 804 is further configured to converge the same fault warning information that occurs within a preset time into a piece of fault warning information; and converting the fault alarm information into a corresponding fault alarm problem.

Specifically, the conversion module 804 is further configured to converge and combine the same fault warning information within a preset time, i.e., within 5 minutes but not limited to 5 minutes, into the same fault warning information, and convert the fault warning information into a fault warning problem corresponding to the fault warning information through the fault converter.

In this embodiment, when the obtaining module 802 receives the fault alarm information that the "alarm system is down and a large amount of ping unreachable messages" sent by the first server 102, hundreds of alarms occur within 5 minutes, which causes an alarm storm. The conversion module 804 is further configured to convert the plurality of fault warning messages into one fault warning message through a convergence process of the plurality of fault warning messages. The conversion module 804 is further configured to convert the fault warning information into a "server ping fault".

In this embodiment, the conversion module 804 is further configured to classify the fault, where the classification includes: the method can be used for directly processing classes, reminding post-processing classes and classes requiring manual intervention. Specifically, when the fault alarm problem is stored in the database, the conversion module 804 is further configured to classify the fault alarm problem into a directly processable class; when the fault alarm problem is a more complex fault alarm problem and cannot be solved immediately, the conversion module 804 is further configured to classify the fault alarm problem into a post-warning processing class; when the fault alarm problem is a novel fault alarm problem and cannot be immediately solved, the conversion module 804 is further configured to classify the fault alarm problem into a class requiring manual intervention.

In one embodiment, the obtaining module 802 is further configured to match a fault alarm problem in a database according to the fault alarm problem; and when the fault alarm problem is successfully matched with the fault alarm problem in the database, calling a fault processing rule in the database corresponding to the fault alarm problem in the database.

Specifically, the obtaining module 802 is further configured to match a fault alarm problem from a fault alarm problem table in the database according to the fault alarm problem, and when the matching is successful, call a fault handling rule table corresponding to the fault handling problem from a fault handling rule table in the database.

In one embodiment, the obtaining module 802 is further configured to report the fault alarm problem when the fault alarm problem fails to match the fault alarm problem in the database; acquiring an input fault processing rule according to the fault alarm problem; and correspondingly storing the fault alarm problem and the input fault processing rule in the database.

Specifically, when the fault alarm problem cannot be matched with the fault alarm problem in the database, the obtaining module 802 is further configured to report the fault alarm problem, obtain a fault processing rule corresponding to the fault alarm problem input by the user side, and correspondingly store the fault processing rule in the database.

In one embodiment, the obtaining module 802 is further configured to identify the fault alarm problem as a fault sub-problem when the fault alarm problem fails to match the fault alarm problem in the database; matching a corresponding fault processing rule according to the fault subproblem; and forming the fault processing rule corresponding to the fault sub-problem into the fault processing rule corresponding to the fault alarm problem.

Specifically, when the matching between the fault alarm problem and the fault alarm problem in the database fails, that is, when the fault alarm problem is not stored in the database, the obtaining module 802 is further configured to identify the fault alarm problem, obtain two or more fault sub-problems, match the fault sub-problems split according to the fault alarm problem with the fault alarm problem in the database, and invoke the fault processing rule corresponding to the fault sub-problem if the matching is successful. The obtaining module 802 is further configured to arrange the fault handling rules corresponding to the sub-problems one by one according to the corresponding order during decomposition, so as to obtain the fault handling rules corresponding to the fault alarm problem.

For the specific definition of the fault handling apparatus, reference may be made to the above definition of the fault handling method, which is not described herein again. The respective modules in the fault handling apparatus described above may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing fault handling data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a fault handling method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring fault alarm information; carrying out convergence processing on the fault alarm information and converting the fault alarm information into a fault alarm problem; acquiring a corresponding fault processing rule from a database according to the fault alarm problem; and performing fault processing according to the fault processing rule.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a fault alarm problem and a script for solving the fault alarm problem; splitting the script into sub-processing operations; recombining the sub-processing operations to form a fault handling rule; the fault alarm problem and the fault handling rule are stored in the database.

In one embodiment, the processor, when executing the computer program, further performs the steps of: converging the same fault warning information which appears in the preset time into a piece of fault warning information; and converting the fault alarm information into a corresponding fault alarm problem.

In one embodiment, the processor, when executing the computer program, further performs the steps of: matching the fault alarm problem in the database according to the fault alarm problem; and when the fault alarm problem is successfully matched with the fault alarm problem in the database, calling a fault processing rule in the database corresponding to the fault alarm problem in the database.

In one embodiment, the processor, when executing the computer program, further performs the steps of: when the fault alarm problem is matched with the fault alarm problem in the database unsuccessfully, reporting the fault alarm problem; acquiring an input fault processing rule according to the fault alarm problem; and correspondingly storing the fault alarm problem and the input fault processing rule in the database.

In one embodiment, the processor, when executing the computer program, further performs the steps of: when the fault alarm problem is matched with the fault alarm problem in the database unsuccessfully, identifying the fault alarm problem to obtain a fault sub-problem; matching a corresponding fault processing rule according to the fault subproblem; and forming the fault processing rule corresponding to the fault sub-problem into the fault processing rule corresponding to the fault alarm problem.

In one embodiment, the data in the database is stored in a Redis manner.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: carrying out convergence processing on the fault alarm information and converting the fault alarm information into a fault alarm problem; acquiring a corresponding fault processing rule from a database according to the fault alarm problem; and performing fault processing according to the fault processing rule.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a fault alarm problem and a script for solving the fault alarm problem; splitting the script into sub-processing operations; recombining the sub-processing operations to form a fault handling rule; the fault alarm problem and the fault handling rule are stored in the database.

In one embodiment, the computer program when executed by the processor further performs the steps of: converging the same fault warning information which appears in the preset time into a piece of fault warning information; and converting the fault alarm information into a corresponding fault alarm problem.

In one embodiment, the computer program when executed by the processor further performs the steps of: matching the fault alarm problem in the database according to the fault alarm problem; and when the fault alarm problem is successfully matched with the fault alarm problem in the database, calling a fault processing rule in the database corresponding to the fault alarm problem in the database.

In one embodiment, the computer program when executed by the processor further performs the steps of: when the fault alarm problem is matched with the fault alarm problem in the database unsuccessfully, reporting the fault alarm problem; acquiring an input fault processing rule according to the fault alarm problem; and correspondingly storing the fault alarm problem and the input fault processing rule in the database.

In one embodiment, the computer program when executed by the processor further performs the steps of: when the fault alarm problem is matched with the fault alarm problem in the database unsuccessfully, identifying the fault alarm problem to obtain a fault sub-problem; matching a corresponding fault processing rule according to the fault subproblem; and forming the fault processing rule corresponding to the fault sub-problem into the fault processing rule corresponding to the fault alarm problem.

In one embodiment, the data in the database is stored in a Redis manner.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of fault handling, the method comprising:

acquiring a fault alarm problem input by a user side or imported from other terminals or systems and a script for solving the fault alarm problem, splitting the script to obtain sub-processing operations corresponding to the script, recombining the sub-processing operations to form a fault processing rule corresponding to the fault alarm problem, and correspondingly storing the fault alarm problem and the fault processing rule in a database; the script for solving the fault alarm problem refers to a solution for solving the fault alarm problem; the sub-processing operation refers to a processing step in the script, the sub-processing operation correspondingly solves a basic problem, and the sub-processing operation is the smallest fault processing unit;

acquiring fault alarm information;

carrying out convergence processing on the fault warning information, and converting the fault warning information into a fault warning problem corresponding to the fault warning information;

acquiring a fault processing rule corresponding to the fault alarm information from the database according to the fault alarm problem corresponding to the fault alarm information; when the matching of the fault alarm problem corresponding to the fault alarm information and the fault alarm problem in the database fails, identifying the fault alarm problem corresponding to the fault alarm information to obtain a fault subproblem, and arranging the fault processing rules corresponding to the fault subproblem according to the fault subproblem matching corresponding fault processing rules and the corresponding sequence when the fault subproblem is decomposed to form the fault processing rules corresponding to the fault alarm information; the fault alarm problem is decomposed into a plurality of fault sub-problems;

and executing the script corresponding to the fault processing rule, and performing fault processing according to the fault processing rule corresponding to the fault warning information.

2. The method of claim 1, wherein the convergence process is to converge the same fault alarm information into a single fault alarm information.

3. The method according to claim 1, wherein the converging the fault warning information and converting the fault warning information into the fault warning problem corresponding to the fault warning information comprises:

converging the same fault warning information which appears in the preset time into a piece of fault warning information;

and converting the fault alarm information into a fault alarm problem corresponding to the fault alarm information.

4. The method of claim 1, wherein the obtaining the fault handling rule corresponding to the fault alarm information from the database according to the fault alarm problem corresponding to the fault alarm information comprises:

matching fault alarm problems in a database according to the fault alarm problems;

and when the fault alarm problem is successfully matched with the fault alarm problem in the database, calling a fault processing rule in the database corresponding to the fault alarm problem in the database.

5. The method of claim 4, further comprising:

when the fault alarm problem is unsuccessfully matched with the fault alarm problem in the database, reporting the fault alarm problem;

acquiring an input fault processing rule according to the fault alarm problem;

and correspondingly storing the fault alarm problem and the input fault processing rule in the database.

6. The method of claim 1, further comprising:

and feeding back the fault processing result to the user side.

7. The method of claim 1, wherein the data in the database is stored in a Redis manner.

8. A fault handling apparatus, characterized in that the apparatus comprises:

the fault processing rule forming module is used for acquiring a fault alarm problem input by a user side or imported from other terminals or systems and a script for solving the fault alarm problem, splitting the script to obtain a sub-processing operation corresponding to the script, recombining the sub-processing operation to form a fault processing rule corresponding to the fault alarm problem, and correspondingly storing the fault alarm problem and the fault processing rule in a database; the script for solving the fault alarm problem refers to a solution for solving the fault alarm problem; the sub-processing operation refers to a processing step in the script, the sub-processing operation correspondingly solves a basic problem, and the sub-processing operation is the smallest fault processing unit;

the acquisition module is used for acquiring fault alarm information;

the conversion module is used for carrying out convergence processing on the fault warning information and converting the fault warning information into a fault warning problem corresponding to the fault warning information;

the fault processing rule acquisition module is used for acquiring a fault processing rule corresponding to the fault warning information from the database according to the fault warning problem corresponding to the fault warning information; when the matching of the fault alarm problem corresponding to the fault alarm information and the fault alarm problem in the database fails, identifying the fault alarm problem corresponding to the fault alarm information to obtain a fault subproblem, and arranging the fault processing rules corresponding to the fault subproblem according to the fault subproblem matching corresponding fault processing rules and the corresponding sequence when the fault subproblem is decomposed to form the fault processing rules corresponding to the fault alarm information; the fault alarm problem is split into a plurality of fault sub-problems;

and the processing module is used for executing the script corresponding to the fault processing rule and processing the fault according to the fault processing rule corresponding to the fault warning information.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.