CN112446511A - Fault handling method, device, medium and equipment - Google Patents

Fault handling method, device, medium and equipment Download PDF

Info

Publication number
CN112446511A
CN112446511A CN202011311725.4A CN202011311725A CN112446511A CN 112446511 A CN112446511 A CN 112446511A CN 202011311725 A CN202011311725 A CN 202011311725A CN 112446511 A CN112446511 A CN 112446511A
Authority
CN
China
Prior art keywords
fault
target
instance
handling
root cause
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011311725.4A
Other languages
Chinese (zh)
Inventor
苑志云
陈倩
梁晓东
杨西锋
张岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202011311725.4A priority Critical patent/CN112446511A/en
Publication of CN112446511A publication Critical patent/CN112446511A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/20Administration of product repair or maintenance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Engineering & Computer Science (AREA)
  • Marketing (AREA)
  • Databases & Information Systems (AREA)
  • General Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Operations Research (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Game Theory and Decision Science (AREA)
  • Data Mining & Analysis (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a fault handling method, a device, a medium and equipment, comprising the following steps: acquiring alarm information, and determining a target fault instance according to the alarm information; acquiring a fault analysis model, carrying out root cause analysis on a target fault instance through the fault analysis model, and determining root causes, loss stopping points and/or loss stopping suggestions of the target fault instance; matching a target handling policy according to the root cause, the outage point and/or the outage advice, and generating a failure solution for the target failure instance according to the target handling policy; calling a preset standard handling flow to execute the fault solution to obtain an execution result; and checking the execution result to finish the treatment of the target fault instance. The technical scheme provided by the invention has the advantages that the root cause of the fault is more carefully and quickly positioned, the loss stopping point can be determined, the loss stopping suggestion can be provided, the disposal strategy can be automatically matched, the automatic standard flow disposal of the fault can be executed, and the efficiency and the accuracy of the fault disposal can be improved.

Description

Fault handling method, device, medium and equipment
Technical Field
The invention relates to the field of operation and maintenance, in particular to a fault handling method, device, medium and equipment.
Background
The fault handling of the current system is mainly implemented manually by operation and maintenance personnel, if the fault is a repeated event or a corresponding fault type exists in an emergency plan, operation is carried out according to handling measures in the emergency plan, and an emergency plan strategy is generally subjected to strategy arrangement through a cloud platform; if the fault is not generated, namely the corresponding fault does not exist in the emergency plan, the reason for alarming is found out through a root cause analysis tool, and then the reason is analyzed based on human experience and then disposed.
However, services are generally completed by a plurality of technologies or service components together, the calling relationship among the components is complex, a plurality of system anomalies are usually involved when a fault occurs, the troubleshooting range is wide, the difficulty is high, the reason is difficult to quickly and accurately locate based on analysis of personal experience in a complex fault scene, and on the other hand, a large amount of historically accumulated fault analysis decision experiences are difficult to precipitate and solidify. The existing event analysis tools are not deep enough for analyzing the root cause and are limited to the positioning of the cause. In the aspect of fault handling, more personalized and specialized operations are performed, and a unified standard flow does not exist.
Disclosure of Invention
In order to solve the problems of the prior art, the invention provides a fault handling method, a fault handling device, a fault handling medium and equipment. The technical scheme is as follows:
in a first aspect, the present invention provides a fault handling method, including:
acquiring alarm information, and determining a target fault instance according to the alarm information;
acquiring a fault analysis model, carrying out root cause analysis on the target fault instance through the fault analysis model, and determining root causes, loss stopping points and/or loss stopping suggestions of the target fault instance;
matching a target handling policy according to the root cause, the outage point and/or the outage advice, and generating a failure solution for the target failure instance according to the target handling policy;
calling a preset standard handling flow to execute the fault solution to obtain an execution result;
and checking the execution result to finish the treatment of the target fault instance.
Further, the determining a target fault instance according to the alarm information includes:
based on preset logic, performing aggregation and duplicate removal on the alarm information;
and combining at least one alarm information meeting the preset association degree in the alarm information after the aggregation and the de-duplication into a target fault instance.
Further, the obtaining the fault analysis model includes:
establishing an index association relation between at least one index of the operation and maintenance object according to the historical fault handling record;
establishing a fault propagation relation between at least one operation and maintenance object according to a configuration and calling relation between the at least one operation and maintenance object;
and generating a fault tree based on the index incidence relation and the fault propagation relation, and taking the fault tree as a fault analysis model.
Further, the root cause analysis of the target fault instance by the fault analysis model, and the determining root cause, loss stopping point and/or loss stopping suggestion of the target fault instance includes:
calling an event platform, and creating a root cause analysis event list aiming at the target fault instance;
executing the root cause analysis event list based on the event platform, and carrying out root cause analysis on the target fault instance by using the fault tree to obtain a troubleshooting analysis result;
and determining root causes, loss stopping points and/or loss stopping suggestions of the target fault examples according to the failure removing analysis results.
Further, the matching a target handling policy according to the root cause, the outage point, and/or the outage recommendation, and the generating a failure solution for the target failure instance according to the target handling policy comprises:
acquiring a preset fault standard handling strategy set, wherein a standard handling strategy in the fault standard handling strategy set is preset with an incidence relation with a root cause, a loss stopping point and/or a loss stopping suggestion of a fault;
matching a target handling strategy in the preset fault standard handling strategy set according to the root cause, the loss stopping point and/or the loss stopping suggestion of the target fault instance;
generating a fault solution for the target fault instance in accordance with the target handling policy.
Further, the invoking a preset standard handling process executes the fault solution, and acquiring an execution result includes:
instantiating the fault solution to obtain a fault handling scheme according to a preset standard handling flow, wherein the fault handling scheme comprises a handling flow instance;
and executing the handling process instance to obtain an execution result.
Further, the executing the treatment flow instance comprises:
judging whether the fault handling scheme is a preset automatic handling type;
if yes, executing the processing flow instance by using an automation platform, and acquiring an execution result returned by the automation platform;
if not, forwarding the handling process instance to a node of a target processing object by using an automation platform, so that the target processing handles the target fault instance according to the handling process instance, and acquiring an execution result returned by the target processing object.
Further, the method further comprises:
generating a fault list according to at least one fault instance;
performing root cause analysis on the at least one fault instance to generate a fault root cause list;
generating a fault solution list according to the at least one fault instance;
and checking fault information and fault propagation relation according to the fault list, the fault root cause list and/or the fault solution list.
In a second aspect, the present invention provides a fault handling apparatus, the apparatus comprising:
the first determining module is used for acquiring alarm information and determining a target fault instance according to the alarm information;
the second determining module is used for acquiring a fault analysis model, performing root cause analysis on the target fault instance through the fault analysis model, and determining root causes, loss stopping points and/or loss stopping suggestions of the target fault instance;
a failure solution generation module, configured to match a target handling policy according to the root cause, the outage point, and/or the outage suggestion, and generate a failure solution for the target failure instance according to the target handling policy;
the execution module is used for calling a preset standard handling process to execute the fault solution and obtaining an execution result;
and the checking module is used for checking the execution result to finish the treatment of the target fault instance.
In a third aspect, the present invention provides a computer device comprising a processor and a memory, wherein at least one instruction or at least one program is stored in the memory, and the at least one instruction or at least one program is loaded by the processor and executes the fault handling method.
In a fourth aspect, the present invention provides a computer-readable storage medium having at least one instruction or at least one program stored therein, the at least one instruction or the at least one program being loaded and executed by a processor to implement the one fault handling method.
The invention provides a fault handling method, a device, a medium and equipment, which have the following technical effects:
(1) the scheme provided by the invention establishes a fault analysis model based on all operation and maintenance objects and indexes, can realize comprehensive and ordered root cause analysis of full link and full coverage, and further determines a fault node or a loss stopping point by combining with application scene characteristics of fault occurrence;
(2) the fault analysis model in the scheme provided by the invention also introduces an index incidence relation and a fault propagation relation of the operation and maintenance object, so that the root cause of the fault instance is traced deeper, the root cause analysis of the fault in a complex scene is deeper, and more clear guidance can be provided for the disposal of the fault;
(3) according to the scheme provided by the invention, the historically accumulated fault analysis treatment experience is converted, loss stopping suggestions can be provided when the root cause is analyzed, and meanwhile, the treatment strategies pre-arranged in the cloud platform are automatically matched according to the loss stopping suggestions so as to save the time for manually retrieving the treatment strategies;
(4) according to the scheme provided by the invention, the external automation platform is utilized to realize active analysis on the fault case, automatic calling and processing flow of the fault solution and/or automatic inspection on fault processing, so that intelligent processing on the fault case is realized, the efficiency of fault processing is improved, and the negative influence of fault occurrence is reduced;
(5) according to the scheme provided by the invention, the information such as the fault instance, the root cause thereof, the solution scheme and the like is generated into a visual list, real-time and uniform troubleshooting and disposal information is provided for operation and maintenance personnel, and the cooperative work efficiency among different teams is improved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a fault handling method according to an embodiment of the present invention;
FIG. 2 is a flow chart of fault-rejection analysis for a fault instance according to an embodiment of the present invention;
FIG. 3 is a flow diagram of a method for generating a fault solution provided by an embodiment of the invention;
FIG. 4 is a diagram illustrating a list of examples of faults provided by an embodiment of the present invention;
fig. 5 is a schematic diagram of a fault root cause list provided by an embodiment of the present invention;
FIG. 6 is a schematic diagram of a troubleshooting solution list provided by an embodiment of the present invention;
FIG. 7 is a diagram illustrating a list of fault handling schemes provided by an embodiment of the invention;
FIG. 8 is a diagram of a list of example treatment flows provided by an embodiment of the invention;
fig. 9 is a flowchart of another fault handling method provided by an embodiment of the present invention;
fig. 10 is a schematic diagram of a fault handling apparatus provided in an embodiment of the present invention;
fig. 11 is a block diagram of a hardware structure of a server operating a fault handling method according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. Examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In order to make the objects, technical solutions and advantages disclosed in the embodiments of the present invention more clearly apparent, the embodiments of the present invention are described in further detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the embodiments of the invention and are not intended to limit the embodiments of the invention.
In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present embodiment, "a plurality" means two or more unless otherwise specified. In order to facilitate understanding of the technical solutions and the technical effects thereof described in the embodiments of the present invention, the embodiments of the present invention first explain related terms:
root cause analysis: also called root Cause analysis, or simply rca (root Cause analysis), refers to a systematic logical thinking method and a set of corresponding tools with subjective objectives for thoroughly solving or explaining problems in the fields of modern management, scientific research, and the like. The root cause analysis comprises two steps, firstly various causes causing problems are found through tracing cause reasoning, and then the root cause is determined according to the relationship between the causes and the needs.
And (3) fault tree: or barrier removal tree, is a special inverted tree-shaped logic cause and effect relationship diagram, and describes cause and effect relationships among various events in the system by using event symbols, logic gate symbols and transition symbols. The input event of a logic gate is the "cause" of the output event and the output event of the logic gate is the "effect" of the input event. Fault Tree Analysis (FTA) identifies and evaluates the risks of various systems by using logical reasoning, and not only can analyze the direct cause of an accident, but also can deeply reveal the potential cause of the accident.
Examples
Fig. 1 is a flowchart of a fault handling method according to an embodiment of the present invention, and referring to fig. 1, a fault handling method according to an embodiment of the present disclosure may include the following steps:
s101: and acquiring alarm information, and determining a target fault instance according to the alarm information.
In an embodiment of the present specification, specifically, the determining a target fault instance according to the alarm information may include the following steps:
s201: and based on preset logic, performing aggregation and duplicate removal on the alarm information.
In a feasible implementation mode, alarm information reported after an external monitoring system monitors the system performance in real time is obtained, and the alarm information is aggregated and deduplicated based on simple logic of time and space.
S202: and combining at least one alarm information meeting the preset association degree in the alarm information after the aggregation and the de-duplication into a target fault instance.
Specifically, the information of the target failure instance includes, but is not limited to, a failure number, an operation and maintenance object name, an application system, a host name, an index name, a failure category, and a processing state. It can be understood that the operation and maintenance object name, the application system, the host name and the index name can represent a fault scenario of the fault instance, in particular, an application scenario or a business scenario, so as to facilitate more accurate root cause analysis of the fault in combination with the fault scenario. In an application scenario or business scenario of a banking system, the application system includes, but is not limited to, a payment settlement system, a cloud management platform, a proxy payment system, and a middleware system.
S103: and acquiring a fault analysis model, carrying out root cause analysis on the target fault instance through the fault analysis model, and determining root causes, loss stopping points and/or loss stopping suggestions of the target fault instance.
The root cause analysis based on the fault analysis model is characterized in that an artificial intelligence algorithm is used for big data analysis and mining, full link, full coverage, comprehensive and ordered analysis and positioning are achieved, analysis granularity is smaller, root cause tracing of faults is deeper, and root cause analysis of faults in complex scenes is more accurate; meanwhile, the fault analysis and treatment experience accumulated historically is combined, loss stopping suggestions can be provided when root causes are analyzed, and more clear guidance can be provided for the treatment of the faults.
In one embodiment of the present specification, specifically, the obtaining of the fault analysis model may include the following steps:
s301: and establishing an index association relation between at least one index of the operation and maintenance object according to the historical fault handling record.
In particular, the at least one indicator may include, but is not limited to, transaction amount, average processing duration, memory space, computing resources, and the like. The establishment of the index association relationship can be used for classifying fault reasons from different dimensions after the occurrence of a fault is determined by detecting the index abnormality, so that the root probability of the fault reason at a service side, a hardware side and a network side is further determined. And an abnormal index association path when a target fault instance occurs can be determined through the index association relation.
S303: and establishing a fault propagation relation between at least one operation and maintenance object according to the configuration and calling relation between the at least one operation and maintenance object.
It can be understood that the service is usually completed by a plurality of systems or service components, and the calling relationship between the components is complex, and a plurality of systems are usually involved when a fault occurs, so that the troubleshooting range is wide, and the difficulty of troubleshooting is improved. Therefore, configuration and calling relations among a plurality of operation and maintenance applications, operation and maintenance components and operation and maintenance services are introduced into the fault analysis model, and the fault root cause of a complex fault scene or cascading faults can be more accurately analyzed, so that the fault handling efficiency is improved.
S305: and generating a fault tree based on the index incidence relation and the fault propagation relation, and taking the fault tree as a fault analysis model.
The setting of the fault tree in the embodiment of the present disclosure is only one logical inference manner adopted in the fault analysis model, and the present invention is not limited to this.
In one embodiment of the present specification, specifically, as shown in fig. 2, step S103 may include the steps of:
s302: and calling an event platform, and creating a root cause analysis event list aiming at the target fault instance.
In one embodiment of the present description, for a newly generated fault instance, a script is invoked to automatically create a root cause analysis event sheet; or in a visual operation interface of a fault instance list, aiming at a target fault instance, obtaining a user operation command, and configuring a root cause analysis event of the target fault instance.
S304: and executing the root cause analysis event list based on the event platform, and carrying out root cause analysis on the target fault instance by using the fault tree to obtain a troubleshooting analysis result.
In one possible embodiment, the root cause analysis of the target fault instance is performed by passing the information of the target fault instance into the fault tree. The troubleshooting analysis results include, but are not limited to: fault number, operation and maintenance object name, application system, host name, index name, troubleshooting time, root cause type, root cause probability and loss stopping suggestion. It can be understood that, for one target fault instance, the root cause of the analysis may be a service layer, or may also be a hardware layer or a network layer, and the root cause probability is calculated for different root causes, and compared with the absolute case of a single root cause determination, the method in the embodiment of the present invention integrates the root cause analysis more comprehensively, and can also improve the accuracy of fault handling.
S306: and determining root causes, loss stopping points and/or loss stopping suggestions of the target fault examples according to the failure removing analysis results.
S105: matching a target handling policy according to the root cause, outage point and/or outage recommendation, and generating a fault solution for the target fault instance according to the target handling policy.
It is understood that the existing fault handling is mainly based on manual implementation by operation and maintenance personnel, and if the fault is a repeatedly occurring event or a fault type existing in an emergency plan, the operation is performed according to experience or handling measures in the emergency plan. The emergency plan strategy is generally stored in the cloud platform, operation and maintenance personnel need to query the specific strategy in the cloud platform and then execute the strategy, and the response speed to fault disposal is low.
In one embodiment of the present specification, specifically, as shown in fig. 3, the step S105 may include the steps of:
s501: acquiring a preset fault standard handling strategy set, wherein the standard handling strategies in the fault standard handling strategy set are preset with an incidence relation with root causes, loss stopping points and/or loss stopping suggestions of faults.
Specifically, the historically accumulated operation and maintenance treatment experience sediment is converted into standardized and normalized treatment strategies of different types of systems, including but not limited to an internet connection class, a sales promotion class, a basic component class and an office class; for each policy, a normalized naming and standardized flow setup is performed, including but not limited to restart policy, quarantine policy, flow control policy, and handover policy. The mode in the embodiment of the specification not only effectively utilizes the historical operation and maintenance experience, but also facilitates unified search or automatic association matching by standardizing the disposal strategy.
S503: matching a target handling policy in the preset fault standard handling policy set according to the root cause, loss stopping point and/or loss stopping suggestion of the target fault instance.
In the embodiment of the present specification, for example, according to the loss stopping suggestion for the target fault instance, the associated handling policy is automatically searched in the cloud platform by adopting character matching, so that the time for manually searching the handling policy in the cloud platform is saved, and the efficiency of fault handling can be improved.
S505: generating a fault solution for the target fault instance in accordance with the target handling policy.
In an embodiment of the present specification, the information of the fault solution includes, but is not limited to, a root cause number, a root cause, a plan name, a plan content, an operation and maintenance object name, an application system, a host name, a target name, and a related troubleshooting tree model. The scenario content may consist of the target handling policy. The fault solution is a solution for a specific fault instance recommended or provided for operation and maintenance personnel in the intelligent fault handling process, and in another feasible implementation, the operation and maintenance personnel can also modify and customize a handling strategy in the fault solution.
Preferably, in a visual operation interface, an emergency fault solution can be quickly generated by retrieving a disposal strategy, and the fault solution can be quickly executed in a parametric manner to improve the disposal efficiency of the emergency fault.
S107: and calling a preset standard handling flow to execute the fault solution to obtain an execution result.
It can be understood that, based on the manual implementation of the operation and maintenance personnel on the fault handling, there are many personalized and specialized operations, and an effective automatic handling mechanism is lacked for a fault scenario with high specific repeatability. In the embodiments of the present description, for example, the external automation platform is linked to perform automatic handling of the fault, so that the time loss of manual processing can be effectively reduced.
In one embodiment of the present specification, specifically, the step S107 may include the steps of:
s701: instantiating the fault solution to obtain a fault handling scheme according to a preset standard handling flow, wherein the fault handling scheme comprises a handling flow instance.
In one possible embodiment, the failure handling scheme includes, but is not limited to, a handling scheme name, creation time, creator, whether to trigger automatically, associate stop-loss advice, and processing status; the handling flow instances include, but are not limited to, basic information, operational policies, notification policies, and exception executions.
S702: and executing the handling process instance to obtain an execution result.
In a possible implementation, the step S702 may include the following steps:
s7021: and judging whether the fault handling scheme is a preset automatic handling type.
S7023: and if so, executing the processing flow instance by utilizing an automation platform, and acquiring an execution result returned by the automation platform.
S7025: if not, forwarding the handling process instance to a node of a target processing object by using an automation platform, so that the target processing object handles the target fault instance according to the handling process instance, and obtains an execution result returned by the target processing object.
S109: and checking the execution result to finish the treatment of the target fault instance.
In one embodiment of the present specification, the verification manner of the execution result may specifically include index comparison before and after fault handling, system health check, and failure-in-process analysis of the rerun fault tree. If the fault is confirmed to be eliminated, the fault handling process is ended; if the fault is not eliminated, the root cause analysis step can be returned again to re-determine the root cause and the fault solution. And after the treatment result is automatically checked, a treatment report and evaluation are synchronously generated and sent to the operation and maintenance manager.
In another embodiment of the present specification, specifically, the fault handling method may further include:
s901: a fault list is generated from at least one fault instance.
In one possible implementation, as shown in fig. 4, all fault instances are managed in the fault list, where the fault list includes, but is not limited to, a serial number, an operation and maintenance object name, an application system, a host name, an index name, a fault classification, a processing status, a recommended outage point, and a recommended root cause. And specific fault inquiry can be carried out according to the fault number, the aggregation time information and the inquiry condition of the contained alarm number.
S903: and performing root cause analysis on the at least one fault instance to generate a fault root cause list.
In a possible implementation, as shown in fig. 5, a root cause analysis operation is performed on a fault instance in the fault list, and after the analysis is completed, the system jumps to a visual interface of the fault root cause list, where the fault root cause list includes, but is not limited to, root cause, operation and maintenance object name, application system, host name, index name, troubleshooting time, root cause type, and root cause probability. The method can also be used for inquiring according to the root cause, so that the faults under the same root cause can be comprehensively known, and the propagation of the faults can be visually known.
S905: generating a failure solution list according to the at least one failure instance.
In one possible implementation, as shown in fig. 6, the fault solution list may include, but is not limited to, a root cause number, a root cause, a plan name, a plan content, an associated troubleshooting tree model, an operation and maintenance object name, an application system, a host name, a target name, and provides a configuration operation entry for fault handling to meet the requirement for manually configuring the fault handling plan.
S907: and checking fault information and fault propagation relation according to the fault list, the fault root cause list and/or the fault solution list.
In another possible embodiment, the instantiated fault solution is visualized to generate a list of fault solutions and a list of disposal flow instances, as shown in fig. 7 and 8. It can be understood that the information of the analysis and disposal links is visually operated, real-time information is provided for analysis, decision and disposal personnel, time overhead among different teams is reduced, and the coordination efficiency of fault disposal is improved.
In another embodiment of the present specification, a fault handling procedure as shown in fig. 9 may be adopted, in fig. 9, the loss prevention proposal does consider the main processing work of the fault analysis model, which is not shown in the figure; the treatment model matching includes step S105 and step S107 in this embodiment, and this example description is only for reference and is not a specific limitation to the content of the present invention, and the rest of the processing model matching is the same as the corresponding part in this embodiment, and is not described again here.
An embodiment of the present invention further provides an embodiment of a fault handling apparatus, as shown in fig. 10, the apparatus may include:
the first determining module 1010 is configured to obtain alarm information, and determine a target fault instance according to the alarm information;
a second determining module 1020, configured to obtain a fault analysis model, perform root cause analysis on the target fault instance through the fault analysis model, and determine a root cause, a loss stopping point, and/or a loss stopping suggestion of the target fault instance;
a failure solution generating module 1030 configured to match a target handling policy according to the root cause, the outage point, and/or the outage suggestion, and generate a failure solution for the target failure instance according to the target handling policy;
the executing module 1040 is configured to invoke a preset standard handling process to execute the fault solution, and obtain an execution result;
a checking module 1050, configured to check the execution result, and complete handling of the target fault instance.
The embodiments of a fault handling apparatus and a method according to the embodiments of the present invention are based on the same inventive concept, and please refer to the embodiments of the method for details, which are not described herein again.
The embodiment of the present invention further provides a computer device, which includes a processor and a memory, where the memory stores at least one instruction or at least one program, and the at least one instruction or the at least one program is loaded and executed by the processor to implement a fault handling method as provided in the above method embodiment.
The memory may be used to store software programs and modules, and the processor may execute various functional applications and data processing by operating the software programs and modules stored in the memory. The memory can mainly comprise a program storage area and a data storage area, wherein the program storage area can store an operating system, application programs needed by functions and the like; the storage data area may store data created according to use of the apparatus, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory may also include a memory controller to provide the processor access to the memory.
The method embodiments provided by the embodiments of the present invention may be executed in a mobile terminal, a computer terminal, a server, or a similar computing device, that is, the computer device may include a mobile terminal, a computer terminal, a server, or a similar computing device. Taking the example of running on a server, fig. 11 is a hardware structure block diagram of the server of the failure handling method according to the embodiment of the present invention. As shown in fig. 11, the server 1100 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1110 (the processors 1110 may include but are not limited to Processing devices such as a microprocessor MCU or a programmable logic device FPGA), a memory 1130 for storing data, and one or more storage media 1120 (e.g., one or more mass storage devices) for storing applications 1123 or data 1122. The memory 1130 and the storage medium 1120 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 1120 may include one or more modules, each of which may include a series of instruction operations for a server. Further, the CPU 1110 mayArranged to communicate with the storage medium 1120, a series of instruction operations in the storage medium 1120 are executed on the server 1100. The Server 1100 may also include one or more power supplies 1160, one or more wired or wireless network interfaces 1150, one or more input-output interfaces 1140, and/or one or more operating systems 1121, such as a Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTMAnd so on.
It will be understood by those skilled in the art that the structure shown in fig. 11 is only an illustration and is not intended to limit the structure of the electronic device. For example, server 1100 may also include more or fewer components than shown in FIG. 11, or have a different configuration than shown in FIG. 11.
Embodiments of the present invention also provide a computer-readable storage medium, which may be disposed in a server to store at least one instruction or at least one program for implementing a fault handling method in the method embodiments, where the at least one instruction or the at least one program is loaded and executed by the processor to implement a fault handling method provided in the method embodiments.
Alternatively, in this embodiment, the storage medium may be located in at least one network server of a plurality of network servers of a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
As can be seen from the above embodiments of a method, an apparatus, a medium, and a device for handling a fault provided by the present invention, the method, the apparatus, the medium, and the device for handling a fault provided by the present invention have the following technical effects:
(1) the scheme provided by the invention establishes a fault analysis model based on all operation and maintenance objects and indexes, can realize comprehensive and ordered root cause analysis of full link and full coverage, and determines a fault node or a loss stopping point by combining with application scene characteristics of fault occurrence;
(2) the fault analysis model in the scheme provided by the invention also introduces an index incidence relation and a fault propagation relation of the operation and maintenance object, so that the root cause of the fault instance is traced more deeply, the root cause analysis of the fault in a complex scene is more accurate, and more clear guidance can be provided for the disposal of the fault;
(3) according to the scheme provided by the invention, the historically accumulated fault analysis treatment experience is converted, loss stopping suggestions can be provided when the root cause is analyzed, and meanwhile, the treatment strategies pre-arranged in the cloud platform are automatically matched according to the loss stopping suggestions so as to save the time for manually retrieving the treatment strategies;
(4) according to the scheme provided by the invention, the external automation platform is utilized to realize active analysis on the fault case, automatic calling and processing flow of the fault solution and/or automatic inspection on fault processing, so that intelligent processing on the fault case is realized, the efficiency of fault processing is improved, and the negative influence of fault occurrence is reduced;
(5) according to the scheme provided by the invention, the information such as the fault instance, the root cause thereof, the solution scheme and the like is generated into a visual list, real-time and uniform troubleshooting and disposal information is provided for operation and maintenance personnel, and the cooperative work efficiency among different teams is improved.
It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, device and storage medium embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (11)

1. A method of fault handling, the method comprising:
acquiring alarm information, and determining a target fault instance according to the alarm information;
acquiring a fault analysis model, carrying out root cause analysis on the target fault instance through the fault analysis model, and determining root causes, loss stopping points and/or loss stopping suggestions of the target fault instance;
matching a target handling policy according to the root cause, the outage point and/or the outage advice, and generating a failure solution for the target failure instance according to the target handling policy;
calling a preset standard handling flow to execute the fault solution to obtain an execution result;
and checking the execution result to finish the treatment of the target fault instance.
2. The method of claim 1, wherein the determining a target fault instance according to the alarm information comprises:
based on preset logic, performing aggregation and duplicate removal on the alarm information;
and combining at least one alarm information meeting the preset association degree in the alarm information after the aggregation and the de-duplication into a target fault instance.
3. A method according to claim 1, wherein said obtaining a fault analysis model comprises:
establishing an index association relation between at least one index of the operation and maintenance object according to the historical fault handling record;
establishing a fault propagation relation between at least one operation and maintenance object according to a configuration and calling relation between the at least one operation and maintenance object;
and generating a fault tree based on the index incidence relation and the fault propagation relation, and taking the fault tree as a fault analysis model.
4. A method according to claim 3, wherein the root cause analysis of the target fault instance by the fault analysis model, and the determining root cause, outage point and/or outage recommendation of the target fault instance comprises:
calling an event platform, and creating a root cause analysis event list aiming at the target fault instance;
executing the root cause analysis event list based on the event platform, and carrying out root cause analysis on the target fault instance by using the fault tree to obtain a troubleshooting analysis result;
and determining root causes, loss stopping points and/or loss stopping suggestions of the target fault examples according to the failure removing analysis results.
5. A method of fault handling according to claim 1, wherein said matching a target handling policy according to the root cause, the outage point and/or the outage recommendation, and generating a fault solution for the target fault instance according to the target handling policy comprises:
acquiring a preset fault standard handling strategy set, wherein a standard handling strategy in the fault standard handling strategy set is preset with an incidence relation with a root cause, a loss stopping point and/or a loss stopping suggestion of a fault;
matching a target handling strategy in the preset fault standard handling strategy set according to the root cause, the loss stopping point and/or the loss stopping suggestion of the target fault instance;
generating a fault solution for the target fault instance in accordance with the target handling policy.
6. The method according to claim 1, wherein the invoking of the preset standard handling procedure executes the fault solution, and the obtaining of the execution result includes:
instantiating the fault solution to obtain a fault handling scheme according to a preset standard handling flow, wherein the fault handling scheme comprises a handling flow instance;
and executing the handling process instance to obtain an execution result.
7. The method of claim 6, wherein the executing the instance of the handling flow comprises:
judging whether the fault handling scheme is a preset automatic handling type;
if yes, executing the processing flow instance by using an automation platform, and acquiring an execution result returned by the automation platform;
if not, forwarding the handling process instance to a node of a target processing object by using an automation platform, so that the target processing object handles the target fault instance according to the handling process instance, and obtains an execution result returned by the target processing object.
8. A method of fault handling according to claim 1, the method further comprising:
generating a fault list according to at least one fault instance;
performing root cause analysis on the at least one fault instance to generate a fault root cause list;
generating a fault solution list according to the at least one fault instance;
and checking fault information and fault propagation relation according to the fault list, the fault root cause list and/or the fault solution list.
9. A fault handling apparatus, characterized in that the apparatus comprises:
the first determining module is used for acquiring alarm information and determining a target fault instance according to the alarm information;
the second determining module is used for acquiring a fault analysis model, performing root cause analysis on the target fault instance through the fault analysis model, and determining root causes, loss stopping points and/or loss stopping suggestions of the target fault instance;
a failure solution generation module, configured to match a target handling policy according to the root cause, the outage point, and/or the outage suggestion, and generate a failure solution for the target failure instance according to the target handling policy;
the execution module is used for calling a preset standard handling process to execute the fault solution and obtaining an execution result;
and the checking module is used for checking the execution result to finish the treatment of the target fault instance.
10. A computer storage medium having stored therein at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by a processor to implement a fault handling method as claimed in any one of claims 1 to 8.
11. A computer device, characterized in that it comprises a processor and a memory, in which at least one instruction or at least one program is stored, which is loaded by the processor and executes a fault handling method according to any of claims 1 to 8.
CN202011311725.4A 2020-11-20 2020-11-20 Fault handling method, device, medium and equipment Pending CN112446511A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011311725.4A CN112446511A (en) 2020-11-20 2020-11-20 Fault handling method, device, medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011311725.4A CN112446511A (en) 2020-11-20 2020-11-20 Fault handling method, device, medium and equipment

Publications (1)

Publication Number Publication Date
CN112446511A true CN112446511A (en) 2021-03-05

Family

ID=74737079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011311725.4A Pending CN112446511A (en) 2020-11-20 2020-11-20 Fault handling method, device, medium and equipment

Country Status (1)

Country Link
CN (1) CN112446511A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590579A (en) * 2021-06-22 2021-11-02 微梦创科网络科技(中国)有限公司 Root cause analysis method, device and system based on data warehouse
CN114012723A (en) * 2021-11-01 2022-02-08 中国建设银行股份有限公司 Robot process automation fault processing method, device, equipment and storage medium
CN114363149A (en) * 2021-12-23 2022-04-15 上海哔哩哔哩科技有限公司 Fault processing method and device
CN114374996A (en) * 2021-12-29 2022-04-19 浪潮通信信息系统有限公司 Fault processing method, device, equipment and product
CN114661515A (en) * 2022-05-23 2022-06-24 武汉四通信息服务有限公司 Alarm information convergence method and device, electronic equipment and storage medium
WO2024066346A1 (en) * 2022-09-27 2024-04-04 中兴通讯股份有限公司 Alarm processing method and apparatus, and storage medium and electronic apparatus
CN113590579B (en) * 2021-06-22 2024-05-31 微梦创科网络科技(中国)有限公司 Root cause analysis method, device and system based on data warehouse

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050144151A1 (en) * 2003-04-02 2005-06-30 Fischman Reuben S. System and method for decision analysis and resolution
CN105404224A (en) * 2015-12-16 2016-03-16 北京煜邦电力技术股份有限公司 Method and apparatus for processing machine room fault
WO2019061364A1 (en) * 2017-09-29 2019-04-04 华为技术有限公司 Failure analyzing method and related device
CN109783260A (en) * 2018-12-13 2019-05-21 平安普惠企业管理有限公司 Intelligent IT whole process O&M method, apparatus, equipment and readable storage medium storing program for executing
CN109787816A (en) * 2018-12-28 2019-05-21 北京奇安信科技有限公司 Traffic failure localization method, device, equipment and medium
CN111865682A (en) * 2020-07-16 2020-10-30 北京百度网讯科技有限公司 Method and apparatus for handling faults
CN111859047A (en) * 2019-04-23 2020-10-30 华为技术有限公司 Fault solving method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050144151A1 (en) * 2003-04-02 2005-06-30 Fischman Reuben S. System and method for decision analysis and resolution
CN105404224A (en) * 2015-12-16 2016-03-16 北京煜邦电力技术股份有限公司 Method and apparatus for processing machine room fault
WO2019061364A1 (en) * 2017-09-29 2019-04-04 华为技术有限公司 Failure analyzing method and related device
CN109783260A (en) * 2018-12-13 2019-05-21 平安普惠企业管理有限公司 Intelligent IT whole process O&M method, apparatus, equipment and readable storage medium storing program for executing
CN109787816A (en) * 2018-12-28 2019-05-21 北京奇安信科技有限公司 Traffic failure localization method, device, equipment and medium
CN111859047A (en) * 2019-04-23 2020-10-30 华为技术有限公司 Fault solving method and device
CN111865682A (en) * 2020-07-16 2020-10-30 北京百度网讯科技有限公司 Method and apparatus for handling faults

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113590579A (en) * 2021-06-22 2021-11-02 微梦创科网络科技(中国)有限公司 Root cause analysis method, device and system based on data warehouse
CN113590579B (en) * 2021-06-22 2024-05-31 微梦创科网络科技(中国)有限公司 Root cause analysis method, device and system based on data warehouse
CN114012723A (en) * 2021-11-01 2022-02-08 中国建设银行股份有限公司 Robot process automation fault processing method, device, equipment and storage medium
CN114363149A (en) * 2021-12-23 2022-04-15 上海哔哩哔哩科技有限公司 Fault processing method and device
CN114363149B (en) * 2021-12-23 2023-12-26 上海哔哩哔哩科技有限公司 Fault processing method and device
CN114374996A (en) * 2021-12-29 2022-04-19 浪潮通信信息系统有限公司 Fault processing method, device, equipment and product
CN114661515A (en) * 2022-05-23 2022-06-24 武汉四通信息服务有限公司 Alarm information convergence method and device, electronic equipment and storage medium
WO2024066346A1 (en) * 2022-09-27 2024-04-04 中兴通讯股份有限公司 Alarm processing method and apparatus, and storage medium and electronic apparatus

Similar Documents

Publication Publication Date Title
US10901727B2 (en) Monitoring code sensitivity to cause software build breaks during software project development
CN112446511A (en) Fault handling method, device, medium and equipment
US20190228296A1 (en) Significant events identifier for outlier root cause investigation
US20160170818A1 (en) Adaptive fault diagnosis
CN111817891A (en) Network fault processing method and device, storage medium and electronic equipment
CN111176879A (en) Fault repairing method and device for equipment
JP2014112400A (en) Method and apparatus for generating configuration rules for computing entities within computing environment by using association rule mining
CN114500250B (en) System linkage comprehensive operation and maintenance system and method in cloud mode
CN112559376A (en) Automatic positioning method and device for database fault and electronic equipment
CN113505044B (en) Database warning method, device, equipment and storage medium
JP5240709B2 (en) Computer system, method and computer program for evaluating symptom
CN116149824A (en) Task re-running processing method, device, equipment and storage medium
CN107682173B (en) Automatic fault positioning method and system based on transaction model
Sebu et al. Business activity monitoring solution to detect deviations in business process execution
CN113626288A (en) Fault processing method, system, device, storage medium and electronic equipment
CN111813872B (en) Method, device and equipment for generating fault troubleshooting model
CN112579402A (en) Method and device for positioning faults of application system
CN112860523A (en) Fault prediction method and device for batch job processing and server
CN113918204A (en) Metadata script management method and device, electronic equipment and storage medium
US20150112914A1 (en) Case-based reasoning
US20150112912A1 (en) Case-based reasoning
US20150112916A1 (en) Case-based reasoning
WO2015103764A1 (en) Monitoring an object to prevent an occurrence of an issue
CN116841792B (en) Application program development fault repairing method
CN115909533B (en) System safety inspection method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination