CN116414609A - Fault analysis method, device, electronic equipment and storage medium - Google Patents

Fault analysis method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116414609A
CN116414609A CN202310442661.9A CN202310442661A CN116414609A CN 116414609 A CN116414609 A CN 116414609A CN 202310442661 A CN202310442661 A CN 202310442661A CN 116414609 A CN116414609 A CN 116414609A
Authority
CN
China
Prior art keywords
fault
fault analysis
monitored object
analysis
monitored
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310442661.9A
Other languages
Chinese (zh)
Inventor
张孝龙
朱二夫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qianxin Technology Group Co Ltd
Original Assignee
Qianxin Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qianxin Technology Group Co Ltd filed Critical Qianxin Technology Group Co Ltd
Priority to CN202310442661.9A priority Critical patent/CN116414609A/en
Publication of CN116414609A publication Critical patent/CN116414609A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/02Standardisation; Integration
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a fault analysis method, a fault analysis device, electronic equipment and a storage medium, which are used for improving the efficiency of fault analysis. The method comprises the following steps: if the monitored object is monitored to be faulty, acquiring a target scene arrangement flow corresponding to the monitored object; the target scene arrangement flow is generated in advance and comprises a plurality of fault analysis nodes and execution logic relations of the fault analysis nodes; and sequentially executing the fault analysis nodes according to the execution logic relationship based on the target scene arrangement flow so as to perform fault analysis. According to the method and the device for analyzing the faults, the scene arrangement flow corresponding to various monitored objects is generated in advance, and when the monitored objects are in faults, the corresponding scene arrangement flow is utilized to realize automatic analysis of the faults, so that the efficiency of fault analysis is improved.

Description

Fault analysis method, device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a fault analysis method, a fault analysis device, an electronic device, and a storage medium.
Background
In order to timely find out the faults of the services or components in the system, a monitoring system can be used for monitoring the service system, specifically, the monitoring system can be used for collecting relevant data of the service system in the running process, determining whether the services or components in the service system are faulty or not based on the collected data, and giving a prompt when the faults occur.
When the monitoring system monitors that the service system has faults, the current common practice is to manually check and analyze the faults, and the method makes the fault analysis low in efficiency.
Disclosure of Invention
An embodiment of the application aims to provide a fault analysis method, a device, electronic equipment and a storage medium, which are used for improving the efficiency of fault analysis.
In a first aspect, an embodiment of the present application provides a fault analysis method, including:
if the monitored object is monitored to be faulty, acquiring a target scene arrangement flow corresponding to the monitored object; the target scene arrangement flow is generated in advance and comprises a plurality of fault analysis nodes and execution logic relations of the fault analysis nodes;
and sequentially executing the fault analysis nodes according to the execution logic relationship based on the target scene arrangement flow so as to perform fault analysis.
According to the method and the device for analyzing the faults, the scene arrangement flow corresponding to various monitored objects is generated in advance, and when the monitored objects are in faults, the corresponding scene arrangement flow is utilized to realize automatic analysis of the faults, so that the efficiency of fault analysis is improved.
In any embodiment, the fault analysis node includes a fault analysis script, and sequentially executes the fault analysis nodes according to the execution logic relationship, including:
and sequentially executing the fault analysis scripts corresponding to the fault analysis nodes according to the execution logic relation.
According to the embodiment of the application, the fault analysis scripts are sequentially executed according to the execution logic relationship, so that automation of fault investigation can be realized, and the manual participation degree is reduced.
In any embodiment, the target scene orchestration procedure further comprises a failure recovery node; the fault recovery node comprises fault recovery scripts corresponding to each fault cause respectively; the method further comprises the steps of:
determining a fault reason, and determining a target fault recovery script from the fault recovery nodes according to the fault reason;
and executing the target fault recovery script to perform fault recovery on the monitored object.
In the embodiment of the application, after the fault reason is determined, the fault recovery script corresponding to the fault reason can be used for realizing automatic recovery of the fault, so that the influence of an invalid recovery command on the load of the service system is reduced.
In any embodiment, the target scene arrangement includes a dependency component troubleshooting and a self-troubleshooting, and the execution of the failure analysis node corresponding to the dependency component troubleshooting precedes the failure analysis node corresponding to the self-troubleshooting; sequentially executing the fault analysis nodes according to the execution logic relationship, including:
acquiring dependency index data of a dependent component corresponding to a monitored object; the dependent component refers to a component of the monitored object with a dependent relationship in the running process;
based on the dependency index data, executing a fault analysis node corresponding to the dependency component troubleshooting so as to perform fault analysis on the dependency component;
if the dependent component has no fault, executing a fault analysis node corresponding to self-checking so as to perform fault analysis on the monitored object.
According to the embodiment of the application, whether the dependent component fails or not is firstly checked, so that the failure cause analysis efficiency is improved.
In any embodiment, before monitoring that the monitored object fails, the method further comprises:
and collecting dependency index data of the dependency component corresponding to the monitored object.
According to the embodiment of the application, the dependency index data of the dependency component are collected in advance, so that the efficiency of fault analysis is further improved.
In any embodiment, before monitoring that the monitored object fails, the method further comprises:
and collecting index data of the monitored object, and performing fault monitoring based on the index data.
According to the method and the device, the index data of the monitored object are collected, so that whether the monitored object runs or not is monitored.
In any embodiment, the method further comprises:
acquiring fault analysis scripts corresponding to each monitored object from the tool set; the tool is stored with a plurality of executable fault analysis scripts in a centralized way;
setting an execution sequence for the fault analysis script;
and generating a scene arrangement flow of the corresponding monitored object based on the fault analysis script and the corresponding execution sequence.
According to the method and the device for analyzing the fault, the scene arrangement flow corresponding to various monitored objects is generated in advance, and when the monitored objects are in fault, the corresponding scene arrangement flow is utilized to realize automatic analysis of the fault reason, so that the efficiency of fault analysis is improved.
In a second aspect, an embodiment of the present application provides a fault analysis apparatus, including:
the flow acquisition module is used for acquiring a target scene arrangement flow corresponding to the monitored object if the monitored object is monitored to be faulty; the target scene arrangement flow is generated in advance and comprises a plurality of fault analysis nodes and execution logic relations of the fault analysis nodes;
and the fault analysis module is used for sequentially executing the fault analysis nodes according to the execution logic relationship based on the target scene arrangement flow so as to perform fault analysis.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory, and a bus, wherein,
the processor and the memory complete communication with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of the first aspect.
In a fourth aspect, embodiments of the present application provide a non-transitory computer readable storage medium comprising:
the non-transitory computer-readable storage medium stores computer instructions that cause the computer to perform the method of the first aspect.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the embodiments of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a fault analysis method provided in an embodiment of the present application;
fig. 2 is a schematic flow chart of an analysis method for alarm health detection fault provided in an embodiment of the present application;
FIG. 3 is a schematic diagram of fault analysis according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a fault analysis device according to an embodiment of the present application;
fig. 5 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the technical solutions of the present application will be described in detail below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical solutions of the present application, and thus are only examples, and are not intended to limit the scope of protection of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the figures above are intended to cover non-exclusive inclusions.
In the description of the embodiments of the present application, the technical terms "first," "second," etc. are used merely to distinguish between different objects and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated, a particular order or a primary or secondary relationship. In the description of the embodiments of the present application, the meaning of "plurality" is two or more unless explicitly defined otherwise.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In the description of the embodiments of the present application, the term "and/or" is merely an association relationship describing an association object, which means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
In the description of the embodiments of the present application, the term "plurality" refers to two or more (including two), and similarly, "plural sets" refers to two or more (including two), and "plural sheets" refers to two or more (including two).
In the description of the embodiments of the present application, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured" and the like are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally formed; or may be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the embodiments of the present application will be understood by those of ordinary skill in the art according to the specific circumstances.
At present, when a service system fails in the operation process, a method is generally adopted to directly execute a recovery command to enable the service system to continue to operate, or manually check the failure cause and recover the operation of the service system after the failure cause is checked and cleared. The first failure recovery method does not really solve the failure, the second failure analysis requires a lot of manpower and has low analysis efficiency.
In order to solve the above technical problems, embodiments of the present application provide a fault analysis method, apparatus, electronic device, and storage medium, where by generating scene arrangement flows corresponding to multiple monitored objects in advance, when a fault occurs, the fault may be automatically analyzed by using the corresponding scene arrangement flows, so that efficiency of fault analysis is improved.
Fig. 1 is a schematic flow chart of a fault analysis method provided in an embodiment of the present application, where as shown in fig. 1, the fault analysis method is applied to a monitoring system, where the monitoring system may operate on a service system corresponding to a monitored object, or may be an independent electronic device, and when the monitoring system is an independent electronic device, the monitoring system may be in communication connection with the service system. It can be understood that the monitoring system can be prometaus, which is an open-source system monitoring and alarming system, and has the characteristics of multidimensional data model, flexible query language, direct local deployment, high-efficiency storage and the like.
The method comprises the following steps:
step 101: if the monitored object is monitored to be faulty, acquiring a target scene arrangement flow corresponding to the monitored object; the target scene arrangement flow is generated in advance and comprises a plurality of fault analysis nodes and execution logic relations of the fault analysis nodes;
step 102: and sequentially executing the fault analysis nodes according to the execution logic relation based on the target scene arrangement flow so as to perform fault analysis.
In step 101, the monitored object may be a service or a component on the business system. During operation of the service system, the monitoring system can collect data generated by the service system, and the data can be data which is set in the monitoring system in advance and needs to be obtained. Different monitored objects, which require different data, e.g.: the Java service of the monitored object is monitored, and the data required to be collected are the data such as the latest heartbeat time, the occupied memory size, the Jvm garbage collection frequency, the key content of an abnormal log and the like. When at least one item of the collected data is abnormal, the Java service can be determined to be faulty. For example: and the monitoring system finds that no heartbeat returns in the last period of time, and then the Java service is indicated to be faulty.
The scene arrangement flow is a flow for realizing automatic fault analysis of a corresponding monitored object by a pointer, and comprises a plurality of fault analysis nodes and execution logic relations among the plurality of fault analysis nodes. The failure analysis node refers to an troubleshooting operation during failure troubleshooting, for example: check whether the data is successfully acquired, check whether the network times out, check whether a port exists, check whether a process exists, check for dependent components, etc. The execution logic relationship is used for representing the execution sequence and the dependency relationship of a plurality of fault analysis nodes, for example: the scene arranging process comprises the steps of checking whether data acquisition succeeds, checking whether a network is overtime and checking whether three fault analysis nodes exist at a port, firstly executing the step of checking whether the data acquisition succeeds, executing the step of checking whether the network is overtime after the data acquisition fails, ending if the network is overtime, and executing the step of checking whether the port exists if the network is normal. Therefore, the execution logic relationship not only includes the execution sequence, but also includes that the execution results corresponding to the failure analysis nodes are different, and the failure analysis nodes executed later are also different.
The monitored objects that may fail in the business system may be predetermined and a corresponding scene orchestration flow may be generated based on the failure of each monitored object. When the monitoring system monitors that the monitored object has faults, the target scene arranging process corresponding to the monitored object can be determined from a plurality of scene arranging processes according to the types of the faults. It can be understood that when the monitoring system monitors that the monitored object has a fault, the fault identifier corresponding to the monitored object can be output, and the scene arrangement flow and the fault identifier corresponding to the monitored object are correspondingly stored in the monitoring system, so that the target arrangement flow can be determined according to the fault identifier.
In step 102, the same type of fault may cause multiple faults, so after the monitoring system acquires the target scene layout process, fault analysis is sequentially performed according to the execution logic relationship of each fault analysis node in the target scene layout process, thereby achieving the purpose of automatically analyzing faults.
According to the method and the device for analyzing the faults, the scene arrangement flow corresponding to various monitored objects is generated in advance, and when the monitored objects are in faults, the corresponding scene arrangement flow is utilized to realize automatic analysis of the faults, so that the efficiency of fault analysis is improved.
On the basis of the above embodiment, each failure analysis node in the scene layout process includes a corresponding failure analysis script, which is an executable script, so that when the failure analysis is performed by using the target scene layout process, for each failure analysis node, the execution result of the failure analysis node is obtained by executing the corresponding failure analysis script, the next failure analysis node is determined according to the execution result, and the next failure analysis node is continuously executed until the failure cause is obtained.
According to the embodiment of the application, the fault analysis scripts are sequentially executed according to the execution logic relationship, so that automation of fault investigation can be realized, and the manual participation degree is reduced.
When the monitored object fails, the comparison scheme only executes the corresponding recovery command, and the failure is not accurately analyzed. The reason for the failure may be that the corresponding component fails, or that other components fail, and the simple execution of the recovery command is often ineffective. For example: when monitoring that a certain service is abnormal, if the service process does not exist in the port corresponding to the service, the system directly issues a service restarting command. In this scheme, service faults are not analyzed, and in many cases, after a period of time when service restart issuing a restart command succeeds, an abnormality still occurs. This will result in frequent restarting of the service putting a greater stress on the load of the business system.
Based on the above problems, the embodiment of the present application further provides a fault automatic recovery function, that is, the target scene arrangement flow further includes a fault recovery node; the fault recovery node comprises fault recovery scripts corresponding to each fault cause respectively; after determining the fault reason, determining a target fault recovery script from the fault recovery nodes according to the fault reason; and executing the target fault recovery script to perform fault recovery on the monitored object.
It can be understood that the fault recovery node may be disposed behind the fault analysis node, and for the same type of fault, fault recovery methods corresponding to different fault causes are different, so that different fault recovery methods may be disposed behind different execution results of the fault analysis node, where the different fault recovery methods correspond to different fault recovery scripts, and the fault recovery scripts are executed to implement fault processing, so that the monitored object can operate normally.
Fig. 2 is a schematic flow chart of an analysis method for detecting faults in alarm health, which is provided in an embodiment of the present application, as shown in fig. 2, the method includes:
(1) Failure analysis of data acquisition: executing a fault analysis script for data acquisition to judge whether the data can be acquired, if so, analyzing the acquired data, and if the analysis fails, carrying out fault reminding, for example: reminding can be carried out through mails or short messages; if the data cannot be acquired, carrying out fault analysis of the network;
(2) Failure analysis of the network: executing a fault analysis script of the network to judge whether the network is overtime, and if so, determining that the fault cause is a network problem; if the network is normal, carrying out fault analysis of the port;
(3) Port failure analysis: executing a fault analysis script of the port to judge whether the port exists or not, and performing fault analysis of the process;
(4) Fault analysis of the process: executing a fault analysis script of the process to judge whether the process exists, wherein the fault analysis script comprises the following four conditions:
case 1: if both the port and the process exist and httpcode=500 is returned, performing thread data investigation; executing a fault analysis script for thread data inquiry, collecting a thread which is not closed currently, and notifying a mail message if the thread which is not closed currently exists; if there are no threads that are not closed currently and the failure has not been recovered for a long time, a restart command is executed.
Case 2: if the port exists and the process does not exist, the port is deleted, the component inspection is performed, and a restart command is performed after the component inspection process is completed.
Case 3: if the port and the process do not exist, the dependent component is checked, if the dependent component fails, the dependent component is subjected to fault processing, and a restarting command is executed again after the processing is completed.
Case 4: if the port does not exist but the process exists, waiting to judge whether the restarting is successful; if the restart is successful, the notification of the completion of the fault recovery is performed, and a restart log is recorded. If the restarting fails, the dependency component is checked, and the dependency component can be checked through the aspects of a database, a redis cache, an es database, a registry, userstervers, a configuration center, hdfs, disk space, memory space, log writing authority, port occupation and the like.
In the embodiment of the application, after the fault reason is determined, the fault recovery script corresponding to the fault reason can be used for realizing automatic recovery of the fault, so that the influence of an invalid recovery command on the load of the service system is reduced.
On the basis of the embodiment, the target scene arrangement comprises the relying component investigation and the self-investigation, and the execution of the fault analysis node corresponding to the relying component investigation precedes the fault analysis node corresponding to the self-investigation; sequentially executing the fault analysis nodes according to the execution logic relationship, including:
acquiring dependency index data of a dependent component corresponding to a monitored object; the dependent component refers to a component of the monitored object with a dependent relationship in the running process;
based on the dependency index data, executing a fault analysis node corresponding to the dependency component troubleshooting so as to perform fault analysis on the dependency component;
if the dependent component has no fault, executing a fault analysis node corresponding to self-checking so as to perform fault analysis on the monitored object.
In a specific implementation process, after a monitoring system monitors that an object to be monitored is abnormal, firstly performing fault analysis on a service or a dependent component of a component, in which the object to be monitored is abnormal, and if the dependent component is faulty, performing fault processing on the dependent component; if the dependent component is normal, performing fault analysis on the abnormal service or component.
When the fault analysis is performed on the dependent component, the analysis can be performed based on the dependent index data of the dependent component in the operation process, as shown in fig. 2, the configuration center and userstervers can be called to perform data initialization when the dependent component is started, and meanwhile, the ports and disk spaces in the server can be relied on, so that the configuration center, userstervers, ports and disk spaces are dependent components of the dependent component service. Under the condition of alarms starting failure, four types of dependency index data, namely a configuration center, userstervers, ports and disk space, can be judged to carry out fault detection, and if the dependency index data is abnormal, the abnormality is processed. Therefore, the problem that the components are depended on is avoided from being repaired, an invalid recovery command is issued, and the service system resources are tensed due to abnormality.
If the dependent component fails, the alarm service may be analyzed for failure using the index data of the alarm service.
It can be understood that the detection and fault processing of the dependent component can also be configured in the scene arranging process, specifically can be configured in front of the self-checking node, and after the self-checking of the dependent component is completed, the self-checking is executed after the non-fault of the dependent component is determined.
In the embodiment of the application, the application of the dependent component can be carried out, so that whether the dependent component fails or not is firstly checked, if so, the problem of the dependent component is firstly processed, the problem of the dependent component is avoided, an invalid recovery command is issued because the problem of the dependent component is not repaired, the efficiency of analyzing the failure cause is improved, and meanwhile, the occupation of service system resources is reduced.
On the basis of the above embodiment, before the monitored object is monitored to fail, the method further includes:
and collecting dependency index data of the dependency component corresponding to the monitored object.
In a specific implementation process, the dependency index data refers to data required for performing fault analysis on a dependency component corresponding to the monitored object. The index data required by fault analysis of different dependent components are different, so that the dependent component corresponding to the monitored object can be preset, and the dependent index data corresponding to each dependent component can also be preset.
The dependency index data may be collected before the monitored object fails, and thus, the dependency index data may be collected in real-time. After the monitored object fails, the dependency component can be subjected to failure analysis by utilizing dependency index data collected in advance, so that whether the dependency component fails or not can be rapidly determined.
It will be appreciated that the failure analysis of the dependent components may also be automated analysis by pre-generated scene orchestration procedures.
According to the embodiment of the application, the dependency index data of the dependency component are collected in advance, so that the efficiency of fault analysis is further improved.
On the basis of the embodiment, the index data of the monitored objects can be collected in real time, and the index data required by fault analysis of different monitored objects are different, so that the index data corresponding to each monitored object can be preset. The monitoring system monitors the monitored object according to preset index data to determine whether the monitored object has faults or not. And executing the fault analysis flow provided by the above embodiments when it is determined that the monitored object fails.
Fig. 3 is a schematic diagram of fault analysis provided in the embodiment of the present application, as shown in fig. 3, where index data is collected by a promethaus, fault detection is performed according to the collected index data, after a fault is detected, a scene arrangement flow of a monitored object with a fault is obtained, and fault analysis, fault location and fault handling are performed by using the scene arrangement flow, so as to implement automatic operation and maintenance of the service system. The knowledge base comprises explanation documents, namely description of possible reasons and processing methods of occurrence of each fault event, and mainly provides references for generation of scene arrangement flow. The tool set includes a plurality of executable fault analysis scripts, such as: limiting access to one IP and cleaning a log file of a system disk; when generating the scene arranging flow, the small tools can be cited, and the scene arranging flow of the fault event can be generated after the precedence relation of each fault analysis script is set.
On the basis of the above embodiment, the scene arrangement flow corresponding to each monitored object may be generated in advance, where the method for generating the scene arrangement flow is as follows:
acquiring fault analysis scripts corresponding to each monitored object from the tool set; the tool set stores various executable scripts;
acquiring the execution sequence of each fault analysis script;
and generating a scene arrangement flow of the corresponding monitored object based on the fault analysis script and the corresponding execution sequence.
In a specific implementation, the relevant explanation of the tool set may be referred to the above embodiments, which are not repeated here. The fault analysis scripts required when each monitored object fails can be predetermined, and the corresponding fault analysis scripts are obtained from the tool set. After the monitoring system acquires the fault analysis scripts, the corresponding execution sequence can be set for each fault analysis script. After the monitoring system acquires the fault analysis scripts and the execution sequences corresponding to the fault analysis scripts, connection relations are established among the fault analysis scripts according to the execution sequences, so that a scene arrangement flow is generated.
According to the method and the device for analyzing the fault, the scene arrangement flow corresponding to various monitored objects is generated in advance, and when the monitored objects are in fault, the corresponding scene arrangement flow is utilized to realize automatic analysis of the fault reason, so that the efficiency of fault analysis is improved.
Fig. 4 is a schematic structural diagram of a fault analysis device provided in an embodiment of the present application, where the device may be a module, a program segment, or a code on an electronic device. It should be understood that the apparatus corresponds to the embodiment of the method of fig. 1 described above, and is capable of performing the steps involved in the embodiment of the method of fig. 1, and specific functions of the apparatus may be referred to in the foregoing description, and detailed descriptions thereof are omitted herein as appropriate to avoid redundancy. The device comprises: a flow acquisition module 401 and a fault analysis module 402, wherein:
the process obtaining module 401 is configured to obtain a target scene scheduling process corresponding to the monitored object if the monitored object is monitored to fail; the target scene arrangement flow is generated in advance and comprises a plurality of fault analysis nodes and execution logic relations of the fault analysis nodes;
the fault analysis module 402 is configured to sequentially execute the fault analysis nodes according to the execution logic relationship based on the target scene scheduling procedure, so as to perform fault analysis.
On the basis of the above embodiment, the fault analysis node includes a fault analysis script, and the fault analysis module 402 is specifically configured to:
and sequentially executing the fault analysis scripts corresponding to the fault analysis nodes according to the execution logic relation.
On the basis of the above embodiment, the target scene orchestration procedure further includes a failure recovery node; the fault recovery node comprises fault recovery scripts corresponding to each fault cause respectively; the apparatus further comprises a fault recovery module for:
determining a fault reason, and determining a target fault recovery script from the fault recovery nodes according to the fault reason;
and executing the target fault recovery script to recover the fault of the monitored object.
On the basis of the embodiment, the target scene arrangement comprises dependency component investigation and self-investigation, wherein the execution of the fault analysis node corresponding to the dependency component investigation is performed before the fault analysis node corresponding to the self-investigation; the fault analysis module 402 is specifically configured to:
acquiring dependency index data of a dependency component corresponding to the monitored object; the dependent component refers to a component of the monitored object with a dependent relationship in the running process;
based on the dependency index data, executing a fault analysis node corresponding to the dependency component troubleshooting to perform fault analysis on the dependency component;
and if the dependent component has no fault, executing a fault analysis node corresponding to the self-checking so as to perform fault analysis on the monitored object.
On the basis of the above embodiment, the apparatus further includes a first data acquisition module configured to:
and collecting the dependency index data of the dependency component corresponding to the monitored object.
On the basis of the above embodiment, the apparatus further includes a second data acquisition module configured to:
and collecting index data of the monitored object, and performing fault monitoring based on the index data.
On the basis of the above embodiment, the apparatus further includes a scene arrangement module for:
acquiring fault analysis scripts corresponding to each monitored object from the tool set; the tool is stored with a plurality of executable fault analysis scripts in a centralized way;
acquiring the execution sequence of the fault analysis script;
and generating a scene arrangement flow of the corresponding monitored object based on the fault analysis script and the corresponding execution sequence.
Fig. 5 is a schematic diagram of an entity structure of an electronic device according to an embodiment of the present application, as shown in fig. 5, where the electronic device includes: a processor (processor) 501, a memory (memory) 502, and a bus 503; wherein, the liquid crystal display device comprises a liquid crystal display device,
the processor 501 and the memory 502 complete communication with each other via the bus 503;
the processor 501 is configured to invoke the program instructions in the memory 502 to perform the methods provided in the above method embodiments, for example, including: if the monitored object is monitored to be faulty, acquiring a target scene arrangement flow corresponding to the monitored object; the target scene arrangement flow is generated in advance and comprises a plurality of fault analysis nodes and execution logic relations of the fault analysis nodes; and sequentially executing the fault analysis nodes according to the execution logic relation based on the target scene arrangement flow so as to perform fault analysis.
The processor 501 may be an integrated circuit chip having signal processing capabilities. The processor 501 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Which may implement or perform the various methods, steps, and logical blocks disclosed in embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Memory 502 may include, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), and the like.
The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing the methods provided by the above-described method embodiments, for example comprising: if the monitored object is monitored to be faulty, acquiring a target scene arrangement flow corresponding to the monitored object; the target scene arrangement flow is generated in advance and comprises a plurality of fault analysis nodes and execution logic relations of the fault analysis nodes; and sequentially executing the fault analysis nodes according to the execution logic relation based on the target scene arrangement flow so as to perform fault analysis.
The present embodiment provides a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above-described method embodiments, for example, including: if the monitored object is monitored to be faulty, acquiring a target scene arrangement flow corresponding to the monitored object; the target scene arrangement flow is generated in advance and comprises a plurality of fault analysis nodes and execution logic relations of the fault analysis nodes; and sequentially executing the fault analysis nodes according to the execution logic relation based on the target scene arrangement flow so as to perform fault analysis.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
Further, the units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Furthermore, functional modules in various embodiments of the present application may be integrated together to form a single portion, or each module may exist alone, or two or more modules may be integrated to form a single portion.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims (10)

1. A method of fault analysis, comprising:
if the monitored object is monitored to be faulty, acquiring a target scene arrangement flow corresponding to the monitored object; the target scene arrangement flow is generated in advance and comprises a plurality of fault analysis nodes and execution logic relations of the fault analysis nodes;
and sequentially executing the fault analysis nodes according to the execution logic relation based on the target scene arrangement flow so as to perform fault analysis.
2. The method of claim 1, wherein the failure analysis node comprises a failure analysis script, the sequentially executing the failure analysis nodes according to the execution logic relationship comprising:
and sequentially executing the fault analysis scripts corresponding to the fault analysis nodes according to the execution logic relation.
3. The method of claim 1, wherein the target scene orchestration flow further comprises a failure recovery node; the fault recovery node comprises fault recovery scripts corresponding to each fault cause respectively; the method further comprises the steps of:
determining a fault reason, and determining a target fault recovery script from the fault recovery nodes according to the fault reason;
and executing the target fault recovery script to recover the fault of the monitored object.
4. The method of claim 1, wherein the target scenario arrangement comprises a dependent component troubleshooting and a self-troubleshooting, the dependent component troubleshooting corresponding failure analysis node being performed prior to the self-troubleshooting corresponding failure analysis node; the sequentially executing the fault analysis nodes according to the execution logic relationship comprises the following steps:
acquiring dependency index data of a dependency component corresponding to the monitored object; the dependent component refers to a component of the monitored object with a dependent relationship in the running process;
based on the dependency index data, executing a fault analysis node corresponding to the dependency component troubleshooting to perform fault analysis on the dependency component;
and if the dependent component has no fault, executing a fault analysis node corresponding to the self-checking so as to perform fault analysis on the monitored object.
5. The method of claim 4, wherein prior to monitoring that the monitored object is malfunctioning, the method further comprises:
and collecting the dependency index data of the dependency component corresponding to the monitored object.
6. The method of claim 1, wherein prior to monitoring that the monitored object is malfunctioning, the method further comprises:
and collecting index data of the monitored object, and performing fault monitoring based on the index data.
7. The method according to any one of claims 1-6, further comprising:
acquiring fault analysis scripts corresponding to each monitored object from the tool set; the tool is stored with a plurality of executable fault analysis scripts in a centralized way;
acquiring the execution sequence of the fault analysis script;
and generating a scene arrangement flow of the corresponding monitored object based on the fault analysis script and the corresponding execution sequence.
8. A fault analysis apparatus, comprising:
the flow acquisition module is used for acquiring a target scene arrangement flow corresponding to the monitored object if the monitored object is monitored to be faulty; the target scene arrangement flow is generated in advance and comprises a plurality of fault analysis nodes and execution logic relations of the fault analysis nodes;
and the fault analysis module is used for sequentially executing the fault analysis nodes according to the execution logic relationship based on the target scene arrangement flow so as to perform fault analysis.
9. An electronic device, comprising: a processor, a memory, and a bus, wherein,
the processor and the memory complete communication with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1-7.
10. A non-transitory computer readable storage medium storing computer instructions which, when executed by a computer, cause the computer to perform the method of any of claims 1-7.
CN202310442661.9A 2023-04-23 2023-04-23 Fault analysis method, device, electronic equipment and storage medium Pending CN116414609A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310442661.9A CN116414609A (en) 2023-04-23 2023-04-23 Fault analysis method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310442661.9A CN116414609A (en) 2023-04-23 2023-04-23 Fault analysis method, device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116414609A true CN116414609A (en) 2023-07-11

Family

ID=87056156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310442661.9A Pending CN116414609A (en) 2023-04-23 2023-04-23 Fault analysis method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116414609A (en)

Similar Documents

Publication Publication Date Title
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
US6996751B2 (en) Method and system for reduction of service costs by discrimination between software and hardware induced outages
CN111814999B (en) Fault work order generation method, device and equipment
US7814369B2 (en) System and method for detecting combinations of perfomance indicators associated with a root cause
CN110716842B (en) Cluster fault detection method and device
JPH0644242B2 (en) How to solve problems in computer systems
WO2016188100A1 (en) Information system fault scenario information collection method and system
CN112529223A (en) Equipment fault repair method and device, server and storage medium
CN113672456A (en) Modular self-monitoring method, system, terminal and storage medium of application platform
CN113312200A (en) Event processing method and device, computer equipment and storage medium
CN110011854B (en) MDS fault processing method, device, storage system and computer readable storage medium
CN108710545A (en) A kind of remote monitoring fault self-recovery system
CN111752741A (en) System performance detection method and device
CN115766402B (en) Method and device for filtering server fault root cause, storage medium and electronic device
CN111159051B (en) Deadlock detection method, deadlock detection device, electronic equipment and readable storage medium
CN116414609A (en) Fault analysis method, device, electronic equipment and storage medium
CN111813872B (en) Method, device and equipment for generating fault troubleshooting model
CN112131090B (en) Service system performance monitoring method, device, equipment and medium
CN116089141A (en) Database fault repairing method and device, emergency library system equipment and storage medium
CN116264541A (en) Multi-dimension-based database disaster recovery method and device
CN115408271A (en) One-stop closed loop test method, system, equipment and medium
CN113676356A (en) Alarm information processing method and device, electronic equipment and readable storage medium
CN111835566A (en) System fault management method, device and system
CN112068935A (en) Method, device and equipment for monitoring deployment of kubernets program
CN113138872A (en) Abnormal processing device and method for database system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination