CN114629776A - Fault analysis method and device based on graph model - Google Patents

Fault analysis method and device based on graph model Download PDF

Info

Publication number
CN114629776A
CN114629776A CN202011453509.3A CN202011453509A CN114629776A CN 114629776 A CN114629776 A CN 114629776A CN 202011453509 A CN202011453509 A CN 202011453509A CN 114629776 A CN114629776 A CN 114629776A
Authority
CN
China
Prior art keywords
log
equipment
fault
graph
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011453509.3A
Other languages
Chinese (zh)
Other versions
CN114629776B (en
Inventor
张勉知
刘惜吾
程亚锋
叶晓斌
陈孟尝
曾昭才
张园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN202011453509.3A priority Critical patent/CN114629776B/en
Publication of CN114629776A publication Critical patent/CN114629776A/en
Application granted granted Critical
Publication of CN114629776B publication Critical patent/CN114629776B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Abstract

The invention provides a fault analysis method and a fault analysis device based on a graph model, and the fault analysis method based on the graph model provided by the embodiment comprises the following steps: acquiring a first real-time log, and preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence; performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism to determine a log sequence to be processed; performing fault analysis on the log sequence to be processed according to a preset graph model, and determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation; and determining the predicted fault equipment and the predicted fault information according to the root cause equipment, the root cause fault and the graph model. By the fault analysis method based on the graph model, provided by the embodiment of the invention, the fault occurrence details are accurately positioned, and a foundation is laid for fault root cause diagnosis and rapid service recovery.

Description

Fault analysis method and device based on graph model
Technical Field
The invention relates to the technical field of communication, in particular to a fault analysis method and device based on a graph model.
Background
With the continuous expansion of network scale, more and more network devices operate in different scenes, system logs can be generated in the operation process of the network devices and are used for reflecting the real-time change conditions of the operation state and the service state of the devices, fault points can be located through the system logs, and when the number of the system logs is increased in a blowout mode, the massive system logs cannot depend on a traditional expert to construct an experience base for efficient and accurate fault diagnosis and analysis, so that the potential hazard troubleshooting and fault location of the system logs by means of an Artificial Intelligence (AI) algorithm become research hotspots in the field of mobile communication.
In the prior art, the AI algorithm is mainly applied to the fault analysis of the logs, and a complete framework is not formed to mine the fault propagation relationship in the massive logs, for example, the fault which is generated by the crossing of a plurality of network event logs at the same time is difficult to be solved in a timely and accurate manner at a fixed point.
Therefore, how to timely and accurately locate the fault problem that a plurality of network event logs cross occurs is an urgent problem to be solved.
Disclosure of Invention
The invention provides a fault analysis method based on a graph model, which aims to realize accurate positioning of fault occurrence details and lay a foundation for fault root cause diagnosis and rapid service recovery.
In a first aspect, the present invention provides a fault analysis method based on a graph model, including:
acquiring a first real-time log, and preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence, wherein the first real-time log is used for recording real-time change information of an equipment running state and a service state;
performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed, wherein the log sequence to be processed is used for recording the anomalous log sequence screened out by the anomaly detection mechanism;
performing fault analysis on the log sequence to be processed according to a preset graph model, and determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation;
and determining the predicted fault equipment and the predicted fault information according to the root cause equipment, the root cause fault and the graph model.
In one possible design, performing fault analysis on a log sequence to be processed according to a preset graph model, and determining a root cause device with a fault and a root cause fault of the root cause device includes: in a second aspect, the present invention further provides a log detection apparatus, including:
the graph model comprises a device topological graph, a fault transfer graph and a device transfer graph;
determining a first device topological graph corresponding to the log sequence to be processed according to the log sequence to be processed and the device topological graph; the device topological graph corresponds to all devices in a preset network coverage range;
determining a first fault transfer diagram and a first equipment transfer diagram corresponding to the first equipment topological diagram according to the first equipment topological diagram, the fault transfer diagram and the equipment transfer diagram; the equipment transfer graph corresponds to all equipment in a preset network coverage range, and the fault transfer graph corresponds to the equipment transfer graph;
and determining a root cause device and a root cause fault according to the first fail-over diagram and the first device transition diagram.
In one possible design, the pre-processing the first real-time log according to a preset processing rule to obtain a first log sequence, including:
extracting information of the first real-time log according to a preset key field, and determining the first real-time log after the information is extracted;
according to a preset filtering rule, filtering the first real-time log, and determining the filtered first real-time log;
and dividing a fault analysis domain and segmenting a time sequence of the equipment corresponding to the first real-time log according to a preset network planning rule and a clustering algorithm to obtain a first log sequence.
In one possible design, performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed, includes:
and performing key field matching processing on the first log sequence according to a preset abnormal log level, and determining a log sequence to be processed containing the abnormal log level.
In one possible design, before obtaining the first real-time log, the method further includes:
acquiring a historical log training set;
preprocessing the historical logs in the historical log training set according to a preset processing rule to obtain a historical log sequence;
performing anomaly detection on the historical log sequence according to a preset anomaly detection mechanism, and determining the historical log sequence to be processed; the history log sequence to be processed is used for recording the abnormal history log sequence screened by the abnormal detection mechanism;
determining an equipment transfer graph according to a historical log sequence to be processed and a preset equipment topological graph;
and determining a fault transfer diagram according to the historical log sequence to be processed, the equipment topological diagram and the equipment transfer diagram.
In a second aspect, the present invention provides a failure analysis apparatus based on a graph model, including:
the first processing module is used for acquiring a first real-time log and preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence, wherein the first real-time log is used for recording real-time change information of an equipment running state and a service state;
the second processing module is used for carrying out abnormity detection on the first log sequence according to a preset abnormity detection mechanism and determining a log sequence to be processed, wherein the log sequence to be processed is used for recording the log sequence containing the abnormal log level; the anomaly log level is determined by an anomaly detection mechanism;
the first determining module is used for performing fault analysis on the log sequence to be processed according to a preset graph model, determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation;
and the second determining module is used for determining the predicted fault equipment and the predicted fault information according to the root cause equipment, the root cause fault and the graph model.
In one possible design, the first determining module is specifically configured to:
the graph model comprises an equipment topological graph, a fault transfer graph and an equipment transfer graph;
determining a first device topological graph corresponding to the log sequence to be processed according to the log sequence to be processed and the device topological graph; the device topological graph corresponds to all devices in a preset network coverage range;
determining a first fault transfer diagram and a first equipment transfer diagram corresponding to the first equipment topological diagram according to the first equipment topological diagram, the fault transfer diagram and the equipment transfer diagram; the equipment transfer graph corresponds to all equipment in a preset network coverage range, and the fault transfer graph corresponds to the equipment transfer graph;
and determining a root cause device and a root cause fault according to the first fail-over diagram and the first device transition diagram.
In one possible design, a first processing module is to:
extracting information of the first real-time log according to a preset key field, and determining the first real-time log after the information is extracted;
according to a preset filtering rule, filtering the first real-time log, and determining the filtered first real-time log;
and dividing a fault analysis domain and segmenting a time sequence of the equipment corresponding to the first real-time log according to a preset network planning rule and a clustering algorithm to obtain a first log sequence.
In one possible design, the second processing module is specifically configured to:
and performing key field matching processing on the first log sequence according to a preset abnormal log level, and determining a log sequence to be processed containing the abnormal log level.
In one possible design, the first processing module is further configured to:
acquiring a historical log training set;
preprocessing the historical logs in the historical log training set according to a preset processing rule to obtain a historical log sequence;
performing anomaly detection on the historical log sequence according to a preset anomaly detection mechanism, and determining the historical log sequence to be processed; the history log sequence to be processed is used for recording the abnormal history log sequence screened by the abnormal detection mechanism;
determining an equipment transfer graph according to a historical log sequence to be processed and a preset equipment topological graph;
and determining a fault transfer diagram according to the historical log sequence to be processed, the equipment topological diagram and the equipment transfer diagram.
In a third aspect, the present invention further provides a fault analysis platform, including:
a processor; and the number of the first and second groups,
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform any one of the graph model based fault analysis methods of the first aspect via execution of executable instructions.
In a fourth aspect, an embodiment of the present invention further provides a storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements any one of the graph model-based fault analysis methods in the first aspect.
In a fifth aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program, and when the computer program is executed by a processor, the computer program implements any one of the failure analysis methods based on the graph model in the first aspect.
The invention provides a fault analysis method and a fault analysis device based on a graph model.A first log sequence is obtained by acquiring a first real-time log and preprocessing the first real-time log according to a preset processing rule, wherein the first real-time log is used for recording real-time change information of an equipment running state and a service state; performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed, wherein the log sequence to be processed is used for recording the anomalous log sequence screened out by the anomaly detection mechanism; performing fault analysis on the log sequence to be processed according to a preset graph model, and determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation; and determining the predicted fault equipment and the predicted fault information according to the root cause equipment, the root cause fault and the graph model so as to timely and accurately position the fault problem of the cross occurrence of a plurality of network event logs.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a diagram illustrating an application scenario of a graph model-based fault analysis method according to an exemplary embodiment of the present invention;
FIG. 2 is a flow diagram illustrating a graph model-based fault analysis method in accordance with an exemplary embodiment of the present invention;
FIG. 3 is a schematic flow chart illustrating log preprocessing in a graph model-based fault analysis method according to an exemplary embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating a domain partitioning of a graph model based fault analysis method according to an exemplary embodiment of the present invention;
FIG. 5 is a schematic diagram illustrating log sequence segmentation in a graph model-based fault analysis method according to an example embodiment of the present invention;
FIG. 6 is a schematic diagram illustrating a device topology in a graph model-based fault analysis method according to an example embodiment of the present invention;
FIG. 7 is a schematic diagram illustrating a device transition in a graph model-based failure analysis method according to an example embodiment of the present invention;
FIG. 8 is a schematic diagram illustrating a failover in a graph model-based failure analysis method according to an example embodiment of the present invention;
FIG. 9 is a schematic diagram of a graph model-based fault analysis apparatus according to an exemplary embodiment of the present invention;
fig. 10 is a schematic structural diagram of a fault analysis platform according to an exemplary embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The following describes the technical solutions of the present invention and how to solve the above technical problems with specific embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
Fig. 1 is a diagram of an application scenario of a graph model-based fault analysis method according to an exemplary embodiment of the present invention, as shown in fig. 1, a history log 101 obtains a history log sequence through log preprocessing 103, then determines an abnormal history log sequence to be processed through abnormality detection 104, and performs inference model 105 processing on the abnormal history log sequence, including performing fault correlation statistics on the abnormal history log sequences according to a preset device topology map, that is, counting a probability of a next related device that fails when a device in the abnormal history log sequence fails, so as to obtain a device transfer map 106, and determining a fault transfer map 107 according to the abnormal history log sequence, the device topology map, and the device transfer map 106; the device topology map, the device transition map 106, and the failover map 107 are set as graph models. When the real-time log 102 is subjected to fault location processing, log preprocessing 103 is carried out on the real-time log 102 to obtain a real-time log sequence, then the real-time log sequence is subjected to abnormity detection 104 to determine an abnormal log sequence to be processed, different key fields are selected for the abnormal log sequence to carry out event separation 108 processing, root cause reasoning 109 is carried out according to a preset graph model, and root cause equipment with faults and root cause faults of the root cause equipment are determined; and the root cause prediction 110 can be performed according to the root cause device, the root cause failure, and the graph model, so as to predict the failure device which fails next time and the failure cause information of the failure device.
Fig. 2 is a schematic flowchart of a fault analysis method based on a graph model according to an exemplary embodiment of the present invention, and as shown in fig. 2, the fault analysis method based on a graph model according to the present embodiment includes:
step 201, obtaining a first real-time log, and preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence, where the first real-time log is used to record real-time change information of an equipment running state and a service state.
Specifically, fig. 3 is a schematic flowchart illustrating a log preprocessing process in a graph model-based fault analysis method according to an exemplary embodiment of the present invention, and as shown in fig. 3, key field extraction, data filtering, and data sequence processing are performed on a first real-time log, and a detailed processing procedure is as follows.
Because the log formats of different manufacturers and different devices are different, subsequent unified processing and analysis are inconvenient, and therefore, key information in the log needs to be identified according to the log specification, a log structure needs to be constructed, and the log structure needs to be converted into a universal format. Acquiring a first real-time log, extracting information of the first real-time log according to a preset key field, and determining the first real-time log after information extraction; for example, taking a manufacturer device log as an example, the structure of the parsed log is shown in table one.
Watch 1
Time stamp Host name Name of module Log level Information abstract Log identification Information counting Detailed information
According to a preset filtering rule, filtering the first real-time log, and determining the filtered first real-time log; the filtering rules are formed by collecting expert experiences in advance. For example, because the original log data volume is large, there exists a certain amount of illegal data, such as that the occurrence time of an alarm is an illegal value, the information of the alarm source is undefined, etc.; meanwhile, some logs are irrelevant to fault root causes needing to be analyzed, and therefore data cleaning needs to be carried out on original logs to filter interference terms, noise and the like.
And dividing a fault analysis domain and segmenting a time sequence of the equipment corresponding to the first real-time log according to a preset network planning rule and a clustering algorithm to obtain a first log sequence. Specifically, according to a preset network planning rule, relevant devices of which the topological relations belong to the same minimum management domain are divided into a fault analysis domain, and logs of all the devices in the domain are sorted according to a fault occurrence time sequence, so that fault propagation relations among the devices are mined subsequently. For example, fig. 4 is a schematic diagram illustrating a fault analysis domain division in a fault analysis method based on a graph model according to an exemplary embodiment of the present invention; as shown in fig. 4, the convergence device 1 and the convergence device 2 belong to a shared convergence device, and the convergence device 1, the convergence device 2, the access device 1, the access device 2, the access device 3, and the access device 4 form an access ring 1; the convergence device 1, the convergence device 2, the access device 5, the access device 6, the access device 7 and the access device 8 form an access ring 2; the access ring 1 and the access ring 2 do not intersect and are not associated except for the shared aggregation device 1 and the shared aggregation device 2, so that when a device in one access ring fails, the failure of a single loop cannot be propagated to other loops, that is, the access ring 1 and the access ring 2 are independent failure analysis domains respectively.
And then, a Clustering algorithm, such as Density-Based Clustering of Applications with Noise (DBSCAN), is adopted to derive the log sample set with the maximum Density connection according to the Density reachable relation. The algorithm parameters such as a sample neighborhood distance threshold and a sample number threshold are continuously adjusted according to evaluation parameters such as contour coefficients, clustering numbers and actual segmentation effects, and then an optimal value is determined. For example, fig. 5 is a schematic diagram illustrating log sequence segmentation in a graph model-based fault analysis method according to an exemplary embodiment of the present invention; as shown in fig. 5, each dot in the graph can be regarded as an abstraction of one log, the logs are arranged according to the printing time sequence, and according to the DBSCAN algorithm, a plurality of logs in a circle are grouped into one cluster, that is, the time sequence segmentation of the plurality of logs is realized, so that the logs in different time periods are stripped, and therefore, the subsequent more accurate positioning of the root cause device with the fault and the root cause fault is made to be accurate.
And step 202, performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed, wherein the log sequence to be processed is used for recording the anomaly log sequence screened out by the anomaly detection mechanism.
Specifically, key field matching processing is performed on the first log sequence according to a preset abnormal log level, and a to-be-processed log sequence containing the abnormal log level is determined. For example, different vendors have standardized definitions of device log levels, and for a vendor as an example, log levels 0(Emergency), 1(Alert), 2(Critical), and 3(Errors) respectively represent an extremely urgent error, an error that needs to be corrected immediately, a more serious error, and an error. One characterization of equipment failure is the production of a large number of logs at a high level, i.e., levels 0-3. Based on the method, the log level is selected as a key field for triggering exception, and when the first log sequence contains logs with higher levels, the log sequence is determined to be a log sequence to be processed which is most likely to have faults.
Step 203, performing fault analysis on the log sequence to be processed according to a preset graph model, and determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation;
specifically, the graph model comprises an equipment topology graph, a fault transfer graph and an equipment transfer graph; determining a first device topological graph corresponding to the log sequence to be processed according to the log sequence to be processed and the device topological graph; the device topological graph corresponds to all devices in a preset network coverage range; determining a first fault transfer diagram and a first equipment transfer diagram corresponding to the first equipment topological diagram according to the first equipment topological diagram, the fault transfer diagram and the equipment transfer diagram; the equipment transfer graph corresponds to all equipment in a preset network coverage range, and the fault transfer graph corresponds to the equipment transfer graph; and determining a root cause device and a root cause fault according to the first fail-over diagram and the first device transition diagram.
For example, fig. 6 is a device topology schematic diagram in the graph model-based fault analysis method according to an exemplary embodiment of the present invention, as shown in fig. 6, there are 7 devices, which are device a, device B, device C, device D, device E, device F, and device G, on a first device topology diagram corresponding to a log sequence to be processed, a device transfer diagram corresponding to the topology diagram of the 7 devices is shown in fig. 7, and fig. 7 is a device transfer schematic diagram in the graph model-based fault analysis method according to an exemplary embodiment of the present invention; fig. 8 is a schematic diagram of a failover corresponding to the topological diagram of the 7 devices, where fig. 8 is a schematic diagram of failover in a graph model-based failure analysis method according to an exemplary embodiment of the present invention; the abnormal log set of the log sequence to be processed is marked as { A: a, A: B, A: C, B: a, B: B, B: C, B: d, B: e, C: d, C: e }, namely, the network event generates 10 logs in total, and the method comprises the following steps: the device a has faults a, B, and C, the device B has faults a, B, C, d, and e, and the device C has faults d and e. The above 3 devices are analyzed for faults according to the failover graph of fig. 8, and probability values of each directed edge of the fault set { a, b, c } of device a are compared. For example, the probability values of each fault pointing to a directed edge are 0.5 for a- > a, 0.1 for a- > b, 0.2 for a- > c, 0.1 for a- > d, and 0.1 for a- > e, respectively; b- > a is 0.1, b- > b is 0.2, b- > c is 0.4, b- > d is 0.1, b- > e is 0.2; c- > a is 0.2, c- > b is 0.1, c- > c is 0.2, c- > d is 0.3 and c- > e is 0.2; it can be seen that the directed edge with the highest probability value is a- > a, then the root cause failure of device a is determined to be a. Similarly, the root cause failure a of the device B and the root cause failure e of the device C are determined. And analyzing the fault transfer relationship of the equipment according to the equipment transfer diagram in fig. 7, comparing the probability values of the directional edges of the equipment set { A, B, C }, and if the directional edge with the maximum probability value is A- > B, determining that the root equipment is the equipment A. And (4) combining the analysis results, wherein the root cause log of the network event is { A: a }, namely the root cause device is a device A, the root cause failure is a failure a, and the rest 9 logs are associated logs.
In one possible design, the log sequence to be processed after the anomaly detection and screening may be a log set including a plurality of events, and the plurality of events respectively generate a root cause log and a set of associated logs corresponding to the events. The logs in the time slice are separated according to the characteristics of the events, so that each network event can be deeply analyzed more effectively, the root cause reasoning logs of each event are further obtained, and possible faults of each event are predicted. For example, different keywords are selected as indexes for log separation in events according to different types of event features in the network, so as to achieve more accurate positioning of failure occurrence reasons, for example, for common network failures, the keywords are usually selected from ports, Internet Protocol (IP), and the like.
And step 204, determining the predicted fault equipment and the predicted fault information according to the root cause equipment, the root cause fault and the graph model.
Specifically, for example, on the basis of the root cause analysis result { a: a }, a directed edge of the maximum probability value that the device a points to other devices is obtained by combining the device transfer graph, when the pointing device is a device B, the directed edge of the maximum probability value that the fault a points to other faults is determined according to the fault transfer graph, and when the pointing fault is a fault B, the predicted faulty device is determined to be B and the predicted fault information is determined to be fault B.
The processing method of step 201-204 can more accurately locate the specific information of the fault by preprocessing the real-time log, detecting the abnormality, and then performing the fault root cause reasoning and fault prediction on the log according to the graph model based on the topological relation and the transfer relation of the equipment.
The graph model is obtained by training and processing the historical log, and specifically, a historical log training set is obtained; preprocessing the historical logs in the historical log training set according to a preset processing rule to obtain a historical log sequence; performing anomaly detection on the historical log sequence according to a preset anomaly detection mechanism, and determining the historical log sequence to be processed; the history log sequence to be processed is used for recording the abnormal history log sequence screened by the abnormal detection mechanism; determining an equipment transfer graph according to a historical log sequence to be processed and a preset equipment topological graph; and determining a fault transfer diagram according to the historical log sequence to be processed, the equipment topological diagram and the equipment transfer diagram. Wherein, the preprocessing and the exception detection can be performed according to the processing method of step 201-202. After determining the historical log sequence to be processed, counting the probability of the next fault occurring in the related equipment when each equipment has a fault, and obtaining a relation graph between the faulty equipment, wherein each directed edge in the graph corresponds to a corresponding probability value. Taking the device A, B in fig. 6 and 7 as an example, the edge pointing to B of a represents the historical statistical probability value of any fault occurring in the device a, which may cause any fault occurring in the device B; the edge of A pointing to A indicates the historical statistical probability value of the device A when a certain fault occurs, which can cause other faults of the device A. Referring to fig. 8, a failover diagram is determined, that is, when considering whether two failures are connected, the failure transfer relationship between two devices in the device failover diagram of fig. 7 needs to be considered correspondingly. For example, the device a is only connected to the devices B and C in topology, and the next adjacent fault to a certain fault on the device a can only be a certain fault in the three devices (a | B | C). Taking the faults a and B as examples, the edge pointing to B from a represents the historical statistical probability value that all devices (i.e. A/B/C/D/E/F/G) have a faults and other devices have B faults.
For example, when a fault a occurs, the next fault is the historical frequency of B, and the historical statistical probability value is obtained according to the frequency. If the historical frequency of other faults caused by the occurrence of the fault A is as follows: the number of faults A → B is n, the number of A → C is m, and the number of A → D is k; then, the historical statistical probability values pointed by the directional edges of the fault a corresponding to the 3 cases are:
Figure BDA0002832439040000111
Figure BDA0002832439040000112
Figure BDA0002832439040000113
and similarly, obtaining historical statistical probability values pointed by all directed edges of all the devices by counting the transfer frequency among the devices when the faults occur.
Acquiring a first real-time log by the method of step 201-204, and preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence, where the first real-time log is used to record real-time change information of an equipment running state and a service state; performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed, wherein the log sequence to be processed is used for recording the anomalous log sequence screened out by the anomaly detection mechanism; performing fault analysis on the log sequence to be processed according to a preset graph model, and determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation; and determining the predicted fault equipment and the predicted fault information according to the root cause equipment, the root cause fault and the graph model. Therefore, the fault occurrence details can be accurately positioned, and a foundation is laid for fault root cause diagnosis and rapid service recovery.
FIG. 9 is a schematic diagram of a graph model-based fault analysis apparatus according to an exemplary embodiment of the present invention; as shown in fig. 9, the fault analysis apparatus 90 based on a graph model according to the present embodiment includes:
a first processing module 901, configured to obtain a first real-time log, and pre-process the first real-time log according to a preset processing rule to obtain a first log sequence, where the first real-time log is used to record real-time change information of an operating state and a service state of a device;
a second processing module 902, configured to perform anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determine a log sequence to be processed, where the log sequence to be processed is used to record a log sequence including an anomaly log level; the anomaly log level is determined by an anomaly detection mechanism;
a first determining module 903, configured to perform fault analysis on a log sequence to be processed according to a preset graph model, and determine a root cause device that has a fault and a root cause fault of the root cause device, where the graph model is used to represent a device topology relationship and a transfer relationship;
a second determining module 904, configured to determine a predicted failure device and predicted failure information according to the root cause device, the root cause failure, and the graph model.
In one possible design, the first determining module 903 is specifically configured to:
the graph model comprises an equipment topological graph, a fault transfer graph and an equipment transfer graph;
determining a first device topological graph corresponding to the log sequence to be processed according to the log sequence to be processed and the device topological graph; the device topological graph corresponds to all devices in a preset network coverage range;
determining a first fault transfer diagram and a first equipment transfer diagram corresponding to the first equipment topological diagram according to the first equipment topological diagram, the fault transfer diagram and the equipment transfer diagram; the equipment transfer graph corresponds to all equipment in a preset network coverage range, and the fault transfer graph corresponds to the equipment transfer graph;
and determining a root cause device and a root cause fault according to the first fail-over diagram and the first device transition diagram.
In one possible design, the first processing module 901 is specifically configured to:
extracting information of the first real-time log according to a preset key field, and determining the first real-time log after the information is extracted;
according to a preset filtering rule, filtering the first real-time log, and determining the filtered first real-time log;
and dividing a fault analysis domain and segmenting a time sequence of the equipment corresponding to the first real-time log according to a preset network planning rule and a clustering algorithm to obtain a first log sequence.
In one possible design, the second processing module 902 is specifically configured to:
and performing key field matching processing on the first log sequence according to a preset abnormal log level, and determining a log sequence to be processed containing the abnormal log level.
In one possible design, the first processing module 901 is further configured to:
acquiring a historical log training set;
preprocessing the historical logs in the historical log training set according to a preset processing rule to obtain a historical log sequence;
performing anomaly detection on the historical log sequence according to a preset anomaly detection mechanism, and determining the historical log sequence to be processed; the history log sequence to be processed is used for recording the abnormal history log sequence screened by the abnormal detection mechanism;
determining an equipment transfer graph according to a historical log sequence to be processed and a preset equipment topological graph;
and determining a fault transfer diagram according to the historical log sequence to be processed, the equipment topological diagram and the equipment transfer diagram.
Fig. 10 is a schematic structural diagram of a fault analysis platform according to an exemplary embodiment of the present invention. As shown in fig. 10, the fault analysis platform 10 provided in this embodiment includes:
a processor 1001; and the number of the first and second groups,
a memory 1002 for storing executable instructions of the processor, which may also be a flash (flash memory);
wherein the processor 1001 is configured to perform the various steps of the above-described method via execution of executable instructions. Reference may be made in particular to the description relating to the previous method embodiments.
Alternatively, the memory 1002 may be separate or integrated with the processor 1001.
When the memory 1002 is a device independent of the processor 1001, the database 100 may further include:
the bus 1003 connects the processor 1001 and the memory 1002.
In addition, embodiments of the present application further provide a computer-readable storage medium, in which computer-executable instructions are stored, and when at least one processor of the user equipment executes the computer-executable instructions, the user equipment performs the above-mentioned various possible methods.
Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in user equipment. Of course, the processor and the storage medium may reside as discrete components in a communication device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the spirit of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A fault analysis method based on a graph model is characterized by comprising the following steps:
acquiring a first real-time log, and preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence, wherein the first real-time log is used for recording real-time change information of an equipment running state and a service state;
performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed, wherein the log sequence to be processed is used for recording the anomaly log sequence screened out by the anomaly detection mechanism;
performing fault analysis on the log sequence to be processed according to a preset graph model, and determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation;
and determining predicted fault equipment and predicted fault information according to the root cause equipment, the root cause fault and the graph model.
2. The method according to claim 1, wherein the performing fault analysis on the log sequence to be processed according to a preset graph model to determine a root cause device with a fault and a root cause fault of the root cause device comprises:
the graph model comprises an equipment topological graph, a fault transfer graph and an equipment transfer graph;
determining a first device topological graph corresponding to the log sequence to be processed according to the log sequence to be processed and the device topological graph; the device topological graph corresponds to all devices in a preset network coverage range;
determining a first fault transfer diagram and a first equipment transfer diagram corresponding to the first equipment topological diagram according to the first equipment topological diagram, the fault transfer diagram and the equipment transfer diagram; the equipment transfer graph corresponds to all equipment in the preset network coverage range, and the fault transfer graph corresponds to the equipment transfer graph;
determining the root cause device and the root cause failure according to the first failover graph and the first device failover graph.
3. The method according to claim 1, wherein the preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence comprises:
extracting information of the first real-time log according to a preset key field, and determining the first real-time log after information extraction;
according to a preset filtering rule, filtering the first real-time log, and determining the first real-time log after filtering;
and dividing a fault analysis domain and segmenting a time sequence of the equipment corresponding to the first real-time log according to a preset network planning rule and a clustering algorithm to obtain a first log sequence.
4. The method according to claim 3, wherein the performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism to determine a log sequence to be processed comprises:
and performing key field matching processing on the first log sequence according to a preset abnormal log level, and determining a log sequence to be processed containing the abnormal log level.
5. The method of any of claims 1-4, wherein prior to obtaining the first real-time log, further comprising:
acquiring a historical log training set;
preprocessing the historical logs in the historical log training set according to a preset processing rule to obtain a historical log sequence;
performing anomaly detection on the historical log sequence according to a preset anomaly detection mechanism, and determining a historical log sequence to be processed; the history log sequence to be processed is used for recording the abnormal history log sequence screened by the abnormal detection mechanism;
determining the equipment transfer graph according to the historical log sequence to be processed and the preset equipment topological graph;
and determining the fault transfer diagram according to the historical log sequence to be processed, the equipment topological diagram and the equipment transfer diagram.
6. A graph model-based fault analysis apparatus, comprising:
the first processing module is used for acquiring a first real-time log and preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence, wherein the first real-time log is used for recording real-time change information of an equipment running state and a service state;
the second processing module is used for carrying out abnormity detection on the first log sequence according to a preset abnormity detection mechanism and determining a log sequence to be processed, wherein the log sequence to be processed is used for recording the log sequence containing the abnormal log level; the anomaly log level is determined by the anomaly detection mechanism;
the first determining module is used for performing fault analysis on the log sequence to be processed according to a preset graph model, determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation;
and the second determining module is used for determining the predicted fault equipment and the predicted fault information according to the root cause equipment, the root cause fault and the graph model.
7. The apparatus of claim 6, wherein the first determining module is specifically configured to:
the graph model comprises an equipment topological graph, a fault transfer graph and an equipment transfer graph;
determining a first device topological graph corresponding to the log sequence to be processed according to the log sequence to be processed and the device topological graph; the device topological graph corresponds to all devices in a preset network coverage range;
determining a first fault transfer diagram and a first equipment transfer diagram corresponding to the first equipment topological diagram according to the first equipment topological diagram, the fault transfer diagram and the equipment transfer diagram; the equipment transfer graph corresponds to all equipment in the preset network coverage range, and the fault transfer graph corresponds to the equipment transfer graph;
determining the root cause device and the root cause failure according to the first failover graph and the first device failover graph.
8. A fault analysis platform, comprising:
a processor; and the number of the first and second groups,
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the graph model based fault analysis method of any of claims 1-5 via execution of the executable instructions.
9. A storage medium on which a computer program is stored, the program, when executed by a processor, implementing the graph model-based fault analysis method of any one of claims 1 to 5.
10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the graph model based failure analysis method according to any one of claims 1 to 5.
CN202011453509.3A 2020-12-11 2020-12-11 Fault analysis method and device based on graph model Active CN114629776B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011453509.3A CN114629776B (en) 2020-12-11 2020-12-11 Fault analysis method and device based on graph model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011453509.3A CN114629776B (en) 2020-12-11 2020-12-11 Fault analysis method and device based on graph model

Publications (2)

Publication Number Publication Date
CN114629776A true CN114629776A (en) 2022-06-14
CN114629776B CN114629776B (en) 2023-05-30

Family

ID=81895792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011453509.3A Active CN114629776B (en) 2020-12-11 2020-12-11 Fault analysis method and device based on graph model

Country Status (1)

Country Link
CN (1) CN114629776B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115225460A (en) * 2022-07-15 2022-10-21 北京天融信网络安全技术有限公司 Failure determination method, electronic device, and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529455A (en) * 2003-09-29 2004-09-15 港湾网络有限公司 Network failure real-time relativity analysing method and system
CN103001811A (en) * 2012-12-31 2013-03-27 北京启明星辰信息技术股份有限公司 Method and device for fault locating
CN109617715A (en) * 2018-11-27 2019-04-12 中盈优创资讯科技有限公司 Network fault diagnosis method, system
CN109840157A (en) * 2017-11-28 2019-06-04 中国移动通信集团浙江有限公司 Method, apparatus, electronic equipment and the storage medium of fault diagnosis
US20190179691A1 (en) * 2017-12-08 2019-06-13 Nec Laboratories America, Inc. Log-based computer failure diagnosis
WO2019221461A1 (en) * 2018-05-18 2019-11-21 주식회사 케이티 Apparatus and method for analyzing cause of network failure
CN110493025A (en) * 2018-05-15 2019-11-22 中国移动通信集团浙江有限公司 It is a kind of based on the failure root of multilayer digraph because of the method and device of diagnosis
CN110855503A (en) * 2019-11-22 2020-02-28 叶晓斌 Fault cause determining method and system based on network protocol hierarchy dependency relationship
CN110855502A (en) * 2019-11-22 2020-02-28 叶晓斌 Fault cause determination method and system based on time-space analysis log
WO2020168675A1 (en) * 2019-02-21 2020-08-27 烽火通信科技股份有限公司 Sample data processing method, and system and apparatus
CN112052151A (en) * 2020-10-09 2020-12-08 腾讯科技(深圳)有限公司 Fault root cause analysis method, device, equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1529455A (en) * 2003-09-29 2004-09-15 港湾网络有限公司 Network failure real-time relativity analysing method and system
CN103001811A (en) * 2012-12-31 2013-03-27 北京启明星辰信息技术股份有限公司 Method and device for fault locating
CN109840157A (en) * 2017-11-28 2019-06-04 中国移动通信集团浙江有限公司 Method, apparatus, electronic equipment and the storage medium of fault diagnosis
US20190179691A1 (en) * 2017-12-08 2019-06-13 Nec Laboratories America, Inc. Log-based computer failure diagnosis
CN110493025A (en) * 2018-05-15 2019-11-22 中国移动通信集团浙江有限公司 It is a kind of based on the failure root of multilayer digraph because of the method and device of diagnosis
WO2019221461A1 (en) * 2018-05-18 2019-11-21 주식회사 케이티 Apparatus and method for analyzing cause of network failure
CN109617715A (en) * 2018-11-27 2019-04-12 中盈优创资讯科技有限公司 Network fault diagnosis method, system
WO2020168675A1 (en) * 2019-02-21 2020-08-27 烽火通信科技股份有限公司 Sample data processing method, and system and apparatus
CN110855503A (en) * 2019-11-22 2020-02-28 叶晓斌 Fault cause determining method and system based on network protocol hierarchy dependency relationship
CN110855502A (en) * 2019-11-22 2020-02-28 叶晓斌 Fault cause determination method and system based on time-space analysis log
CN112052151A (en) * 2020-10-09 2020-12-08 腾讯科技(深圳)有限公司 Fault root cause analysis method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
彦逸: "\"基于因果规则的电力营销系统故障定位算法\"", 《计算机与现代化》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115225460A (en) * 2022-07-15 2022-10-21 北京天融信网络安全技术有限公司 Failure determination method, electronic device, and storage medium
CN115225460B (en) * 2022-07-15 2023-11-28 北京天融信网络安全技术有限公司 Fault determination method, electronic device, and storage medium

Also Published As

Publication number Publication date
CN114629776B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
US10795753B2 (en) Log-based computer failure diagnosis
US10931511B2 (en) Predicting computer network equipment failure
CN113282461B (en) Alarm identification method and device for transmission network
CN111309565B (en) Alarm processing method and device, electronic equipment and computer readable storage medium
AU2019275633B2 (en) System and method of automated fault correction in a network environment
US20100100521A1 (en) Diagnostic system
CN107124289B (en) Weblog time alignment method, device and host
CN114385391A (en) NFV virtualization device operation data analysis method and device
CN113497726B (en) Alarm monitoring method, alarm monitoring system, computer readable storage medium and electronic equipment
CN110149223B (en) Fault positioning method and equipment
CN114978568A (en) Data center management using machine learning
US11860615B2 (en) Method and system for anomaly detection and diagnosis in industrial processes and equipment
CN109669844A (en) Equipment obstacle management method, apparatus, equipment and storage medium
CN112769605B (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
CN111859047A (en) Fault solving method and device
JPWO2019116418A1 (en) Fault analyzer, fault analysis method and fault analysis program
US11792081B2 (en) Managing telecommunication network event data
CN114629776B (en) Fault analysis method and device based on graph model
CN111367971A (en) Financial system abnormity auxiliary analysis method and device based on data mining
Pan et al. Unsupervised two-stage root-cause analysis for integrated systems
CN116582414A (en) Fault root cause positioning method, device, equipment and readable storage medium
US11822578B2 (en) Matching machine generated data entries to pattern clusters
JP2022037107A (en) Failure analysis device, failure analysis method, and failure analysis program
CN110727538A (en) Fault positioning system and method based on model hit probability distribution
CN117527622B (en) Data processing method and system of network switch

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant