CN114629776B - Fault analysis method and device based on graph model - Google Patents

Fault analysis method and device based on graph model Download PDF

Info

Publication number
CN114629776B
CN114629776B CN202011453509.3A CN202011453509A CN114629776B CN 114629776 B CN114629776 B CN 114629776B CN 202011453509 A CN202011453509 A CN 202011453509A CN 114629776 B CN114629776 B CN 114629776B
Authority
CN
China
Prior art keywords
log
fault
equipment
graph
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011453509.3A
Other languages
Chinese (zh)
Other versions
CN114629776A (en
Inventor
张勉知
刘惜吾
程亚锋
叶晓斌
陈孟尝
曾昭才
张园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN202011453509.3A priority Critical patent/CN114629776B/en
Publication of CN114629776A publication Critical patent/CN114629776A/en
Application granted granted Critical
Publication of CN114629776B publication Critical patent/CN114629776B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/12Discovery or management of network topologies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)
  • Test And Diagnosis Of Digital Computers (AREA)

Abstract

The invention provides a fault analysis method and a fault analysis device based on a graph model, and the fault analysis method based on the graph model provided by the embodiment comprises the following steps: acquiring a first real-time log, and preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence; performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed; performing fault analysis on the log sequence to be processed according to a preset graph model, and determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation; and determining the predicted fault equipment and the predicted fault information according to the root cause equipment, the root cause fault and the graph model. By the fault analysis method based on the graph model, the fault occurrence details are accurately positioned, and a foundation is laid for fault root diagnosis and service quick recovery.

Description

Fault analysis method and device based on graph model
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a fault analysis method and apparatus based on a graph model.
Background
With the continuous expansion of network scale, more and more network devices run in different scenes, the network devices can generate system logs in the running process, the system logs are used for reflecting the real-time change conditions of the running states and the service states of the devices, fault points can be positioned through the system logs, when the number of the system logs is increased in a blowout mode, the mass system logs cannot rely on a traditional expert to construct an experience base for efficient and accurate fault diagnosis and analysis, and therefore hidden trouble investigation and fault positioning of the system logs are achieved by means of an artificial intelligence (Artificial Intelligence, AI) algorithm and become research hotspots in the field of mobile communication.
In the prior art, the fault analysis of the AI algorithm applied to the logs mainly aims at single network events, a complete framework is not formed to mine fault propagation relations in massive logs, for example, faults which occur at the same moment and in the intersection of a plurality of network event logs are difficult to solve in a timely and accurate manner.
Therefore, how to timely and accurately locate the fault problem of the intersection of multiple network event logs is a problem to be solved.
Disclosure of Invention
The invention provides a fault analysis method based on a graph model, which is used for realizing the accurate positioning of fault occurrence details and laying a foundation for fault root cause diagnosis and service quick recovery.
In a first aspect, the present invention provides a fault analysis method based on a graph model, including:
acquiring a first real-time log, preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence, wherein the first real-time log is used for recording real-time change information of equipment operation state and service state;
performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed, wherein the log sequence to be processed is used for recording the anomaly log sequence screened by the anomaly detection mechanism;
performing fault analysis on the log sequence to be processed according to a preset graph model, and determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation;
and determining the predicted fault equipment and the predicted fault information according to the root cause equipment, the root cause fault and the graph model.
In one possible design, performing fault analysis on a log sequence to be processed according to a preset graph model, determining a root cause device with a fault and a root cause fault of the root cause device, including: in a second aspect, the present invention also provides a log detection device, including:
the graph model comprises a device topology graph, a failover graph and a device transfer graph;
determining a first equipment topological graph corresponding to the log sequence to be processed according to the log sequence to be processed and the equipment topological graph; the equipment topological graph corresponds to all equipment in a preset network coverage area;
determining a first fault transfer diagram and a first equipment transfer diagram corresponding to the first equipment topological diagram according to the first equipment topological diagram, the fault transfer diagram and the equipment transfer diagram; the equipment transfer diagram corresponds to all equipment in a preset network coverage area, and the fault transfer diagram corresponds to the equipment transfer diagram;
root cause devices and root cause faults are determined from the first failover graph and the first device failover graph.
In one possible design, preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence, including:
information extraction is carried out on the first real-time log according to a preset key field, and the first real-time log after information extraction is determined;
according to a preset filtering rule, filtering the first real-time log, and determining the first real-time log after the filtering;
and dividing fault analysis domains and segmenting time sequences of equipment corresponding to the first real-time log according to a preset network planning rule and a clustering algorithm to obtain a first log sequence.
In one possible design, the detecting the abnormality of the first log sequence according to a preset abnormality detection mechanism, and determining the log sequence to be processed includes:
and carrying out key field matching processing on the first log sequence according to the preset abnormal log level, and determining a log sequence to be processed containing the abnormal log level.
In one possible design, before the first real-time log is obtained, the method further includes:
acquiring a history log training set;
preprocessing a history log in a history log training set according to a preset processing rule to obtain a history log sequence;
performing anomaly detection on the history log sequence according to a preset anomaly detection mechanism, and determining a history log sequence to be processed; the history log sequence to be processed is used for recording the abnormal history log sequence screened out by the abnormality detection mechanism;
determining a device transfer diagram according to the history log sequence to be processed and a preset device topological diagram;
and determining a fault transfer diagram according to the history log sequence to be processed, the equipment topological diagram and the equipment transfer diagram.
In a second aspect, the present invention provides a fault analysis apparatus based on a graph model, including:
the first processing module is used for acquiring a first real-time log, preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence, and recording real-time change information of equipment operation state and service state by the first real-time log;
the second processing module is used for carrying out anomaly detection on the first log sequence according to a preset anomaly detection mechanism, determining a log sequence to be processed, and the log sequence to be processed is used for recording log sequences containing anomaly log levels; the exception log level is determined by an exception detection mechanism;
the first determining module is used for carrying out fault analysis on the log sequence to be processed according to a preset graph model, determining the root cause equipment with faults and the root cause faults of the root cause equipment, and the graph model is used for representing the topological relation and the transfer relation of the equipment;
and the second determining module is used for determining the predicted fault equipment and the predicted fault information according to the root cause equipment, the root cause fault and the graph model.
In one possible design, the first determining module is specifically configured to:
the graph model comprises a device topology graph, a failover graph and a device transfer graph;
determining a first equipment topological graph corresponding to the log sequence to be processed according to the log sequence to be processed and the equipment topological graph; the equipment topological graph corresponds to all equipment in a preset network coverage area;
determining a first fault transfer diagram and a first equipment transfer diagram corresponding to the first equipment topological diagram according to the first equipment topological diagram, the fault transfer diagram and the equipment transfer diagram; the equipment transfer diagram corresponds to all equipment in a preset network coverage area, and the fault transfer diagram corresponds to the equipment transfer diagram;
root cause devices and root cause faults are determined from the first failover graph and the first device failover graph.
In one possible design, the first processing module is configured to:
information extraction is carried out on the first real-time log according to a preset key field, and the first real-time log after information extraction is determined;
according to a preset filtering rule, filtering the first real-time log, and determining the first real-time log after the filtering;
and dividing fault analysis domains and segmenting time sequences of equipment corresponding to the first real-time log according to a preset network planning rule and a clustering algorithm to obtain a first log sequence.
In one possible design, the second processing module is specifically configured to:
and carrying out key field matching processing on the first log sequence according to the preset abnormal log level, and determining a log sequence to be processed containing the abnormal log level.
In one possible design, the first processing module is further configured to:
acquiring a history log training set;
preprocessing a history log in a history log training set according to a preset processing rule to obtain a history log sequence;
performing anomaly detection on the history log sequence according to a preset anomaly detection mechanism, and determining a history log sequence to be processed; the history log sequence to be processed is used for recording the abnormal history log sequence screened out by the abnormality detection mechanism;
determining a device transfer diagram according to the history log sequence to be processed and a preset device topological diagram;
and determining a fault transfer diagram according to the history log sequence to be processed, the equipment topological diagram and the equipment transfer diagram.
In a third aspect, the present invention also provides a fault analysis platform, including:
a processor; the method comprises the steps of,
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform any of the graph model based fault analysis methods of the first aspect via execution of the executable instructions.
In a fourth aspect, an embodiment of the present invention further provides a storage medium having stored thereon a computer program, which when executed by a processor implements any one of the graph model-based fault analysis methods of the first aspect.
In a fifth aspect, an embodiment of the present invention further provides a computer program product, which includes a computer program, where the program when executed by a processor implements any one of the fault analysis methods based on the graph model in the first aspect.
The invention provides a fault analysis method and device based on a graph model, which are characterized in that a first real-time log is obtained, and is preprocessed according to a preset processing rule to obtain a first log sequence, wherein the first real-time log is used for recording real-time change information of equipment running state and service state; performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed, wherein the log sequence to be processed is used for recording the anomaly log sequence screened by the anomaly detection mechanism; performing fault analysis on the log sequence to be processed according to a preset graph model, and determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation; according to the root cause equipment, the root cause fault and the graph model, the prediction fault equipment and the prediction fault information are determined, so that the fault problem caused by the intersection of a plurality of network event logs can be timely and accurately positioned.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 is an application scenario diagram of a graph model-based fault analysis method according to an example embodiment of the present invention;
FIG. 2 is a flow chart of a graph model-based fault analysis method according to an example embodiment of the present invention;
FIG. 3 is a flow chart illustrating log preprocessing in a graph model-based fault analysis method according to an example embodiment of the present invention;
FIG. 4 is a schematic diagram of fault analysis domain partitioning in a graph model-based fault analysis method according to an example embodiment of the present invention;
FIG. 5 is a log sequence segmentation schematic diagram of a graph model-based fault analysis method according to an example embodiment of the present invention;
FIG. 6 is a schematic diagram of a device topology in a graph model-based fault analysis method according to an example embodiment of the present invention;
FIG. 7 is a schematic diagram illustrating a device transition in a graph model-based fault analysis method according to an example embodiment of the present invention;
FIG. 8 is a schematic diagram illustrating failover in a graph model-based failure analysis method according to an example embodiment of the present invention;
FIG. 9 is a schematic diagram of a graph model-based fault analysis apparatus according to an example embodiment of the present invention;
fig. 10 is a schematic structural diagram of a fault analysis platform according to an exemplary embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The following describes the technical scheme of the present invention and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
Fig. 1 is an application scenario diagram of a fault analysis method based on a graph model according to an exemplary embodiment of the present invention, as shown in fig. 1, a history log 101 is subjected to log preprocessing 103 to obtain a history log sequence, then an anomaly history log sequence to be processed is determined through anomaly detection 104, and an inference model 105 is performed on the anomaly history log sequence, including performing fault correlation statistics on the anomaly history log sequences according to a preset device topology graph, that is, counting the probability of a relevant device that fails next when each device fails in the anomaly history log sequence, so as to obtain a device transfer graph 106, and determining a fault transfer graph 107 according to the anomaly history log sequence, the device topology graph and the device transfer graph 106; the device topology map, the device transition map 106, and the failover map 107 are set as map models. When fault location processing is performed on the real-time log 102, firstly, log preprocessing 103 is performed on the real-time log 102 to obtain a real-time log sequence, then, anomaly detection 104 is performed on the real-time log sequence to determine an anomaly log sequence to be processed, event separation 108 processing is performed on different key fields selected by the anomaly log sequence, root cause reasoning 109 is performed according to a preset graph model, and root cause equipment with faults and root cause faults of the root cause equipment are determined; root cause prediction 110 can be performed according to the root cause device, the root cause fault and the graph model, so that the fault device which is in fault next time and the fault cause information of the fault device can be predicted.
Fig. 2 is a flow chart of a graph model-based fault analysis method according to an exemplary embodiment of the present invention, as shown in fig. 2, where the graph model-based fault analysis method provided in the present embodiment includes:
step 201, a first real-time log is obtained, and is preprocessed according to a preset processing rule, so that a first log sequence is obtained, and the first real-time log is used for recording real-time change information of equipment running state and service state.
Specifically, fig. 3 is a schematic flow chart of log preprocessing in a fault analysis method based on a graph model according to an exemplary embodiment of the present invention, and as shown in fig. 3, key field extraction, data filtering and data sequence processing are performed on a first real-time log, and detailed processing procedures are described below.
Because the log formats of different manufacturers and different devices are different, the subsequent unified processing analysis is inconvenient, and therefore, key information in the log is required to be identified according to the log specification, a log structure is constructed, and the key information is converted into a general format. Acquiring a first real-time log, extracting information from the first real-time log according to a preset key field, and determining the first real-time log after information extraction; for example, taking a log of a certain manufacturer device as an example, the log structure after parsing is shown in table one.
List one
Time stamp Host name Module name Log level Information abstract Log identification Information counting Detailed information
According to a preset filtering rule, filtering the first real-time log, and determining the first real-time log after the filtering; the filtering rules are formed by collecting expert experience in advance. For example, since the original log data amount is large, there is a certain amount of illegal data such as that the occurrence time of the alarm is an illegal value, that the information of the alarm source is undefined, etc.; meanwhile, some logs are irrelevant to the fault root cause needing to be analyzed, so that data cleaning is needed to be carried out on the original logs to filter interference items, noise and the like.
And then dividing fault analysis domains and segmenting time sequences of the equipment corresponding to the first real-time log according to a preset network planning rule and a clustering algorithm to obtain a first log sequence. Specifically, according to a preset network planning rule, related devices of which the topological relation of the devices belongs to the same minimum management domain are divided into a fault analysis domain, and logs of all the devices in the domain are ordered according to a fault occurrence time sequence so as to facilitate the follow-up excavation of fault propagation relations among the devices. For example, fig. 4 is a schematic diagram illustrating a fault analysis domain division in a fault analysis method based on a graph model according to an exemplary embodiment of the present invention; as shown in fig. 4, the aggregation device 1, the aggregation device 2 belong to a shared aggregation device, and the aggregation device 1, the aggregation device 2, and the access device 1, the access device 2, the access device 3, and the access device 4 constitute an access ring 1; the convergence device 1, the convergence device 2, the access device 5, the access device 6, the access device 7 and the access device 8 form an access ring 2; the access rings 1 and 2 have no cross-over and association except for the common convergence device 1 and the convergence device 2, so that when one device in one access ring fails, the failure of the single ring does not propagate to affect other rings, i.e. the access rings 1 and 2 are independent failure analysis domains, respectively.
And then a clustering algorithm, such as Density-based clustering algorithm (Density-Based Spatial Clustering of Applications with Noise, DBSCAN), is adopted to derive the maximum Density connected log sample set from the Density reachable relation. The algorithm parameters such as the sample neighborhood distance threshold and the sample number threshold are continuously adjusted according to the evaluation parameters such as the contour coefficient, the clustering number and the actual segmentation effect, so that an optimal value is determined. For example, fig. 5 is a log sequence segmentation schematic diagram in a graph model-based fault analysis method according to an exemplary embodiment of the present invention; as shown in fig. 5, each dot in the graph may be considered as an abstraction of a log, the logs are arranged according to the time sequence of printing, and according to the DBSCAN algorithm, multiple logs in a circle are clustered, that is, the time sequence segmentation of multiple logs is realized, so that logs in different time periods are stripped, and the accuracy is achieved for the subsequent more accurate positioning of the root cause equipment and the root cause fault.
Step 202, performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed, wherein the log sequence to be processed is used for recording the anomaly log sequence screened by the anomaly detection mechanism.
Specifically, the key field matching processing is performed on the first log sequence according to a preset abnormal log level, and a log sequence to be processed including the abnormal log level is determined. For example, different manufacturers have standardized definition on log levels of devices, and taking a manufacturer as an example, log levels 0 (average), 1 (Alert), 2 (Critical), and 3 (error) respectively represent extremely urgent Errors, errors that need to be immediately corrected, more serious Errors, and Errors. One characterization of equipment failure is to generate a large number of logs of higher levels, i.e., levels 0-3. Based on the method, the log level is selected as a key field for triggering abnormality, and when the first log sequence contains the log with higher level, the log sequence to be processed is determined to be the log sequence which is most likely to have faults.
Step 203, performing fault analysis on the log sequence to be processed according to a preset graph model, and determining the root cause equipment with faults and the root cause faults of the root cause equipment, wherein the graph model is used for representing the topological relation and the transfer relation of the equipment;
specifically, the graph model includes a device topology graph, a failover graph, and a device transfer graph; determining a first equipment topological graph corresponding to the log sequence to be processed according to the log sequence to be processed and the equipment topological graph; the equipment topological graph corresponds to all equipment in a preset network coverage area; determining a first fault transfer diagram and a first equipment transfer diagram corresponding to the first equipment topological diagram according to the first equipment topological diagram, the fault transfer diagram and the equipment transfer diagram; the equipment transfer diagram corresponds to all equipment in a preset network coverage area, and the fault transfer diagram corresponds to the equipment transfer diagram; root cause devices and root cause faults are determined from the first failover graph and the first device failover graph.
For example, fig. 6 is a schematic diagram of a device topology in a graph model-based fault analysis method according to an exemplary embodiment of the present invention, as shown in fig. 6, 7 devices, respectively, including a device a, a device B, a device C, a device D, a device E, a device F, and a device G, are on a first device topology corresponding to a log sequence to be processed, and a device transfer diagram corresponding to the topology of the 7 devices is shown in fig. 7, and fig. 7 is a schematic diagram of a device transfer in the graph model-based fault analysis method according to an exemplary embodiment of the present invention; the failover diagrams corresponding to the topology diagrams of the 7 pieces of equipment are shown in fig. 8, and fig. 8 is a failover schematic diagram in a graph model-based failure analysis method according to an example embodiment of the invention; the abnormal log set of the log sequence to be processed is { A: a, A: B, A: C, B: a, B: B, B: C, B: d, B: e, C: d, C: e }, namely 10 logs are generated in total by the network event, and the method comprises the following steps: device a failed a, B, C, device B failed a, B, C, d, e, and device C failed d, e. The 3 devices were subjected to fault analysis according to the failover diagram of fig. 8, and the probability values of each directed edge of the device a fault set { a, b, c } were compared. For example, the probability value of each directed edge of each fault is 0.5 for the a- > a edge, 0.1 for the a- > b edge, 0.2 for the a- > c edge, 0.1 for the a- > d edge, and 0.1 for the a- > e edge, respectively; b- > a is 0.1, b- > b is 0.2, b- > c is 0.4, b- > d is 0.1, b- > e is 0.2; 0.2 for c- > a, 0.1 for c- > b, 0.2 for c- > c, 0.3 for c- > d, 0.2 for c- > e; it can be seen that the directed edge with the highest probability value is a- > a, the root cause fault of device a is determined to be a. Similarly, the root cause fault a for device B and the root cause fault e for device C are determined. And analyzing the failure transfer relation of the devices according to the device transfer diagram of fig. 7, comparing probability values of the directed edges of the device sets { A, B and C }, and if the directed edge with the maximum probability value is A- > B, judging that the root device is the device A. In summary, the root cause log of the network event is { A: a }, that is, the root cause device is device A, the root cause fault is failure a, and the other 9 logs are associated logs.
In one possible design, the log sequence to be processed after the anomaly detection screening may be a log set including a plurality of events, where the plurality of events may respectively generate root logs and associated log sets corresponding to the events. The logs in the time slices are separated according to the characteristics of the events, so that each network event can be effectively and deeply analyzed, the root cause reasoning log of each event is further obtained, and the possible faults of each event are predicted. For example, for different types of event features in a network, different keywords are selected as indexes of log separation in the event to achieve more accurate localization of fault occurrence sources, such as for common network faults, the keywords typically select ports, internet protocol (Internet Protocol, IP), etc.
Step 204, according to the root cause device, the root cause fault and the graph model, the predicted fault device and the predicted fault information are determined.
Specifically, for example, on the basis of the root cause analysis result { a: a }, a directed edge of the maximum probability value that the device a points to other devices is obtained by combining with the device transfer graph, when the pointing device is the device B, the directed edge of the maximum probability value that the fault a points to other faults is determined according to the fault transfer graph, and when the pointing fault is the fault B, the predicted fault device is the device B and the predicted fault information is the fault B.
The processing method of steps 201-204 can more accurately locate specific information of faults by preprocessing the real-time log, detecting abnormality, and then carrying out fault root cause reasoning and fault prediction on the log according to a graph model based on the topological relation and the transfer relation of the equipment.
The graph model is obtained by training a history log, and specifically, a history log training set is obtained; preprocessing a history log in a history log training set according to a preset processing rule to obtain a history log sequence; performing anomaly detection on the history log sequence according to a preset anomaly detection mechanism, and determining a history log sequence to be processed; the history log sequence to be processed is used for recording the abnormal history log sequence screened out by the abnormality detection mechanism; determining a device transfer diagram according to the history log sequence to be processed and a preset device topological diagram; and determining a fault transfer diagram according to the history log sequence to be processed, the equipment topological diagram and the equipment transfer diagram. The preprocessing and anomaly detection processes refer to the processing methods of steps 201-202. After the history log sequence to be processed is determined, the probability of the next fault occurring in the related equipment when each equipment fails is counted, a relation diagram among the fault equipment can be obtained, and each directed edge in the diagram corresponds to a corresponding probability value. Taking the device A, B in fig. 6 and 7 as an example, the edge of a pointing to B represents the historical statistical probability value that any failure of device a would result in any failure of device B; the edge of a pointing to a itself represents a historical statistical probability value that device a has a certain fault, which may lead to other faults of device a. And determining a fault transfer diagram in combination with fig. 8, namely, correspondingly considering the fault transfer relation between two devices in the device transfer diagram of fig. 7 when considering whether the two faults are connected. For example, the a device is topologically connected to only the B and C devices, and then the next adjacent fault to the a device is only a fault in the (a|b|c) three devices. Taking faults a, B as an example, the edge of a pointing to B represents a historical statistical probability value that all devices (i.e., a/B/C/D/E/F/G) have a fault, which would result in B faults in other devices.
For example, when a fault a occurs, the historical frequency of the next fault B is counted, and a historical statistical probability value is obtained according to the frequency. The historical frequencies that cause other faults when fault a occurs are respectively: the number of faults A-B is n, the number of faults A-C is m, and the number of faults A-D is k; the historical statistical probability values pointed by the directed edges of the faults a corresponding to the 3 cases are as follows:
Figure BDA0002832439040000111
Figure BDA0002832439040000112
Figure BDA0002832439040000113
and similarly, obtaining a historical statistical probability value pointed by each directed edge of each device by counting the transition frequency among the devices when the fault occurs.
Acquiring a first real-time log by the method in the steps 201-204, and preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence, wherein the first real-time log is used for recording real-time change information of equipment running state and service state; performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed, wherein the log sequence to be processed is used for recording the anomaly log sequence screened by the anomaly detection mechanism; performing fault analysis on the log sequence to be processed according to a preset graph model, and determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation; and determining the predicted fault equipment and the predicted fault information according to the root cause equipment, the root cause fault and the graph model. Therefore, the accurate positioning of the fault occurrence details is realized, and a foundation is laid for fault root cause diagnosis and service quick recovery.
FIG. 9 is a schematic diagram of a graph model-based fault analysis apparatus according to an example embodiment of the present invention; as shown in fig. 9, the fault analysis apparatus 90 based on the graph model provided in the present embodiment includes:
the first processing module 901 is configured to obtain a first real-time log, and pre-process the first real-time log according to a preset processing rule to obtain a first log sequence, where the first real-time log is used for recording real-time change information of an equipment running state and a service state;
the second processing module 902 is configured to perform anomaly detection on the first log sequence according to a preset anomaly detection mechanism, determine a log sequence to be processed, where the log sequence to be processed is used to record a log sequence including an anomaly log level; the exception log level is determined by an exception detection mechanism;
the first determining module 903 is configured to perform fault analysis on the log sequence to be processed according to a preset graph model, determine a root cause device with a fault and a root cause fault of the root cause device, where the graph model is used to represent a device topology relationship and a transfer relationship;
a second determining module 904, configured to determine a predicted failure device and predicted failure information according to the root cause device, the root cause failure, and the graph model.
In one possible design, the first determining module 903 is specifically configured to:
the graph model comprises a device topology graph, a failover graph and a device transfer graph;
determining a first equipment topological graph corresponding to the log sequence to be processed according to the log sequence to be processed and the equipment topological graph; the equipment topological graph corresponds to all equipment in a preset network coverage area;
determining a first fault transfer diagram and a first equipment transfer diagram corresponding to the first equipment topological diagram according to the first equipment topological diagram, the fault transfer diagram and the equipment transfer diagram; the equipment transfer diagram corresponds to all equipment in a preset network coverage area, and the fault transfer diagram corresponds to the equipment transfer diagram;
root cause devices and root cause faults are determined from the first failover graph and the first device failover graph.
In one possible design, the first processing module 901 is specifically configured to:
information extraction is carried out on the first real-time log according to a preset key field, and the first real-time log after information extraction is determined;
according to a preset filtering rule, filtering the first real-time log, and determining the first real-time log after the filtering;
and dividing fault analysis domains and segmenting time sequences of equipment corresponding to the first real-time log according to a preset network planning rule and a clustering algorithm to obtain a first log sequence.
In one possible design, the second processing module 902 is specifically configured to:
and carrying out key field matching processing on the first log sequence according to the preset abnormal log level, and determining a log sequence to be processed containing the abnormal log level.
In one possible design, the first processing module 901 is further configured to:
acquiring a history log training set;
preprocessing a history log in a history log training set according to a preset processing rule to obtain a history log sequence;
performing anomaly detection on the history log sequence according to a preset anomaly detection mechanism, and determining a history log sequence to be processed; the history log sequence to be processed is used for recording the abnormal history log sequence screened out by the abnormality detection mechanism;
determining a device transfer diagram according to the history log sequence to be processed and a preset device topological diagram;
and determining a fault transfer diagram according to the history log sequence to be processed, the equipment topological diagram and the equipment transfer diagram.
Fig. 10 is a schematic structural diagram of a fault analysis platform according to an exemplary embodiment of the present invention. As shown in fig. 10, the fault analysis platform 10 provided in this embodiment includes:
a processor 1001; the method comprises the steps of,
a memory 1002 for storing executable instructions of the processor, which may also be a flash (flash memory);
wherein the processor 1001 is configured to perform the steps of the above-described method via execution of executable instructions. Reference may be made in particular to the description of the embodiments of the method described above.
Alternatively, the memory 1002 may be separate or integrated with the processor 1001.
When the memory 1002 is a device separate from the processor 1001, the database 100 may further include:
bus 1003 is used to connect processor 1001 and memory 1002.
In addition, the embodiment of the application further provides a computer-readable storage medium, in which computer-executable instructions are stored, when the at least one processor of the user equipment executes the computer-executable instructions, the user equipment performs the above possible methods.
Among them, computer-readable media include computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a user device. The processor and the storage medium may reside as discrete components in a communication device.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims (9)

1. A graph model-based fault analysis method, comprising:
acquiring a first real-time log, preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence, wherein the first real-time log is used for recording real-time change information of equipment running state and service state;
performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed, wherein the log sequence to be processed is used for recording the anomaly log sequence screened by the anomaly detection mechanism;
performing fault analysis on the log sequence to be processed according to a preset graph model, and determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation;
determining predicted fault equipment and predicted fault information according to the root cause equipment, the root cause fault and the graph model;
the fault analysis is performed on the log sequence to be processed according to a preset graph model, and the determination of the root cause equipment with fault and the root cause fault of the root cause equipment comprises the following steps:
the graph model comprises a device topology graph, a fault transfer graph and a device transfer graph;
determining a first equipment topological graph corresponding to the log sequence to be processed according to the log sequence to be processed and the equipment topological graph; the device topological graph corresponds to all devices in a preset network coverage area;
determining a first failover graph and a first equipment transfer graph corresponding to the first equipment topological graph according to the first equipment topological graph, the failover graph and the equipment transfer graph; the equipment transfer diagram corresponds to all equipment within the preset network coverage range, and the fault transfer diagram corresponds to the equipment transfer diagram;
determining the root cause device and the root cause fault according to the first failover graph and the first device failover graph;
the method for determining root cause equipment and root cause faults according to the first fault transfer diagram and the first equipment transfer diagram specifically comprises the following steps:
determining the root cause fault of any one device according to the fault set of the any one device and the probability value of each directed edge of each fault in the first fault transfer graph aiming at any one device contained in the log sequence to be processed;
according to the probability values of the directed edges of the devices contained in the log sequence to be processed and the devices in the first device transfer diagram, determining the device with the maximum probability value directed to the directed edge as the root cause device;
the first device transfer graph is obtained by counting the probability that the next fault occurs in the related device when each device fails.
2. The method of claim 1, wherein preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence comprises:
information extraction is carried out on the first real-time log according to a preset key field, and the first real-time log after information extraction is determined;
according to a preset filtering rule, filtering the first real-time log, and determining the first real-time log after the filtering;
and dividing fault analysis domains and segmenting time sequences of the equipment corresponding to the first real-time log according to a preset network planning rule and a clustering algorithm to obtain a first log sequence.
3. The method according to claim 2, wherein the performing anomaly detection on the first log sequence according to a preset anomaly detection mechanism, and determining a log sequence to be processed includes:
and carrying out key field matching processing on the first log sequence according to a preset abnormal log level, and determining a log sequence to be processed containing the abnormal log level.
4. A method according to any one of claims 1-3, wherein prior to said obtaining the first real-time log, further comprising:
acquiring a history log training set;
preprocessing the history logs in the history log training set according to a preset processing rule to obtain a history log sequence;
performing anomaly detection on the history log sequence according to a preset anomaly detection mechanism, and determining a history log sequence to be processed; the history log sequence to be processed is used for recording the abnormal history log sequence screened out by the abnormal detection mechanism;
determining the equipment transfer diagram according to the history log sequence to be processed and the preset equipment topological diagram;
and determining the fault transfer diagram according to the history log sequence to be processed, the equipment topological diagram and the equipment transfer diagram.
5. A graph model-based fault analysis apparatus, comprising:
the first processing module is used for acquiring a first real-time log, preprocessing the first real-time log according to a preset processing rule to obtain a first log sequence, and recording real-time change information of equipment running state and service state;
the second processing module is used for carrying out anomaly detection on the first log sequence according to a preset anomaly detection mechanism, determining a log sequence to be processed, and recording the log sequence containing the anomaly log level; the anomaly log level is determined by the anomaly detection mechanism;
the first determining module is used for carrying out fault analysis on the log sequence to be processed according to a preset graph model, determining root cause equipment with faults and root cause faults of the root cause equipment, wherein the graph model is used for representing equipment topological relation and transfer relation;
the second determining module is used for determining prediction fault equipment and prediction fault information according to the root cause equipment, the root cause fault and the graph model; the first determining module is specifically configured to:
the graph model comprises a device topology graph, a fault transfer graph and a device transfer graph;
determining a first equipment topological graph corresponding to the log sequence to be processed according to the log sequence to be processed and the equipment topological graph; the device topological graph corresponds to all devices in a preset network coverage area;
determining a first failover graph and a first equipment transfer graph corresponding to the first equipment topological graph according to the first equipment topological graph, the failover graph and the equipment transfer graph; the equipment transfer diagram corresponds to all equipment within the preset network coverage range, and the fault transfer diagram corresponds to the equipment transfer diagram;
determining the root cause device and the root cause fault according to the first failover graph and the first device failover graph;
the first determining module is further configured to:
determining the root cause fault of any one device according to the fault set of the any one device and the probability value of each directed edge of each fault in the first fault transfer graph aiming at any one device contained in the log sequence to be processed;
according to the probability values of the directed edges of the devices contained in the log sequence to be processed and the devices in the first device transfer diagram, determining the device with the maximum probability value directed to the directed edge as the root cause device;
the first device transfer graph is obtained by counting the probability that the next fault occurs in the related device when each device fails.
6. The apparatus of claim 5, wherein the first determining module is further configured to:
information extraction is carried out on the first real-time log according to a preset key field, and the first real-time log after information extraction is determined;
according to a preset filtering rule, filtering the first real-time log, and determining the first real-time log after the filtering;
and dividing fault analysis domains and segmenting time sequences of the equipment corresponding to the first real-time log according to a preset network planning rule and a clustering algorithm to obtain a first log sequence.
7. The apparatus of claim 6, wherein the second determining module is specifically configured to:
and carrying out key field matching processing on the first log sequence according to a preset abnormal log level, and determining a log sequence to be processed containing the abnormal log level.
8. A fault analysis platform, comprising:
a processor; the method comprises the steps of,
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the graph model-based fault analysis method of any one of claims 1 to 4 via execution of the executable instructions.
9. A storage medium having stored thereon a computer program, which when executed by a processor implements the graph model based fault analysis method of any one of claims 1 to 4.
CN202011453509.3A 2020-12-11 2020-12-11 Fault analysis method and device based on graph model Active CN114629776B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011453509.3A CN114629776B (en) 2020-12-11 2020-12-11 Fault analysis method and device based on graph model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011453509.3A CN114629776B (en) 2020-12-11 2020-12-11 Fault analysis method and device based on graph model

Publications (2)

Publication Number Publication Date
CN114629776A CN114629776A (en) 2022-06-14
CN114629776B true CN114629776B (en) 2023-05-30

Family

ID=81895792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011453509.3A Active CN114629776B (en) 2020-12-11 2020-12-11 Fault analysis method and device based on graph model

Country Status (1)

Country Link
CN (1) CN114629776B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115225460B (en) * 2022-07-15 2023-11-28 北京天融信网络安全技术有限公司 Fault determination method, electronic device, and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019221461A1 (en) * 2018-05-18 2019-11-21 주식회사 케이티 Apparatus and method for analyzing cause of network failure
WO2020168675A1 (en) * 2019-02-21 2020-08-27 烽火通信科技股份有限公司 Sample data processing method, and system and apparatus

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100456687C (en) * 2003-09-29 2009-01-28 华为技术有限公司 Network failure real-time relativity analysing method and system
CN103001811B (en) * 2012-12-31 2016-01-06 北京启明星辰信息技术股份有限公司 Fault locating method and device
CN109840157A (en) * 2017-11-28 2019-06-04 中国移动通信集团浙江有限公司 Method, apparatus, electronic equipment and the storage medium of fault diagnosis
US10795753B2 (en) * 2017-12-08 2020-10-06 Nec Corporation Log-based computer failure diagnosis
CN110493025B (en) * 2018-05-15 2022-06-14 中国移动通信集团浙江有限公司 Fault root cause diagnosis method and device based on multilayer digraphs
CN109617715A (en) * 2018-11-27 2019-04-12 中盈优创资讯科技有限公司 Network fault diagnosis method, system
CN110855502A (en) * 2019-11-22 2020-02-28 叶晓斌 Fault cause determination method and system based on time-space analysis log
CN110855503A (en) * 2019-11-22 2020-02-28 叶晓斌 Fault cause determining method and system based on network protocol hierarchy dependency relationship
CN112052151B (en) * 2020-10-09 2022-02-18 腾讯科技(深圳)有限公司 Fault root cause analysis method, device, equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019221461A1 (en) * 2018-05-18 2019-11-21 주식회사 케이티 Apparatus and method for analyzing cause of network failure
WO2020168675A1 (en) * 2019-02-21 2020-08-27 烽火通信科技股份有限公司 Sample data processing method, and system and apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
彦逸."基于因果规则的电力营销系统故障定位算法".《计算机与现代化》.2020,全文. *

Also Published As

Publication number Publication date
CN114629776A (en) 2022-06-14

Similar Documents

Publication Publication Date Title
CN108415789B (en) Node fault prediction system and method for large-scale hybrid heterogeneous storage system
CN113282461B (en) Alarm identification method and device for transmission network
CN111309565B (en) Alarm processing method and device, electronic equipment and computer readable storage medium
CN111475804A (en) Alarm prediction method and system
WO2022083576A1 (en) Analysis method and apparatus for operating data of network function virtualization device
AU2019275633B2 (en) System and method of automated fault correction in a network environment
CN108809745A (en) A kind of user's anomaly detection method, apparatus and system
CN108322347A (en) Data detection method, device, detection service device and storage medium
CN107124289B (en) Weblog time alignment method, device and host
CN105095048A (en) Processing method for alarm correlation of monitoring system based on business rules
CN114465874B (en) Fault prediction method, device, electronic equipment and storage medium
US7716152B2 (en) Use of sequential nearest neighbor clustering for instance selection in machine condition monitoring
CN114978568A (en) Data center management using machine learning
CN109669844A (en) Equipment obstacle management method, apparatus, equipment and storage medium
CN112769605B (en) Heterogeneous multi-cloud operation and maintenance management method and hybrid cloud platform
CN115514619B (en) Alarm convergence method and system
CN115550139B (en) Fault root cause positioning method, device, system, electronic equipment and storage medium
CN112559237A (en) Operation and maintenance system troubleshooting method and device, server and storage medium
CN114629776B (en) Fault analysis method and device based on graph model
US11792081B2 (en) Managing telecommunication network event data
Pan et al. Unsupervised two-stage root-cause analysis for integrated systems
CN111367971A (en) Financial system abnormity auxiliary analysis method and device based on data mining
CN115102848A (en) Log data extraction method, system, device and medium
CN110727538A (en) Fault positioning system and method based on model hit probability distribution
FI130824B1 (en) Method and apparatus for predictive maintenance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant