CN103001811B

CN103001811B - Fault locating method and device

Info

Publication number: CN103001811B
Application number: CN201210594148.3A
Authority: CN
Inventors: 张延佳; 韩三田; 胡盛华
Original assignee: Beijing Venus Information Security Technology Co Ltd; Beijing Venus Information Technology Co Ltd
Current assignee: Beijing Venus Information Security Technology Co Ltd; Beijing Venus Information Technology Co Ltd
Priority date: 2012-12-31
Filing date: 2012-12-31
Publication date: 2016-01-06
Anticipated expiration: 2032-12-31
Also published as: CN103001811A

Abstract

The invention provides a kind of Fault Locating Method and device.Relate to applications of computer network field; Solve ageing poor, the problem that efficiency is lower of existing alarm association rule mining system.The method comprises: build network element topology constraint model; Detect the running status of each network element device in managed network, to find event of failure; Gather event of failure; Utilize described network element topology constraint model, time horizon association is carried out to the event of failure collected and associates with space layer, determine abort situation.Technical scheme provided by the invention is applicable to failure diagnosis, achieves the fault location of efficiently and accurately.

Description

Fault positioning method and device

Technical Field

The invention relates to the field of computer network application, in particular to a fault positioning method and device.

Background

The application of computer networks has gone deep into every corner of people's life, work, and computers have also become an indispensable application tool of modern people. In order to enable the network to provide services for people effectively, reliably, safely and economically, network management requires that a network management node can perform corresponding fault management in time when the network fails, so that the network can be repaired quickly and continue to provide services for people. Fault management generally comprises four steps of fault detection, fault diagnosis and fault repair, and fault recording, wherein fault diagnosis is one of the most critical rings. If the network fault diagnosis can quickly and accurately locate the fault source, the fault can be quickly repaired, so that the loss caused by the network fault is reduced, the reliability and the availability of the network are ensured, and the fault can be prevented to a certain extent.

The network is composed of various devices and subsystems, and the different devices and subsystems are mutually associated and tightly coupled. The failure of a device affects many of the devices or subsystems to which it is connected, and can even lead to the breakdown of the network, a phenomenon known as fault propagation. The transmissibility of the fault can cause a large number of fault events to trigger simultaneously, creating a storm of fault events, making fault diagnosis difficult. Another reason for the storm of the failure event is that, for increasingly complex network situations, in order to continuously respond to new security challenges, enterprises and organizations have sequentially deployed anti-virus systems, firewalls, intrusion detection systems, vulnerability scanning systems, UTMs, and the like, and when one device fails, a whole set of security systems is triggered, thereby forming a large number of security events. Therefore, in a complex network environment, when a device failure occurs, a failure event storm is easily caused, and it is difficult for a network manager to quickly find out a failure source from a large pile of failure phenomena.

Data mining is the process of extracting information and knowledge hidden therein that is not known a priori but is sometimes potentially useful, from a large, incomplete, noisy, fuzzy, random amount of data. Under a complex network environment, when one equipment fault occurs together with a fault event storm, a data mining technology is introduced into alarm association, a plurality of alarms can be classified into fewer alarms by utilizing a rule-based correlation analysis technology, and a large number of redundant alarms are filtered, so that network management personnel are assisted in locating the fault.

However, most of the conventional alarm association rule mining systems directly perform simple preprocessing on original alarm data and then mine the original alarm data by using a mining algorithm, so as to obtain the association relationship between alarms. Although effective alarm association rules can be mined by the method, the timeliness and the efficiency of the alarm association rule mining system are not high for massive alarm data. In addition, if the original alarm data is only passively acquired from the existing security system, the effectiveness and comprehensiveness of the original alarm data are difficult to guarantee.

Disclosure of Invention

The invention provides a fault positioning method and a fault positioning device, and solves the problems of poor timeliness and low efficiency of an existing alarm association rule mining system.

A fault location method, comprising:

constructing a network element topology constraint model;

detecting the running state of each network element device in the managed network to find out a fault event;

collecting fault events;

and carrying out time layer association and space layer association on the collected fault events by utilizing the network element topological constraint model, and determining the fault position.

Preferably, in an unauthorized network environment, the constructing a network element topology constraint model includes:

taking a management center network element node of a managed network as a detection origin, and sending a designed detection data packet from the detection origin to a target node in the managed network;

acquiring feedback data packets of each target node to the detection data packet, analyzing the feedback data packets, and acquiring detection feedback data information of each target node, wherein the detection feedback data information comprises an array formed by a detection target address and detection path jumping point information;

and traversing and removing the path of the detection feedback data information to obtain the network element topology constraint model.

Preferably, in an authorized network environment, the constructing a network element topology constraint model includes:

taking out an IP address from the IP address field of the managed network, and acquiring an IPforwarding value of the IP address by using SNMP;

when the IPForwarding value is 1, judging that the network element corresponding to the IP address is a router;

using SNMP to inquire the IP address table of the router, acquiring all IP addresses and corresponding subnet masks in the IP address table, and determining all subnet addresses connected with the router;

acquiring a variable ifType from the interface table, and determining the network type of the subnet;

and inquiring a routing table of the router to obtain a next hop IP address of the router which is not directly connected, and discovering all active IP nodes in the subnet by using ICMP.

Preferably, the detecting the operation status of each network element device in the managed network to find the fault event includes:

detecting the downtime fault of each network element in the managed network by using an error detection and return mechanism of an ICMP (Internet control protocol);

detecting performance type faults of each network element in the managed network by using an SNMP (simple network management protocol) and/or an SSH (secure Shell) protocol;

after the fault is found, the fault event is reported in a SYSLOG protocol.

Preferably, the collecting the fault event comprises:

collecting fault events reported by SYSLOG protocol;

collecting general log information of the managed network, states of network security equipment, network equipment, host server equipment, an operating system, a database and middleware, logs and network data packets;

normalizing the collected fault events to form a uniform fault event according to the fault event and the general log information;

and putting the fault event formed after normalization into a fault event cache.

Preferably, the step of normalizing the collected fault events according to the fault events and the general log information to form a unified fault event specifically includes:

according to the general log information, normalizing the collected fault events into the following categories:

server down fault, server performance fault, link interruption fault, service interruption fault, threshold alarm fault, general equipment fault.

Preferably, the fault event includes the following information:

module name, source IP address, source port, destination IP address, destination port, protocol type, attack type, message, and specific action.

Preferably, the performing, by using the network element topology constraint model, temporal layer association and spatial layer association on the collected fault event, and determining the fault location includes:

in the time correlation layer, according to the alarm severity, the alarm time and the event type, carrying out de-duplication correlation on the fault events, removing non-fault information and aggregating the fault events which are concurrent at the same time;

acquiring the latest network element topology association model in the memory, and converting the network element topology association model into an association rule script file;

storing all rules in the associated rule script file into a rule cache;

acquiring the latest fault event from the fault event cache, performing multi-event correlation, and storing all fault events meeting the rules in the cache;

and when the fault events stored in the cache can be matched with the rules in the associated rule script file, all the fault events matched with the rules are moved out of the cache, and an alarm for all the fault events is generated.

Preferably, the step of performing temporal layer correlation and spatial layer correlation on the collected fault event by using the network element topology constraint model, and determining the fault location further includes:

and displaying the fault alarm information through network topology visualization of a tree structure.

The invention provides a fault locating device, comprising:

the topology constraint model building layer is used for building a network element topology constraint model;

the network state measurement layer is used for detecting the running state of each network element device in the managed network so as to find out a fault event;

the fault event acquisition layer is used for acquiring fault events;

and the event correlation analysis layer is used for performing time layer correlation and space layer correlation on the collected fault events by utilizing the network element topology constraint model to determine the fault positions.

Preferably, the fault locating device further includes:

and the fault positioning display layer is used for displaying the fault alarm information through the network topology visualization of the tree structure.

The invention provides a fault positioning method and a device, which construct a network element topology constraint model, detect the running state of each network element device in a managed network to find a fault event, collect the fault event, perform time layer correlation and space layer correlation on the collected fault event by using the network element topology constraint model, determine the fault position, perform mining processing on alarm data through the network topology model, and filter out association rules without topology connection relation, thereby improving mining efficiency and correctness and solving the problems of poor timeliness and low efficiency of the existing alarm association rule mining system.

Drawings

Fig. 1 is a flowchart of a fault location method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a fault location device according to a second embodiment of the present invention;

fig. 3 is a schematic diagram of an association rule mining algorithm based on topological constraints in the third embodiment of the present invention.

Detailed Description

Most of the traditional alarm association rule mining systems directly carry out simple preprocessing on original alarm data and then mine the original alarm data by using a mining algorithm, so that the association relation between alarms is obtained. Although effective alarm association rules can be mined by the method, the timeliness and the efficiency of the alarm association rule mining system are not high for massive alarm data. In addition, if the original alarm data is only passively acquired from the existing security system, the effectiveness and comprehensiveness of the original alarm data are difficult to guarantee.

Therefore, a network fault positioning method with higher efficiency and accuracy needs to be found, and the requirement of fault positioning in a complex network environment is met.

In order to solve the above problem, an embodiment of the present invention provides a fault location method. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments in the present application may be arbitrarily combined with each other without conflict.

First, a first embodiment of the present invention will be described with reference to the drawings.

The embodiment of the invention provides a fault positioning method, aiming at comprehensively utilizing advanced technologies such as an association rule mining algorithm based on topological constraint, an automatic topology discovery technology under the conditions of an authorized network and an unauthorized network, a heterogeneous mass log acquisition technology, an alarm visualization technology based on tree topology, a driving process of a blackboard model, modern communication and the like, and solving the problem of rapid diagnosis and positioning of network faults under the environment of an ultra-large and ultra-complex scale network. The network element scale in the network processed by the fault positioning method provided by the embodiment of the invention can reach 5000, and the network element types in the network can comprise network security equipment, network equipment, host server equipment, an operating system, a database, middleware and the like.

The process of completing the fault analysis by using the fault positioning method provided by the embodiment of the invention is shown in fig. 1, and comprises the following steps:

step 101, constructing a network element topological constraint model;

the embodiment of the invention is based on the deep research on the network fault propagation characteristics, integrates the characteristics of various network protocols, constructs a network element topology constraint model by using an automatic topology discovery technology under a complex network environment, rapidly diagnoses faults of a large-scale network by using an asynchronous detection technology and forms fault events, acquires the fault events of various sources of the whole network by using a heterogeneous mass log acquisition technology, performs correlation analysis on the fault events by using a topology constraint association rule algorithm, acquires a final fault positioning conclusion, provides visual display by using a tree-shaped network topology structure, and organically forms a whole by using a blackboard model frame as a drive.

The embodiment of the invention researches the propagation characteristics of network faults, and the propagation paths of the faults in the network mainly comprise two types: lateral propagation and longitudinal propagation. Lateral propagation is the horizontal propagation of a fault only between devices that are physically connected or logically connected. Vertical propagation refers to the propagation of a fault from a lower layer to a higher layer along the protocol stack within a device. The fault positioning method provided by the embodiment of the invention can divide the fault diagnosis into two parts according to the fault propagation path: transverse diagnosis and longitudinal diagnosis. This provides efficiency and accuracy of fault diagnosis. The time when a series of faults occur is a clue to how the faults propagate, and of course, the time when the series of faults occur is combined with the network topology to more fully understand how the faults propagate.

The topology constraint model construction layer constructs a network element topology association model in the network, and intelligently explores the topology connection relation of the network by comprehensively adopting protocols such as SNMP, CDP, ICMP and the like and methods such as TRACEROUTE and the like under various network environments. The topological incidence relation model is a tree-shaped logic network structure formed by taking a management center as a starting point, a topological model construction engine is carried out by an independent thread, the topological discovery is periodically carried out on a managed network, and the model data is written into a network element incidence model cache. Meanwhile, the topology constraint model building layer can convert the existing asset topology into the network element association model.

The network element topology constraint model constructed in this step is based on a management center, and all subnets are regarded as a tree structure. The tree network structure uses tree sub-tables to represent the data structure of the tree, representing the whole tree as a node table, and each element in the node table contains a table, which records the positions of all sub-nodes of the node, called sub-table. The length of the node table, i.e. the number of nodes in the tree, is generally stored in a one-dimensional array sequence; the length of the sub-table depends on the degree of each node, so that the length is different and is generally represented by a single linked list; the order of linking the nodes in the sub-tables is in the order they are from left to right in the tree. In this way, in addition to the information of the elements themselves, the head pointers of the sub-tables are also stored in the node tables.

In order to adapt to a complex network environment, different technical schemes are adopted in the unauthorized network environment and the authorized network environment in the step, and the aim of the step is to construct the network element topology constraint model. The following are listed and described respectively:

firstly, under the environment of an unauthorized network, an improved TRACEROUTE method is adopted to construct a network element topology constraint model. The specific method comprises the following steps:

1) taking a management center network element node of a managed network as a detection origin, and sending ICMP response data packets with different IP survival time (TTL) values from the detection origin to a target node in the managed network as detection data packets;

2) acquiring feedback data packets fed back by each target node, analyzing the feedback data and then acquiring detection feedback data information of each target node, wherein the detection feedback data information comprises an array structure formed by a detection target address and detection path jumping point information;

3) and traversing and de-duplicating the path of the detection feedback data information to obtain a network element topology constraint model.

And secondly, under the authorized network environment, the SNMP protocol and the ICMP are comprehensively adopted to construct a network element topology constraint model in the step. The specific method comprises the following steps:

1) and (3) taking out an IP address from the IP address field of the managed network (for example, taking out the first address according to the network segment), using the SNMP to obtain the iPforwarding value, and if the value is 1, the equipment has the function of forwarding an IP data packet in the forward direction and is a router. If a router is found, turning to step 2); if no router exists (i.e. after all the IP addresses in the IP address field are verified to be non-routers, it is determined that no router may exist in the managed network address field), the algorithm ends.

2) The router IP address table (iPAddrTable) is queried using SNMP to get all the IP addresses in the table (ipAdEntAddr) and the corresponding subnet mask (iPAdEntNetMask). Carrying out AND operation on the iPADENTADR and the corresponding iPADENTNetMask, determining all subnet addresses connected with the router, and finishing the algorithm if the subnets are not in the managed network in the management range; otherwise, a variable ifType is obtained from an interface table (ifTable) to determine the network type of the subnet.

After the subnet information is obtained, the router routing table ((ipRouteTable)) is queried, and the next hop IP address ((ipRoute-NextHop)) of the indirectly connected router, that is, the routing type ((ipRouteType)) has a value of 4 ((indiect)) is obtained. If there is no such router, the algorithm ends; otherwise, go to step ((2)). All subnets determined by the above algorithm are found in a loop, and all active IP nodes within the network (if multiple subnets are included in the network, all subnets are traversed) are found using ICMP.

The two construction technologies provide an effective and efficient network element topology constraint model discovery under a complex network environment, and lay a foundation for subsequent fault location analysis.

102, detecting the running state of each network element device in a managed network to find out a fault event;

in this step, the network state measurement layer uses asynchronous network detection and diagnosis technology and uses the error detection and report mechanism of ICMP protocol to detect the connection status of network. The diagnostic information of the network equipment fault is obtained by adopting an asynchronous ICMP message sending and receiving mode, and the diagnostic information of the network service fault is obtained by establishing a TCP protocol connection with a specified service port. The diagnostic information forms fault events and is transmitted to a fault event acquisition layer through a syslog protocol.

The detection conclusions are uniformly marked in the form of fault events, and the event format is as follows:

mod=%ssa=%ssport=%dda=%sdport=%dproto=%dtype="%s"count=%dmsg="%s"act="%s"

the significance of the parameters in the event of a fault is shown in table 1.

TABLE 1

The step is specifically completed by a network state measuring layer, and the network state measuring layer reports the fault event to a fault event acquisition layer. Specifically, the network state measurement layer directly puts the fault event into the fault event cache in the form of java objects inside the system.

Step 103, collecting fault events;

in this step, the fault event collection performed by the fault event collection layer specifically includes three steps of event reception, event normalization, and event caching. The fault event acquisition layer can receive the fault event reported by the network state measurement layer and can also receive the safety log actively reported by various network element equipment by a syslog protocol.

The fault event collection layer may receive, in addition to the fault event reported by the network state measurement layer, a security log actively reported by various network element devices according to a syslog protocol, where the security log may be in the following form:

devid=0date="2011/07/1216:28:10"dname="Guard8000"logtype=6pri=5mod=attacksa=189.16.100.9sport=2582da=189.16.100.180dport=8888proto=6type=

"synflood"count=1msg="protectsynconnect"act="drop"。

the fault event collection engine (corresponding to a fault collection layer in a frame and specifically realizing the fault collection layer) receives safety logs reported by a network state measurement layer and various network element devices by using an independent thread, extracts equipment fault events, directly generates data messages into syslog data classes, normalizes the contents of the safety logs, generates fault event classes with uniform formats, and finally puts the fault events into a fault event cache.

(here, how are extracted fault events reported by a network state measurement layer, the data flow is roughly the same, and firstly, a syslog protocol message is converted into a syslog data class through message collection, but the syslog data formats of different devices are inconsistent, and then converted into a fault event class with a uniform final format through normalization processing), and field description information of a log normalization configuration file is obtained (the content of the field description information is? The answer is: the following is a description of the normalization of the log, where can the simplification be done? Does not specify "normalize the field description information of the configuration file according to the log", but directly writes to normalize the contents of the security log)

104, performing time layer correlation and spatial layer correlation on the collected fault events by using the network element topological constraint model to determine fault positions;

based on the propagation characteristic of network faults, the event correlation analysis layer adopts an association rule mining algorithm based on topological constraint to obtain the hierarchical relationship between network elements according to the established topological correlation model, and carries out hierarchical coding on each device of each alarm event (the hierarchical coding is actually the hierarchy of routing forwarding, a fault positioning system is taken as the center, and when a certain address is reached, multiple layers of routes are needed to be transmitted and arrive, and the forwarding level from the center of the system is used for coding). Determining a propagation path of the fault according to the connection relationship between the network elements represented by the topological structure, and obtaining a constraint condition of the association rule mining process (the constraint condition means that network element equipment on the propagation path of the fault has a physical connection relationship, and if higher-level equipment fails, a lower-level network also fails). Whether two or more failure events are likely to be connected into a set in mining association rules is limited by this condition. And an association rule mining algorithm based on topological constraint is adopted, so that the number of combinations to be detected is reduced to a greater extent before reconnection is realized, and the timeliness of fault positioning and the accuracy of results are improved.

In this step, a double-layer event association strategy of temporal layer association and spatial layer association based on topological constraint is adopted.

In the time-layer association part, the following information in the fault event is subjected to de-duplication association:

1. the severity of the alarm;

2. the time of the alarm;

3. the type of event.

The non-fault information can be removed through time layer association, and simultaneously, concurrent events at the same time are aggregated, specifically, events which are sent by the same equipment in a large amount in a short time are aggregated. The subsequent spatial layer association is to perform association processing on events sent by different devices within a certain time interval.

The spatial layer association based on topological constraint specifically comprises the following steps:

1. the method comprises the steps of obtaining latest network element topology association model data in a memory, converting the model into an association rule script file, wherein the association rule script file comprises a plurality of association rules, such as space-based association rules, each network segment generally has a set of own association rules which accord with a fault propagation path, and the association rule script file comprises the association rules of the network segments. Examples of classification are: and the host equipment fault rule in the same network segment, the safety equipment fault rule in the same network segment and the like, and the network equipment with serious grade gives an alarm. Meanwhile, the association rule based on time is like a device which repeatedly alarms in a certain time period.

2. And mapping the associated rule script file into a memory. All the rules are stored in the rule cache, and when the rules are added, deleted or changed, the rule cache is updated simultaneously. The rule cache used by the association analysis thread pool is updated immediately when the association rule script changes.

3. And acquiring the latest fault event from the fault event cache to perform multi-event correlation, wherein the specific correlation analysis engine acquires the event from the fault event cache at a fixed time by taking 1 minute as a period to perform analysis. All the events meeting the rules (for example, in the time-based association, the rule "repeat host alarm in a certain time period of the same device", meeting the condition of host events is called as meeting the rules, and these events are cached, and when reaching a certain time period, for example, 2 minutes, they can be called as rule matching, some rules are like "serious host fault alarm" rules, because there is no time period constraint and event quantity constraint, as long as the rule and rule matching are met.) all exist in the cache, once the rule matching is met (each rule corresponds to a type of fault, the rule can be composed of a plurality of constraint requirements; for example, "repeat host alarm in a certain time period of the same device", the constraint conditions include the same device, a certain time period, and host type events.), the matched rules are removed from the rule cache, and merging the cached rule-meeting events into alarms. In order to ensure the performance, maximum 30 simultaneous concurrent alarm actions (the alarm action refers to the processing action of the fault alarm analyzed by the association, such as sending an email, a short message and the like) are designed, and maximum 5000 alarm processing actions which are not executed but are allowed to be arranged in a queue are discarded, wherein the maximum limit is exceeded.

105, visually displaying fault alarm information through a network topology with a tree structure;

in the embodiment of the invention, after a fault alarm is triggered, a user sees the fault alarm, and a fault tracing is actually needed. Specifically, the fault alarm is displayed by a fault tree, the related event and alarm are generated into an alarm tree, and a user can clearly see which alarms generate a fault and reflect the reasoning process of the alarm. When the interface is displayed, each alarm generated by the fault positioning module can be traced back to a fault tree.

The second embodiment of the present invention will be described below with reference to the drawings.

An embodiment of the present invention provides a fault locating apparatus, which is structurally shown in fig. 2 and includes:

a topology constraint model construction layer 201, configured to construct a network element topology constraint model;

a network status measurement layer 202, configured to detect an operating status of each network element device in the managed network to discover a fault event;

a fault event collection layer 203 for collecting fault events;

and the event correlation analysis layer 204 is configured to perform temporal layer correlation and spatial layer correlation on the collected fault event by using the network element topology constraint model, and determine a fault location.

Preferably, the apparatus further comprises:

and the fault positioning display layer 205 is used for displaying fault alarm information through network topology visualization of a tree structure.

The driving engine of the system flow motion is a blackboard model frame, a memory database with a composite framework is used as a blackboard, and a plurality of groups of engines are established according to business logic to update and analyze the cache data. Because a plurality of groups of engines and a plurality of data cache regions exist, the method does not adopt a 'publish-subscribe push mode' with higher real-time performance, but adopts a 'pull mode', each engine accesses the blackboard region in a fixed period according to respective service conditions, and the memory data of the blackboard region comprises:

1. caching the association model of the network element (caching the association relation data of the network element in the network);

2. an event engine cache pool (original syslog data passively acquired by a cache system from network element equipment);

3. failure event caching (normalized device failure related security event objects);

4. asset topology caching (network element topology data discovered or configured by a network management system);

5. and (4) fault alarm caching (accurate equipment fault alarm information obtained through correlation analysis).

The event engine cache pool specifically refers to syslog data sent by network element equipment and acquired by a fault event acquisition engine, and the data are filtered and selected to enter a fault event cache. Another source of data in the failure event cache is a failure event directly reported by the network measurement layer. )

The topology constraint model construction layer constructs topology association models of all network elements in the network, and intelligently explores the topology connection relation of the network by comprehensively adopting protocols such as SNMP, CDP, ICMP and the like and methods such as TRACEROUTE and the like under various network environments. The topological incidence relation model is a tree-shaped logic network structure formed by taking a management center as a starting point, a topological model construction engine is carried out by an independent thread, the topological discovery is periodically carried out on a managed network, and the model data is written into a network element incidence model cache. Meanwhile, the topology constraint model building layer can convert the existing asset topology into the network element association model.

The network state measurement layer adopts asynchronous network detection and diagnosis technology and utilizes the error detection and return mechanism of ICMP protocol to detect the connection state of network. The diagnostic information of the network equipment fault is obtained by adopting an asynchronous ICMP message sending and receiving mode, and the diagnostic information of the network service fault is obtained by establishing a TCP protocol connection with a specified service port. The diagnostic information forms fault events and is transmitted to a fault event acquisition layer through a syslog protocol.

The fault event acquisition layer comprises three steps of event receiving, event normalization and event caching. The fault event acquisition layer can receive the fault event reported by the network state measurement layer and can also receive the safety log actively reported by various network element equipment by a syslog protocol. The fault event collection engine receives safety logs reported by a network state measuring layer and various network element devices through independent threads, extracts device fault events, directly generates syslog data classes from data messages, normalizes log contents according to field description information of log normalization configuration files, generates fault event classes with uniform formats, and finally puts the fault events into a cache pool.

And the event correlation analysis layer adopts an association rule mining algorithm based on topological constraint based on the propagation characteristics of network faults and obtains the hierarchical relationship between the network elements according to the established topological correlation model, and carries out hierarchical coding on each device of each alarm event. And (4) obtaining the constraint conditions of the association rule mining process according to the connection relation between the network elements reflected by the topological structure and the propagation path of the fault. Whether two or more items are likely to be connected into a set in mining association rules is limited by this condition. And an association rule mining algorithm based on topological constraint is adopted, so that the number of combinations to be detected is reduced to a greater extent before reconnection is realized, and the timeliness of fault positioning and the accuracy of results are improved.

The third embodiment of the present invention will be described below with reference to the accompanying drawings.

The embodiment of the invention provides a fault positioning method, wherein the principle of an association rule mining algorithm based on topological constraint is shown in figure 3.

Based on the network element topological constraint model formed by the topological constraint model building layer, the network element topological matching algorithm can inquire whether the input network element sequence is contained by a network element cluster from the network topological database for any input network element sequence. If yes, returning true, indicating that the network elements in the network element sequence have a topological relation, namely an alarm propagation path exists between the network elements; if not, false is returned, that is, the input network element sequence is not contained by any network element cluster, which indicates that there is no topological relation between the network elements in the network element sequence, that is, there is no alarm propagation path between them.

And the association rule mining algorithm screens the frequent patterns according to the returned result, so that the error frequent patterns of the alarm propagation conditions which do not exist are filtered. The FP-Growth algorithm adopts a tree structure for mining, and frequent patterns can be further generated after the generation of the tree is completed, so that the FP-Growth algorithm judges whether the frequent patterns accord with network topology constraints once after the frequent patterns are mined, and then the non-conformity patterns are deleted from the final frequent patterns in a centralized manner.

The embodiment of the invention provides a fault positioning method and a fault positioning device, which are used for constructing a network element topology constraint model, detecting the operating state of each network element device in a managed network to find a fault event, collecting the fault event, performing time layer association and space layer association on the collected fault event by using the network element topology constraint model, determining the fault position, performing mining processing on alarm data through the network topology model, and filtering association rules without topology connection relation, thereby improving mining efficiency and correcting mining efficiency

According to the propagation characteristics of network faults, most of network security faults are not determined by a single network security event, but are determined by interaction of a plurality of network security alarms at different times and different generation sources, so that the requirement of network security cannot be met only by recording and simply analyzing the single network security alarm. According to the technical scheme provided by the embodiment of the invention, on the basis of traditional network alarm association mining according to the propagation characteristics of network faults, the alarm association is combined with a specific network topology node idea by adopting a topology constraint association rule mining algorithm, so that the fault positioning efficiency and the adaptability to a complex network are greatly improved.

The embodiment of the invention also constructs a network constraint topological model, and processes the alarm data in mining through the network constraint topological model, and filters out the association rule without the topological connection relation, thereby improving the mining efficiency and correctness.

The fault positioning method and the fault positioning device provided by the embodiment of the invention are used for solving the main problems of instantaneity, stability, expansibility and the like frequently occurring in the fault positioning process under the complex network environment by adopting a technical means, classifying and accurately positioning various fault problems of complex IT resources in a computer network in detail, truly and accurately reflecting the safety condition of the computer network, and are suitable for computer networks

By adopting an open fault diagnosis strategy and utilizing a log string matching and rapid dynamic analysis mechanism deployed by an event acquisition layer, fault events of various safety devices can be rapidly analyzed, device faults can be actively detected in a non-invasive manner, and various faults occurring in the operation process of the network safety devices, the network devices, the host server devices, the operating system, the database and the middleware can be analyzed, diagnosed and positioned.

In addition, a real-time graphical fault positioning analysis result display scheme is adopted, fault alarm information is displayed in a visual mode through a network topology with a tree structure, spatial information and time sequence information of network faults are displayed in a visual mode, and event information related to the faults is displayed in a paging table mode.

It will be understood by those of ordinary skill in the art that all or part of the steps of the above embodiments may be implemented using a computer program flow, which may be stored in a computer readable storage medium and executed on a corresponding hardware platform (e.g., system, apparatus, device, etc.), and when executed, includes one or a combination of the steps of the method embodiments.

Alternatively, all or part of the steps of the above embodiments may be implemented by using an integrated circuit, and the steps may be respectively manufactured as an integrated circuit module, or a plurality of the blocks or steps may be manufactured as a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The devices/functional modules/functional units in the above embodiments may be implemented by general-purpose computing devices, and they may be centralized on a single computing device or distributed on a network formed by a plurality of computing devices.

Each device/function module/function unit in the above embodiments may be implemented in the form of a software function module and may be stored in a computer-readable storage medium when being sold or used as a separate product. The computer readable storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, etc.

Any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and all such changes or substitutions are included in the scope of the present disclosure. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of fault location, comprising:

constructing a network element topology constraint model;

collecting fault events;

carrying out time layer association and space layer association on the collected fault events by utilizing the network element topological constraint model, and determining the fault position;

wherein, the performing time layer association and spatial layer association on the collected fault event by using the network element topological constraint model, and determining the fault position comprises:

storing all rules in the associated rule script file into a rule cache;

acquiring the latest fault event from the fault event cache, performing multi-event association, and storing all fault events meeting the rules in the cache;

2. The method of claim 1, wherein in an unauthorized network environment, the constructing the topology constraint model of the network element comprises:

3. The method of claim 1, wherein in an authorized network environment, the constructing the topology constraint model of the network element comprises:

4. The method of claim 1, wherein the detecting the operation status of each network element device in the managed network to discover the fault event comprises:

after the fault is found, the fault event is reported in a SYSLOG protocol.

5. The fault localization method of claim 1, wherein the collecting fault events comprises:

collecting fault events reported by SYSLOG protocol;

6. The method according to claim 5, wherein the step of normalizing the collected fault events to form a unified fault event according to the fault event and the general log information specifically comprises:

7. The fault localization method of claim 5, wherein the fault event comprises the following information:

8. The method according to claim 1, wherein the step of performing temporal layer correlation and spatial layer correlation on the collected fault event by using the network element topology constraint model, and determining the fault location further comprises:

9. A fault locating device, comprising:

the topology constraint model construction module is used for constructing a network element topology constraint model;

the network state measurement module is used for detecting the running state of each network element device in the managed network so as to find out a fault event;

the fault event acquisition module is used for acquiring fault events;

the event correlation analysis module is used for carrying out time layer correlation and space layer correlation on the collected fault events by utilizing the network element topological constraint model to determine fault positions;

wherein,

the determining the fault position by using the network element topological constraint model to perform time layer association and spatial layer association on the collected fault event comprises:

storing all rules in the associated rule script file into a rule cache;

10. The fault locating device of claim 9, further comprising:

and the fault positioning display module is used for displaying the fault alarm information through the network topology visualization of the tree structure.