WO2001008016A1 - Network managing system - Google Patents
Network managing system Download PDFInfo
- Publication number
- WO2001008016A1 WO2001008016A1 PCT/JP1999/004041 JP9904041W WO0108016A1 WO 2001008016 A1 WO2001008016 A1 WO 2001008016A1 JP 9904041 W JP9904041 W JP 9904041W WO 0108016 A1 WO0108016 A1 WO 0108016A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- failure
- fault
- cause
- value
- event
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/08—Configuration management of networks or network elements
- H04L41/0803—Configuration setting
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/08—Configuration management of networks or network elements
- H04L41/085—Retrieval of network configuration; Tracking network configuration history
- H04L41/0853—Retrieval of network configuration; Tracking network configuration history by actively collecting configuration information or by backing up configuration information
- H04L41/0856—Retrieval of network configuration; Tracking network configuration history by actively collecting configuration information or by backing up configuration information by backing up or archiving configuration information
Definitions
- the present invention generally relates to a network management system for managing faults on a network, and more particularly, to a function of identifying a root cause of a fault from various types of fault symptoms observed on the network.
- the present invention relates to a network management system having: Background art
- An “event” is an exceptional condition that occurs in a network. Includes hardware and software failures, outages, performance bottlenecks, inconsistent network configurations, unintended consequences due to poor design, and malicious damage such as computer viruses.
- Symptoms refer to observable events. “Symptom” is the same as “symptom event”. For example, “it always takes a long time to communicate with a certain destination A and retransmission is required”, “characters are always garbled for a certain destination B”, “reception confirmation is always returned for a certain destination C” “Do not come”. In the same sense, we use the word y Symptom J.
- “Problem” refers to the root cause of a failure.
- the problem is not always observable. For example, damage to the transmitter of a communication device, disconnection of a communication cable, lack of communication line capacity, etc. are examples of “problems”. In the same sense, “problem” and “disability” The word “cause” is also used.
- An “object” is something that has a clear boundary and meaning for a concept, abstraction, or problem of interest.
- An "object class” is a group of objects that have similar properties (attributes), common behavior (operations), common relationships with other objects, and common meanings.
- Class has the same meaning as object class.
- An “object instance” is a specific object belonging to a certain object class.
- An “object instance” is simply called an “instance”.
- One problem event on one resource in the network can cause many symptom events on multiple resources involved. Some problems are observable, but generally not always observable. Therefore, it is necessary to identify the problem that is the root cause of the failure from multiple symptoms. Network administrators must be able to correlate the various observed symptom events with the problem in order to identify the root cause problem.
- the static aspects of the modeling of managed objects and the modeling of event propagation are abstracted, and an object-oriented concept is introduced to perform the modeling efficiently.
- various managed objects are first modeled as classes. Then define the relationships between the classes.
- certain events are modeled as propagating along the relationships between classes.
- the network to be managed is modeled based on the class system defined in this way. In other words, the managed object in the network is abstracted as one instance of a certain class. Next, the network is modeled as an event that propagates these instances (managed objects) according to the relationship established between the class to which the instance belongs and the class to which other instances belong. . Based on the network modeled in this way, the correlation between the problem and the symptoms is specified in advance.
- a symptom event propagation rule is prepared in advance.
- This propagation rule rules out the relationship that the problem event of the root cause of the failure propagates to the symptom event of the failure, and the symptom event propagates to another symptom event.
- This set of propagation rules is called a propagation model.
- event propagation is modeled such that each event propagates between instances according to the relationship defined between the classes of managed objects.
- U.S. Pat. No. 5,528,516 particularly relates to a fault management function, which is used to input the symptoms of a large number of faults that occur when a network fault occurs and to quickly identify the root cause. On the event correlation table approach.
- U.S. Pat. No. 5,661,668 relates to a technique for generating an event correlation table used in the above-mentioned U.S. Pat. No. 5,528,516. That is, the method disclosed in US Patent No. 5,661,668 includes the following steps.
- the problem of the root cause of the failure can be identified by a relatively simple task of comparing the symptom pattern of the actual failure with the symptom pattern of the event correlation table. can do. Therefore, it is likely that this conventional technique will greatly facilitate the identification of the root cause of the problem.
- this conventional technique still has the following problems to be solved.
- the event correlation table used by the above-mentioned conventional technology requires a memory capacity proportional to the number of problems X the number of symptoms. Therefore, this method is disadvantageous because a large network requires an enormous amount of storage capacity, and there is a limit to applying this method to a large-scale network.
- an object of the present invention is to provide a network management system capable of effectively specifying a problem with a small amount of memory usage.
- Still another object of the present invention is to provide a network management system that can specify a problem with a small amount of memory and a small amount of calculation, and that does not cause any omission in specifying a root problem.
- the present invention relates to a network management system and a machine-readable recording medium storing a program for causing a computer to operate as such a network management system.
- This network management system includes a configuration management information schema definition storage device for holding a configuration management information schema definition that describes classes representing devices on the network and failure events observed in those classes;
- a propagation model storage device for storing a propagation model between clients, a configuration information storage device for storing actual configuration information of devices on a network, a configuration management information schema definition, a propagation model, and a configuration Inference including a counter provided for each failure event occurring in each device based on the information and a comparison device for inferring the cause of the failure event by a predetermined method based on the contents of the counter.
- the counter has a storage device including a plurality of storage areas for holding each count value, a device for setting an upper limit value for each storage area based on a propagation model, and an input fault.
- a storage device including a plurality of storage areas for holding each count value, a device for setting an upper limit value for each storage area based on a propagation model, and an input fault.
- a value for updating a storage area value corresponding to an input failure event Logic circuit.
- the inference apparatus By updating the contents of the storage area corresponding to each cause of failure by the counter, the inference apparatus infers the cause of failure based on the contents of the storage area.
- the storage space required for this is proportional to the number of instances included in the configuration information, so it is clearly compared to those that require a storage capacity proportional to the square of the number of instances, such as an event correlation table. It is advantageous.
- the counter further includes a device for holding configuration information for counting, which is composed only of a connection relation between devices, and the logic circuit includes a predetermined rule and a count in response to the input of the failure event.
- a counter that should operate in response to the input of the failure event is determined and operated.
- the logic circuit in the operation of the counter, if there is device connection information, the propagation of events can be traced, and the processing can be sped up.
- the counter generates a plurality of storage areas when the network management system is started, and sets an upper limit value for each of the storage areas. If a storage area is created at the time of startup, the subsequent counting process for a failure event by the counter can be executed quickly.
- the counter reserves a necessary storage area and sets a corresponding upper limit value in response to the input of the failure event. Secure the required storage space when needed.
- the storage device can be used efficiently without using a storage area that is rarely used.
- the inference device infers the true cause of the failure in response to the input of each failure event. Even if the network administrator does not request inference, inference is performed when a failure event occurs. Inference can be performed at an appropriate time without being influenced by the requirements of the network administrator. In addition, the time lag from failure occurrence to inference is reduced compared to the case where inference is performed at regular intervals. Preferably, the inference device infers the true cause of the failure every predetermined time. Since inferences are made at regular intervals, inferences can be made reliably at appropriate times without the need of a network administrator.
- the inference apparatus calculates, for each fault cause, a distance between the true fault cause defined based on the corresponding count value, and determines a predetermined number of faults starting from the one having the smallest distance. Present as a cause candidate. Since a predetermined number is presented as candidates starting from the one with the shortest distance, the possibility that the true cause of failure is leaked from the candidates for the cause of failure is small.
- the inference device calculates, for each fault cause, a distance between the true fault cause defined based on the corresponding count value, and the distance is smaller than a predetermined threshold value. Is presented as a candidate for the cause of the failure. Since only the distances that are smaller than the threshold value are presented, the possibility of a failure is high, and the failure can be presented to the network administrator.
- the inference circuit presents the candidates for the cause of the failure in the sorted order according to the respective distances.
- candidates can be presented to the network administrator according to the order of possible failures. The network administrator can reliably remove the fault by checking for possible faults in this order.
- the inference apparatus calculates, gives and presents a certainty factor calculated according to the count value of each power counter to each of the presented fault cause candidates. With confidence, you can intuitively grasp the potential cause of a failure.
- the counter includes a counted area storage device for storing information specifying the storage area updated by the logic circuit, and the inference apparatus stores only the storage area stored in the counted storage apparatus in the inference. To be calculated.
- the storage area that has not been updated is not included in the calculation for inference by the inference device. Since only the storage areas that need to be calculated are to be calculated, the calculation can be sped up.
- the storage device includes an area for storing a flag indicating whether or not each storage area has been updated in response to a predetermined failure event, and the logic circuit maintains the flag and stores each flag according to the content of the flag. Determine whether to update the value in. Ah
- the same fault event may occur through more than one causal path.
- the storage area corresponding to the above-mentioned cause may be updated more than once.
- the calculation result is not correct.
- a flag is set, and the storage area updated for a certain failure event is not updated any more. By taking such measures, the storage area can be updated without error even under the above causal relationship.
- the update by the logic circuit is a process of incrementing a value in each storage area of the input fault event transmission range by a predetermined value, and the predetermined value is a value greater than 0 and equal to or less than 1.
- the counter further includes a cache device for holding information specifying a storage area updated in response to the input of a certain failure event in the logic circuit, and the logic circuit is provided when a certain failure event is input again. The storage area specified based on the information stored in the cache device is updated.
- the target storage area By retaining the information that identifies the storage area updated in response to a failure event input, when the same failure event is input again, the target storage area can be accessed directly without following the propagation path. And can be updated. Therefore, the processing can be sped up.
- the cache device holds information indicating the time when the information specifying the storage area was last referenced
- the counter further includes a device for deleting, from the cache device, information that has not been referenced for a certain period of time. Including.
- the storage area can be used effectively by erasing from the cache device information that is not referenced for a certain period of time.
- the counter is a propagation rule for accumulating the input fault events, detecting a cross-correlation between the fault events, and feeding back a new fault propagation model not described in the propagation model to the logic circuit.
- the logic circuit further includes a detection device, and receives the feedback from the propagation rule detection device, and reconstructs a predetermined rule. Since the rules for inference can be changed dynamically based on the history of actual failure events, inference of the cause of the failure becomes more reliable.
- FIG. 1 is a block diagram of a network management system according to one embodiment of the present invention.
- FIG. 2 is a block diagram of the fault management unit 34 shown in FIG.
- FIG. 3 is a block diagram of the fault counter 54 shown in FIG.
- FIG. 4 is a diagram schematically showing a configuration of the counter value storage area 70 shown in FIG.
- FIG. 5 is a diagram for explaining the concept of a causal loop between a cause and a symptom.
- FIG. 6 is an external view of a computer for realizing the network management system according to the present invention.
- FIG. 7 is a block diagram of a computer for realizing the network management system according to the present invention.
- FIG. 8 is a flowchart showing a failure event increment process in one embodiment of the present invention.
- FIG. 9 is a flowchart of a process for identifying a cause of a failure started by a timer in one embodiment of the present invention.
- FIG. 10 is a flowchart of the cache clearing process started by the timer in one embodiment of the present invention.
- FIG. 11 is a flowchart showing the processing contents of the failure counter generation unit 56.
- FIG. 12 is a flowchart of the failure cause identification processing.
- FIG. 13 is a flowchart illustrating another example of the failure cause identification processing.
- FIG. 14 is a flowchart illustrating still another example of the failure cause identification processing.
- FIG. 15 is a flowchart of the process of incrementing the fault counter and specifying the cause of the fault in another embodiment of the present invention.
- the present invention provides a configuration of a propagation model, a managed object model, and an actual network each time a symptom event occurs without preparing an event correlation table in advance. Based on the information, the counter provided for the fault associated with the symptom event (referred to as “fault counter”) is incremented. Since the number of propagation models is finite and the types of symptom events are finite, each fault counter should count up to a certain number when all symptom events are considered to have occurred.
- the number is obtained in advance as the upper limit of the count number of each fault counter, and the obtained number is compared with the actual number of occurrences of symptom events and the count number of each fault counter to correspond to the occurrence pattern of the symptom event. Identify the cause of the failure.
- a network management system 20 is described using a configuration management information schema definition 40 that holds a database schema for defining network configuration information, and this schema.
- a configuration management unit 30 for managing network configuration information, an event database 44 for storing configuration information data and failure information data as events, and network failure events are collected, and then the cause of the failure is collected.
- the fault management unit 34 for estimating the fault and information indicating the inference result of the root cause problem of the fault is received from the fault management unit 34.
- a user interface unit 36 for presenting to a network administrator.
- the configuration management information schema definition 40, the configuration information section 38, and the propagation model 42 are stored in a storage device such as a memory. The storage device in which these are stored may be the same or separate.
- the fault management section 34 manages an event database section 62 for storing fault events from the network in the event database 44 and rules for fault propagation (propagation model 4 2).
- the fault counter generator 56 generates a fault counter to be described later, and the fault counter generator 56 generates the fault model from the propagation model unit 58 for reference, configuration information, its schema definition information, and the propagation model 42. Done, A counter corresponding to each fault cause is provided.
- a fault counter 54 for incrementing the counter corresponding to the fault cause, and a counter value of the fault counter 54 are provided.
- the fault counter 54 includes counter configuration information 78, which consists only of information required to identify the device type and device from the configuration information, and connection information between devices.
- the counter value storage area 70 composed of an array of counters corresponding to the failures of each device generated from the propagation model 42 and the configuration information section 38, and the counters corresponding to each failure in the counter value storage area 70
- Counter type increment logic 74 for incrementing, and reference to the counter increment logic 74, generated from the propagation model 42, for the equipment type and connection relationship, and for the fault event input to the force counter.
- the counter increment rule 80 which describes which device class causes the fault counter to be incremented, and the incremented fault
- An incremented counter index storage area 76 which is the area where the harm counter index is recorded, and a record of which fault counter has been incremented as a result of applying the rule for the fault event of a specific device For cache information 72.
- the counter value storage area 70 includes an index 90 (serial number in this embodiment) and an index 90 for each failure cause (problem P0, P1, P2,).
- the upper limit value 9 2 is set in advance to the number of fault events that are considered to cause the fault cause corresponding to the counter on the propagation model, and the count value is an area that holds the count value.
- Update flags 9 6 and used including. All of these areas are maintained by the counter value increment logic 74.
- Fig. 5 shows Causality Mapping when the same symptom occurs in two causal paths from a certain cause of failure. Now, it is assumed that the following causal relationships exist for the obstacles.
- the network management system shown in FIG. 1 is actually realized by software running on a computer such as a personal computer or a workstation.
- Figure 6 shows the appearance of the computer that implements the network management system. Referring to FIG. 6, this computer has a computer 100 equipped with a CD-ROM (Compact Disc Read-Only Memory) drive 110 and a FD (Flexible Disk) drive 112, a display 102, and a printer 1 04, a keyboard 106, and a mouse 108.
- Figure 7 shows the configuration of this computer in block diagram form. As shown in FIG. 7, the computer main unit 100 constituting the network management system 20 includes a CD-ROM drive 110 and an FD drive 112, and a CPU (CPU) connected to a bus 126, respectively.
- CD-ROM Compact Disc Read-Only Memory
- FD Flexible Disk
- This network management system is realized by a computer hard disk and software executed by the CPU 116.
- such software is stored in a storage medium such as CD_R ⁇ M122 or FD124 and distributed, and is read from the storage medium by a CD-ROM drive 110 or FD drive 112 and read from a hard disk 114. Is stored once. Further, the data is read from the hard disk 114 to the RAMI 20 and executed by the CPU 116.
- the computer hardware itself shown in FIGS. 6 and 7 is general. Therefore, the most essential part of the present invention is the software stored in the storage medium such as the CD-ROM 122, the FD 124 and the hard disk 114.
- the operation of the computer itself shown in FIGS. 6 and 7 is well known and will be apparent to those skilled in the art.
- the software is configured to receive a fault event after the device is started and increment the fault counter (FIG. 8), and to be started periodically by the timer 50 to identify the cause of the fault. (FIG. 9), and a part that is activated by the timer to clear the cache information 72 (FIG. 10).
- the software structure of the part that increments the failure counter will be described.
- this device When this device is started, it first secures a counter value storage area 70, sets its upper limit, and generates a counter value increment logic 74 (140). Then, the system waits for a failure event to be input (142), and when a failure event is input, increments the failure counter (144). Is executed, and after execution, the system waits for the input of a failure event again (1 4 2). Details of these processes will be described later.
- failure cause identification processing 150 is started periodically by timer 50.
- the user interface unit 36 presents the problem corresponding to the failure counter having the minimum value to the network administrator as the root cause of the failure.
- This processing corresponds to the processing performed by the count comparator 52 in FIG.
- timer 50 clears cache information 72 periodically.
- the cache information 72 holds the date when the cache was last referenced as an attribute, and the timer 50 periodically checks this date and discards cache information that has not been referenced for a long time. I do.
- the storage capacity of the cache information 72 can be saved, and it is possible to cope with changes in the network configuration.
- the process for specifying the cause of the failure and the process for clearing the cache information 72 are started by the timer independently of the process for incrementing the failure counter.
- the present invention is not limited to this.
- these processes are executed in response to a request from a network administrator, or these processes are executed in response to satisfaction of a condition in the process of incrementing the failure counter. May be started.
- Their implementation depends on the design, but will depend on the requirements and will be apparent to those skilled in the art.
- the process 140 for generating a fault counter will be described with reference to FIG. This processing corresponds to the processing of the failure counter generation unit 56 in FIG.
- an index unique to each class is assigned to the cause of failure in each class (180). For example, ApplicationDown, TcpDisconnect, RoutingError, etc. described in the examples below correspond to this.
- a counter value storage area is generated by pairing the index of each managed object with the index of the cause of the failure (184).
- a failure counter in the counter value storage area 70 shown in FIG. 4 stores the managed index and the failure cause. It is specified and accessed by the index.
- the fault counters corresponding to the problems P 0, P 1, etc., which are the causes of the faults shown in Fig. 4, are actually managed for each managed object. Counters are assigned.
- the range of the failure is predicted, and the upper limit is set (186).
- the counter value increment logic 74 shown in FIG. 3 is generated.
- the generation of the counter value increment logic 74 is also performed by software. Prior to this processing, it is assumed that the following logic template is prepared in advance.
- appendCache (mo, symptom, mol, symptoml); return;
- mo 2 search (mol, Relation, rClassName); if (mo2 NOT—FOUND) return;
- the counter value increment opening logic 74 is generated as follows.
- the first class is selected for processing (188).
- it is determined whether or not all classes have been processed (190). If all classes are finished, this routine is finished. If any classes have not been completed, control proceeds to step 200.
- step 200 (b) is expanded to the position indicated as “addition position of case clause macro for C 1 ass” in the counter logic template of (a).
- “ClassName” in (b) is replaced with the class name to be processed (for example, "Application”). Subsequent positions in this processing move after the expanded part.
- (c) was defined for the class at the position indicated as “Addition of case clause macro for Symptom name” in (b) expanded in step 200. Expand all failure events in order (202). However, at this time, “ClassName” in the macro of (c) is the name of the class to be processed, rSymptomNameJ is the name of each fault event defined in the class, and “RelationJ is the definition of the class in the propagation model. Replace "rClassNameJ” with the name of the class associated with this class in "Relation”, and replace "ProblemName” with the cause of failure given in step 180 for that class. Thus, in step 202, the macro of (c) is expanded for all the fault events defined in the class.
- step 190 If it is determined in step 190 that the processing for all classes has been completed, the counter values obtained by expanding (b) and (c) for all classes defined in the class definition Increment logic 74 is obtained. An example is shown below.
- appendCache (mo, symptom, mol, symptoml) return
- An instance of the class Application has an Underly relationship with an instance of the class TcpNode.
- An instance of class TcpNode has a ConnectedTo relationship to an instance of class Router.
- TcpDisconnect Failure of class Router RoutingError propagates to failure of TcpNode connected to it, TcpDisconnect.
- executeCache A function that increments the failure counter according to the cache data related to the failure event specified by the specified management target
- the increment of the specified failure counter specified by the specified management target is recorded as cache data appendCache ();
- the present invention does not store and hold the Causality Mapping, but the logic of the counter value increment, the rule, and the relationship between the devices.
- the counter configuration information 78 consisting of only the counter is stored.
- failure counter increment processing shown in step 144 of FIG. 8 will be described.
- Failure events are input to the counter in time series.
- a device identifier and device type information for specifying the device in the actual configuration information are added as attributes.
- the counter value increment logic 74 queries the configuration information for the counter 78 to find a fault counter for the problem corresponding to the fault in the management object corresponding to the device. Increment the value.
- the counter logic will execute the management object of “Tcp J” corresponding to the device.
- the counter logic stores the incremented index of the failure counter in the incremented counter index storage area 76.
- the counter logic takes it as an input again and searches the counter configuration information 78 for a router connected thereto. Then, the RoutingError count of the router is incremented. In this way, the counter logic continually increments each fault counter The process for this fault event is terminated when there are no more items to be propagated. At this time, when the counter logic increments a certain fault counter for a certain fault event, a flag corresponding to the fault event of the fault counter is set, and if the flag is not set, Only increment.
- the templates shown in “(b) Case clause macro for Class” and “(c) Case clause macro for Symptom name” perform such processing by replacing words in the template with specific class names and expanding them. It is for realizing.
- the network management system notifies the network administrator of the cause of the failure corresponding to the value having the smallest distance among the values calculated in this way as a candidate of the true cause of the failure.
- the failure identification process for that will be described. Referring to FIG. 12, when the process of identifying the cause of the failure is started, first, all the inputted events are acquired (300). Further, as an initial setting, a predetermined large constant (for example, the maximum number that the work area can hold) is substituted into the work area for obtaining the minimum value (302). This is a preparation to keep the value as the minimum value of the previous calculation each time the minimum value is obtained in the middle of the calculation, as described below.
- a predetermined large constant for example, the maximum number that the work area can hold
- one index number is extracted from the indexes stored in the incremented counter index storage area 76 (304). Then, it is determined whether or not the processing has been completed for all the index numbers stored in the incremented counter index storage area 76 (306). If the processing has been completed for all index numbers, control proceeds to step 308, the details of which will be described later. In this case, the calculation described below is performed not on all the indexes, but only on the indexes stored in the incremented counter index storage area 76. Other indexes have not been incremented, so omitting the calculation has no effect. The calculation amount is reduced, and the processing speed can be increased.
- the upper limit value of the counter indicated by the index number extracted in step 304 and the counter value are read (320). Then, the total number of input failure events and read The difference from the calculated counter value is calculated (322). Further, the difference between the upper limit value of the counter and the counter value is calculated (324). The two differences thus obtained are summed and retained (32). In the network management system 20 of this embodiment, the sum calculated in this manner is used as a value (distance) indicating the possibility of a failure cause. Of course, various other calculation methods can be assumed, but the method of this example is the simplest.
- step 304 it is determined whether or not the sum thus calculated is smaller than the previously held minimum value (328). If it is not less than the minimum value, the processing is advanced to the next index number (332), and control returns to step 304. If the sum is smaller than the minimum value, the index number corresponding to the sum is stored, and the sum is stored as a new minimum value (330). Then, the processing target is advanced to the next index number (332), and control returns to step 304.
- step 308 it is determined whether or not the minimum value obtained as a result of the above series of processing is smaller than a predetermined threshold value. If the minimum value is smaller than the threshold value, the failure cause indicated by the index number corresponding to this minimum value is estimated as the root cause of the failure and presented to the network administrator via the user interface unit 36.
- the process waits for the input of the next failure event (3 1 2), or terminates the process after displaying that identification was not possible.
- the cache information 72 when a certain failure event is input, information indicating which management target object has finally reached and which failure counter has been incremented is stored in the cache information 72. I do. For example, when a failure event S4 is input under the causal relationship as shown in FIG. 5, the failure counters of P0, P1, and P2 are finally incremented. If this information is stored in the cache information 72, the next time the failure event S4 is input, the corresponding failure power counter can be immediately incremented without following the causal relationship from the configuration information. The processing speed can be increased.
- the network management system 20 of the above-described embodiment has a propagation rule detection unit 60, detects a cross-correlation between failure events stored in the event database 44, and describes the cross-correlation in the propagation model 42. If a new fault propagation that is not detected is detected, it is fed back (added) to the propagation model 42. By reconstructing the counter value increment logic 74 of the fault counter and the rule 80 of the counter increment based on the new propagation model 42 updated in this way, a more accurate fault cause can be estimated. Becomes possible.
- This value C (S 1, S 2) is calculated between all fault events that occurred within the time window. Then, when this value exceeds a certain threshold value, it is estimated that there is a propagation relationship between these events.
- rules not yet described in the propagation model 42 may be given to the propagation model unit 58 and fed back to the fault counter generation unit 56.
- the fault counter generation unit 56 reconstructs the rules for the fault counter and the increment based on the rules updated in this manner, so that more accurate inference can be performed.
- the equation for calculating the degree of cross-correlation is not limited to the above equation, and various equations can be used.
- the “confidence” that the certain cause of the failure is the true cause of the failure is as follows. z can be calculated.
- n indicates the number of fault events assumed for the fault cause that matches the actually input fault event.
- m indicates the larger of the total number of input failure events and the maximum value of the upper limit of the counter.
- the upper limit value set for each individual counter may be used.
- Z in the above equation takes a value of 0 to 100% (0 to 1). This confidence z is added to the candidate for the cause of the failure and notified.
- the event correlation table approach required a relatively large amount of storage for the event correlation table. According to the present embodiment, it is possible to realize a compact logic obtained from the description about the class and the propagation model, and a method for estimating the cause of the fault composed of the configuration information. Proportional. Therefore, this system is advantageous compared to the event correlation table approach, which requires a memory capacity that is proportional to the square of the number of instances.
- the logic of the counter value is incremented at the timing when the failure event is notified. No calculations occur when no fault has occurred. Also, since only the fault counter related to the fault event that has occurred is incremented, the calculation load at the time of increment can be reduced.
- the distance calculated based on the counter value of the failure counter is the smallest, and only one failure cause corresponding to the index number is presented to the network administrator.
- the present invention is not limited to this. For example, for all index numbers, calculate the distance (the sum of the difference between the total number of failure events and the counter value, and the difference between the upper limit value of the counter and the counter value) and find the top N (smaller distance)
- the failure cause corresponding to the index number (where N is an arbitrary natural number) may be presented to the network administrator in ascending order of distance.
- FIG. 13 shows a flowchart in this case.
- the same processes as those in FIG. 12 are denoted by the same reference numerals, and the detailed description thereof will not be repeated.
- the processing of steps 302, 308, 328, and 330 of FIG. 12 is omitted.
- the count values are sorted in ascending order and the top N
- a step 340 of presenting only the individual as a candidate for the cause of the failure is provided.
- the network administrator can check the candidates for the cause of the failure in order from the most suspicious.
- Network managers can efficiently identify and eliminate the cause of the failure.
- the possibility that the true cause of failure is omitted from the candidates is reduced.
- step 340 the count values are sorted in ascending order, and the cause of failure corresponding to everything below a predetermined threshold is presented to the network administrator as a candidate together with the above-mentioned certainty factor.
- the processing configuration may be such that the failure cause is estimated each time a failure event is input.
- Figure 15 shows a flowchart of the entire process in that case.
- steps for performing the same processing as the processing steps shown in FIG. 8 or FIG. 9 are denoted by the same reference numerals. A detailed description of them will not be repeated here.
- the failure cause identification processing (150) is performed immediately. I do. Then, when it is determined that the minimum value of the obtained distance is smaller than a predetermined threshold, the cause of the failure corresponding to the counter is estimated as the root cause.
- the failure event is input to the network administrator as soon as the information necessary for estimating the failure is available.
- the effect is obtained.
- the effect of reducing the time lag from the occurrence of a failure to the estimation of the cause is obtained, as compared to the case where the processing for identifying the cause of the failure is started by the timer.
- the value to be incremented has been described as 1.0.
- the present invention is not limited to this, and the number may be incremented by an arbitrary number in a range from more than 0 to 1.0 or less. By doing so, the occurrence of a stochastic failure can be expressed by the same logic as in the above-described embodiment.
- the probability of propagation from fault event S3 to fault event S4 is
- the network management system can handle stochastic failure propagation.
- the network management system of the present invention it is possible to specify the cause of a failure with a small amount of memory used and a small amount of calculation. Therefore, it is suitable for effectively identifying the cause of a failure in a large or complex network and resolving it.
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU49287/99A AU4928799A (en) | 1999-07-28 | 1999-07-28 | Network managing system |
PCT/JP1999/004041 WO2001008016A1 (en) | 1999-07-28 | 1999-07-28 | Network managing system |
GB0114407A GB2363286B (en) | 1999-07-28 | 1999-07-28 | Network managing system |
CA002348294A CA2348294A1 (en) | 1999-07-28 | 1999-07-28 | Network management system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP1999/004041 WO2001008016A1 (en) | 1999-07-28 | 1999-07-28 | Network managing system |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2001008016A1 true WO2001008016A1 (en) | 2001-02-01 |
Family
ID=14236326
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP1999/004041 WO2001008016A1 (en) | 1999-07-28 | 1999-07-28 | Network managing system |
Country Status (4)
Country | Link |
---|---|
AU (1) | AU4928799A (en) |
CA (1) | CA2348294A1 (en) |
GB (1) | GB2363286B (en) |
WO (1) | WO2001008016A1 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8122106B2 (en) | 2003-03-06 | 2012-02-21 | Microsoft Corporation | Integrating design, deployment, and management phases for systems |
US8489728B2 (en) | 2005-04-15 | 2013-07-16 | Microsoft Corporation | Model-based system monitoring |
CN109669844B (en) * | 2018-11-27 | 2022-08-23 | 平安科技(深圳)有限公司 | Equipment fault processing method, device, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0730540A (en) * | 1993-07-08 | 1995-01-31 | Hitachi Ltd | Network fault monitor equipment |
JPH0818593A (en) * | 1994-06-27 | 1996-01-19 | Internatl Business Mach Corp <Ibm> | Limited plurality of fault management method and diagnostic system |
JPH09247145A (en) * | 1996-03-05 | 1997-09-19 | Nippon Telegr & Teleph Corp <Ntt> | Network management system |
-
1999
- 1999-07-28 AU AU49287/99A patent/AU4928799A/en not_active Abandoned
- 1999-07-28 CA CA002348294A patent/CA2348294A1/en not_active Abandoned
- 1999-07-28 GB GB0114407A patent/GB2363286B/en not_active Expired - Fee Related
- 1999-07-28 WO PCT/JP1999/004041 patent/WO2001008016A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0730540A (en) * | 1993-07-08 | 1995-01-31 | Hitachi Ltd | Network fault monitor equipment |
JPH0818593A (en) * | 1994-06-27 | 1996-01-19 | Internatl Business Mach Corp <Ibm> | Limited plurality of fault management method and diagnostic system |
JPH09247145A (en) * | 1996-03-05 | 1997-09-19 | Nippon Telegr & Teleph Corp <Ntt> | Network management system |
Also Published As
Publication number | Publication date |
---|---|
GB2363286B (en) | 2003-08-27 |
GB2363286A (en) | 2001-12-12 |
GB0114407D0 (en) | 2001-08-08 |
AU4928799A (en) | 2001-02-13 |
CA2348294A1 (en) | 2001-02-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10649838B2 (en) | Automatic correlation of dynamic system events within computing devices | |
EP2341434B1 (en) | Method and apparatus for performing root cause analysis | |
US10303539B2 (en) | Automatic troubleshooting from computer system monitoring data based on analyzing sequences of changes | |
US9378112B2 (en) | Predictive alert threshold determination tool | |
JP4997856B2 (en) | Database analysis program, database analysis apparatus, and database analysis method | |
US10592327B2 (en) | Apparatus, system, and method for analyzing logs | |
CN104903866A (en) | Management system and method for assisting event root cause analysis | |
US20210406288A1 (en) | Novelty detection system | |
US10585932B1 (en) | Methods and apparatus for generating causality matrix and impacts using graph processing | |
AU2021309929B2 (en) | Anomaly detection in network topology | |
JP6280862B2 (en) | Event analysis system and method | |
JPWO2011055436A1 (en) | Operation management apparatus and operation management method | |
US8909768B1 (en) | Monitoring of metrics to identify abnormalities in a large scale distributed computing environment | |
CN111240876A (en) | Fault positioning method and device for microservice, storage medium and terminal | |
CN112559538A (en) | Incidence relation generation method and device, computer equipment and storage medium | |
Peng et al. | Mining logs files for data-driven system management | |
CN116418653A (en) | Fault positioning method and device based on multi-index root cause positioning algorithm | |
WO2001008016A1 (en) | Network managing system | |
JPH11308222A (en) | Network management system | |
CN114881521A (en) | Service evaluation method, device, electronic equipment and storage medium | |
CN111338609B (en) | Information acquisition method, device, storage medium and terminal | |
Karami et al. | Maintaining accurate web usage models using updates from activity diagrams | |
JPH11308221A (en) | Network management system | |
EP3671467A1 (en) | Gui application testing using bots | |
CN117389908B (en) | Dependency analysis method, system and medium for interface automation test case |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AU CA CN GB IN KR US |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
WWE | Wipo information: entry into national phase |
Ref document number: 09830172 Country of ref document: US |
|
ENP | Entry into the national phase |
Ref document number: 2348294 Country of ref document: CA Ref country code: CA Ref document number: 2348294 Kind code of ref document: A Format of ref document f/p: F |
|
WWE | Wipo information: entry into national phase |
Ref document number: 49287/99 Country of ref document: AU |
|
ENP | Entry into the national phase |
Ref country code: GB Ref document number: 200114407 Kind code of ref document: A Format of ref document f/p: F |