WO2001008016A1

WO2001008016A1 - Network managing system

Info

Publication number: WO2001008016A1
Application number: PCT/JP1999/004041
Authority: WO
Inventors: Noriaki Kuwahara
Original assignee: Sumitomo Electric Industries, Ltd.
Priority date: 1999-07-28
Filing date: 1999-07-28
Publication date: 2001-02-01
Also published as: GB2363286B; GB2363286A; GB0114407D0; AU4928799A; CA2348294A1

Abstract

A network managing system comprising a configuration management information schema definition storage unit for storing therein a configuration management information schema definition describing classes representing apparatuses in a network and failure events observed in the classes, a propagation model storage unit for storing therein propagation models between failure events, a configuration information storage unit for storing therein configuration information about the apparatuses actually installed in the network, and an inferring device for making inference based on the configuration management information schema definition, the propagation models, and the configuration information, including a counter for dealing with the failure events occurring in the apparatuses and a comparator for inferring the causes of the failure events based on the contents of the counter according to a predetermined method.

Description

Description Network management system Technical field

The present invention generally relates to a network management system for managing faults on a network, and more particularly, to a function of identifying a root cause of a fault from various types of fault symptoms observed on the network. The present invention relates to a network management system having: Background art

Communication networks using computers have been increasing in scale. As the communication network becomes larger, the effects of failures occurring on the network are becoming larger and more serious. Therefore, the ability to manage the network efficiently is very important. Hereinafter, terms related to network management used in this specification are defined.

An “event” is an exceptional condition that occurs in a network. Includes hardware and software failures, outages, performance bottlenecks, inconsistent network configurations, unintended consequences due to poor design, and malicious damage such as computer viruses.

“Defect” has the same meaning as “event” in this specification.

“Symptoms” refer to observable events. "Symptom" is the same as "symptom event". For example, "it always takes a long time to communicate with a certain destination A and retransmission is required", "characters are always garbled for a certain destination B", "reception confirmation is always returned for a certain destination C" "Do not come". In the same sense, we use the word y Symptom J.

“Problem” refers to the root cause of a failure. The problem is not always observable. For example, damage to the transmitter of a communication device, disconnection of a communication cable, lack of communication line capacity, etc. are examples of “problems”. In the same sense, "problem" and "disability" The word "cause" is also used.

“Problem event” has the same meaning as “problem”.

An “object” is something that has a clear boundary and meaning for a concept, abstraction, or problem of interest.

An "object class" is a group of objects that have similar properties (attributes), common behavior (operations), common relationships with other objects, and common meanings.

"Class" has the same meaning as object class.

An “object instance” is a specific object belonging to a certain object class. An “object instance” is simply called an “instance”.

One problem event on one resource in the network can cause many symptom events on multiple resources involved. Some problems are observable, but generally not always observable. Therefore, it is necessary to identify the problem that is the root cause of the failure from multiple symptoms. Network administrators must be able to correlate the various observed symptom events with the problem in order to identify the root cause problem.

However, as the network grows, so does the number of symptom events observed. What becomes a “causation” of which problems cause which symptoms also becomes more complex. This makes it almost impossible for network administrators to manually identify the root cause of the problem.

A prior art method for accurately and quickly identifying the root cause problem from the vast number of failure symptom events observed on such a network is disclosed in a US patent issued June 18, 1996. Nos. 5, 528, 516 ("Apparatus and Method for Event Correlation and Problem Reporting") and the United States published August 26, 1997 Patent No. 5, 661, 668 (“Apparataus and Method for Analyzing and Correlating tvents in a System Using a Causality Matrix” Devices and methods for correlation )))).

In the above-described prior art, the static aspects of the modeling of managed objects and the modeling of event propagation are abstracted, and an object-oriented concept is introduced to perform the modeling efficiently. In these methods, various managed objects are first modeled as classes. Then define the relationships between the classes. In addition, certain events are modeled as propagating along the relationships between classes. There are various textbooks on object-oriented technology, so please refer to them. The network to be managed is modeled based on the class system defined in this way. In other words, the managed object in the network is abstracted as one instance of a certain class. Next, the network is modeled as an event that propagates these instances (managed objects) according to the relationship established between the class to which the instance belongs and the class to which other instances belong. . Based on the network modeled in this way, the correlation between the problem and the symptoms is specified in advance.

For this purpose, a symptom event propagation rule is prepared in advance. This propagation rule rules out the relationship that the problem event of the root cause of the failure propagates to the symptom event of the failure, and the symptom event propagates to another symptom event. This set of propagation rules is called a propagation model.

Some events are both problem and symptom events, and some are neither. In such a propagation model (rule), event propagation is modeled such that each event propagates between instances according to the relationship defined between the classes of managed objects.

The above-mentioned U.S. Pat. No. 5,528,516 particularly relates to a fault management function, which is used to input the symptoms of a large number of faults that occur when a network fault occurs and to quickly identify the root cause. On the event correlation table approach.

U.S. Pat. No. 5,661,668 relates to a technique for generating an event correlation table used in the above-mentioned U.S. Pat. No. 5,528,516. That is, the method disclosed in US Patent No. 5,661,668 includes the following steps.

(1) Classes representing network devices and the faults observed in those classes Describe harm events.

(2) Describe the propagation model between these fault events.

(3) Describe the actual network configuration information.

(4) Based on the information in (1) to (3) above, a matrix called Causality (causality) Mapping is generated that indicates the correspondence between the failure event and the probable cause.

(5) The above Causality Mapping is stored in a computer recording medium, and an event correlation table used in US Pat. No. 5,528,518 is generated.

If the event correlation table is generated in advance in this way, the problem of the root cause of the failure can be identified by a relatively simple task of comparing the symptom pattern of the actual failure with the symptom pattern of the event correlation table. can do. Therefore, it is likely that this conventional technique will greatly facilitate the identification of the root cause of the problem. However, this conventional technique still has the following problems to be solved. The event correlation table used by the above-mentioned conventional technology requires a memory capacity proportional to the number of problems X the number of symptoms. Therefore, this method is disadvantageous because a large network requires an enormous amount of storage capacity, and there is a limit to applying this method to a large-scale network.

To address this problem, the above-mentioned US patent states that by compressing the obtained event correlation table, calculations for the root cause can be performed in real time. However, it is unclear how much the event correlation table can be compressed with this method. It is also expected that effective compression may not be possible depending on the actual network configuration. Due to the compression, the information to be obtained may be missing from the event correlation table, in which case the inference accuracy will naturally decrease. To keep the accuracy of the inference, the compression ratio must be kept constant. Eventually, the time required for inference is proportional to the size of the event correlation table.

When an event correlation table is used, a quantity that is always proportional to the number of problems x the number of symptoms is calculated when identifying a problem. Therefore, no matter how many symptom events occur within a certain time, the calculation can be completed in a certain time. However, if this is reversed, the number of symptom events that occurred within a certain period of time will be small. At least, it also means that a certain amount of time must be used for inference. When the number of symptom events is small, it is desirable that inferences can be made faster.

In identifying such a problem, it is desirable not only to make no mistake in the inference of the cause of the failure, but also to prevent the true cause of the failure from leaking from the inference result. Furthermore, it is more preferable that when a plurality of candidates for the cause of the failure are listed, it is preferable that the network administrator can appropriately determine the importance of the cause of the failure. It is also desirable that the network administrator be able to estimate the cause of the failure at any time when desired, but if a failure occurs and the cause of the failure can be identified without requesting it, the cause of the failure can be identified at that time. It is also preferable to be able to present. Therefore, an object of the present invention is to provide a network management system capable of effectively specifying a problem with a small amount of memory usage.

It is another object of the present invention to provide a network management system capable of specifying a problem with a small amount of memory and a small amount of calculation.

Still another object of the present invention is to provide a network management system that can specify a problem with a small amount of memory and a small amount of calculation, and that does not cause any omission in specifying a root problem. . Disclosure of the invention

The present invention relates to a network management system and a machine-readable recording medium storing a program for causing a computer to operate as such a network management system. This network management system includes a configuration management information schema definition storage device for holding a configuration management information schema definition that describes classes representing devices on the network and failure events observed in those classes; A propagation model storage device for storing a propagation model between clients, a configuration information storage device for storing actual configuration information of devices on a network, a configuration management information schema definition, a propagation model, and a configuration Inference including a counter provided for each failure event occurring in each device based on the information and a comparison device for inferring the cause of the failure event by a predetermined method based on the contents of the counter. apparatus And The counter has a storage device including a plurality of storage areas for holding each count value, a device for setting an upper limit value for each storage area based on a propagation model, and an input fault. In response to an event, based on a configuration management information schema definition, a propagation model, configuration information, and a predetermined rule for updating, a value for updating a storage area value corresponding to an input failure event. Logic circuit. By updating the contents of the storage area corresponding to each cause of failure by the counter, the inference apparatus infers the cause of failure based on the contents of the storage area. The storage space required for this is proportional to the number of instances included in the configuration information, so it is clearly compared to those that require a storage capacity proportional to the square of the number of instances, such as an event correlation table. It is advantageous.

Preferably, the counter further includes a device for holding configuration information for counting, which is composed only of a connection relation between devices, and the logic circuit includes a predetermined rule and a count in response to the input of the failure event. With reference to the configuration information for use, a counter that should operate in response to the input of the failure event is determined and operated. With the logic circuit, in the operation of the counter, if there is device connection information, the propagation of events can be traced, and the processing can be sped up.

Preferably, the counter generates a plurality of storage areas when the network management system is started, and sets an upper limit value for each of the storage areas. If a storage area is created at the time of startup, the subsequent counting process for a failure event by the counter can be executed quickly.

Preferably, the counter reserves a necessary storage area and sets a corresponding upper limit value in response to the input of the failure event. Secure the required storage space when needed. The storage device can be used efficiently without using a storage area that is rarely used.

Preferably, the inference device infers the true cause of the failure in response to the input of each failure event. Even if the network administrator does not request inference, inference is performed when a failure event occurs. Inference can be performed at an appropriate time without being influenced by the requirements of the network administrator. In addition, the time lag from failure occurrence to inference is reduced compared to the case where inference is performed at regular intervals. Preferably, the inference device infers the true cause of the failure every predetermined time. Since inferences are made at regular intervals, inferences can be made reliably at appropriate times without the need of a network administrator.

Preferably, the inference apparatus calculates, for each fault cause, a distance between the true fault cause defined based on the corresponding count value, and determines a predetermined number of faults starting from the one having the smallest distance. Present as a cause candidate. Since a predetermined number is presented as candidates starting from the one with the shortest distance, the possibility that the true cause of failure is leaked from the candidates for the cause of failure is small.

Preferably, the inference device calculates, for each fault cause, a distance between the true fault cause defined based on the corresponding count value, and the distance is smaller than a predetermined threshold value. Is presented as a candidate for the cause of the failure. Since only the distances that are smaller than the threshold value are presented, the possibility of a failure is high, and the failure can be presented to the network administrator.

Preferably, the inference circuit presents the candidates for the cause of the failure in the sorted order according to the respective distances. By sorting according to the distance, candidates can be presented to the network administrator according to the order of possible failures. The network administrator can reliably remove the fault by checking for possible faults in this order.

Preferably, the inference apparatus calculates, gives and presents a certainty factor calculated according to the count value of each power counter to each of the presented fault cause candidates. With confidence, you can intuitively grasp the potential cause of a failure.

Preferably, the counter includes a counted area storage device for storing information specifying the storage area updated by the logic circuit, and the inference apparatus stores only the storage area stored in the counted storage apparatus in the inference. To be calculated. The storage area that has not been updated is not included in the calculation for inference by the inference device. Since only the storage areas that need to be calculated are to be calculated, the calculation can be sped up.

Preferably, the storage device includes an area for storing a flag indicating whether or not each storage area has been updated in response to a predetermined failure event, and the logic circuit maintains the flag and stores each flag according to the content of the flag. Determine whether to update the value in. Ah For some reasons, the same fault event may occur through more than one causal path. By reversing the above two or more paths in response to the input of the failure event, the storage area corresponding to the above-mentioned cause may be updated more than once. The calculation result is not correct. A flag is set, and the storage area updated for a certain failure event is not updated any more. By taking such measures, the storage area can be updated without error even under the above causal relationship.

Preferably, the update by the logic circuit is a process of incrementing a value in each storage area of the input fault event transmission range by a predetermined value, and the predetermined value is a value greater than 0 and equal to or less than 1. By setting the increment value to such a value, it is possible to cope with a stochastic fault propagation model in which an event is propagated with a certain probability. Preferably, the counter further includes a cache device for holding information specifying a storage area updated in response to the input of a certain failure event in the logic circuit, and the logic circuit is provided when a certain failure event is input again. The storage area specified based on the information stored in the cache device is updated.

By retaining the information that identifies the storage area updated in response to a failure event input, when the same failure event is input again, the target storage area can be accessed directly without following the propagation path. And can be updated. Therefore, the processing can be sped up.

Preferably, the cache device holds information indicating the time when the information specifying the storage area was last referenced, and the counter further includes a device for deleting, from the cache device, information that has not been referenced for a certain period of time. Including. The storage area can be used effectively by erasing from the cache device information that is not referenced for a certain period of time. Preferably, the counter is a propagation rule for accumulating the input fault events, detecting a cross-correlation between the fault events, and feeding back a new fault propagation model not described in the propagation model to the logic circuit. The logic circuit further includes a detection device, and receives the feedback from the propagation rule detection device, and reconstructs a predetermined rule. Since the rules for inference can be changed dynamically based on the history of actual failure events, inference of the cause of the failure becomes more reliable. BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 is a block diagram of a network management system according to one embodiment of the present invention.

FIG. 2 is a block diagram of the fault management unit 34 shown in FIG.

FIG. 3 is a block diagram of the fault counter 54 shown in FIG.

FIG. 4 is a diagram schematically showing a configuration of the counter value storage area 70 shown in FIG.

FIG. 5 is a diagram for explaining the concept of a causal loop between a cause and a symptom.

FIG. 6 is an external view of a computer for realizing the network management system according to the present invention.

FIG. 7 is a block diagram of a computer for realizing the network management system according to the present invention.

FIG. 8 is a flowchart showing a failure event increment process in one embodiment of the present invention.

FIG. 9 is a flowchart of a process for identifying a cause of a failure started by a timer in one embodiment of the present invention.

FIG. 10 is a flowchart of the cache clearing process started by the timer in one embodiment of the present invention.

FIG. 11 is a flowchart showing the processing contents of the failure counter generation unit 56. FIG. 12 is a flowchart of the failure cause identification processing.

FIG. 13 is a flowchart illustrating another example of the failure cause identification processing.

FIG. 14 is a flowchart illustrating still another example of the failure cause identification processing. FIG. 15 is a flowchart of the process of incrementing the fault counter and specifying the cause of the fault in another embodiment of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION

The problem with the prior art arises because the association between all problems and Symptoms The point is to use the event correlation table. In order to solve the problems of the conventional technology, the present invention provides a configuration of a propagation model, a managed object model, and an actual network each time a symptom event occurs without preparing an event correlation table in advance. Based on the information, the counter provided for the fault associated with the symptom event (referred to as “fault counter”) is incremented. Since the number of propagation models is finite and the types of symptom events are finite, each fault counter should count up to a certain number when all symptom events are considered to have occurred. In the present invention, the number is obtained in advance as the upper limit of the count number of each fault counter, and the obtained number is compared with the actual number of occurrences of symptom events and the count number of each fault counter to correspond to the occurrence pattern of the symptom event. Identify the cause of the failure.

Referring to FIG. 1, a network management system 20 according to the present invention is described using a configuration management information schema definition 40 that holds a database schema for defining network configuration information, and this schema. Information received from the configuration information section 38 that holds the configuration information of the network, the propagation model 42, the configuration management information schema definition 40, and the configuration information section 38, and based on these models and the network configuration information. A configuration management unit 30 for managing network configuration information, an event database 44 for storing configuration information data and failure information data as events, and network failure events are collected, and then the cause of the failure is collected. The fault management unit 34 for estimating the fault and information indicating the inference result of the root cause problem of the fault is received from the fault management unit 34. And a user interface unit 36 for presenting to a network administrator. The configuration management information schema definition 40, the configuration information section 38, and the propagation model 42 are stored in a storage device such as a memory. The storage device in which these are stored may be the same or separate.

Referring to FIG. 2, the fault management section 34 manages an event database section 62 for storing fault events from the network in the event database 44 and rules for fault propagation (propagation model 4 2). The fault counter generator 56 generates a fault counter to be described later, and the fault counter generator 56 generates the fault model from the propagation model unit 58 for reference, configuration information, its schema definition information, and the propagation model 42. Done, A counter corresponding to each fault cause is provided. Each time a fault event is input from the event database section 62, a fault counter 54 for incrementing the counter corresponding to the fault cause, and a counter value of the fault counter 54 are provided. In order to periodically activate the count comparator 52 and the count comparator 52 for identifying the cause of the failure by performing a predetermined comparison process on the Calculates the cross-correlation between the fault event and the fault event from the input fault event history, and detects fault propagation rules with missing descriptions, and generates fault counters via the propagation model unit 58 And a propagation rule detection unit 60 for feeding back to the unit 56.

Referring to FIG. 3, the fault counter 54 includes counter configuration information 78, which consists only of information required to identify the device type and device from the configuration information, and connection information between devices. The counter value storage area 70 composed of an array of counters corresponding to the failures of each device generated from the propagation model 42 and the configuration information section 38, and the counters corresponding to each failure in the counter value storage area 70 Counter type increment logic 74 for incrementing, and reference to the counter increment logic 74, generated from the propagation model 42, for the equipment type and connection relationship, and for the fault event input to the force counter. The counter increment rule 80, which describes which device class causes the fault counter to be incremented, and the incremented fault An incremented counter index storage area 76, which is the area where the harm counter index is recorded, and a record of which fault counter has been incremented as a result of applying the rule for the fault event of a specific device For cache information 72.

Referring to FIG. 4, the counter value storage area 70 includes an index 90 (serial number in this embodiment) and an index 90 for each failure cause (problem P0, P1, P2,...). For the counter, the upper limit value 9 2 is set in advance to the number of fault events that are considered to cause the fault cause corresponding to the counter on the propagation model, and the count value is an area that holds the count value. In order to prevent the same symptom (failure event) from being incremented repeatedly when the same symptom occurs by taking two or more causal paths from a certain cause of failure. Update flags 9 6 and used including. All of these areas are maintained by the counter value increment logic 74.

Fig. 5 shows Causality Mapping when the same symptom occurs in two causal paths from a certain cause of failure. Now, it is assumed that the following causal relationships exist for the obstacles.

(1) Problem P0 strength Symptoms S0 to S4 are caused under the causal relationship of the disorder as shown in FIG.

(2) Problem P1 strength Symptoms S1 to S4 are caused under the causal relationship of the disorder as shown in FIG.

(3) Problem P2 force Symptom Causes S4.

In this case, since the disorder transmitted from the symptom SO to the symptom S5 causes the symptom S4, a loop is formed in the above Causality Mapping together with the symptom S4 caused by the problem P2. That is, when the symptom S4 is input, the counter of the problem P2 is first incremented immediately by referring to the configuration information. In addition, the counter of the problem PI is incremented via the path of the symptoms S4, S3, S2, SI. For problem P0, it is uncertain whether the force is incremented via the symptom S4, S3, S2, Sl, SO and the relay path, and whether it is incremented via the symptom S4, S5, SO. However, for the input symptom S4, the counter of problem P0 must not be incremented twice. Therefore, in the present embodiment, a flag indicating whether or not the input of the symptom S4 has been incremented is provided for the problem P0, and if it is set, the counter is not incremented.

The network management system shown in FIG. 1 is actually realized by software running on a computer such as a personal computer or a workstation. Figure 6 shows the appearance of the computer that implements the network management system. Referring to FIG. 6, this computer has a computer 100 equipped with a CD-ROM (Compact Disc Read-Only Memory) drive 110 and a FD (Flexible Disk) drive 112, a display 102, and a printer 1 04, a keyboard 106, and a mouse 108. Figure 7 shows the configuration of this computer in block diagram form. As shown in FIG. 7, the computer main unit 100 constituting the network management system 20 includes a CD-ROM drive 110 and an FD drive 112, and a CPU (CPU) connected to a bus 126, respectively. Central Processing Unit) 116, ROM (Read Only Memory) 118, RAM (Random Access Memory) 120, and hard disk 114. The CD-ROM drive 22 is attached to the CD-ROM drive 110. FD 1 24 is attached to FD drive 1 1 2. As described above, this network management system is realized by a computer hard disk and software executed by the CPU 116. Generally, such software is stored in a storage medium such as CD_R〇M122 or FD124 and distributed, and is read from the storage medium by a CD-ROM drive 110 or FD drive 112 and read from a hard disk 114. Is stored once. Further, the data is read from the hard disk 114 to the RAMI 20 and executed by the CPU 116.

The computer hardware itself shown in FIGS. 6 and 7 is general. Therefore, the most essential part of the present invention is the software stored in the storage medium such as the CD-ROM 122, the FD 124 and the hard disk 114. The operation of the computer itself shown in FIGS. 6 and 7 is well known and will be apparent to those skilled in the art.

The structure of the software constituting the network management system 20 will be described below. In this embodiment, the software is configured to receive a fault event after the device is started and increment the fault counter (FIG. 8), and to be started periodically by the timer 50 to identify the cause of the fault. (FIG. 9), and a part that is activated by the timer to clear the cache information 72 (FIG. 10). With reference to FIG. 8, the software structure of the part that increments the failure counter will be described. When this device is started, it first secures a counter value storage area 70, sets its upper limit, and generates a counter value increment logic 74 (140). Then, the system waits for a failure event to be input (142), and when a failure event is input, increments the failure counter (144). Is executed, and after execution, the system waits for the input of a failure event again (1 4 2). Details of these processes will be described later.

Referring to FIG. 9, failure cause identification processing 150 is started periodically by timer 50. As a result of this processing, the user interface unit 36 presents the problem corresponding to the failure counter having the minimum value to the network administrator as the root cause of the failure. This processing corresponds to the processing performed by the count comparator 52 in FIG.

Referring to FIG. 10, timer 50 clears cache information 72 periodically. The cache information 72 holds the date when the cache was last referenced as an attribute, and the timer 50 periodically checks this date and discards cache information that has not been referenced for a long time. I do. As a result, the storage capacity of the cache information 72 can be saved, and it is possible to cope with changes in the network configuration. In the examples shown in FIGS. 8 to 10, the process for specifying the cause of the failure and the process for clearing the cache information 72 are started by the timer independently of the process for incrementing the failure counter. However, the present invention is not limited to this. For example, these processes are executed in response to a request from a network administrator, or these processes are executed in response to satisfaction of a condition in the process of incrementing the failure counter. May be started. Their implementation depends on the design, but will depend on the requirements and will be apparent to those skilled in the art.

The process 140 for generating a fault counter will be described with reference to FIG. This processing corresponds to the processing of the failure counter generation unit 56 in FIG. First, an index unique to each class is assigned to the cause of failure in each class (180). For example, ApplicationDown, TcpDisconnect, RoutingError, etc. described in the examples below correspond to this.

Next, all configuration management information is searched, and an index is given to each management target (18 2). In the example described later, the management targets m o and m o 2 correspond.

Further, a counter value storage area is generated by pairing the index of each managed object with the index of the cause of the failure (184). In other words, a failure counter in the counter value storage area 70 shown in FIG. 4 stores the managed index and the failure cause. It is specified and accessed by the index. The fault counters corresponding to the problems P 0, P 1, etc., which are the causes of the faults shown in Fig. 4, are actually managed for each managed object. Counters are assigned.

For each storage area of the counter value of the cause of the failure, the range of the failure is predicted, and the upper limit is set (186).

Hereinafter, the counter value increment logic 74 shown in FIG. 3 is generated. The generation of the counter value increment logic 74 is also performed by software. Prior to this processing, it is assumed that the following logic template is prepared in advance.

(a) Counter logic main

Required variable definitions ...

while (true) {

switch (classOf (mol)) {

// Addition of case clause macro for Class

(b) Case clause macro related to Class

Case clause macro for Class (ClassName) = {

case ClassName:

switch (syraptoral) {

II Addition of case clause macro for Symptom name

break;

(c) Case clause macro for Symptom name

Symptom name case ¾romak mouth (ClassName, SymptomName,

Relation, rClassName, ProblemName) = {

case SymtpmName:

if (existCacheCmol, symptoml)) {

executeCache (mol, symptoml);

if (mo! = mol) {

appendCache (mo, symptom, mol, symptoml); return;

}

mo 2 = search (mol, Relation, rClassName); if (mo2 NOT—FOUND) return;

if (existCounter (mo2, ProblemName)) {

counter [mo2] [ProblemName]. val ++;

incrementDone (mo2, ProblemName);

addCache (mo, symptom, o2, ProblemName); symptom = TcpDisconnect; mo = mo2;

break; From the macro template, the propagation model 42, the configuration management information schema definition 40, and the configuration information section 38, the counter value increment opening logic 74 is generated as follows.

First, the first class is selected for processing (188). Next, it is determined whether or not all classes have been processed (190). If all classes are finished, this routine is finished. If any classes have not been completed, control proceeds to step 200.

In step 200, (b) is expanded to the position indicated as “addition position of case clause macro for C 1 ass” in the counter logic template of (a). At this time, “ClassName” in (b) is replaced with the class name to be processed (for example, "Application"). Subsequent positions in this processing move after the expanded part.

Furthermore, (c) was defined for the class at the position indicated as “Addition of case clause macro for Symptom name” in (b) expanded in step 200. Expand all failure events in order (202). However, at this time, “ClassName” in the macro of (c) is the name of the class to be processed, rSymptomNameJ is the name of each fault event defined in the class, and “RelationJ is the definition of the class in the propagation model. Replace "rClassNameJ" with the name of the class associated with this class in "Relation", and replace "ProblemName" with the cause of failure given in step 180 for that class. Thus, in step 202, the macro of (c) is expanded for all the fault events defined in the class.

Next, proceed to the next class (204). If it is determined in step 190 that the processing for all classes has been completed, the counter values obtained by expanding (b) and (c) for all classes defined in the class definition Increment logic 74 is obtained. An example is shown below.

(d) Example of counter value logic:

mol = mo;

symptoml = symptom;

while (true) {

switch (classOf (mol)) {

case Application:

switch (symptoml) {

case Appl i cat i onDown:

if (ExistCache (mol symptoml)) {

executeCache (mol, symptoml); if (mo! = mol) {

appendCache (mo, symptom, mol, symptoml); return; mo2 = search (mol, Underly, TcpNode);

if (mo2 == NOT— FOUND) return;

if (existCounter (mo2, TcpDisconnect)) {

counter [mo2] [TcpDisconnect]. val ++; incrementDone (o2, TcpDisconnect); addCache (mo, symptom, mo2,

TcpDisconnect) symptoml = TcpDisconnect; mol = mo2;

break; break;

case TcpNode:

switch (symptoml) {

case TcpDisconnect:

if (existCache (mol, symptoml)) {

executeCache (mol, symptoml);

if (mo! = mol) {

appendCache (mo, symptom, mol, symptoml) return;

}

mo2 search (mol, ConnectedTo, Router)

if (mo2 == NOT— FOUND) return;

if (existCounter (mo2, RoutingError)) counter [mo2] [RoutingError}-val ++;

incrementDone (mo2, RoutingError); addCache (mo, symptom, mo2, routingError) symptoml = TcpDisconnect; mol = mo2;

break;

Here, each function and variable have the following functions and meanings.

(e) Functions and meanings of functions and variables

(Class definition)

An instance of the class Application has an Underly relationship with an instance of the class TcpNode.

An instance of class TcpNode has a ConnectedTo relationship to an instance of class Router.

(Propagation model)

Failure of class Router RoutingError propagates to failure of TcpNode connected to it, TcpDisconnect.

Failure of class TcpNode TcpDisconnect propagates to Application Down, a failure of class Application linderly to it.

(Function)

search ();

A function that searches for the managed object of the specified class that has the specified relationship with the specified managed object

existCache ();

Function to check whether the cache for the specified failure event of the specified management target exists

executeCache (); A function that increments the failure counter according to the cache data related to the failure event specified by the specified management target

addCache ();

For the specified failure event specified by the management target, the increment of the specified failure counter specified by the specified management target is recorded as cache data appendCache ();

Function to append cache data of the specified cause of failure specified by the management target to another cache data

existCounter ();

Function to check whether the specified failure cause counter of the specified management target exists

incrementDone ();

Function to record that the counter of the specified failure cause of the specified managed object has been incremented

classOf ();

Function to find the managed class

mol, mo2;

Work area of managed index

symptoml;

Work area for the fault event name

mo;

The name of the managed object where the entered failure event occurred

symptom;

Failure name of the entered failure event

counter;

Failure counter corresponding to each failure cause Note that the functions described in “(a) Counter logic main” to “(e) The contents described in “Functions and meanings” are only examples realized in this embodiment. The expression varies depending on the device, operating system, programming language, etc. used. Also, some operating systems call routines provided by the system in the program. In this case, the processing entity is the routine, but those distributed as software for the network management system may not include the routine. However, even in this case, the power of arranging such routines is defined, and software that does not include such routines is also included in the scope of the invention of the present application, as long as the claimed function is realized.

Unlike the prior art disclosed in US Pat. No. 5,661,668, the present invention does not store and hold the Causality Mapping, but the logic of the counter value increment, the rule, and the relationship between the devices. The counter configuration information 78 consisting of only the counter is stored.

Next, the failure counter increment processing shown in step 144 of FIG. 8 will be described. Failure events are input to the counter in time series. To the failure event, a device identifier and device type information for specifying the device in the actual configuration information are added as attributes. When a fault event is input, the counter value increment logic 74 queries the configuration information for the counter 78 to find a fault counter for the problem corresponding to the fault in the management object corresponding to the device. Increment the value. In the example shown in “(d) Example of counter value logic:”, when “ApplicationDown” is input as a failure event, the counter logic will execute the management object of “Tcp J” corresponding to the device.

Obtain the “TcpDisconnect” counter and increment its value. Then, the counter logic stores the incremented index of the failure counter in the incremented counter index storage area 76.

Further, since TcpDisconnect becomes an input of another propagation node, the counter logic takes it as an input again and searches the counter configuration information 78 for a router connected thereto. Then, the RoutingError count of the router is incremented. In this way, the counter logic continually increments each fault counter The process for this fault event is terminated when there are no more items to be propagated. At this time, when the counter logic increments a certain fault counter for a certain fault event, a flag corresponding to the fault event of the fault counter is set, and if the flag is not set, Only increment.

The templates shown in “(b) Case clause macro for Class” and “(c) Case clause macro for Symptom name” perform such processing by replacing words in the template with specific class names and expanding them. It is for realizing. The network management system notifies the network administrator of the cause of the failure corresponding to the value having the smallest distance among the values calculated in this way as a candidate of the true cause of the failure. The failure identification process for that will be described. Referring to FIG. 12, when the process of identifying the cause of the failure is started, first, all the inputted events are acquired (300). Further, as an initial setting, a predetermined large constant (for example, the maximum number that the work area can hold) is substituted into the work area for obtaining the minimum value (302). This is a preparation to keep the value as the minimum value of the previous calculation each time the minimum value is obtained in the middle of the calculation, as described below.

Subsequently, one index number is extracted from the indexes stored in the incremented counter index storage area 76 (304). Then, it is determined whether or not the processing has been completed for all the index numbers stored in the incremented counter index storage area 76 (306). If the processing has been completed for all index numbers, control proceeds to step 308, the details of which will be described later. In this case, the calculation described below is performed not on all the indexes, but only on the indexes stored in the incremented counter index storage area 76. Other indexes have not been incremented, so omitting the calculation has no effect. The calculation amount is reduced, and the processing speed can be increased.

If the processing has not been completed for all index numbers, the upper limit value of the counter indicated by the index number extracted in step 304 and the counter value are read (320). Then, the total number of input failure events and read The difference from the calculated counter value is calculated (322). Further, the difference between the upper limit value of the counter and the counter value is calculated (324). The two differences thus obtained are summed and retained (32). In the network management system 20 of this embodiment, the sum calculated in this manner is used as a value (distance) indicating the possibility of a failure cause. Of course, various other calculation methods can be assumed, but the method of this example is the simplest.

Subsequently, it is determined whether or not the sum thus calculated is smaller than the previously held minimum value (328). If it is not less than the minimum value, the processing is advanced to the next index number (332), and control returns to step 304. If the sum is smaller than the minimum value, the index number corresponding to the sum is stored, and the sum is stored as a new minimum value (330). Then, the processing target is advanced to the next index number (332), and control returns to step 304.

When processing is performed for all the index numbers in this way, the control proceeds to step 308 as a result of the determination in step 306. In step 308, it is determined whether or not the minimum value obtained as a result of the above series of processing is smaller than a predetermined threshold value. If the minimum value is smaller than the threshold value, the failure cause indicated by the index number corresponding to this minimum value is estimated as the root cause of the failure and presented to the network administrator via the user interface unit 36.

If the minimum value is greater than the threshold, it means that the root cause could not be estimated on sufficient grounds. Therefore, the process waits for the input of the next failure event (3 1 2), or terminates the process after displaying that identification was not possible.

In the network management system 20 of the present embodiment, when a certain failure event is input, information indicating which management target object has finally reached and which failure counter has been incremented is stored in the cache information 72. I do. For example, when a failure event S4 is input under the causal relationship as shown in FIG. 5, the failure counters of P0, P1, and P2 are finally incremented. If this information is stored in the cache information 72, the next time the failure event S4 is input, the corresponding failure power counter can be immediately incremented without following the causal relationship from the configuration information. The processing speed can be increased. Further, the network management system 20 of the above-described embodiment has a propagation rule detection unit 60, detects a cross-correlation between failure events stored in the event database 44, and describes the cross-correlation in the propagation model 42. If a new fault propagation that is not detected is detected, it is fed back (added) to the propagation model 42. By reconstructing the counter value increment logic 74 of the fault counter and the rule 80 of the counter increment based on the new propagation model 42 updated in this way, a more accurate fault cause can be estimated. Becomes possible.

There are the following methods for detecting the cross-correlation between failure events. For example, for any two events S 1 and S 2, the time t 1 at which event S 1 occurred and the time t 2 at which event S 2 occurred within a certain time window, The degree of correlation C (S 1, S 2) between these events is calculated from the time of occurrence t 12 by, for example, the following equation. y ti2

C (Sl, S2) =

This value C (S 1, S 2) is calculated between all fault events that occurred within the time window. Then, when this value exceeds a certain threshold value, it is estimated that there is a propagation relationship between these events. Of the propagation relations thus detected, rules not yet described in the propagation model 42 may be given to the propagation model unit 58 and fed back to the fault counter generation unit 56. The fault counter generation unit 56 reconstructs the rules for the fault counter and the increment based on the rules updated in this manner, so that more accurate inference can be performed. Of course, the equation for calculating the degree of cross-correlation is not limited to the above equation, and various equations can be used.

When the result of estimating the cause of the failure is presented to the network administrator using the user interface unit 36, for example, the “confidence” that the certain cause of the failure is the true cause of the failure is as follows. z can be calculated.

m Here, n indicates the number of fault events assumed for the fault cause that matches the actually input fault event. m indicates the larger of the total number of input failure events and the maximum value of the upper limit of the counter. Instead of the maximum value of the upper limit value of the counter, the upper limit value set for each individual counter may be used. Z in the above equation takes a value of 0 to 100% (0 to 1). This confidence z is added to the candidate for the cause of the failure and notified. By presenting to the user with such a certainty factor, the user can more easily and intuitively recognize the probability that the cause of the fault is correct, rather than a measure of the degree of difference.

As described above, according to the present invention, if information (class definition, propagation model) that can be described with relatively little effort and does not depend on specific network configuration information is defined, specific network configuration information can be obtained. Can be obtained mechanically by, for example, a method of automatically finding the information. Therefore, the labor required to generate information (fault counter 54) necessary for fault management is relatively small. It can also flexibly respond to dynamic changes in configuration information (addition and deletion of devices).

The event correlation table approach required a relatively large amount of storage for the event correlation table. According to the present embodiment, it is possible to realize a compact logic obtained from the description about the class and the propagation model, and a method for estimating the cause of the fault composed of the configuration information. Proportional. Therefore, this system is advantageous compared to the event correlation table approach, which requires a memory capacity that is proportional to the square of the number of instances. In the present embodiment, the logic of the counter value is incremented at the timing when the failure event is notified. No calculations occur when no fault has occurred. Also, since only the fault counter related to the fault event that has occurred is incremented, the calculation load at the time of increment can be reduced.

In the process of identifying the cause of a fault, only the fault counter incremented by a fault event that occurred within a predetermined time period (time window) is used. Calculation is done. Therefore, when the number of generated failure events is small, the amount of calculation for identifying the cause is very small.

In the network management system 20 of the above-described embodiment, the distance calculated based on the counter value of the failure counter is the smallest, and only one failure cause corresponding to the index number is presented to the network administrator. . However, the present invention is not limited to this. For example, for all index numbers, calculate the distance (the sum of the difference between the total number of failure events and the counter value, and the difference between the upper limit value of the counter and the counter value) and find the top N (smaller distance) The failure cause corresponding to the index number (where N is an arbitrary natural number) may be presented to the network administrator in ascending order of distance.

FIG. 13 shows a flowchart in this case. In FIG. 13, the same processes as those in FIG. 12 are denoted by the same reference numerals, and the detailed description thereof will not be repeated. In FIG. 13, since it is not necessary to obtain the minimum value of the distance, the processing of steps 302, 308, 328, and 330 of FIG. 12 is omitted. In addition, instead of the processing of step 310 of FIG. 12, after obtaining distances for all index numbers (the result of the determination of step 310 is YES), the count values are sorted in ascending order and the top N A step 340 of presenting only the individual as a candidate for the cause of the failure is provided.

By sorting and outputting a plurality of candidates in ascending order of distance as described above, the network administrator can check the candidates for the cause of the failure in order from the most suspicious. Network managers can efficiently identify and eliminate the cause of the failure. In addition, the possibility that the true cause of failure is omitted from the candidates is reduced.

In this case, only the failure cause candidates whose distance is less than a specific threshold value may be displayed. A flowchart in that case is shown in FIG. In FIG. 14, the same processing steps as those in FIG. 13 are denoted by the same reference numerals, and detailed description thereof will not be repeated. The example shown in FIG. 14 includes step 350 instead of step 340 shown in FIG. In step 340, the count values are sorted in ascending order, and the cause of failure corresponding to everything below a predetermined threshold is presented to the network administrator as a candidate together with the above-mentioned certainty factor. By presenting everything below a certain threshold in this way, for example, for multiple It can present everything that has a high probability of causing a failure, and can eliminate leakage from inference results. Also, by assigning certainty, there is an effect that it is possible to easily distinguish between important ones and those that are not among candidate candidates.

Further, unlike the above-described embodiment, the processing configuration may be such that the failure cause is estimated each time a failure event is input. Figure 15 shows a flowchart of the entire process in that case. In FIG. 15, steps for performing the same processing as the processing steps shown in FIG. 8 or FIG. 9 are denoted by the same reference numerals. A detailed description of them will not be repeated here.

In FIG. 15, in FIG. 8, after the failure counter increment processing for the input failure event is completed, instead of waiting for the input of the next failure event, the failure cause identification processing (150) is performed immediately. I do. Then, when it is determined that the minimum value of the obtained distance is smaller than a predetermined threshold, the cause of the failure corresponding to the counter is estimated as the root cause.

By performing the process of identifying the cause of a failure every time a failure event is input, the failure event is input to the network administrator as soon as the information necessary for estimating the failure is available. The effect is obtained. The effect of reducing the time lag from the occurrence of a failure to the estimation of the cause is obtained, as compared to the case where the processing for identifying the cause of the failure is started by the timer.

In the example shown in Fig. 8, all the fault counter areas are secured when the device is started, and the range of the failure is predicted to determine the upper limit of each counter. As a result, when a failure event occurs, the corresponding failure power center can be immediately incremented. However, these methods, on the other hand, increase the load on the equipment at startup and increase the startup time. Therefore, it is conceivable that such a process is not performed at the time of startup, and the necessary counter value storage area is generated and the upper limit value is set as needed when necessary. In this case, before a fault event is input and the counter is incremented, it is checked whether the counter exists or not.If it does not exist, the area is secured and the upper limit is set. You just have to do what you want to do. In this way, a memory area for storing the counter value is not generated for a failure event that rarely occurs, thereby improving the memory use efficiency. In addition, as described above, there is also an effect that the load at the time of starting the device is reduced and the operation can be started quickly.

Further, in each of the above-described embodiments, the value to be incremented has been described as 1.0. However, the present invention is not limited to this, and the number may be incremented by an arbitrary number in a range from more than 0 to 1.0 or less. By doing so, the occurrence of a stochastic failure can be expressed by the same logic as in the above-described embodiment.

For example, in the propagation model, the probability of propagation from fault event S3 to fault event S4 is

If it is described as 0.5, the corresponding fault counter is incremented by 0.5 at that time. Further, when the probability of propagation from the fault event S2 to the fault event S3 is described as 0.3, the value to be incremented is the product of these probabilities, that is, 0.5 X 0.3. = 0.15. Thus, the network management system according to the present invention can handle stochastic failure propagation.

Although the network management system according to the present invention has been described based on the embodiments, the present invention is not limited to the systems according to the embodiments. The scope of the present invention should be determined by the description of each claim in the claims. Those using components equivalent to the components of the embodiments disclosed in the specification of the present application are also included in the scope of the right of the present invention. Industrial applicability

As described above, according to the network management system of the present invention, it is possible to specify the cause of a failure with a small amount of memory used and a small amount of calculation. Therefore, it is suitable for effectively identifying the cause of a failure in a large or complex network and resolving it.

Claims

The scope of the claims

1. Configuration management information schema definition holding means (40) for holding a configuration management information schema definition that describes classes representing devices on the network and fault events observed in those classes;

A propagation model holding unit (42) for holding a propagation model between failure events, a configuration information holding unit (38) for holding configuration information of actual devices on the network,

A counting unit (54) provided corresponding to each failure event occurring in each device based on the configuration management information schema definition, the propagation model, and the configuration information; Comparing means (52) for inferring the fault cause of the fault event by a predetermined method based on the content, and inference means (34) including

The counting means (54)

A storage unit (70) including a plurality of storage areas for holding each count value; and for each of the storage areas, the configuration management information schema definition, the propagation model, and the configuration information, Means for setting the upper limit value (140), and in response to an input failure event, the configuration management information schema definition, the propagation model, the configuration information, and a predetermined rule for updating ( (80) based on the above (80), logic means (74) for updating the value of the storage area corresponding to the input failure event.

2. The counting means (54) further includes means (78) for holding configuration information for counting, consisting only of the connection relation of devices,

The logic means (74) responds to the input of the fault event by referring to the predetermined rule (80) and the configuration information for counting (78) in response to the input of the fault event. The network management system according to claim 1, further comprising: means for determining and operating counting means to be operated in response.

3. The network management according to claim 1, further comprising: a means for generating the plurality of storage areas and setting respective upper limits when the network management system is started. Systems (20).

4. The counting means (54) includes means (140) for securing the necessary storage area and setting a corresponding upper limit value in response to the input of a force S and a failure event. The network management system according to item 1 (20).

5. The inference means (34) includes means (FIG. 15; 150, 152) for inferring the true cause of the failure in response to the input of each failure event.

The network management system described in 1 (20).

6. The network management system (20) according to claim 1, wherein said inference means (34) includes means (FIG. 9; 150, 152) for inferring the true cause of the failure every predetermined time. ).

7. The inference means (34) calculates, for each fault cause, a distance between the true fault cause defined based on the corresponding count value, and The network management system (20) according to claim 1, further comprising means (340) for presenting a predetermined number as a candidate for the cause of the failure.

8. The inference means (34) calculates, for each fault cause, a distance between the true fault cause defined based on the corresponding count value, and sets the distance to a predetermined value. 2. The network management system (20) according to claim 1, further comprising means (308, 310) for presenting only those smaller than the threshold as possible fault causes.

9. The inference means (34) includes means (340, 350) for presenting the fault cause candidates in the sorted order according to the respective distances.

The network management system described in 8 (20).

10. The inference means (34) calculates and gives a certainty factor calculated in accordance with the count value of each counting means (54) to each of the presented fault cause candidates and presents it. The network management system (20) according to claim 1, including means for performing (350).

1 1. The counting means (54) includes counted area storage means (76) for storing information specifying the storage area updated by the logic means (74),

The inference means (34) is stored in the counted area storage means (76). The network management system according to claim 1, wherein only the storage area is a target of calculation in the inference.

12. The storage means (70) includes an area for storing a flag indicating whether each of the storage areas has been updated in response to a predetermined failure event,

The network management system according to claim 1, wherein the logic means (74) includes means for maintaining the flag and determining whether to update a value in each of the storage areas based on the content of the flag. 20).

13. The updating by the logic means (74) includes a process of incrementing a value in each of the storage areas in the range of the input failure event by a predetermined value, wherein the predetermined value is greater than 0 and equal to or less than 1 The network management system (20) according to claim 1, wherein the value is:

14. The counting means (54) further includes a cache means (72) for holding information identifying the storage area updated in response to the input of the failure event by the logic means (74). ,

The logic means (74) includes means for updating the storage area specified based on the information stored in the cache means (72) when the certain failure event is input again. The network management system according to item 1 (20).

15. The cache means (72) holds information indicating a time when the information specifying the storage area was last referred to,

The network management system (20) according to claim 14, wherein said counting means (54) further includes means for deleting information that has not been referred to for a certain period of time or more from said cache means (72).

16. The counting means (54) accumulates the input fault events, detects a cross-correlation between the fault events, and outputs a new fault propagation model not described in the propagation model to the logic means. Propagation rule detection means (60) for feeding back to (74) is further included,

The network management system (20) according to claim 1, wherein the logic means (74) includes means for receiving feedback from the propagation rule detection means (60) and reconstructing the predetermined rule. ).

17. A machine-readable recording medium storing a program for causing a computer to operate as a network management system (20),

The network management system (20)

Configuration management information schema definition holding means (40) for holding a configuration management information schema definition that describes classes representing devices on the network and fault events observed in those classes;

The counting means (54)

A storage unit (70) including a plurality of storage areas for holding each count value; and for each of the storage areas, the configuration management information schema definition, the propagation model, and the configuration information, Means for setting the upper limit value (140), and in response to an input failure event, the configuration management information schema definition, the propagation model, the configuration information, and a predetermined rule for updating ( 80) based on the above, logic means (74) for updating the value of the storage area corresponding to the input failure event.

18. The counting means (54) further includes means (78) for holding configuration information for counting, consisting only of the connection relation of the devices,

The logic means (74) should operate in response to the input of the failure event, referring to the predetermined rule and the configuration information for counting in response to the input of the failure event. 18. The machine-readable recording medium according to claim 17, comprising means for determining and operating the counting means (54) (Fig. 15; 142, 144).

19. The counting means (54) 1 When the network management system is started 18. The machine-readable recording medium according to claim 17, further comprising: means (140) for generating the plurality of storage areas and setting respective upper limits.

20. The counting means (54), comprising means for reserving the required storage area and setting a corresponding upper limit in response to the input of a force S and a failure event. Machine-readable recording medium.

21. The machine-readable recording medium according to claim 17, wherein said inference means (34) includes means for inferring a true cause of failure in response to input of each failure event.

22. The machine-readable recording medium according to claim 17, wherein the inference means (34) includes means (FIG. 9) for inferring the true cause of the failure every predetermined time.

23. The inference means (34) calculates, for each cause of failure, a distance between the true cause of failure, which is defined based on the corresponding force point value, and The machine-readable recording medium according to claim 17, further comprising: means (340) for presenting a predetermined number as a candidate for a cause of a failure.

24. The inference means (34) calculates a distance between each fault cause and a true fault cause, which is defined based on the corresponding force value, and determines that the distance is a predetermined distance. 18. The machine-readable recording medium according to claim 17, comprising means for presenting only a value smaller than the threshold value as a candidate for a cause of failure.

25. The machine-readable record of claim 24, wherein the inference means (34) includes means (340, 350) for presenting the fault cause candidates in a sorted order according to their respective distances. Medium.

26. The inference means (34) calculates and gives a certainty factor calculated in accordance with the count value of each counting means (54) to each of the presented fault cause candidates. 18. The machine-readable recording medium of claim 17, comprising means for performing (350).

27. The counting means (54) includes counted area storage means (76) for storing information specifying the storage area updated by the logic means (74),

The inference means (34) is stored in the counted area storage means (76). 18. The machine-readable recording medium according to claim 17, wherein only the storage area is a target of calculation in the inference.

28. The storage means (70) includes an area for storing a flag indicating whether each of the storage areas has been updated in response to a predetermined failure event,

The machine-readable device according to claim 17, wherein the self-logic means (74) includes means for maintaining the flag and determining whether to update a value in each of the storage areas according to the content of the flag. Recording medium.

29. The update by the logic means (74) includes a process of incrementing a value in each of the storage areas in the range of the input failure event by a predetermined value, wherein the predetermined value is greater than 0 and less than or equal to 1 18. The machine-readable recording medium according to claim 17, which has a value of:

30. The counting means (54) further includes a cache means (72) for holding information specifying the storage area updated in response to an input of a failure event by the logic means (74). ,

The logic means (74) includes means for updating the storage area specified based on the information stored in the cache means (72) when the certain failure event is input again. 17. The machine-readable recording medium according to item 7.

31. The cache means (72) holds information indicating a time when the information specifying the storage area was last referenced,

31. The machine-readable recording medium according to claim 30, wherein said counting means (54) further includes means for deleting from said cache means (72) information that has not been referenced for a certain period of time or more.

32. The counting means (54) accumulates input fault events, detects a cross-correlation between the fault events, and outputs a new fault propagation model not described in the propagation model to the logic means. Propagation rules for feedback to (74)

— Further comprising means (60) for detecting

The machine-readable recording medium according to claim 17, wherein the logic means (74) includes means for receiving the feedback from the propagation rule detection means (60) and reconstructing the predetermined rule.