WO2011007394A1 - 障害の根本原因に対応した復旧方法を表す情報を出力する管理システム - Google Patents
障害の根本原因に対応した復旧方法を表す情報を出力する管理システム Download PDFInfo
- Publication number
- WO2011007394A1 WO2011007394A1 PCT/JP2009/003358 JP2009003358W WO2011007394A1 WO 2011007394 A1 WO2011007394 A1 WO 2011007394A1 JP 2009003358 W JP2009003358 W JP 2009003358W WO 2011007394 A1 WO2011007394 A1 WO 2011007394A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- event
- information
- rule
- meta
- failure history
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0748—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a remote unit communicating with a single-box computer node experiencing an error/fault
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3055—Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/32—Monitoring with visual or acoustical indication of the functioning of the machine
- G06F11/324—Display of status information
- G06F11/327—Alarm or error message display
Definitions
- the present invention relates to output of information representing a recovery method from a failure.
- Patent Document 1 As a system for supporting specification of a recovery method, for example, there is a failure history database system disclosed in Patent Document 1.
- the system administrator registers the failure that has occurred in the monitoring target node and the method of actually recovering from the failure as a failure history in the database system.
- the database system holds a plurality of failure histories.
- the administrator of the monitoring target node (hereinafter sometimes referred to as “system administrator”) inputs a desired keyword when a new failure occurs.
- the database system searches for a failure history that matches the input keyword from a plurality of failure histories.
- the monitoring system receives a change in the operating state of the monitoring target node (for example, an input / output (I / O) error with respect to the disk device and a decrease in the throughput of the processor) from the monitoring target node as an event.
- the system administrator knows the contents of the event by receiving the event using a message or a patrol card.
- the administrator knows the failure (for example, service stoppage or performance degradation) of the monitored node from the contents of the event, and predicts the root cause of the failure.
- Root ⁇ ⁇ Cause Analysis (hereinafter referred to as RCA) as a technique for predicting the root cause of failure.
- the monitoring system holds in advance a combination of an event group and a root cause as a rule.
- the monitoring system estimates the root cause of the event from the rule including the event.
- the inconsistency amount is calculated for each of the cases where the generated event is known and unknown, and the calculated inconsistency amount is taken into account in estimating the root cause of the failure.
- Patent Document 3 information representing the environmental relationship between monitored nodes is constructed. When estimating the root cause of a failure, it is specified on which monitoring target node a failure occurring in a certain monitoring target node affects based on the information.
- the monitoring target node is a server A having a switch A and a communication interface device (communication I / F) connected to the switch A, and the server A performs I / O to the storage device via the switch A.
- a failure occurs in the communication I / F (for example, NIC (Network Interface Card)) of the server A. Due to the failure, a first event in which the I / O throughput of the server A reaches an abnormal value and a second event in which the network traffic of the switch A reaches an abnormal value occur.
- the monitoring system detects a group of events including the first and second events. The contents of the event group are transmitted to the system administrator. At this time, in the failure history database system, the same case is not stored but the same case is stored.
- “Same case” represents the same failure as the failure that occurred (failure corresponding to the event group). It is a failure history including information. “Class” is a failure history that includes information that represents a failure different from the failure that has occurred but includes information that represents the same recovery method as the recovery method from the failure that has occurred.
- Patent Document 1 a keyword desired by a system administrator is used for searching for a failure history. Therefore, depending on the keyword, there is a possibility that the target failure history does not hit or many irrelevant failure histories hit.
- Patent Document 2 when the root cause of a failure is a search query, there is a possibility that a similar item will not hit even if the same item is hit.
- Patent Document 3 when a monitoring target node in which a failure as a root cause has occurred or a monitoring target node affected by the failure is set as a search query, there may be many unrelated failure histories. .
- an object of the present invention is to enable a system administrator to quickly identify an appropriate recovery method according to the root cause of a failure.
- the management server has a meta rule that identifies an event that is a root cause of an event that can occur in a plurality of node devices, and a failure recovery method that corresponds to the meta rule, and the root cause of the event detected by the management server And a recovery method from this cause event is displayed.
- the recovery method may be information created or updated based on a recovery method that is input by an administrator who uses the management server and is generated when the above-described plurality of node devices are recovered.
- FIG. 1 is a block diagram illustrating the configuration of the computer system according to the first embodiment.
- FIG. 2 is a block diagram showing the configuration of the management server.
- FIG. 3 is a block diagram showing the configuration of the display computer.
- FIG. 4 is a block diagram showing a configuration of server information.
- FIG. 5 is a block diagram showing a configuration of switch information.
- FIG. 6 is a block diagram showing the configuration of storage information.
- FIG. 7 is a block diagram showing a configuration of topology information.
- FIG. 8 is a block diagram showing a configuration of the meta RCA rule information.
- FIG. 9 is a block diagram showing the configuration of the expanded RCA rule information.
- FIG. 10 is a block diagram showing a configuration of event information.
- FIG. 11 is a block diagram illustrating a configuration of the failure analysis context.
- FIG. 12A is a block diagram illustrating a configuration of a failure history entry.
- FIG. 12B is a block diagram illustrating a configuration of server weight information.
- FIG. 12C is a block diagram illustrating a configuration of switch weight information.
- FIG. 12D is a block diagram illustrating a configuration of storage weight information.
- FIG. 13 is a flowchart for creating a deployment RCA rule.
- FIG. 14 is a flowchart for determining a root cause candidate and its certainty factor.
- FIG. 15 is a flowchart for creating a failure analysis context.
- FIG. 16 is a flowchart for selecting a root cause.
- FIG. 17 is a flowchart for registering a failure history.
- FIG. 18A is a flowchart for matching failure analysis contexts.
- FIG. 18B is a flowchart showing details of step 1026 in FIG. 18A.
- FIG. 18C is a flowchart showing details of step 1031 in FIG. 18B.
- FIG. 18D is a flowchart showing details of step 1034 in FIG. 18B.
- FIG. 18E is a diagram showing an overview of failure analysis context matching.
- FIG. 18F is a flowchart showing details of step 1035 in FIG. 18B.
- FIG. 18G is a diagram showing an overview of failure analysis context matching.
- FIG. 19 shows an example of the candidate / confidence level screen.
- FIG. 20 shows an example of a failure history search result screen.
- FIG. 21 shows an example of a failure history registration screen.
- FIG. 22A shows an example of a meta restoration method registration screen displayed in the second embodiment.
- FIG. 22B shows another example of the display area e13 on the meta recovery method registration screen.
- FIG. 23 shows an example of the candidate / confidence level screen displayed in the second embodiment.
- FIG. 24A shows a first example of the matching degree comparison screen.
- FIG. 24B shows a second example of the matching degree comparison screen.
- FIG. 1 is a block diagram relating to the configuration of the computer system 1 according to the first embodiment of the present invention.
- the computer system 1 includes a management server 10, a display computer 20, and a monitoring target node 30. Note that one management server 10, one display computer 20, and one monitoring target node 30 are shown, but any number may be provided.
- the monitored node 30 is a device managed by the management server 10.
- the monitoring target node 20 include a server computer, a storage device (for example, a disk array device having a RAID configuration), a network switch (for example, an FC (Fibre-Channel) switch), a router), a proxy server, and the like. Other devices may be used.
- the management server 10 is a computer that manages one or more monitoring target nodes 30.
- the display computer 20 is a computer having a display screen for displaying information output from the management server 10.
- the management server 10, the display computer 20, and the monitoring target node 30 are connected to each other via a network 50.
- the network 50 that connects the management server 10 and the display computer 20 and the network 50 that connects the management server 10 and the monitoring target node 30 are the same network, but may be separate networks.
- the management server 10 and the display computer 20 may be integrated.
- the management server 10 may be composed of a plurality of computers, and the plurality of computers may have the functions of the management server 10.
- one or more computers constituting the management server 10 and the display computer 20 may be described as a “management system”.
- the management computer is a management system, and the combination of the management server 10 and the display computer 20 is also a management system.
- FIG. 2 shows the configuration of the management server 10.
- the management server 10 is a computer including a memory 110, a memory interface 161, a processor 140, and a network interface 150.
- the memory interface 161, the processor 140, and the network interface 150 are connected to each other by an internal network (for example, a bus) 160.
- the processor 140 accesses the memory 110 via the memory interface 161.
- the processor 140 performs various processes by executing a program stored in the memory 110.
- program may be used as the subject. However, since the program is executed by the processor 140, a predetermined process is performed using the memory 110 and the network interface 150.
- the subject may be an explanation. Further, the processing disclosed with the program as the subject may be processing performed by a computer such as the management server 10. Moreover, a part or all of the program may be realized by dedicated hardware.
- Various programs may be installed in each computer from a program source (for example, a program distribution server or a computer-readable storage medium (for example, a portable medium)).
- a program source for example, a program distribution server or a computer-readable storage medium (for example, a portable medium)
- the memory 110 stores a program executed by the processor 140, information required by the processor 140, and the like. Specifically, for example, in the memory 110, server information 111, switch information 112, storage information 113, topology information 114, meta RCA rule information 115, expanded RCA rule information 116, event information 117, failure history information 119, topology An application program 121, a rule matching analysis program 122, a generation program 123, a context matching analysis program 124, and a failure history management program 125 are stored. Further, the memory 110 stores an application program (hereinafter referred to as AP) 131 and an OS (Operating System) 132.
- AP application program
- OS Operating System
- the AP 131 is a program that implements various processes.
- the AP 117 provides a database management function or a WEB server function.
- the OS 132 is a program that controls the entire processing of the management server 10.
- the server information 111 is information for managing configuration information of a server that is a kind of monitoring target node.
- the switch information 112 is information for managing configuration information of a switch that is a kind of monitoring target node.
- Storage information 113 is information for managing configuration information of a storage device that is a kind of monitoring target node.
- the topology information 114 is information for managing information on the connection configuration (topology) of servers, switches, and storages that are monitoring target nodes.
- the meta RCA rule information 115 is information for managing meta RCA rules. The details of the meta RCA rule will be described later in ⁇ 1-1: Definition of terms>.
- the expanded RCA rule information 116 is information for managing the expanded RCA rules. Details of the expanded RCA rule will be described in ⁇ 1-1: Term definition> described later.
- the event information 117 is information for managing event records of events that have occurred in the monitoring target node.
- the failure history information 119 includes one or more failure history entries.
- One failure history entry includes information indicating the cause of a failure that has occurred in the past, information indicating a recovery method, and a failure analysis context.
- At least the failure history information 119 may be stored in an external storage resource (for example, an external storage device). In that case, for example, the processor 140 can access the failure history information 119 via the network interface 150.
- the topology application program 121 creates the deployment RCA rule information 116 using the meta RCA rule information 115, the server information 111, the switch information 112, the storage information 113, and the topology information 114.
- the rule matching analysis program 122 uses the expanded RCA rule information 116 and the event information 117 to determine the meta RCA rule information 115 associated with the event information 117, the expanded RCA rule information 116, and the certainty factor.
- the generation program 123 generates a failure analysis context using the meta RCA rule information 115, the expanded RCA rule information 116, the server information 111, the switch information 112, the storage information 113, and the topology information 114.
- the context matching analysis program 124 matches the generated failure analysis context with the failure analysis context in each failure history entry.
- the failure history management program 125 generates a failure analysis context including the generated failure analysis context, information indicating the recovery method, and information indicating the content of the failure that has occurred, and the failure analysis context is referred to as the failure history information 119. Include in
- the network interface 150 transmits / receives data to / from another computer (for example, a monitoring target node) via the network 50.
- another computer for example, a monitoring target node
- the various programs stored in the memory 110 are not necessarily separate program codes, and one or more program codes may realize program processing.
- the management server 10 may have an input / output device.
- input / output devices include a display, a keyboard, and a pointer device, but other devices may be used.
- a serial interface or an Ethernet interface is used as the input / output device, a display computer having a display or a keyboard or a pointer device is connected to the interface, and display table information is transmitted to the display computer.
- the display computer may perform the display, or the input may be replaced with the input / output device by receiving the input.
- FIG. 3 shows the configuration of the display computer 20.
- the display computer 20 includes a memory 210, a processor 240, a network interface 250, and an input / output device 260 (for example, the memory interface shown in FIG. 2 is not shown).
- the memory 210, the processor 240, the network interface 250, and the input / output device 260 are connected to each other by an internal network 270.
- the processor 240 performs various processes by executing a program stored in the memory 210.
- the memory 210 stores a program executed by the processor 240, information required by the processor 240, and the like. Specifically, for example, the screen display program 211 is stored in the memory 210. Further, the memory 210 stores an application program (hereinafter referred to as AP) 221 and an OS (Operating System) 222.
- the AP 221 is a program that implements various processes. For example, the AP 221 provides a WEB client function.
- the OS 222 is a program that controls the entire processing of the display computer 20.
- the screen display program 211 is a program for displaying information on the input / output device 260, for example, a display device.
- the network interface 250 transmits / receives data to / from another computer (for example, the management server 10) via the network 50.
- another computer for example, the management server 10.
- An example of the input / output device 260 is a display, a keyboard, and a pointer device, but other devices may be used.
- a serial interface or an Ethernet interface may be connected, and a display computer having a display, a keyboard, or a pointer device may be connected to the interface as the input / output device.
- the display computer 20 may receive display information from the management server 10 or transmit input information to the management server 10.
- the management server 10 has first and second computers, the first computer executes the topology application program 121, the rule matching analysis program 122, and the generation program 123, and the second computer
- the matching analysis program 124 and the failure history management program 125 may be executed.
- the server information 111, the switch information 121, the storage information 113, the topology information 114, the meta RCA rule information 115, and the expanded RCA rule information 116 are included in the first computer, and the event information 117 and the failure history information 119 are:
- the second computer may have.
- Event is an event of a change in the operating state that has occurred in the monitoring target node 30.
- Event record is information for identifying an event.
- the event record includes, for example, an event type that is information indicating the type of the event, an identifier of the monitoring target node 30 that is the generation source, information that represents the content of the event, and information that represents the occurrence date and time of the event. There is one event record for each event.
- RCA is an abbreviation of Root Cause Analysis, and is a function for specifying a monitoring target node as a root cause of an event based on an event record of the monitoring target node (for example, server, switch, storage device). .
- Metal RCA rule is a rule that defines a certain failure and a group of events that are expected to occur due to the failure. Used by RCA. By using the meta RCA rule, it is possible to derive the root cause failure from the event group.
- the meta RCA rule is information (meta information) that does not include topology information representing a topology composed of one or more monitoring target nodes.
- “Expanded RCA rule” is a rule in which a meta RCA rule is expanded for each monitoring target node. Used by RCA.
- fault analysis context is information used when analyzing a fault.
- the failure analysis context includes a record in the meta RCA rule information 115, a record in the expanded RCA rule information 116, a record in the server information 111, a record in the switch information 112, a record in the storage information 113, and topology information 114.
- the records in are associated. Details will be described later with reference to FIG.
- FIG. 4 is a diagram showing the server information 111.
- the server information 111 is a table having one record per server (hereinafter referred to as server record).
- the server record is a record having a server ID 501, a server name 502, a server vendor 503, a server IP address 504, a server OS 505, and a server continuous operation time 506 as attribute values.
- the information elements 501 to 506 will be described by taking one server (hereinafter referred to as “target server” in the description of FIG. 4) as an example.
- the server ID 501 is an identifier assigned to the target server that is the monitoring target node 30 by the topology application program 121.
- the server name 502 is a computer name of the target server.
- the server vendor 503 is the manufacturer name of the target server.
- the server IP address 504 is an identifier to which the target server is assigned on the network.
- the server OS 505 is an OS name installed in the target server.
- the continuous operation time 506 of the server is a continuous operation time from the last start of the target server to the present.
- the server information 111 may have a data structure other than a table as long as it has an attribute value related to the server, or may have an attribute value other than the attribute values described above. Further, the server information 111 may not have at least one attribute value other than the server ID 501.
- FIG. 5 is a diagram showing the switch information 112.
- the switch information 112 is a table having one record per switch (hereinafter referred to as switch record).
- the switch record is a record having the switch ID 511, the switch name 512, the switch vendor 513, the switch IP address 514, the switch type 515, and the switch continuous operation time 516 as attribute values.
- the information elements 511 to 516 will be described by taking one switch (hereinafter referred to as “target switch” in the description of FIG. 5) as an example.
- the switch ID 511 is an identifier assigned to the target switch that is the monitoring target node 30 by the topology application program 121.
- the switch name 512 is a computer name of the target switch.
- the switch vendor 513 is the manufacturer name of the target switch.
- the switch IP address 514 is an identifier to which the target switch is assigned on the network.
- Switch type 515 is the model name of the target switch.
- the continuous operation time 516 of the switch is a continuous operation time from the last activation of the target switch to the present.
- the switch information 112 may have a data structure other than a table as long as it has an attribute value related to a switch, and may have an attribute value other than the attribute values described above.
- the switch information 112 may not have at least one attribute value other than the switch ID 511.
- FIG. 6 is a diagram showing the storage information 113.
- Storage information 113 is a table having one record (hereinafter referred to as storage record) for each storage device.
- the storage record is a record having a storage ID 521, a storage name 522, a storage vendor 523, a storage IP address 524, a storage firmware 525, and a continuous operation time 526 of the storage as attribute values.
- the information elements 521 to 526 will be described by taking one storage apparatus (hereinafter referred to as “target storage” in the description of FIG. 6) as an example.
- the storage ID 521 is an identifier assigned to the target storage that is the monitoring target node 30 by the topology application program 121.
- Storage name 522 is the computer name of the target storage.
- the storage vendor 523 is the manufacturer name of the target storage.
- the storage IP address 524 is an identifier to which the target storage is allocated on the network.
- Storage firmware 525 is the name of the firmware installed in the target storage.
- the continuous operation time 526 of the storage is the continuous operation time from the last start of the target storage to the present.
- the storage information 113 may have a data structure other than a table as long as it has an attribute value related to the storage device, and may have an attribute value other than the attribute values described above. In addition, the storage information 113 may not have at least one attribute value other than the storage ID 521.
- FIG. 7 is a diagram showing the topology information 114.
- the topology information 114 is a table having one record for each topology (hereinafter, topology record).
- the topology record is a record having the topology ID 531, server ID 532, switch ID 533, and storage ID 534 as attribute values.
- the information elements 531 to 534 will be described by taking one topology (hereinafter referred to as “target topology” in the description of FIG. 7) as an example.
- the topology ID 531 is an identifier of the target topology.
- the “topology” is a connection form between the monitoring target nodes 30, in other words, a combination of the monitoring target nodes 30. Specifically, the types and arrangement of monitoring target nodes are defined as the topology.
- the server ID 532 is the server ID 501 of the server that the target topology has.
- Switch ID 533 is the switch ID 511 of the switch that the target topology has.
- the storage ID 534 is the storage ID 521 of the storage device that the target topology has.
- the topology information 114 may have a data structure other than the table as long as it has an attribute value related to the connection form of the monitoring target node 30, and may have an attribute value other than the attribute values described above.
- the topology is typically a connection form in which a server (computer) is connected to a storage device via a switch (network switch).
- the server issues an I / O command (write command or read command) specifying a logical volume provided from the storage apparatus.
- the I / O command reaches the storage device via the switch.
- the storage apparatus performs I / O with respect to the logical volume specified by the command.
- FIG. 8 is a diagram showing the meta RCA rule information 115.
- the meta RCA rule information 115 is a table having one record for each meta RCA rule (hereinafter, “meta RCA record”).
- the meta RCA record is a record having a meta RCA rule ID 541, a server event 542, a switch event 543, a storage event 544, a cause node 545, and cause content 546 as attribute values.
- the information elements 541 to 546 will be described by taking one meta RCA rule (hereinafter referred to as “target meta RCA rule” in the description of FIG. 8) as an example.
- the meta RCA rule ID 541 is an identifier assigned by the rule matching analysis program 122 to the target meta RCA rule.
- the server event 542 is information representing the content of the event at the server included in the target meta RCA rule.
- the switch event 543 is information indicating the content of the event at the switch of the target meta RCA rule.
- the storage event 544 is information indicating the content of the event in the storage device possessed by the target meta RCA rule.
- the cause node 545 is information indicating the type of node that is the root cause of the event included in the target meta RCA rule.
- the cause content 546 is information indicating the content of the root cause of the event included in the target meta RCA rule.
- a combination of the cause content 546 and the above-described cause node 545 represents the root cause of the event group.
- the meta RCA rule information 115 may have a data structure other than a table as long as it has an attribute value related to the meta RCA rule, and may have an attribute value other than the attribute values described above.
- FIG. 9 is a diagram showing the expanded RCA rule information 116.
- the expanded RCA rule information 116 is a table having one record (hereinafter referred to as an expanded RCA record) for each expanded RCA rule.
- the expanded RCA record is a record having expanded RCA rule ID 551, meta RCA rule ID 552, topology ID 553, cause node ID 554, and cause details 555 as attribute values.
- the information elements 551 to 555 will be described by taking one expanded RCA rule (hereinafter referred to as “target expanded RCA rule” in the description of FIG. 9) as an example.
- the expanded RCA rule ID 551 is an identifier assigned by the rule matching analysis program 122 to the target expanded RCA rule.
- the meta RCA rule ID 552 is the meta RCA rule ID 541 of the meta RCA rule to which the target deployment RCA rule belongs.
- the topology ID 553 is the topology ID 531 of the topology to which the target deployment RCA rule belongs.
- the cause node ID 554 is a server ID 501, a switch ID 511, or a storage ID 521 for identifying the monitoring target node 30 that is the root cause of the target deployment RCA rule.
- the cause detail 555 is a cause content 546 representing the content of the root cause of the target deployment RCA rule.
- the expanded RCA rule information 116 may have a data structure other than a table as long as it has an attribute value related to the expanded RCA rule, and may have an attribute value other than the attribute values described above.
- FIG. 10 is a diagram showing the event information 117.
- the event information 117 is a table having one event record for each event.
- the event record is a record having an event ID 561, an event type 562, a target node type 563, a target node ID 564, an event content 565, an occurrence date and time 566, and a state 567 as attribute values.
- the information elements 561 to 567 will be described by taking one event (hereinafter referred to as “target event” in the description of FIG. 10) as an example.
- the event ID 561 is an identifier assigned to the event record of the target event by the rule matching analysis program 122.
- the event type 562 is information indicating the type of target event. Specific values of the event type 562 include, for example, “Critical”, “Warning”, and “Information”.
- the target node type 563 is information indicating the node type (for example, server, switch, or storage device) of the monitoring target node 30 that is the source of the target event.
- the target node ID 564 is a server ID 501, a switch ID 511, or a storage ID 521 that represents the monitoring target node 30 that is the source of the target event.
- the event content 565 is information representing the content of the target event.
- the occurrence date and time 566 is information representing the occurrence date and time of the target event.
- the state 567 is information indicating whether or not the target event has been resolved.
- the event information 117 may have a data structure other than the table as long as it has an attribute value related to the event, and may have an attribute value other than the attribute values described above. Further, the event information 117 may not have at least one attribute value other than the event ID 551, the target node ID 564, the event content 565, and the occurrence date / time 566.
- FIG. 11 is a diagram showing the failure analysis context 120.
- the failure analysis context 120 is data having a failure analysis context ID 601, a meta RCA rule ID 602, a deployment RCA rule ID 603, a topology ID 604, a server ID 605, a switch ID 606, and a storage ID 607 as attribute values.
- the failure analysis context ID 601 is an identifier assigned to the failure analysis context 120 by the generation program 123.
- the meta RCA rule ID 602 is a meta RCA rule ID 541 for identifying the meta RCA rule associated with the failure analysis context 120.
- the expanded RCA rule ID 603 is an expanded RCA rule ID 551 for identifying the expanded RCA rule associated with the failure analysis context 120.
- the topology ID 604 is a topology ID 531 for identifying the topology associated with the failure analysis context 120.
- the server ID 605 is a server ID 501 for identifying a server associated with the failure analysis context 120.
- the switch ID 606 is a switch ID 511 for identifying a switch associated with the failure analysis context 120.
- the storage ID 607 is a storage ID 521 for identifying the storage device associated with the failure analysis context 120.
- failure analysis context 120 may have an attribute value other than the attribute values described above.
- FIG. 12A is a diagram showing a failure history entry 1191 included in the failure history information 119.
- Failure history entry 1191 includes failure history ID 701, meta RCA rule ID 702, expanded RCA rule ID 703, topology ID 704, server ID 705, switch ID 706, storage ID 707, server weight ID 708, switch weight ID 709, storage weight ID 710, cause 711 and recovery method 712. Are data having attribute values.
- the failure history ID 701 is an identifier assigned to the failure history entry 1191 by the failure history management program 125.
- the meta RCA rule ID 702 is a meta RCA rule ID 541 for identifying the meta RCA rule associated with the failure history entry 1191.
- the expanded RCA rule ID 703 is an expanded RCA rule ID 551 for identifying the expanded RCA rule associated with the failure history entry 1191.
- the topology ID 704 is a topology ID 531 for identifying the topology associated with the failure history entry 1191.
- the server ID 705 is a server ID 501 for identifying the server associated with the failure history entry 1191.
- the switch ID 706 is a switch ID 511 for identifying the switch associated with the failure history entry 1191.
- the storage ID 707 is a storage ID 521 for identifying the storage device associated with the failure history entry 1191.
- the server weight ID 708 is a server weight ID 801 (see FIG. 12B) for identifying the server weight record associated with the failure history entry 1191.
- the server weight record is a record that the server weight information 800 has.
- the switch weight ID 709 is a switch weight ID 811 (see FIG. 12C) for identifying the switch weight record associated with the failure history entry 1191.
- the switch weight record is a record that the switch weight information 810 has.
- the storage weight ID 710 is a storage weight ID 821 (see FIG. 12D) for identifying the storage weight record associated with the failure history entry 1191.
- the storage weight record is a record that the storage weight information 820 has.
- the cause 711 is information indicating the cause of the failure corresponding to the failure history entry 1191.
- the recovery method 712 is information representing a recovery method from a failure corresponding to the failure history entry 1191.
- IDs 702 to 707 included in the failure history entry 1191 are duplicates of IDs 602 to 607 included in the failure analysis context 120 (see FIG. 11). That is, as described above, the failure history entry 1191 has the failure analysis context 120. According to FIG. 12A, the failure analysis context ID 601 is not included in the failure history entry 1191, but the ID 601 may be included in the record 1191.
- the failure history information 119 may have a data structure other than the data structure described above, or may have an attribute value other than the attribute values described above. Further, the failure history information 119 may not include the server weight ID 708, the switch weight ID 709, and the storage weight ID 710.
- FIG. 12B is a diagram showing server weight information 800.
- the server weight information 800 is a table having one record (server weight record) for each server weight.
- the server weight record is a record having the server weight ID 801, the server vendor 802, the server IP address 803, the server OS 804, and the server continuous operation time 805 as attribute values.
- the information elements 801 to 805 will be described by taking one server weight (referred to as “target server weight” in the description of FIG. 12B) as an example.
- the server weight ID 801 is an identifier assigned to the target server weight.
- the server vendor 802 is a kind of weight belonging to the target server weight, and is a value representing how much importance is given to the item of server vendor.
- the server IP address 803 is a kind of weight belonging to the target server weight, and is a value representing how important the item of the server IP address is.
- the server OS 804 is a kind of weight belonging to the target server weight, and is a value representing how much the item of server OS is regarded as important.
- the continuous operation time 805 of the server is a kind of weight belonging to the target server weight, and represents how much importance is given to the item of continuous operation time of the server.
- server weight is defined by the weights of a plurality of types of items related to the server.
- server weight information 800 may have a data structure other than a table as long as it has an attribute value related to the server weight, and may also have an attribute value other than the attribute values described above. Further, the server weight information 800 may not have at least one attribute value other than the server weight ID 801.
- FIG. 12C is a diagram showing switch weight information 810.
- the switch weight information 810 is a table having one record (switch weight record) for one switch weight.
- the switch weight record is a record having a switch weight ID 811, a switch vendor 812, a switch IP address 813, a switch type 814, and a switch continuous operation time 815 as attribute values.
- the information elements 811 to 815 will be described by taking one switch weight (referred to as “target switch weight” in the description of FIG. 12C) as an example.
- the switch weight ID 811 is an identifier assigned to the target switch weight.
- the switch vendor 812 is a kind of weight belonging to the target switch weight, and is a value indicating how much importance is given to the item of switch vendor.
- the switch IP address 813 is a kind of weight belonging to the target switch weight, and is a value representing how much importance is given to the item of the switch IP address.
- the switch type 814 is a kind of weight belonging to the target switch weight, and is a value representing how much importance is given to the item of switch type.
- the continuous operation time 815 of the switch is a kind of weight belonging to the target switch weight, and represents how much importance is given to the item of continuous operation time of the switch.
- switch weight is defined by the weights of a plurality of types of items related to the switch.
- the switch weight information 810 may have a data structure other than a table as long as it has an attribute value related to the switch weight, and may also have an attribute value other than the attribute values described above.
- the switch weight information 810 may not have at least one attribute value other than the switch weight ID 811.
- FIG. 12D is a diagram showing the storage weight information 820.
- the storage weight information 820 is a table having one record (storage weight record) for one storage weight.
- the storage weight record is a record having a storage weight ID 821, storage vendor 822, storage IP address 823, storage firmware 824, and storage continuous operation time 825 as attribute values.
- the information elements 821 to 825 will be described by taking one storage weight (referred to as “target storage weight” in the description of FIG. 12D) as an example.
- the storage weight ID 821 is an identifier assigned to the target storage weight.
- the storage vendor 822 is a kind of weight belonging to the target storage weight, and is a value indicating how much importance is given to the item of storage vendor.
- the storage IP address 823 is a kind of weight belonging to the target storage weight, and is a value indicating how much importance is given to the item of storage IP address.
- the storage firmware 824 is a kind of weight belonging to the target storage weight, and is a value indicating how much importance is given to the item of storage firmware.
- the continuous operation time 825 of the storage is a kind of weight belonging to the target storage weight, and represents how much importance is given to the item of continuous operation time of the storage.
- “storage weight” is defined by the weights of a plurality of types of items related to storage.
- the storage weight information 820 may have a data structure other than the table as long as it has an attribute value related to the storage weight, and may also have an attribute value other than the attribute values described above. Further, the storage weight information 820 may not have at least one attribute value other than the storage weight ID 821.
- the server weight information 800, the switch weight information 810, and the storage weight information 820 described above are included in the failure history information, for example.
- a topology composed of servers, switches, and storage devices is described as an example.
- the present invention is not limited to such a topology but can be applied to other types of topologies.
- a service providing node device an example is a storage device
- a service using node device an example is a server
- the server information is service using node device information.
- the service using node device information includes the following information (a1) to (a3): (A1) Network identifier such as the IP address of the service using node device; (A2) Information indicating the hardware or software configuration of the node device; (A3) Information indicating setting contents, Can be included.
- the switch information (see FIG. 5) is more abstractly relay device information (or relay node device information).
- the relay device information includes the following information (b1) and (b2): (B1) Information indicating a hardware or software configuration of a node device (an example is a switch) that mediates communication between the service using node device and the service providing node device; (B2) Information indicating setting contents, Can be included.
- the service providing node device information includes the following information (c1) to (c3): (C1) Network identifier such as the IP address of the service providing node device; (C2) Information indicating the hardware or software configuration of the node device; (C3) Information indicating setting contents, Can be included. Further, the service providing node device information may include information indicating the type of network service provided by the service using node device.
- the topology information can include information representing a set (or correspondence) of an identifier of a service using node device and an identifier of a service providing node device used by the service using node device. If the service using node device communicates with the service using node device via one or more relay devices, the identifier of the one or more relay devices may be included in the topology information.
- the meta RCA rule information includes the following information (d1) and (d2) for each network service to be monitored by the management server: (D1) Type of first event (service utilization node device occurrence event) that can occur in the service utilization node device, and second event (service provision node device occurrence) that can occur in the service provision node device (or relay device) Information indicating the combination with the event type; (D2) Indicates the cause (or cause type) that can occur in the service providing node device or the relay device that can be determined (or estimated to be the cause) when the first event and the second event occur information, Can be included.
- D1 Type of first event (service utilization node device occurrence event) that can occur in the service utilization node device, and second event (service provision node device occurrence) that can occur in the service provision node device (or relay device) Information indicating the combination with the event type;
- D2 Indicates the cause (or cause type) that can occur in the service providing node device or the relay device that can be determined (or estimated to be the cause)
- the expanded RCA rule information includes the following information (e1) to (e3) for each monitored node that uses or provides a network service: (E1) The type of the first event that can occur in the node device that is the service using node device and the identifier of the service using node device, the type of the second event that can occur in the service providing node device (or relay device), and Information indicating a combination with an identifier of a service providing node device (or relay device); (E2) an identifier of a service providing node device (or relay device) that can be determined (or estimated to be a cause) when the first event and the second event occur; (E3) Information indicating a cause (or cause type) that may occur in the service providing node device (or relay device); Can be included.
- E1 The type of the first event that can occur in the node device that is the service using node device and the identifier of the service using node device, the type of the second event that can occur in the service providing node device (or relay device), and Information
- the failure analysis context can include the identifier of the meta RCA rule used to identify the root cause of the failure. Note that the identifier of the deployment RCA rule, the topology identifier, or the identifier of the monitoring target node used to identify the root cause of the failure may be included.
- the failure history entry can include the content of the failure analysis context and information indicating a recovery method (for example, a recovery procedure) from the failure corresponding to the context.
- the failure history entry may include an evaluation value for evaluating the matching degree of the failure analysis context included in the record, or an identifier of information in which the evaluation value is recorded.
- the server weight information can include a value for evaluating the matching degree, which is distributed to the hardware or software configuration of the service using node device and the setting content element.
- Switch weight information (see FIG. 12C): It can include a value for evaluating the matching degree, which is distributed to the hardware or software configuration of the relay device and the setting content elements.
- the storage weight information can include a value for evaluating the matching degree, which is distributed to the hardware or software configuration of the service providing node device and the setting content element.
- FIG. 13 shows a flow for creating an expanded RCA rule.
- the topology application program 121 acquires information from the monitoring target node 30 through the network 50. If the monitoring target node 30 is a server, information including the server name, the server vendor name, the server IP address, the server OS name, and the continuous operation time of the server (hereinafter, server acquisition information) is acquired. . The topology application program 121 creates or updates the server information 111 based on the server acquisition information received from each monitoring target node (each server).
- the topology application program 121 performs the following processes (A) and (B):
- (A) When the identifier in the server acquisition information is not stored in the server information 111, the server record in the server information 111 corresponding to the server acquisition information (hereinafter referred to as “target server record” in the description of FIG. 13).
- the server ID 501 for example, an identifier in the server acquisition information
- the server ID 501 is stored in the target record
- the server name 502, the vendor name 503, the IP address 504, the OS name 505, and the continuous operation time 506 in the server acquisition information are stored in the target server record. I do.
- step 1001 may be omitted for a monitoring target node that is clearly not a server in advance.
- the topology application program 121 acquires information from the monitoring target node 30 through the network 50. If the monitoring target node 30 is a switch, information including the switch name, the switch vendor name, the switch IP address, the switch type, and the continuous operation time of the switch (hereinafter, switch acquisition information) is acquired. The topology application program 121 creates or updates the switch information 112 based on the switch acquisition information received from each monitoring target node (each switch).
- the topology application program 121 performs the following processes (A) and (B):
- (A) When the identifier in the switch acquisition information is not stored in the switch information 112, the switch record in the switch information 112 corresponding to the switch acquisition information (hereinafter referred to as “target switch record” in the description of FIG. 13).
- target switch record Is assigned a switch ID 511 (for example, an identifier in the switch acquisition information), and the switch ID 511 is stored in the target switch record;
- the switch name 512, the vendor name 513, the IP address 514, the type 515, and the continuous operation time 516 of the switch acquisition information are stored in the target switch record. I do.
- step 1002 may be omitted for a monitoring target node that is clearly not a switch in advance.
- the topology application program 121 acquires information from the monitoring target node 30 through the network 50. If the monitoring target node 30 is a storage device, the storage name, storage vendor name, storage IP address, storage firmware name, and information including the continuous operation time of the storage (hereinafter referred to as storage acquisition information) are acquired. The The topology application program 121 creates or updates the storage information 113 based on the storage acquisition information received from each monitoring target node (each storage device).
- the topology application program 121 performs the following processes (A) and (B):
- (A) When the identifier in the storage acquisition information is not stored in the storage information 113, the storage record in the storage information 113 corresponding to the storage acquisition information (hereinafter referred to as “target storage record” in the description of FIG. 13) ) Is assigned a storage ID 521 (for example, an identifier in the storage acquisition information), and the storage ID 521 is stored in the target storage record;
- the storage name 522, the vendor name 523, the IP address 524, the firmware 525, and the continuous operation time 526 of the storage acquisition information are stored in the target storage record. I do.
- step 1003 may be omitted for a monitoring target node that is clearly not a storage in advance.
- the topology application program 121 receives the topology acquisition information of the monitoring target node 30 through the network 50.
- the topology acquisition information the ID of a switch and the IDs of servers and storage devices connected to the switch are included.
- the topology application program 121 performs the following processes (A) and (B): (A) When the identifier in the topology acquisition information is not stored in the topology information 114, the topology record in the topology information 114 corresponding to the topology acquisition information (hereinafter referred to as “target topology record” in the description of FIG. 13).
- topology ID 531 for example, an identifier in the topology acquisition information
- topology ID 321 is stored in the target topology record
- B Store the switch ID 533, server ID 532, and storage ID 534 in the topology acquisition information in the target topology record. I do.
- the data structure of the topology acquisition information is not limited to the above structure as long as the topology information 114 can be updated.
- step 1004 may be omitted for nodes to be monitored that are clearly not switches, servers, or storage devices in advance.
- the topology record may be updated as follows. In other words, connection destination information indicating which monitoring target node is directly connected to each monitoring target node is acquired, and path information indicating which logical volume is accessed from which server from the server or storage device And the target topology record may be updated based on the connection destination information and the path information.
- the topology application program 121 has the following (Step A) to (Step D): (Step A) acquiring at least one value included in each of the acquisition information from the monitoring target node as node acquisition information; (Step B) Update service using node device information, service providing node device information, or relay node device information based on the node acquisition information; (Step C) Based on the topology acquisition information, the correspondence between the identifier of the service providing node device for the predetermined network service and the identifier of the service using node device that uses the node device is included in the topology information; (Step D) Update the expanded RCA rule information based on the topology information and the meta RCA rule information; It can be performed.
- one record of the expanded RCA rule information is created from one record of the meta RCA rule information, but the present invention is not limited to this.
- One example is multi-level reasoning.
- a new rule can be derived from a plurality of rules using a three-level reasoning method or the like.
- the expanded RCA rule information may actually be created by one or more meta RCA rule records and topology information.
- An example of deriving a new rule from a plurality of rules is as follows.
- Event A For a first network service (for example, WWW (World Wide Web)), a first type event (hereinafter referred to as event A) that occurs in a service using node device and a service providing node device
- event B the second type event
- the root cause of event A is the occurrence of event B.
- a third type of event (hereinafter referred to as event C) that occurs in a service using node device, and a service providing node device
- event D a fourth type of event that occurs
- the root cause of event C is the occurrence of event D.
- the node device A is a service using node device and the node device B is a service providing node device.
- the node device B is a service using node device
- the node device C is a service providing node device.
- the first network service in the node apparatus B is provided using the second network service.
- First deployment RCA rule to be generated When event A occurring in node device A is detected and event B occurring in node device B is detected, the root cause of event A occurring in node device A is An event B occurs in the node device B.
- Second deployment RCA rule to be generated When the event C occurring in the node device B is detected and the event D occurring in the node device C is detected, the root cause of the event C occurring in the node device B is An event D occurs in the node device C.
- second deployment RCA rule to be generated When event A occurring in node device A is detected and event D occurring in node device C is detected, the root cause of event A occurring in node device A is An event D occurs in the node device C.
- topology information When multi-stage reasoning is used, in addition to the dependency between physical devices (for example, between nodes), information representing the dependency between network services or logical objects may be included in the topology information.
- the third topology information is an example.
- the root cause is identified while referring to the meta RCA rule represented by the meta RCA rule information 115 and the topology represented by the topology information based on the meta RCA rule ID 552 and the topology ID 553. Is called.
- information representing a meta RCA rule and a rule developed based on the topology may be included in the developed RCA rule information. According to this method, the consumption amount of the memory 110 of the management server 10 is increased, but the speed of identifying the root cause is higher.
- the meta RCA rule ID 552 is required in the expanded RCA rule information.
- FIG. 14 shows a flow from detection of an event to identification of the root cause of the event. This flow is executed at regular time intervals (for example, every 10 minutes) or simply by repetition.
- the program 122 requests an event entry, which is information including an event type, a target node type, a target node ID, an event content, and an occurrence date and time, to all the monitoring target nodes 30.
- Each information element included in the event entry is as follows: (Event type) Indicates the type of event (for example, Critical, Warning, Information) to which the event entry belongs; (Target node type) Indicates the node type (for example, server, switch or storage device) of the monitoring target node 30 that is the target of the event that has occurred; (Target node ID) An identifier (server ID 501, switch ID 511, or storage ID 521) indicating the monitoring target node 30 in which the event has occurred; (Event content) The content of the event that occurred; (Occurrence date and time) This is the event occurrence date and time.
- the event entry may be transmitted from the monitoring target node 30 without receiving a request from the rule matching analysis program 122. Moreover, the information showing the date and time of occurrence does not necessarily have to be included. In that case, instead of the occurrence date and time, the date and time when the management server 10 received the event entry can be adopted.
- Step 1012 When the rule matching analysis program 122 receives an event entry from the monitoring target node 30 in Step 1011, the rule matching analysis program 122 performs Step 1013. If no event entry has been received from the monitoring target node 30, step 1011 is performed.
- the rule matching analysis program 122 adds information to the event information 117 based on the event entry. Specifically, for example, the program 122 executes the following processes (A) to (C): (A) A new event ID 561 is acquired, and the ID 561 is stored in a blank record in the event information 117 (hereinafter referred to as “target record” in the description of step 1013); (B) Store in the target record the event type, target node type, target node ID, event content, and occurrence date and time in the event entry; (C) A value “unresolved” is stored as the state 567 in the target record. I do.
- event entry may include other values as long as the event record of the event information 117 (record in the event information 117) can be added or updated.
- Step 1014 The rule matching analysis program 122 sets the state 567 representing “unsolved” based on the event record including the state 567 representing “unsolved”, the topology information 114, and the expanded RCA rule information 116.
- An expanded RCA record (a record in the expanded RCA rule information 116) associated with the included event record is specified.
- the rule matching analysis program 122 performs the following processes (A) to (H): (A) Identifying the event record (first event record) with the latest occurrence date and time 565 from among the event records whose state 556 is “unresolved”; (B) One or more second event records are identified based on the first event record identified in the immediately preceding step (the occurrence date and time 565 in the first event record and the second event record) The difference from the occurrence date and time 565 is within a predetermined time (for example, around 10 minutes)); (C) With reference to the target node types 563 in all the second event records obtained in (B) above, the target node types are different based on all the target node IDs in the second event records.
- node ID sets composed of target node IDs (for example, there are four event records, two of which are records related to servers A and B, and the remaining two are switches A and In the case of the record relating to B, the server A ID—the switch A ID, the server A ID—the switch B ID, the server B ID—the switch A ID, and the server B ID—the switch B ID Create node ID set); (D-1) There is a second event record that includes a target node ID that is not included in any ID set obtained in (C) above (an event record that includes a state 556 representing “unresolved”).
- the second event entry with the latest occurrence date and time 565 is identified, and (B) is performed with the second event entry as the first event entry; (D-2) There is no second event record including the target node ID that is not included in any of the ID sets obtained in (C) (an event record including a state 556 indicating “unresolved”).
- Meta RCA record (record of meta RCA rule information 115): (Condition F1) having a server event 542 that matches the event content 564 in the event record having the target node ID of the server in the event ID set; (Condition F2) having a switch event 543 that matches the event content 564 in the event record having the target node ID of the switch in the event ID set; (Condition F3) having a storage event 544 that matches the event content 564 in the event record having the target node ID of the storage device in the event ID set, If such a meta RCA record is found, the meta RCA rule ID 541 of the meta RCA record is extracted, and the ID 541 is associated with the corresponding node ID set; (G) The following processing (g1) to (g4) for each node ID set (set associated with the meta RCA rule ID 541) obtained in (F) above: (G1) extracting the cause node 545 from the meta RCA record having the meta RCA rule ID 541 associated with the node ID set; (G2) identifying an
- Step 1015 The rule matching analysis program 122 compiles the plurality of expanded RCA records obtained in Step 1014 into records having the same meta RCA rule ID 552. As a result, one or a plurality of expanded RCA record groups having the same meta RCA rule ID 552 can be created.
- Step 1016 The program 122 compiles the expanded RCA rules belonging to the group obtained in Step 1015 with records having the same cause node ID 554. As a result, one or a plurality of subgroups of the expanded RCA record having the same cause node ID 554 can be generated per group of the expanded RCA record having the same meta RCA rule ID 552.
- the monitoring target node 30 indicated by the cause node ID 554 is a root cause candidate.
- the program 122 calculates the certainty of the root cause candidate obtained in Step 1016 as a certainty factor.
- a certainty factor calculation method for example, there is a method based on the number of expanded RCA rule records with matching cause node IDs 554. For example, a certainty factor corresponding to the number of expanded RCA rule records with the matching cause node ID 554 is assigned to each root cause candidate. The certainty factor allocated to the root cause candidate having a large number of expanded RCA rule records having the same cause node ID 554 is higher than the certainty factor allocated to the root cause candidate having a small number of such expanded RCA rule records.
- the certainty factor may be calculated by another method.
- FIG. 15 is a flowchart for creating a failure analysis context. This flow is started immediately after step 1017, for example.
- the generation program 123 creates a failure analysis context 118. Specifically, for example, the generation program 123 performs the following processes (A) to (G): (A) Include the meta RCA rule ID 552 obtained in step 1015 in the failure analysis context; (B) Of one or more expanded RCA records having one or more expanded RCA rule IDs 551 obtained in step 1014, expanded from a record having an ID 552 that matches the meta RCA rule ID 552 obtained in (A) above.
- the server ID 532 is extracted from the topology record having the ID 531 that matches the topology ID 553 (605) obtained in (C) above, and the ID 532 is included in the failure analysis context;
- E Extract the switch ID 533 from the topology record having the ID 531 that matches the topology ID 553 (605) obtained in (C) above, and include the ID 533 in the failure analysis context;
- F Extract the storage ID 534 from the topology record having the ID 531 that matches the topology ID 553 (605) obtained in (C) above, and include the ID 534 in the failure analysis context;
- the generation program 123 allocates a failure analysis context ID 601 and includes the ID 601 in the failure analysis context.
- FIG. 16 shows a flow for selecting the root cause. This flow is started immediately after step 1018, for example.
- the generation program 123 includes first display information including the following elements (a) to (c): (A) The server name 502 in the server record having the server ID 501 that matches the cause node ID 554 in step 1016, the switch name 512 in the switch record having the switch ID 511 that matches the cause node ID 554 in step 1016, or the step A storage name 522 in the storage record having a storage ID 521 that matches the cause node ID 554 at 1016; (B) Cause details 555 in the expanded RCA record corresponding to the cause node ID 554 in (a) above (the expanded RCA record compiled in step 1015); (C) The certainty factor corresponding to the cause node ID 554 in (a) above (the certainty factor obtained in step 1017), Is transmitted to the display computer 20 through the network 50.
- Step 1020 The screen display program 211 receives the first display information transmitted in Step 1019.
- Step 1021 The screen display program 211 displays the first display information received in Step 1020 on the input / output device 260 (for example, a display device).
- FIG. 19 shows a candidate / confidence screen 2010.
- a screen 2010 is an example of a display screen of first display information.
- Candidate ID 2011 is an identifier of a root cause candidate. Each root cause candidate is assigned by a candidate ID, for example, the display program 211.
- the cause node name 2012 is an element included in the first display information, and is the server name 502, switch name 512, or storage name 522 of the root cause candidate (monitored node 30).
- the cause details 2013 are cause details 555 included in the first display information.
- the certainty factor 2014 is the certainty factor included in the first display information.
- the screen display program 211 uses the input / output device 260 (for example, a mouse) to transmit information (for example, the cause node ID) for identifying the root cause candidate selected by the system administrator via the network 50 to the management server. Send to.
- information for example, the cause node ID
- Step 1023 The generation program 123 receives the information transmitted in Step 1022.
- Step 1024 The generation program 123 determines the failure analysis context 118 corresponding to the information received in Step 1023.
- This failure analysis context 118 is the failure analysis context created in step 1018.
- FIG. 17 shows the flow of failure registration. When the number of failure history entries is zero, this flow is started after the flow of FIG. When there are one or more failure history entries, this flow is started after the flow of FIG. 16 and the flow of FIG. 18A.
- Step 1040 The display computer 20 displays a failure history registration screen.
- FIG. 21 shows a registration screen 2030.
- This screen 2030 is an example of a failure history registration screen.
- the root cause 2031 is the server name 502, the switch name 512, or the switch name 522 indicating the root cause candidate (monitored node 30) corresponding to the cause node ID in step 1016.
- Failure analysis context ID 2032 to storage ID 2038 are failure analysis context ID 601 to storage ID 607 in the failure analysis context (context determined in step 1024) corresponding to the cause node ID in step 1016.
- the screen shown in FIG. 19 may be closed between step 1024 and the step. In this case, it is necessary to record the failure analysis context obtained in step 1024 in a storage device such as a memory before closing the screen of FIG. 19 and read in this step.
- Cause 2039 is a system administrator form in which the content of the cause of the failure is registered in a natural language by the system administrator using the input / output device 260.
- the recovery method 2040 is a system administrator form in which the content of the recovery method from the failure is registered in a natural language by the system administrator using the input / output device 260.
- the system administrator transmits the meta RCA rule ID 2033 to storage ID 2038, the cause 2039, and the recovery method 2040 to the failure history management program 125 by pressing the registration button after inputting the fields of the cause 2039 and the recovery method 2040.
- Step 1041 The failure history management program 125 receives the meta RCA rule ID 2033 to storage ID 2038, the cause 2039, and the recovery method 2040 transmitted in Step 1040.
- Step 1042 The failure history management program 125 registers the meta RCA rule ID 2033 to storage ID 2038, the cause 2039, and the recovery method 2040 received in Step 1041 in the failure history entry.
- the program 125 allocates a failure history ID 701 to this record.
- Step 1043 The failure history management program 125 creates a new record in the server weight information 800.
- An initial value (for example, 100) is substituted for the server vendor 802 to the server continuous operation time 805 of the record, and the server weight ID is stored in the record.
- the initial value may be another value as long as it indicates the weight of each element.
- Step 1044 The failure history management program 125 creates a new record in the switch weight information 810.
- An initial value (for example, 100) is assigned to the switch vendor 812 to the switch continuous operation time 815 in the record, and the switch weight ID is stored in the record.
- the initial value may be another value as long as it indicates the weight of each element.
- Step 1045 The failure history management program 125 creates a new record in the storage weight information 820.
- An initial value (for example, 100) is assigned to the storage vendor 822 to the continuous operation time 825 of the record, and the storage weight ID is stored in the record.
- the initial value may be another value as long as it indicates the weight of each element.
- Step 1043 to Step 1045 an evaluation value for matching the failure analysis context is allocated to an arbitrary element of the hardware or software configuration of the monitoring target node and the setting contents of the failure analysis context. There should be processing.
- FIG. 18A is a flow for acquiring the same and / or similar fault history entries from the fault history information.
- Step 1025 If the number of failure history entries is 0, the context matching analysis program 124 ends this flow.
- the program 124 executes Step 1022 when the number of failure history entries is one or more.
- Step 1026 The program 124 searches for failure history information using the failure analysis context. Details of step 1026 will be described later with reference to FIG. 18B.
- the program 124 transmits the search result information obtained in Step 1026 to the display computer 20.
- the search result information includes, for example, failure history ID 701, meta RCA rule ID 702, expanded RCA rule ID 703, topology ID 704, server ID 705, switch ID 706, storage ID 707, server weight ID 708, switch weight ID 709, storage weight ID 710, cause 711, recovery method 712, and matching rate.
- failure history ID 701 meta RCA rule ID 702
- expanded RCA rule ID 703 topology ID 704
- server ID 705 switch ID 706, storage ID 707, server weight ID 708, switch weight ID 709, storage weight ID 710, cause 711, recovery method 712, and matching rate.
- server ID 705 for example, failure history ID 701, meta RCA rule ID 702, expanded RCA rule ID 703, topology ID 704, server ID 705, switch ID 706, storage ID 707, server weight ID 708, switch weight ID 709, storage weight ID 710, cause 711, recovery method 712, and matching rate.
- other information may be
- the screen display program 211 (display computer 20) receives the information transmitted in Step 1027 and displays it on the input / output device 260 (for example, a display device). At that time, the program 211 preferentially displays information with a high matching rate (for example, displays information in descending order (highest order) of the matching rate).
- FIG. 20 shows the search result screen 2020 displayed in step 1028.
- This screen 2020 is an example of a search result screen.
- the history ID is an identifier (for example, a serial number) assigned to the searched search history.
- the failure history ID 2022 is the failure history ID 701 of the hit failure history entry.
- the failure history node name 2023 is the server name 502 in the server record, the switch name 512 in the switch record, or the storage name 522 in the storage record. Records having those elements 502, 522, or 512 have an ID that matches the cause node ID 554.
- the cause node ID 554 is in the expanded RCA record having the expanded RCA rule ID 551 that matches the expanded RCA rule ID 703 of the hit failure history entry.
- Cause 2024 is a cause 711 of the hit failure history entry.
- the recovery method 2025 is a recovery method 712 included in the hit failure history entry.
- the matching rate 2026 indicates the matching rate transmitted from the context matching analysis program 124 in step 1027. Search results are displayed in descending order of the matching rate.
- search result screen instead of or in addition to the information elements shown in FIG. 20, other types of information elements related to the failure history search results may be displayed.
- FIG. 24A shows a first example of the matching degree comparison screen.
- the display area e01 Details of the information regarding the failure that has occurred this time are displayed in the display area e01.
- the meta RCA rule ID 541 corresponding to the current failure, the node name 502, 512 or 522 of the event that has occurred, and the event content 565 are displayed.
- the details of the selected failure history are displayed in the display area e02.
- the failure history meta RCA rule ID 541, the node name 502, 512 or 522 of the event that has occurred, and the event content 565 are displayed.
- a matching rate 2026 between the current failure and the failure history is displayed.
- a failure history recovery method 2025 is displayed.
- FIG. 24A shows a second example of the matching degree comparison screen.
- a diagram based on event information, topology information, and node information regarding the current failure is displayed.
- the figure shows how the connections between the nodes are and what events occur on which nodes.
- there are three blocks in the display area e05 each block corresponds to one of the nodes, and the connection between the blocks follows the topology specified from the topology information.
- the character string displayed in the box represents the node name of the node corresponding to the block and the contents of the event (failure) that occurred in the node.
- a diagram based on event information, topology information, and node information is displayed regarding the failure history. Specifically, for example, three blocks are displayed in the display area e06, and each block corresponds to one of the nodes as in the display area e05.
- the portion that matches each other (the portion that matches the meta RCA rule) of the information displayed in the display area e05 and the information displayed in the display area e06 is indicated by a method such as enclosing with a broken line.
- the system administrator can visually grasp the difference between the failure history selected by the system administrator and the current failure. Specifically, it can be seen that the failure that has occurred this time does not cause an IO error in the node with the node name “BOTAN”, compared to the selected failure history.
- FIG. 18B shows details of step 1026 of FIG. 18A.
- Step 1031 The context matching analysis program 124 performs meta RCA rule matching as the processing of Step 1031. Details of step 1031 will be described later with reference to FIG. 18C.
- the context matching analysis program 124 transmits a failure history entry search request including a specific failure analysis context to the failure history management program 125.
- the “specific failure analysis context” is a failure analysis context having a meta RCA rule ID equal to the meta RCA rule ID of the failure analysis context 119 obtained in step 1024.
- Step 1102 The failure history management program 125 receives the search request transmitted in Step 1101.
- Step 1103 In response to the search request received in Step 1102, the failure history management program 125 searches for a failure history entry having the specific failure analysis context.
- the program 125 transmits information representing the search result to the context matching analysis program 124.
- the transmitted information includes information registered in the failure history entry including a specific failure analysis context.
- Step 1104 The context matching analysis program 124 receives the information transmitted in Step 1103.
- Step 1033 The context matching analysis program 124 executes Step 1034 when the number of failure history entries obtained in Step 1031 is less than a first threshold (for example, 10). On the other hand, when the number of failure history entries obtained in step 1031 exceeds the second threshold (for example, 50), the program 124 executes step 1035.
- the second threshold value is equal to or greater than the first threshold value. If the number of failure history entries obtained in step 1031 is an appropriate number (for example, not less than the first threshold value and not more than the second threshold value), this flow ends.
- Step 1034 The program 124 performs processing for obtaining more fault history entries than the step 1031 by relaxing the search conditions. Specifically, the process shown in FIG. 18D is performed.
- a failure analysis context that is a search query has a plurality of meta RCA rules (that is, when an expanded RCA rule is composed of multi-stage inference of meta RCA rules)
- the context matching analysis program 124 has k meta RCA rule IDs 702 equal to a plurality of meta RCA rule IDs 602 included in a failure analysis context (failure analysis context serving as a search key) 119 (k is a natural number).
- the failure history entry search request having the above is transmitted to the failure history management program 125. Note that the value of k can be arbitrarily set by the system administrator.
- Step 1112 The failure history management program 125 receives the search request transmitted in Step 1111.
- Step 1114 The context matching analysis program 124 receives the information transmitted in Step 1113. Note that the number of pieces of information to be transmitted (the number of failure history entries to be search hits) may be suppressed to an appropriate number (for example, a first number and / or a second number described later) or less.
- the search method is not limited to the method described above, and other methods may be employed.
- a meta RCA rule hereinafter, referred to as a meta RCA rule identified from the meta RCA rule ID in the failure analysis context of the search source.
- the failure history management program 125 A failure history entry having the ID of the second meta RCA rule whose matching rate with one meta RCA rule is X% or more (X is a natural number) may be a search hit target.
- the matching rate is based on the degree of overlap between the event group belonging to the first meta RCA rule and the event group belonging to the second meta RCA rule. Specifically, for example, the first ratio of the number of duplicated events to the total number of events belonging to the first meta RCA rule, and the number of duplicated events to the total number of events belonging to the second meta RCA rule.
- the matching rate is calculated based on at least one of the two ratios.
- the first meta RCA rule displayed in the display area e05 and the second meta RCA rule displayed in the display area e06 partially match.
- Step 1035 The context matching analysis program 124 performs the process shown in FIG. 18F.
- Matching evaluation is, for example, the following (A) and (B): (A) Monitored node hardware or software configuration specified from the failure analysis context of the search source, and elements of setting contents; (B) Hardware or software configuration of the monitoring target node identified from the failure history entry, and elements of the setting contents, Based on the degree of agreement of each other.
- Step 1121 The context matching analysis program 124 transmits to the failure history management program 125 a search request including the meta RCA rule ID (first meta RCA rule ID) of the failure analysis context 119 obtained in step 1024. .
- Step 1122 The program 125 receives the search request transmitted in Step 1101.
- Step 1123 The program 125 performs a search in response to the search request received in Step 1102, and records it in the failure history entry having the meta RCA rule ID equal to the first meta RCA rule ID in the context matching analysis program 124. Information being sent.
- Step 1124 The context matching analysis program 124 receives the information transmitted in Step 1103.
- Step 1125 The program 124 performs the following processing (A) to (D): (A) Server record identified from at least one of the server record, switch record, and storage record identified from the ID in the failure analysis context of the search source, and the ID in the failure history entry obtained in step 1124 , A value that matches or approximates each other is extracted from at least one of the switch record and the storage record (for example, if the continuous operation time is within 3000, the values are approximate to each other).
- each item corresponding to each value obtained in (A) is extracted from the server weight information 800, switch weight information 810, and storage weight information 820 included in the failure history information;
- a matching rate corresponding to the cumulative weight value to each fault history entry obtained in step 1124 (for example, a fault history entry with a high cumulative weight value is assigned a high matching rate, and the cumulative weight value Low failure history entries are assigned a low matching rate) I do.
- other elements may be considered instead of or in addition to the cumulative weight value.
- Step 1126 The program 124 rearranges the failure history entries in descending order of the matching rate obtained in Step 1125. By performing this process, the system administrator can refer to the failure that occurred this time and the failure history having a high matching rate in order.
- Step 1127 In the comparison process of Step 1125, the program 124 selects the item of the value extracted in Step 1125 from the information 800, 810 and 820 included in the failure history information (hereinafter, in the description of FIGS. 18F and 18G, The weight corresponding to the “target item”) is relatively increased. “To relatively increase” may be to increase the weight corresponding to the target item, or to decrease the weight corresponding to the non-target item.
- the program 124 transmits an update request including the identification information (for example, name) of the item whose weight has been changed and the updated weight (and / or change amount) to the failure history management program 125.
- Step 1129 The failure history management program 125 updates at least one of the information 800, 810, and 820 in the failure history information in response to the update request. That is, the weight calculated in step 1127 is reflected in the corresponding record in the information 800, 810, and 820 in the failure history information.
- the failure analysis context (or search query) of the search source includes a value representing the weight of each attribute other than the type of the node device for each node device belonging to the expanded RCA rule (or topology) specified from the context. .
- the vendor and the OS match among the multiple types of attributes. For this reason, the cumulative value for the first failure history entry is the sum “130” of the weight “50” for the vendor and the weight “80” for the OS.
- the cumulative value for the second failure history entry is the sum “30” of the weight “20” for the IP address and the continuous operation time “10”.
- the first failure history entry has a higher similarity with the search source failure analysis context than the second failure history entry.
- the weights of these attributes are set to higher values in step 1127.
- the weight of the attribute corresponding to each value included in the failure analysis record including the information indicating the recovery method selected by the system administrator is greater. It may be a high value.
- the system administrator identifies the recovery method for the failure that occurred this time from the failure history information as described above. After finishing the recovery of the failure that has occurred this time, the system administrator performs the flow of FIG. 17 using this event as the failure history. As a result, the failure analysis context corresponding to the failure that has occurred this time, information that represents the root cause of the failure that has occurred this time, and information that represents the recovery method that has been taken this time are associated.
- Step 1124 and the subsequent steps may be performed based on the information obtained in step 1031 of FIG. 18B.
- the failure history entry includes a failure analysis context corresponding to the failure that has occurred, in addition to the information indicating the root cause of the failure that has occurred and the information that represents the recovery method according to the root cause.
- the failure analysis context is information including information (hereinafter, rule specifying information) for specifying the cause / result rule that is the basis of the root cause of the failure among the plurality of cause / result rules.
- the cause / result rules are the following (x) and (y): (X) The type of node device and the content of the event that occurred as the root cause; (Y) As a result, the type of the node device and the content of the event that occurred (what type of event occurred in which type of node device), Represents the correspondence relationship.
- a search query including a failure analysis context corresponding to the occurred failure is input from the system administrator to the management server.
- the management server sends a failure analysis context (first failure analysis context) included in the search query and a failure analysis context (second failure analysis context) in each failure history entry included in the failure history information. )
- first failure analysis context included in the search query
- second failure analysis context failure analysis context
- the management server displays information registered in the specified second failure history entry (information including information indicating the recovery method). As a result, the system administrator can quickly specify a recovery method from the failure that has occurred.
- a new failure history entry including the first failure analysis context corresponding to the failure that has occurred and information indicating the identified recovery method can be registered.
- This registration operation may be performed manually by the system administrator or automatically by the management server.
- the management server includes a failure history entry including the first failure analysis context used in the search, information indicating the specified root cause, and information indicating the specified recovery method. Can be registered.
- a failure history entry including a second failure analysis context similar to the first failure analysis context which type of node is based on the rule specifying information in the first and second failure analysis contexts Information indicating what event has occurred in the device is specified. That is, the node device types are compared with each other. For this reason, even if the node devices in which the event having the same content has occurred are different, if the type of the node device is the same, the second failure analysis context is similar to the first failure analysis context. Thus, for example, if a previous event occurred on server A and this event occurred on server B this time, a failure history entry including the second failure analysis context corresponding to the previous failure is found as a search hit. May be subject to In other words, it is possible to hit similar items.
- a fault including a second fault analysis context associated with a cause / result rule that completely matches the cause / result rule specified from the first fault analysis context A history entry becomes the target of a search hit.
- the search is performed again with relaxed conditions. Specifically, for example, if the cause / result rules are similar to each other with a predetermined similarity (but less than 100%), the search history record becomes a hit.
- the search history record becomes a hit.
- the search history record Becomes a hit.
- the management system includes a function that assists in metasizing and registering a recovery method procedure as a meta recovery method, a function that associates the meta recovery method with a meta RCA rule, and a root cause reference time. And a function for displaying the meta-recovery method together.
- the management system displays an identifier such as the IP address of a node that has failed in the past, and the system administrator replaces the displayed recovery method information with the node that has failed this time. I do.
- the management system displays the recovery method using the identifier of the node where the failure has occurred this time.
- the system administrator can specify a recovery method candidate that can be taken when referring to the root cause.
- ⁇ 2-1 Difference in Configuration of Embodiment 2 from Embodiment 1> Information indicating a meta recovery method (described later) is associated with the meta RCA rule information 115 (meta RCA record) of the first embodiment.
- a meta recovery method registration screen (FIG. 22A) is added to the failure history registration screen (FIG. 21), and a step of registering the meta recovery method is added.
- step 1020 of the first embodiment information indicating the meta-recovery method is added to the root cause candidate list and the certainty factor screen (FIG. 19) (FIG. 23).
- Metal recovery method is a recovery method defined by a combination of finite elements (objects) provided by the management system.
- the meta recovery method is a method that does not depend on a specific node, and can be registered in association with the meta RCA rule.
- the format of the information is not limited as long as the recovery method can be defined.
- the meta-recovery method is defined by a combination of one or more of three elements, Arc, Branch, and Command.
- Arc indicates a transition between Branch or Command.
- Branch indicates a conditional branch.
- Common indicates processing.
- Registration of the meta-recovery method is performed, for example, at a timing immediately before transmitting the failure history registration information in step 1040 in the first embodiment.
- FIG. 22A shows an example of a meta recovery method registration screen.
- icons of Arc, Branch, and Command are installed in the display area e11.
- the system administrator can place an icon in the display area e12 by dragging and dropping any icon to the display area e12.
- Fig. 22Ae02 is an edit screen for defining the meta recovery method. By arranging the icons in the display area e01, the configuration of the meta recovery method can be defined.
- the display area e13 is a window for performing detailed setting of each icon installed in the display area e12. This figure shows an example of the Branch setting screen.
- the display area e14 shows the identifier of the icon.
- the display area e15 is a form for selecting a condition target in a conditional branch. Selection items are finite elements provided by the system.
- the display area e16 is a form for selecting the content of the condition in the conditional branch. Selection items are finite elements provided by the system.
- the display area e17 defines a transition destination when the condition defined in the display area e16 is true and a transition destination when the condition is false.
- the display area e18 is a form for inputting details of branch contents that cannot be expressed only by the display area e16.
- the information is registered in a natural language by the system administrator.
- the display area e13 shown in FIG. 23B shows an example when the display area e13 is a command setting screen.
- the display area e14 shown in FIG. 23B indicates the identifier of the icon.
- the display area e15 shown in FIG. 23B is a form for selecting a processing target. Selection items are finite elements provided by the system.
- the display area e16 shown in FIG. 23B is a form for selecting the content of processing. Selection items are finite elements provided by the system.
- the display area e17 shown in FIG. 23B is a form for inputting details of processing contents that cannot be expressed only by the display area e16.
- the information is registered in a natural language by the system administrator.
- the meta recovery method definition defines the flow of object transitions from the start to the end of recovery. Specifically, it is defined which object (conditional branching or processing) to which object the transition is made to.
- the acquisition of the meta recovery method is performed, for example, immediately after extracting the meta RCA rule in step 1015 in the first embodiment. Since the meta-recovery method is registered in association with the meta-RCA rule, when the meta-RCA rule is determined, the meta-recovery method is also determined.
- step 1019 in the first embodiment the meta recovery method is also transmitted.
- step 1020 in the first embodiment the meta recovery method is displayed in addition to the root cause and the certainty factor.
- FIG. 23 is an example of a candidate / confidence screen displayed in the second embodiment.
- a list of the display area e21 which is a table showing the accumulated processing contents of the command and a list of command processes of the meta restoration method is listed.
- a column e22 which is the selected column is added.
- a meta recovery method defined by a series of flows using common parts such as conditional branch (Branch) and treatment (Command) is prepared.
- the meta recovery method is associated with the meta RCA rule of the combination of the event group and the root cause. As a result, it is possible to define from the detection of the occurred failure to the recovery method as one rule.
- the present embodiment is an embodiment of the present invention to which the general-purpose rule base system of Non-Patent Document 1 described above is applied.
- Non-Patent Document 1 discloses a general-purpose rule base system having a rule memory and a fact memory on a rule base system.
- the rule memory stores general-purpose rules described without depending on a specific individual.
- the fact memory stores specific information of a specific individual.
- the rule base system is a system for deriving new facts using the rules and the information.
- Causality Rule is a rule that describes the relationship between an event and its cause without depending on a specific topology. Specific examples of the Causality Rule are as follows.
- Topology Rule is a rule that describes the connection state of a node without depending on a specific topology.
- a specific example of Topology Rule is as follows.
- T-RULE-200 IF FC-connected (x, y) & FC-connected (z, y) THEN FC-connected (x, z).
- the topology application program 121 stores the Causality Rule and Topology Rule in the rule memory on the rule base system.
- the topology application program 121 detects the following topology fact by applying the Topology rule to the monitoring target node 30, and stores it in the fact memory on the rule base system.
- TF1 Serer (“ServerA”)
- TF2 Storage (“Storage A”)
- TF3 Switch (“SwitchA”)
- TF4 FC-Connected (“ServerA”, “ABC”)
- TF5 FC-Connected (“AMS1000”, “ABC”).
- the rule base system creates an instance like the following example by combining Causality Rule and topology fact.
- C-RULE-100-INSTANCE-1 IF EventHappensOn (IO_ERROR, “ServerA”, t1) & EventHappensOn (CTRL_FAIL, “StorageA”, t2) & WithTimeTimeWindow (t1, t2, “10 minutes”) THEN IdentifyRootCause (CTRL_FAIL, “StorageA”).
- C-RULE-100-INSTANCE-1 is also stored in the memory.
- the topology application program 121 When the topology application program 121 monitors the monitoring target node 30 and observes that the “ServerA” IO_ERROR event and the “StorageA” CTRL_FAIL event occur within the event correlation processing time width, the topology application program 121 stores the following event fact in the memory for the rule-based system.
- EF1 EventHappensOn (IO_ERROR, "ServerA”, “12:32:12 22009/03/10"
- EF2 EventHappensOn (CTRL_FAIL, "AMS1000", “12:32:10 22009/03/10"
- EF3 WithinTimeWindow ("12:32:10 22009/03/10", "12:32:12 22009/03/10 ",” 10 minutes ").
- the rule-based system derives IdentifyRootCause (CTRL_FAIL, “StorageA”) from C-RULE-100-INSTANCE-1 and event facts, thereby identifying the root cause.
- CRL_FAIL IdentifyRootCause
- C-RULE-100-INSTANCE-1 is the expanded RCA rule
- C-RULE-100 (Causality Rule) corresponds to the meta RCA rule
- C-RULE-100 becomes the meta RCA rule ID 541.
- Multi-stage inference may be performed using a plurality of Causality Rules, and there may be a plurality of meta RCA rules.
- the Causality Rule corresponding to the meta RCA rule used to derive the root cause and the instance corresponding to the expanded RCA rule are acquired and handled as a fault analysis context.
- the effects of the invention can be obtained.
- a general rule-based system can be applied.
- the following storage format may be employed as the data structure of the expanded RCA rule information.
- All combination patterns are stored for an event that occurs in a monitoring target node and that is to be managed by the management system (including a node device) and an event that distinguishes the event contents.
- the root cause can be identified among the combinations of (A)
- the occurrence site including the node device
- the event contents are stored in association with each other.
- an interface that communicates with a plurality of node devices, a processor that detects an event that occurs in the plurality of node devices via the interface, event information, and a meta rule A management system comprising a storage resource for storing information, failure history information, a display device for displaying information about the plurality of node devices, and an input device:
- the event information includes an event entry indicating information for identifying a node device that has generated the event and a type of the event that has occurred.
- the meta-rule information indicates the type of latent event that may potentially occur in the node device, and the type of event that can be identified as the root cause when an event corresponding to the type of the latent event occurs.
- Meta rules *
- the failure history information includes a failure history entry including information indicating a recovery method and information for identifying the meta-rule corresponding to the recovery method.
- the processor is: (A) Based on the meta rule information, a first cause event that is a root cause of the first event specified by the event entry stored in the event information is specified, and is used for specifying the first cause event. Identify the first meta-rule that was (B) A first recovery method, which is a method of recovering from the first cause event, is received via the input device, and the failure history information includes the first recovery method based on the first recovery method.
- the display device (X) displaying information about the second cause event as a root cause of the second event; (Y) displaying a recovery method from the second cause event based on the predetermined failure history entry;
- the management system characterized by this was explained.
- the failure history entry includes an identifier of a node device to which the recovery method is applied, and the display device: (Z)
- the identifier of the node device indicated by the predetermined failure history entry may be displayed as the identifier of the node device to which the recovery method indicated by the predetermined failure history entry of (Y) is applied.
- the management system may be configured such that when the first cause event generation node device and the second cause event generation node device are different node devices, the display device: (A) As a display of information on the second cause event in (X), information including an identifier of the node device that has generated the second cause event is displayed. (B) The identifier of the node device that has generated the first cause event may be displayed as the identifier of the node device indicated by the predetermined failure history entry in (Z).
- the identification of (D) is as follows: (D1) Select the failure history entry indicating the same meta rule as the second meta rule, (D2) When the number of failure history entries selected in (D1) is less than a first threshold, the predetermined number is determined based on a matching rate between a meta rule corresponding to the failure history entry and the second meta rule. Identify fault history entries for (D3) The failure history entry selected in (D1) may be specified as the predetermined failure history entry.
- the storage resource stores configuration setting information of the plurality of node devices
- the failure history entry includes past configuration setting information corresponding to the entry creation time of the plurality of node devices
- the storage resource may store weight information indicating a weight value for an item of configuration setting information, and the identification of (D4) may be performed based on the weight information.
- the first recovery method of (B) is a meta recovery method that is a recovery method that does not include the identifier of the node device that generated the first cause event, and the second cause of (Y)
- the display of the recovery method from the event may be a display of the meta recovery method and the identifier of the node device that has generated the second cause event.
- Storage resources may be inside or outside the management system. If in, the storage resource is, for example, a memory. When the storage resource is outside, the storage resource is, for example, a storage device (for example, a disk array device).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computer Hardware Design (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
(a1)サービス利用ノード装置のIPアドレス等のネットワーク識別子;
(a2)当該ノード装置のハードウェア又はソフトウェアの構成を表す情報;
(a3)設定内容を表す情報、
を含むことができる。
(b1)サービス利用ノード装置とサービス提供ノード装置との通信を仲介するノード装置(一例がスイッチ)のハードウェア又はソフトウェアの構成を表す情報;
(b2)設定内容を表す情報、
を含むことができる。
(c1)サービス提供ノード装置のIPアドレス等のネットワーク識別子;
(c2)当該ノード装置のハードウェア又はソフトウェアの構成を表す情報;
(c3)設定内容を表す情報、
を含むことができる。また、サービス提供ノード装置情報は、サービス利用ノード装置が提供するネットワークサービスの種別を表す情報等を含んでも良い。
(d1)サービス利用ノード装置で発生しうる第1のイベント(サービス利用ノード装置発生イベント)の種別と、サービス提供ノード装置(又は中継装置)で発生しうる第2のイベント(サービス提供ノード装置発生イベント)の種別との組み合わせを表す情報;
(d2)第1のイベントと第2のイベントとが発生した場合の原因と決定できる(または原因と推定される)サービス提供ノード装置又は中継装置で発生しうる原因(または原因の種別)を示す情報、
を含むことができる。
(e1)サービス利用ノード装置であるノード装置で発生しうる第1のイベントの種別及びサービス利用ノード装置の識別子と、サービス提供ノード装置(又は中継装置)で発生しうる第2のイベントの種別及びサービス提供ノード装置(又は中継装置)の識別子との組み合わせを表す情報;
(e2)第1のイベントと第2のイベントとが発生した場合の原因と決定できる(又は原因と推定される)サービス提供ノード装置(又は中継装置)の識別子;
(e3)当該サービス提供ノード装置(又は中継装置)で発生しうる原因(または原因の種別)を示す情報、
を含むことができる。
(A)サーバ取得情報内の識別子がサーバ情報111に格納されていない場合、そのサーバ取得情報に対応した、サーバ情報111内のサーバレコード(以下、図13の説明において「対象サーバレコード」と言う)に対して、サーバID501(例えば、サーバ取得情報内の識別子)を割り振り、そのサーバID501を対象レコードに格納する;
(B)サーバ取得情報内のサーバ名502、ベンダ名503、IPアドレス504、OS名505及び連続稼働時間506を、対象サーバレコードに格納する、
を行う。
(A)スイッチ取得情報内の識別子がスイッチ情報112に格納されていない場合、そのスイッチ取得情報に対応した、スイッチ情報112内のスイッチレコード(以下、図13の説明において「対象スイッチレコード」と言う)に対して、スイッチID511(例えば、スイッチ取得情報内の識別子)を割り振り、そのスイッチID511を対象スイッチレコードに格納する;
(B)スイッチ取得情報のスイッチ名512、ベンダ名513、IPアドレス514、タイプ515及び連続稼働時間516を、対象スイッチレコードに格納する、
を行う。
(A)ストレージ取得情報内の識別子がストレージ情報113に格納されていない場合、そのストレージ取得情報に対応した、ストレージ情報113内のストレージレコード(以下、図13の説明において「対象ストレージレコード」と言う)に対して、ストレージID521(例えば、ストレージ取得情報内の識別子)を割り振り、そのストレージID521を対象ストレージレコードに格納する;
(B)ストレージ取得情報のストレージ名522、ベンダ名523、IPアドレス524、ファームウェア525及び連続稼働時間526を、対象ストレージレコードに格納する、
を行う。
(A)トポロジ取得情報内の識別子がトポロジ情報114に格納されていない場合、そのトポロジ取得情報に対応した、トポロジ情報114内のトポロジレコード(以下、図13の説明において「対象トポロジレコード」と言う)に対して、トポロジID531(例えば、トポロジ取得情報内の識別子)を割り振り、そのトポロジID321を対象トポロジレコードに格納する;
(B)トポロジ取得情報内のスイッチID533、サーバID532及びストレージID534を、対象トポロジレコードに格納する、
を行う。
(x)トポロジ情報114内のトポロジID531とメタRCAルール情報115内のメタRCAルールID541との全ての組み合わせを作成する(例えば、2つのトポロジID531と3つのメタRCAルールID541がある場合、6つ(2×3=6)の組合せを作成する);
(y)各組み合わせについて、展開RCAルールID551を割り振り、且つ、展開RCAルールID551と、組合せを構成するトポロジID及びメタRCAルールIDとを、展開RCAレコード(展開RCAルール情報116内のレコード)に格納する、
を行う。なお、実際には利用されることのないストレージ装置とサーバの組み合わせを含んだトポロジのトポロジIDについては、上記(x)の処理は行われなくて良い。同様に、他の処理によって展開RCAルール情報が作成されてもよい。より抽象化して考えた場合、例えば、トポロジ適用プログラム121は、以下の(ステップA)~(ステップD):
(ステップA)監視対象ノードから上記の各取得情報に含まれる少なくとも一つの値をノード取得情報として取得する;
(ステップB)ノード取得情報に基づいて、サービス利用ノード装置情報、サービス提供ノード装置情報、又は中継ノード装置情報を更新する;
(ステップC)トポロジ取得情報に基づいて、所定のネットワークサービスについてのサービス提供ノード装置の識別子と、当該ノード装置を利用するサービス利用ノード装置の識別子との対応を、トポロジ情報に含める;
(ステップD)トポロジ情報とメタRCAルール情報に基づいて、展開RCAルール情報を更新する;
を行うことができる。
(第1のメタRCAルール)第1のネットワークサービス(例えばWWW(World Wide Web))について、サービス利用ノード装置で発生する第1の種別のイベント(以下、イベントA)とサービス提供ノード装置で発生する第2の種別のイベント(以下、イベントB)とを検知した場合、イベントAが発生する根本原因はイベントBの発生である。
(第2のメタRCAルール)第2のネットワークサービス(例えばDNS(Domain Name System))について、サービス利用ノード装置で発生する第3の種別のイベント(以下、イベントC)と、サービス提供ノード装置で発生する第4の種別のイベント(以下、イベントD)とを検知した場合、イベントCが発生する根本原因はイベントDの発生である。
(第1のトポロジ情報)第1のネットワークサービスについて、ノード装置Aがサービス利用ノード装置であり、ノード装置Bがサービス提供ノード装置である。
(第2のトポロジ情報)第2のネットワークサービスについて、ノード装置Bがサービス利用ノード装置であり、ノード装置Cがサービス提供ノード装置である。
(第3のトポロジ情報)ノード装置Bにおける第1のネットワークサービスは、第2のネットワークサービスを利用して提供する。
(生成される第1の展開RCAルール)ノード装置Aで発生したイベントAが検知され且つノード装置Bで発生したイベントBが検知された場合、ノード装置Aで発生したイベントAの根本原因は、ノード装置BでのイベントBの発生である。
(生成される第2の展開RCAルール)ノード装置Bで発生したイベントCが検知され且つノード装置Cで発生したイベントDが検知された場合、ノード装置Bで発生したイベントCの根本原因は、ノード装置CでのイベントDの発生である。
(生成される第3の展開RCAルール)ノード装置Aで発生したイベントAが検知され且つノード装置Cで発生したイベントDが検知された場合、ノード装置Aで発生したイベントAの根本原因は、ノード装置CでのイベントDの発生である。
(イベントタイプ)イベントエントリの属するイベントの種類(例えば、Critical、Warning、Information)を示す;
(対象ノードタイプ)発生したイベントの対象である監視対象ノード30のノードの種類(例えば、サーバ、スイッチもしくはストレージ装置)を示す;
(対象ノードID)イベントが発生した監視対象ノード30を示す識別子(サーバID501、スイッチID511又はストレージID521)である;
(イベント内容)発生したイベントの内容である;
(発生日時)イベントの発生日時である。
(A)新規のイベントID561を取得し、イベント情報117内のブランクのレコード(以下、ステップ1013の説明において「対象レコード」と言う)に、そのID561を格納する;
(B)対象レコードに、イベントエントリ内のイベントタイプ、対象ノードタイプ、対象ノードID、イベント内容及び発生日時を格納する;
(C)対象レコードに、状態567として、「未解決」という値を格納する、
を行う。
(A)状態556が「未解決」のイベントレコードのうち、発生日時565が最も遅いイベントレコード(第1のイベントレコード)を特定する;
(B)直前のステップで特定された第1のイベントレコードを基に、一つ以上の第2のイベントレコードを特定する(第1のイベントレコード内の発生日時565と、第2のイベントレコード内の発生日時565との差は、所定の時間(例えば10分前後)以内である);
(C)上記(B)で得られた全ての第2のイベントレコード内の対象ノードタイプ563を参照し、それら第2のイベントレコード内の全ての対象ノードIDを基に、対象ノードタイプの異なる対象ノードIDで構成された全ての組み合わせ(以下、ノードIDセット)を作成する(例えば、4つのイベントレコードがあり、そのうちの2つがサーバA及びBに関するレコードであり、残りの2つがスイッチA及びBに関するレコードである場合、サーバAのID-スイッチAのID、サーバAのID-スイッチBのID、サーバBのID-スイッチAのID、及びサーバBのID-スイッチBのIDという4つのノードIDセットを作成する);
(D-1)上記(C)で得られたいずれのIDセットにも含まれない対象ノードIDを含んだ第2のイベントレコード(「未解決」を表す状態556を含んだイベントレコード)がある場合、その中で発生日時565が最も遅い第2のイベントエントリを特定し、その第2のイベントエントリを上記第1のイベントエントリとして上記(B)を実施する;
(D-2)上記(C)で得られたいずれのIDセットにも含まれない対象ノードIDを含んだ第2のイベントレコード(「未解決」を表す状態556を含んだイベントレコード)がない場合、次の処理(E)を実施する;
(E)上記(D-1)及び(D-2)までに得られた各ノードIDセットについて、以下の(条件E1)~(条件E3)の全てに適合するトポロジレコード(トポロジ情報114のレコード):
(条件E1)ノードIDセット内のサーバの対象ノードIDと一致するサーバID532を有する;
(条件E2)そのノードIDセット内のスイッチの対象ノードIDと一致するスイッチID533を有する;
(条件E3)そのノードIDセット内のストレージ装置の対象ノードIDと一致するストレージID534を有する、
を探し、そのようなトポロジレコードが見つかれば、そのトポロジレコードが有するトポロジID531を抽出し、そのID531を、そのトポロジレコードに対応するノードIDセットに対応づける;
(F)上記(E)で得られた各ノードIDセット(条件E1~E3に適合するトポロジレコードが特定されたノードIDセット)について、以下の(条件F1)~(条件F3)の全てに適合するメタRCAレコード(メタRCAルール情報115のレコード):
(条件F1)イベントIDセット内のサーバの対象ノードIDを有するイベントレコード内のイベント内容564と一致するサーバイベント542を有する;
(条件F2)そのイベントIDセット内のスイッチの対象ノードIDを有するイベントレコード内のイベント内容564と一致するスイッチイベント543を有する;
(条件F3)そのイベントIDセット内のストレージ装置の対象ノードIDを有するイベントレコード内のイベント内容564と一致するストレージイベント544を有する、
を探し、そのようなメタRCAレコードが見つかれば、そのメタRCAレコードが有するメタRCAルールID541を抽出し、そのID541を、対応するノードIDセットに対応づける;
(G)上記(F)で得られた各ノードIDセット(メタRCAルールID541が対応付けられたセット)について、以下の処理(g1)~(g4):
(g1)ノードIDセットに対応付けられたメタRCAルールID541を有するメタRCAレコードから原因ノード545を抽出する;
(g2)抽出された原因ノード545と一致する対象ノードタイプ563を有するイベントレコードを特定する;
(g3)特定されたイベントレコードから対象ノードID564を抽出する;
(g4)抽出された対象ノードID564を、上記(g1)でのノードIDセットに対応付ける、
を行う;
(H)下記(h1)~(h3)の要素を有する展開RCAレコード(展開RCAルール情報116のレコード):
(h1)上記(E)で得られたトポロジID531と一致するトポロジID553;
(h2)上記(F)で得られたメタRCAルールID541と一致するメタRCAルールID552;
(h3)上記(G)で得られた対象ノードID564と一致する原因ノードID554、
を抽出する。
(A)ステップ1015で得られたメタRCAルールID552を障害分析コンテキストに含める;
(B)ステップ1014で得られた1つ以上の展開RCAルールID551を有する1以上の展開RCAレコードのうちの、上記(A)で得られたメタRCAルールID552と一致するID552を有するレコードから展開RCAルールID551を抽出し、抽出したID551を障害分析コンテキストに含める;
(C)上記(B)で得られた展開RCAルールID551(603)と一致するID551を有する展開RCAレコードからトポロジID553を抽出し、そのID553を障害分析コンテキストに含める;
(D)上記(C)で得られたトポロジID553(605)と一致するID531を有するトポロジレコードからサーバID532を抽出し、そのID532を障害分析コンテキストに含める;
(E)上記(C)で得られたトポロジID553(605)と一致するID531を有するトポロジレコードからスイッチID533を抽出し、そのID533を障害分析コンテキストに含める;
(F)上記(C)で得られたトポロジID553(605)と一致するID531を有するトポロジレコードからストレージID534を抽出し、そのID534を障害分析コンテキストに含める;
(G)生成プログラム123が、障害分析コンテキストID601を割り振り、そのID601を障害分析コンテキストに含める、
を行う。障害分析コンテキスト118は、障害分析コンテキストID601とメタRCAルールID603だけ用いて作成されてもよい。
(a)ステップ1016での原因ノードID554と一致するサーバID501を有するサーバレコード内のサーバ名502、ステップ1016での原因ノードID554と一致するスイッチID511を有するスイッチレコード内のスイッチ名512、又は、ステップ1016での原因ノードID554と一致するストレージID521を有するストレージレコード内のストレージ名522;
(b)上記(a)での原因ノードID554に対応した展開RCAレコード(ステップ1015でまとめられた展開RCAレコード)における原因詳細555;
(c)上記(a)での原因ノードID554に対応した確信度(ステップ1017で得られた確信度)、
を、ネットワーク50を通じて表示用計算機20に送信する。
(A)検索元の障害分析コンテキストから特定される監視対象ノードハードウェア又はソフトウェア構成、及び設定内容の要素;
(B)障害履歴エントリから特定される監視対象ノードのハードウェア又はソフトウェア構成、及び設定内容の要素、
の互いの一致度を基に行われる。
(A)検索元の障害分析コンテキスト内のIDから同定されるサーバレコード、スイッチレコード及びストレージレコードのうちの少なくとも一つのレコードと、ステップ1124で得た障害履歴エントリ内のIDから同定されるサーバレコード、スイッチレコード及びストレージレコードのうちの少なくとも一つのレコードとから、互いに一致する又は近似する値を抽出する(例えば、連続稼働時間については、誤差が3000以内であれば、互いに近似する値となる);
(B)上記(A)で得た各値に対応した各項目の重みを、障害履歴情報が有するサーバ重み情報800、スイッチ重み情報810及びストレージ重み情報820から抽出する;
(C)ステップ1124で得た障害履歴エントリ毎に、上記(B)で得た重みの累計値を算出する;
(D)ステップ1124で得た各障害履歴エントリに、重みの累計値に応じたマッチング率を割り当てる(例えば、重みの累計値が高い障害履歴エントリには高いマッチング率が割り当てられ、重みの累計値が低い障害履歴エントリには低いマッチング率が割り当てられる)、
を行う。なお、マッチング率の算出には、重みの累計値に代えて又は加えて、他の要素が参酌されてもよい。
(x)根本原因としての、ノード装置のタイプと発生したイベントの内容;
(y)結果としての、ノード装置のタイプと発生したイベントの内容(どのタイプのノード装置でどんなイベントが発生したか)、
の対応関係を表す。発生した障害に対応した障害分析コンテキストを含んだ検索クエリが、システム管理者から管理サーバに入力される。管理サーバは、その検索クエリに応答して、検索クエリが有する障害分析コンテキスト(第1の障害分析コンテキスト)と、障害履歴情報が有する各障害履歴エントリ内の障害分析コンテキスト(第2の障害分析コンテキスト)とを比較し、それにより、検索元の障害分析コンテキストと類似性の高い障害分析コンテキストを含んだ障害履歴エントリを特定する。管理サーバは、特定された第2の障害履歴エントリに登録されている情報(復旧方法を表す情報を含んだ情報)を表示する。これにより、システム管理者は、迅速に、発生した障害からの復旧方法を特定することができる。
実施例1のメタRCAルール情報115(メタRCAレコード)に、メタ復旧方法(後述)を表す情報が対応づけられる。
IF Server(X) & Storage(Y) & FC-Connected(x,y) & EventHappensOn(IO_ERROR, x, y, t1) & EventHappensOn(CTRL_FAIL, y, t2) & WithinTimeWindow(t1, t2, “10 minutes”)
THEN IdentifyRootCause(CTRL_FAIL, y)
Topology Ruleとは、ノードの接続状態を、特定トポロジに依存せず記述したルールである。具体的なTopology Ruleの例は以下の通りである。
IF FC-connected(x,y)& FC-connect(z,y)
THEN FC-connected(x、z)。
TF1: Serer(“ServerA”)
TF2: Storage(“StorageA”)
TF3: Switch(“SwitchA”)
TF4: FC-Connected(“ServerA”, “ABC”)
TF5: FC-Connected(“AMS1000”, “ABC”)。
IF EventHappensOn(IO_ERROR, “ServerA”,t1) & EventHappensOn(CTRL_FAIL, “StorageA”,t2) & WithinTimeWindow(t1, t2, “10 minutes”)
THEN IdentifyRootCause(CTRL_FAIL, “StorageA”)。
EF1:
EventHappensOn(IO_ERROR, "ServerA", "12:32:12 22009/03/10")
EF2:
EventHappensOn(CTRL_FAIL, "AMS1000", "12:32:10 22009/03/10")
EF3:
WithinTimeWindow("12:32:10 22009/03/10", "12:32:12
22009/03/10", "10 minutes")。
C-RULE-100-INSTANCE-1という中間形式が展開RCAルールであり、
C-RULE-100(Causality Rule)がメタRCAルールに対応し、"C-RULE-100"がメタRCAルールID541となる。
(A)監視対象ノードで発生して管理システムが管理対象とする発生部位(含むノード装置)及びイベント内容を区別するイベントについて、全ての組み合わせパターンを格納する。
(B)(A)の組み合わせの中で根本原因を特定可能な組み合わせについては、根本原因とする発生部位(ノード装置を含む)及びイベント内容を対応させて格納する。
*前記イベント情報は、前記発生したイベントの発生元ノード装置を特定する情報と、前記発生したイベントの種別と、を示すイベントエントリを含む。
*前記メタルール情報は、ノード装置で潜在的に発生する可能性のある潜在イベントの種別と、前記潜在イベントの種別に対応するイベントが発生した場合に根本原因と特定可能なイベントの種別とを示すメタルール、を含む。
*前記障害履歴情報は、復旧方法を示す情報及び前記復旧方法が対応する前記メタルールを識別する情報を含む障害履歴エントリ、を含む。
(A)前記メタルール情報に基づき、前記イベント情報に格納した前記イベントエントリで特定される第一のイベントの根本原因である第一の原因イベントを特定し、前記第一の原因イベントの特定に用いた第一のメタルールを特定し;
(B)前記第一の原因イベントから復旧する方法である第一の復旧方法を、前記入力装置を介して受信し、前記第一の復旧方法に基づいて、前記障害履歴情報に前記第一のメタルールに対応する第一の障害履歴エントリを追加し;
(C)前記メタルール情報に基づき、前記イベント情報に格納した前記イベントエントリで特定される第二のイベントの根本原因である第二の原因イベントを特定し、第二の原因イベントの特定に用いた第二のメタルールを特定し;
(D)前記障害履歴情報に基づき、前記第二のメタルールに対応する所定の障害履歴エントリを特定する。
(X)前記第二の原因イベントに関する情報を、前記第二のイベントの根本原因として表示し;
(Y)前記所定の障害履歴エントリに基づき、前記第二の原因イベントからの復旧方法を表示する、
ことを特徴とした管理システムについて説明した。
(Z)前記所定の障害履歴エントリが示すノード装置の識別子を、前記(Y)の前記所定の障害履歴エントリが示す復旧方法を適用したノード装置の識別子として表示してもよい。
(a)前記(X)の前記第二の原因イベントに関する情報の表示として、前記第二の原因イベントの発生元ノード装置の識別子を含む情報を表示し、
(b)前記(Z)の前記所定の障害履歴エントリが示すノード装置の識別子の表示として、前記第一の原因イベントの発生元ノード装置の識別子を表示してもよい。
(D1)前記第二のメタルールと同一のメタルールのを示す前記障害履歴エントリを選択し、
(D2)前記(D1)により選択された障害履歴エントリの数が第一の閾値未満の場合は、前記障害履歴エントリが対応するメタルールと、前記第二のメタルールとのマッチング率に基づいて前記所定の障害履歴エントリを特定し、
(D3)前記(D1)により選択された障害履歴エントリを前記所定の障害履歴エントリと特定してもよい。
(D4)前記(D1)により選択された障害履歴エントリの数が第二の閾値以上の場合は、前記障害履歴エントリに含まれる前記過去の構成設定情報と、前記構成設定情報とのマッチング率に基づいて、前記所定の障害履歴エントリを特定してもよい。
Claims (15)
- 複数のノード装置と、
前記複数のノード装置で発生するイベントを検知する管理システムと、
を備える計算機システムであって、
前記管理システムは、イベント情報と、メタルール情報と、障害履歴情報と、を格納し、
前記イベント情報は、前記発生したイベントの発生元ノード装置を特定する情報と、前記発生したイベントの種別と、を示すイベントエントリを含み、
前記メタルール情報は、ノード装置で潜在的に発生する可能性のある潜在イベントの種別と、前記潜在イベントの種別に対応するイベントが発生した場合に根本原因と特定可能なイベントの種別とを示すメタルール、を含み、
前記障害履歴情報は、復旧方法を示す情報及び前記復旧方法が対応する前記メタルールを識別する情報を含む障害履歴エントリ、を含み、
前記管理システムは:
(A)前記メタルール情報に基づき、前記イベント情報に格納した前記イベントエントリで特定される第一のイベントの根本原因である第一の原因イベントを特定し、前記第一の原因イベントの特定に用いた第一のメタルールを特定し;
(B)前記第一の原因イベントから復旧する方法である第一の復旧方法を、前記入力装置を介して受信し、前記第一の復旧方法に基づいて、前記障害履歴情報に前記第一のメタルールに対応する第一の障害履歴エントリを追加し;
(C)前記メタルール情報に基づき、前記イベント情報に格納した前記イベントエントリで特定される第二のイベントの根本原因である第二の原因イベントを特定し、第二の原因イベントの特定に用いた第二のメタルールを特定し;
(D)前記障害履歴情報に基づき、前記第二のメタルールに対応する所定の障害履歴エントリを特定し、
(X)前記第二の原因イベントに関する情報を、前記第二のイベントの根本原因として表示し;
(Y)前記所定の障害履歴エントリに基づき、前記第二の原因イベントからの復旧方法を表示する、
ことを特徴とした計算機システム。 - 請求項1記載の計算機システムであって、
前記障害履歴エントリは、復旧方法を適用したノード装置の識別子を含み、
前記管理システムは:
(Z)前記所定の障害履歴エントリが示すノード装置の識別子を、前記(Y)の前記所定の障害履歴エントリが示す復旧方法を適用したノード装置の識別子として表示する、
ことを特徴とした計算機システム。 - 請求項2記載の計算機システムであって、
前記第一のメタルールと前記第二のメタルールが同一の場合、前記(Y)の前記所定の障害履歴エントリが示す復旧方法は、前記第一の障害履歴エントリが示す第一の復旧方法であり、
前記第一の原因イベントの発生元ノード装置と前記第二の原因イベントの発生元ノード装置は異なるノード装置の場合、前記管理システムは:
(a)前記(X)の前記第二の原因イベントに関する情報の表示として、前記第二の原因イベントの発生元ノード装置の識別子を含む情報を表示し、
(b)前記(Z)の前記所定の障害履歴エントリが示すノード装置の識別子の表示として、前記第一の原因イベントの発生元ノード装置の識別子を表示する、
ことを特徴とした計算機システム。 - 請求項2記載の計算機システムであって、
前記(D)の特定は:
(D1)前記第二のメタルールと同一のメタルールを示す前記障害履歴エントリを選択し、
(D2)前記(D1)により選択された障害履歴エントリの数が第一の閾値未満の場合は、前記障害履歴エントリが対応するメタルールと、前記第二のメタルールとのマッチング率に基づいて前記所定の障害履歴エントリを特定し、
(D3)前記(D1)により選択された障害履歴エントリを前記所定の障害履歴エントリと特定する、
ことを特徴とした計算機システム。 - 請求項4記載の計算機システムであって、
前記記憶資源は、前記複数のノード装置の構成設定情報を格納し、
前記障害履歴エントリは、前記複数のノード装置の当該エントリ作成時点に対応する過去の構成設定情報を含み、
前記(D)の特定は:
(D4)前記(D1)により選択された障害履歴エントリの数が第二の閾値以上の場合は、前記障害履歴エントリに含まれる前記過去の構成設定情報と、前記構成設定情報とのマッチング率に基づいて、前記所定の障害履歴エントリを特定する、
ことを特徴とした計算機システム。 - 請求項5記載の計算機システムであって、
前記記憶資源は、構成設定情報の項目についての重み値を示す重み情報を格納し、
前記(D4)の特定は、前記重み情報に基づいて行われる、
ことを特徴とした計算機システム。 - 請求項1記載の計算機システムであって、
前記(B)の第一の復旧方法は、前記第一の原因イベントの発生元ノード装置の識別子を含まない復旧方法であるメタ復旧方法であり、
前記(Y)の前記第二の原因イベントからの復旧方法の表示は、前記メタ復旧方法と前記第二の原因イベントの発生元ノード装置の識別子との表示である、
ことを特徴とした計算機システム。 - 複数のノード装置と通信するインターフェースと、
前記インターフェースを介して、前記複数のノード装置で発生するイベントを検知するプロセッサと、
イベント情報と、メタルール情報と、障害履歴情報と、を格納する記憶資源と、
前記複数のノード装置についての情報を表示する表示装置と、
入力装置と、
を備え、
前記イベント情報は、前記発生したイベントの発生元ノード装置を特定する情報と、前記発生したイベントの種別と、を示すイベントエントリを含み、
前記メタルール情報は、ノード装置で潜在的に発生する可能性のある潜在イベントの種別と、前記潜在イベントの種別に対応するイベントが発生した場合に根本原因と特定可能なイベントの種別とを示すメタルール、を含み、
前記障害履歴情報は、復旧方法を示す情報及び前記復旧方法が対応する前記メタルールを識別する情報を含む障害履歴エントリ、を含み、
前記プロセッサは:
(A)前記メタルール情報に基づき、前記イベント情報に格納した前記イベントエントリで特定される第一のイベントの根本原因である第一の原因イベントを特定し、前記第一の原因イベントの特定に用いた第一のメタルールを特定し;
(B)前記第一の原因イベントから復旧する方法である第一の復旧方法を、前記入力装置を介して受信し、前記第一の復旧方法に基づいて、前記障害履歴情報に前記第一のメタルールに対応する第一の障害履歴エントリを追加し;
(C)前記メタルール情報に基づき、前記イベント情報に格納した前記イベントエントリで特定される第二のイベントの根本原因である第二の原因イベントを特定し、第二の原因イベントの特定に用いた第二のメタルールを特定し;
(D)前記障害履歴情報に基づき、前記第二のメタルールに対応する所定の障害履歴エントリを特定し、
前記表示装置は:
(X)前記第二の原因イベントに関する情報を、前記第二のイベントの根本原因として表示し;
(Y)前記所定の障害履歴エントリに基づき、前記第二の原因イベントからの復旧方法を表示する、
ことを特徴とした管理システム。 - 請求項8記載の管理システムであって、
前記障害履歴エントリは復旧方法を適用したノード装置の識別子を含み、
前記表示装置は:
(Z)前記所定の障害履歴エントリが示すノード装置の識別子を、前記(Y)の前記所定の障害履歴エントリが示す復旧方法を適用したノード装置の識別子として表示する、
ことを特徴とした管理システム。 - 請求項9記載の管理システムであって、
前記第一の原因イベントの発生元ノード装置と前記第二の原因イベントの発生元ノード装置は異なるノード装置の場合、前記表示装置は:
(a)前記(X)の前記第二の原因イベントに関する情報の表示として、前記第二の原因イベントの発生元ノード装置の識別子を含む情報を表示し、
(b)前記(Z)の前記所定の障害履歴エントリが示すノード装置の識別子の表示として、前記第一の原因イベントの発生元ノード装置の識別子を表示する、
ことを特徴とした管理システム。 - 請求項9記載の管理システムであって、
前記(D)の特定は:
(D1)前記第二のメタルールと同一のメタルールのを示す前記障害履歴エントリを選択し、
(D2)前記(D1)により選択された障害履歴エントリの数が第一の閾値未満の場合は、前記障害履歴エントリが対応するメタルールと、前記第二のメタルールとのマッチング率に基づいて前記所定の障害履歴エントリを特定し、
(D3)前記(D1)により選択された障害履歴エントリを前記所定の障害履歴エントリと特定する、
ことを特徴とした管理システム。 - 請求項11記載の管理システムであって、
前記記憶資源は、前記複数のノード装置の構成設定情報を格納し、
前記障害履歴エントリは、前記複数のノード装置の当該エントリ作成時点に対応する過去の構成設定情報を含み、
前記(D)の特定は:
(D4)前記(D1)により選択された障害履歴エントリの数が第二の閾値以上の場合は、前記障害履歴エントリに含まれる前記過去の構成設定情報と、前記構成設定情報とのマッチング率に基づいて、前記所定の障害履歴エントリを特定する、
ことを特徴とした管理システム。 - 請求項12記載の管理システムであって、
前記記憶資源は、構成設定情報の項目についての重み値を示す、重み情報を格納し、
前記(D4)の特定は、前記重み情報に基づいて行われる、
ことを特徴とした管理システム。 - 請求項8記載の管理システムであって、
前記(B)の第一の復旧方法は、前記第一の原因イベントの発生元ノード装置の識別子を含まない復旧方法であるメタ復旧方法であり、
前記(Y)の前記第二の原因イベントからの復旧方法の表示は、前記メタ復旧方法と前記第二の原因イベントの発生元ノード装置の識別子との表示である、
ことを特徴とした管理システム。 - 複数のノード装置を管理する管理システムの管理方法であって、
前記管理システムは、複数のノード装置で発生しうるイベントについて、根本原因となる事象を特定するメタルールと、メタルールに対応させた障害復旧方法と、を有し、
前記管理システムは、管理サーバが検知したイベントの根本原因となる原因イベントと、前記原因イベントからの復旧方法と、を表示する。
ことを特徴とした管理システムの管理方法。
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2011522628A JP5385982B2 (ja) | 2009-07-16 | 2009-07-16 | 障害の根本原因に対応した復旧方法を表す情報を出力する管理システム |
PCT/JP2009/003358 WO2011007394A1 (ja) | 2009-07-16 | 2009-07-16 | 障害の根本原因に対応した復旧方法を表す情報を出力する管理システム |
US12/529,522 US8429453B2 (en) | 2009-07-16 | 2009-07-16 | Management system for outputting information denoting recovery method corresponding to root cause of failure |
EP09847293A EP2455863A4 (en) | 2009-07-16 | 2009-07-16 | MANAGEMENT SYSTEM FOR PROVIDING INFORMATION DESCRIBING A RECOVERY METHOD CORRESPONDING TO A FUNDAMENTAL CAUSE OF FAILURE |
CN200980160965.4A CN102473129B (zh) | 2009-07-16 | 2009-07-16 | 输出表示与故障的根本原因对应的恢复方法的信息的管理系统 |
US13/845,992 US9189319B2 (en) | 2009-07-16 | 2013-03-18 | Management system for outputting information denoting recovery method corresponding to root cause of failure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2009/003358 WO2011007394A1 (ja) | 2009-07-16 | 2009-07-16 | 障害の根本原因に対応した復旧方法を表す情報を出力する管理システム |
Related Child Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/529,522 A-371-Of-International US8429453B2 (en) | 2009-07-16 | 2009-07-16 | Management system for outputting information denoting recovery method corresponding to root cause of failure |
US13/845,992 Continuation US9189319B2 (en) | 2009-07-16 | 2013-03-18 | Management system for outputting information denoting recovery method corresponding to root cause of failure |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2011007394A1 true WO2011007394A1 (ja) | 2011-01-20 |
Family
ID=43449016
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2009/003358 WO2011007394A1 (ja) | 2009-07-16 | 2009-07-16 | 障害の根本原因に対応した復旧方法を表す情報を出力する管理システム |
Country Status (5)
Country | Link |
---|---|
US (2) | US8429453B2 (ja) |
EP (1) | EP2455863A4 (ja) |
JP (1) | JP5385982B2 (ja) |
CN (1) | CN102473129B (ja) |
WO (1) | WO2011007394A1 (ja) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012174232A (ja) * | 2011-02-24 | 2012-09-10 | Fujitsu Ltd | 監視装置、監視システムおよび監視方法 |
WO2014141460A1 (ja) * | 2013-03-15 | 2014-09-18 | 株式会社日立製作所 | 管理システム |
WO2014162595A1 (ja) | 2013-04-05 | 2014-10-09 | 株式会社日立製作所 | 管理システム及び管理プログラム |
JP5719974B2 (ja) * | 2012-09-03 | 2015-05-20 | 株式会社日立製作所 | 複数の監視対象デバイスを有する計算機システムの管理を行う管理システム |
JP2015156225A (ja) * | 2015-03-23 | 2015-08-27 | 株式会社日立製作所 | 複数の監視対象デバイスを有する計算機システムの管理を行う管理システム |
EP2674865A4 (en) * | 2011-09-26 | 2016-06-01 | Hitachi Ltd | ADMINISTRATIVE COMPUTERS AND METHODS OF BASIC ANALYSIS |
JPWO2014068659A1 (ja) * | 2012-10-30 | 2016-09-08 | 株式会社日立製作所 | 管理計算機およびルール生成方法 |
JPWO2015079564A1 (ja) * | 2013-11-29 | 2017-03-16 | 株式会社日立製作所 | イベントの根本原因の解析を支援する管理システム及び方法 |
Families Citing this family (40)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2455863A4 (en) * | 2009-07-16 | 2013-03-27 | Hitachi Ltd | MANAGEMENT SYSTEM FOR PROVIDING INFORMATION DESCRIBING A RECOVERY METHOD CORRESPONDING TO A FUNDAMENTAL CAUSE OF FAILURE |
JP5419746B2 (ja) * | 2010-02-23 | 2014-02-19 | 株式会社日立製作所 | 管理装置及び管理プログラム |
US8451739B2 (en) | 2010-04-15 | 2013-05-28 | Silver Spring Networks, Inc. | Method and system for detecting failures of network nodes |
US8943364B2 (en) * | 2010-04-30 | 2015-01-27 | International Business Machines Corporation | Appliance for storing, managing and analyzing problem determination artifacts |
US8429455B2 (en) * | 2010-07-16 | 2013-04-23 | Hitachi, Ltd. | Computer system management method and management system |
US8572434B2 (en) | 2010-09-29 | 2013-10-29 | Sepaton, Inc. | System health monitor |
US8386850B2 (en) * | 2010-09-29 | 2013-02-26 | Sepaton, Inc. | System health monitor |
WO2013057790A1 (ja) * | 2011-10-18 | 2013-04-25 | 富士通株式会社 | 情報処理装置、時刻補正値決定方法、およびプログラム |
CN103176873A (zh) * | 2011-12-23 | 2013-06-26 | 鸿富锦精密工业(深圳)有限公司 | 计数卡 |
US8977886B2 (en) * | 2012-02-14 | 2015-03-10 | Alcatel Lucent | Method and apparatus for rapid disaster recovery preparation in a cloud network |
JP5684946B2 (ja) * | 2012-03-23 | 2015-03-18 | 株式会社日立製作所 | イベントの根本原因の解析を支援する方法及びシステム |
US8996532B2 (en) | 2012-05-21 | 2015-03-31 | International Business Machines Corporation | Determining a cause of an incident based on text analytics of documents |
WO2014001841A1 (en) * | 2012-06-25 | 2014-01-03 | Kni Műszaki Tanácsadó Kft. | Methods of implementing a dynamic service-event management system |
US9576318B2 (en) * | 2012-09-25 | 2017-02-21 | Mx Technologies, Inc. | Automatic payment and deposit migration |
JP5839133B2 (ja) * | 2012-12-12 | 2016-01-06 | 三菱電機株式会社 | 監視制御装置及び監視制御方法 |
US10169122B2 (en) * | 2013-04-29 | 2019-01-01 | Moogsoft, Inc. | Methods for decomposing events from managed infrastructures |
CN103440174B (zh) * | 2013-08-02 | 2016-05-25 | 杭州华为数字技术有限公司 | 一种错误信息处理方法、装置及应用该装置的电子设备 |
JP6190468B2 (ja) * | 2013-10-30 | 2017-08-30 | 株式会社日立製作所 | 管理システム、プラン生成方法、およびプラン生成プログラム |
CN104035849B (zh) * | 2014-06-19 | 2017-02-15 | 浪潮电子信息产业股份有限公司 | 一种防止Rack机柜风扇管理失效的方法 |
DE112015006084T5 (de) * | 2015-01-30 | 2017-10-12 | Hitachi, Ltd. | Systemverwaltungsvorrichtung und systemverwaltungssystem |
US9692815B2 (en) | 2015-11-12 | 2017-06-27 | Mx Technologies, Inc. | Distributed, decentralized data aggregation |
US9830150B2 (en) | 2015-12-04 | 2017-11-28 | Google Llc | Multi-functional execution lane for image processor |
US10180869B2 (en) * | 2016-02-16 | 2019-01-15 | Microsoft Technology Licensing, Llc | Automated ordering of computer system repair |
CN105786635B (zh) * | 2016-03-01 | 2018-10-12 | 国网江苏省电力公司电力科学研究院 | 一种面向故障敏感点动态检测的复杂事件处理系统及方法 |
US9922539B1 (en) * | 2016-08-05 | 2018-03-20 | Sprint Communications Company L.P. | System and method of telecommunication network infrastructure alarms queuing and multi-threading |
JP6885193B2 (ja) * | 2017-05-12 | 2021-06-09 | 富士通株式会社 | 並列処理装置、ジョブ管理方法、およびジョブ管理プログラム |
US10977154B2 (en) * | 2018-08-03 | 2021-04-13 | Dynatrace Llc | Method and system for automatic real-time causality analysis of end user impacting system anomalies using causality rules and topological understanding of the system to effectively filter relevant monitoring data |
US10282248B1 (en) * | 2018-11-27 | 2019-05-07 | Capital One Services, Llc | Technology system auto-recovery and optimality engine and techniques |
US10824528B2 (en) | 2018-11-27 | 2020-11-03 | Capital One Services, Llc | Techniques and system for optimization driven by dynamic resilience |
US11093319B2 (en) * | 2019-05-29 | 2021-08-17 | Microsoft Technology Licensing, Llc | Automated recovery of webpage functionality |
US11907087B2 (en) | 2019-07-10 | 2024-02-20 | International Business Machines Corporation | Remote health monitoring in data replication environments |
US11281694B2 (en) | 2019-07-10 | 2022-03-22 | International Business Machines Cormoration | Remote data capture in data replication environments |
US10686645B1 (en) | 2019-10-09 | 2020-06-16 | Capital One Services, Llc | Scalable subscriptions for virtual collaborative workspaces |
EP3823215A1 (en) * | 2019-11-18 | 2021-05-19 | Juniper Networks, Inc. | Network model aware diagnosis of a network |
CN113206749B (zh) | 2020-01-31 | 2023-11-17 | 瞻博网络公司 | 网络事件的相关性的可编程诊断模型 |
CN113328872B (zh) * | 2020-02-29 | 2023-03-28 | 华为技术有限公司 | 故障修复方法、装置和存储介质 |
US11765015B2 (en) * | 2020-03-19 | 2023-09-19 | Nippon Telegraph And Telephone Corporation | Network management apparatus, method, and program |
US11269711B2 (en) | 2020-07-14 | 2022-03-08 | Juniper Networks, Inc. | Failure impact analysis of network events |
US20220182278A1 (en) * | 2020-12-07 | 2022-06-09 | Citrix Systems, Inc. | Systems and methods to determine root cause of connection failures |
JP2022115316A (ja) * | 2021-01-28 | 2022-08-09 | 株式会社日立製作所 | ログ検索支援装置、及びログ検索支援方法 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004061681A1 (ja) * | 2002-12-26 | 2004-07-22 | Fujitsu Limited | 運用管理方法および運用管理サーバ |
JP2006526842A (ja) | 2003-03-31 | 2006-11-24 | システム マネージメント アーツ,インク. | 兆候除外付きコードブック相関を使用するシステム管理のための方法および装置 |
JP2006338305A (ja) * | 2005-06-01 | 2006-12-14 | Toshiba Corp | 監視装置及び監視プログラム |
US7478404B1 (en) | 2004-03-30 | 2009-01-13 | Emc Corporation | System and methods for event impact analysis |
JP2009043029A (ja) | 2007-08-09 | 2009-02-26 | Hitachi Ltd | 関連db作成装置 |
JP2009064101A (ja) * | 2007-09-04 | 2009-03-26 | Toshiba Corp | 遠隔監視システム及び遠隔監視方法 |
Family Cites Families (68)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4779208A (en) * | 1983-09-28 | 1988-10-18 | Hitachi, Ltd. | Information processing system and method for use in computer systems suitable for production system |
US5261086A (en) * | 1987-10-26 | 1993-11-09 | Nec Corporation | Performance analyzing and diagnosing system for computer systems |
US5214653A (en) * | 1990-10-22 | 1993-05-25 | Harris Corporation | Fault finder expert system |
US5572670A (en) * | 1994-01-10 | 1996-11-05 | Storage Technology Corporation | Bi-directional translator for diagnostic sensor data |
US5557765A (en) * | 1994-08-11 | 1996-09-17 | Trusted Information Systems, Inc. | System and method for data recovery |
US6072777A (en) * | 1996-06-28 | 2000-06-06 | Mci Communications Corporation | System and method for unreported root cause analysis |
US6226659B1 (en) * | 1996-09-16 | 2001-05-01 | Oracle Corporation | Method and apparatus for processing reports |
EP1021804A4 (en) * | 1997-05-06 | 2002-03-20 | Speechworks Int Inc | SYSTEM AND METHOD FOR DEVELOPING INTERACTIVE LANGUAGE APPLICATIONS |
US7500143B2 (en) * | 2000-05-05 | 2009-03-03 | Computer Associates Think, Inc. | Systems and methods for managing and analyzing faults in computer networks |
US7752024B2 (en) * | 2000-05-05 | 2010-07-06 | Computer Associates Think, Inc. | Systems and methods for constructing multi-layer topological models of computer networks |
WO2002087152A1 (en) * | 2001-04-18 | 2002-10-31 | Caveo Technology, Llc | Universal, customizable security system for computers and other devices |
EP1405187B1 (en) | 2001-07-06 | 2019-04-10 | CA, Inc. | Method and system for correlating and determining root causes of system and enterprise events |
US6792393B1 (en) * | 2001-12-03 | 2004-09-14 | At&T Corp. | System and method for diagnosing computer system operational behavior |
US20040153692A1 (en) * | 2001-12-28 | 2004-08-05 | O'brien Michael | Method for managing faults it a computer system enviroment |
US7194445B2 (en) * | 2002-09-20 | 2007-03-20 | Lenovo (Singapore) Pte. Ltd. | Adaptive problem determination and recovery in a computer system |
US7254515B1 (en) * | 2003-03-31 | 2007-08-07 | Emc Corporation | Method and apparatus for system management using codebook correlation with symptom exclusion |
US20050091356A1 (en) * | 2003-10-24 | 2005-04-28 | Matthew Izzo | Method and machine-readable medium for using matrices to automatically analyze network events and objects |
JP2005165847A (ja) * | 2003-12-04 | 2005-06-23 | Fujitsu Ltd | ポリシールールシナリオ制御装置及び制御方法 |
US7965620B2 (en) * | 2004-05-25 | 2011-06-21 | Telcordia Licensing Company, Llc | Method, computer product and system for correlating events in a network |
JP3826940B2 (ja) * | 2004-06-02 | 2006-09-27 | 日本電気株式会社 | 障害復旧装置および障害復旧方法、マネージャ装置並びにプログラム |
US7536370B2 (en) * | 2004-06-24 | 2009-05-19 | Sun Microsystems, Inc. | Inferential diagnosing engines for grid-based computing systems |
US20060112061A1 (en) * | 2004-06-24 | 2006-05-25 | Masurkar Vijay B | Rule based engines for diagnosing grid-based computing systems |
US7631222B2 (en) * | 2004-08-23 | 2009-12-08 | Cisco Technology, Inc. | Method and apparatus for correlating events in a network |
US7373552B2 (en) * | 2004-09-30 | 2008-05-13 | Siemens Aktiengesellschaft | Model based diagnosis and repair for event logs |
US7275017B2 (en) | 2004-10-13 | 2007-09-25 | Cisco Technology, Inc. | Method and apparatus for generating diagnoses of network problems |
US7954090B1 (en) * | 2004-12-21 | 2011-05-31 | Zenprise, Inc. | Systems and methods for detecting behavioral features of software application deployments for automated deployment management |
DE112005003530T5 (de) * | 2005-04-08 | 2008-03-27 | Hewlett-Packard Development Company, L.P., Houston | Fehlercodesystem |
US7426654B2 (en) * | 2005-04-14 | 2008-09-16 | Verizon Business Global Llc | Method and system for providing customer controlled notifications in a managed network services system |
US7571150B2 (en) * | 2005-04-15 | 2009-08-04 | Microsoft Corporation | Requesting, obtaining, and processing operational event feedback from customer data centers |
JP4672722B2 (ja) * | 2005-04-25 | 2011-04-20 | 富士通株式会社 | ネットワーク設計処理装置,方法およびそのプログラム |
US7949904B2 (en) * | 2005-05-04 | 2011-05-24 | Microsoft Corporation | System and method for hardware error reporting and recovery |
US8392236B2 (en) * | 2005-05-13 | 2013-03-05 | The Boeing Company | Mobile network dynamic workflow exception handling system |
JP4701148B2 (ja) * | 2006-03-02 | 2011-06-15 | アラクサラネットワークス株式会社 | 障害回復システム及びサーバ |
US8326969B1 (en) * | 2006-06-28 | 2012-12-04 | Emc Corporation | Method and apparatus for providing scalability in resource management and analysis system- three way split architecture |
US8284675B2 (en) * | 2006-06-28 | 2012-10-09 | Rockstar Bidco, L.P. | Method and system for automated call troubleshooting and resolution |
JP4859558B2 (ja) * | 2006-06-30 | 2012-01-25 | 株式会社日立製作所 | コンピュータシステムの制御方法及びコンピュータシステム |
US7924733B2 (en) * | 2006-09-28 | 2011-04-12 | Avaya Inc. | Root cause analysis of network performance based on exculpation or inculpation sets |
JP2008084242A (ja) * | 2006-09-29 | 2008-04-10 | Omron Corp | データベース作成装置およびデータベース活用支援装置 |
US7872982B2 (en) * | 2006-10-02 | 2011-01-18 | International Business Machines Corporation | Implementing an error log analysis model to facilitate faster problem isolation and repair |
US20080140817A1 (en) * | 2006-12-06 | 2008-06-12 | Agarwal Manoj K | System and method for performance problem localization |
US7757117B2 (en) * | 2007-04-17 | 2010-07-13 | International Business Machines Corporation | Method and apparatus for testing of enterprise systems |
US8421614B2 (en) * | 2007-09-19 | 2013-04-16 | International Business Machines Corporation | Reliable redundant data communication through alternating current power distribution system |
US7941707B2 (en) * | 2007-10-19 | 2011-05-10 | Oracle International Corporation | Gathering information for use in diagnostic data dumping upon failure occurrence |
US7788534B2 (en) * | 2007-12-11 | 2010-08-31 | International Business Machines Corporation | Method for monitoring and managing a client device in a distributed autonomic computing environment |
US8826077B2 (en) * | 2007-12-28 | 2014-09-02 | International Business Machines Corporation | Defining a computer recovery process that matches the scope of outage including determining a root cause and performing escalated recovery operations |
US8341014B2 (en) * | 2007-12-28 | 2012-12-25 | International Business Machines Corporation | Recovery segments for computer business applications |
US20090172674A1 (en) * | 2007-12-28 | 2009-07-02 | International Business Machines Corporation | Managing the computer collection of information in an information technology environment |
US20090210745A1 (en) * | 2008-02-14 | 2009-08-20 | Becker Sherilyn M | Runtime Error Correlation Learning and Guided Automatic Recovery |
US7835307B2 (en) * | 2008-03-14 | 2010-11-16 | International Business Machines Corporation | Network discovery tool |
US7870441B2 (en) * | 2008-03-18 | 2011-01-11 | International Business Machines Corporation | Determining an underlying cause for errors detected in a data processing system |
US8086905B2 (en) * | 2008-05-27 | 2011-12-27 | Hitachi, Ltd. | Method of collecting information in system network |
US7814369B2 (en) * | 2008-06-12 | 2010-10-12 | Honeywell International Inc. | System and method for detecting combinations of perfomance indicators associated with a root cause |
US8112378B2 (en) * | 2008-06-17 | 2012-02-07 | Hitachi, Ltd. | Methods and systems for performing root cause analysis |
US20110125544A1 (en) * | 2008-07-08 | 2011-05-26 | Technion-Research & Development Foundation Ltd | Decision support system for project managers and associated method |
US8310931B2 (en) * | 2008-07-18 | 2012-11-13 | International Business Machines Corporation | Discovering network topology from routing information |
US8370466B2 (en) * | 2008-07-23 | 2013-02-05 | International Business Machines Corporation | Method and system for providing operator guidance in network and systems management |
US7877636B2 (en) * | 2008-08-28 | 2011-01-25 | Honeywell International Inc. | System and method for detecting temporal relationships uniquely associated with an underlying root cause |
US7962472B2 (en) * | 2008-09-29 | 2011-06-14 | International Business Machines Corporation | Self-optimizing algorithm for real-time problem resolution using historical data |
JP5237034B2 (ja) * | 2008-09-30 | 2013-07-17 | 株式会社日立製作所 | イベント情報取得外のit装置を対象とする根本原因解析方法、装置、プログラム。 |
US8166351B2 (en) * | 2008-10-21 | 2012-04-24 | At&T Intellectual Property I, L.P. | Filtering redundant events based on a statistical correlation between events |
US7877642B2 (en) * | 2008-10-22 | 2011-01-25 | International Business Machines Corporation | Automatic software fault diagnosis by exploiting application signatures |
US7954010B2 (en) * | 2008-12-12 | 2011-05-31 | At&T Intellectual Property I, L.P. | Methods and apparatus to detect an error condition in a communication network |
US8055945B2 (en) * | 2009-02-02 | 2011-11-08 | International Business Machines Corporation | Systems, methods and computer program products for remote error resolution reporting |
US7979747B2 (en) * | 2009-02-20 | 2011-07-12 | International Business Machines Corporation | Interactive problem resolution presented within the context of major observable application behaviors |
WO2010112960A1 (en) | 2009-03-30 | 2010-10-07 | Hitachi, Ltd. | Method and apparatus for cause analysis involving configuration changes |
US8527328B2 (en) * | 2009-04-22 | 2013-09-03 | Bank Of America Corporation | Operational reliability index for the knowledge management system |
JP5325981B2 (ja) * | 2009-05-26 | 2013-10-23 | 株式会社日立製作所 | 管理サーバ及び管理システム |
EP2455863A4 (en) * | 2009-07-16 | 2013-03-27 | Hitachi Ltd | MANAGEMENT SYSTEM FOR PROVIDING INFORMATION DESCRIBING A RECOVERY METHOD CORRESPONDING TO A FUNDAMENTAL CAUSE OF FAILURE |
-
2009
- 2009-07-16 EP EP09847293A patent/EP2455863A4/en not_active Withdrawn
- 2009-07-16 US US12/529,522 patent/US8429453B2/en not_active Expired - Fee Related
- 2009-07-16 WO PCT/JP2009/003358 patent/WO2011007394A1/ja active Application Filing
- 2009-07-16 JP JP2011522628A patent/JP5385982B2/ja not_active Expired - Fee Related
- 2009-07-16 CN CN200980160965.4A patent/CN102473129B/zh not_active Expired - Fee Related
-
2013
- 2013-03-18 US US13/845,992 patent/US9189319B2/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004061681A1 (ja) * | 2002-12-26 | 2004-07-22 | Fujitsu Limited | 運用管理方法および運用管理サーバ |
JP2006526842A (ja) | 2003-03-31 | 2006-11-24 | システム マネージメント アーツ,インク. | 兆候除外付きコードブック相関を使用するシステム管理のための方法および装置 |
US7478404B1 (en) | 2004-03-30 | 2009-01-13 | Emc Corporation | System and methods for event impact analysis |
JP2006338305A (ja) * | 2005-06-01 | 2006-12-14 | Toshiba Corp | 監視装置及び監視プログラム |
JP2009043029A (ja) | 2007-08-09 | 2009-02-26 | Hitachi Ltd | 関連db作成装置 |
JP2009064101A (ja) * | 2007-09-04 | 2009-03-26 | Toshiba Corp | 遠隔監視システム及び遠隔監視方法 |
Non-Patent Citations (2)
Title |
---|
COMMUNICATIONS OF THE ACM, September 1985 (1985-09-01), pages 921 - 932 |
See also references of EP2455863A4 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012174232A (ja) * | 2011-02-24 | 2012-09-10 | Fujitsu Ltd | 監視装置、監視システムおよび監視方法 |
US8996924B2 (en) | 2011-02-24 | 2015-03-31 | Fujitsu Limited | Monitoring device, monitoring system and monitoring method |
EP2674865A4 (en) * | 2011-09-26 | 2016-06-01 | Hitachi Ltd | ADMINISTRATIVE COMPUTERS AND METHODS OF BASIC ANALYSIS |
JP5719974B2 (ja) * | 2012-09-03 | 2015-05-20 | 株式会社日立製作所 | 複数の監視対象デバイスを有する計算機システムの管理を行う管理システム |
JPWO2014068659A1 (ja) * | 2012-10-30 | 2016-09-08 | 株式会社日立製作所 | 管理計算機およびルール生成方法 |
WO2014141460A1 (ja) * | 2013-03-15 | 2014-09-18 | 株式会社日立製作所 | 管理システム |
JP5946583B2 (ja) * | 2013-03-15 | 2016-07-06 | 株式会社日立製作所 | 管理システム |
US9628360B2 (en) | 2013-03-15 | 2017-04-18 | Hitachi, Ltd. | Computer management system based on meta-rules |
WO2014162595A1 (ja) | 2013-04-05 | 2014-10-09 | 株式会社日立製作所 | 管理システム及び管理プログラム |
US9619314B2 (en) | 2013-04-05 | 2017-04-11 | Hitachi, Ltd. | Management system and management program |
JPWO2015079564A1 (ja) * | 2013-11-29 | 2017-03-16 | 株式会社日立製作所 | イベントの根本原因の解析を支援する管理システム及び方法 |
JP2015156225A (ja) * | 2015-03-23 | 2015-08-27 | 株式会社日立製作所 | 複数の監視対象デバイスを有する計算機システムの管理を行う管理システム |
Also Published As
Publication number | Publication date |
---|---|
US9189319B2 (en) | 2015-11-17 |
JP5385982B2 (ja) | 2014-01-08 |
US20130219225A1 (en) | 2013-08-22 |
JPWO2011007394A1 (ja) | 2012-12-20 |
CN102473129A (zh) | 2012-05-23 |
US20110264956A1 (en) | 2011-10-27 |
CN102473129B (zh) | 2015-12-02 |
EP2455863A1 (en) | 2012-05-23 |
EP2455863A4 (en) | 2013-03-27 |
US8429453B2 (en) | 2013-04-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5385982B2 (ja) | 障害の根本原因に対応した復旧方法を表す情報を出力する管理システム | |
US11614990B2 (en) | Automatic correlation of dynamic system events within computing devices | |
US10467084B2 (en) | Knowledge-based system for diagnosing errors in the execution of an operation | |
US8782039B2 (en) | Generating a semantic graph relating information assets using feedback re-enforced search and navigation | |
US7536370B2 (en) | Inferential diagnosing engines for grid-based computing systems | |
US9071535B2 (en) | Comparing node states to detect anomalies | |
US8892705B2 (en) | Information processing system, operation management method for computer systems, and program in a distributed network environment | |
WO2015079564A1 (ja) | イベントの根本原因の解析を支援する管理システム及び方法 | |
US8751856B2 (en) | Determining recovery time for interdependent resources in heterogeneous computing environment | |
US20120030346A1 (en) | Method for inferring extent of impact of configuration change event on system failure | |
US20120221898A1 (en) | System and method for determination of the root cause of an overall failure of a business application service | |
US20120284262A1 (en) | Managing information assets using feedback re-enforced search and navigation | |
US20120221558A1 (en) | Identifying information assets within an enterprise using a semantic graph created using feedback re-enforced search and navigation | |
JP2011076292A (ja) | 取得可能な機器情報に応じた障害原因解析ルールの設計方法及び計算機 | |
JP2014199579A (ja) | 検出方法、検出プログラム、および検出装置 | |
JP6988304B2 (ja) | 運用管理システム、監視サーバ、方法およびプログラム | |
US20150032776A1 (en) | Cross-cutting event correlation | |
WO2006117833A1 (ja) | 監視シミュレーション装置,方法およびそのプログラム | |
US20200073781A1 (en) | Systems and methods of injecting fault tree analysis data into distributed tracing visualizations | |
JP5514643B2 (ja) | 障害原因判定ルール変化検知装置及びプログラム | |
JP6280862B2 (ja) | イベント分析システムおよび方法 | |
WO2014068705A1 (ja) | 監視システム及び監視プログラム | |
US20150242416A1 (en) | Management computer and rule generation method | |
US20230325294A1 (en) | Models for detecting and managing excessive log patterns | |
Kobayashi et al. | amulog: A general log analysis framework for comparison and combination of diverse template generation methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 200980160965.4 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 12529522 Country of ref document: US |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09847293 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2011522628 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2009847293 Country of ref document: EP |