WO2015079564A1 - イベントの根本原因の解析を支援する管理システム及び方法 - Google Patents
イベントの根本原因の解析を支援する管理システム及び方法 Download PDFInfo
- Publication number
- WO2015079564A1 WO2015079564A1 PCT/JP2013/082207 JP2013082207W WO2015079564A1 WO 2015079564 A1 WO2015079564 A1 WO 2015079564A1 JP 2013082207 W JP2013082207 W JP 2013082207W WO 2015079564 A1 WO2015079564 A1 WO 2015079564A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- information
- diagnostic procedure
- deployment
- diagnosis
- program
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0727—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a storage system, e.g. in a DASD or network based storage system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0751—Error or fault detection not based on redundancy
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/32—Monitoring with visual or acoustical indication of the functioning of the machine
- G06F11/321—Display for diagnostics, e.g. diagnostic result display, self-test user interface
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/0645—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis by additionally acting on or stimulating the network after receiving notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/065—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/349—Performance evaluation by tracing or monitoring for interfaces, buses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/86—Event-based monitoring
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/875—Monitoring of systems including the internet
Definitions
- the present invention generally relates to support for analyzing the root cause of an event that has occurred in a managed component.
- a cause event is detected from a plurality of faults detected in the system or its signs.
- various failures in a management target device or a component constituting the management target device are converted into events, and management software accumulates event occurrence information in an event DB (database).
- the management software also has an analysis engine for analyzing the causal relationship between a plurality of events that have occurred in the management target device. This analysis engine accesses the configuration management DB having the configuration information of the management target device, and between a plurality of components across one or more management target devices on a path on a certain I / O (input / output) path Are recognized as one group called “topology”.
- the analysis engine analyzes a failure in each topology by applying a meta rule including a predetermined conditional statement and an analysis result to each topology including the component in which the event has occurred.
- Build deployment rules for The expansion rule includes a conclusion event that can be a root cause and a condition event group that is caused by the conclusion event when it occurs. Specifically, an event described in the THEN part of the rule is a conclusion event that can be the root cause, and an event described in the IF part is a conditional event.
- the condition event group of the expansion rule matches the detected event group, the analysis engine displays the conclusion event described in the expansion rule as the root cause of a plurality of failures that occurred in the IT system.
- a failure that occurs in one device may cause a plurality of device failures that have a dependency.
- the technique disclosed in Patent Document 1 can identify a failure that is a propagation source from a plurality of detected failures.
- the technology for analyzing the cause of the failure based on the pattern of the event that occurred in the component can narrow down the failure that is the origin of a plurality of failures that occurred in the IT system.
- the storage device stores configuration management information, a plurality of rules, and a plurality of general-purpose diagnostic procedures.
- the configuration management information is information related to the configuration of the plurality of managed components.
- Each of the plurality of rules is a rule indicating an association between one or more condition events corresponding to one or more events and a conclusion event that is a cause when the one or more condition events occur.
- Each of the plurality of general-purpose diagnosis procedures is a general-purpose diagnosis procedure that is associated with any one of the plurality of rules, is defined using one or a plurality of component types, and does not depend on the managed component.
- the processor is one or more based on one or more target rules that are one or more rules associated with one or more conditional events related to one or more occurrence events (occurred events) of the plurality of rules. Identify possible causes of.
- the processor identifies a general-purpose diagnostic procedure associated with the target rule that is the basis of the selected cause candidate among one or more candidate causes among the plurality of general-purpose diagnosis procedures.
- a processor is a diagnostic procedure to be executed for one or more managed components based on the specified general-purpose diagnostic procedure and configuration management information, and a more specific cause of the selected cause candidate is specified or selected.
- a deployment diagnostic procedure is generated to update the probability of the possible cause candidates.
- Example 1 shows a configuration example of an IT system and a management computer according to a first embodiment.
- the structural example of the apparatus table in configuration management DB is shown.
- An example of the configuration of an iSCSI disk table in the configuration management DB is shown.
- the structural example of the network I / F table in configuration management DB is shown.
- An example of the configuration of a switch port table in the configuration management DB is shown.
- the structural example of the iSCSI target table in configuration management DB is shown.
- the structural example of the storage port table in configuration management DB is shown.
- the structural example of a performance table is shown.
- the structural example of an event queue table is shown.
- the example of a structure of a metarule is shown.
- deployment rule is shown.
- the structural example of a meta-diagnosis procedure is shown.
- the structural example of topology conditions is shown.
- the structural example of a meta collection means is shown.
- An example of the configuration of the deployment diagnosis procedure is shown.
- deployment collection means is shown.
- 6 shows a flowchart of an example of failure cause analysis processing executed by a failure analysis program.
- An example of an event analysis result screen is shown.
- deployment program is shown.
- deployment program is shown.
- the flowchart of the example of the process performed by a display program is shown.
- An example of a diagnostic result screen is shown.
- Example 9 is a flowchart illustrating an example of a failure cause analysis process executed by a failure analysis program in the second embodiment.
- these quantities are in the form of electrical or magnetic signals that can be stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, to refer to these signals as bits, values, elements, symbols, characters, items, numbers, instructions, or the like because of their common use in principle. It should be noted, however, that all of these and similar items are to be associated with the appropriate physical quantities and are merely convenient labels attached to these physical quantities.
- An apparatus for performing the operations herein may be specially constructed for the required purposes, or one or more general purpose computers that are selectively activated or reconfigured by one or more computer programs. May be included.
- Such a computer program can be stored, for example, on a computer readable storage medium such as an optical disk, magnetic disk, read only memory, random access memory, solid state device and drive, or any other medium suitable for storing electronic information. However, it is not limited to these.
- the processor may be the subject.
- the processing disclosed with the program as the subject may be processing performed by a computer such as a management computer.
- part or all of the program may be realized by dedicated hardware.
- Various programs may be installed in the computer by a program distribution server or a computer-readable storage medium.
- the management computer has input / output devices.
- input / output devices include a display, a keyboard, and a pointer device, but other devices may be used.
- a serial interface or an Ethernet (registered trademark) interface is used as the input / output device, and a display computer having a display, keyboard, or pointer device is connected to the interface, and the display information is transferred to the display computer.
- the input and display on the input / output device may be substituted by transmitting or receiving input information from the display computer to display on the display computer or accepting input.
- a set of one or more computers that manage an IT system (information processing system) and display display information may be referred to as a management system.
- the management computer displays the display information
- the management computer may be a management system.
- the management system may be a combination of the management computer and the display computer.
- multiple computers may perform processing equivalent to that of the management computer.
- these multiple computers for display when the display computer performs display
- including computers may be a management system.
- “Displaying display information” by the management computer may mean displaying the display information on a display device included in the management computer, or the management computer (for example, a server) may be a remote display computer (for example, a client). ) May be transmitted to display information.
- the server 202 may be described when the server is not particularly distinguished, and may be described as the servers 202a and 202b when the individual server is described separately.
- a diagnostic procedure for identifying a cause event of a failure that has occurred in the IT system is derived, and a cause event of the failure is identified based on the diagnostic procedure.
- An apparatus, method, and computer program for performing diagnosis are provided.
- the management computer 201 is a computer that manages a plurality of devices to be managed.
- the types of devices to be managed include, for example, computers (for example, servers), network devices (for example, IP (Internet Protocol) switches, routers, or FC (Fibre Channel) switches), and storage devices (for example, NAS (Network Attached Storage)).
- Examples of logical or physical elements such as devices included in one managed apparatus include ports, processors, storage resources, physical storage devices, programs, virtual machines, logical volumes (logical storage devices), and RAID (Redundant There is at least one of the Arrays of Inexpensive (Independent) Disks) group.
- each of the managed device and the elements included in the managed device may be collectively referred to as a “managed component”.
- the managed device can also be called a node device.
- FIG. 1 shows an outline of the first embodiment.
- the event analysis program result display screen 111 displays the event analysis result 101.
- the event analysis result 101 represents a failure that is a propagation source of a failure that has occurred in a plurality of devices as a cause failure candidate.
- the event analysis result 101 is a result derived by an event analysis program described later.
- the event analysis result 101 may be derived by a method disclosed in Patent Document 1, for example.
- the management computer 201 has a meta-diagnosis procedure repository 234 that stores a diagnosis procedure for identifying a cause event of an IT system failure, and a configuration management DB (database) 232 that stores configuration information of managed components.
- the meta diagnosis procedure stored in the meta diagnosis procedure repository 234 describes a diagnosis procedure to be executed for a certain configuration pattern in the IT system.
- the configuration information stored in the configuration management DB 232 includes information on each managed component, connection relationship information representing a connection relationship between each managed component, and dependency relationship information representing a dependency relationship between each managed component. .
- the management computer 201 When one cause failure candidate is selected from one or a plurality of cause failure candidates represented by the event analysis result 101 by the user or the management computer 201, the management computer 201 performs the diagnostic procedure development program 223 to perform more detailed failure cause analysis. Execute.
- the diagnostic procedure development program 223 acquires a meta diagnostic procedure related to the event analysis result 101 from the meta diagnostic procedure repository 234. Next, based on the configuration pattern defined in the acquired meta-diagnostic procedure and the selected cause failure candidate, the diagnostic procedure deployment program 223 sends configuration information related to the management target component to be diagnosed to the configuration management DB 232. Get from. Then, the diagnostic procedure deployment program 223 generates a deployment diagnostic procedure 124 from the acquired meta diagnostic procedure and the acquired configuration information.
- the deployment diagnosis procedure 124 includes an information collection step 131 for collecting information necessary for diagnosis, a determination step 132 for making a determination based on the collected information, and a conclusion 133 indicating a failure cause event derived from the determination result. including.
- the diagnosis execution program 224 executes each step defined in the generated development diagnosis procedure 124, and uses the obtained conclusion as a failure cause event of the IT system.
- the diagnosis result display screen 113 displays a diagnosis result according to the failure cause event. 141 is displayed.
- the diagnosis procedure necessary to identify the cause of the propagation source failure is automatically performed By deploying and executing diagnosis, it is possible to quickly identify the cause of the failure.
- failure recovery measures can be quickly determined based on the identified cause event, and IT system downtime can be shortened. As a result, it is possible to reduce economic damage such as business opportunity loss caused by the stoppage of the IT system.
- FIG. 2 shows a configuration example of the IT system and the management computer 201 according to the first embodiment.
- the management computer 201 is a computer that manages the IT system.
- the IT system includes one or more servers (or other computers) 202a, 202b, and 202c, one or more storage devices 204, and one or more network switches (or other networks such as IP switches). Device) 203.
- the servers 202a, 202b, 202c, the network switch 203, and the storage device 204 are communicably connected via a network 205 (a network switch 203 according to the example of FIG. 2) such as a LAN (local area network). .
- a network switch 203 such as a LAN (local area network).
- the management computer 201 includes a CPU 211, a memory 212, a disk 213, an input device 214, an output device 217, and a network interface device (network I / F) 215, and these devices are connected via a system bus 216. It's okay.
- the disk 213 is, for example, an HDD (Hard Disk Drive), but another nonvolatile storage device such as an SSD (Solid State Drive) may be employed instead.
- One determination program 226 may be provided, or may be provided for each determination of the meta-diagnosis procedure.
- the term “means” in each of “meta collection means” and “deployment collection means” in the present embodiment (and example 2) may be replaced with the words “method”, “definition”, or “command”. .
- the deployment diagnostic procedure repository 235 and the deployment collection means repository 237 are repositories that are stored in order to reuse information that has been generated once, and the management computer 201 may not have the repository.
- the performance table 238 is a database that stores performance information of managed components collected from managed devices by the performance acquisition program 229.
- the performance acquisition program 229 and the performance table 238 are programs and information used to show an example of “diagnosis procedure” described in the present embodiment, and the management computer 201 may not have.
- the performance table 238 is not included in the management computer 201.
- the management computer 201 transmits the management table 201 via the network 205.
- the performance information may be acquired by accessing the target device.
- Fault analysis program 221, event analysis program 222, diagnostic procedure expansion program 223, diagnostic execution program 224, display program 225, one or more determination programs 226, event reception program 227, configuration acquisition program 228, performance acquisition program 229 are stored in memory 212 and is executed by the CPU 211.
- the meta rule repository 231, configuration management DB 232, event queue table 233, meta diagnostic procedure repository 234, deployment diagnostic procedure repository 235, meta collection means repository 236, deployment collection means repository 237, and performance table 238 are stored in the disk 213. At least one of these programs or at least one data may be stored in another appropriate storage area that the CPU 211 can refer to.
- the network I / F 215 acquires component-related information such as configuration information and performance information from managed devices such as the server 202, the network switch 203, and the storage device 204 connected via the network 205.
- the output device 217 is a device that outputs (typically displays) information from the display program 225.
- the input device 214 is a device for inputting a user instruction. For example, a keyboard, a pointer device, or the like can be used as the input device 214, and a display, a printer, or the like can be used as the output device 217, but other devices may be used.
- Each server 202a, 202b, 202c may be a managed device that executes a program such as an application.
- the server 202a may be a general-purpose computer including a memory 242, a network I / F 243, and a CPU 246 connected thereto.
- the server 202a may have a nonvolatile storage device such as an HDD in addition to the memory 242.
- the server 202a includes a monitoring agent (program) 245 that monitors the state of the server 202a and transmits event information representing the event to the management computer 201 via the network 205 when a specific state change (event) is detected. But you can.
- the monitoring agent 245 may be executed by the CPU 241. Notifying an event may be transmitting event information representing the event.
- the server 202a may include an iSCSI (Internet Small Computer System Interface) initiator 244.
- the server 202 a can use the iSCSI disk 251 virtually like a local HDD, which is realized by the storage capacity of the iSCSI initiator 244 and the storage device 204.
- Other communication and storage protocols may be used instead of or in addition to iSCSI.
- the configuration of the server 202a has been described, the servers 202b and 202c may have the same configuration as the server 202a.
- Each storage device 204 may be a management target device for providing a storage capacity (logical volume) for an application operating on the server 202 (or for other purposes).
- the storage apparatus 204 includes an I / O port 263, a disk 262, and a storage controller (for example, CPU) 261 connected to them. There may be a plurality of I / O ports 263.
- the disk 262 may be a single HDD or a RAID group composed of a plurality of HDDs, but the nonvolatile storage device in the disk 262 is another storage device such as an SSD. Also good.
- the storage device 204 may be configured to provide an iSCSI logical volume as a storage capacity to the servers 202a and 202b.
- the two servers 202a and 202b may be connected to the storage apparatus 204 via the network switch 203, and the storage apparatus 204 may provide the iSCSI logical volume to each server 202a and 202b.
- the storage apparatus 204 may include a monitoring agent (program) 264 that monitors the state of the storage apparatus 204 and transmits event information to the management computer 201.
- the monitoring agent 264 may be executed by the storage controller 261.
- the monitoring agent 245 of the server 202 may be able to monitor the state of the storage apparatus 204.
- the network switch 203 has ports 271a to 271d that receive data transmitted from the server 202 or the storage apparatus 204 and transmit received data.
- the network switch 203 also includes a monitoring agent (program) 272 that monitors the state of the network switch 203 and sends event information to the management computer 201 via the network 205 when a specific state change (event) is detected. Good.
- the monitoring agent 272 may be executed by a CPU (not shown) in the network switch 203.
- the monitoring agent 245 of the server 202 may monitor the state of the network switch 203.
- the configuration management DB 232 stores configuration information of managed devices acquired by the configuration acquisition program 228 from a monitoring agent or the like.
- the configuration information includes information indicating connection relations, dependency relations, and the like between managed components. Examples of configuration information of the server 202, the network switch 203, and the storage device 204 are shown in FIGS. Note that the configuration management DB 232 may not include some of the tables in FIGS. 3 to 9, or may not include some items in at least one table.
- the data representation format and data structure of each item stored in the configuration management DB 232 may not be the same as the data representation format and data structure of the managed device.
- the management computer 201 may receive them according to the data structure and expression format of the management target device.
- information in the table in the configuration management DB 232 may be updated as the configuration of the managed component is changed.
- a log related to the update may be stored as history information.
- the past configuration management DB 232 may be restored based on the log.
- FIG. 3 shows a configuration example of the device table in the configuration management DB 232.
- the device table 300 has a record for each device to be managed, and each record has three fields, that is, a device ID 301, a device name 302, and a type 303.
- the ID 301 stores a value that uniquely identifies the management target device.
- the device name 302 stores a value that allows the administrator to uniquely identify the device.
- the type 303 stores an identifier indicating the type of device.
- FIG. 4 shows a configuration example of the iSCSI disk table in the configuration management DB 232.
- the iSCSI disk table 400 is a table showing the configuration of the iSCSI disk 251 used by the server 202.
- the iSCSI disk table 400 has a record for each iSCSI disk 251, and each record has seven fields: ID 401, disk drive name 402, device ID 403, iSCSI initiator name 404, connection destination iSCSI target 405, LUN ID 406, and type. 407.
- the ID 401 stores a value that uniquely identifies the iSCSI disk (managed component) 251.
- the disk drive name 402 stores a value that allows the server 202 to uniquely identify the iSCSI disk 251.
- the device ID 403 stores an identifier indicating the server 202 that uses the iSCSI disk 251.
- the iSCSI initiator name 404 stores the identifier of the network I / F 243 on the server 202 that is used for communication with the storage apparatus 204 in which the actual iSCSI disk 251 exists.
- the connection destination iSCSI target 405 stores the identifier of the I / O port 263 on the storage apparatus 204 used for communication with the storage apparatus 204 in which the substance of the iSCSI disk 251 exists.
- the LUN ID 406 stores an identifier of a logical volume (logical volume in the storage apparatus 204) as an entity of the iSCSI disk 251.
- the type 407 stores an identifier indicating the type of managed component (iSCSI disk).
- the record on the first line means the following.
- the iSCSI disk indicated by the disk drive name “D:” on the server identified by the identifier “SvA” is identified by the identifier “DRIVE1”, and the component type is “iScsiDisk”.
- a logical volume having a LUN ID of 0 is provided from the storage apparatus to the server via a storage port (port of the storage apparatus) indicated by the iSCSI target name of stoC1.
- FIG. 5 shows a configuration example of the network I / F table in the configuration management DB 232.
- the network I / F table 500 has a record for each network I / F 243, and each record has five fields, that is, an ID 501, an I / F name 502, a device ID 503, an iSCSI initiator name 504, and a type 505.
- the ID 501 stores a value that uniquely identifies the network I / F 243 (managed component).
- the I / F name 502 stores a value that serves as an identifier of the network I / F 243 in the server 202.
- the device ID 503 stores the identifier of the server 202 having the network I / F 243.
- the iSCSI initiator name 504 stores the identifier of the network I / F 243 on the server 202 used for communication with the storage apparatus in which the iSCSI disk entity exists.
- the type 505 stores an identifier indicating the type of managed component.
- the record on the first line means the following.
- the network I / F indicated by the I / F name “eth0” exists in the server identified by the identifier “SvA”, is identified by the identifier “SVIF1”, and the component type is “ServerIF”.
- the iSCSI initiator name used as an identifier during communication of the storage apparatus is “com.hitachi.sva”.
- FIG. 6 shows a configuration example of the switch port table in the configuration management DB 232.
- the switch port table 600 has a record for each I / O port 271 that the network switch 203 has, and each record has five fields, that is, ID 601, port number 602, device ID 603, connection destination port 604, and type 605. .
- the ID 601 stores a value that uniquely identifies the I / O port 271 (managed component).
- the port number 602 stores a value that uniquely identifies the I / O port 271 in the network switch 203.
- the device ID 603 stores the identifier of the network switch 203 having the I / O port 271.
- the connection destination port 604 stores the identifier of the network I / F 243 of the server 202 connected to the I / O port 271 or the I / O port 263 of the storage apparatus 204.
- the data output from the network I / F of the plurality of servers or the I / O port of the storage device passes through the port of the network switch, so that the plurality of identifiers are connected ports. 604 may be stored.
- the type 605 stores an identifier indicating the type of managed component. For example, the record on the first line means the following.
- the I / O port indicated by the number “0” is in the network switch identified by the identifier “SwD”, identified by the identifier “SWPORT1”, the component type is “NWSswitchPort”, and “STPORT1” Connected to the I / O port identified by.
- FIG. 7 shows a configuration example of the iSCSI target table in the configuration management DB 232.
- the iSCSI target table 700 has a record for each iSCSI target, and each record has two fields, that is, an iSCSI target name 701 and a connection permitted iSCSI initiator 702.
- the iSCSI target name 701 stores the iSCSI target name possessed by each iSCSI target.
- the connection-permitted iSCSI initiator 702 stores an iSCSI initiator name that serves as an identifier of the network I / F 243 on the server that is permitted to access the logical volume belonging to the iSCSI target.
- the record on the first line means the following.
- the network I / F 243 on the server identified by “com.hitachi.sva” and “com.hitachi.svb” is accessed. Is allowed.
- FIG. 8 shows a configuration example of the storage port table in the configuration management DB 232.
- the storage port table 800 has a record for each I / O port 263 that the storage apparatus 204 has, and each record has five fields, that is, an ID 801, a port number 802, an apparatus ID 803, an iSCSI target ID 804, and a type 805.
- the ID 801 stores a value that uniquely identifies the I / O port 263 (managed component).
- the port number 802 stores a value that uniquely identifies the I / O port 263 in the storage apparatus 204.
- the device ID 803 stores the identifier of the storage device 204 having the I / O port 263.
- the iSCSI target 804 stores the identifier of the iSCSI target that uses the I / O port 263.
- the type 605 stores an identifier indicating the type of managed component.
- the record on the first line means the following.
- the I / O port indicated by the number “0” is in the storage device identified by the identifier “StoC”, is identified by the identifier “STPORT1”, the type of the component is “StorageiSCIPort”, and “com. used for the iSCSI target identified by hitachi.stoC1.
- the performance table 238 stores the performance information of the managed component that constitutes the managed device acquired by the performance acquisition program 229 from the monitoring agent or the like.
- FIG. 9 shows a configuration example of the performance table 238.
- the performance table 238 has a record for each piece of performance information, and each record has five fields, that is, a component ID 901, a metric 902, a time 903, a value 904, and a unit 905.
- the component ID 901 stores a value that uniquely identifies the management target component from which the performance information is acquired.
- the metric 902 stores a value for identifying an observation item (metric) of the performance of the managed component.
- the time 903 stores the time when the performance of the managed component is observed. The time is a unit for the year, month, and hour, but it may be a coarser unit or a finer unit.
- the value 904 stores a value observed as the performance of the management target component.
- a unit 905 stores a unit for the observed value.
- the record on the first line means the following.
- TxDropPacketNum of the management component (here, port 0 of the network switch D) identified by the identifier “SWPORT1”, “0 Packets / "sec” was observed.
- FIG. 10 shows a configuration example of the event queue table 233.
- the event queue table 233 stores event information acquired by the event reception program 227 from the monitoring agent of the management target device.
- the event queue table 233 has a record for each event information, and each record has five fields, that is, an event ID 1001, a device ID 1002, a component ID 1003, an event type 1004, and an occurrence time 1005.
- the event ID 1001 stores an identifier for uniquely identifying event information.
- the device ID 1002 stores an identifier for uniquely identifying a management target device from which event information is acquired.
- the component ID 203 stores an identifier for uniquely identifying the managed component from which the event information is acquired.
- the event type 1004 stores an identifier indicating the type of event that has occurred in the managed component.
- the occurrence time 1005 stores the time when the event occurred (the time included in the acquired event information).
- the occurrence time 1005 may store the time when the management computer 201 receives the event information.
- the value of the component ID 1003 may be equal to the value of the device ID 1002.
- the record on the first line means the following. “TxDropPacketNumError (transmission drop packet number error)” occurred at 0:00 on January 1, 2013 at the I / O port 273 whose component ID of the network switch 203 whose device ID is SwD is SWPORT1.
- the event analysis program 222 executes failure cause analysis.
- the failure cause analysis may be the same as the analysis described in Patent Document 1, for example. Then, the event analysis program 222 narrows down the faults that are the propagation sources of a plurality of faults that have occurred in the IT system, and then performs a diagnosis to identify the cause of the fault that has become the propagation source.
- the meta rule is information used by the event analysis program 222 during analysis.
- a meta-rule is a combination of events that can occur in a pattern of a certain topology (a group of one or more managed components that exist on a certain I / O path) and a failure if those events occur at the same time It is the information which shows the correspondence with a cause candidate.
- the cause candidate defined in the meta rule indicates a failure that is a propagation source of the system failure.
- the meta-rule has information for identifying a meta-diagnosis procedure used when executing a detailed diagnosis for a failure cause event indicated by the meta-rule and information on a managed component that is a starting point of a topology to be diagnosed.
- the meta-rule is described in the IF-THEN format. However, if the cause event of the system failure and the observation event (observed event) caused by the cause event are described, the meta-rule is in other formats. May be.
- FIG. 11A shows a configuration example of the metarule 1100 that resides in the metarule repository 231.
- a rule can be divided into two parts (fields), a first part called “IF” part 1111 and a second part called “THEN” part 1112.
- the IF unit 1111 may include one or more condition elements.
- the meta-rule 1100 indicates that when an event (conditional event) of the IF unit 1111 is detected, an event (conclusion event) of the THEN unit 1112 is a cause of failure. Therefore, if the status of the management target component represented by the THEN unit 1112 becomes normal, the problem represented by the IF unit 1111 is expected to be solved.
- the event analysis program 222 analyzes the event represented by the event information stored in the event queue table 233 of FIG. 10 as an observation event. Therefore, the IF unit 1111 has an entry for each condition element, and each entry has a device type 1101, a component type 1102, and an event type 1103. That is, the management target device and its elements are classified into several types in the management computer 201, and the condition element of the IF unit 1111 has a state indicated by the specified event type in the specified type of the management target component. It shows that.
- the condition element indicates an event related to the apparatus itself instead of the element of the apparatus, the value of the component type 1102 for the condition element may be equal to the apparatus type 1101.
- the metarule 1100 includes a metarule ID 1113, which is a field for storing a metarule ID for uniquely identifying each metarule, and a metarule when the metarule 1100 is applied to an actual configuration of an IT system to be managed to generate an expansion rule.
- topology condition 1114 which is a field for storing the condition of the topology to which 1100 is applied.
- a method of acquiring topology information from the configuration management DB 232 is taken as an example. For example, in the topology condition example shown in FIG.
- the topology to which the meta-rule is applied is the iSCSI disk, the network I / F of the server used to provide the storage capacity of the iSCSI disk, and the I / F of the storage apparatus. It shows the combination of the O port and the I / O port of the network switch between the two I / O ports.
- the meta-rule 1100 in order to execute a diagnosis for specifying the cause event in more detail based on the conclusion derived using the meta-rule, includes an identifier of the meta-diagnosis procedure and a topology to be diagnosed. And a field 1115 for storing the condition of the management target component.
- the metadiagnostic procedure identified from the metadiagnostic procedure ID (metadiagnostic procedure ID described in the field 1115 of the metarule) associated with the metarule is used. Is done. In the example of FIG.
- a plurality of combinations (combination of meta-diagnostic procedure identifier and starting condition) may be stored.
- an identifier of one meta diagnostic procedure may be stored in each field 1115 of the plurality of meta rules 1100.
- the topology to be diagnosed may be different from the topology to which the metarule 1100 is applied. A description on the topology to be diagnosed will be described later.
- the meta-rule “MetaRule1” in FIG. 11A has two observation events: “Abnormal disk access response time of iSCSI disk 151 on server 202” and “Abnormal number of drop packets transmitted on I / O port 271 in network switch 203”. When detected, it is concluded that “abnormal number of transmission drop packets of the I / O port 271 in the network switch 203” is a bottleneck. Further, when performing analysis using the meta rule “MetaRule 1”, topology information to which the meta rule is applied based on the condition stored in the topology condition 1114 is acquired from the configuration management DB or the like.
- the diagnosis target topology it is possible to define the diagnosis target topology separately from the managed component in the topology analyzed by the event analysis program 222. It is possible to include the management target components in the periphery of the topology as a diagnosis target.
- condition element included in the IF unit 1111 it may be defined that a certain component is normal (a failure event has not occurred). Further, the event type represented by the event type 1103 of the THEN unit 1112 may be newly defined, and may not be the event type of the event received by the event receiving program 227.
- the deployment rule is information indicating a correspondence relationship between a combination of events that can occur in the IT system and an event that is a cause of a failure when those events occur.
- the cause candidate defined in the expansion rule indicates a failure that is a propagation source of the system failure.
- the expansion rule is a rule generated as a result of searching the managed IT system for a topology to which the meta rule 1100 can be applied based on the topology condition 1114 of the meta rule 1100 and applying the meta rule 1100 to the searched topology. It is.
- the expansion rule is information used by the event analysis program 222 during analysis.
- the expansion rule is described in the IF-THEN format as in the case of the meta rule, but may be in other formats as long as the cause event of the system failure and the observation event caused by the cause event are described.
- FIG. 11B shows a configuration example of an expansion rule.
- the expansion rule 1150 can also be divided into two parts (fields), that is, a first part called an IF part 1151 and a second part called a THEN part 1152, similarly to the metarule 1100. it can.
- the IF unit 1151 may include one or more condition elements.
- the expansion rule 1150 indicates that when an event (condition event) of the IF unit 1151 is detected, an event (conclusion event) of the THEN unit 1152 causes a failure. Therefore, if the status of the managed component represented by the THEN unit 1152 becomes normal, it is expected that the problem represented by the IF unit 1151 will be solved.
- the IF unit 1151 of the expansion rule 1150 has an entry for each condition element, and each entry has fields of a device ID 1161, a component ID 1162, an event type 1163, and a reception flag 1164. That is, the condition element of the IF unit 1151 indicates that the state indicated by the information of the event type 1163 occurs in the management target component specified by the device ID 1161 and the component ID 1162.
- the reception flag 1164 stores the result of whether or not the event indicated by the condition element is actually received.
- the values stored in the device ID 1161 and the component ID 1162 are the device type 1101 among the device IDs and component IDs specified from the configuration management DB 232 based on the topology condition 1114 of the metarule 1100. And a value corresponding to the type defined in the component type 1102.
- the expansion rule 1150 includes an expansion rule ID 1153 that is a field for storing an expansion rule ID that uniquely identifies the expansion rule 1150. Further, the expansion rule 1150 executes a diagnosis for specifying the cause event in more detail based on the conclusion derived using the expansion rule 1150. Therefore, the identifier of the meta diagnosis procedure, the origin of the topology to be diagnosed And a field 1155 for storing the identifier of the managed component. Among the values stored in the field 1155, the meta diagnosis procedure ID is equal to the value stored in the field 1115 of the meta rule 1100 used when generating the expansion rule 1150.
- the device ID and component ID stored as the starting point are the meta rule 1100 among the device ID and component ID specified from the configuration management DB 232 based on the topology condition 1114 of the meta rule 1100. ID corresponding to the “starting point condition” stored in the field 1115.
- FIG. 11B shows expanded rules 1150a to 1150d generated by expanding the meta-rule 1100 of FIG. 11A based on the configuration management DB 232 shown in FIGS.
- the meta diagnosis procedure identified by “MetaDiagnosticProc1” is used, and “the device ID is identified by SwD and the component ID is identified by SWPORT1”. Diagnosis is performed on the topology starting from the managed component. Note that, as a condition element included in the IF unit 1151, it may be defined that a certain component is normal (
- the meta-diagnosis procedure is a series of diagnosis procedures executed to identify the failure cause event after narrowing down the failure that becomes the propagation source of the failure of the IT system by the event analysis program 222.
- the meta-diagnosis procedure includes a step of collecting information necessary for diagnosis, a step of making a determination based on the collected information, and a conclusion derived based on one or a plurality of determination results.
- the specific managed component that is the target of executing the meta-diagnosis procedure is not defined, and the topology pattern and configuration pattern that are the target of executing the procedure are defined.
- FIG. 12 shows a configuration example of the meta diagnosis procedure 1200 resident in the meta diagnosis procedure repository 234.
- the meta diagnosis procedure 1200 stores a basic object 1201 for storing information related to the meta diagnosis procedure 1200, an information collection object 1202 for storing means for collecting information necessary for diagnosis, and a means for determining based on the collected information. And a conclusion object 1204 that stores conclusion information derived based on one or a plurality of determination results.
- the meta-diagnosis procedure 1200 is an object structure, but is composed of a combination of information of means for collecting information, information of a determination step, and information of a conclusion derived based on the determination result. Other data structures may be used as long as they are.
- a plurality of objects 1201 to 1204 other than the object 1201 can exist.
- the meta diagnosis procedure 1200 illustrated in FIG. 12 includes a basic object 1201, two information collection objects 1202a and 1202b, two determination objects 1203a and 1203b, and three conclusion objects 1204a, 1204b, and 1204c. Yes.
- the basic object 1201 has five fields, that is, a type 1211, an ID 1212, a meta diagnosis procedure ID 1213, a topology condition ID 1214, and a Next ID 1215.
- the type 1211 stores an identifier for identifying the type of object (for example, “Start” indicating basic information).
- the ID 1212 stores an identifier for uniquely identifying the object.
- the meta diagnosis procedure ID 1213 stores an identifier for uniquely identifying the meta diagnosis procedure 1200.
- the topology condition ID 1214 stores an identifier for uniquely identifying a topology condition to which the meta-diagnosis procedure 1200 is applied.
- NextID 1215 stores the identifier of the object storing the step to be executed first.
- the information collection object 1202 has four fields, that is, a type 1221, an ID 1222, a means ID 1223, and a NextID 1224.
- the type 1221 stores an identifier for identifying the type of the object (for example, “CollectInfo” indicating that the information collecting unit is stored).
- the ID 1222 stores an identifier for uniquely identifying an object, like the ID 1212.
- the unit ID 1223 stores an identifier for uniquely identifying the meta collection unit. Based on the identifier stored in the means ID 1223, the meta collection means necessary for diagnosis is searched from the meta collection means repository 236.
- the NextID 1225 stores an identifier of an object that stores a step to be executed next.
- the information collection object 1202a acquires the meta collection means identified by the identifier “GetInfo1” from the meta collection means repository 236 at the time of diagnosis execution, collects information based on the means, and then has the ID “2”. ”Indicates that the step indicated by the object is executed.
- the determination object 1203 has five fields, that is, a type 1231, an ID 1232, a determination program ID 1233, an argument 1234, and a Decision Map 1235.
- the type 1231 stores an identifier for identifying the type of the object (for example, “Decision” indicating that information regarding the determination step is stored).
- the ID 1232 stores an identifier for uniquely identifying the object.
- the determination program ID 1233 stores an identifier for uniquely identifying a program that performs determination based on the collected information. Based on the identifier stored in the determination program ID, the determination program 226 resident in the memory 212 is called.
- the argument 1234 stores identification information of information used when the determination is executed by the determination program 226.
- the Decision Map 1235 stores a list of combinations of the key 1236 and the NextID 1237.
- the key 1236 stores a value that can be a return value of the determination program 226, and the NextID 1237 stores an identifier of the object. That is, the Decision Map 1235 stores information for determining the next step to be executed according to the return value of the determination program 226 at the time of diagnosis execution.
- the determination object 1203a starts the determination program 226 identified by the identifier “determination program 1” at the time of diagnosis execution, and is collected by the object 1202a identified by the identifier “1” as an argument to “determination program 1”.
- step indicated by the object 1202b identified by the identifier “3” is executed, and the return value is “NO” Indicates that the step indicated by the object 1204a identified by the identifier "4" is executed.
- determination program 1 is “determining whether the rate of increase in performance information given as an argument is greater than or equal to a predefined value, and if it is greater than that value, “Yes” may be “a program that returns NO if it is less than that value”.
- the conclusion object 1204 has three fields: type 1241, ID 1242, and confusion 1243.
- the type 1241 stores an identifier (for example, “End” indicating that information regarding a conclusion is stored) for identifying the type of the object.
- the ID 1242 stores an identifier for uniquely identifying the object, like the ID 1212.
- the Conclusion 1243 stores information that is the conclusion of the diagnosis when the diagnosis is executed. For example, information stored in the Conculino 1243 may be displayed on the output device 217. For example, when the conclusion object 1204a is selected as a conclusion based on the determination result of the determination object 1203a when the diagnosis is executed, “insufficient bandwidth of“ network switch port ”” is displayed on the output device 217 as the diagnosis result. However, in “network switch port”, the identification information of the network switch port acquired from the configuration management DB 232 based on the topology condition indicated by the topology condition ID 1214 is displayed.
- FIG. 13 shows a configuration example of the topology condition to which the meta diagnosis procedure 1200 is applied.
- the topology condition 1300 has two fields, that is, a topology condition ID 1301 and a condition 1302.
- the topology condition ID 1301 stores an identifier for uniquely identifying the topology condition.
- the value stored in the topology condition ID 1301 is equal to the identifier stored in the topology condition ID 1214 of the basic object 1201 in FIG.
- the condition 1302 stores information regarding the condition of the topology to which the meta diagnosis procedure 1200 is applied.
- a method for acquiring topology information from the configuration management DB 232 is taken as an example. For example, when topology information is acquired based on the condition 1302 of FIG.
- the value of the device ID 603 in the switch port table 600 is equal to the device ID of the starting point stored in the field 1155 of the expansion rule, and ( 2) A combination of records in which the value of the ID 501 in the network I / F table 500 is equal to the value of the connection destination port in the record of the switch port table 600 in (1) is acquired.
- the topology including the starting management target component represented by the condition 1302 and the management target component associated with the starting management target component in the condition 1302 is specified.
- the topology condition stored in the condition 1302 does not have to be in the format shown in FIG. 13 as long as a method for acquiring topology information is described.
- FIG. 14 shows an example of the configuration of the meta collection means stored in the meta collection means repository 236.
- the meta collection unit 1400 has two fields, that is, a unit ID 1401 and a collection unit 1402.
- the unit ID 1401 stores an identifier for uniquely identifying the meta collection unit 1400.
- the value stored in the means ID 1401 is equal to the identifier stored in the means ID 1223 of the information collection object 1202 in FIG.
- the meta collection unit 1402 stores information collection unit necessary for diagnosis.
- one example of information necessary for diagnosis is performance information of managed components that can be acquired from the performance table 238. Therefore, for example, the meta collection unit 1402a stores a query for acquiring information from the table.
- which management target component performance information is collected depends on the conclusion derived by the event analysis program 222, and therefore the identifier of the management target component is a variable.
- the portion enclosed by double quotations is expressed as a variable (this is the same for the meta collection means 1402 b).
- the expansion diagnosis procedure is a diagnosis procedure that is expanded by the diagnosis procedure expansion program 223 based on the meta diagnosis procedure and the topology information. Similar to the meta-diagnostic procedure, the development diagnostic procedure includes a step of collecting information necessary for diagnosis, a step of making a determination based on the collected information, and a conclusion derived based on the result of one or more determinations. Consists of. In the meta diagnosis procedure, a specific component to be executed is not defined, whereas in the development diagnosis procedure, a component to be executed is defined based on the topology information.
- FIG. 15 shows a configuration example of the deployment diagnostic procedure 1500 stored in the deployment diagnostic procedure repository 235.
- the deployment diagnostic procedure repository 235 is a repository that stores a deployment diagnostic procedure once generated for reuse in another diagnosis, and the repository does not necessarily exist in the management computer 201.
- the reference numeral “124” is attached to the deployment diagnostic procedure.
- the deployment diagnostic procedure shown in FIG. 15 is different in configuration from the deployment diagnostic procedure in FIG. Uses the reference numeral “1500” which is different from the development diagnostic procedure of FIG.
- the deployment diagnostic procedure of FIG. 1 and the deployment diagnostic procedure of FIG. 15 may be procedures generated by the same method.
- the deployment diagnosis procedure 1500 includes a basic object 1501 that stores information related to the deployment diagnosis procedure, an information collection object 1502 that stores a means for collecting information necessary for diagnosis, and a determination that stores a means for determining based on the collected information.
- the development diagnosis procedure is an object structure, but is composed of a combination of information of means for collecting information, information of a determination step, and information of a conclusion derived based on the determination result. Any other data structure may be used.
- a plurality of objects 1501 to 1504 other than the object 1501 can exist.
- the expanded diagnosis procedure 1500 illustrated in FIG. 15 includes a basic object 1501, two information collection objects 1502a and 1502b, two determination objects 1503a and 1503b, and three conclusion objects 1504a, 1504b, and 1504c. Yes.
- the basic object 1501 has six fields, that is, a type 1511, an ID 1212, a meta diagnosis procedure ID 1513, a development diagnosis procedure ID 1514, a route list 1515, and a Next ID 1516.
- the type 1511 stores an identifier (for example, “Start” indicating basic information) for identifying the type of the object, similar to the type 1211 of the meta-diagnosis procedure 1200.
- the ID 1512 stores an identifier for uniquely identifying the object.
- the meta diagnosis procedure ID 1513 stores the identifier of the meta diagnosis procedure 1200 used when the development diagnosis procedure 1500 is generated.
- the deployment diagnosis procedure ID 1514 stores an identifier for uniquely identifying the deployment diagnosis procedure 1500.
- the path list 1515 stores a list of object IDs of the referenced development diagnosis procedure 1500 at the time of diagnosis execution. That is, the route list 1515 may have a data structure that can acquire information collected for diagnosis, a determination result, and a conclusion derived based on the determination result after execution of the diagnosis.
- NextID 1516 stores the identifier of the object that stores the step to be executed first.
- the information collection object 1502 has four fields, that is, a type 1521, an ID 1522, a development means ID 1523, and a Next ID 1524.
- the type 1521 stores an identifier (for example, “CollectInfo” indicating that the information collecting unit is stored) for identifying the type of the object, similarly to the type 1221 of the meta diagnosis procedure 1200.
- ID 1522 similarly to ID 1512, stores an identifier for uniquely identifying an object.
- the expansion means ID 1523 stores an identifier for uniquely identifying the expansion collection means. Based on the identifier stored in the expansion means ID 1223, the expansion collection means necessary for diagnosis is searched from the expansion collection means repository 237.
- the NextID 1525 stores an identifier of an object that stores a step to be executed next.
- the information collection object 1502a acquires the information collection means identified by the identifier “ExpandedGetInfo1-1” from the expanded collection means repository 237 at the time of diagnosis execution, collects information based on the means, and then collects the ID. This indicates that the step indicated by the object “Proc1-1-2” is executed.
- the determination object 1503 has five fields, that is, a type 1531, an ID 1532, a determination program ID 1533, an argument 1534, and a Decision Map 1535.
- the type 1531 stores an identifier for identifying the type of the object (for example, “Decision” indicating that information related to the determination step is stored), similar to the type 1231 of the meta diagnosis procedure 1200.
- the ID 1532 stores an identifier for uniquely identifying the object.
- the determination program ID 1533 stores an identifier that uniquely identifies a program that performs determination based on the collected information.
- the determination program ID 1533 stores a value equal to the determination program ID 1233 of the meta diagnosis procedure 1200.
- the determination program 226 resident in the memory 212 is called.
- the argument 1534 stores identification information of information used when the determination program 226 executes determination.
- the Decision Map 1535 stores a list of combinations of the key 1536 and the NextID 1537 in the same manner as the Decision Map 1235 of the meta diagnosis procedure 1200.
- the key 1536 stores a value that can be a return value of the determination program 226, and the NextID 1537 stores an identifier of the object. That is, the Decision Map 1535 stores information for determining the next step to be executed in accordance with the return value of the determination program 226 at the time of diagnosis execution.
- the determination object 1503a activates the determination program 226 identified by the identifier “determination program 1” at the time of diagnosis execution, and is identified by the identifier “Proc1-1-1” as an argument to “determination program 1”.
- the information collected by the object 1502a is passed, and if the return value of “determination program 1” is “YES”, the step indicated by the object 1502b identified by the identifier “Proc1-1-3” is executed, and the return value “NO” indicates that the step indicated by the object 1504a identified by the identifier “Proc1-1-4” is executed.
- Conclusion object 1504 has three fields: type 1541, ID 1542, and Confusion 1543.
- the type 1541 stores an identifier for identifying the type of the object (for example, “Conclusion” indicating that information related to the conclusion is stored), similar to the type 1241 of the meta diagnostic procedure 1200.
- the ID 1542 stores an identifier for uniquely identifying the object, like the ID 1512.
- information that is a conclusion of diagnosis at the time of diagnosis execution is stored. For example, information stored in the Confusion 1543 may be displayed on the output device 217.
- the conclusion object 1504a is selected as a conclusion based on the determination result of the determination object 1503 at the time of diagnosis execution, “insufficient bandwidth of SWPORT1 (port 0 of the network switch D)” is displayed on the output device 217 as the diagnosis result.
- the development collection means is information collection means developed by the diagnostic procedure development program 223 based on the meta development collection means and the topology information.
- the meta collection means does not define a specific component that is a target of information collection, and is expressed by a variable in this embodiment.
- components to be collected are defined based on the topology information.
- FIG. 16 shows a configuration example of the deployment collection means stored in the deployment collection means repository 237.
- the development collection means 1600 has two fields, that is, a development means ID 1601 and a development collection means 1602.
- the expansion means ID 1601 stores an identifier for uniquely identifying the expansion collection means.
- the value stored in the expansion means ID 1601 is equal to the identifier stored in the expansion means ID 1523 of the information collection object 1502 in FIG.
- the deployment collection means 1602 stores information collection means necessary for diagnosis.
- the development collection unit 1602a stores a query for acquiring information from the table.
- the deployment collection unit 1602 defines information collection targets.
- FIG. 16 shows an example of expansion collection means 1600a to 1600d generated by expanding the meta collection means 1400 of FIG. 14 based on the topology condition 1300a of FIG.
- the diagnosis is executed based on the result in order to further specify the failure cause event.
- FIG. 17 shows a flowchart of an example of failure cause analysis processing executed by the failure analysis program 221.
- the failure analysis program 221 may be configured to start this process when a failure occurs in the IT system and an event related to the failure is detected by the event reception program 227. Further, this process may be started when an administrator detects the occurrence of a failure in the IT system and is activated by an instruction from the input device 214 by the administrator.
- the failure analysis program 221 executes the event analysis program 222.
- the event analysis program 222 executes processing for narrowing down failure cause events based on the pattern of events that have occurred.
- the event analysis program 222 is based on the event information group stored in the event queue table 233, the metarule stored in the metarule repository 231, and the configuration information stored in the configuration management DB 232. Narrow down fault candidates that are the source of fault propagation. For example, when the event reception program 227 receives the event information group of the event queue table 233 shown in FIG. 10, and the event analysis program 222 performs analysis based on the metarule 1100 shown in FIG. 11A and the tables shown in FIGS.
- Expansion rules 1150a, 1150b, 1150c, and 1150d are generated. Then, for example, based on the information of each THEN unit 1152 of the expansion rules 1150a and 1150b, the event analysis program 222 reads “abnormal number of transmission drop packets on port 0 (ID is SWPORT1) of the network switch D (ID is SwD1). A conclusion is derived that “the event type identifier is TxDropPacketNumError” is the propagation source of the failure ”.
- FIG. 18 shows an example of the event analysis result screen 1800.
- the event analysis result screen 1800 is a screen that presents a conclusion derived by the event analysis program 222 as a cause candidate for a failure that is a propagation source of a plurality of failures that have occurred in the IT system.
- the event analysis result screen 1800 has an entry for each failure cause candidate as a propagation source, and each entry has a cause failure candidate field 1801 for displaying a failure cause candidate and a certainty for the cause candidate indicated by the field 1801 (confidence level).
- the certainty factor displayed in the certainty factor field 1802 may be, for example, the event reception rate of the expansion rule 1150 related to the cause candidate 1811.
- values based on a plurality of event reception rates respectively corresponding to the plurality of expansion rules may be displayed in the confidence field 1802.
- the event reception rate is calculated based on the total number of condition elements of all the expansion rules related to the cause candidate 1811 and the condition element number where the reception flag 1164 is “1”, and the calculated value is displayed in the certainty factor field 1802. May be displayed.
- a plurality of cause candidates may be displayed in descending order of confidence based on the conclusion derived by the event analysis program 222.
- the process proceeds to step S1702 in FIG. 17 to execute the diagnosis procedure expansion program 223 to execute detailed diagnosis of the corresponding cause candidate.
- the input interface for executing detailed diagnosis by the administrator is not limited to a button, and any input interface that instructs the management computer 201 to execute diagnosis can be employed.
- the start of the diagnostic procedure development program 223 may be automatically executed for each derived cause candidate after the cause candidate is derived by the event analysis program 222 instead of an instruction from the administrator.
- the diagnostic procedure expansion program 223 is automatically executed, the diagnostic procedure expansion program 223 is executed only for the cause candidates derived by the event analysis program 222 when the certainty factor is a certain value or more. Also good.
- the conclusion derived by the event analysis program 222 indicates a failure that is a propagation source of a plurality of failures that occurred in the IT system, and the administrator presses the diagnosis execution button 1803 in response to the failure. Then, the diagnostic procedure expansion program 223 is started to execute a diagnosis that identifies the cause of the failure that has become the propagation source.
- step S1702 the failure analysis program 221 starts the diagnostic procedure development program 223 with the information on the cause candidate selected in step S1701 as an input.
- the diagnostic procedure expansion program is stored in the input cause candidate information, that is, the information of the THEN unit 1152 of the expansion rule 1150, the expansion rule 1150, the meta diagnosis procedure 1200, the meta collection means 1400, and the configuration management DB 232.
- a deployment diagnostic procedure 1500 is generated based on the configuration information.
- An example of detailed processing of the diagnostic procedure development program 223 is shown in FIG.
- step S1703 the failure analysis program 221 starts the diagnosis execution program 224 with the deployment diagnosis procedure 1500 as an input.
- the diagnosis execution program 224 executes diagnosis based on the deployment diagnosis procedure 1500 and identifies a failure cause event of the IT system.
- An example of detailed processing of the diagnosis execution program 224 is shown in FIG.
- step S1704 the failure analysis program 221 starts the display program 225 with the development diagnosis procedure 1500 executed in step S1703 as an input.
- the display program 225 displays information on the cause of the failure derived in step S1703 on the output device 217 based on the input expansion diagnosis procedure 1500 and its route list 1515.
- the diagnostic procedure expansion program 223 is executed after the event analysis program 222 is executed. However, the diagnostic procedure expansion program 223 may be executed before the event analysis program 222 is executed.
- the diagnosis procedure expansion program 223 lists all the cause candidates that can be derived by the event analysis program 222 based on the configuration information of the configuration management DB 232 and the meta-rule 1100, and the expansion necessary for diagnosing those cause candidates
- the diagnostic procedure 1500 and the deployment collection unit 1600 are generated based on the configuration information of the meta diagnosis procedure 1200, the meta collection unit 1400, and the configuration management DB 232, and are stored in the deployment diagnostic procedure repository 235 and the deployment collection unit repository 237. May be.
- the failure analysis program 221 acquires the expansion diagnosis procedure 1500 for the cause candidate derived by the event analysis program 222 from the expansion diagnosis procedure repository 235, and the acquired expansion diagnosis procedure 1500 is obtained.
- the diagnosis execution program 224 is activated as an input.
- the diagnosis execution program 224 collects information necessary for diagnosis and the determination program 226 executes determination. However, after the execution of step S1702, the generated deployment diagnosis procedure 1500 is executed.
- the display program 225 may pass the display program 225, the display program 225 may display the expansion diagnosis procedure 1500 on the output device 217, and the administrator may perform processing according to the expansion diagnosis procedure 1500.
- FIG. 19 shows a flowchart of an example of processing executed by the diagnostic procedure development program 223 (step S1702).
- the diagnostic procedure development program 223 receives the conclusion information derived by the event analysis program 222 as a cause of failure.
- the conclusion information may be a combination of information stored in the THEN unit 1152 of the expansion rule 1150.
- the diagnostic procedure development program 223 receives information “abnormal number of transmission drop packets of the port 0 (ID is SWPORT1) of the network switch D (ID is SwD) (event type identifier is TxDropPacketNumError)”.
- step S1902 the diagnostic procedure expansion program 223 acquires the expansion rule 1150 related to the conclusion information received in step S1901. That is, the diagnostic procedure expansion program 223 acquires the expansion rule 1150 having the received conclusion in the THEN unit 1152.
- the diagnostic procedure expansion program 223 performs the processing of steps S1904 to S1912 for each of all the expansion rules 1150 acquired in step S1902.
- target development rule one development rule (hereinafter, “target development rule” in the description of FIG. 19) 1150 is taken as an example.
- step S1904 the diagnostic procedure development program 223 acquires the meta diagnostic procedure 1200 identified from the meta diagnostic procedure ID stored in the field 1155 of the target development rule 1150 from the meta diagnostic procedure repository 234.
- the diagnostic procedure development program 223 performs the processing of steps S1906 to S1912 for each of all the meta diagnostic procedures 1200 acquired in step S1904.
- one meta diagnosis procedure hereinafter, “target meta diagnosis procedure” in the description of FIG. 19
- target meta diagnosis procedure in the description of FIG. 19
- step S1906 the diagnostic procedure expansion program 223 determines whether or not the target meta diagnostic procedure 1200 has been expanded with respect to the starting point indicated by the field 1155 of the target expansion rule 1150. If the result of this determination is true (S1906: YES), the process proceeds to step S1907. If the result of this determination is false (S1906: NO), the process proceeds to step S1908.
- step S1907 the diagnostic procedure expansion program 223 acquires from the expanded diagnostic procedure repository 235 the expanded diagnostic procedure 1500 expanded based on the target meta diagnostic procedure and the starting point indicated by the field 1155 of the target expanded rule 1150.
- step S1908 the diagnostic procedure expansion program 223 acquires the topology condition 1300 identified from the identifier stored in the topology condition ID 1214 of the basic object 1201 of the target meta diagnostic procedure 1200.
- the diagnostic procedure expansion program 223 acquires topology information from the configuration management DB 232 based on the information stored in the condition 1302 of the topology condition 1300 acquired in step S1908.
- the topology represented by the acquired topology information starts from the management target component (device or element thereof) indicated by “starting point” in the field 1155 of the target deployment rule 1150.
- the target deployment rule 1150 is the deployment rule 1150a of FIG. 11B
- the starting point is a managed component with the device ID “SwD” and the component ID “SWPORT1”.
- the topology condition 1300 is the topology condition 1300a of FIG.
- the diagnostic procedure expansion program 223 refers to the record (the first to fourth lines) in which the device ID 603 of the switch port table 600 is “SwD”.
- the ID of the referenced record (3 sets of SWPORT1-SWPORT2-SVIF1, SWPORT1-SWPORT3-SVIF2, and SWPORT1-SWPORT4-SVIF3) are acquired as topology information.
- step S1909 is performed. May be excluded from the topology information acquired in step (b). Whether or not a failure event has occurred in the managed component depends on whether or not a failure event has occurred within a certain period from the time when the event reception program 227 detected the failure event that triggered the analysis. You may judge by. Thereby, the object of diagnosis can be limited to the topology in which the failure has occurred. Further, the deployment diagnosis procedure 1500 may be generated for each topology, or one for all the topologies acquired based on a set of topology conditions and starting points.
- step S1910 the diagnostic procedure deployment program 223 acquires the meta collection unit 1400 identified from the identifier stored in the unit ID 1223 of the information collection object 1202 of the meta diagnosis procedure 1200 from the meta collection unit repository 236. Then, the diagnostic procedure expansion program 223 generates the expansion collection unit 1600 by expanding the meta collection unit 1400 based on the topology information acquired in step S1909. The ID in the topology information is substituted for the variable in the meta collection unit 1400 to generate the development collection unit 1600 (the development collection unit 1602 is as shown in FIG. 16, for example).
- step S1911 the diagnostic procedure deployment program 223 generates a deployment diagnostic procedure 1500 based on the meta diagnostic procedure 1200, the topology information acquired in step S1909, and the deployment collection means 1600 generated in step S1910.
- step S1912 the diagnostic procedure deployment program 223 registers the deployment diagnostic procedure 1500 generated in step S1911 in the deployment diagnostic procedure repository 235.
- step S1913 the diagnostic procedure deployment program 223 returns the deployment diagnostic procedure 1500 generated or acquired from the deployment diagnostic procedure repository 235 to the calling program.
- step S1904 when the event reception rate of the target expansion rule 1150 is equal to or less than a predetermined value, the target expansion rule may be excluded from the development of the meta-diagnostic procedure related to the expansion rule and the diagnosis execution.
- the deployment diagnostic procedure executed by the diagnostic execution program 224 is limited to the deployment diagnostic procedure related to the deployment rule having an event reception rate of a certain value or more, and unnecessary diagnostic execution can be reduced.
- step S1901 the event analysis program 222 concludes that the information “abnormal number of transmission drop packets of the network switch D (ID is SwD) port 0 (ID is SWPORT1) (the event type identifier is TxDropPacketNumError)” is received.
- step S1902 the diagnostic procedure expansion program 223 acquires the expansion rules 1150a and 1150b in FIG. 11B. Taking the development rule 1150a as an example, the diagnostic procedure development program 223 acquires the meta diagnostic procedure 1200 of FIG. 12 in step S1904. If it is determined in step S1906 that it has not been expanded, the diagnostic procedure expansion program 223 acquires the topology condition 1300a of FIG. 13 in step S1908.
- step S1909 the diagnostic procedure expansion program 223 acquires three pieces of topology information (SWPORT1-SWPORT2-SVIF1, SWPORT1-SWPORT3-SVIF2, SWPORT1-SWPORT4-SVIF3). Since “GetInfo1” and “GetInfo2” are respectively stored in the means IDs 1223 of the two information collection objects 1202 of the meta diagnosis procedure 1200, in step S1910, the diagnosis procedure expansion program 223 displays the meta collection means 1400a of FIG. The expansion collection means 1600a is generated based on the topology information, and the expansion collection means 1600b, 1600c and 1600d are generated based on the meta collection means 1400b and the topology information. In step S1911, the diagnostic procedure deployment program 223 generates a deployment diagnostic procedure 1500 shown in FIG.
- the diagnostic procedure expansion program 223 stores the expansion diagnostic procedure 1500 in the expansion diagnostic procedure repository 235.
- the diagnostic procedure expansion program 223 stores the generated expansion diagnostic procedure 1500 in the failure analysis program 221. return.
- FIG. 20 shows a flowchart of an example of processing executed by the diagnostic procedure development program 223 (step S1703).
- step S2001 the diagnosis execution program 224 receives the deployment diagnosis procedure 1500.
- the diagnosis execution program 224 repeats the processes in steps S2003 to S2014 for all the deployment diagnosis procedures received in step S2001.
- one deployment diagnosis procedure hereinafter, “target deployment diagnosis procedure” in the description of FIG. 20
- target deployment diagnosis procedure in the description of FIG. 20
- step S2003 the diagnosis execution program 224 refers to the basic object 1501 whose type is “Start” among the objects constituting the target deployment diagnosis procedure 1500.
- step S2004 the diagnosis execution program 224 adds the ID of the referenced object to the route list 1515 of the basic object 1501.
- step S2005 the diagnosis execution program 224 refers to the object next to the object being referred to.
- the diagnosis execution program 224 refers to the object having the ID stored in the NextID 1516 or the NextID 1524. If the determination object 1503 is being referred to, the diagnosis execution program 224 determines the next object based on the Decision Map 1535 in step S2013 described later.
- step S2006 the diagnosis execution program 224 determines whether or not the object type referred to in step S2005 is “End”. If this determination result is true (S2006: YES), the process proceeds to step S2007. If this determination result is false (S2006: NO), the process proceeds to step S2014.
- step S2007 the diagnosis execution program 224 determines whether or not the type of the object referred to in step S2005 is “CollectInfo”. If the result of this determination is true (S2007: YES), the process proceeds to step S2008. If the result of this determination is false (S2007: NO), the process proceeds to step S2010.
- step S2008 the diagnosis execution program 224 acquires from the deployment collection unit repository 237 the deployment collection unit 1600 identified from the identifier stored in the deployment unit ID 1523 of the referenced object.
- step S2009 the diagnosis execution program 224 acquires information from the repository of the management target device or the management computer 201 based on the deployment collection means acquired in step S2008.
- step S2010 the diagnosis execution program 224 acquires the information collected in step S2009 based on the information stored in the argument 1534 of the referenced object.
- step S2011 the diagnosis execution program 224 uses the information acquired in step S2010 as an input, and starts the determination program 226 identified from the identifier stored in the determination program ID 1533 of the referenced object.
- step S2012 the diagnosis execution program 224 receives the determination result from the determination program 226 executed in step S2011.
- step S2013 the diagnosis execution program 224 acquires the NextID 1537 stored in the Decision Map 1535 of the referenced object using the determination result received in step S2012 as a key, and determines the object to be referenced next.
- step S2014 the diagnosis execution program 224 adds the ID of the referenced object to the route list 1515 of the basic object 1501.
- step S2015 the diagnosis execution program 224 returns the received deployment diagnosis procedure 1500 to the calling program.
- step S2001 the diagnosis execution program 224 refers to the basic object 1501a in step S2003, and in step S2004, the object ID “Proc1- 1-0 "is added.
- step S2005 the diagnosis execution program 224 refers to the information collection object 1502 based on the identifier “Proc1-1-1” indicated by the NextID 1516. Since the type of the information collection object 1502a is “CollectInfo”, the process proceeds to step S2008.
- step S2008 the diagnosis execution program 224 acquires the expansion information unit 1600a of FIG. 16 based on the expansion unit ID “ExpandedGetInfo1-1”.
- step S2004 the diagnosis execution program 224 collects information from the performance table 238 based on the SQL query described in the deployment collection unit 1602. Then, returning to step S2004, the diagnosis execution program 224 adds the object ID “Proc1-1-1” to the route list 1515.
- step S2010 the diagnosis execution program 224 acquires the performance information acquired based on the development information means 1600a.
- step S2011 the diagnosis execution program 224 starts the “determination program 1” with the performance information as an input.
- step S2012 when the value “NO” is received from “determination program 1”, the diagnosis execution program 224 concludes that the object to be referred to next has the ID “Proc1-1-4” based on the Decision Map 1535.
- the object 1504a is determined. Again, returning to step S2004, the diagnosis execution program 224 adds the object ID “Proc1-1-3” to the route list 1515, and refers to the conclusion object 1504a in step S2005. Since the conclusion object 1504a has the type “End”, the process proceeds to step S2014, and the diagnosis execution program 224 adds the object ID “Proc1-1-4” to the route list 1515. Then, the diagnosis execution program 224 returns the expansion diagnosis procedure 1500 in which the route list 1515 is updated to the failure analysis program 221 that is the caller.
- the diagnosis execution program 224 can execute diagnosis in order to identify the cause event of the failure that has occurred in the IT system.
- the diagnosis execution program 224 displays the collected information on the output device 217 in step S2009, and the determination program 226 executed in step S2011 inputs the determination criteria and the determination result to the output device 217 by the administrator.
- the determination result displayed on the input interface (eg, button) and received in step S2012 may be a determination result input by the administrator via the input interface.
- diagnosis execution program 224 fails to acquire information used for determination in step S2010, the determination program 226 returns a plurality of determination results in step S2011, and the diagnosis execution program 224 returns a plurality of determination results.
- the diagnostic procedure may be continued for each of these, referring to a plurality of conclusion objects 1504, and the display program 225 may display a plurality of cause events based on the plurality of conclusion objects 1504.
- diagnosis execution program 224 executes the information collection processing based on the information collection object 1502 and the determination of the determination program 226 based on the determination object 1503 in parallel without executing the objects in the development diagnosis procedure. Also good.
- FIG. 21 shows a flowchart of an example of processing executed by the display program 225 (step S1704).
- step S2101 the display program 225 receives the deployment diagnosis procedure 1500.
- step S2102 the display program 225 acquires the conclusion object 1504 finally referred to by the diagnosis execution program 224 based on the received expansion diagnosis procedure 1500 and the list stored in the route list 1515 of the basic object 1501. Display as a diagnostic result.
- step S2103 the display program 225 displays the used diagnostic procedure based on the received deployment diagnostic procedure.
- step S2104 the display program 225 displays the executed procedure among the diagnostic procedures used by the diagnostic execution program 224 based on the received path list 1515 of the basic object 1501 of the expanded diagnostic procedure 1500.
- steps 2101 to S2104 information is sequentially displayed. Instead, the display program 225 writes information to be displayed in the memory 212, and all display objects are written in the memory 212. In addition, a screen including those display objects (for example, the screen of FIG. 22) may be displayed.
- FIG. 22 shows an example of the diagnosis result screen.
- the diagnosis result screen 2200 is a screen that displays the diagnosis procedure executed by the diagnosis execution program 224 and the diagnosis result, and is displayed on the output device 217. Specifically, this screen 2200 shows the development diagnosis procedure of FIG. 15 and the result of executing the procedure.
- the diagnosis result screen 2200 includes a diagnosis result field 2201 for displaying a diagnosis result derived by the diagnosis execution program 224 and a diagnosis procedure field 2202 for displaying information on the expansion diagnosis procedure 1500 used in the diagnosis execution program 224. Good. Further, the diagnosis result screen 2200 may include a diagnosis target topology field 2203 for displaying information on the topology on which the diagnosis has been performed, and a diagnosis target data field 2204 for displaying the information collected and used for the determination when the diagnosis is executed. Good.
- the information displayed in the diagnosis result field 2201 is an example of information (diagnosis result) displayed by the display program 225 in step S2102.
- a conclusion object 1504 finally referred to by the diagnosis execution program 224 is acquired based on the received path list 1515 of the expanded diagnosis procedure 1500, and the conclusion object 1504 is displayed as a diagnosis result in the field 2201. Yes.
- the information displayed in the diagnostic procedure field 2202 is an example of information (diagnostic procedure) displayed by the display program 225 in step S2103.
- the diagnostic procedure used by the diagnostic execution program 224 is acquired based on the received information on the deployment diagnostic procedure 1500, and the diagnostic procedure is displayed in the field 2202.
- FIG. 22 as an example of display of the diagnostic procedure, the value indicated by the argument 1534 of the determination object 1503, the determination criterion by the determination program 226 identified from the determination object 1503, and the conclusion information derived from the conclusion object 1504 are displayed.
- a path 2223 in FIG. 22 is an example of the “executed procedure” displayed by the display program 225 based on the path list 1515 in step S2104. As shown in FIG. 22, a portion (arrow) indicating the flow of “executed procedure” may be highlighted for the diagnosis procedure 2221, or a list of executed procedures may be displayed.
- the information displayed in the diagnosis target topology field 2203 is information representing the topology that is the target of the deployment diagnosis procedure 1500.
- the diagnostic procedure development program 223 saves the topology information in the processing of FIG. 19 in a storage area such as the memory 212 of the management computer 201 in association with the development diagnostic procedure 1500, and when the display program 225 is started up, the display program 225 saves the topology information.
- the information may be displayed in the field 2203.
- diagnosis target data field 2204 information acquired when the diagnosis execution program 224 refers to the information collection object 1502 of the development diagnosis procedure 1500 is displayed.
- the diagnosis execution program 224 stores the information acquired in step S2009 in the processing of FIG. 20 in a storage area such as the memory 212 of the management computer 201 in association with the development diagnosis procedure 1500, and when the display program 225 is activated, the display program 225 The stored information may be displayed in the field 2204.
- information regarding the management target component that is the determination target may be displayed for each determination procedure. For example, in the display example of FIG. 22, when the administrator selects the determination display 2222 that displays the determination criteria of the determination object 1503, the information on the management target component that is determined by the determination program 226 related to the determination object 1503 is highlighted. May be displayed.
- the administrator selects the determination display 2222a that displays the determination criteria of the determination object 1503a
- the information indicated by the argument 1534 of the determination object 1503a is “return value of Proc1-1-1”
- “Port 0 of network switch D” may be highlighted.
- the diagnosis target topology field 2203 information on the management target component that is an element for determining the determination result may be displayed for each determination procedure. For example, in the display example of FIG. 22, when the administrator selects the determination display 2222 that displays the determination criteria of the determination object 1503 of the deployment diagnosis procedure 1500, the determination is made among the management target components displayed in the diagnosis target topology field 2203. Information on the managed component that has become an element that determines the result may be highlighted.
- the determination object 1503b related to the determination display 2222b is “an increase rate of the number of transmission drop packets of port 0 of the network switch D and an increase rate of the number of transmission packets of eth0 of the server A, eth0 of the server B, and eth0 of the server C”.
- the diagnosis execution program 224 refers to the conclusion object 1504c.
- server B eth0 identifier is SVIF2
- port 0 of network switch D identifier is SWPORT1
- Such information may be displayed by saving the information acquired in step S2010 and the determination result in step S2012 in the storage area such as the memory 212 of the management computer 201 when the diagnosis execution program 224 is executed.
- the “determination program 2” indicated by the determination program ID 1533 is called to make a determination
- the “determination program 2” is a combination of component IDs having the same rate of increase in performance information. If it is a program to be returned, the return value of “determination program 2” is stored in a storage area such as the memory 212 of the management computer 201, and the display program 225 displays the information of the managed component having those IDs Good.
- information that is a determination target may be displayed for each determination procedure. For example, in the display example of FIG. 22, when the administrator selects the determination display 2222 that displays the determination criteria of the determination object 1503, the information indicated by the argument 1534 of the determination object 1503 may be highlighted. For example, when the administrator selects the determination display 2222a that displays the determination criterion of the determination object 1503a, the information 2241b indicated by the argument 1534 of the determination object 1503a may be highlighted.
- diagnosis target data field 2204 information that is an element for determining the determination result may be displayed for each determination procedure.
- the determination result is displayed among the information displayed in the diagnosis target data field 2204.
- Information that has become an element to be determined may be highlighted.
- the determination object 1503b related to the determination display 2222b is “an increase rate of the number of transmission drop packets of port 0 of the network switch D and an increase rate of the number of transmission packets of eth0 of the server A, eth0 of the server B, and eth0 of the server C”.
- the diagnosis execution program 224 refers to the conclusion object 1504c.
- “Performance information on the number of dropped packets” may be highlighted. Such information may be displayed by saving the information acquired in step S2010 and the determination result in step S2012 in the storage area such as the memory 212 of the management computer 201 when the diagnosis execution program 224 is executed.
- a diagnosis result screen may be displayed for each development diagnosis procedure.
- the diagnosis execution program 224 saves the information collected in step S2009 in a storage area such as the memory 212 of the management computer 201 for a certain period, and collects the same information for the same managed component when another diagnosis is executed. When executing this step, information already stored in a storage area such as the memory 212 may be used. When displaying the collected information on the output device 217, the collected time may be displayed.
- diagnosis execution program 224 stores the determination result received in step S2012 in a storage area such as the memory 212 of the management computer 201 for a certain period of time, and based on the same information of the same managed component when another diagnosis is executed.
- the determination program stored in the image may be used without executing the determination program.
- the determination result is displayed on the output device 217, the determined time may be displayed.
- a diagnosis related to a cause failure candidate derived by the event analysis program 222 is executed, and information necessary for diagnosis is collected and collected in the diagnosis. It is possible to determine the cause information of the failure based on the conclusion obtained as a result of the determination. Thereby, the administrator can quickly identify the cause event of the failure, and can reduce the downtime due to the failure of the IT system.
- Example 2 will be described.
- differences from the first embodiment will be mainly described, and descriptions of equivalent components, programs having equivalent functions, and tables having equivalent items will be omitted or simplified.
- diagnosis is performed on a failure that is a propagation source of a plurality of failures derived by an event analysis program, and a conclusion obtained by the diagnosis is presented as a cause of the failure that is a propagation source.
- the method illustrated in the first embodiment is effective for investigating a more detailed cause after specifying the cause within a range that can be understood by the event analysis program.
- another effective method for using diagnosis is to improve the accuracy of the certainty factor of the cause candidate derived by the event analysis program (for example, to increase the value of the certainty factor).
- Example 2 describes an example in which diagnosis is performed after a cause candidate is derived by an event analysis program, and the diagnosis result is reflected in the certainty of the cause candidate derived by the event analysis function.
- FIG. 23 shows a configuration example of the meta-rule 2300 in the second embodiment.
- the configuration of the metarule 2300 in the second embodiment is substantially the same as the configuration of the metarule 1100 in the first embodiment.
- the condition element 1121 configuring the IF unit 1111 includes a device type 1101, a component type 1102, and an event type 1103 in order to store the type of event received by the event reception program 227.
- the meta-rule 2300 according to the second embodiment may include a field 2311 for storing the identifier of the meta-diagnosis procedure 1200 as a conditional element of the IF unit 1111 in order to reflect the diagnosis result.
- FIG. 24 shows a configuration example of the expansion rule 2400 in the second embodiment.
- the configuration of the deployment rule 2400 in the second embodiment is substantially the same as the configuration of the deployment rule 1150 in the first embodiment. Similar to the meta-rule, the expansion rule 1150 according to the first embodiment includes the device ID 1161, the component ID 1162, and the event type 1163 in order to store events that can be received by the event reception program 227 for the IF unit 1151. Yes. On the other hand, the expansion rule 2400 in the second embodiment may include a field 2411 for storing an identifier of the expansion diagnosis procedure as a conditional element of the IF unit 1151 in order to reflect the diagnosis result.
- FIG. 25 shows a configuration example of a deployment diagnosis procedure in the second embodiment.
- the configuration of the deployment diagnostic procedure 2500 in the second embodiment is substantially the same as the configuration of the deployment diagnostic procedure 1500 in the first embodiment.
- an instruction to update the reception flag 1164 corresponding to the field 2411 in which the identifier of the expansion diagnosis procedure of the expansion rule 2400 is stored is stored in the Conclusion 1543 of the conclusion object 1504 to reflect the result of the diagnosis. Good.
- FIG. 26 shows a flowchart of an example of failure cause analysis processing executed by the failure analysis program 221 in the second embodiment.
- the timing of starting the failure analysis program 221 may be the timing described in the first embodiment.
- step S1701 the failure analysis program 221 executes the event analysis program 222.
- the process to be executed is the same as the process in step S1701 described in the first embodiment.
- step S1702 the failure analysis program 221 starts the diagnostic procedure development program 223 with the information on the cause candidate selected in step S1701 as an input.
- the processing to be executed is substantially the same as step S1702 described in the first embodiment or the processing of FIG.
- the diagnostic procedure expansion program 223 generates the expansion diagnosis procedure 2500 in step S1909, and then acquires the expansion rule 2400 acquired in step S1902 and the metarule 2300 that is the base of the expansion rule 2400. If the generated expanded diagnostic procedure 2500 has the same meta diagnostic procedure ID as the identifier of the meta diagnostic procedure stored in the condition element field 2311 of the meta rule 2300, the diagnostic procedure expanded program 223 sets the expanded diagnostic procedure ID to the meta rule. This is stored in the field 2411 of the condition element of the expansion rule 2400 related to 2300.
- the diagnostic procedure expansion program 223 expands the expansion rule having the ID of the component that is the starting point.
- the development diagnosis procedure ID may be stored in the field 2411 of the condition element.
- the diagnosis procedure expansion program 223 displays the expansion diagnosis in the expansion rule field 2411 only when the topology information acquired when generating the expansion diagnosis procedure and the topology information acquired when generating the expansion rule are the same.
- the procedure ID may be stored.
- step S1703 the failure analysis program 221 starts the diagnosis execution program 224 with the deployment diagnosis procedure as an input.
- the executed process is the same as the process in step S1703 described in the first embodiment.
- step S2601 the failure analysis program 221 receives the expansion diagnosis procedure from the diagnosis execution program 224, and determines the conclusion object 1504 of the expansion diagnosis procedure 2400 referenced by the diagnosis execution program 224 based on the path list 1515 of the expansion diagnosis procedure. refer.
- step S2602 the failure analysis program 221 searches for a deployment rule having the deployment diagnostic procedure ID of the deployment diagnostic procedure 2400 received from the diagnostic execution program 224 as a condition element. Then, the reception flag 1164 of the condition element 2411 of the expansion rule 2400 is updated according to the instruction stored in the Confusion 1543 of the conclusion object 1504 referred to in step S2601.
- the failure analysis program 221 includes the ID of the expansion diagnosis procedure 2500 in the condition element.
- the reception flag 1164 corresponding to the field 2411 of the condition element of the expansion rule 2400 having “ExpandedDiagnosticProc10-1” is updated to “1”.
- the failure analysis program 221 calculates an event reception rate of each expansion rule.
- step S2604 the failure analysis program 221 activates the display program 225.
- the display program 225 updates the certainty factor of the cause candidate selected in step S1701 on the event analysis result screen 1800 based on the event reception rate calculated in step S2603.
- the second embodiment by performing a related diagnosis on the cause candidate derived by the event analysis program, and updating the certainty factor of the cause candidate based on the result obtained as a result. It is possible to prioritize a more probable failure cause candidate to the administrator. As a result, the administrator can quickly identify the cause of the failure.
- the present invention is not limited to these embodiments.
- the meta-diagnostic procedure 1200 instead of or in addition to the meta-diagnostic procedure 1200 including the meta-diagnostic procedure ID and origin of the meta-diagnostic procedure 1200 associated with the meta-rule 1100, the meta-diagnostic procedure 1200 is associated with the meta-diagnostic procedure 1200.
- the meta rule ID of the existing meta rule 1100 and the starting point may be included.
- the meta-rule 100 and the meta-diagnosis procedure 1200 can be associated in a many-to-many manner.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Human Computer Interaction (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Debugging And Monitoring (AREA)
- Test And Diagnosis Of Digital Computers (AREA)
Abstract
Description
Claims (15)
- 複数の管理対象コンポーネントのうちの1以上の管理対象コンポーネントで発生した1以上のイベントである1以上の発生イベントの原因解析を行う管理システムであって、
記憶デバイスと、
前記記憶デバイスに接続されたプロセッサと
を有し、
前記記憶デバイスが、構成管理情報と、複数のルールと、複数の汎用診断手順とを記憶し、
前記構成管理情報は、前記複数の管理対象コンポーネントの構成に関する情報であり、
前記複数のルールの各々は、1以上の条件イベントと前記1以上の条件イベントが発生した場合に原因となる結論イベントとの関連付けを示すルールであり、
前記複数の汎用診断手順の各々は、前記複数のルールのいずれかに関連付けられており1又は複数のコンポーネント種別を用いて定義され管理対象コンポーネントに依存しない汎用の診断手順であり、
前記プロセッサが、
前記複数のルールのうちの、前記1以上の発生イベントに関連する1以上の条件イベントが関連付けられている1以上のルールである1以上の対象ルールを基に、1以上の原因候補を特定し、
前記複数の汎用診断手順のうちの、前記1以上の原因候補のうちの選択された原因候補の基になる対象ルールに関連付けられている汎用診断手順を特定し、前記特定された汎用診断手順と前記構成管理情報とに基づいて、1以上の管理対象コンポーネントに対して実行する診断手順であり前記選択された原因候補のより具体的な原因を特定する又は前記選択された原因候補の確からしさを更新するための展開診断手順を生成する、
管理システム。 - 前記プロセッサが、前記生成した展開診断手順を表す情報を表示する、
請求項1記載の管理システム。 - 前記プロセッサが、前記特定された汎用診断手順と、前記構成管理情報を基に特定されるトポロジであり前記1以上の対象ルールの中の1以上の条件イベントの対象となる管理対象コンポーネントまたは前記1以上の対象ルール中の1以上の結論イベントの対象となる管理対象コンポーネントを起点としたトポロジに対して前記展開診断手段を生成する、
請求項1記載の管理システム。 - 前記プロセッサが、前記特定された汎用診断手順と前記構成管理情報とに加えて、前記1以上の発生イベントの情報を基に、前記展開診断手順を生成する、
請求項1記載の管理システム。 - 前記複数の汎用診断手順の各々が、1以上の情報収集定義と、1以上の判定定義と、複数の結論定義との組合せであり、
前記1以上の情報収集定義の各々は、情報収集と情報収集元のコンポーネント種別とを表し、
前記1以上の判定定義の各々は、収集した情報に基づいて判定することを表し、判定の結果として少なくとも1つの結論定義と少なくとも1つの情報収集定義とのうちの少なくとも一方に対応し、
前記1以上の結論定義の各々は、結論を表し、
少なくとも1つの判定定義が、少なくとも1つの結論定義に関連付けられている、
請求項1記載の管理システム。 - 前記展開診断手順は、前記特定された汎用診断手順におけるコンポーネント種別に対しそのコンポーネント種別に該当する管理対象コンポーネントが前記構成管理情報を基に関連付けられることにより生成され、
前記プロセッサが、前記展開診断手順を基に結論を決定し、決定した結論を表示する、
請求項5記載の管理システム。 - 前記プロセッサは、前記選択された原因候補の基になる対象ルールに関連付けられている1以上の条件イベントのうち発生イベントに適合する条件イベントの割合が一定値以上の場合にのみ、前記選択された原因候補の基になる対象ルールに関連付けられている汎用診断手順を、展開診断手順の生成のための基とする、
請求項1記載の管理システム。 - 前記プロセッサが、実行した定義及び収集した情報のうちの少なくとも一方を表示する、
請求項6記載の管理システム。 - 前記プロセッサが、前記選択された原因候補の基になる対象ルールと前記1以上の発生イベントとを基に、前記1以上の原因候補の各々の確信度を算出し、
前記プロセッサが、算出された1以上の確信度に基づいて、前記1以上の原因候補の中から診断対象とする原因候補を選択する、
請求項1記載の管理システム。 - 前記プロセッサが、前記選択された原因候補の基になる対象ルールと前記1以上の発生イベントとを基に、前記1以上の原因候補の各々の確信度を算出し、
前記複数の結論定義のうちの一部の結論定義が、算出された確信度を更新することを表しており、
前記プロセッサが、前記展開診断手順を基に結論を決定し、決定した結論が確信度の更新であれば、前記選択された原因候補の確信度を更新する、
請求項5記載の管理システム。 - 前記プロセッサが、前記展開診断手順を表示し、その後に、前記展開診断手順が表す判定についての結果を表す情報の入力を受け付け、受け付けた情報が表す判定結果に基づいて、実行する定義を決定する、
請求項5記載の管理システム。 - 前記プロセッサが、前記展開診断手順を表示し、その後に、前記展開診断手順に基づき収集した情報のうち、判定結果を満たす情報を表示する、
請求項5記載の管理システム。 - 前記プロセッサが、前記展開診断手順の実行において収集した情報と収集時刻、及び、前記展開診断手順の実行における判定結果と判定時刻、のうちの少なくとも一方を前記記憶デバイスに書き込み、別の展開診断手順の実行において、前記記憶デバイスに書き込まれている情報又は判定結果と同じ管理対象コンポーネントについての情報収集又は判定であり、且つ、前記記憶デバイスに書き込まれている収集時刻又は判定時刻から一定時間経過していなければ、前記記憶デバイスに記憶されている情報又は判定結果を前記別の展開診断手順における収集情報又は判定結果として扱う、
請求項5記載の管理システム。 - 複数の管理対象コンポーネントのうちの1以上の管理対象コンポーネントで発生した1以上のイベントである1以上の発生イベントの原因解析を支援する方法であって、
それぞれが1以上の条件イベントと前記1以上の条件イベントが発生した場合に原因となる結論イベントとの関連付けを示す複数のルールのうちの、前記1以上の発生イベントに関連する1以上の条件イベントが関連付けられている1以上のルールである1以上の対象ルールを基に、1以上の原因候補を特定し、
それぞれが前記複数のルールのいずれかに関連付けられており1又は複数のコンポーネント種別を用いて定義され管理対象コンポーネントに依存しない汎用の診断手順である複数の汎用診断手順のうちの、前記1以上の原因候補のうちの選択された原因候補の基になる対象ルールに関連付けられている汎用診断手順を特定し、
前記特定された汎用診断手順と、前記複数の管理対象コンポーネントの構成に関する情報である構成管理情報とに基づいて、1以上の管理対象コンポーネントに対して実行する診断手順であり前記選択された原因候補のより具体的な原因を特定する又は前記選択された原因候補の確からしさを更新するための展開診断手順を生成する、
方法。 - それぞれが1以上の条件イベントと前記1以上の条件イベントが発生した場合に原因となる結論イベントとの関連付けを示す複数のルールのうちの、前記1以上の発生イベントに関連する1以上の条件イベントが関連付けられている1以上のルールである1以上の対象ルールを基に、1以上の原因候補を特定し、
それぞれが前記複数のルールのいずれかに関連付けられており1又は複数のコンポーネント種別を用いて定義され管理対象コンポーネントに依存しない汎用の診断手順である複数の汎用診断手順のうちの、前記1以上の原因候補のうちの選択された原因候補の基になる対象ルールに関連付けられている汎用診断手順を特定し、
前記特定された汎用診断手順と、複数の管理対象コンポーネントの構成に関する情報である構成管理情報とに基づいて、1以上の管理対象コンポーネントに対して実行する診断手順であり前記選択された原因候補のより具体的な原因を特定する又は前記選択された原因候補の確からしさを更新するための展開診断手順を生成する、
ことをコンピュータに実行させるためのコンピュータプログラム。
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2015550292A JP6208770B2 (ja) | 2013-11-29 | 2013-11-29 | イベントの根本原因の解析を支援する管理システム及び方法 |
GB1513880.3A GB2536317A (en) | 2013-11-29 | 2013-11-29 | Management system and method for assisting event root cause analysis |
DE112013006475.8T DE112013006475T5 (de) | 2013-11-29 | 2013-11-29 | Verwaltungssystem und Verfahren zur Unterstützung einer Analyse in Bezug auf eine Hauptursache eines Ereignisses |
CN201380070015.9A CN104903866B (zh) | 2013-11-29 | 2013-11-29 | 对事件根本原因的分析予以支援的管理系统以及方法 |
US14/765,988 US20150378805A1 (en) | 2013-11-29 | 2013-11-29 | Management system and method for supporting analysis of event root cause |
PCT/JP2013/082207 WO2015079564A1 (ja) | 2013-11-29 | 2013-11-29 | イベントの根本原因の解析を支援する管理システム及び方法 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2013/082207 WO2015079564A1 (ja) | 2013-11-29 | 2013-11-29 | イベントの根本原因の解析を支援する管理システム及び方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015079564A1 true WO2015079564A1 (ja) | 2015-06-04 |
Family
ID=53198550
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2013/082207 WO2015079564A1 (ja) | 2013-11-29 | 2013-11-29 | イベントの根本原因の解析を支援する管理システム及び方法 |
Country Status (6)
Country | Link |
---|---|
US (1) | US20150378805A1 (ja) |
JP (1) | JP6208770B2 (ja) |
CN (1) | CN104903866B (ja) |
DE (1) | DE112013006475T5 (ja) |
GB (1) | GB2536317A (ja) |
WO (1) | WO2015079564A1 (ja) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017051453A1 (ja) * | 2015-09-24 | 2017-03-30 | 株式会社日立製作所 | ストレージシステム、及び、ストレージシステムの管理方法 |
JP2017097879A (ja) * | 2015-11-24 | 2017-06-01 | 株式会社日立製作所 | クラウド環境における障害原因解析システムのルール検証のための方法及びシステム |
JP2018528529A (ja) * | 2015-08-05 | 2018-09-27 | フェイスブック,インク. | コネクテッド・デバイスのルール・エンジン |
JP2018530035A (ja) * | 2015-08-13 | 2018-10-11 | ブル・エス・アー・エス | トポロジカルデータを用いたスーパーコンピュータのための監視システム |
JP2021174459A (ja) * | 2020-04-30 | 2021-11-01 | Necプラットフォームズ株式会社 | 障害処理装置、障害処理方法及びコンピュータプログラム |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015112150A1 (en) * | 2014-01-23 | 2015-07-30 | Hewlett-Packard Development Company, L.P. | Volume migration for a storage area network |
US10306490B2 (en) | 2016-01-20 | 2019-05-28 | Netscout Systems Texas, Llc | Multi KPI correlation in wireless protocols |
EP3403152B1 (en) | 2016-03-09 | 2023-07-12 | Siemens Aktiengesellschaft | Smart embedded control system for a field device of an automation system |
US11132620B2 (en) | 2017-04-20 | 2021-09-28 | Cisco Technology, Inc. | Root cause discovery engine |
JP2019009726A (ja) * | 2017-06-28 | 2019-01-17 | 株式会社日立製作所 | 障害切り分け方法および管理サーバ |
US11995518B2 (en) | 2017-12-20 | 2024-05-28 | AT&T Intellect al P Property I, L.P. | Machine learning model understanding as-a-service |
CN109905270B (zh) * | 2018-03-29 | 2021-09-14 | 华为技术有限公司 | 定位根因告警的方法、装置和计算机可读存储介质 |
US10977154B2 (en) * | 2018-08-03 | 2021-04-13 | Dynatrace Llc | Method and system for automatic real-time causality analysis of end user impacting system anomalies using causality rules and topological understanding of the system to effectively filter relevant monitoring data |
US10931542B2 (en) * | 2018-08-10 | 2021-02-23 | Futurewei Technologies, Inc. | Network embedded real time service level objective validation |
JP7221644B2 (ja) * | 2018-10-18 | 2023-02-14 | 株式会社日立製作所 | 機器故障診断支援システムおよび機器故障診断支援方法 |
US11520678B2 (en) * | 2020-02-24 | 2022-12-06 | International Business Machines Corporation | Set diagnostic parameters command |
US11169949B2 (en) | 2020-02-24 | 2021-11-09 | International Business Machines Corporation | Port descriptor configured for technological modifications |
US11327868B2 (en) | 2020-02-24 | 2022-05-10 | International Business Machines Corporation | Read diagnostic information command |
US11169946B2 (en) | 2020-02-24 | 2021-11-09 | International Business Machines Corporation | Commands to select a port descriptor of a specific version |
WO2021250873A1 (ja) * | 2020-06-12 | 2021-12-16 | 日本電信電話株式会社 | ルール生成装置、ルール生成方法およびプログラム |
US11329933B1 (en) * | 2020-12-28 | 2022-05-10 | Drift.com, Inc. | Persisting an AI-supported conversation across multiple channels |
JP2022170275A (ja) * | 2021-04-28 | 2022-11-10 | 富士通株式会社 | ネットワークマップ作成支援プログラム、情報処理装置およびネットワークマップ作成支援方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH05114899A (ja) * | 1991-10-22 | 1993-05-07 | Hitachi Ltd | ネツトワーク障害診断方式 |
JP2010086115A (ja) * | 2008-09-30 | 2010-04-15 | Hitachi Ltd | イベント情報取得外のit装置を対象とする根本原因解析方法、装置、プログラム。 |
JP2011076293A (ja) * | 2009-09-30 | 2011-04-14 | Hitachi Ltd | 障害の根本原因解析結果表示方法、装置、及びシステム |
WO2012053104A1 (ja) * | 2010-10-22 | 2012-04-26 | 株式会社日立製作所 | 管理システム、及び管理方法 |
Family Cites Families (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7107185B1 (en) * | 1994-05-25 | 2006-09-12 | Emc Corporation | Apparatus and method for event correlation and problem reporting |
US6675315B1 (en) * | 2000-05-05 | 2004-01-06 | Oracle International Corp. | Diagnosing crashes in distributed computing systems |
CN1300694C (zh) * | 2003-06-08 | 2007-02-14 | 华为技术有限公司 | 基于故障树分析的系统故障定位方法及装置 |
US7434099B2 (en) * | 2004-06-21 | 2008-10-07 | Spirent Communications Of Rockville, Inc. | System and method for integrating multiple data sources into service-centric computer networking services diagnostic conclusions |
JP2006060762A (ja) * | 2004-07-21 | 2006-03-02 | Hitachi Communication Technologies Ltd | 無線通信システム、および、その診断方法、ならびに、無線通信システムの診断に用いる無線端末 |
CN100393048C (zh) * | 2006-01-13 | 2008-06-04 | 武汉大学 | 一种建立网络故障诊断规则库的方法 |
JP4873985B2 (ja) * | 2006-04-24 | 2012-02-08 | 三菱電機株式会社 | 設備機器用故障診断装置 |
US20090144214A1 (en) * | 2007-12-04 | 2009-06-04 | Aditya Desaraju | Data Processing System And Method |
US8112378B2 (en) * | 2008-06-17 | 2012-02-07 | Hitachi, Ltd. | Methods and systems for performing root cause analysis |
JP2011008375A (ja) * | 2009-06-24 | 2011-01-13 | Hitachi Ltd | 原因分析支援装置および原因分析支援方法 |
EP2455863A4 (en) * | 2009-07-16 | 2013-03-27 | Hitachi Ltd | MANAGEMENT SYSTEM FOR PROVIDING INFORMATION DESCRIBING A RECOVERY METHOD CORRESPONDING TO A FUNDAMENTAL CAUSE OF FAILURE |
CN101710359B (zh) * | 2009-11-03 | 2011-11-16 | 中国科学院计算技术研究所 | 一种集成电路故障诊断系统及方法 |
US8429455B2 (en) * | 2010-07-16 | 2013-04-23 | Hitachi, Ltd. | Computer system management method and management system |
US8819220B2 (en) * | 2010-09-09 | 2014-08-26 | Hitachi, Ltd. | Management method of computer system and management system |
JP5432867B2 (ja) * | 2010-09-09 | 2014-03-05 | 株式会社日立製作所 | 計算機システムの管理方法、及び管理システム |
US9065728B2 (en) * | 2011-03-03 | 2015-06-23 | Hitachi, Ltd. | Failure analysis device, and system and method for same |
JP5684946B2 (ja) * | 2012-03-23 | 2015-03-18 | 株式会社日立製作所 | イベントの根本原因の解析を支援する方法及びシステム |
US9667473B2 (en) * | 2013-02-28 | 2017-05-30 | International Business Machines Corporation | Recommending server management actions for information processing systems |
-
2013
- 2013-11-29 DE DE112013006475.8T patent/DE112013006475T5/de not_active Withdrawn
- 2013-11-29 JP JP2015550292A patent/JP6208770B2/ja active Active
- 2013-11-29 CN CN201380070015.9A patent/CN104903866B/zh not_active Expired - Fee Related
- 2013-11-29 US US14/765,988 patent/US20150378805A1/en not_active Abandoned
- 2013-11-29 GB GB1513880.3A patent/GB2536317A/en not_active Withdrawn
- 2013-11-29 WO PCT/JP2013/082207 patent/WO2015079564A1/ja active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH05114899A (ja) * | 1991-10-22 | 1993-05-07 | Hitachi Ltd | ネツトワーク障害診断方式 |
JP2010086115A (ja) * | 2008-09-30 | 2010-04-15 | Hitachi Ltd | イベント情報取得外のit装置を対象とする根本原因解析方法、装置、プログラム。 |
JP2011076293A (ja) * | 2009-09-30 | 2011-04-14 | Hitachi Ltd | 障害の根本原因解析結果表示方法、装置、及びシステム |
WO2012053104A1 (ja) * | 2010-10-22 | 2012-04-26 | 株式会社日立製作所 | 管理システム、及び管理方法 |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018528529A (ja) * | 2015-08-05 | 2018-09-27 | フェイスブック,インク. | コネクテッド・デバイスのルール・エンジン |
JP2018530035A (ja) * | 2015-08-13 | 2018-10-11 | ブル・エス・アー・エス | トポロジカルデータを用いたスーパーコンピュータのための監視システム |
US11436121B2 (en) | 2015-08-13 | 2022-09-06 | Bull Sas | Monitoring system for supercomputer using topological data |
WO2017051453A1 (ja) * | 2015-09-24 | 2017-03-30 | 株式会社日立製作所 | ストレージシステム、及び、ストレージシステムの管理方法 |
JP2017097879A (ja) * | 2015-11-24 | 2017-06-01 | 株式会社日立製作所 | クラウド環境における障害原因解析システムのルール検証のための方法及びシステム |
JP2021174459A (ja) * | 2020-04-30 | 2021-11-01 | Necプラットフォームズ株式会社 | 障害処理装置、障害処理方法及びコンピュータプログラム |
JP7007025B2 (ja) | 2020-04-30 | 2022-01-24 | Necプラットフォームズ株式会社 | 障害処理装置、障害処理方法及びコンピュータプログラム |
Also Published As
Publication number | Publication date |
---|---|
DE112013006475T5 (de) | 2015-10-08 |
JPWO2015079564A1 (ja) | 2017-03-16 |
CN104903866B (zh) | 2017-12-15 |
GB201513880D0 (en) | 2015-09-23 |
GB2536317A (en) | 2016-09-14 |
US20150378805A1 (en) | 2015-12-31 |
CN104903866A (zh) | 2015-09-09 |
JP6208770B2 (ja) | 2017-10-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6208770B2 (ja) | イベントの根本原因の解析を支援する管理システム及び方法 | |
US10439922B2 (en) | Service analyzer interface | |
US8635498B2 (en) | Performance analysis of applications | |
US20160378583A1 (en) | Management computer and method for evaluating performance threshold value | |
US10291463B2 (en) | Large-scale distributed correlation | |
US9294338B2 (en) | Management computer and method for root cause analysis | |
JP5385982B2 (ja) | 障害の根本原因に対応した復旧方法を表す情報を出力する管理システム | |
JP5670598B2 (ja) | コンピュータプログラムおよび管理計算機 | |
US11138058B2 (en) | Hierarchical fault determination in an application performance management system | |
JP5542398B2 (ja) | 障害の根本原因解析結果表示方法、装置、及びシステム | |
US9628360B2 (en) | Computer management system based on meta-rules | |
JP2007096796A (ja) | ネットワーク障害診断装置、ネットワーク障害診断方法およびネットワーク障害診断プログラム | |
US10929259B2 (en) | Testing framework for host computing devices | |
US9021078B2 (en) | Management method and management system | |
JP5295062B2 (ja) | 複合イベント処理向けクエリ自動生成装置 | |
US20150242416A1 (en) | Management computer and rule generation method | |
Cinque et al. | A logging approach for effective dependability evaluation of complex systems | |
Harper et al. | Cookbook, a recipe for fault localization | |
Makanju et al. | System state discovery via information content clustering of system logs | |
Kannan et al. | A differential approach for configuration fault localization in cloud environments | |
Sheluhin et al. | Anomaly states monitoring of large-scale systems with intellectual analysis of system logs | |
US8407531B2 (en) | Method of collecting and correlating locking data to determine ultimate holders in real time | |
WO2013103008A1 (ja) | 事象の原因を特定する情報システム、コンピュータ及び方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13898220 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14765988 Country of ref document: US |
|
ENP | Entry into the national phase |
Ref document number: 201513880 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20131129 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1513880.3 Country of ref document: GB |
|
WWE | Wipo information: entry into national phase |
Ref document number: 112013006475 Country of ref document: DE Ref document number: 1120130064758 Country of ref document: DE |
|
ENP | Entry into the national phase |
Ref document number: 2015550292 Country of ref document: JP Kind code of ref document: A |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13898220 Country of ref document: EP Kind code of ref document: A1 |
|
ENPC | Correction to former announcement of entry into national phase, pct application did not enter into the national phase |
Ref country code: GB |