WO2016120989A1 - Management computer and rule test method - Google Patents

Management computer and rule test method Download PDF

Info

Publication number
WO2016120989A1
WO2016120989A1 PCT/JP2015/052164 JP2015052164W WO2016120989A1 WO 2016120989 A1 WO2016120989 A1 WO 2016120989A1 JP 2015052164 W JP2015052164 W JP 2015052164W WO 2016120989 A1 WO2016120989 A1 WO 2016120989A1
Authority
WO
WIPO (PCT)
Prior art keywords
event
rule
information
configuration
rca
Prior art date
Application number
PCT/JP2015/052164
Other languages
French (fr)
Japanese (ja)
Inventor
淳 北脇
峰義 増田
裕 工藤
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to PCT/JP2015/052164 priority Critical patent/WO2016120989A1/en
Publication of WO2016120989A1 publication Critical patent/WO2016120989A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/28Error detection; Error correction; Monitoring by checking the correct order of processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring

Definitions

  • the present invention relates to a test apparatus for testing a rule for monitoring a computer system.
  • RCA Ring Cause Analysis
  • the administrator can detect failures with RCA, but failures that cannot be detected with RCA also occur. Therefore, by creating a new rule or modifying an existing rule with RCA, it is possible to improve the accuracy of detecting a failure and construct a system that can respond to the failure more quickly.
  • a new rule or a modified rule can detect a failure that could not be detected before must be executed in the environment where the failure occurred.
  • the configuration changes every moment while the system is in operation, the exact same environment cannot be prepared.
  • testing new or modified rules on a running system can affect the services being offered. Therefore, a technique for preparing a test environment that is the same as the production environment separately from the production environment and testing the rules has been proposed.
  • Patent Document 1 includes a work information storage unit that stores work information including one or more execution information including a command executed for program modification in a test environment and result information that is information related to a command execution result.
  • a program storage unit for storing a program in a production environment, a command execution unit for modifying the program by executing each command included in the work information for the program stored in the program storage unit, and command execution
  • An information processing system comprising: a determination unit that determines whether a command execution result by a unit matches result information included in the same execution information as the command; and an output unit that outputs a determination result by the determination unit Has been.
  • Patent Document 2 describes a system for determining an event by using only an event log accumulated in the past.
  • test environment In a small environment, it is possible to physically construct the same test environment as the production environment, but in a large and complex environment, it is difficult to construct the same test environment as the production environment.
  • the test environment In general, the test environment has a small configuration that is different from the production environment. The load on the system and the system behavior are different from the production environment, so there was no problem in the production environment because there was no problem in the test environment. It cannot be said that there is no.
  • a typical example of the invention disclosed in the present application is as follows. That is, a management computer for testing a rule for monitoring a system composed of a plurality of devices, comprising a storage unit and a processor that refers to the storage unit, the storage unit in the system in the past
  • the management computer stores event information for managing a failure event that has occurred and a configuration change event for the system, configuration information for managing a current configuration of the system, and a topology snapshot representing a past configuration of the system. Reconstructing the past configuration of the system using the configuration information and the configuration change event, and applying a cause analysis algorithm to the reconstructed past configuration, and a configuration information reproducing unit that records the configuration snapshot in the topology snapshot.
  • a rule test section that estimates the cause of the failure using the rule to be tested and outputs whether the failure can be detected. That.
  • the same test environment as the production environment can be logically configured, and a new rule can be tested using event information generated in the production environment.
  • the embodiment of the present invention may be implemented by software running on a general-purpose computer, or may be implemented by dedicated hardware or a combination of software and hardware.
  • each information of the present invention will be described in a “table” format.
  • the information does not necessarily have to be expressed in a data structure by a table. It may be expressed in other ways. Therefore, “table”, “matrix”, “list”, “DB”, “queue”, and the like may be simply referred to as “information” in order to indicate that they do not depend on the data structure.
  • program as the subject (operation subject).
  • the program is executed by the processor, and the process determined by the memory and the communication port (communication control device) Since it is performed while being used, the description may be made with the processor as the subject.
  • processing disclosed with the program as the subject may be processing performed by a computer such as a management server or an information processing apparatus. Part or all of the program may be realized by dedicated hardware, or may be modularized.
  • Various programs may be installed in each computer by a program distribution server or a storage medium.
  • FIG. 1 is a block diagram showing the configuration of the rule test apparatus 100 of the first embodiment and the relationship between the rule test apparatus 100, the test environment, and the production environment.
  • the test environment of the present embodiment has a rule test apparatus 100.
  • the rule test apparatus 100 is connected to the RCA server 500 through the test network 200.
  • the rule test apparatus 100 is connected to the RCA server 500 via the test network 200, but may be connected to the RCA server 500 via the management network 300.
  • the rule test apparatus 100 is connected to the terminal 800 via the test network 200.
  • the terminal 800 displays an input screen and an output screen, which will be described later, in accordance with instructions from the rule testing apparatus 100.
  • the rule test apparatus 100 includes a CPU 110, a memory 111, a communication device 112, an input device 113, an output device 114, a media reading device 115, and an auxiliary storage device 120.
  • the CPU 110 is a processor that performs various calculations by executing a program.
  • the memory 111 includes a ROM that is a nonvolatile storage element and a RAM that is a volatile storage element.
  • the ROM stores an immutable program (for example, BIOS).
  • the RAM is a high-speed and volatile storage element such as a DRAM (Dynamic Random Access Memory), temporarily stores a program stored in the auxiliary storage device 120 and data used when the program is executed. It becomes a work area.
  • the communication device 112 is a network interface device that controls communication with the RCA server 500 in accordance with a predetermined protocol.
  • the input device 113 is a user interface (for example, a keyboard, a mouse, etc.) for the user to input data and instructions to the rule testing device 100.
  • the output device 114 is a user interface (for example, a display device or a printer) for presenting the execution result of the program to the user.
  • the media reader 115 is an interface device that reads data stored in a storage medium such as a DVD drive. Note that the media reading device 115 may not be provided.
  • the auxiliary storage device 120 is a large-capacity and nonvolatile storage device such as a magnetic storage device (HDD) or a flash memory (SSD).
  • the auxiliary storage device 120 stores a configuration information table 130, an event table 140, a co-occurrence waiting event table 150, an RCA rule table 160, a topology snapshot 170, and an RCA rule test result table 180.
  • the configuration information table 130 stores configuration information of the server 600 and the storage device 700 that are connected to and monitored by the RCA server 500 via the management network 300.
  • the event table 140 stores event information created based on performance information and configuration information collected from the server 600 and storage device 700 monitored by the RCA server 500 (see FIG. 6).
  • the co-occurrence wait event table 150 stores events within the retention period (see FIG. 7).
  • the RCA rule table 160 stores RCA rules held by the RCA server 500 (see FIG. 9).
  • the topology snapshot 170 is information that reproduces the configuration information of the server 600 and the storage apparatus 700 at an arbitrary time.
  • the RCA rule test result table 180 is a table in which RCA rules and information on events that have occurred are registered (see FIG. 12). A configuration example of each table 140, 150, 160, 180 stored in the auxiliary storage device 120 will be described later.
  • the memory 111 performs an input reception program 102 that performs external input processing, an output control program 103 that performs external output processing, an event management program 104 that processes event information retrieved from the event table 140, and tests RCA rules.
  • a rule test program 105 and a configuration information reproduction program 107 that reproduces the configuration information of the production environment based on the configuration information extracted from the configuration information table 130 are stored. These programs are stored in the auxiliary storage device 120, read out from the auxiliary storage device 120 at the time of execution, copied into the memory 111, and executed by the CPU 110.
  • the program executed by the CPU 110 is provided to the rule test apparatus 100 via a removable medium (CD-ROM, flash memory, etc.) or a network, and stored in the auxiliary storage device 120 which is a non-temporary storage medium.
  • a removable medium CD-ROM, flash memory, etc.
  • auxiliary storage device 120 which is a non-temporary storage medium.
  • the rule test apparatus 100 is a computer system configured on a single physical computer or a plurality of logically or physically configured computers. It may operate on a thread, or may operate on a virtual computer constructed on a plurality of physical computer resources.
  • the user changes the RCA rule provided by the RCA server 500 (S1).
  • the change of the RCA rule includes editing and addition of the RCA rule, new application of the RCA rule that has not been applied until now, and exemption of application of the RCA rule that has been applied until now.
  • the user determines the configuration of past data used for the rule test, and the rule testing apparatus 100 accepts the determined configuration of past data (S2).
  • the rule test apparatus 100 executes a test of the target RCA rule and presents the result to the user (S3).
  • the RCA rule editing (S1) performed by the user will be described later.
  • FIG. 3 is a detailed flowchart of the process of determining the configuration of past data used for the test (S2 in FIG. 2).
  • step S2 In the process of determining the configuration of past data used for the test (S2), first, it is determined whether the change of the RCA rule by the user is creation of a new rule or improvement of an existing rule (S11). When it is an improvement of the existing rule, the rule test apparatus 100 makes an inquiry to the RCA server 500 and determines whether an RCA snapshot remains. The RCA snapshot is a result of analyzing the root cause using the same RCA rule in the past. When the RCA server 500 holds the RCA snapshot, the process proceeds to step S14. On the other hand, when the RCA server 500 does not hold the RCA snapshot, the process proceeds to step S13 (S12).
  • step S13 it is determined whether the conditional expression of the RCA rule newly created or edited by the user includes an already defined event (S13). If the conditional expression does not include an already defined event, the process proceeds to step S17. If the conditional expression includes an already defined event, the process proceeds to step S14. In step S14, the corresponding event is added to the test candidate. For example, an event may be added to the test candidate event table temporarily created on the memory (S14).
  • the target system and the event that is insufficient in the conditional expression are presented to the user.
  • events that are insufficient in the conditional expression of the RCA rule if one of the events constituting the conditional expression of the rule is not found from the event table 140, a co-occurrence is detected between the start date and time and the end date and time. It is not an event.
  • the start date and time is “date and time when the event holding time is subtracted from the date and time when the first event occurs”
  • the end date and time is “date and time when the event holding time is added to the date and time when the last event occurs”.
  • the target system is a part of the system to be tested, specifically, a physical or logical unit that operates on a system such as a VM (virtual machine), an HV (hypervisor), or a storage device, and an RCA rule. Is a unit to which the device is applied, and is a device type related to the event defined by the rule (S15).
  • step S15 the user selects the test target RCA rule (S16). Then, the user is prompted to input whether to automatically create an event lacking in the test (S17). If the user wishes to create an event automatically, the process proceeds to step S18 to create a test event. On the other hand, if the event is not automatically created, the process ends.
  • the test event is an event in which an event that is insufficient in the conditional expression presented in step S15 is added as a temporary event.
  • step S18 (1) before the occurrence of the first event of the n rules constituting the rule to be applied, (2) between the first event and the next event, (n) n ⁇ Between the first event and the last event, an event is created at each occurrence timing after the occurrence of the (n + 1) last event.
  • step S18 the rule test program 105 creates an event with a value that makes the event ID 141 unique in the event table 140 described later with reference to FIG. Further, as described above, the rule test program 105 sets the event occurrence time 142 as a time for testing various contexts with other events constituting the rule. Further, the rule test program 105 is already held in the event table 140, and the identification information of the device indicated by the remaining events correctly obtained from the production environment and the identification information of the devices that are topologically related are included in the target device. ID 143. Further, the rule test program 105 sets a value corresponding to the type obtained from the ID 143 of the target device to the type 144 of the target device.
  • the rule test program 105 sets the value assigned at the time of rule creation as the event type 145.
  • the rule test program 105 also sets values given by the user when creating the rules for the failure flag 146 and the configuration change flag.
  • the method in which the user sets the values of the failure flag 146 and the configuration change flag 147 is specified in the uploaded rule file when uploading a rule (for example, in the input screen 10 of FIG. 10 described later).
  • the user may specify the values of the failure flag 146 and the configuration change flag 147 using a separately provided GUI.
  • the rule test program 105 sets a unique value that does not overlap with the co-occurrence wait event ID 151 of the co-occurrence wait event table 150 in the co-occurrence wait event ID 148.
  • FIG. 4 is a detailed flowchart of the process for testing the RCA rule (S3 in FIG. 2).
  • the position of the event to start reading, the reading direction, and the reading period are determined (S21).
  • the position of the event at which reading is started is a line in the event table 140 where analysis of the RCA rule is started. A method of determining a line for starting the analysis of the RCA rule will be described later.
  • the reading direction indicates whether reading is to be performed upward or downward from the position of the event at which reading of the event table 140 is started. That is, whether the event is read back in time or the event is read as time elapses.
  • the reading period 84 (see FIG. 13) is a time for reading an event from the position where the reading of the event is started.
  • a set of failure events to which the RCA rules should be applied can be obtained by reading events that occurred during the reading period from the position of the event at which reading is started in the reading direction.
  • FIG. 13 is a diagram showing a method for determining a line for starting the analysis of the RCA rule.
  • the data backup point (81a or 81b) closest to the failure event occurrence time 82 to be tested is set to the line where the analysis of the RCA rule is started. Extracting information can speed up processing. Further, in order to reach the failure event occurrence time 82 to be tested, a data backup that is newer than the failure event occurrence time 82 to be tested from the data backup time 81a that is older than the failure event occurrence time 82 to be tested. There may be fewer events 85 to analyze at time 81b. In this case, the event information may be extracted by setting the new data backup time 81b in the line where the analysis of the RCA rule is started.
  • the event reading start time and reading direction are determined using the relationship between the backup point 81, the event holding time 83, the occurrence time of the failure event 85, and the number of events 85 to be analyzed.
  • the processing time for reproducing can be shortened.
  • the configuration information reproduction program 107 creates a base of topology information necessary for RCA analysis (S22). Specifically, the configuration information reproduction program 107 selects the backup time with the smallest number of events between the backup of the topology information from the RCA server 500 and the position of the event to start reading, and starts reading. Apply the configuration change event one by one up to the event position to create a base of topology information. When going back in time, the configuration information reproduction program 107 applies the reverse operation of the configuration change event one by one, updates the configuration information, and creates the topology information base. The created topology information base is stored in the topology snapshot 170.
  • Steps S23 to S28 are processes in a loop for analyzing events one by one.
  • the RCA rule test process S3 compares the event reading period set in step S21 with the event occurrence time 142 of the event table 140. If the event occurrence time 142 of the read event is outside the event reading period, it is not necessary to read further events from the event table 140, and the process proceeds to step S29. If the event occurrence time 142 of the read event is within the event reading period, the event needs to be analyzed, and the process proceeds to step S25 (S24).
  • step S26 the process proceeds to step S27 (S25). Details of the processing in step S26 will be described later with reference to FIG.
  • step S28 the configuration change flag 147 of the read event is evaluated. If the configuration change flag 147 is ON, the configuration change event process S28 is executed. On the other hand, if the configuration change flag 147 is OFF, the loop is terminated and the process returns to step S23 (S27). Details of the processing in step S28 will be described later with reference to FIG.
  • FIG. 5 is a detailed flowchart of the failure event process (S26 in FIG. 4).
  • the failure event process S26 it is determined whether the occurrence time of the failure event is within the event holding period 83 (S31). If the occurrence time of the failure event is within the event holding period of the RCA, the corresponding information in the event table 140 is registered in the co-occurrence waiting event table 150 (S32). Specifically, the value of the event occurrence time 142 in the event table 140 is registered in the occurrence time 152 of the co-occurrence waiting event table 150, the value of the target device ID 143 is registered in the node ID 153, and the event type 145 value is changed to the event. Register to type 154. Then, the value of the co-occurrence wait event ID 151 of the co-occurrence wait event table 150 is registered in the co-occurrence wait event ID 148 of the event table 140.
  • FIG. 6 is a diagram illustrating a configuration example of the event table 140.
  • the event table 140 is a table for storing events.
  • the event ID 141, the event occurrence time 142, the node ID 143, the node type 144, the event type 145, the failure flag 146, and the configuration change flag 147 are co-occurrence. And a waiting event ID 148.
  • the event ID 141 is identification information for uniquely identifying an event.
  • the event occurrence time 142 is the occurrence time of the event.
  • the node ID 143 is identification information of the device that is the target of the event.
  • the node type 144 is the type of the device that is the target of the event. For example, “HV” indicating that the device is a hypervisor and “VM” indicating that the device is a virtual machine are set. In addition to the illustrated types, there may be types representing storage devices, fiber channel switches, network switches, and routers.
  • the event type 145 is identification information representing the event type.
  • the failure flag 146 is a flag indicating whether or not the event is a failure event.
  • the configuration change flag 147 is a flag indicating whether the event is a configuration change event.
  • the co-occurrence wait event ID 148 is identification information for uniquely identifying the corresponding event in the co-occurrence wait event table 150.
  • the rule test apparatus 100 receives the update information from the database event of the RCA server 500 at the timing when the database of the RCA server 500 is updated, and synchronizes the event table 140 with the event database of the RCA server 500.
  • FIG. 7 is a diagram showing a configuration example of the RCA co-occurrence waiting event table 150.
  • the co-occurrence wait event table 150 includes a co-occurrence wait event ID 151, a time 152, a node ID 153, and an event type 154.
  • the co-occurrence wait event ID 151 is identification information for uniquely identifying the co-occurrence wait event.
  • Time 152 is the time when the co-occurrence waiting event occurs.
  • the node ID 153 is identification information of the device that is the target of the event.
  • the event type 154 is identification information representing the event type.
  • FIG. 8 is a detailed flowchart of the configuration event process (S28 in FIG. 4).
  • the configuration information of the topology snapshot 170 is changed according to the contents of the configuration change event (S41).
  • changing the topology means that the relationship between devices changes regardless of physical or virtual. For example, a virtual server operating on a physical server moves to another physical server. If there is a change in the topology, the co-occurrence wait event table 150 is reset (S43). On the other hand, if there is no change in the topology, the process is terminated (S42).
  • FIG. 9 is a diagram illustrating a configuration example of the RCA rule table 160.
  • the RCA rule table 160 is generated based on the RCA rule copied from the RCA server 500, and is updated when the rule is edited on the rule test apparatus 100.
  • the RCA rule table 160 includes a rule ID 161 for uniquely identifying a rule, a rule usage state 162 indicating whether the user is using the rule, a device type 163 related to the rule, and a rule content 164. including.
  • the failure event is analyzed based on the RCA rule (S29). Specifically, first, the information of the topology snapshot 170 is fetched. Thereafter, the RCA rule is applied to the remaining co-occurrence waiting event. By applying the same processing algorithm as that of the RCA server 500 to the RCA rule, the same root cause location as that in the production environment is estimated. If there is an event that conforms to the RCA rule, the RCA rule and information on the event that has occurred are registered in the RCA rule test result table 180.
  • an event occurring at the corresponding occurrence time in the event table 140 is compared with the RCA test result table 180, and the event occurs in the RCA server 500 and the rule testing apparatus 100. Or an event that occurred in the rule testing apparatus 100 but did not occur in the RCA server 500, or an event that occurred in the RCA server 500 but did not occur in the rule testing apparatus 100.
  • the event change information is represented as characters, but may be displayed by another method such as changing the color of the displayed event or changing the frame line.
  • FIG. 12 is a diagram illustrating a configuration example of the RCA rule test result table 180.
  • the RCA rule test result table 180 adds the event change information and the ID of the applied RCA rule to the contents of the event table 140, and deletes the failure flag 146, the configuration change flag 147, and the ID 148 of the corresponding event in the co-occurrence waiting event table 150. Is.
  • the RCA rule test result table 180 includes event change information 181, RCA rule ID 182, event occurrence time 183, target device ID 184, target device type 185, and ID 186 representing an event.
  • the event change information 181 exists in the event table 140 but does not exist in the RCA rule test result table 180 (“-” in this embodiment), and exists in both the event table 140 and the RCA rule test result table 180. This is information for discriminating between events to be performed (in this embodiment (blank)) and events that do not exist in the event table 140 but exist in the RCA rule test result table 180 (“+” in this embodiment).
  • the RCA rule ID 182 is information for uniquely specifying the RCA rule.
  • the event occurrence time 183 is the time when the event occurred.
  • the target device ID 184 is identification information of a device that is an event generation source.
  • the target device type 185 is the type of the device that is the source of the event.
  • ID 186 representing an event is information for uniquely identifying the event type.
  • FIG. 10 is a diagram illustrating an example of the input screen 10 displayed on the terminal 800 by the output control program 103.
  • the input screen 10 illustrated in FIG. 10 is a screen configured as a Web application, but may be a screen configured by a native application executed on the rule test apparatus 100.
  • the input screen 10 includes a title bar 11 for displaying a GUI name, address bars 1 and 2 for displaying a URL (Uniform Resource Locator), an event list area 13 for displaying an event read from the event database of the RCA server 500, A rule list area 21 that displays RCA rules read from the RCA server 500, a “details” button 31 used to determine details of the RCA rule test method, and an RCA rule used to actually apply the event to the event. And an “Apply” button 41.
  • the address bar 12 may not be provided.
  • the event list area 13 includes fields for displaying an event message 16, an event occurrence time 17, a device (source) 18 in which an event indicated by the event has occurred, and a device type 19 of the event generation device.
  • Event information stored in the event table 140 is set in the event list area 13.
  • the name of the event stored in the event type 145 is set.
  • the event type 145 a table in which event names and identification information are associated with each other so as to be easily understood by a person may be prepared, and identification information obtained by converting the name using the table may be stored.
  • the event identification information stored in may be set directly.
  • a value stored at the event occurrence time 142 in the event table 140 is set.
  • the source 18 is set with the name of the device in which the event indicated by the event, which is stored in the node ID 143, has occurred.
  • the source 18 may prepare a table in which device names and identification information are associated with each other so as to be easily understood by a person, and store identification information obtained by converting the name using the table, or a node ID 143.
  • the identification information of the target device stored in may be set.
  • the value of the type (node type 144) of the device in which the event indicated by the event has occurred is set.
  • the rule list area 21 includes fields for displaying a rule usage state 22, a rule name 23, a target 24 to which the rule is applied, and a rule detail 25 indicating the details of the rule. Further, the rule list area 21 includes an “upload rule” button 26 that is used when a new rule that has not been read in the RCA rule table 160 is uploaded to the rule testing apparatus 100 and an existing rule that is read in the RCA rule table 160. And a “delete rule” button 27 used when deleting the rule.
  • the value of the rule usage state 162 of the rule table 160 is set.
  • the name of the rule stored in the rule ID 161 is set.
  • a table in which the name of the rule is associated with the identification information so as to be easily understood by a person may be prepared, and the identification information obtained by converting the name using the table may be stored.
  • the identification information stored in the ID 161 may be set directly.
  • a value of the device type 163 related to the rule is set.
  • the content 164 of the currently selected rule stored in the rule content 164 is set.
  • the field of the rule details 25 has a function of editing text, and the user can edit the rule.
  • the “upload rule” button 26 is operated, a screen for selecting a file is displayed, and the user can select a rule definition file to be added to the rule list.
  • the “delete rule” button 27 is operated, the user can delete the currently selected rule from the rule list.
  • the operations using the rule details 25, the “upload rule” button 26, and the “delete rule” button 27 correspond to the editing of the RCA rule (S1 in FIG. 2).
  • FIG. 11 shows an example of the output screen 50 displayed on the terminal 800 by the output control program 103.
  • the output screen 50 illustrated in FIG. 11 is a screen configured as a Web application similarly to the input screen 10, but may be a screen configured by a native application executed on the rule testing apparatus 100.
  • the output screen 50 includes a title bar 51 that displays the GUI name, an address bar 52 that displays the URL, an RCA rule application result area 53 that displays a list of events obtained as a test result of the RCA rule, and “to input screen” And a “return” button 61. Similar to the input screen 10, the address bar 52 may not be provided.
  • the RCA rule application result area 53 includes event change information 54, an application rule ID 55, an event message 56, an event occurrence time 57, a device (source) 58 in which the event has occurred, and a device in which the event has occurred.
  • Device type 59 In the RCA rule application result area 53, information stored in the RCA rule test result table 180 obtained by extending the event table 140 is displayed in time series.
  • the event change information 54 displays the value of event change information in the RCA rule test result table 180.
  • the applied rule ID 55 displays the name of the rule.
  • the applied rule ID 55 may be a table in which the name of the rule is associated with the identification information so as to be easily understood by a person, and the name obtained by converting the identification information using the table may be displayed.
  • the identification information may be displayed as it is.
  • the message 56 displays a character string corresponding to the value of ID 145 representing the event.
  • the message 56 may display a name obtained by converting the identification information using the table by preparing a table in which the event name and the identification information are associated with each other so as to be easily understood by a person.
  • Information (event ID 145) may be displayed as it is.
  • the source 58 displays the name of the device in which the event indicated by the event has occurred.
  • the source 58 displays the name of the target device in the event table 140.
  • the source 58 may prepare a table in which device names and identification information are associated with each other so as to be easily understood by a person, and display the name obtained by converting the identification information using the table. Information may be displayed as it is.
  • the source device type 59 displays the value of the device type 144 in which the event indicated by the event has occurred.
  • the RCA rule is introduced before the RCA rule is introduced into the production environment. Can confirm whether the expected failure event can be detected.
  • an event occurring within a predetermined period is set as an event causing a failure, the event causing the failure can be accurately selected with a simple process.
  • the root cause by the RCA is estimated when an event occurs, but the root cause by the RCA may be estimated when a change in configuration information is detected.
  • FIG. 14 is a block diagram illustrating the configuration of the rule test apparatus 100 according to the second embodiment and the relationship between the rule test apparatus 100, the test environment, and the production environment.
  • the test environment of the second embodiment has the rule test apparatus 100 as in the test environment of the first embodiment.
  • the rule test apparatus 100 of the second embodiment has almost the same configuration as the rule test apparatus 100 of the first embodiment, but the RCA rule history management program 108 on the memory 111 and the rule history table on the auxiliary storage device 120. 190 is stored.
  • the RCA server 500 when the RCA rule used on the RCA server 500 is changed, such as addition, deletion, or editing, the RCA server 500 receives a change event transmitted from the database holding the rule. As a result, the RCA rule history management program 108 of the rule testing apparatus 100 records the rule change history in the rule history table 190.
  • FIG. 15 is a diagram illustrating a configuration example of the rule history table 190.
  • the rule history table 190 is a table for storing the RCA rule change history, the time 191 when the RCA rule was changed, the identification information (rule ID) 192 of the changed RCA rule, and addition, deletion, and editing. And the change type 193 such as.
  • FIG. 16 shows an example of the output screen 50 displayed on the terminal 800 by the output control program 103.
  • the output screen 50 illustrated in FIG. 16 is a screen configured as a Web application similarly to the input screen 10, but may be a screen configured by a native application executed on the rule testing apparatus 100.
  • the same components as those of the output screen 50 of the first embodiment are denoted by the same reference numerals, description thereof is omitted, and only different portions are described.
  • the RCA rule application result area 53 failure events and RCA rule change events that are not failures are displayed in time series. Therefore, an event type 71 is added to the RCA rule application result area 53 instead of the event change information 54.
  • the event type 71 displays a character (“!” In this embodiment) indicating that the event is not a failure.
  • the ID of the RCA rule with the change is displayed in the application rule ID 55, and the content of the change to the RCA rule (in this example, “RCA rule added” (RCA rule has been added)) is displayed in the message 56.
  • the time when the change to the RCA rule is performed is displayed at time 57. Since the source 58 and the device 59 have no corresponding information, an empty character string is displayed. In this embodiment, a server resource shortage event is generated next to the RCA rule change event, and then a server down event is generated. Originally, if only the correct RCA rule is applied, only the resource shortage event is displayed as the root cause, so that the user can quickly understand the cause and deal with the problem.
  • the RCA rule changed before the time when the failure event occurred is displayed along with the failure event based on the past event information and the RCA rule change information.
  • the user can be given information to determine whether the reason why the event could not be successfully applied is related to a change in the RCA rules.
  • the present invention is not limited to the above-described embodiments, and includes various modifications and equivalent configurations within the scope of the appended claims.
  • the above-described embodiments have been described in detail for easy understanding of the present invention, and the present invention is not necessarily limited to those having all the configurations described.
  • a part of the configuration of one embodiment may be replaced with the configuration of another embodiment.
  • another configuration may be added, deleted, or replaced.
  • each of the above-described configurations, functions, processing units, processing means, etc. may be realized in hardware by designing a part or all of them, for example, with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.
  • Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
  • a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
  • control lines and information lines indicate what is considered necessary for the explanation, and do not necessarily indicate all control lines and information lines necessary for mounting. In practice, it can be considered that almost all the components are connected to each other.

Abstract

A management computer for testing a rule for monitoring a system comprising a plurality of devices, wherein a storage unit of said management computer stores event information for managing past failure events and configuration change events in the system, configuration information for managing the current configuration of the system, and a topology snapshot representing a past configuration of the system, and wherein said management computer is provided with: a configuration information reproduction unit which reproduces a past configuration of the system by use of the aforementioned configuration information and information about the configuration change events, and causes the reproduced past configuration to be reflected in the topology snapshot; and a rule testing unit which applies a cause analysis algorithm to the reproduced past configuration, deduces the cause of a failure using a rule to be tested, and outputs a determination as to whether the rule to be tested can be used to detect the failure.

Description

管理計算機及びルールの試験方法Management computer and rule test method
 本発明は、計算機システムを監視するためのルールを試験する試験装置に関する。 The present invention relates to a test apparatus for testing a rule for monitoring a computer system.
 データセンタのような大規模環境では管理対象のノードが多いことから、多くの場合、RCA(Route Cause Analysis)を用いて障害を監視している。RCAを利用することによって、障害を発見でき、又は、容易に障害を検知できる。 Since there are many nodes to be managed in a large-scale environment such as a data center, in many cases, faults are monitored using RCA (Route Cause Analysis). By using RCA, a failure can be found or a failure can be easily detected.
 管理者は、RCAによって障害を検知できるが、RCAで検知できない障害も発生する。このため、RCAで新しいルールを作成し又は既存のルールを修正するとによって、障害の検知の精度を高めて、より迅速に障害に対応できるシステムを構築することができる。 The administrator can detect failures with RCA, but failures that cannot be detected with RCA also occur. Therefore, by creating a new rule or modifying an existing rule with RCA, it is possible to improve the accuracy of detecting a failure and construct a system that can respond to the failure more quickly.
 新しいルールや修正したルールが従来は検知できなかった障害を検知できるかは、障害が発生した環境で当該ルールを実行してみる必要がある。しかし、システムの稼働中は、時々刻々と構成が変化するので、厳密に同じ環境は用意できない。また、稼働しているシステム上で新たなルールや修正したルールの試験をすると提供中のサービスに影響を与えることがある。そこで、本番環境とは別に、本番環境と同じテスト環境を用意して、ルールの試験を行う技術が提案されている。 ∙ Whether a new rule or a modified rule can detect a failure that could not be detected before must be executed in the environment where the failure occurred. However, since the configuration changes every moment while the system is in operation, the exact same environment cannot be prepared. Also, testing new or modified rules on a running system can affect the services being offered. Therefore, a technique for preparing a test environment that is the same as the production environment separately from the production environment and testing the rules has been proposed.
 本技術の背景技術として、特開2012-185599号公報(特許文献1)および特開2010-231568号公報(特許文献2)がある。特許文献1には、テスト環境においてプログラム改修のために実行されたコマンドと、コマンドの実行結果に関する情報である結果情報とを有する実行情報を1以上有する作業情報が記憶される作業情報記憶部と、本番環境におけるプログラムが記憶されるプログラム記憶部と、プログラム記憶部で記憶されているプログラムに対して、作業情報に含まれる各コマンドを実行することによりプログラム改修を行うコマンド実行部と、コマンド実行部によるコマンドの実行結果と、コマンドと同じ実行情報に含まれる結果情報とが整合するかどうか判断する判断部と、判断部による判断の結果を出力する出力部と、を備える情報処理システムが記載されている。また、特許文献2には、過去に蓄積されたイベントログのみを使用することで、イベントを判定するシステムが記載されている。 As background arts of the present technology, there are JP 2012-185599 A (Patent Document 1) and JP 2010-231568 A (Patent Document 2). Patent Document 1 includes a work information storage unit that stores work information including one or more execution information including a command executed for program modification in a test environment and result information that is information related to a command execution result. A program storage unit for storing a program in a production environment, a command execution unit for modifying the program by executing each command included in the work information for the program stored in the program storage unit, and command execution An information processing system comprising: a determination unit that determines whether a command execution result by a unit matches result information included in the same execution information as the command; and an output unit that outputs a determination result by the determination unit Has been. Patent Document 2 describes a system for determining an event by using only an event log accumulated in the past.
特開2012-185599号公報JP 2012-185599 A 特開2010-231568号公報JP 2010-231568 A
 小規模な環境であれば、本番環境と同じテスト環境を物理的に構築することも可能ではあるが、大規模かつ複雑な環境では、本番環境と同じテスト環境を構築することは困難である。また、一般的にテスト環境は本番環境と異なる小規模な構成であり、本番環境とはシステムにかかる負荷も、システムの振る舞いも異なるため、テスト環境で問題がなかったからといって本番環境でも問題がないとは言えない。 In a small environment, it is possible to physically construct the same test environment as the production environment, but in a large and complex environment, it is difficult to construct the same test environment as the production environment. In general, the test environment has a small configuration that is different from the production environment. The load on the system and the system behavior are different from the production environment, so there was no problem in the production environment because there was no problem in the test environment. It cannot be said that there is no.
 このため、大規模かつ複雑な環境において、検知できなかった障害を新しいルールや修正したルールが検知できるかをテストする環境の準備が技術的な課題である。 Therefore, it is a technical challenge to prepare an environment for testing whether a new rule or a corrected rule can be detected for a failure that could not be detected in a large-scale and complex environment.
 また、過去にイベントとして作成されていない障害に対するルールは、過去のイベントを用いてもルールの試験ができない。 Also, rules for faults that have not been created as events in the past cannot be tested using past events.
 本願において開示される発明の代表的な一例を示せば以下の通りである。すなわち、複数の装置で構成されるシステムを監視するためのルールを試験する管理計算機であって、記憶部と、前記記憶部を参照するプロセッサとを備え、前記記憶部は、前記システムに過去に生じた障害イベント及び前記システムの構成変更イベントを管理するイベント情報と、前記システムの現在の構成を管理する構成情報と、前記システムの過去の構成を表すトポロジスナップショットとを保持し、前記管理計算機は、前記構成情報及び前記構成変更イベントを用いて前記システムの過去の構成を再現し、前記トポロジスナップショットに記録する構成情報再現部と、前記再現された過去の構成に原因分析アルゴリズムを適用して、試験対象のルールを用いて障害の原因を推定し、障害が検知できるかを出力するルール試験部とを備える。 A typical example of the invention disclosed in the present application is as follows. That is, a management computer for testing a rule for monitoring a system composed of a plurality of devices, comprising a storage unit and a processor that refers to the storage unit, the storage unit in the system in the past The management computer stores event information for managing a failure event that has occurred and a configuration change event for the system, configuration information for managing a current configuration of the system, and a topology snapshot representing a past configuration of the system. Reconstructing the past configuration of the system using the configuration information and the configuration change event, and applying a cause analysis algorithm to the reconstructed past configuration, and a configuration information reproducing unit that records the configuration snapshot in the topology snapshot. A rule test section that estimates the cause of the failure using the rule to be tested and outputs whether the failure can be detected. That.
 本発明の一形態によれば、本番環境と同じテスト環境を論理的に構成し、本番環境で発生したイベント情報を用いて新しいルールをテストすることができる。前述した以外の課題、構成及び効果は、以下の実施例の説明により明らかにされる。 According to one aspect of the present invention, the same test environment as the production environment can be logically configured, and a new rule can be tested using event information generated in the production environment. Problems, configurations, and effects other than those described above will become apparent from the description of the following embodiments.
本発明に係る第1実施例のルール試験装置の構成、及びルール試験装置と試験環境と本番環境との関係を示すブロック図である。It is a block diagram which shows the structure of the rule test apparatus of 1st Example which concerns on this invention, and the relationship between a rule test apparatus, a test environment, and a production environment. 本発明の第1実施例のルール試験装置におけるルール試験の動作のフローチャートである。It is a flowchart of operation | movement of the rule test in the rule test apparatus of 1st Example of this invention. 本発明の第1実施例のテストに使う過去データの構成を決定する処理の詳細のフローチャートである。It is a flowchart of the detail of the process which determines the structure of the past data used for the test of 1st Example of this invention. 本発明の第1実施例のRCAルールをテストする処理の詳細のフローチャートである。It is a flowchart of the detail of the process which tests the RCA rule of 1st Example of this invention. 本発明の第1実施例の障害イベント処理の詳細のフローチャートである。It is a flowchart of the detail of the failure event process of 1st Example of this invention. 本発明の第1実施例のイベントテーブルの構成例を示す図である。It is a figure which shows the structural example of the event table of 1st Example of this invention. 本発明の第1実施例のRCAの共起待ちイベントテーブルの構成例を示す図である。It is a figure which shows the structural example of the co-occurrence | waiting event table of RCA of 1st Example of this invention. 本発明の第1実施例の構成イベント処理の詳細のフローチャートである。It is a flowchart of the detail of the structure event process of 1st Example of this invention. 本発明の第1実施例のRCAルールテーブルの構成例を示す図である。It is a figure which shows the structural example of the RCA rule table of 1st Example of this invention. 本発明の第1実施例の入力画面の例を示す図である。It is a figure which shows the example of the input screen of 1st Example of this invention. 本発明の第1実施例の出力画面の例を示す図である。It is a figure which shows the example of the output screen of 1st Example of this invention. 本発明の第1実施例のRCAルール試験結果テーブルの構成例を示す図である。It is a figure which shows the structural example of the RCA rule test result table of 1st Example of this invention. 本発明の第1実施例のRCAルールの分析を始める行を決定する方法を示す図である。It is a figure which shows the method of determining the line which starts the analysis of the RCA rule of 1st Example of this invention. 本発明の第2実施例のルール試験装置の構成、及びルール試験装置と試験環境と本番環境との関係を示すブロック図である。It is a block diagram which shows the structure of the rule test apparatus of 2nd Example of this invention, and the relationship between a rule test apparatus, a test environment, and a production environment. 本発明の第2実施例のルール履歴テーブルの構成例を示す図である。It is a figure which shows the structural example of the rule history table of 2nd Example of this invention. 本発明の第2実施例の出力画面の例を示す図である。It is a figure which shows the example of the output screen of 2nd Example of this invention.
 以下、添付図面を参照して本発明の実施例について説明する。添付図面では、機能的に同じ要素は同じ番号で表示される場合もある。なお、添付図面は本発明の原理に則った具体的な実施形態及び実装例を示しているが、これらは本発明の理解のためのものであり、決して本発明を限定的に解釈するために用いられるものではない。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the accompanying drawings, functionally identical elements may be denoted by the same numbers. The accompanying drawings show specific embodiments and implementation examples based on the principle of the present invention, but these are for understanding the present invention and are not intended to limit the present invention. Not used.
 本実施例では、当業者が本発明を実施するのに十分詳細にその説明がなされているが、他の実装・形態も可能で、本発明の技術的思想の範囲と精神を逸脱することなく構成及び構造の変更や多様な要素の置き換えが可能であることを理解する必要がある。従って、以降の記述をこれに限定して解釈してはならない。 The present embodiment has been described in sufficient detail for those skilled in the art to practice the present invention, but other implementations and forms are possible without departing from the scope and spirit of the technical idea of the present invention. It is necessary to understand that the configuration and structure can be changed and various elements can be replaced. Therefore, the following description should not be interpreted as being limited to this.
 更に、本発明の実施例は、後述されるように、汎用コンピュータ上で稼動するソフトウェアで実装してもよいし専用ハードウェア又はソフトウェアとハードウェアの組み合わせで実装してもよい。 Furthermore, as will be described later, the embodiment of the present invention may be implemented by software running on a general-purpose computer, or may be implemented by dedicated hardware or a combination of software and hardware.
 なお、以後の説明では「テーブル」形式によって本発明の各情報について説明するが、これら情報は必ずしもテーブルによるデータ構造で表現されていなくてもよく、マトリックス、リスト、DB、キュー等のデータ構造やそれ以外で表現されていてもよい。そのため、データ構造に依存しないことを示すために「テーブル」、「マトリックス」、「リスト」、「DB」、「キュー」等について単に「情報」と称することがある。 In the following description, each information of the present invention will be described in a “table” format. However, the information does not necessarily have to be expressed in a data structure by a table. It may be expressed in other ways. Therefore, “table”, “matrix”, “list”, “DB”, “queue”, and the like may be simply referred to as “information” in order to indicate that they do not depend on the data structure.
 また、各情報の内容を説明する際に、「識別情報」、「識別子」、「名」、「名前」、「ID」という表現を用いることが可能であり、これらについてはお互いに置換が可能である。 In addition, when explaining the contents of each information, the expressions “identification information”, “identifier”, “name”, “name”, “ID” can be used, and these can be replaced with each other. It is.
 以下では「プログラム」を主語(動作主体)として本発明の実施例における各処理について説明を行うが、プログラムはプロセッサによって実行されることで定められた処理をメモリ及び通信ポート(通信制御装置)を用いながら行うため、プロセッサを主語とした説明としてもよい。また、プログラムを主語として開示された処理は管理サーバ等の計算機、情報処理装置が行う処理としてもよい。プログラムの一部又は全ては専用ハードウェアで実現してもよく、また、モジュール化されていてもよい。各種プログラムは、プログラム配布サーバや記憶メディアによって各計算機にインストールされてもよい。 In the following, each process in the embodiment of the present invention will be described using “program” as the subject (operation subject). However, the program is executed by the processor, and the process determined by the memory and the communication port (communication control device) Since it is performed while being used, the description may be made with the processor as the subject. Further, the processing disclosed with the program as the subject may be processing performed by a computer such as a management server or an information processing apparatus. Part or all of the program may be realized by dedicated hardware, or may be modularized. Various programs may be installed in each computer by a program distribution server or a storage medium.
 <第1実施例>
 まず、図1~図13を用いて、第1実施例の試験環境について説明する。
<First embodiment>
First, the test environment of the first embodiment will be described with reference to FIGS.
 図1は、第1実施例のルール試験装置100の構成、及びルール試験装置100と試験環境と本番環境との関係を示すブロック図である。 FIG. 1 is a block diagram showing the configuration of the rule test apparatus 100 of the first embodiment and the relationship between the rule test apparatus 100, the test environment, and the production environment.
 図1に示すように、本実施例の試験環境は、ルール試験装置100を有している。 As shown in FIG. 1, the test environment of the present embodiment has a rule test apparatus 100.
 ルール試験装置100は、試験用ネットワーク200を通じてRCAサーバ500と接続している。本実施例では、ルール試験装置100は試験用ネットワーク200を介してRCAサーバ500と接続しているが、管理用ネットワーク300を介してRCAサーバ500に接続してもよい。また、ルール試験装置100は、試験用ネットワーク200を介して、端末800と接続されている。端末800は、ルール試験装置100からの指示に従って、後述する入力画面及び出力画面を表示する。 The rule test apparatus 100 is connected to the RCA server 500 through the test network 200. In this embodiment, the rule test apparatus 100 is connected to the RCA server 500 via the test network 200, but may be connected to the RCA server 500 via the management network 300. The rule test apparatus 100 is connected to the terminal 800 via the test network 200. The terminal 800 displays an input screen and an output screen, which will be described later, in accordance with instructions from the rule testing apparatus 100.
 ルール試験装置100は、CPU110と、メモリ111と、通信装置112と、入力装置113と、出力装置114と、メディア読取装置115と、補助記憶装置120と、を有する。 The rule test apparatus 100 includes a CPU 110, a memory 111, a communication device 112, an input device 113, an output device 114, a media reading device 115, and an auxiliary storage device 120.
 CPU110は、プログラムを実行することによって各種演算を行うプロセッサである。メモリ111は、不揮発性の記憶素子であるROM及び揮発性の記憶素子であるRAMを含む。ROMは、不変のプログラム(例えば、BIOS)などを格納する。RAMは、DRAM(Dynamic Random Access Memory)のような高速かつ揮発性の記憶素子であり、補助記憶装置120に格納されたプログラム及びプログラムの実行時に使用されるデータを一時的に格納し、CPU110のワークエリアとなる。通信装置112は、所定のプロトコルに従って、RCAサーバ500との通信を制御するネットワークインターフェース装置である。 The CPU 110 is a processor that performs various calculations by executing a program. The memory 111 includes a ROM that is a nonvolatile storage element and a RAM that is a volatile storage element. The ROM stores an immutable program (for example, BIOS). The RAM is a high-speed and volatile storage element such as a DRAM (Dynamic Random Access Memory), temporarily stores a program stored in the auxiliary storage device 120 and data used when the program is executed. It becomes a work area. The communication device 112 is a network interface device that controls communication with the RCA server 500 in accordance with a predetermined protocol.
 入力装置113は、ユーザがルール試験装置100にデータや指示を入力するためのユーザインタフェース(例えば、キーボード、マウスなど)である。出力装置114は、プログラムの実行結果をユーザに提示するためのユーザインタフェース(例えば、表示装置、プリンタなど)である。メディア読取装置115は、DVDドライブなどの記憶媒体に格納されたデータを読み取るインターフェース装置である。なお、メディア読取装置115を設けなくてもよい。 The input device 113 is a user interface (for example, a keyboard, a mouse, etc.) for the user to input data and instructions to the rule testing device 100. The output device 114 is a user interface (for example, a display device or a printer) for presenting the execution result of the program to the user. The media reader 115 is an interface device that reads data stored in a storage medium such as a DVD drive. Note that the media reading device 115 may not be provided.
 補助記憶装置120は、例えば、磁気記憶装置(HDD)、フラッシュメモリ(SSD)等の大容量かつ不揮発性の記憶装置である。補助記憶装置120は、構成情報テーブル130と、イベントテーブル140と、共起待ちイベントテーブル150と、RCAルールテーブル160と、トポロジスナップショット170と、RCAルール試験結果テーブル180とを格納する。 The auxiliary storage device 120 is a large-capacity and nonvolatile storage device such as a magnetic storage device (HDD) or a flash memory (SSD). The auxiliary storage device 120 stores a configuration information table 130, an event table 140, a co-occurrence waiting event table 150, an RCA rule table 160, a topology snapshot 170, and an RCA rule test result table 180.
 構成情報テーブル130は、管理用ネットワーク300を介してRCAサーバ500と接続され監視されているサーバ600及びストレージ装置700の構成情報を格納する。イベントテーブル140は、RCAサーバ500が監視しているサーバ600及びストレージ装置700から収集した性能情報及び構成情報に基づいて作成したイベント情報を格納する(図6参照)。共起待ちイベントテーブル150は、保持期間内のイベントが格納される(図7参照)。RCAルールテーブル160は、RCAサーバ500が保持しているRCAルールを格納する(図9参照)。トポロジスナップショット170は、任意の時刻のサーバ600及びストレージ装置700の構成情報を再現した情報である。RCAルール試験結果テーブル180は、RCAルール及び発生したイベントの情報が登録されるテーブルである(図12参照)。なお、補助記憶装置120に格納される各テーブル140、150、160、180の構成例は、後述する。 The configuration information table 130 stores configuration information of the server 600 and the storage device 700 that are connected to and monitored by the RCA server 500 via the management network 300. The event table 140 stores event information created based on performance information and configuration information collected from the server 600 and storage device 700 monitored by the RCA server 500 (see FIG. 6). The co-occurrence wait event table 150 stores events within the retention period (see FIG. 7). The RCA rule table 160 stores RCA rules held by the RCA server 500 (see FIG. 9). The topology snapshot 170 is information that reproduces the configuration information of the server 600 and the storage apparatus 700 at an arbitrary time. The RCA rule test result table 180 is a table in which RCA rules and information on events that have occurred are registered (see FIG. 12). A configuration example of each table 140, 150, 160, 180 stored in the auxiliary storage device 120 will be described later.
 メモリ111は、外部からの入力処理を行う入力受付プログラム102、外部への出力処理を行う出力制御プログラム103、イベントテーブル140から取り出したイベント情報を処理するイベント管理プログラム104、RCAルールの試験を行うルール試験プログラム105、構成情報テーブル130から取り出した構成情報をもとに本番環境の構成情報を再現する構成情報再現プログラム107とを格納する。これらのプログラムは、補助記憶装置120に格納されており、実行時に補助記憶装置120から読み出されて、メモリ111にコピーされて、CPU110が実行することによって機能する。 The memory 111 performs an input reception program 102 that performs external input processing, an output control program 103 that performs external output processing, an event management program 104 that processes event information retrieved from the event table 140, and tests RCA rules. A rule test program 105 and a configuration information reproduction program 107 that reproduces the configuration information of the production environment based on the configuration information extracted from the configuration information table 130 are stored. These programs are stored in the auxiliary storage device 120, read out from the auxiliary storage device 120 at the time of execution, copied into the memory 111, and executed by the CPU 110.
 CPU110が実行するプログラムは、リムーバブルメディア(CD-ROM、フラッシュメモリなど)又はネットワークを介してルール試験装置100に提供され、非一時的記憶媒体である補助記憶装置120に格納される。 The program executed by the CPU 110 is provided to the rule test apparatus 100 via a removable medium (CD-ROM, flash memory, etc.) or a network, and stored in the auxiliary storage device 120 which is a non-temporary storage medium.
 本実施例のルール試験装置100は、物理的に一つの計算機上で、又は、論理的又は物理的に構成された複数の計算機上で構成される計算機システムであり、同一の計算機上で別個のスレッドで動作してもよく、複数の物理的計算機資源上に構築された仮想計算機上で動作してもよい。 The rule test apparatus 100 according to the present embodiment is a computer system configured on a single physical computer or a plurality of logically or physically configured computers. It may operate on a thread, or may operate on a virtual computer constructed on a plurality of physical computer resources.
 次に、図2に示すフローチャートに従って、本実施例のルール試験装置100におけるルール試験の動作について説明する。 Next, the operation of the rule test in the rule test apparatus 100 of the present embodiment will be described according to the flowchart shown in FIG.
 まず、ユーザは、RCAサーバ500が提供するRCAルールを変更する(S1)。RCAルールの変更には、RCAルールの編集及び追加、今まで適用していなかったRCAルールの新たな適用、及び今まで適用していたRCAルールの適用除外などがある。次に、ユーザはルールのテストに使用する過去データの構成を決定し、ルール試験装置100は、決定された過去データの構成を受け付ける(S2)。そして、ルール試験装置100は、対象となったRCAルールのテストを実行し、その結果をユーザに提示する(S3)。ユーザが行うRCAルールの編集(S1)については後述する。 First, the user changes the RCA rule provided by the RCA server 500 (S1). The change of the RCA rule includes editing and addition of the RCA rule, new application of the RCA rule that has not been applied until now, and exemption of application of the RCA rule that has been applied until now. Next, the user determines the configuration of past data used for the rule test, and the rule testing apparatus 100 accepts the determined configuration of past data (S2). Then, the rule test apparatus 100 executes a test of the target RCA rule and presents the result to the user (S3). The RCA rule editing (S1) performed by the user will be described later.
 図3は、テストに使う過去データの構成を決定する処理(図2のS2)の詳細のフローチャートである。 FIG. 3 is a detailed flowchart of the process of determining the configuration of past data used for the test (S2 in FIG. 2).
 テストに使う過去データの構成の決定処理(S2)では、まず、ユーザによるRCAルールの変更が、新規ルールの作成であるか、既存ルールの改善であるかを判定する(S11)。既存ルールの改善である場合、ルール試験装置100は、RCAサーバ500に問い合わせ、RCAスナップショットが残っているかを判定する。RCAスナップショットとは、過去に同じRCAルールを使って根本原因を分析した結果である。RCAサーバ500がRCAスナップショットを保持している場合、ステップS14へ進む。一方、RCAサーバ500がRCAスナップショットを保持していない場合、ステップS13へ進む(S12)。 In the process of determining the configuration of past data used for the test (S2), first, it is determined whether the change of the RCA rule by the user is creation of a new rule or improvement of an existing rule (S11). When it is an improvement of the existing rule, the rule test apparatus 100 makes an inquiry to the RCA server 500 and determines whether an RCA snapshot remains. The RCA snapshot is a result of analyzing the root cause using the same RCA rule in the past. When the RCA server 500 holds the RCA snapshot, the process proceeds to step S14. On the other hand, when the RCA server 500 does not hold the RCA snapshot, the process proceeds to step S13 (S12).
 ステップS13では、ユーザが新たに作成又は編集したRCAルールの条件式が既に定義されたイベントを含むかを判定する(S13)。条件式が既に定義されたイベントを含まない場合はステップS17へ進み、条件式が既に定義されたイベントを含む場合はステップS14へ進む。ステップS14では、該当するイベントをテスト候補に加える。例えば、メモリ上に一時的に作成されたテスト候補イベントテーブルにイベントを追加してもよい(S14)。 In step S13, it is determined whether the conditional expression of the RCA rule newly created or edited by the user includes an already defined event (S13). If the conditional expression does not include an already defined event, the process proceeds to step S17. If the conditional expression includes an already defined event, the process proceeds to step S14. In step S14, the corresponding event is added to the test candidate. For example, an event may be added to the test candidate event table temporarily created on the memory (S14).
 次に、対象システム、及び条件式で不足するイベントをユーザに提示する。RCAルールの条件式で不足するイベントは、ルールの条件式を構成するイベントのうち、イベントの一つがイベントテーブル140から見つけられなかった場合に、開始日時から終了日時までの間に共起が検出されていないイベントである。ここで、開始日時は「最初のイベントが発生した日時からイベント保持時間を減じた日時」であり、終了日時は「最後のイベントが発生した日時にイベント保持時間を加えた日時」である。対象システムは、テストの対象となるシステムの部分、具体的には、VM(仮想マシン)、HV(ハイパバイザ)、ストレージ装置などのシステム上で稼動する物理的又は論理的な単位でありかつRCAルールを適用する単位であり、ルールで定義されているイベントに関連する装置種別である(S15)。 Next, the target system and the event that is insufficient in the conditional expression are presented to the user. For events that are insufficient in the conditional expression of the RCA rule, if one of the events constituting the conditional expression of the rule is not found from the event table 140, a co-occurrence is detected between the start date and time and the end date and time. It is not an event. Here, the start date and time is “date and time when the event holding time is subtracted from the date and time when the first event occurs”, and the end date and time is “date and time when the event holding time is added to the date and time when the last event occurs”. The target system is a part of the system to be tested, specifically, a physical or logical unit that operates on a system such as a VM (virtual machine), an HV (hypervisor), or a storage device, and an RCA rule. Is a unit to which the device is applied, and is a device type related to the event defined by the rule (S15).
 次に、ステップS15で表示されるテスト対象システム、及びテスト対象システムに関するRCAルールを選択するための画面の選択欄において、ユーザがテスト対象のRCAルールを選択する(S16)。そして、テストに不足しているイベントを自動作成するかの入力を促す(S17)。そして、ユーザが、イベントの自動作成を希望すればする場合、ステップS18へ進み、テスト用イベントを作成する。一方、イベントを自動作成しない場合、処理を終了する。 Next, in the selection field of the screen for selecting the test target system and the RCA rule related to the test target system displayed in step S15, the user selects the test target RCA rule (S16). Then, the user is prompted to input whether to automatically create an event lacking in the test (S17). If the user wishes to create an event automatically, the process proceeds to step S18 to create a test event. On the other hand, if the event is not automatically created, the process ends.
 テスト用イベントは、ステップS15で提示した条件式で不足するイベントを、仮のイベントとして追加したイベントである。ステップS18では、適用するルールを構成するn個のルールの(1)1個目のイベント発生前、(2)1個目のイベントと次のイベントの間、・・・、(n)n-1個目のイベントと最後のイベントの間、(n+1)最後のイベントの発生後の各発生タイミングにおいてイベントを作成する。 The test event is an event in which an event that is insufficient in the conditional expression presented in step S15 is added as a temporary event. In step S18, (1) before the occurrence of the first event of the n rules constituting the rule to be applied, (2) between the first event and the next event, (n) n− Between the first event and the last event, an event is created at each occurrence timing after the occurrence of the (n + 1) last event.
 ステップS18では、ルール試験プログラム105が、図6で後述するイベントテーブル140においてイベントID141が一意になる値を付してイベントを作成する。また、ルール試験プログラム105は、前述したように、ルールを構成する他のイベントとの様々な前後関係を試すような時刻をイベント発生時刻142に設定する。また、ルール試験プログラム105は、イベントテーブル140に既に保持されており、本番環境から正しく得られた残りのイベントが示す装置の識別情報と、トポロジ的に関係する装置の識別情報とを、対象装置のID143に設定する。また、ルール試験プログラム105は、対象装置のID143から得られる種別に対応する値を対象装置の種別144に設定する。 In step S18, the rule test program 105 creates an event with a value that makes the event ID 141 unique in the event table 140 described later with reference to FIG. Further, as described above, the rule test program 105 sets the event occurrence time 142 as a time for testing various contexts with other events constituting the rule. Further, the rule test program 105 is already held in the event table 140, and the identification information of the device indicated by the remaining events correctly obtained from the production environment and the identification information of the devices that are topologically related are included in the target device. ID 143. Further, the rule test program 105 sets a value corresponding to the type obtained from the ID 143 of the target device to the type 144 of the target device.
 また、ルール試験プログラム105は、ルール作成時に付けられている値をイベント種別145に設定する。また、ルール試験プログラム105は、障害フラグ146及び構成変更フラグについても、ユーザがルール作成時に付けた値を設定する。ここで、ユーザが障害フラグ146及び構成変更フラグ147の値を設定する方法は、ルールをアップロードする際(例えば、後述する図10の入力画面10において)、アップロードするルールファイルの中で指定してもよいし、別に設けたGUIによってユーザが障害フラグ146および構成変更フラグ147の値を指定してもよい。 Also, the rule test program 105 sets the value assigned at the time of rule creation as the event type 145. The rule test program 105 also sets values given by the user when creating the rules for the failure flag 146 and the configuration change flag. Here, the method in which the user sets the values of the failure flag 146 and the configuration change flag 147 is specified in the uploaded rule file when uploading a rule (for example, in the input screen 10 of FIG. 10 described later). Alternatively, the user may specify the values of the failure flag 146 and the configuration change flag 147 using a separately provided GUI.
 また、ルール試験プログラム105は、共起待ちイベントテーブル150の共起待ちイベントID151と重複しない一意な値を共起待ちイベントID148に設定する。 Also, the rule test program 105 sets a unique value that does not overlap with the co-occurrence wait event ID 151 of the co-occurrence wait event table 150 in the co-occurrence wait event ID 148.
 図4は、RCAルールをテストする処理(図2のS3)の詳細のフローチャートである。 FIG. 4 is a detailed flowchart of the process for testing the RCA rule (S3 in FIG. 2).
 RCAルールをテストする際、まず、初期化処理として、読み込みを開始するイベントの位置、読み込み方向、及び読み込み期間を決定する(S21)。読み込みを開始するイベントの位置は、イベントテーブル140の中で、RCAルールの分析を始める行である。RCAルールの分析を始める行の決定方法は後述する。読み込み方向は、イベントテーブル140の読み込みを開始するイベントの位置から上又は下に向かうように読み込むかを示す。すなわち、時間を遡ってイベントを読み込むか、時間の経過に沿ってイベントを読み込むかである。読み込み期間84(図13参照)は、イベントの読み込みを開始する位置から、イベントを読み込む時間である。読み込みを開始するイベントの位置から、読み込み方向へ向かって、読み込み期間中に発生したイベントを読み込むことによって、RCAルールを適用すべき障害イベントの集合を得ることができる。 When testing the RCA rule, first, as an initialization process, the position of the event to start reading, the reading direction, and the reading period are determined (S21). The position of the event at which reading is started is a line in the event table 140 where analysis of the RCA rule is started. A method of determining a line for starting the analysis of the RCA rule will be described later. The reading direction indicates whether reading is to be performed upward or downward from the position of the event at which reading of the event table 140 is started. That is, whether the event is read back in time or the event is read as time elapses. The reading period 84 (see FIG. 13) is a time for reading an event from the position where the reading of the event is started. A set of failure events to which the RCA rules should be applied can be obtained by reading events that occurred during the reading period from the position of the event at which reading is started in the reading direction.
 図13は、RCAルールの分析を始める行を決定する方法を示す図である。 FIG. 13 is a diagram showing a method for determining a line for starting the analysis of the RCA rule.
 運用管理の現場では、運用管理ソフトウェアの全てのデータを定期的にバックアップすることが多い。このため、イベントテーブル140の最も古いイベントから分析を始めるより、試験の対象となる障害イベント発生時刻82に最も近いデータバックアップポイント(81a又は81b)をRCAルールの分析を始める行に設定してイベント情報を抽出すると、処理が速くなることがある。また、試験の対象となる障害イベント発生時刻82に到達するために、試験の対象となる障害イベント発生時刻82より古いデータバックアップ時刻81aより、試験の対象となる障害イベント発生時刻82より新しいデータバックアップ時刻81bの方が、分析するイベント85の数が少ないことがある。この場合、新しいデータバックアップ時刻81bをRCAルールの分析を始める行に設定して、イベント情報を抽出してもよい。 In operation management sites, all data of operation management software is often backed up regularly. Therefore, rather than starting the analysis from the oldest event in the event table 140, the data backup point (81a or 81b) closest to the failure event occurrence time 82 to be tested is set to the line where the analysis of the RCA rule is started. Extracting information can speed up processing. Further, in order to reach the failure event occurrence time 82 to be tested, a data backup that is newer than the failure event occurrence time 82 to be tested from the data backup time 81a that is older than the failure event occurrence time 82 to be tested. There may be fewer events 85 to analyze at time 81b. In this case, the event information may be extracted by setting the new data backup time 81b in the line where the analysis of the RCA rule is started.
 このように、バックアップポイント81と、イベント保持時間83と、障害イベント85の発生時刻と分析するイベント85の数との関係を用いて、イベントの読み込み開始時刻と読み込み方向を定めるので、過去の状態を再現するための処理時間を短縮することができる。 In this way, the event reading start time and reading direction are determined using the relationship between the backup point 81, the event holding time 83, the occurrence time of the failure event 85, and the number of events 85 to be analyzed. The processing time for reproducing can be shortened.
 図4に戻り、次に、RCAルールテスト処理S3では、構成情報再現プログラム107がRCA分析時に必要となるトポロジ情報のベースを作成する(S22)。具体的には、構成情報再現プログラム107は、RCAサーバ500からトポロジ情報のバックアップから、読み込みを開始するイベントの位置までの間のイベントの数の最も少ないバックアップ時刻を選択して、読み込みを開始するイベントの位置まで構成変更イベントを一つ一つ適用して、トポロジ情報のベースを作成する。なお、構成情報再現プログラム107は、時間をさかのぼる場合、構成変更イベントの逆の操作を一つ一つ適用し、構成情報を更新して、トポロジ情報のベースを作成する。作成されたトポロジ情報のベースは、トポロジスナップショット170に格納される。 Referring back to FIG. 4, next, in the RCA rule test process S3, the configuration information reproduction program 107 creates a base of topology information necessary for RCA analysis (S22). Specifically, the configuration information reproduction program 107 selects the backup time with the smallest number of events between the backup of the topology information from the RCA server 500 and the position of the event to start reading, and starts reading. Apply the configuration change event one by one up to the event position to create a base of topology information. When going back in time, the configuration information reproduction program 107 applies the reverse operation of the configuration change event one by one, updates the configuration information, and creates the topology information base. The created topology information base is stored in the topology snapshot 170.
 このように、バックアップポイント81からイベント読み込み方向に変更履歴を適用して過去状態を再現するので、過去の構成情報を再現するための処理時間を短縮することができる。 Thus, since the past state is reproduced by applying the change history in the event reading direction from the backup point 81, the processing time for reproducing the past configuration information can be shortened.
 次に、RCAルールテスト処理S3では、イベントを一つ読み込み、読み込むイベントの位置を、読み込み方向に沿って一つ新しく又は古く更新する(S23)。ステップS23からステップS28までは、イベントを一つ一つ分析するループ内の処理である。 Next, in the RCA rule test process S3, one event is read, and the position of the read event is updated one new or old along the reading direction (S23). Steps S23 to S28 are processes in a loop for analyzing events one by one.
 まず、イベントは、イベントテーブル140(図6参照)に示す情報を持っている。イベントテーブル140の内容の詳細については後述する。RCAルールテスト処理S3は、ステップS21で設定されたイベントの読み込み期間とイベントテーブル140のイベント発生時刻142とを比較する。読み込んだイベントのイベント発生時刻142がイベントの読み込み期間外であれば、イベントテーブル140から、さらにイベントを読み込む必要がないためステップS29へ進む。読み込んだイベントのイベント発生時刻142がイベントの読み込み期間内であれば、イベントを分析する必要があるため、ステップS25へ進む(S24)。 First, the event has information shown in the event table 140 (see FIG. 6). Details of the contents of the event table 140 will be described later. The RCA rule test process S3 compares the event reading period set in step S21 with the event occurrence time 142 of the event table 140. If the event occurrence time 142 of the read event is outside the event reading period, it is not necessary to read further events from the event table 140, and the process proceeds to step S29. If the event occurrence time 142 of the read event is within the event reading period, the event needs to be analyzed, and the process proceeds to step S25 (S24).
 次に、RCAルールテスト処理S3では、読み込んだイベントの障害フラグ146を評価し、障害フラグ146がONであれば障害イベント処理S26を実行する。一方、障害フラグ146がOFFであれば、ステップS27へ進む(S25)。ステップS26の処理の詳細は図5を用いて後述する。 Next, in the RCA rule test processing S3, the failure flag 146 of the read event is evaluated, and if the failure flag 146 is ON, the failure event processing S26 is executed. On the other hand, if the failure flag 146 is OFF, the process proceeds to step S27 (S25). Details of the processing in step S26 will be described later with reference to FIG.
 同様に、RCAルールテスト処理S3では、読み込んだイベントの構成変更フラグ147を評価し、構成変更フラグ147がONであれば構成変更イベント処理S28を実行する。一方、構成変更フラグ147がOFFであれば、ループを終了し、ステップS23へ戻る(S27)。ステップS28の処理の詳細は図8を用いて後述する。 Similarly, in the RCA rule test process S3, the configuration change flag 147 of the read event is evaluated. If the configuration change flag 147 is ON, the configuration change event process S28 is executed. On the other hand, if the configuration change flag 147 is OFF, the loop is terminated and the process returns to step S23 (S27). Details of the processing in step S28 will be described later with reference to FIG.
 図5は、障害イベント処理(図4のS26)の詳細のフローチャートである。 FIG. 5 is a detailed flowchart of the failure event process (S26 in FIG. 4).
 まず、障害イベント処理S26では、障害イベントの発生時刻がイベント保持期間83内であるかを判定する(S31)。障害イベントの発生時刻がRCAのイベント保持期間内であれば、イベントテーブル140の該当する情報を共起待ちイベントテーブル150へ登録する(S32)。具体的には、イベントテーブル140のイベント発生時刻142の値を共起待ちイベントテーブル150の発生時刻152へ登録し、対象装置のID143の値をノードID153へ登録し、イベント種別145の値をイベント種別154へ登録する。そして、共起待ちイベントテーブル150の共起待ちイベントID151の値をイベントテーブル140の共起待ちイベントID148へ登録する。 First, in the failure event process S26, it is determined whether the occurrence time of the failure event is within the event holding period 83 (S31). If the occurrence time of the failure event is within the event holding period of the RCA, the corresponding information in the event table 140 is registered in the co-occurrence waiting event table 150 (S32). Specifically, the value of the event occurrence time 142 in the event table 140 is registered in the occurrence time 152 of the co-occurrence waiting event table 150, the value of the target device ID 143 is registered in the node ID 153, and the event type 145 value is changed to the event. Register to type 154. Then, the value of the co-occurrence wait event ID 151 of the co-occurrence wait event table 150 is registered in the co-occurrence wait event ID 148 of the event table 140.
 図6は、イベントテーブル140の構成例を示す図である。 FIG. 6 is a diagram illustrating a configuration example of the event table 140.
 イベントテーブル140は、イベントを格納するテーブルであり、イベントID141と、イベント発生時刻142と、ノードID143と、ノード種別144と、イベント種別145と、障害フラグ146と、構成変更フラグ147と、共起待ちイベントID148とを含む。 The event table 140 is a table for storing events. The event ID 141, the event occurrence time 142, the node ID 143, the node type 144, the event type 145, the failure flag 146, and the configuration change flag 147 are co-occurrence. And a waiting event ID 148.
 イベントID141は、イベントを一意に識別するための識別情報である。イベント発生時刻142は、当該イベントの発生時刻である。ノードID143は、イベントの対象となる装置の識別情報である。ノード種別144は、イベントの対象となる装置の種別であり、例えば、ハイパバイザであることを表す「HV」、仮想マシンであることを表す「VM」が設定される。図示した他に、ストレージ装置、ファイバチャンネルスイッチ、ネットワークスイッチ、ルータを表す種別があってもよい。イベント種別145は、イベント種別を表す識別情報である。障害フラグ146は、当該イベントが障害イベントであるか否かを示すフラグである。構成変更フラグ147は、当該イベントが構成変更イベントであるか示すフラグである。共起待ちイベントID148は、共起待ちイベントテーブル150の該当イベントを一意に特定するための識別情報である。 The event ID 141 is identification information for uniquely identifying an event. The event occurrence time 142 is the occurrence time of the event. The node ID 143 is identification information of the device that is the target of the event. The node type 144 is the type of the device that is the target of the event. For example, “HV” indicating that the device is a hypervisor and “VM” indicating that the device is a virtual machine are set. In addition to the illustrated types, there may be types representing storage devices, fiber channel switches, network switches, and routers. The event type 145 is identification information representing the event type. The failure flag 146 is a flag indicating whether or not the event is a failure event. The configuration change flag 147 is a flag indicating whether the event is a configuration change event. The co-occurrence wait event ID 148 is identification information for uniquely identifying the corresponding event in the co-occurrence wait event table 150.
 ルール試験装置100は、RCAサーバ500のデータベースが更新されたタイミングで、RCAサーバ500のデータベースイベントから更新情報を受信し、イベントテーブル140をRCAサーバ500のイベントデータベースに同期させる。 The rule test apparatus 100 receives the update information from the database event of the RCA server 500 at the timing when the database of the RCA server 500 is updated, and synchronizes the event table 140 with the event database of the RCA server 500.
 図7は、RCAの共起待ちイベントテーブル150の構成例を示す図である。 FIG. 7 is a diagram showing a configuration example of the RCA co-occurrence waiting event table 150.
 共起待ちイベントテーブル150は、共起待ちイベントID151と、時刻152と、ノードID153と、イベント種別154とを含む。 The co-occurrence wait event table 150 includes a co-occurrence wait event ID 151, a time 152, a node ID 153, and an event type 154.
 共起待ちイベントID151は、共起待ちイベントを一意に識別するための識別情報である。時刻152は、共起待ちイベントが発生した時刻である。ノードID153は、イベントの対象となる装置の識別情報である。イベント種別154は、イベント種別を表す識別情報である。 The co-occurrence wait event ID 151 is identification information for uniquely identifying the co-occurrence wait event. Time 152 is the time when the co-occurrence waiting event occurs. The node ID 153 is identification information of the device that is the target of the event. The event type 154 is identification information representing the event type.
 図8は、構成イベント処理(図4のS28)の詳細のフローチャートである。 FIG. 8 is a detailed flowchart of the configuration event process (S28 in FIG. 4).
 まず、構成変更イベント処理S28では、構成変更イベントの内容に従ってトポロジスナップショット170の構成情報を変更する(S41)。 First, in the configuration change event process S28, the configuration information of the topology snapshot 170 is changed according to the contents of the configuration change event (S41).
 次に、今回の構成変更がトポロジに変更を与えるものであるかを判定する。ここで、トポロジに変化を与えるとは、物理及び仮想を問わず装置間の関連が変化することである。例えば、物理サーバ上で稼動している仮想サーバが別の物理サーバ上に移動することである。トポロジに変化がある場合は、共起待ちイベントテーブル150をリセットする(S43)。一方、トポロジに変化がない場合は処理を終了する(S42)。 Next, it is determined whether the current configuration change is a change to the topology. Here, changing the topology means that the relationship between devices changes regardless of physical or virtual. For example, a virtual server operating on a physical server moves to another physical server. If there is a change in the topology, the co-occurrence wait event table 150 is reset (S43). On the other hand, if there is no change in the topology, the process is terminated (S42).
 図9は、RCAルールテーブル160の構成例を示す図である。 FIG. 9 is a diagram illustrating a configuration example of the RCA rule table 160.
 RCAルールテーブル160は、RCAサーバ500からコピーされたRCAルールに基づいて生成され、ルール試験装置100上でのルール編集時に更新される。RCAルールテーブル160は、ルールを一意に識別するためのルールID161と、ユーザが当該ルールを使用しているかを示すルールの使用状態162と、ルールに関連する装置の種別163と、ルールの内容164を含む。 The RCA rule table 160 is generated based on the RCA rule copied from the RCA server 500, and is updated when the rule is edited on the rule test apparatus 100. The RCA rule table 160 includes a rule ID 161 for uniquely identifying a rule, a rule usage state 162 indicating whether the user is using the rule, a device type 163 related to the rule, and a rule content 164. including.
 図4に戻る。次に、RCAルールテスト処理S3では、RCAルールにより障害イベントを解析する(S29)。具体的には、まずトポロジスナップショット170の情報を取り込む。その後、残った共起待ちイベントにRCAルールを適用する。RCAサーバ500と同じ処理アルゴリズムをRCAルールに適用することによって、本番環境と同じ根本原因箇所を推定する。RCAルールに適合するイベントがあった場合、RCAルール及び発生したイベントの情報をRCAルール試験結果テーブル180に登録する。 Return to FIG. Next, in the RCA rule test process S3, the failure event is analyzed based on the RCA rule (S29). Specifically, first, the information of the topology snapshot 170 is fetched. Thereafter, the RCA rule is applied to the remaining co-occurrence waiting event. By applying the same processing algorithm as that of the RCA server 500 to the RCA rule, the same root cause location as that in the production environment is estimated. If there is an event that conforms to the RCA rule, the RCA rule and information on the event that has occurred are registered in the RCA rule test result table 180.
 RCAルールによる障害イベント解析S29では、まず、イベントテーブル140の該当する発生時刻に発生したイベントとRCA試験結果テーブル180とを比較して、当該イベントがRCAサーバ500及びルール試験装置100で発生したイベントか、ルール試験装置100で発生したがRCAサーバ500では発生しなかったイベントか、RCAサーバ500で発生したがルール試験装置100では発生しなかったイベントかを判定する。そして、新しいRCAルールによって新たに発生したイベントであることを示す値(本実施例では「+」)、従来から発生しており、新しいRCAルールでも発生したイベントであることを示す値(本実施例では空白)、従来から発生しているが、新しいRCAルールでは発生しなかったことを示す値(本実施例では「-」)のいずれかを、イベント変更情報181に設定する。本実施例では文字としてイベント変更情報を表しているが、表示されるイベントの色を変える、枠線を変えるなど別の方法で表示してもよい。 In the failure event analysis S29 based on the RCA rule, first, an event occurring at the corresponding occurrence time in the event table 140 is compared with the RCA test result table 180, and the event occurs in the RCA server 500 and the rule testing apparatus 100. Or an event that occurred in the rule testing apparatus 100 but did not occur in the RCA server 500, or an event that occurred in the RCA server 500 but did not occur in the rule testing apparatus 100. A value (“+” in this embodiment) indicating an event newly generated by a new RCA rule, a value indicating an event that has occurred conventionally and also occurred in a new RCA rule (this embodiment) In the example, it is blank), but the event change information 181 is set to one of the values ("-" in the present embodiment) indicating that it has occurred in the past but has not occurred in the new RCA rule. In this embodiment, the event change information is represented as characters, but may be displayed by another method such as changing the color of the displayed event or changing the frame line.
 図12は、RCAルール試験結果テーブル180の構成例を示す図である。 FIG. 12 is a diagram illustrating a configuration example of the RCA rule test result table 180.
 RCAルール試験結果テーブル180は、イベントテーブル140の内容にイベント変更情報及び適用したRCAルールのIDを加え、障害フラグ146、構成変更フラグ147、共起待ちイベントテーブル150の該当イベントのID148を消去したものである。RCAルール試験結果テーブル180は、イベント変更情報181と、RCAルールID182と、イベント発生時刻183と、対象装置のID184と、対象装置の種別185と、イベントを表すID186とを含む。 The RCA rule test result table 180 adds the event change information and the ID of the applied RCA rule to the contents of the event table 140, and deletes the failure flag 146, the configuration change flag 147, and the ID 148 of the corresponding event in the co-occurrence waiting event table 150. Is. The RCA rule test result table 180 includes event change information 181, RCA rule ID 182, event occurrence time 183, target device ID 184, target device type 185, and ID 186 representing an event.
 イベント変更情報181は、イベントテーブル140には存在するが、RCAルール試験結果テーブル180には存在しないイベント(本実施例では「-」)、イベントテーブル140にもRCAルール試験結果テーブル180にも存在するイベント(本実施例では(空白))、そしてイベントテーブル140には存在しないがRCAルール試験結果テーブル180には存在するイベント(本実施例では「+」)を区別するための情報である。 The event change information 181 exists in the event table 140 but does not exist in the RCA rule test result table 180 (“-” in this embodiment), and exists in both the event table 140 and the RCA rule test result table 180. This is information for discriminating between events to be performed (in this embodiment (blank)) and events that do not exist in the event table 140 but exist in the RCA rule test result table 180 (“+” in this embodiment).
 RCAルールID182は、RCAルールを一意に特定する情報である。イベント発生時刻183は、イベントが発生した時刻である。対象装置のID184は、イベント発生元となる装置の識別情報である。対象装置の種別185は、イベント発生元となった装置の種別である。イベントを表すID186は、イベント種別を一意に特定するための情報である。 The RCA rule ID 182 is information for uniquely specifying the RCA rule. The event occurrence time 183 is the time when the event occurred. The target device ID 184 is identification information of a device that is an event generation source. The target device type 185 is the type of the device that is the source of the event. ID 186 representing an event is information for uniquely identifying the event type.
 図4に戻る。最後に、RCAルールテスト処理S3では、解析結果を出力する(S30)。 Return to FIG. Finally, in the RCA rule test process S3, an analysis result is output (S30).
 図10は、出力制御プログラム103が端末800に表示する入力画面10の例を示す図である。図10に示す入力画面10は、Webアプリケーションとして構成された画面であるが、ルール試験装置100上で実行されるネイティブアプリケーションによって構成された画面でもよい。 FIG. 10 is a diagram illustrating an example of the input screen 10 displayed on the terminal 800 by the output control program 103. The input screen 10 illustrated in FIG. 10 is a screen configured as a Web application, but may be a screen configured by a native application executed on the rule test apparatus 100.
 入力画面10は、GUIの名称を表示するタイトルバー11と、URL(Uniform Resource Locator)を表示するアドレスバー1と2、RCAサーバ500のイベントデータベースから読み込んだイベントを表示するイベント一覧エリア13と、RCAサーバ500から読み込んだRCAルールを表示するルール一覧エリア21と、RCAルール試験方法の詳細を決定するために使用する「詳細」ボタン31と、RCAルールを実際にイベントに適用するために使用する「適用」ボタン41とを含む。WebプリケーションでなくネイティブアプリケーションのようにURLを使用しない場合や、WebアプリケーションであってもURLを表示させない場合は、アドレスバー12を設けなくてもよい。 The input screen 10 includes a title bar 11 for displaying a GUI name, address bars 1 and 2 for displaying a URL (Uniform Resource Locator), an event list area 13 for displaying an event read from the event database of the RCA server 500, A rule list area 21 that displays RCA rules read from the RCA server 500, a “details” button 31 used to determine details of the RCA rule test method, and an RCA rule used to actually apply the event to the event. And an “Apply” button 41. When a URL is not used like a native application instead of a Web application, or when a URL is not displayed even with a Web application, the address bar 12 may not be provided.
 イベント一覧エリア13は、イベントのメッセージ16と、イベント発生時刻17と、イベントが示す事象が発生した装置(ソース)18と、イベント発生装置のデバイス種別19とを表示するフィールドを含む。 The event list area 13 includes fields for displaying an event message 16, an event occurrence time 17, a device (source) 18 in which an event indicated by the event has occurred, and a device type 19 of the event generation device.
 イベント一覧エリア13には、イベントテーブル140に格納するイベントの情報が設定される。イベントのメッセージ16には、イベント種別145に格納するイベントの名称が設定される。イベント種別145には、人が理解しやすいようにイベントの名称と識別情報とを対応させたテーブルを用意して、当該テーブルを用いて名称を変換した識別情報を格納もよいし、イベント種別145に格納されるイベントの識別情報を直接設定してもよい。時刻17には、イベントテーブル140のイベント発生時刻142に格納される値が設定される。ソース18には、ノードID143に格納する、イベントが示す事象が発生した装置の名称が設定される。ソース18には、人が理解しやすいように装置の名称と識別情報とを対応させたテーブルを用意して、当該テーブルを用いて名称を変換した識別情報を格納してもよいし、ノードID143に格納される対象装置の識別情報を設定してもよい。デバイス種別19には、イベントが示す事象が発生した装置の種別(ノード種別144)の値が設定される。 Event information stored in the event table 140 is set in the event list area 13. In the event message 16, the name of the event stored in the event type 145 is set. In the event type 145, a table in which event names and identification information are associated with each other so as to be easily understood by a person may be prepared, and identification information obtained by converting the name using the table may be stored. The event identification information stored in may be set directly. At time 17, a value stored at the event occurrence time 142 in the event table 140 is set. The source 18 is set with the name of the device in which the event indicated by the event, which is stored in the node ID 143, has occurred. The source 18 may prepare a table in which device names and identification information are associated with each other so as to be easily understood by a person, and store identification information obtained by converting the name using the table, or a node ID 143. The identification information of the target device stored in may be set. In the device type 19, the value of the type (node type 144) of the device in which the event indicated by the event has occurred is set.
 ルール一覧エリア21は、ルールの使用状態22と、ルール名23と、ルールを適用するターゲット24と、ルールの詳細を示すルール詳細25とを表示するフィールドを含む。さらに、ルール一覧エリア21は、RCAルールテーブル160に読み込まれていない新しいルールをルール試験装置100にアップロードする際に使用する「ルールのアップロード」ボタン26と、RCAルールテーブル160に読み込まれている既存のルールを削除する際に使用する「ルールの削除」ボタン27とを含む。 The rule list area 21 includes fields for displaying a rule usage state 22, a rule name 23, a target 24 to which the rule is applied, and a rule detail 25 indicating the details of the rule. Further, the rule list area 21 includes an “upload rule” button 26 that is used when a new rule that has not been read in the RCA rule table 160 is uploaded to the rule testing apparatus 100 and an existing rule that is read in the RCA rule table 160. And a “delete rule” button 27 used when deleting the rule.
 ルール一覧エリア21には、RCAルールテーブル160に格納する情報が設定される。ルールの使用状態22RCAには、ルールテーブル160のルールの使用状態162の値が設定される。ルール名23には、ルールID161に格納するルールの名称が設定される。ルール名23には、人が理解しやすいようにルールの名称と識別情報とを対応させたテーブルを用意して、当該テーブルを用いて名称を変換した識別情報を格納してもよいし、ルールID161に格納される識別情報を直接設定してもよい。ターゲット24には、ルールに関連する装置の種別163の値が設定される。ルール詳細25には、ルールの内容164に格納する、現在選択されているルールの内容164が設定される。ルール詳細25のフィールドは、テキストを編集する機能を有し、ユーザがルールを編集することができる。「ルールのアップロード」ボタン26を操作すると、ファイルを選択する画面が表示され、ユーザはルール一覧に追加したいルール定義ファイルを選択できる。「ルールの削除」ボタン27を操作すると、ユーザは現在選択中のルールをルール一覧から削除することができる。これらルール詳細25、「ルールのアップロード」ボタン26及び「ルールの削除」ボタン27を使用した操作が、RCAルールの編集(図2のS1)に対応する。 In the rule list area 21, information stored in the RCA rule table 160 is set. In the rule usage state 22RCA, the value of the rule usage state 162 of the rule table 160 is set. In the rule name 23, the name of the rule stored in the rule ID 161 is set. In the rule name 23, a table in which the name of the rule is associated with the identification information so as to be easily understood by a person may be prepared, and the identification information obtained by converting the name using the table may be stored. The identification information stored in the ID 161 may be set directly. In the target 24, a value of the device type 163 related to the rule is set. In the rule details 25, the content 164 of the currently selected rule stored in the rule content 164 is set. The field of the rule details 25 has a function of editing text, and the user can edit the rule. When the “upload rule” button 26 is operated, a screen for selecting a file is displayed, and the user can select a rule definition file to be added to the rule list. When the “delete rule” button 27 is operated, the user can delete the currently selected rule from the rule list. The operations using the rule details 25, the “upload rule” button 26, and the “delete rule” button 27 correspond to the editing of the RCA rule (S1 in FIG. 2).
 「詳細」ボタン31が操作されると、ルールを試験する際の詳細設定を行う画面が表示される。詳細設定画面では、共起待ちイベントテーブル150にイベント情報を保存する時間が調整でき、イベントの読み込み期間を設定できる。イベントの読み込み期間が変更された場合、RCA試験S3のステップS21において決定される、読み込み期間が変更される。 When the “Details” button 31 is operated, a screen for performing detailed settings when testing the rule is displayed. On the detailed setting screen, the time for storing event information in the co-occurrence waiting event table 150 can be adjusted, and the event reading period can be set. When the event reading period is changed, the reading period determined in step S21 of the RCA test S3 is changed.
 「適用」ボタン41が操作されると、イベント一覧エリア13で選択されているイベントの発生時刻又は詳細設定画面で設定されたイベントの読み込み期間に基づいて、現在使用されているルールを使った試験が実行される。 When the “Apply” button 41 is operated, a test using the currently used rule based on the event occurrence time selected in the event list area 13 or the event reading period set on the detailed setting screen. Is executed.
 図11は、出力制御プログラム103が端末800に表示する出力画面50の例を示す。図11に示す出力画面50は、入力画面10と同様にWebアプリケーションとして構成された画面であるが、ルール試験装置100上で実行されるネイティブアプリケーションによって構成された画面でもよい。 FIG. 11 shows an example of the output screen 50 displayed on the terminal 800 by the output control program 103. The output screen 50 illustrated in FIG. 11 is a screen configured as a Web application similarly to the input screen 10, but may be a screen configured by a native application executed on the rule testing apparatus 100.
 出力画面50はGUIの名称を表示するタイトルバー51と、URLを表示するアドレスバー52と、RCAルールの試験結果として得られたイベント一覧を表示するRCAルール適用結果エリア53と、「入力画面へ戻る」ボタン61とを含む。入力画面10と同様に、アドレスバー52を設けなくてもよい。 The output screen 50 includes a title bar 51 that displays the GUI name, an address bar 52 that displays the URL, an RCA rule application result area 53 that displays a list of events obtained as a test result of the RCA rule, and “to input screen” And a “return” button 61. Similar to the input screen 10, the address bar 52 may not be provided.
 RCAルール適用結果エリア53は、イベント変更情報54と、適用ルールID55と、イベントのメッセージ56と、イベントが発生した時刻57と、イベントが発生した装置(ソース)58と、イベントが発生した装置のデバイス種別59とを含む。RCAルール適用結果エリア53には、イベントテーブル140を拡張したRCAルール試験結果テーブル180に格納された情報が時系列に表示される。 The RCA rule application result area 53 includes event change information 54, an application rule ID 55, an event message 56, an event occurrence time 57, a device (source) 58 in which the event has occurred, and a device in which the event has occurred. Device type 59. In the RCA rule application result area 53, information stored in the RCA rule test result table 180 obtained by extending the event table 140 is displayed in time series.
 イベント変更情報54は、RCAルール試験結果テーブル180のイベント変更情報の値を表示する。適用ルールID55は、ルールの名称を表示する。適用ルールID55は、人が理解しやすいようにルールの名称と識別情報とを対応させたテーブルを用意して、当該テーブルを用いて識別情報を変換した名称を表示してもよいし、ルールの識別情報をそのまま表示してもよい。メッセージ56は、イベントを表すID145の値に対応する文字列を表示する。メッセージ56は、人が理解しやすいようにイベントの名称と識別情報とを対応させたテーブルを用意して、当該テーブルを用いて識別情報を変換した名称を表示してもよいし、イベントの識別情報(イベントID145)をそのまま表示してもよい。時刻57は、RCAルール試験結果テーブル180のイベント発生時刻142の値を表示する。ソース58は、イベントが示す事象が発生した装置の名称を表示する。ソース58は、イベントテーブル140の対象装置の名称を表示する。ソース58は、人が理解しやすいように装置の名称と識別情報とを対応させたテーブルを用意して、当該テーブルを用いて識別情報を変換した名称を表示してもよいし、装置の識別情報をそのまま表示してもよい。ソースのデバイス種別59は、イベントが示す事象が発生した装置の種別144の値を表示する。 The event change information 54 displays the value of event change information in the RCA rule test result table 180. The applied rule ID 55 displays the name of the rule. The applied rule ID 55 may be a table in which the name of the rule is associated with the identification information so as to be easily understood by a person, and the name obtained by converting the identification information using the table may be displayed. The identification information may be displayed as it is. The message 56 displays a character string corresponding to the value of ID 145 representing the event. The message 56 may display a name obtained by converting the identification information using the table by preparing a table in which the event name and the identification information are associated with each other so as to be easily understood by a person. Information (event ID 145) may be displayed as it is. At time 57, the value of the event occurrence time 142 in the RCA rule test result table 180 is displayed. The source 58 displays the name of the device in which the event indicated by the event has occurred. The source 58 displays the name of the target device in the event table 140. The source 58 may prepare a table in which device names and identification information are associated with each other so as to be easily understood by a person, and display the name obtained by converting the identification information using the table. Information may be displayed as it is. The source device type 59 displays the value of the device type 144 in which the event indicated by the event has occurred.
 このように、第1実施例によると、本番環境と同じ構成情報において、本番環境で発生したイベント情報を用いて新しいRCAルールのテストを行うので、本番環境へRCAルールを導入する前にRCAルールの動作を確認し、期待する障害イベントが検出できるかを確認することができる。 As described above, according to the first embodiment, since the new RCA rule is tested using the event information generated in the production environment in the same configuration information as the production environment, the RCA rule is introduced before the RCA rule is introduced into the production environment. Can confirm whether the expected failure event can be detected.
 また、所定期間内に発生したイベントを障害の原因となるイベントとするので、簡単な処理で正確に障害の原因となるイベントを選択することができる。 In addition, since an event occurring within a predetermined period is set as an event causing a failure, the event causing the failure can be accurately selected with a simple process.
 また、第1実施例によると、RCAルールの条件に合致するイベントが過去に生じていなくても、当該条件に合致するイベントを自動的に作成して、RCAルールの動作を的確に確認することができる。 In addition, according to the first embodiment, even if an event that matches the condition of the RCA rule has not occurred in the past, an event that matches the condition is automatically created and the operation of the RCA rule is confirmed accurately. Can do.
 なお、第1実施態ではイベント発生を契機としてRCAによる根本原因を推定しているが、構成情報の変更の検出を契機としてRCAによる根本原因を推定してもよい。 In the first embodiment, the root cause by the RCA is estimated when an event occurs, but the root cause by the RCA may be estimated when a change in configuration information is detected.
 <第2実施例>
 次に、本発明に係る試験環境の第2実施例について説明する。第2実施例において、第1実施例と同じ構成及び処理は同じ符号を付し、それらの説明は省略し、異なる部分のみを説明する。
<Second embodiment>
Next, a second embodiment of the test environment according to the present invention will be described. In the second embodiment, the same configurations and processes as those in the first embodiment are denoted by the same reference numerals, description thereof will be omitted, and only different portions will be described.
 図14は、第2実施例のルール試験装置100の構成、及びルール試験装置100と試験環境と本番環境との関係を示すブロック図である。 FIG. 14 is a block diagram illustrating the configuration of the rule test apparatus 100 according to the second embodiment and the relationship between the rule test apparatus 100, the test environment, and the production environment.
 第2実施例の試験環境は、第1実施例の試験環境と同様にルール試験装置100を有している。 The test environment of the second embodiment has the rule test apparatus 100 as in the test environment of the first embodiment.
 第2実施例のルール試験装置100は、第1の実施例のルール試験装置100とほぼ同じ構成を有するが、メモリ111上にRCAルール履歴管理プログラム108と、補助記憶装置120上にルール履歴テーブル190とが格納されていることが異なる。 The rule test apparatus 100 of the second embodiment has almost the same configuration as the rule test apparatus 100 of the first embodiment, but the RCA rule history management program 108 on the memory 111 and the rule history table on the auxiliary storage device 120. 190 is stored.
 第2実施例では、RCAサーバ500上で使用されるRCAルールに追加、削除、編集などの変更がされた場合に、ルールを保持しているデータベースから送信される変更イベントをRCAサーバ500が受信することによって、ルール試験装置100のRCAルール履歴管理プログラム108がルール履歴テーブル190にルールの変更履歴を記録する。 In the second embodiment, when the RCA rule used on the RCA server 500 is changed, such as addition, deletion, or editing, the RCA server 500 receives a change event transmitted from the database holding the rule. As a result, the RCA rule history management program 108 of the rule testing apparatus 100 records the rule change history in the rule history table 190.
 図15は、ルール履歴テーブル190の構成例を示す図である。 FIG. 15 is a diagram illustrating a configuration example of the rule history table 190.
 ルール履歴テーブル190は、RCAルールの変更履歴を格納するテーブルであり、RCAルールに変更がされた時刻191と、変更がされたRCAルールの識別情報(ルールID)192と、追加、削除、編集などの変更の種別193とを含む。 The rule history table 190 is a table for storing the RCA rule change history, the time 191 when the RCA rule was changed, the identification information (rule ID) 192 of the changed RCA rule, and addition, deletion, and editing. And the change type 193 such as.
 図16は、出力制御プログラム103が端末800に表示する出力画面50の例を示す。図16に示す出力画面50は、入力画面10と同様にWebアプリケーションとして構成された画面であるが、ルール試験装置100上で実行されるネイティブアプリケーションによって構成された画面でもよい。 FIG. 16 shows an example of the output screen 50 displayed on the terminal 800 by the output control program 103. The output screen 50 illustrated in FIG. 16 is a screen configured as a Web application similarly to the input screen 10, but may be a screen configured by a native application executed on the rule testing apparatus 100.
 第2実施例の出力画面50において、第1実施例の出力画面50の例と同じ構成は同じ符号を付し、それらの説明は省略し、異なる部分のみを説明する。RCAルール適用結果エリア53に、障害イベント及び障害でないイベントであるRCAルール変更イベントを時系列に表示する。このため、イベント変更情報54に代えて、イベント種別71がRCAルール適用結果エリア53に追加されている。第2実施例では、RCAルールが変更された場合、イベント種別71は、障害ではないイベントであることを示す文字(本実施例では「!」)を表示する。変更が加えられたRCAルールのIDを適用ルールID55に表示し、RCAルールに対する変更の内容(本実施例では「RCA rule added」(RCAルールが追加されました))をメッセージ56に表示し、RCAルールに対する変更が行われた時刻を時刻57に表示する。ソース58及びデバイス59は該当する情報がないため、空文字列を表示する。本実施例では、RCAルール変更イベントの次にサーバリソース不足イベントが生成され、その次にサーバダウンイベントが生成されている。本来、正しいRCAルールのみが適用されていれば、リソース不足イベントのみが根本原因として表示されることで、ユーザはすばやく原因を理解し、問題に対処できる。しかし、現在はサーバダウンイベントもあわせて表示されるため、サーバのリソース不足以外の原因も可能性として調べる必要がある状況になっている。ユーザはこの状況に至る前にRCAルールの変更イベントとして、ルールの追加があったことを知り、追加されたRCAルールが不適切である可能性を考えることができる。 In the output screen 50 of the second embodiment, the same components as those of the output screen 50 of the first embodiment are denoted by the same reference numerals, description thereof is omitted, and only different portions are described. In the RCA rule application result area 53, failure events and RCA rule change events that are not failures are displayed in time series. Therefore, an event type 71 is added to the RCA rule application result area 53 instead of the event change information 54. In the second embodiment, when the RCA rule is changed, the event type 71 displays a character (“!” In this embodiment) indicating that the event is not a failure. The ID of the RCA rule with the change is displayed in the application rule ID 55, and the content of the change to the RCA rule (in this example, “RCA rule added” (RCA rule has been added)) is displayed in the message 56. The time when the change to the RCA rule is performed is displayed at time 57. Since the source 58 and the device 59 have no corresponding information, an empty character string is displayed. In this embodiment, a server resource shortage event is generated next to the RCA rule change event, and then a server down event is generated. Originally, if only the correct RCA rule is applied, only the resource shortage event is displayed as the root cause, so that the user can quickly understand the cause and deal with the problem. However, since the server down event is also displayed at present, it is necessary to investigate the cause other than the server resource shortage as a possibility. Before reaching this situation, the user knows that a rule has been added as an RCA rule change event, and can consider the possibility that the added RCA rule is inappropriate.
 このように、第2実施例によると、過去のイベント情報及びRCAルールの変更情報に基づいて、障害イベントが発生した時刻より前に変更されたRCAルールを、障害イベントと共に表示することによって、障害イベントをうまく適用できなかった原因がRCAルールの変更と関連するかの判断材料をユーザに与えることができる。 As described above, according to the second embodiment, the RCA rule changed before the time when the failure event occurred is displayed along with the failure event based on the past event information and the RCA rule change information. The user can be given information to determine whether the reason why the event could not be successfully applied is related to a change in the RCA rules.
 なお、本発明は前述した実施例に限定されるものではなく、添付した特許請求の範囲の趣旨内における様々な変形例及び同等の構成が含まれる。例えば、前述した実施例は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに本発明は限定されない。また、ある実施例の構成の一部を他の実施例の構成に置き換えてもよい。また、ある実施例の構成に他の実施例の構成を加えてもよい。また、各実施例の構成の一部について、他の構成の追加・削除・置換をしてもよい。 The present invention is not limited to the above-described embodiments, and includes various modifications and equivalent configurations within the scope of the appended claims. For example, the above-described embodiments have been described in detail for easy understanding of the present invention, and the present invention is not necessarily limited to those having all the configurations described. A part of the configuration of one embodiment may be replaced with the configuration of another embodiment. Moreover, you may add the structure of another Example to the structure of a certain Example. In addition, for a part of the configuration of each embodiment, another configuration may be added, deleted, or replaced.
 また、前述した各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等により、ハードウェアで実現してもよく、プロセッサがそれぞれの機能を実現するプログラムを解釈し実行することにより、ソフトウェアで実現してもよい。 In addition, each of the above-described configurations, functions, processing units, processing means, etc. may be realized in hardware by designing a part or all of them, for example, with an integrated circuit, and the processor realizes each function. It may be realized by software by interpreting and executing the program to be executed.
 各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスク、SSD(Solid State Drive)等の記憶装置、又は、ICカード、SDカード、DVD等の記録媒体に格納することができる。 Information such as programs, tables, and files that realize each function can be stored in a storage device such as a memory, a hard disk, and an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, and a DVD.
 また、制御線や情報線は説明上必要と考えられるものを示しており、実装上必要な全ての制御線や情報線を示しているとは限らない。実際には、ほとんど全ての構成が相互に接続されていると考えてよい。 Also, the control lines and information lines indicate what is considered necessary for the explanation, and do not necessarily indicate all control lines and information lines necessary for mounting. In practice, it can be considered that almost all the components are connected to each other.

Claims (12)

  1.  複数の装置で構成されるシステムを監視するためのルールを試験する管理計算機であって、
     記憶部と、
     前記記憶部を参照するプロセッサとを備え、
     前記記憶部は、前記システムに過去に生じた障害イベント及び前記システムの構成変更イベントを管理するイベント情報と、前記システムの現在の構成を管理する構成情報と、前記システムの過去の構成を表すトポロジスナップショットとを保持し、
     前記管理計算機は、
     前記構成情報及び前記構成変更イベントを用いて前記システムの過去の構成を再現し、前記トポロジスナップショットに記録する構成情報再現部と、
     前記再現された過去の構成に原因分析アルゴリズムを適用して、試験対象のルールを用いて障害の原因を推定し、障害が検知できるかを出力するルール試験部とを備える管理計算機。
    A management computer for testing rules for monitoring a system composed of a plurality of devices,
    A storage unit;
    A processor that refers to the storage unit,
    The storage unit includes event information for managing fault events that have occurred in the system in the past and configuration change events for the system, configuration information for managing the current configuration of the system, and a topology that represents the past configuration of the system. Keep snapshots and
    The management computer is
    A configuration information reproduction unit that reproduces the past configuration of the system using the configuration information and the configuration change event, and records it in the topology snapshot;
    A management computer comprising a rule test unit that applies a cause analysis algorithm to the reproduced past configuration, estimates a cause of a failure using a rule to be tested, and outputs whether the failure can be detected.
  2.  請求項1に記載の管理計算機であって、
     前記記憶部は、ルールの変更履歴を管理するルール履歴情報を保持し、
     前記ルール試験部は、原因が推定できなかった障害イベントと、前記試験対象のルールの変更履歴とが関連するように時系列に表示するためのデータを出力することを特徴とする管理計算機。
    The management computer according to claim 1,
    The storage unit stores rule history information for managing a rule change history,
    The rule test unit outputs data to be displayed in time series so that a failure event whose cause could not be estimated and a change history of the rule to be tested are related.
  3.  請求項1に記載の管理計算機であって、
     関連するイベントの保持期間を管理し、
     前記ルール試験部は、前記システムの状態の情報がバックアップされた時刻と、前記関連するイベントの保持期間と、前記原因を推定すべき障害イベントとの関係から、前記イベント情報からイベントの取得を開始する時刻及びイベントを取得する方法を決定することを特徴とする管理計算機。
    The management computer according to claim 1,
    Manage the retention period of related events,
    The rule test unit starts acquiring an event from the event information based on the relationship between the time when the state information of the system is backed up, the retention period of the related event, and the failure event whose cause should be estimated. A management computer that determines a time to perform and a method for acquiring an event.
  4.  請求項3に記載の管理計算機であって、
     前記構成情報再現部は、前記イベントの取得を開始する時刻にバックアップされたシステムの構成の情報に、前記構成変更イベントを適用して、障害イベント発生時の前記システムの構成を再現することを特徴とする管理計算機。
    The management computer according to claim 3,
    The configuration information reproduction unit reproduces the configuration of the system when a failure event occurs by applying the configuration change event to the configuration information of the system backed up at the time when the acquisition of the event is started. Management computer.
  5.  請求項1に記載の管理計算機であって、
     前記記憶部は、前記関連するイベントの保持期間内に発生したイベントの情報を管理する共起イベント情報を保持し、
     前記ルール試験部は、前記共起イベント情報に管理されたイベントを、前記障害イベントの原因となるイベントとして出力することを特徴とする管理計算機。
    The management computer according to claim 1,
    The storage unit holds co-occurrence event information for managing information on events that occur within the related event holding period,
    The rule test unit outputs an event managed by the co-occurrence event information as an event that causes the failure event.
  6.  請求項5に記載の管理計算機であって、
     前記構成情報再現部は、前記試験対象のルールに記載されているが、前記共起イベント情報に記録されていないイベントを作成し、イベント情報に記録することを特徴とする管理計算機。
    The management computer according to claim 5,
    The configuration information reproducing unit creates an event which is described in the test target rule but is not recorded in the co-occurrence event information, and records the event information in the event information.
  7.  複数の装置で構成されるシステムを監視するためのルールを管理計算機を用いて試験する方法であって、
     前記管理計算機は、記憶部と、前記記憶部を参照するプロセッサとを有し、
     前記記憶部は、前記システムに過去に生じた障害イベント及び前記システムの構成変更イベントを管理するイベント情報と、前記システムの現在の構成を管理する構成情報と、前記システムの過去の構成を表すトポロジスナップショットとを保持し、
     前記方法は、
     前記管理計算機が、前記構成情報及び前記構成変更イベントを用いて前記システムの過去の構成を再現し、前記トポロジスナップショットに記録して、前記メモリに格納する構成情報再現ステップと、
     前記管理計算機が、前記再現された過去の構成に原因分析アルゴリズムを適用して、試験対象のルールを用いて障害の原因を推定し、障害が検知できるかを出力するルール試験ステップとを含むことを特徴とする試験方法。
    A method for testing a rule for monitoring a system composed of a plurality of devices using a management computer,
    The management computer includes a storage unit and a processor that refers to the storage unit,
    The storage unit includes event information for managing fault events that have occurred in the system in the past and configuration change events for the system, configuration information for managing the current configuration of the system, and a topology that represents the past configuration of the system. Keep snapshots and
    The method
    The management computer reproduces the past configuration of the system using the configuration information and the configuration change event, records the topology snapshot, and stores the configuration information in the memory, and
    A rule test step in which the management computer applies a cause analysis algorithm to the reproduced past configuration, estimates a cause of the failure using a rule to be tested, and outputs whether the failure can be detected. A test method characterized by
  8.  請求項7に記載の試験方法であって、
     前記記憶部は、ルールの変更履歴を管理するルール履歴情報を保持し、
     前記ルール試験ステップでは、原因が推定できなかった障害イベントと、前記試験対象のルールの変更履歴とが関連するように時系列に表示するためのデータを出力することを特徴とする試験方法。
    The test method according to claim 7, wherein
    The storage unit stores rule history information for managing a rule change history,
    In the rule test step, a test method is characterized by outputting data for displaying in a time series so that a failure event whose cause could not be estimated and a change history of the rule to be tested are related.
  9.  請求項7に記載の試験方法であって、
     前記管理計算機は、関連するイベントの保持期間を管理し、
     前記ルール試験ステップでは、前記システムの状態の情報がバックアップされた時刻と、前記関連するイベントの保持期間と、前記原因を推定すべき障害イベントとの関係から、前記イベント情報からイベントの取得を開始する時刻及びイベントを取得する方法を決定することを特徴とする試験方法。
    The test method according to claim 7, wherein
    The management computer manages a retention period of related events,
    In the rule test step, the acquisition of the event is started from the event information based on the relationship between the time when the state information of the system was backed up, the retention period of the related event, and the failure event whose cause should be estimated. And determining a method for acquiring a time and an event to be performed.
  10.  請求項9に記載の試験方法であって、
     前記構成情報再現ステップでは、前記イベントの取得を開始する時刻にバックアップされたシステムの構成の情報に、前記構成変更イベントを適用して、障害イベント発生時の前記システムの構成を再現することを特徴とする試験方法。
    The test method according to claim 9, wherein
    In the configuration information reproduction step, the configuration change event is applied to the system configuration information backed up at the time when the acquisition of the event is started to reproduce the system configuration at the time of occurrence of the failure event. Test method.
  11.  請求項7に記載の試験方法であって、
     前記記憶部は、前記関連するイベントの保持期間内に発生したイベントの情報を管理する共起イベント情報を保持し、
     前記ルール試験ステップでは、前記共起イベント情報に管理されたイベントを、前記障害イベントの原因となるイベントとして出力することを特徴とする試験方法。
    The test method according to claim 7, wherein
    The storage unit holds co-occurrence event information for managing information on events that occur within the related event holding period,
    In the rule testing step, an event managed by the co-occurrence event information is output as an event that causes the failure event.
  12.  請求項11に記載の試験方法であって、
     前記構成情報再現ステップでは、前記試験対象のルールに記載されているが、前記共起イベント情報に記録されていないイベントを作成し、イベント情報に記録することを特徴とする試験方法。
    The test method according to claim 11, comprising:
    In the configuration information reproducing step, an event described in the rule to be tested but not recorded in the co-occurrence event information is created and recorded in the event information.
PCT/JP2015/052164 2015-01-27 2015-01-27 Management computer and rule test method WO2016120989A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/052164 WO2016120989A1 (en) 2015-01-27 2015-01-27 Management computer and rule test method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2015/052164 WO2016120989A1 (en) 2015-01-27 2015-01-27 Management computer and rule test method

Publications (1)

Publication Number Publication Date
WO2016120989A1 true WO2016120989A1 (en) 2016-08-04

Family

ID=56542647

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2015/052164 WO2016120989A1 (en) 2015-01-27 2015-01-27 Management computer and rule test method

Country Status (1)

Country Link
WO (1) WO2016120989A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109726111A (en) * 2018-08-17 2019-05-07 平安普惠企业管理有限公司 Test order customization method, unit and computer readable storage medium
CN110609535A (en) * 2018-06-14 2019-12-24 横河电机株式会社 Test information management device, test information management method, and computer-readable non-transitory recording medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006011702A (en) * 2004-06-24 2006-01-12 Hitachi Ltd Method of policy verification and policy verification system
JP2012003406A (en) * 2010-06-15 2012-01-05 Hitachi Solutions Ltd Failure cause determination rule verification device and program therefor
JP2013206368A (en) * 2012-03-29 2013-10-07 Hitachi Solutions Ltd Virtual environment operation support system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006011702A (en) * 2004-06-24 2006-01-12 Hitachi Ltd Method of policy verification and policy verification system
JP2012003406A (en) * 2010-06-15 2012-01-05 Hitachi Solutions Ltd Failure cause determination rule verification device and program therefor
JP2013206368A (en) * 2012-03-29 2013-10-07 Hitachi Solutions Ltd Virtual environment operation support system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110609535A (en) * 2018-06-14 2019-12-24 横河电机株式会社 Test information management device, test information management method, and computer-readable non-transitory recording medium
CN109726111A (en) * 2018-08-17 2019-05-07 平安普惠企业管理有限公司 Test order customization method, unit and computer readable storage medium

Similar Documents

Publication Publication Date Title
JP4345313B2 (en) Operation management method of storage system based on policy
US8140907B2 (en) Accelerated virtual environments deployment troubleshooting based on two level file system signature
JP5971420B2 (en) State restoration program, apparatus, and support method
US9378011B2 (en) Network application versioning
US20150331882A1 (en) Redundant file deletion method, apparatus and storage medium
US8898178B2 (en) Solution monitoring system
JP5630190B2 (en) Update management apparatus, update management method, and update management program
CN109325016B (en) Data migration method, device, medium and electronic equipment
JP2006031109A (en) Management system and management method
JP2007249340A (en) Software update method, update management program and information processor
JP2015219890A (en) Management device, and control method and program for the same
WO2018068639A1 (en) Data recovery method and apparatus, and storage medium
US20170371641A1 (en) Multi-tenant upgrading
JP2015069437A (en) Trace method, processing program, and information processing device
JP4918668B2 (en) Virtualization environment operation support system and virtualization environment operation support program
JP2006259892A (en) Event notice control program and device
WO2016120989A1 (en) Management computer and rule test method
US9946632B1 (en) Self-service customer escalation infrastructure model
JP2019020798A (en) Information processing device and program
US9317273B2 (en) Information processing apparatus and information processing method
US20200394091A1 (en) Failure analysis support system, failure analysis support method, and computer readable recording medium
JP2010152707A (en) Backup method of database and database system
JP5592828B2 (en) Patch impact analysis apparatus, method and program
JP2009265962A (en) Operation log information management system
JP6739599B1 (en) Information processing program, information processing method, and information processing apparatus

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15879886

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15879886

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP