WO2018235310A1 - Switching management device, monitoring control system, switching management method, and switching management program - Google Patents

Switching management device, monitoring control system, switching management method, and switching management program Download PDF

Info

Publication number
WO2018235310A1
WO2018235310A1 PCT/JP2017/039778 JP2017039778W WO2018235310A1 WO 2018235310 A1 WO2018235310 A1 WO 2018235310A1 JP 2017039778 W JP2017039778 W JP 2017039778W WO 2018235310 A1 WO2018235310 A1 WO 2018235310A1
Authority
WO
WIPO (PCT)
Prior art keywords
monitoring
unit
switching
load
standby
Prior art date
Application number
PCT/JP2017/039778
Other languages
French (fr)
Japanese (ja)
Inventor
山田 耕一
明 半田
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Publication of WO2018235310A1 publication Critical patent/WO2018235310A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units

Definitions

  • the present invention relates to a switching management device, a monitoring control system, a switching management method, and a switching management program.
  • the present invention relates to a switching management device, a monitoring control system, a switching management method, and a switching management program that manages switching of a system in a multisystem monitoring system.
  • Patent Document 1 discloses a technique for preventing a standby system from stopping by using operation data synchronized using a failure flag and a stop flag when switching a system in a multiplex system.
  • Patent Document 2 discloses a technology for stabilizing a web server by temporarily limiting the number of clients communicating with the web server.
  • the operating system may stop responding due to overload and may switch to a standby system.
  • the standby system when switching to the standby system, this time the standby system will handle a large number of failures. Therefore, the standby system may be overloaded, the response may be lost, and the operating system and the standby system may fall apart.
  • a failure occurs in a main device of a network such as a switch, a router, or a firewall, or a carrier line connecting a multiplexed monitoring system and a monitoring target.
  • the present inventor aims to prevent the standby system from being overloaded when the system is switched to the standby system due to an overload.
  • the switching management apparatus executes the monitoring process instead of the operating system monitoring unit when a failure occurs in the operating system monitoring unit that performs monitoring processing on a plurality of monitoring devices and the operating system monitoring unit
  • a switching management device that manages system switching of a multiplex system monitoring system including a standby system monitoring unit that A switching unit that switches execution of the monitoring process from the operating system monitoring unit to the standby system monitoring unit when acquiring a failure notification indicating that a failure has occurred in the operating system monitoring unit; A connection control unit that disconnects communication between the multi-system monitoring system and each of the plurality of monitoring devices when acquiring from the switching unit a switching notification indicating that switching has been made to the standby system monitoring unit from the operating system monitoring unit.
  • a load determination unit that acquires load information representing a load of the standby monitoring unit that executes the monitoring process, and comparing the load and a threshold;
  • the connection control unit The communication between each of the plurality of monitoring devices and the standby monitoring unit is controlled based on the comparison determination result between the load and the threshold.
  • the connection control unit when switched to the standby monitoring unit, the connection control unit disconnects communication between the multi-system monitoring system and each of the plurality of monitoring devices. Then, the connection control unit controls communication between each of the plurality of monitoring devices and the standby system monitoring unit based on the comparison determination result of the load of the standby system monitoring unit and the threshold value. Therefore, according to the switching management device of the present invention, it is possible to prevent the standby monitoring unit from becoming overloaded after switching to the standby monitoring unit.
  • FIG. 1 is a block diagram of a monitoring control system 500 according to a first embodiment.
  • FIG. 2 is a block diagram of a switching management device 130 according to the first embodiment.
  • FIG. 6 is a flowchart of switch management processing S130 by the switch management device 130 according to the first embodiment.
  • FIG. 6 shows connection information 138 according to the first embodiment.
  • FIG. 6 shows threshold information 137 according to the first embodiment.
  • FIG. 10 is a flowchart of connection control processing S106 by the connection control unit 136 according to the first embodiment.
  • FIG. 7 is a block diagram of a switching management device 130 according to a modification of the first embodiment.
  • FIG. 16 is a diagram showing a connection table 138a according to Embodiment 2.
  • FIG. 1 is a block diagram of a monitoring control system 500 according to a first embodiment.
  • FIG. 2 is a block diagram of a switching management device 130 according to the first embodiment.
  • FIG. 6 is a flowchart of switch management processing S130 by the switch
  • FIG. 16 is a view showing an example of a user interface for setting of the connection table 138a according to the second embodiment.
  • FIG. 16 is a view showing another example of the user interface for setting of the connection table 138a according to the second embodiment.
  • FIG. 10 is a block diagram of a monitoring control system 500b according to a third embodiment.
  • FIG. 13 shows an example of transfer setting information 143 according to the third embodiment.
  • FIG. 16 shows another example of transfer setting information 143 according to the third embodiment.
  • the monitoring control system 500 includes a multiple system monitoring system 119, a monitoring target system 200, and a switching management device 130.
  • the multiplex system monitoring system 119 and the monitoring target system 200 are connected via the network 106, the FW 107, and the switching management device 130.
  • the network 106 connects between the multisystem monitoring system 119 and the system 200 to be monitored.
  • the network 106 is configured by the Internet, an intranet, or another network depending on requirements such as the installation site of the monitored system 200.
  • the monitoring target system 200 is a system that provides a service to a user.
  • the monitoring target system 200 includes an FW 202, a monitoring device 203, a SW 204, and a server 205.
  • the FW 202 is a firewall that connects the inside and the outside of the monitored system 200. Depending on the configuration of the monitored system 200, one or more FWs 202 exist. Alternatively, depending on the configuration of the monitored system 200, the FW 202 may not exist.
  • the monitoring device 203 monitors a server or a network device configuring the monitoring target system 200.
  • the monitoring device 203 transmits failure data 231 to the multiplex monitoring system 119 when an abnormality is detected.
  • One or more monitoring devices 203 exist in one monitoring target system 200.
  • the SW 204 is a network device for connecting the monitoring device 203 and a server or device existing in the monitoring target system 200.
  • the SW 204 is a device such as a router, a switch, or a hub.
  • One or more SWs 204 exist in one monitored system 200.
  • One or more servers 205 exist in one monitoring target system 200.
  • the FW 107 is a firewall that connects the multiplex monitoring system 119 and the network 106.
  • One or more FWs 107 exist depending on the system configuration. Alternatively, the FW 107 may not exist depending on the system configuration.
  • the multiple system monitoring system 119 includes a monitoring unit 191 that performs monitoring processing on a plurality of monitoring devices 203.
  • the monitoring unit 191 includes an operating system monitoring unit 192 and a standby system monitoring unit 193 that executes monitoring processing in place of the operating system monitoring unit 192 when a failure occurs in the operating system monitoring unit 192.
  • the operating system monitoring unit 192 includes an event aggregation (operating system) 111 and an incident management (operating system) 115.
  • the standby monitoring unit 193 further includes an event aggregation (standby system) 112 and an incident management (standby system) 116.
  • the multiple system monitoring system 119 manages a failure that has occurred in the monitored system 200.
  • the event aggregation unit 110 receives failure data 231 reported from the plurality of monitoring devices 203 in the plurality of monitored systems 200.
  • the event aggregation unit 110 centrally manages the received failure data 231 as an event.
  • the event aggregation unit 110 links information of the monitoring target system 200 to event information. Also, the event aggregation unit 110 determines the severity of the failure. Event information is stored in the event database 113.
  • the event aggregation unit 110 is duplicated.
  • the event aggregation unit 110 includes an event aggregation (operating system) 111 and an event aggregation (standby system) 112.
  • Event information to be processed is stored in the event database 113.
  • the event database 113 can be accessed from both the event aggregation (operating system) 111 and the event aggregation (standby system) 112. Therefore, the processing of event information stored in the event database 113 can be taken over by the event aggregation (standby system) 112.
  • the incident management unit 114 receives event information from the event aggregation unit 110.
  • the incident management unit 114 stores incident information based on the event information in the incident database 117.
  • the incident information includes information such as the content of the notification sent by the operator to the user of the monitoring target system 200, information collected for failure recovery, or processing performed for failure recovery. Incident information is stored in the incident database 117 until the failure is recovered.
  • the incident management unit 114 is duplicated.
  • the incident management unit 114 includes an incident management (operating system) 115 and an incident management (standby system) 116.
  • the incident management (operating system) 115 is operating, and when a failure occurs in the incident management (operating system) 115 and the process can not be continued, the incident management (standby system) 116 takes over the process.
  • the incident database 117 can be accessed from both the incident management (operating system) 115 and the incident management (standby system) 116.
  • Incident management (standby system) 116 can take over processing using the incident information stored in the incident database 117.
  • the fault display unit 118 displays a fault that occurs in the monitoring target system 200 and is detected.
  • the fault display unit 118 may notify not only by displaying the fault content on the screen, but also by a method of lighting a lamp or sounding a sound. By displaying a fault on the fault display unit 118, the operator 121 starts handling the fault.
  • the switching management device 130 is a computer.
  • the switch management device 130 includes a processor 910 and other hardware such as a memory 921, an auxiliary storage device 922, an input interface 930, an output interface 940, and a communication device 950.
  • the processor 910 is connected to other hardware via signal lines to control these other hardware.
  • the switching management device 130 includes, as functional elements, a connection disconnection unit 131, a failure acquisition unit 132, a switching unit 133, a load acquisition unit 134, a load determination unit 135, a connection control unit 136, and a storage unit 139. .
  • the storage unit 139 stores threshold information 137 and connection information 138.
  • the functions of the connection disconnection unit 131, the fault acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136 are realized by software.
  • the storage unit 139 is included in the memory 921.
  • the processor 910 is a device that executes a switching management program.
  • the switching management program is a program for realizing the functions of the connection disconnection unit 131, the failure acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136.
  • the processor 910 is an IC (Integrated Circuit) that performs arithmetic processing.
  • a specific example of the processor 910 is a central processing unit (CPU), a digital signal processor (DSP), or a graphics processing unit (GPU).
  • the memory 921 is a storage device that temporarily stores data.
  • a specific example of the memory 921 is a static random access memory (SRAM) or a dynamic random access memory (DRAM).
  • the auxiliary storage device 922 is a storage device for storing data.
  • a specific example of the auxiliary storage device 922 is an HDD.
  • the auxiliary storage device 922 may also be a portable storage medium such as an SD (registered trademark) memory card, a CF, a NAND flash, a flexible disk, an optical disk, a compact disk, a Blu-ray (registered trademark) disk, and a DVD.
  • HDD is an abbreviation of Hard Disk Drive.
  • SD (registered trademark) is an abbreviation of Secure Digital.
  • CF is an abbreviation of Compact Flash.
  • DVD is an abbreviation of Digital Versatile Disk.
  • the input interface 930 is a port connected to an input device such as a mouse, a keyboard, or a touch panel. Specifically, the input interface 930 is a USB (Universal Serial Bus) terminal. The input interface 930 may be a port connected to a LAN (Local Area Network).
  • LAN Local Area Network
  • the output interface 940 is a port to which a cable of a display device such as a display is connected.
  • the output interface 940 is a USB terminal or an HDMI (registered trademark) (High Definition Multimedia Interface) terminal.
  • the display is specifically an LCD (Liquid Crystal Display).
  • the communication device 950 is a device that communicates with other devices via a network.
  • Communication device 950 has a receiver and a transmitter.
  • the communication device 950 is connected to a communication network such as a LAN, the Internet, or a telephone line by wire or wirelessly.
  • the communication device 950 is specifically a communication chip or a NIC (Network Interface Card).
  • the switching management program is read into the processor 910 and executed by the processor 910.
  • the memory 921 stores not only the switching management program but also the OS (Operating System).
  • the processor 910 executes the switching management program while executing the OS.
  • the switching management program and the OS may be stored in the auxiliary storage device 922.
  • the switching management program and the OS stored in the auxiliary storage device 922 are loaded into the memory 921 and executed by the processor 910. Note that part or all of the switching management program may be incorporated into the OS.
  • the switch management device 130 may include a plurality of processors that replace the processor 910.
  • the plurality of processors share the execution of the switching management program.
  • Each processor is an apparatus that executes a monitoring program in the same manner as the processor 910.
  • Data, information, signal values and variable values used, processed or output by the switching management program are stored in the memory 921, the auxiliary storage device 922, or a register or cache memory in the processor 910.
  • the switching management program “processes” “unit” of each unit of the connection disconnection unit 131, the failure acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136.
  • the computer is made to execute each process, each procedure or each process which is replaced with "procedure” or "process”. Further, the switching management method is performed by the switching management apparatus 130 executing a switching management program.
  • the switching management program may be provided by being recorded on a computer readable recording medium, or may be provided as a program product.
  • the switching management device 130 manages system switching of the multiple system monitoring system 119.
  • the switch management device 130 switches the system when the event aggregation (operating system) 111 or the incident management (operating system) 115 can not continue the process in the multi-system monitoring system 119.
  • the fault acquisition unit 132 monitors the state of the active system of the event aggregation unit 110 and the incident management unit 114.
  • the failure acquisition unit 132 outputs the failure information 91 of the multisystem monitoring system 119 to the switching unit 133 when a failure occurs and the process can not be continued.
  • the switching unit 133 When the switching unit 133 acquires the failure information 91 indicating that the failure has occurred in the operation monitoring unit 192, the switching unit 133 switches the execution of the monitoring process from the operation monitoring unit 192 to the standby monitoring unit 191.
  • the switching unit 133 receives the failure information 91 of the multiplex monitoring system 119, the switching unit 133 switches the active system to the standby system so that the processing of the event aggregation unit 110 or the incident management unit 114 can be continued.
  • the switching unit 133 transmits, to the connection control unit 136, a switching notification 92 indicating that the operation monitoring unit 192 has switched to the standby monitoring unit 193.
  • the switching unit 133 transmits, to the connection control unit 136, a process enable notification 94 indicating that the standby system monitoring unit 193 has become processable.
  • the load acquisition unit 134 acquires load information 93 of the event aggregation unit 110 and the incident management unit 114.
  • the load information 93 is information such as the CPU usage rate, the number of IOs per second, that is, the IOPS, and the number of processing waits.
  • the load acquisition unit 134 outputs the acquired load information 93 to the load determination unit 135.
  • the load determination unit 135 acquires load information 93 representing the load of the standby monitoring unit 193 that executes the monitoring process, and compares the load with a threshold. The load determination unit 135 compares the load included in the load information 93 with the threshold included in the threshold information 137 to determine whether the load exceeds the threshold. Then, the load determination unit 135 outputs the determination result to the connection control unit 136.
  • the connection control unit 136 acquires, from the switching unit 133, a switching notification 92 indicating that the operation monitoring unit 192 has switched to the standby monitoring unit 193.
  • the connection control unit 136 disconnects the communication between the multiplex monitoring system 119 and each of the plurality of monitoring devices 203.
  • the connection control unit 136 also controls communication between each of the plurality of monitoring devices 203 and the standby monitoring unit 193 based on the comparison and determination result of the load and the threshold.
  • the storage unit 139 stores connection information 138 in which the state of communication between each of the plurality of monitoring devices 203 and the standby monitoring unit 193 is set.
  • the connection control unit 136 controls communication using the connection information 138.
  • connection control unit 136 refers to the connection information 138 and gives the connection disconnection unit 131 an instruction to connect or disconnect the monitoring apparatus 203.
  • connection disconnection unit 131 receives an instruction from the connection control unit 136, the connection disconnection unit 131 connects or disconnects the network between the monitoring device 203 and the multisystem monitoring system 119. Control of communication by the connection control unit 136 will be described later.
  • the monitoring device 203 monitors servers and network devices in the monitoring target system 200 to determine whether a failure has occurred. As a specific example, the monitoring device 203 acquires the CPU usage rate and the response information of the PING with respect to the host srv1. Then, the monitoring device 203 determines that a failure occurs when the failure condition is met, such as 90% or more if the CPU usage rate. Further, the monitoring device 203 determines that there is a failure if the failure condition is met, such as no response three consecutive times or more in the case of PING. When a failure occurs, the monitoring device 203 transmits the failure data 231 to the multisystem monitoring system 119.
  • the event aggregation unit 110 receives the failure data 231 sent from the plurality of monitoring devices 203.
  • the event aggregation unit 110 processes the received fault data 231 and stores the process as event information in the event database 113.
  • event aggregation (operating system) 111 is operating and a failure occurs in the operating system and processing can not be continued
  • event aggregation (standby system) 112 is activated and processing is continued. Do.
  • the event aggregation unit 110 sends event information to the incident management unit 114.
  • the incident management unit 114 manages event information as incident information. Further, the incident management unit 114 displays the incident information on the failure display unit 118.
  • the operator 121 responds to the fault displayed on the fault display unit 118. Specifically, the operator 121 contacts a customer using a failed server or network device. In addition, the operator 121 obtains detailed information such as a log from a server or network device in which a failure occurs. In addition, the operator 121 attempts failure recovery according to a determined procedure such as restart or setting change. In addition, the operator 121 notifies the customer of that when parts replacement is necessary. Thus, the operator 121 carries out various responses. When the response is completed, information that the fault has converged is recorded in the incident information and deleted from the display on the fault display unit 118. In the initial state, incident management (operating system) 115 is operating, and when a failure occurs in the operating system and processing can not be continued, incident management (standby system) 116 is activated and processing is continued. Do.
  • FIG. 3 is a flowchart of switch management processing S130 by the switch management apparatus 130 according to the present embodiment.
  • FIG. 4 is a diagram showing connection information 138 according to the present embodiment.
  • FIG. 5 is a diagram showing threshold information 137 according to the present embodiment. The switch management process S130 will be described with reference to FIGS. 3 to 5.
  • connection disconnection unit 131 connects or disconnects the communication between the monitoring apparatus 203 and the multisystem monitoring system 119.
  • the connection disconnecting unit 131 connects or disconnects the communication between the monitoring apparatus 203 and the multisystem monitoring system 119 according to the connection information 138. As shown in FIG. 4, in the initial state of the switching management device 130, all the monitoring devices described in the connection information 138 are in the connected state.
  • step S102 the fault acquisition unit 132 monitors the states of the event aggregation (operating system) 111 and the incident management (operating system) 115. If the fault acquiring unit 132 determines that the process can not be continued due to a program abnormal termination or a state where the response is lost, the fault acquiring unit 132 outputs the fault information 91 to the switching unit 133.
  • step S103 the switching unit 133 disconnects the active system that can not continue the process, activates the standby system, and switches the connection so that the process can be continued in the standby system.
  • the event aggregation unit 110 has one virtual IP address, and performs conversion between the event aggregation (active system) 111 and the physical IP address of the event aggregation (standby system) 112, There is a method to realize switching between operating system and standby system.
  • the switching unit 133 notifies the connection control unit 136 of the switching notification 92 when the working system stops responding and disconnects. Also, the switching unit 133 notifies the connection control unit 136 of the process enable notification 94 when the standby system is activated and processing is possible.
  • step S104 the load acquisition unit 134 periodically acquires the load of the event aggregation (operating system) 111 and the incident management (operating system) 115.
  • the threshold information 137 a target for acquiring a load, a load item, a threshold, and a condition are set.
  • the load acquisition unit 134 acquires information for objects corresponding to the load items described in the threshold information 137. Specifically, it is load information such as CPU utilization, IOPS, number of processing waits, or response time.
  • the load acquisition unit 134 sends the acquired load information to the load determination unit 135.
  • the load determination unit 135 compares the acquired load with the threshold based on the threshold information 137.
  • the load determination unit 135 determines, based on the comparison result, whether the acquired load matches the threshold and the condition. Specifically, in the threshold information 137 of FIG. 5, when incident management is targeted, the threshold of the CPU usage rate which is a load item is “80%”, and the condition is “or more”. Here, it is assumed that the CPU usage rate acquired from the incident management (operating system) 115 is 70% as load information. At this time, the load determination unit 135 determines that the threshold value is not exceeded because the load information does not become “80% or more”, which is a combination of the threshold value and the condition. The comparison determination result determined by the load determination unit 135 is sent to the connection control unit 136.
  • connection control processing The connection control process S106 by the connection control unit 136 according to the present embodiment will be described with reference to FIG.
  • Step S601 is processing in the initial state.
  • the connection control unit 136 permits the connection disconnection unit 131 to perform all communications.
  • the connection control unit 136 receives a notification from the switching unit 133.
  • the notification from the switching unit 133 includes the switching notification 92 when the working system stops responding and is disconnected, and the process available notification 94 when the standby system is activated and processing becomes possible.
  • the connection control unit 136 determines whether the notification is the switching notification 92. If the notification is the switching notification 92, the process proceeds to step S604.
  • step S604 the connection control unit 136 instructs the connection disconnection unit 131 to disconnect all communications. Specifically, the connection control unit 136 sets the disconnection state for all the monitoring devices described in the connection information 138. Then, the connection control unit 136 instructs the connection disconnection unit 131 to disconnect communication with all the monitoring devices.
  • step S605 the connection control unit 136 waits until receiving, from the switching unit 133, a process enable notification 94 indicating that the standby system has become processable.
  • step S605a when the connection control unit 136 acquires the process enable notification 94 from the switching unit 133, the connection control unit 136 sets communication to at least one of the plurality of monitoring devices.
  • connection control unit 136 changes the connection state of one of the monitoring devices described in the connection information 138 to the connection state when receiving the process enable notification 94 that the standby system has been activated from the switching unit 133. Then, the connection control unit 136 instructs the connection disconnection unit 131 to connect communication with the monitoring device.
  • step S606 the connection control unit 136 acquires the comparison determination result from the load determination unit 135.
  • step S607 the connection control unit 136 determines whether the load exceeds the threshold value based on the comparison determination result. If the load exceeds the threshold, the connection control unit 136 proceeds to step S608. If the load is equal to or less than the threshold, the connection control unit 136 proceeds to step S609.
  • step S608 the connection control unit 136 disconnects communication for at least one of the plurality of monitoring devices. That is, the connection control unit 136 instructs the connection disconnection unit 131 to disconnect one of the connected monitoring devices.
  • the connection control unit 136 realizes disconnection of one of the monitoring devices in the connection state by rewriting the monitoring device disconnection table 138.
  • step S609 the connection control unit 136 causes the communication to be in a connected state for at least one of the plurality of monitoring devices. That is, the connection control unit 136 instructs the connection disconnection unit 131 to connect one of the disconnected monitoring devices.
  • the connection control unit 136 realizes connection of one of the disconnected monitoring devices by rewriting the monitoring device disconnection table 138.
  • step S610 the connection control unit 136 waits for a predetermined time.
  • step S611 the connection control unit 136 determines whether all the monitoring devices are in the connected state. Specifically, the connection control unit 136 determines whether all the monitoring devices have been connected by referring to the monitoring device disconnection table 138. If all the monitoring devices are in the connected state, the connection control unit 136 proceeds to step S611. If there is an unconnected monitoring device, the connection control unit 136 returns to step S606. In step S612, the switching to the standby system is completely completed. The connection control unit 136 replaces the standby system with the operating system if necessary. As described above, the connection control unit 136 ends the control of communication when the communication is connected to all of the plurality of monitoring devices.
  • the connection control unit 136 may change one of the monitoring devices in the disconnected state to the connected state.
  • the connection control unit 136 refers to the connection information 138, changes one of the monitoring devices in the disconnection state to the connection state, and connects the connection disconnection unit 131 to the communication with the monitoring device. You may be instructed to Also, if the comparison determination result sent from the load determination unit 135 exceeds the threshold for a certain period of time, the connection control unit 136 may change one of the monitoring devices in the connection state to the disconnection state. Good. That is, the connection control unit 136 refers to the connection information 138, changes one of the monitoring devices in the connection state to the disconnection state, and disconnects the connection disconnection unit 131 from the communication with the monitoring device. You may be instructed to
  • connection disconnection unit 131 connects or disconnects the communication between the monitoring device 203 and the multisystem monitoring system 119 according to the connection or disconnection instruction sent from the connection control unit 136.
  • connection or disconnection there is a method of changing a firewall policy to change permission or non-permission of communication, or changing a setting of VLAN (Virtual LAN).
  • the above-mentioned collection of fault information and load information may use the form of monitoring data or load data used in tools and systems such as commercially available server monitoring or network monitoring. In that case, each operation is similar to tools and systems such as server monitoring or network monitoring.
  • each operation is similar to tools and systems such as server monitoring or network monitoring.
  • a commercially available clustering tool may be used to acquire fault data of the multisystem monitoring system and switch the multisystem monitoring system.
  • the failure detection and switching method is realized by the method possessed by the clustering tool.
  • the event aggregation unit and the incident management unit are divided in the present embodiment, these may be realized by one functional block.
  • there may exist respective functions used in a general multiplex system monitoring system such as a configuration management unit, a correspondence history storage unit, a correspondence history search unit, an event correlation analysis unit, and an automatic failure handling unit. Even in this case, the operation of switching to the standby system is the same as in the case where the operation management system can not respond to the multiplexed part when the operation management system can not respond.
  • the function of the switching management device 130 is realized by software, but as a modification, the function of the switching management device 130 may be realized by hardware.
  • FIG. 7 is a diagram showing a configuration of a switching management device 130 according to a modification of the present embodiment.
  • the switching management device 130 includes an electronic circuit 909, a memory 921, an auxiliary storage device 922, an input interface 930, an output interface 940, and a communication device 950.
  • the electronic circuit 909 is a dedicated electronic circuit that implements the functions of the connection disconnection unit 131, the fault acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136.
  • the electronic circuit 909 is a single circuit, a complex circuit, a programmed processor, a parallel programmed processor, a logic IC, a GA, an ASIC, or an FPGA.
  • GA is an abbreviation of Gate Array.
  • ASIC is an abbreviation of Application Specific Integrated Circuit.
  • FPGA is an abbreviation of Field-Programmable Gate Array.
  • the functions of the components of the switching management device 130 may be realized by one electronic circuit or may be realized by being distributed to a plurality of electronic circuits. As another modification, some of the functions of the components of the switching management device 130 may be realized by an electronic circuit, and the remaining functions may be realized by software.
  • Each of the processor and electronics is also referred to as processing circuitry. That is, in the switching management device 130, the functions of the connection disconnection unit 131, the fault acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136 are realized by processing circuitry. Be done.
  • connection disconnection processing In the switching management device 130, the “unit” of the connection disconnection unit 131, the failure acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136 is replaced with “process”. It is also good.
  • processing of connection disconnection processing, failure acquisition processing, switching processing, load acquisition processing, load determination processing, and connection control processing is "program”, “program product” or “computer readable record recording program” It may be read as “medium”.
  • the monitoring control system it is possible to prevent the standby system from being overloaded when the operation system which has become incapable of responding due to an overload is switched to the standby system.
  • communication with all the monitoring devices is cut off once the redundant monitoring units become unresponsive due to overload and are switched to the standby system. Therefore, the load on the standby system does not increase.
  • the monitoring control system if the load is equal to or less than the threshold value, the monitoring devices are connected one by one based on the connection information. If the threshold is exceeded, the monitoring device is disconnected one by one.
  • the standby system load from exceeding the threshold and becoming incapable of responding due to overload.
  • the failure of the network backbone device or the failure of the carrier line which is the root cause of the large amount of failure information
  • the failure can be dealt with in various ways, such as restarting the device, changing the setting of the device, contacting the carrier, and arranging the repair personnel to the site where the device is installed. Therefore, the root cause will be eliminated eventually. If the root cause is eliminated, a large amount of failure information will not be sent from the monitoring device side, and the load on the multi-system monitoring system will also be reduced, so the connection with the monitoring device will increase and eventually the monitoring device Communication with all the monitoring devices in is connected.
  • connection control unit 136 also controls communication based on the comparison determination result of the load and the threshold and the degree of importance.
  • connection information 138a according to the present embodiment also manages the importance of each monitoring device.
  • FIG. 8 is a diagram showing connection information 138a according to the present embodiment.
  • the connection information 138a according to the present embodiment has information of importance in addition to the contents of the connection information 138 according to the first embodiment.
  • the degree of importance indicates the degree of importance of the monitoring device, and is represented by a real value of 0 to 1, with 1 being the most important. Specifically, regarding the monitoring device that monitors the main network device, the importance is considered to be high, so the importance is set high. Also, the monitoring contract with the customer may change the degree of importance.
  • FIG. 9 is a diagram showing an example of a user interface for setting connection information 138a according to the present embodiment.
  • FIG. 10 is a diagram showing a state in which the drop-down list is displayed in FIG. This user interface is displayed on the display device via the output interface 940.
  • the administrator of the system performs setting by the input device via the input interface 930.
  • the column of importance is a drop-down list, and the part of " ⁇ " is selected by an input device such as a mouse.
  • a list of importance is displayed. The importance is selected in the input device from the list of importance and the importance is set.
  • the system administrator sets the importance and then selects the “OK” button, whereby the setting is stored in the connection information 138 a of the memory 921. If you select the "cancel” button, it will not be stored.
  • the number of monitoring device name columns is three in FIGS. 9 and 10, the number of columns may be more than that, or the number of columns may be changed.
  • FIG. 11 is a diagram showing another example of the user interface for setting the connection information 138a according to the present embodiment.
  • an input device is used to move a label or icon indicating each monitoring device on the user interface. Then, by placing a label or icon indicating each monitoring device at a position indicating the level of importance, the degree of importance of each monitoring device is set.
  • connection control unit 136 When the connection control unit 136 receives from the switching unit 133 the process ready notification 94 activated by the standby system, the connection control unit 136 refers to the connection information 138a to connect one of the monitoring devices in the disconnected state with the highest degree of importance. change. Then, the connection control unit 136 instructs the connection disconnection unit 131 to connect communication with the monitoring device. If the comparison determination result sent from the load determination unit 135 is less than or equal to the threshold for a certain period of time, the connection control unit 136 refers to the connection information 138 a and determines the most important of the monitoring devices in the disconnected state Change the high one to the connected state.
  • connection control unit 136 instructs the connection disconnection unit 131 to connect communication with the monitoring device. If the comparison determination result sent from the load determination unit 135 is greater than or equal to the threshold value, the connection control unit 136 refers to the connection information 138a and the monitoring device in the connected state has the lowest importance 1 Change one to disconnected. Then, the connection control unit 136 instructs the connection disconnection unit 131 to disconnect communication with the monitoring device.
  • connection can be made from a monitoring device with a high degree of importance.
  • priority By giving priority to reviving the monitoring of the system configuration such as the backbone network device, it is possible to more quickly investigate the root cause of the large-scale failure occurrence.
  • FIG. 12 is a diagram showing a configuration of a monitoring control system 500b according to the present embodiment.
  • the monitoring control system 500b according to the present embodiment includes a transfer unit 140 that transfers fault data 231 transmitted from each of the plurality of monitoring devices 203 to the multiplex monitoring system 119.
  • the switching unit 133 transmits the switching notification 92 to the transfer unit 140.
  • the load determination unit 135 transmits the comparison determination result to the transfer unit 140.
  • the transfer unit 140 controls the number of transfers of the failure data 231 based on the switching notification 92 and the comparison determination result.
  • the transfer unit 140 transfers the fault data 231 transmitted from the monitoring device 203.
  • the transfer unit 140 includes a transfer control unit 141 and a transfer buffer 142.
  • the transfer control unit 141 controls transfer of the fault data 231 from the monitoring device 203.
  • the transfer buffer 142 temporarily stores the fault data 231.
  • FIG. 13 shows an example of the transfer setting information 143 of the fault data 231 according to the present embodiment.
  • the transfer unit 140 has transfer setting information 143.
  • Transfer setting information 143 is stored in advance in the transfer unit 140.
  • the transfer interval, the initial value of the transfer number, the number to reduce the transfer number, and the number to increase the transfer number are set.
  • the number to reduce the number of transfers is the number to reduce the number of transfers when the load on the standby system exceeds the threshold when switching occurs. Further, the number of transfers to be reduced is the number to increase the number of transfers when the load on the standby system falls below a threshold.
  • the system is divided into two main ways.
  • One is Pull type transfer.
  • the monitoring device 203 has a database inside.
  • the monitoring device 203 detects the failure and temporarily stores failure data 231 in the database.
  • the information is transferred to the multisystem monitoring system 119 by access to an API (application interface) provided by the monitoring apparatus 203 or a database.
  • the transfer protocol may be a protocol using Web API, an access protocol provided by a database, or a protocol unique to the monitoring apparatus.
  • the multiple system monitoring system 119 takes out the failure data 231 by API or database access, and it is possible to specify the number of failure data 231 to be acquired at one time.
  • Another is Push type transfer.
  • Push type transfer the monitoring device 203 does not have a database.
  • the monitoring device 203 immediately transfers from the monitoring device 203 to the multiple system monitoring system 119 failure data 231 of a failure that has occurred in the server 205 to be monitored or a network device.
  • a transfer protocol a protocol such as SNMP Trap is used.
  • fault data 231 is transferred each time the monitoring device 203 detects a fault. Therefore, in the case of the failure of the network monitoring backbone device, many servers and devices may become disconnected, and a large amount of failure data 231 may be transferred at one time.
  • the switching unit 133 transmits the switching notification 92 to the transfer unit 140 when system switching is performed. Further, the load determination unit 135 transmits, to the transfer unit 140, the comparison determination result obtained by comparing and determining the load and the threshold.
  • the transfer unit 140 transfers all failure data 231 detected by the monitoring device 203 at regular intervals to the multiplex monitoring system 119 via the connection disconnection unit 131.
  • the transfer control unit 141 reduces the number of pieces of failure data 231 acquired from the monitoring apparatus 203 at regular intervals based on the transfer setting information 143. If the load determination unit 135 determines that the load is equal to or less than the threshold value for a predetermined time, the transfer control unit 141 acquires failure data 231 acquired from the monitoring device 203 at regular intervals based on the transfer setting information 143. Increase the number of When it is determined that the load is equal to or higher than the threshold, the number of pieces of fault data 231 acquired from the monitoring device 203 at regular intervals is reduced. Specifically, the transfer control unit 141 controls the number of pieces of failure data 231 acquired at regular intervals as follows.
  • New number Current number-Pull type over load transfer number reduction
  • the number of failure data 231 acquired at regular intervals is increased by the amount of Pull type non load excess transfer number increase .
  • New number present number + Pull type not loaded transfer number increase
  • the transfer unit 140 immediately transfers all fault data 231 detected and notified by the monitoring device 203 to the multiplex monitoring system 119 via the connection disconnection unit 131.
  • the transfer control unit 141 temporarily stores all the failure data 231 sent from the monitoring device 203 in the transfer buffer 142.
  • the transfer buffer 142 does not perform any processing for the fault data 231, and only temporarily stores it. Therefore, even if a large amount of fault data 231 is received at one time from the monitoring device 203 side, all fault data 231 can be stored in the transfer buffer 142 because the processing is light.
  • the transfer control unit 141 sends the fault data 231 to the multiplex monitoring system 119 in order from the old one. At that time, based on the transfer setting information 143, the transfer control unit 141 suppresses the number of transfers in a fixed time to a fixed number. If it is determined by the load determination unit 135 that the load is equal to or less than the threshold for a predetermined time, the transfer control unit 141 increases the number of transfers at regular intervals for transfer of the fault data 231 from the transfer buffer 142. When it is determined that the load is equal to or higher than the threshold, the transfer control unit 141 reduces the number of transfers at regular intervals for transfer of the fault data 231 from the transfer buffer 142.
  • the monitoring control system 500b it is possible to temporarily limit the number of transfer of failure data at the time of switching occurrence in the multiplex monitoring system. Further, according to the monitoring control system 500b according to the present embodiment, if the load is equal to or less than the threshold value, the number of transfers is gradually increased, so that the standby system of the multiplex monitoring system is not overloaded. Further, with regard to pull type fault data transfer, since fault data is stored in the database possessed by the monitoring device, there is no loss of fault data even during switching and after switching. For Push type fault data transfer, since the transfer buffer 142 temporarily stores fault data, there is no loss of fault data even at switching time and after switching.
  • FIG. 14 is a diagram showing another example of the transfer setting information 143 according to the present embodiment.
  • the number of transfers is increased or decreased based on the determination as to whether the threshold value is exceeded or not.
  • the increase and decrease of the number of transfers may be controlled in stages according to the load.
  • the horizontal axis represents the load level and the vertical axis represents the increase and decrease of the transfer number, and the curve is set to continuously change the increase and decrease of the transfer number according to the load level. .
  • the transfer setting information 143 of FIG. 14 it is possible to set the number of transfers according to the type of load such as the CPU usage rate and the IOPS.
  • the transfer control unit 141 acquires the increase / decrease number of the transfer number from the value of the load from the transfer setting information 143 of FIG. Then, as in the case of using the transfer setting information of FIG. 13, the transfer control unit 141 calculates a new transfer number using the increase / decrease number.
  • each part of the supervisory control system has been described as an independent functional block.
  • the configuration of the monitoring control system may not be the configuration as in the above-described embodiment.
  • the functional blocks of the supervisory control system may have any configuration as long as the functions described in the above-described embodiment can be realized.
  • Embodiments 1 to 3 A plurality of parts in Embodiments 1 to 3 may be combined and implemented. Alternatively, one portion of these embodiments may be implemented. In addition, these embodiments may be implemented in any combination in whole or in part.
  • the embodiments described above are essentially preferable examples, and are not intended to limit the scope of the present invention, the scope of the application of the present invention, and the scope of the application of the present invention. The embodiment described above can be variously modified as needed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Hardware Redundancy (AREA)
  • Computer And Data Communications (AREA)
  • Debugging And Monitoring (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

When a switching unit (133) acquires a failure notification indicating that a failure has occurred in an operation system monitoring unit (192), system switching from the operation system monitoring unit (192) to a standby system monitoring unit (193) is performed. When acquiring a switching notification from the switching unit (133), a connection control unit (136) disconnects communication between a multi-system monitoring system (119) and each of a plurality of monitoring devices (203). A load determination unit (135) compares a load on the standby system monitoring unit (193) with a threshold. On the basis of the result of comparison between the load and the threshold, the connection control unit (136) controls communication between each of the plurality of monitoring devices (203) and the standby system monitoring unit (193).

Description

切替管理装置、監視制御システム、切替管理方法および切替管理プログラムSwitching management device, monitoring control system, switching management method and switching management program
 本発明は、切替管理装置、監視制御システム、切替管理方法および切替管理プログラムに関する。特に、多重系監視システムにおける系の切替を管理する切替管理装置、監視制御システム、切替管理方法および切替管理プログラムに関する。 The present invention relates to a switching management device, a monitoring control system, a switching management method, and a switching management program. In particular, the present invention relates to a switching management device, a monitoring control system, a switching management method, and a switching management program that manages switching of a system in a multisystem monitoring system.
 監視システムは、サーバあるいはネットワーク装置といった監視対象の監視を行う。監視システムでは、監視対象で発生した全ての障害に対応する必要がある。そのため、監視システムでは、耐障害性を持たせるために、多重系にすることが一般的である。
 特許文献1には、多重系における系の切り替えを行う際に、障害フラグと停止フラグを使って同期させた運用データを用いることで、待機系が停止することを防ぐ技術が開示されている。
 また、特許文献2には、ウェブサーバと通信するクライアント数を一時的に制限することで、ウェブサーバを安定化させる技術が開示されている。
The monitoring system monitors a monitoring target such as a server or a network device. In a monitoring system, it is necessary to cope with all failures that occur in the monitoring target. Therefore, in the monitoring system, in order to provide fault tolerance, it is common to use multiple systems.
Patent Document 1 discloses a technique for preventing a standby system from stopping by using operation data synchronized using a failure flag and a stop flag when switching a system in a multiplex system.
Further, Patent Document 2 discloses a technology for stabilizing a web server by temporarily limiting the number of clients communicating with the web server.
特許第5342701号公報Patent No. 5342701 特開2008-250669号公報JP 2008-250669 A
 多重系監視システムでは、監視対象で大量の障害が検出された場合、過負荷によって稼働系が応答しなくなり、待機系に切り替わる場合がある。しかし、待機系に切り替わると、今度は待機系が大量の障害を処理することになる。よって、待機系も過負荷に陥り、応答が無くなり、稼働系および待機系が共倒れになることがある。このような状況は、スイッチ、ルータ、ファイアウォールといったネットワークの基幹の装置、あるいは、多重系監視システムと監視対象を結ぶキャリア回線に障害が発生した場合に発生する。 In a multiple system monitoring system, when a large number of failures are detected in the monitoring target, the operating system may stop responding due to overload and may switch to a standby system. However, when switching to the standby system, this time the standby system will handle a large number of failures. Therefore, the standby system may be overloaded, the response may be lost, and the operating system and the standby system may fall apart. Such a situation occurs when a failure occurs in a main device of a network such as a switch, a router, or a firewall, or a carrier line connecting a multiplexed monitoring system and a monitoring target.
 このような障害が発生すると、ネットワークトポロジー的に障害発生箇所よりも遠くにある全ての機器が、多重系監視システムから見て不通状態になる。そのため、監視対象全てが不通状態になったという障害が一度に発生し、その全ての障害情報を多重系監視システムで処理することになる。そのため、過負荷に陥り、稼働系および待機系が共倒れになる。 When such a failure occurs, all devices far from the failure point in the network topology are disconnected from the multi-system monitoring system. Therefore, a fault that all the monitoring targets are disconnected occurs at one time, and all fault information is processed by the multi-system monitoring system. As a result, overloading occurs and both the active and standby systems fall apart.
 本発明者は、過負荷によって待機系へ切り替わった場合に、待機系が過負荷となることを防ぐことを目的とする。 The present inventor aims to prevent the standby system from being overloaded when the system is switched to the standby system due to an overload.
 本発明に係る切替管理装置は、複数の監視装置に対する監視処理を実行する稼働系監視部と、前記稼働系監視部に障害が発生した場合に前記稼働系監視部の代わりに前記監視処理を実行する待機系監視部とを備えた多重系監視システムの系切替を管理する切替管理装置において、
 前記稼働系監視部に障害が発生したことを表す障害通知を取得すると、前記監視処理の実行を前記稼働系監視部から前記待機系監視部に切り替える切替部と、
 前記切替部から、前記稼働系監視部から前記待機系監視部に切り替えたことを表す切替通知を取得すると、前記多重系監視システムと前記複数の監視装置の各々との通信を切断する接続制御部と、
 前記監視処理を実行する前記待機系監視部の負荷を表す負荷情報を取得し、前記負荷と閾値とを比較する負荷判定部と
を備え、
 前記接続制御部は、
 前記負荷と前記閾値との比較判定結果に基づいて、前記複数の監視装置の各々と前記待機系監視部との前記通信を制御する。
The switching management apparatus according to the present invention executes the monitoring process instead of the operating system monitoring unit when a failure occurs in the operating system monitoring unit that performs monitoring processing on a plurality of monitoring devices and the operating system monitoring unit In a switching management device that manages system switching of a multiplex system monitoring system including a standby system monitoring unit that
A switching unit that switches execution of the monitoring process from the operating system monitoring unit to the standby system monitoring unit when acquiring a failure notification indicating that a failure has occurred in the operating system monitoring unit;
A connection control unit that disconnects communication between the multi-system monitoring system and each of the plurality of monitoring devices when acquiring from the switching unit a switching notification indicating that switching has been made to the standby system monitoring unit from the operating system monitoring unit. When,
A load determination unit that acquires load information representing a load of the standby monitoring unit that executes the monitoring process, and comparing the load and a threshold;
The connection control unit
The communication between each of the plurality of monitoring devices and the standby monitoring unit is controlled based on the comparison determination result between the load and the threshold.
 本発明に係る切替管理装置は、待機系監視部に切り替わると、接続制御部が多重系監視システムと複数の監視装置の各々との通信を切断する。そして、接続制御部が、待機系監視部の負荷と閾値との比較判定結果に基づいて、複数の監視装置の各々と待機系監視部との通信を制御する。よって、本発明に係る切替管理装置によれば、待機系監視部へ切り替わった後に、待機系監視部が過負荷となることを防ぐことができる。 In the switching management device according to the present invention, when switched to the standby monitoring unit, the connection control unit disconnects communication between the multi-system monitoring system and each of the plurality of monitoring devices. Then, the connection control unit controls communication between each of the plurality of monitoring devices and the standby system monitoring unit based on the comparison determination result of the load of the standby system monitoring unit and the threshold value. Therefore, according to the switching management device of the present invention, it is possible to prevent the standby monitoring unit from becoming overloaded after switching to the standby monitoring unit.
実施の形態1に係る監視制御システム500の構成図。FIG. 1 is a block diagram of a monitoring control system 500 according to a first embodiment. 実施の形態1に係る切替管理装置130の構成図。FIG. 2 is a block diagram of a switching management device 130 according to the first embodiment. 実施の形態1に係る切替管理装置130による切替管理処理S130のフロー図。FIG. 6 is a flowchart of switch management processing S130 by the switch management device 130 according to the first embodiment. 実施の形態1に係る接続情報138を示す図。FIG. 6 shows connection information 138 according to the first embodiment. 実施の形態1に係る閾値情報137を示す図。FIG. 6 shows threshold information 137 according to the first embodiment. 実施の形態1に係る接続制御部136による接続制御処理S106のフロー図。FIG. 10 is a flowchart of connection control processing S106 by the connection control unit 136 according to the first embodiment. 実施の形態1の変形例に係る切替管理装置130の構成図。FIG. 7 is a block diagram of a switching management device 130 according to a modification of the first embodiment. 実施の形態2に係る接続表138aを示す図。FIG. 16 is a diagram showing a connection table 138a according to Embodiment 2. 実施の形態2に係る接続表138aの設定のためのユーザインタフェースの例を示す図。FIG. 16 is a view showing an example of a user interface for setting of the connection table 138a according to the second embodiment. 図9においてドロップダウンリストを表示させた状態を示す図。The figure which shows the state which displayed the drop-down list in FIG. 実施の形態2に係る接続表138aの設定のためのユーザインタフェースの別例を示す図。FIG. 16 is a view showing another example of the user interface for setting of the connection table 138a according to the second embodiment. 実施の形態3に係る監視制御システム500bの構成図。FIG. 10 is a block diagram of a monitoring control system 500b according to a third embodiment. 実施の形態3に係る転送設定情報143の例を示す図。FIG. 13 shows an example of transfer setting information 143 according to the third embodiment. 実施の形態3に係る転送設定情報143の別例を示す図。FIG. 16 shows another example of transfer setting information 143 according to the third embodiment.
 以下、本発明の実施の形態について、図を用いて説明する。なお、各図中、同一または相当する部分には、同一符号を付している。実施の形態の説明において、同一または相当する部分については、説明を適宜省略または簡略化する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings, the same or corresponding parts are denoted by the same reference numerals. In the description of the embodiment, the description of the same or corresponding parts will be omitted or simplified as appropriate.
 実施の形態1.
***構成の説明***
 図1を用いて、本実施の形態に係る監視制御システム500の構成について説明する。
 監視制御システム500は、多重系監視システム119と、監視対象システム200と、切替管理装置130とを備える。
 多重系監視システム119と監視対象システム200とは、ネットワーク106、FW107、および切替管理装置130を介して接続されている。
 ネットワーク106は、多重系監視システム119と監視対象システム200との間を接続する。ネットワーク106は、監視対象システム200の設置場所といった要件により、インターネット、イントラネット、あるいは、その他のネットワークで構成される。
Embodiment 1
*** Description of the configuration ***
The configuration of the monitoring control system 500 according to the present embodiment will be described using FIG.
The monitoring control system 500 includes a multiple system monitoring system 119, a monitoring target system 200, and a switching management device 130.
The multiplex system monitoring system 119 and the monitoring target system 200 are connected via the network 106, the FW 107, and the switching management device 130.
The network 106 connects between the multisystem monitoring system 119 and the system 200 to be monitored. The network 106 is configured by the Internet, an intranet, or another network depending on requirements such as the installation site of the monitored system 200.
<監視対象システム200>
 監視対象システム200は、利用者に対してサービスを提供するシステムである。監視対象システム200は、1つまたは複数存在する。
 監視対象システム200は、FW202、監視装置203、SW204、およびサーバ205を有する。
<Monitored system 200>
The monitoring target system 200 is a system that provides a service to a user. One or more monitoring target systems 200 exist.
The monitoring target system 200 includes an FW 202, a monitoring device 203, a SW 204, and a server 205.
 FW202は、監視対象システム200内部と外部を接続するファイアウォールである。監視対象システム200の構成によっては、FW202は1つあるいは複数存在する。または、監視対象システム200の構成によっては、FW202は存在しない場合もある。 The FW 202 is a firewall that connects the inside and the outside of the monitored system 200. Depending on the configuration of the monitored system 200, one or more FWs 202 exist. Alternatively, depending on the configuration of the monitored system 200, the FW 202 may not exist.
 監視装置203は、監視対象システム200を構成するサーバあるいはネットワーク装置の監視を行う。監視装置203は、異常を検知した場合は障害データ231を多重系監視システム119に送信する。監視装置203は、1つの監視対象システム200内に1つまたは複数存在する。 The monitoring device 203 monitors a server or a network device configuring the monitoring target system 200. The monitoring device 203 transmits failure data 231 to the multiplex monitoring system 119 when an abnormality is detected. One or more monitoring devices 203 exist in one monitoring target system 200.
 SW204は、監視装置203と、監視対象システム200内に存在するサーバあるいは機器を接続するためのネットワーク機器である。具体的には、SW204は、ルータ、スイッチ、ハブといった機器である。SW204は、1つの監視対象システム200内に1つあるいは複数存在する。
 サーバ205は、1つの監視対象システム200内に1つあるいは複数存在する。
The SW 204 is a network device for connecting the monitoring device 203 and a server or device existing in the monitoring target system 200. Specifically, the SW 204 is a device such as a router, a switch, or a hub. One or more SWs 204 exist in one monitored system 200.
One or more servers 205 exist in one monitoring target system 200.
 FW107は、多重系監視システム119とネットワーク106を接続するファイアウォールである。FW107は、システムの構成により、1つあるいは複数存在する。あるいは、FW107は、システムの構成により、存在しない場合もある。 The FW 107 is a firewall that connects the multiplex monitoring system 119 and the network 106. One or more FWs 107 exist depending on the system configuration. Alternatively, the FW 107 may not exist depending on the system configuration.
<多重系監視システム119>
 多重系監視システム119は、複数の監視装置203に対する監視処理を実行する監視部191を備える。監視部191は、稼働系監視部192と、稼働系監視部192に障害が発生した場合に稼働系監視部192の代わりに監視処理を実行する待機系監視部193とを備える。稼働系監視部192は、イベント集約(稼働系)111と、インシデント管理(稼働系)115を備える。また、待機系監視部193は、イベント集約(待機系)112と、インシデント管理(待機系)116を備える。
<Multiple monitoring system 119>
The multiple system monitoring system 119 includes a monitoring unit 191 that performs monitoring processing on a plurality of monitoring devices 203. The monitoring unit 191 includes an operating system monitoring unit 192 and a standby system monitoring unit 193 that executes monitoring processing in place of the operating system monitoring unit 192 when a failure occurs in the operating system monitoring unit 192. The operating system monitoring unit 192 includes an event aggregation (operating system) 111 and an incident management (operating system) 115. The standby monitoring unit 193 further includes an event aggregation (standby system) 112 and an incident management (standby system) 116.
 多重系監視システム119は、監視対象システム200で発生した障害を管理する。
 イベント集約部110は、複数ある監視対象システム200中の、複数の監視装置203から報告される障害データ231を受け取る。イベント集約部110は、受け取った障害データ231をイベントとして一元管理する。イベント集約部110は、イベント情報に監視対象システム200の情報を紐付ける。また、イベント集約部110は、障害の深刻度を判定する。イベント情報は、イベントデータベース113に記憶される。
 イベント集約部110は、二重化されている。イベント集約部110は、イベント集約(稼働系)111と、イベント集約(待機系)112を備える。通常はイベント集約(稼働系)111が動作しており、イベント集約(稼働系)111に障害が発生して処理を続行できなくなった場合、イベント集約(待機系)112に処理が引き継がれる。処理対象となるイベント情報は、イベントデータベース113に記憶されている。イベントデータベース113は、イベント集約(稼働系)111およびイベント集約(待機系)112の双方からアクセス可能である。よって、イベントデータベース113に記憶されたイベント情報の処理をイベント集約(待機系)112で引き継ぐことができる。
The multiple system monitoring system 119 manages a failure that has occurred in the monitored system 200.
The event aggregation unit 110 receives failure data 231 reported from the plurality of monitoring devices 203 in the plurality of monitored systems 200. The event aggregation unit 110 centrally manages the received failure data 231 as an event. The event aggregation unit 110 links information of the monitoring target system 200 to event information. Also, the event aggregation unit 110 determines the severity of the failure. Event information is stored in the event database 113.
The event aggregation unit 110 is duplicated. The event aggregation unit 110 includes an event aggregation (operating system) 111 and an event aggregation (standby system) 112. Normally, the event aggregation (operating system) 111 is operating, and when a failure occurs in the event aggregation (operating system) 111 and the processing can not be continued, the process is taken over by the event aggregation (standby system) 112. Event information to be processed is stored in the event database 113. The event database 113 can be accessed from both the event aggregation (operating system) 111 and the event aggregation (standby system) 112. Therefore, the processing of event information stored in the event database 113 can be taken over by the event aggregation (standby system) 112.
 インシデント管理部114は、イベント情報をイベント集約部110から受け取る。インシデント管理部114は、イベント情報に基づくインシデント情報を、インシデントデータベース117に記憶する。インシデント情報は、オペレータが監視対象システム200の利用者へ送った通知の内容、障害復旧のために収集した情報、あるいは障害復旧のために実施した処理といった情報を含む。インシデント情報は、障害が復旧するまでインシデントデータベース117に記憶される。
 インシデント管理部114は、二重化されている。インシデント管理部114は、インシデント管理(稼働系)115と、インシデント管理(待機系)116とを備える。通常はインシデント管理(稼働系)115が動作しており、インシデント管理(稼働系)115に障害が発生して処理を続行できなくなった場合、インシデント管理(待機系)116が処理を引き継ぐ。インシデントデータベース117は、インシデント管理(稼働系)115と、インシデント管理(待機系)116の双方からアクセス可能である。インシデントデータベース117に記憶されたインシデント情報を使って、インシデント管理(待機系)116が処理を引き継ぐことができる。
The incident management unit 114 receives event information from the event aggregation unit 110. The incident management unit 114 stores incident information based on the event information in the incident database 117. The incident information includes information such as the content of the notification sent by the operator to the user of the monitoring target system 200, information collected for failure recovery, or processing performed for failure recovery. Incident information is stored in the incident database 117 until the failure is recovered.
The incident management unit 114 is duplicated. The incident management unit 114 includes an incident management (operating system) 115 and an incident management (standby system) 116. Normally, the incident management (operating system) 115 is operating, and when a failure occurs in the incident management (operating system) 115 and the process can not be continued, the incident management (standby system) 116 takes over the process. The incident database 117 can be accessed from both the incident management (operating system) 115 and the incident management (standby system) 116. Incident management (standby system) 116 can take over processing using the incident information stored in the incident database 117.
 障害表示部118は、監視対象システム200で発生し、検知した障害を表示する。障害表示部118は、画面に障害内容を表示するだけでなく、ランプを点ける、あるいは音を鳴らすといった方法で通知を行っても良い。障害表示部118に障害が表示されることにより、オペレータ121は、障害に対する対応を開始する。 The fault display unit 118 displays a fault that occurs in the monitoring target system 200 and is detected. The fault display unit 118 may notify not only by displaying the fault content on the screen, but also by a method of lighting a lamp or sounding a sound. By displaying a fault on the fault display unit 118, the operator 121 starts handling the fault.
<切替管理装置130>
 図2を用いて、本実施の形態に係る切替管理装置130の構成を説明する。
 切替管理装置130は、コンピュータである。切替管理装置130は、プロセッサ910を備えるとともに、メモリ921、補助記憶装置922、入力インタフェース930、出力インタフェース940、および通信装置950といった他のハードウェアを備える。プロセッサ910は、信号線を介して他のハードウェアと接続され、これら他のハードウェアを制御する。
<Switching Management Device 130>
The configuration of the switch management device 130 according to the present embodiment will be described using FIG.
The switching management device 130 is a computer. The switch management device 130 includes a processor 910 and other hardware such as a memory 921, an auxiliary storage device 922, an input interface 930, an output interface 940, and a communication device 950. The processor 910 is connected to other hardware via signal lines to control these other hardware.
 切替管理装置130は、機能要素として、接続切断部131と、障害取得部132と、切替部133と、負荷取得部134と、負荷判定部135と、接続制御部136と、記憶部139を備える。記憶部139には、閾値情報137と、接続情報138が記憶される。接続切断部131と、障害取得部132と、切替部133と、負荷取得部134と、負荷判定部135と、接続制御部136の機能は、ソフトウェアにより実現される。記憶部139は、メモリ921に備えられる。 The switching management device 130 includes, as functional elements, a connection disconnection unit 131, a failure acquisition unit 132, a switching unit 133, a load acquisition unit 134, a load determination unit 135, a connection control unit 136, and a storage unit 139. . The storage unit 139 stores threshold information 137 and connection information 138. The functions of the connection disconnection unit 131, the fault acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136 are realized by software. The storage unit 139 is included in the memory 921.
 プロセッサ910は、切替管理プログラムを実行する装置である。切替管理プログラムは、接続切断部131と、障害取得部132と、切替部133と、負荷取得部134と、負荷判定部135と、接続制御部136の機能を実現するプログラムである。
 プロセッサ910は、演算処理を行うIC(Integrated Circuit)である。プロセッサ910の具体例は、CPU(Central Processing Unit)、DSP(Digital Signal Processor)、あるいはGPU(Graphics Processing Unit)である。
The processor 910 is a device that executes a switching management program. The switching management program is a program for realizing the functions of the connection disconnection unit 131, the failure acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136.
The processor 910 is an IC (Integrated Circuit) that performs arithmetic processing. A specific example of the processor 910 is a central processing unit (CPU), a digital signal processor (DSP), or a graphics processing unit (GPU).
 メモリ921は、データを一時的に記憶する記憶装置である。メモリ921の具体例は、SRAM(Static Random Access Memory)、あるいはDRAM(Dynamic Random Access Memory)である。 The memory 921 is a storage device that temporarily stores data. A specific example of the memory 921 is a static random access memory (SRAM) or a dynamic random access memory (DRAM).
 補助記憶装置922は、データを保管する記憶装置である。補助記憶装置922の具体例は、HDDである。また、補助記憶装置922は、SD(登録商標)メモリカード、CF、NANDフラッシュ、フレキシブルディスク、光ディスク、コンパクトディスク、ブルーレイ(登録商標)ディスク、DVDといった可搬記憶媒体であってもよい。なお、HDDは、Hard Disk Driveの略語である。SD(登録商標)は、Secure Digitalの略語である。CFは、CompactFlashの略語である。DVDは、Digital Versatile Diskの略語である。 The auxiliary storage device 922 is a storage device for storing data. A specific example of the auxiliary storage device 922 is an HDD. The auxiliary storage device 922 may also be a portable storage medium such as an SD (registered trademark) memory card, a CF, a NAND flash, a flexible disk, an optical disk, a compact disk, a Blu-ray (registered trademark) disk, and a DVD. HDD is an abbreviation of Hard Disk Drive. SD (registered trademark) is an abbreviation of Secure Digital. CF is an abbreviation of Compact Flash. DVD is an abbreviation of Digital Versatile Disk.
 入力インタフェース930は、マウス、キーボード、あるいはタッチパネルといった入力装置と接続されるポートである。入力インタフェース930は、具体的には、USB(Universal Serial Bus)端子である。なお、入力インタフェース930は、LAN(Local Area Network)と接続されるポートであってもよい。 The input interface 930 is a port connected to an input device such as a mouse, a keyboard, or a touch panel. Specifically, the input interface 930 is a USB (Universal Serial Bus) terminal. The input interface 930 may be a port connected to a LAN (Local Area Network).
 出力インタフェース940は、ディスプレイといった表示装置のケーブルが接続されるポートである。出力インタフェース940は、具体的には、USB端子またはHDMI(登録商標)(High Definition Multimedia Interface)端子である。ディスプレイは、具体的には、LCD(Liquid Crystal Display)である。 The output interface 940 is a port to which a cable of a display device such as a display is connected. Specifically, the output interface 940 is a USB terminal or an HDMI (registered trademark) (High Definition Multimedia Interface) terminal. The display is specifically an LCD (Liquid Crystal Display).
 通信装置950は、ネットワークを介して他の装置と通信する装置である。通信装置950は、レシーバとトランスミッタを有する。通信装置950は、有線または無線で、LAN、インターネット、あるいは電話回線といった通信網に接続している。通信装置950は、具体的には、通信チップまたはNIC(Network Interface Card)である。 The communication device 950 is a device that communicates with other devices via a network. Communication device 950 has a receiver and a transmitter. The communication device 950 is connected to a communication network such as a LAN, the Internet, or a telephone line by wire or wirelessly. The communication device 950 is specifically a communication chip or a NIC (Network Interface Card).
 切替管理プログラムは、プロセッサ910に読み込まれ、プロセッサ910によって実行される。メモリ921には、切替管理プログラムだけでなく、OS(Operating System)も記憶されている。プロセッサ910は、OSを実行しながら、切替管理プログラムを実行する。切替管理プログラムおよびOSは、補助記憶装置922に記憶されていてもよい。補助記憶装置922に記憶されている切替管理プログラムおよびOSは、メモリ921にロードされ、プロセッサ910によって実行される。なお、切替管理プログラムの一部または全部がOSに組み込まれていてもよい。 The switching management program is read into the processor 910 and executed by the processor 910. The memory 921 stores not only the switching management program but also the OS (Operating System). The processor 910 executes the switching management program while executing the OS. The switching management program and the OS may be stored in the auxiliary storage device 922. The switching management program and the OS stored in the auxiliary storage device 922 are loaded into the memory 921 and executed by the processor 910. Note that part or all of the switching management program may be incorporated into the OS.
 切替管理装置130は、プロセッサ910を代替する複数のプロセッサを備えていてもよい。これら複数のプロセッサは、切替管理プログラムの実行を分担する。それぞれのプロセッサは、プロセッサ910と同じように、監視プログラムを実行する装置である。 The switch management device 130 may include a plurality of processors that replace the processor 910. The plurality of processors share the execution of the switching management program. Each processor is an apparatus that executes a monitoring program in the same manner as the processor 910.
 切替管理プログラムにより利用、処理または出力されるデータ、情報、信号値および変数値は、メモリ921、補助記憶装置922、または、プロセッサ910内のレジスタあるいはキャッシュメモリに記憶される。 Data, information, signal values and variable values used, processed or output by the switching management program are stored in the memory 921, the auxiliary storage device 922, or a register or cache memory in the processor 910.
 切替管理プログラムは、接続切断部131と、障害取得部132と、切替部133と、負荷取得部134と、負荷判定部135と、接続制御部136の各部の「部」を「処理」、「手順」あるいは「工程」に読み替えた各処理、各手順あるいは各工程を、コンピュータに実行させる。また、切替管理方法は、切替管理装置130が切替管理プログラムを実行することにより行われる。
 切替管理プログラムは、コンピュータが読み取り可能な記録媒体に記録されて提供されてもよいし、プログラムプロダクトとして提供されてもよい。
The switching management program “processes” “unit” of each unit of the connection disconnection unit 131, the failure acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136. The computer is made to execute each process, each procedure or each process which is replaced with "procedure" or "process". Further, the switching management method is performed by the switching management apparatus 130 executing a switching management program.
The switching management program may be provided by being recorded on a computer readable recording medium, or may be provided as a program product.
***切替管理装置130の機能の説明***
 切替管理装置130は、多重系監視システム119の系切替を管理する。切替管理装置130は、多重系監視システム119において、イベント集約(稼働系)111またはインシデント管理(稼働系)115が処理を続行できなくなった場合に、系の切替を行う。
 障害取得部132は、イベント集約部110およびインシデント管理部114の稼働系の状態を監視する。障害取得部132は、障害が発生して処理を続行できなくなった場合、多重系監視システム119の障害情報91を切替部133に出力する。
*** Explanation of the function of the switching management device 130 ***
The switching management device 130 manages system switching of the multiple system monitoring system 119. The switch management device 130 switches the system when the event aggregation (operating system) 111 or the incident management (operating system) 115 can not continue the process in the multi-system monitoring system 119.
The fault acquisition unit 132 monitors the state of the active system of the event aggregation unit 110 and the incident management unit 114. The failure acquisition unit 132 outputs the failure information 91 of the multisystem monitoring system 119 to the switching unit 133 when a failure occurs and the process can not be continued.
 切替部133は、稼働系監視部192に障害が発生したことを表す障害情報91を取得すると、監視処理の実行を稼働系監視部192から待機系監視部191に切り替える。切替部133は、多重系監視システム119の障害情報91を受け取ると、稼働系を待機系に切り替え、イベント集約部110またはインシデント管理部114の処理が続行できるようにする。切替部133は、稼働系監視部192から待機系監視部193に切り替えたことを表す切替通知92を接続制御部136に送信する。また、切替部133は、待機系監視部193が処理可能となったことを表す処理可能通知94を接続制御部136に送信する。 When the switching unit 133 acquires the failure information 91 indicating that the failure has occurred in the operation monitoring unit 192, the switching unit 133 switches the execution of the monitoring process from the operation monitoring unit 192 to the standby monitoring unit 191. When the switching unit 133 receives the failure information 91 of the multiplex monitoring system 119, the switching unit 133 switches the active system to the standby system so that the processing of the event aggregation unit 110 or the incident management unit 114 can be continued. The switching unit 133 transmits, to the connection control unit 136, a switching notification 92 indicating that the operation monitoring unit 192 has switched to the standby monitoring unit 193. In addition, the switching unit 133 transmits, to the connection control unit 136, a process enable notification 94 indicating that the standby system monitoring unit 193 has become processable.
 負荷取得部134は、イベント集約部110およびインシデント管理部114の負荷情報93を取得する。負荷情報93は、具体的には、CPU使用率、1秒間のIO回数、すなわちIOPS、および処理の待ち数といった情報である。負荷取得部134は、取得した負荷情報93を負荷判定部135に出力する。 The load acquisition unit 134 acquires load information 93 of the event aggregation unit 110 and the incident management unit 114. Specifically, the load information 93 is information such as the CPU usage rate, the number of IOs per second, that is, the IOPS, and the number of processing waits. The load acquisition unit 134 outputs the acquired load information 93 to the load determination unit 135.
 負荷判定部135は、監視処理を実行する待機系監視部193の負荷を表す負荷情報93を取得し、負荷と閾値とを比較する。負荷判定部135は、負荷情報93に含まれる負荷と、閾値情報137に含まれる閾値とを比較し、負荷が閾値を超えているかを判定する。そして、負荷判定部135は、その判定結果を接続制御部136に出力する。 The load determination unit 135 acquires load information 93 representing the load of the standby monitoring unit 193 that executes the monitoring process, and compares the load with a threshold. The load determination unit 135 compares the load included in the load information 93 with the threshold included in the threshold information 137 to determine whether the load exceeds the threshold. Then, the load determination unit 135 outputs the determination result to the connection control unit 136.
 接続制御部136は、切替部133から、稼働系監視部192から待機系監視部193に切り替えたことを表す切替通知92を取得する。接続制御部136は、切替通知92を取得すると、多重系監視システム119と複数の監視装置203の各々との通信を切断する。また、接続制御部136は、負荷と閾値との比較判定結果に基づいて、複数の監視装置203の各々と待機系監視部193との通信を制御する。
 記憶部139は、複数の監視装置203の各々と待機系監視部193との通信の状態が設定された接続情報138を記憶する。接続制御部136は、接続情報138を用いて通信を制御する。
 接続制御部136は、切替部133が切り替えを行った場合、接続情報138を参照し、接続切断部131に対し、監視装置203の接続または切断の指示を与える。接続切断部131は、接続制御部136からの指示を受け取ると、監視装置203と、多重系監視システム119間のネットワークを、接続または切断する。接続制御部136による通信の制御については後述する。
The connection control unit 136 acquires, from the switching unit 133, a switching notification 92 indicating that the operation monitoring unit 192 has switched to the standby monitoring unit 193. When the connection control unit 136 acquires the switching notification 92, the connection control unit 136 disconnects the communication between the multiplex monitoring system 119 and each of the plurality of monitoring devices 203. The connection control unit 136 also controls communication between each of the plurality of monitoring devices 203 and the standby monitoring unit 193 based on the comparison and determination result of the load and the threshold.
The storage unit 139 stores connection information 138 in which the state of communication between each of the plurality of monitoring devices 203 and the standby monitoring unit 193 is set. The connection control unit 136 controls communication using the connection information 138.
When the switching unit 133 performs switching, the connection control unit 136 refers to the connection information 138 and gives the connection disconnection unit 131 an instruction to connect or disconnect the monitoring apparatus 203. When the connection disconnection unit 131 receives an instruction from the connection control unit 136, the connection disconnection unit 131 connects or disconnects the network between the monitoring device 203 and the multisystem monitoring system 119. Control of communication by the connection control unit 136 will be described later.
***動作の説明***
<監視対象システム200の動作>
 監視対象システム200の動作について説明する。監視装置203が監視対象システム200内のサーバおよびネットワーク機器を監視し、障害が発生しているかを判定する。具体例として、監視装置203は、srv1というホストに対して、CPU使用率とPINGの応答情報を取得する。そして、監視装置203は、CPU使用率ならば90%以上のように、障害条件と一致した場合は、障害であると判定する。また、監視装置203は、PINGならば連続3回以上応答無しのように、障害条件と一致した場合は、障害であると判定する。
 障害が発生した場合は、監視装置203は、その障害データ231を多重系監視システム119へ送信する。
*** Description of operation ***
<Operation of Monitored System 200>
The operation of the monitoring target system 200 will be described. The monitoring device 203 monitors servers and network devices in the monitoring target system 200 to determine whether a failure has occurred. As a specific example, the monitoring device 203 acquires the CPU usage rate and the response information of the PING with respect to the host srv1. Then, the monitoring device 203 determines that a failure occurs when the failure condition is met, such as 90% or more if the CPU usage rate. Further, the monitoring device 203 determines that there is a failure if the failure condition is met, such as no response three consecutive times or more in the case of PING.
When a failure occurs, the monitoring device 203 transmits the failure data 231 to the multisystem monitoring system 119.
<多重系監視システム119の動作>
 多重系監視システム119では、イベント集約部110が複数の監視装置203から送られた障害データ231を受け取る。イベント集約部110は、受け取った障害データ231に対して処理を行い、イベント情報としてイベントデータベース113に記憶する。
 初期状態では、イベント集約(稼働系)111が動作しており、稼働系に障害が発生して処理の続行ができなくなった場合には、イベント集約(待機系)112が起動し、処理を続行する。イベント集約部110は、イベント情報をインシデント管理部114に送付する。
<Operation of Multisystem Monitoring System 119>
In the multiple system monitoring system 119, the event aggregation unit 110 receives the failure data 231 sent from the plurality of monitoring devices 203. The event aggregation unit 110 processes the received fault data 231 and stores the process as event information in the event database 113.
In the initial state, when event aggregation (operating system) 111 is operating and a failure occurs in the operating system and processing can not be continued, event aggregation (standby system) 112 is activated and processing is continued. Do. The event aggregation unit 110 sends event information to the incident management unit 114.
 インシデント管理部114は、イベント情報をインシデント情報として管理する。また、インシデント管理部114は、インシデント情報を障害表示部118に表示する。オペレータ121は、障害表示部118に表示された障害に対応する。具体的には、オペレータ121は、障害が発生したサーバあるいはネットワーク機器を利用している顧客へ連絡する。また、オペレータ121は、ログといった詳細な情報を障害が発生したサーバあるいはネットワーク機器から得る。また、オペレータ121は、再起動あるいは設定変更といった決められた手順に従って障害復旧を試みる。また、オペレータ121は、部品交換が必要な場合はその旨を顧客へ連絡する。このようにオペレータ121は様々な対応を実施する。対応が終了したら、当該障害は収束したという情報をインシデント情報に記録し、障害表示部118での表示から削除する。
 初期状態では、インシデント管理(稼働系)115が動作しており、稼働系に障害が発生して処理の続行ができなくなった場合には、インシデント管理(待機系)116が起動し、処理を続行する。
The incident management unit 114 manages event information as incident information. Further, the incident management unit 114 displays the incident information on the failure display unit 118. The operator 121 responds to the fault displayed on the fault display unit 118. Specifically, the operator 121 contacts a customer using a failed server or network device. In addition, the operator 121 obtains detailed information such as a log from a server or network device in which a failure occurs. In addition, the operator 121 attempts failure recovery according to a determined procedure such as restart or setting change. In addition, the operator 121 notifies the customer of that when parts replacement is necessary. Thus, the operator 121 carries out various responses. When the response is completed, information that the fault has converged is recorded in the incident information and deleted from the display on the fault display unit 118.
In the initial state, incident management (operating system) 115 is operating, and when a failure occurs in the operating system and processing can not be continued, incident management (standby system) 116 is activated and processing is continued. Do.
<切替管理装置130の動作>
 図3は、本実施の形態に係る切替管理装置130による切替管理処理S130のフロー図である。図4は、本実施の形態に係る接続情報138を示す図である。図5は、本実施の形態に係る閾値情報137を示す図である。
 図3から図5を用いて、切替管理処理S130について説明する。
<Operation of Switch Management Device 130>
FIG. 3 is a flowchart of switch management processing S130 by the switch management apparatus 130 according to the present embodiment. FIG. 4 is a diagram showing connection information 138 according to the present embodiment. FIG. 5 is a diagram showing threshold information 137 according to the present embodiment.
The switch management process S130 will be described with reference to FIGS. 3 to 5.
<<接続切断処理>>
 ステップS101において、接続切断部131は、監視装置203と多重系監視システム119間の通信を、接続または切断する。接続切断部131は、接続情報138にしたがって、監視装置203と多重系監視システム119間の通信を、接続または切断する。図4に示すように、切替管理装置130の初期状態では、接続情報138に記載の監視装置全てについて、接続状態となる。
<< Disconnection process >>
In step S101, the connection disconnection unit 131 connects or disconnects the communication between the monitoring apparatus 203 and the multisystem monitoring system 119. The connection disconnecting unit 131 connects or disconnects the communication between the monitoring apparatus 203 and the multisystem monitoring system 119 according to the connection information 138. As shown in FIG. 4, in the initial state of the switching management device 130, all the monitoring devices described in the connection information 138 are in the connected state.
<<障害取得処理>>
 ステップS102において、障害取得部132は、イベント集約(稼働系)111およびインシデント管理(稼働系)115の状態を監視する。障害取得部132は、プログラム異常終了、あるいは応答が無くなるといった状態により処理を続行できないと判断すると、切替部133に障害情報91を出力する。
<< Failure acquisition processing >>
In step S102, the fault acquisition unit 132 monitors the states of the event aggregation (operating system) 111 and the incident management (operating system) 115. If the fault acquiring unit 132 determines that the process can not be continued due to a program abnormal termination or a state where the response is lost, the fault acquiring unit 132 outputs the fault information 91 to the switching unit 133.
<<切替処理>>
 ステップS103において、切替部133は、処理を続行できなくなった稼働系を切り離し、待機系を起動し、待機系で処理が続行できるように接続を切り替える。切り替える方式には、イベント集約部110として仮想的なIPアドレスを1つ持ち、イベント集約(稼働系)111とイベント集約(待機系)112が持つ物理的なIPアドレスとの変換を行うことによって、稼働系と待機系の切替を実現する方式がある。切替部133は、稼働系が応答しなくなり切り離した時に、切替通知92を接続制御部136に通知する。また、切替部133は、待機系が起動し処理が可能となった時に、処理可能通知94を接続制御部136に通知する。
<< Switching process >>
In step S103, the switching unit 133 disconnects the active system that can not continue the process, activates the standby system, and switches the connection so that the process can be continued in the standby system. In the switching method, the event aggregation unit 110 has one virtual IP address, and performs conversion between the event aggregation (active system) 111 and the physical IP address of the event aggregation (standby system) 112, There is a method to realize switching between operating system and standby system. The switching unit 133 notifies the connection control unit 136 of the switching notification 92 when the working system stops responding and disconnects. Also, the switching unit 133 notifies the connection control unit 136 of the process enable notification 94 when the standby system is activated and processing is possible.
<<負荷取得処理>>
 ステップS104において、負荷取得部134は、イベント集約(稼働系)111およびインシデント管理(稼働系)115の負荷を定期的に取得する。
 図5に示すように、閾値情報137には、負荷を取得する対象と、負荷項目と、閾値と、条件とが設定されている。負荷取得部134は、閾値情報137に記載されている負荷項目に相当するものを対象として、情報を取得する。具体的には、CPU使用率、IOPS、処理待ち数、あるいは応答時間といった負荷情報である。負荷取得部134は、取得した負荷情報を負荷判定部135に送付する。
<< Load acquisition processing >>
In step S104, the load acquisition unit 134 periodically acquires the load of the event aggregation (operating system) 111 and the incident management (operating system) 115.
As shown in FIG. 5, in the threshold information 137, a target for acquiring a load, a load item, a threshold, and a condition are set. The load acquisition unit 134 acquires information for objects corresponding to the load items described in the threshold information 137. Specifically, it is load information such as CPU utilization, IOPS, number of processing waits, or response time. The load acquisition unit 134 sends the acquired load information to the load determination unit 135.
<<負荷判定処理>>
 ステップS105において、負荷判定部135は、閾値情報137に基づいて、取得した負荷と閾値とを比較する。負荷判定部135は、比較結果に基づいて、取得した負荷が閾値および条件に合致するかどうかを判定する。具体的には、図5の閾値情報137では、インシデント管理を対象とした場合、負荷項目であるCPU使用率の閾値は「80%」、条件は「以上」である。ここで、負荷情報として、インシデント管理(稼働系)115から取得したCPU使用率が70%であるとする。このとき、負荷判定部135は、負荷情報が閾値と条件の組み合わせである「80%以上」にはならないため、閾値を超えていないと判定する。負荷判定部135で判定した比較判定結果は、接続制御部136に送付される。
<< Load judgment processing >>
In step S105, the load determination unit 135 compares the acquired load with the threshold based on the threshold information 137. The load determination unit 135 determines, based on the comparison result, whether the acquired load matches the threshold and the condition. Specifically, in the threshold information 137 of FIG. 5, when incident management is targeted, the threshold of the CPU usage rate which is a load item is “80%”, and the condition is “or more”. Here, it is assumed that the CPU usage rate acquired from the incident management (operating system) 115 is 70% as load information. At this time, the load determination unit 135 determines that the threshold value is not exceeded because the load information does not become “80% or more”, which is a combination of the threshold value and the condition. The comparison determination result determined by the load determination unit 135 is sent to the connection control unit 136.
<<接続制御処理>>
 図6を用いて、本実施の形態に係る接続制御部136による接続制御処理S106について説明する。
 ステップS601は初期状態の処理である。初期状態では、接続制御部136は、接続切断部131に対し、全ての通信を許可する。
 ステップS602において、接続制御部136は、切替部133から通知を受ける。上述したように、切替部133からの通知には、稼働系が応答しなくなり切り離した時の切替通知92と、待機系が起動し処理が可能となった場合の処理可能通知94とがある。
 ステップS603において、接続制御部136は、通知が切替通知92であるかを判定する。通知が切替通知92の場合、ステップS604に進む。
 ステップS604において、接続制御部136は、接続切断部131に対し、全ての通信を切断するように指示する。具体的には、接続制御部136は、接続情報138に記載の全ての監視装置について、切断状態を設定する。そして、接続制御部136は、接続切断部131に対し、全ての監視装置との通信を切断するように指示する。
 ステップS605において、接続制御部136は、切替部133から、待機系が処理可能となったとの処理可能通知94を受け取るまで待つ。
 ステップS605aにおいて、接続制御部136は、切替部133から処理可能通知94を取得すると、複数の監視装置のうち少なくとも1つの監視装置について通信を接続状態にする。具体的には、接続制御部136は、切替部133から待機系が起動した処理可能通知94を受けると、接続情報138に記載の監視装置のうち、1つについて接続状態へ変更する。そして、接続制御部136は、接続切断部131に対し、当該監視装置との通信を接続するように指示する。
<< Connection control processing >>
The connection control process S106 by the connection control unit 136 according to the present embodiment will be described with reference to FIG.
Step S601 is processing in the initial state. In the initial state, the connection control unit 136 permits the connection disconnection unit 131 to perform all communications.
In step S602, the connection control unit 136 receives a notification from the switching unit 133. As described above, the notification from the switching unit 133 includes the switching notification 92 when the working system stops responding and is disconnected, and the process available notification 94 when the standby system is activated and processing becomes possible.
In step S603, the connection control unit 136 determines whether the notification is the switching notification 92. If the notification is the switching notification 92, the process proceeds to step S604.
In step S604, the connection control unit 136 instructs the connection disconnection unit 131 to disconnect all communications. Specifically, the connection control unit 136 sets the disconnection state for all the monitoring devices described in the connection information 138. Then, the connection control unit 136 instructs the connection disconnection unit 131 to disconnect communication with all the monitoring devices.
In step S605, the connection control unit 136 waits until receiving, from the switching unit 133, a process enable notification 94 indicating that the standby system has become processable.
In step S605a, when the connection control unit 136 acquires the process enable notification 94 from the switching unit 133, the connection control unit 136 sets communication to at least one of the plurality of monitoring devices. Specifically, the connection control unit 136 changes the connection state of one of the monitoring devices described in the connection information 138 to the connection state when receiving the process enable notification 94 that the standby system has been activated from the switching unit 133. Then, the connection control unit 136 instructs the connection disconnection unit 131 to connect communication with the monitoring device.
 ステップS606において、接続制御部136は、負荷判定部135から比較判定結果を取得する。
 ステップS607において、接続制御部136は、比較判定結果により、負荷が閾値を超えているかを判定する。負荷が閾値を超えている場合、接続制御部136は、ステップS608に進む。負荷が閾値以下の場合、接続制御部136は、ステップS609に進む。
In step S606, the connection control unit 136 acquires the comparison determination result from the load determination unit 135.
In step S607, the connection control unit 136 determines whether the load exceeds the threshold value based on the comparison determination result. If the load exceeds the threshold, the connection control unit 136 proceeds to step S608. If the load is equal to or less than the threshold, the connection control unit 136 proceeds to step S609.
 ステップS608において、接続制御部136は、複数の監視装置のうち少なくとも1つの監視装置について通信を切断する。すなわち、接続制御部136は、接続状態の監視装置のうち1つを切断するように接続切断部131へ指示する。接続制御部136は、監視装置切断表138を書き換えることにより、接続状態の監視装置のうち1つを切断することを実現する。
 ステップS609において、接続制御部136は、複数の監視装置のうち少なくとも1つの監視装置について通信を接続状態にする。すなわち、接続制御部136は、切断状態の監視装置のうち1つを接続するように接続切断部131へ指示する。接続制御部136は、監視装置切断表138を書き換えることにより、切断状態の監視装置のうち1つを接続することを実現する。
In step S608, the connection control unit 136 disconnects communication for at least one of the plurality of monitoring devices. That is, the connection control unit 136 instructs the connection disconnection unit 131 to disconnect one of the connected monitoring devices. The connection control unit 136 realizes disconnection of one of the monitoring devices in the connection state by rewriting the monitoring device disconnection table 138.
In step S609, the connection control unit 136 causes the communication to be in a connected state for at least one of the plurality of monitoring devices. That is, the connection control unit 136 instructs the connection disconnection unit 131 to connect one of the disconnected monitoring devices. The connection control unit 136 realizes connection of one of the disconnected monitoring devices by rewriting the monitoring device disconnection table 138.
 ステップS610において、接続制御部136は、一定時間待つ。
 ステップS611において、接続制御部136は、全監視装置が接続状態になったかを判定する。具体的には、接続制御部136は、監視装置切断表138を参照することにより、全監視装置が接続状態になったかを判定する。全監視装置が接続状態になった場合、接続制御部136は、ステップS611に進む。未接続の監視装置がある場合、接続制御部136は、ステップS606に戻る。
 ステップS612において、待機系への切替が完全終了する。接続制御部136は、必要なら待機系を稼働系に読み替える。このように、接続制御部136は、複数の監視装置の全てについて通信が接続状態になった場合に、通信の制御を終了する。
In step S610, the connection control unit 136 waits for a predetermined time.
In step S611, the connection control unit 136 determines whether all the monitoring devices are in the connected state. Specifically, the connection control unit 136 determines whether all the monitoring devices have been connected by referring to the monitoring device disconnection table 138. If all the monitoring devices are in the connected state, the connection control unit 136 proceeds to step S611. If there is an unconnected monitoring device, the connection control unit 136 returns to step S606.
In step S612, the switching to the standby system is completely completed. The connection control unit 136 replaces the standby system with the operating system if necessary. As described above, the connection control unit 136 ends the control of communication when the communication is connected to all of the plurality of monitoring devices.
 また、負荷判定部135から送られた比較判定結果が、一定時間、閾値以下であった場合、接続制御部136は、切断状態となっている監視装置のうち1つを接続状態に変えてもよい。すなわち、接続制御部136は、接続情報138を参照し、切断状態となっている監視装置のうち1つを接続状態に変え、接続切断部131に対して、当該監視装置との通信を接続するように指示してもよい。
 また、負荷判定部135から送られた比較判定結果が、一定時間、閾値を超えた場合は、接続制御部136は、接続状態となっている監視装置のうち1つを切断状態に変えてもよい。すなわち、接続制御部136は、接続情報138を参照し、接続状態となっている監視装置のうち1つを切断状態に変え、接続切断部131に対して、当該監視装置との通信を切断するように指示してもよい。
Also, if the comparison determination result sent from the load determination unit 135 is less than or equal to the threshold for a certain period of time, the connection control unit 136 may change one of the monitoring devices in the disconnected state to the connected state. Good. That is, the connection control unit 136 refers to the connection information 138, changes one of the monitoring devices in the disconnection state to the connection state, and connects the connection disconnection unit 131 to the communication with the monitoring device. You may be instructed to
Also, if the comparison determination result sent from the load determination unit 135 exceeds the threshold for a certain period of time, the connection control unit 136 may change one of the monitoring devices in the connection state to the disconnection state. Good. That is, the connection control unit 136 refers to the connection information 138, changes one of the monitoring devices in the connection state to the disconnection state, and disconnects the connection disconnection unit 131 from the communication with the monitoring device. You may be instructed to
 接続切断部131は、接続制御部136から送られた接続または切断指示に従い、監視装置203と多重系監視システム119間の通信を接続または切断する。接続または切断する方式としては、ファイアウォールのポリシーを変更して通信の許可あるいは不許可を変える、あるいは、VLAN(Virtual LAN)の設定を変える、といった方式がある。 The connection disconnection unit 131 connects or disconnects the communication between the monitoring device 203 and the multisystem monitoring system 119 according to the connection or disconnection instruction sent from the connection control unit 136. As a method of connection or disconnection, there is a method of changing a firewall policy to change permission or non-permission of communication, or changing a setting of VLAN (Virtual LAN).
***他の構成***
<変形例1>
 なお、上述した障害情報および負荷情報の収集は、市販のサーバ監視あるいはネットワーク監視といったツールおよびシステムに用いられている監視データあるいは負荷データの形式を利用してもよい。その場合、各動作はサーバ監視あるいはネットワーク監視といったツールおよびシステムと同様である。また、稼働系から待機系への切替については、切替が終わったら、稼働系をシャットダウンさせて終了させ、それまでの待機系を稼働系とし、それまでの稼働系を待機系として扱うようにしてもよい。この場合、稼働系が応答不能となった原因が、シャットダウンによって解消されている必要がある。
 また、多重系監視システムの障害データの取得、および、多重系監視システムの切替について、市販の多重化を行うクラスタリングツールを利用してもよい。その場合、障害の検知および切替方式は、そのクラスタリングツールが持つ方式で実現する。
 また、多重系監視システムの構成において、本実施の形態ではイベント集約部とインシデント管理部に分けているが、これらを1つの機能ブロックで実現してもよい。また、構成管理部、対応履歴保存部、対応履歴検索部、イベント相関分析部、および障害自動対応部といった一般的な多重系監視システムで使われる各機能が存在してもよい。その場合でも、多重化されている部分について、切替管理装置により稼働系が応答不能となった場合に、待機系に切り替わるという動作は同様である。
*** Other configuration ***
<Modification 1>
The above-mentioned collection of fault information and load information may use the form of monitoring data or load data used in tools and systems such as commercially available server monitoring or network monitoring. In that case, each operation is similar to tools and systems such as server monitoring or network monitoring. In addition, for switching from the active system to the standby system, shut down and terminate the active system after switching is completed, and let the previous standby system be the active system and treat the existing active systems as the standby system. It is also good. In this case, the cause of the unresponsive operation system needs to be eliminated by shutdown.
Alternatively, a commercially available clustering tool may be used to acquire fault data of the multisystem monitoring system and switch the multisystem monitoring system. In that case, the failure detection and switching method is realized by the method possessed by the clustering tool.
Further, in the configuration of the multiplex system monitoring system, although the event aggregation unit and the incident management unit are divided in the present embodiment, these may be realized by one functional block. In addition, there may exist respective functions used in a general multiplex system monitoring system such as a configuration management unit, a correspondence history storage unit, a correspondence history search unit, an event correlation analysis unit, and an automatic failure handling unit. Even in this case, the operation of switching to the standby system is the same as in the case where the operation management system can not respond to the multiplexed part when the operation management system can not respond.
<変形例2>
 本実施の形態では、切替管理装置130の機能がソフトウェアで実現されるが、変形例として、切替管理装置130の機能がハードウェアで実現されてもよい。
<Modification 2>
In the present embodiment, the function of the switching management device 130 is realized by software, but as a modification, the function of the switching management device 130 may be realized by hardware.
 図7は、本実施の形態の変形例に係る切替管理装置130の構成を示す図である。
 切替管理装置130は、電子回路909、メモリ921、補助記憶装置922、入力インタフェース930、出力インタフェース940および通信装置950を備える。
FIG. 7 is a diagram showing a configuration of a switching management device 130 according to a modification of the present embodiment.
The switching management device 130 includes an electronic circuit 909, a memory 921, an auxiliary storage device 922, an input interface 930, an output interface 940, and a communication device 950.
 電子回路909は、接続切断部131と、障害取得部132と、切替部133と、負荷取得部134と、負荷判定部135と、接続制御部136の機能を実現する専用の電子回路である。
 電子回路909は、具体的には、単一回路、複合回路、プログラム化したプロセッサ、並列プログラム化したプロセッサ、ロジックIC、GA、ASIC、または、FPGAである。GAは、Gate Arrayの略語である。ASICは、Application Specific Integrated Circuitの略語である。FPGAは、Field-Programmable Gate Arrayの略語である。
 切替管理装置130の構成要素の機能は、1つの電子回路で実現されてもよいし、複数の電子回路に分散して実現されてもよい。
 別の変形例として、切替管理装置130の構成要素の一部の機能が電子回路で実現され、残りの機能がソフトウェアで実現されてもよい。
The electronic circuit 909 is a dedicated electronic circuit that implements the functions of the connection disconnection unit 131, the fault acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136.
Specifically, the electronic circuit 909 is a single circuit, a complex circuit, a programmed processor, a parallel programmed processor, a logic IC, a GA, an ASIC, or an FPGA. GA is an abbreviation of Gate Array. ASIC is an abbreviation of Application Specific Integrated Circuit. FPGA is an abbreviation of Field-Programmable Gate Array.
The functions of the components of the switching management device 130 may be realized by one electronic circuit or may be realized by being distributed to a plurality of electronic circuits.
As another modification, some of the functions of the components of the switching management device 130 may be realized by an electronic circuit, and the remaining functions may be realized by software.
 プロセッサと電子回路の各々は、プロセッシングサーキットリとも呼ばれる。つまり、切替管理装置130において、接続切断部131と、障害取得部132と、切替部133と、負荷取得部134と、負荷判定部135と、接続制御部136の機能は、プロセッシングサーキットリにより実現される。 Each of the processor and electronics is also referred to as processing circuitry. That is, in the switching management device 130, the functions of the connection disconnection unit 131, the fault acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136 are realized by processing circuitry. Be done.
 切替管理装置130において、接続切断部131と、障害取得部132と、切替部133と、負荷取得部134と、負荷判定部135と、接続制御部136の「部」を「工程」に読み替えてもよい。また、接続切断処理、障害取得処理、切替処理、負荷取得処理、負荷判定処理、おとび接続制御処理の「処理」を「プログラム」、「プログラムプロダクト」または「プログラムを記録したコンピュータ読み取り可能な記録媒体」に読み替えてもよい。 In the switching management device 130, the “unit” of the connection disconnection unit 131, the failure acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136 is replaced with “process”. It is also good. In addition, "processing" of connection disconnection processing, failure acquisition processing, switching processing, load acquisition processing, load determination processing, and connection control processing is "program", "program product" or "computer readable record recording program" It may be read as "medium".
***本実施の形態の効果の説明***
 以上のように、本実施の形態に係る監視制御システムによれば、過負荷により応答不能となった稼働系から待機系に切り替わった時に、待機系に過負荷がかかるのを防ぐことができる。
 本実施の形態に係る監視制御システムでは、二重化されている監視部が過負荷により応答不能となって待機系に切り替わった時に、一旦、全ての監視装置との通信が切断される。このため、待機系の負荷が高くなることは無い。また、本実施の形態に係る監視制御システムでは、負荷が閾値以下であれば、接続情報に基づき、監視装置を1つずつ接続していく。また、閾値を超えた場合は、監視装置を1つずつ切断していく。よって、待機系の負荷が閾値を超え、待機系が過負荷で応答不能となることを防ぐことができる。
 障害情報が大量に発生した根本原因となる、ネットワーク基幹装置の故障、あるいはキャリア回線の障害については、稼働系が正常動作していた時に受信した障害データおよび待機系への切り替わり後に受信した障害データから、オペレータが対応する。障害に対しては、オペレータの対応により、装置の再起動、装置の設定変更、キャリアへの連絡、および装置が設置されている現地への修理人員の手配といった様々な方式で対応可能である。したがって、根本原因については、いずれ解消される。根本原因が解消されれば、大量の障害情報が監視装置側から送られることも無くなり、多重系監視システムの負荷も下がるため、監視装置との接続が増えていき、最終的には、監視装置中の全ての監視装置との通信が接続される。
*** Explanation of the effect of the present embodiment ***
As described above, according to the monitoring control system according to the present embodiment, it is possible to prevent the standby system from being overloaded when the operation system which has become incapable of responding due to an overload is switched to the standby system.
In the monitoring control system according to the present embodiment, communication with all the monitoring devices is cut off once the redundant monitoring units become unresponsive due to overload and are switched to the standby system. Therefore, the load on the standby system does not increase. Further, in the monitoring control system according to the present embodiment, if the load is equal to or less than the threshold value, the monitoring devices are connected one by one based on the connection information. If the threshold is exceeded, the monitoring device is disconnected one by one. Therefore, it is possible to prevent the standby system load from exceeding the threshold and becoming incapable of responding due to overload.
Regarding the failure of the network backbone device or the failure of the carrier line, which is the root cause of the large amount of failure information, the failure data received when the active system was operating normally and the failure data received after switching to the standby system From which the operator responds. The failure can be dealt with in various ways, such as restarting the device, changing the setting of the device, contacting the carrier, and arranging the repair personnel to the site where the device is installed. Therefore, the root cause will be eliminated eventually. If the root cause is eliminated, a large amount of failure information will not be sent from the monitoring device side, and the load on the multi-system monitoring system will also be reduced, so the connection with the monitoring device will increase and eventually the monitoring device Communication with all the monitoring devices in is connected.
 また、ネットワーク基幹装置の故障あるいはキャリア障害といった障害は、まれにしか発生しない。しかし、発生した場合は大量の障害データが発生し、多重系監視システムが、稼働系、および、切り替わり後の待機系共に過負荷に陥る。このような場合でも、本実施の形態に係る監視制御システムによれば、最大負荷に対応するようなリソースを用意することなく、待機系への切替を実現することができる。 In addition, failures such as a failure of a network backbone device or a carrier failure rarely occur. However, when it occurs, a large amount of fault data occurs, and the multisystem monitoring system becomes overloaded with both the active system and the standby system after switching. Even in such a case, according to the monitoring control system according to the present embodiment, switching to the standby system can be realized without preparing a resource corresponding to the maximum load.
 実施の形態2.
 本実施の形態では、実施の形態1とは異なる点について説明する。なお、実施の形態1と同様の構成には同一の符号を付し、その説明を省略する場合がある。
 本実施の形態では、接続情報138aには複数の監視装置の各々の重要度が設定される。また、接続制御部136は、負荷と閾値との比較判定結果と重要度とに基づいて、通信を制御する。
Second Embodiment
In this embodiment, points different from the first embodiment will be described. In addition, the same code | symbol may be attached | subjected to the structure similar to Embodiment 1, and the description may be abbreviate | omitted.
In the present embodiment, the importance of each of the plurality of monitoring devices is set in the connection information 138a. The connection control unit 136 also controls communication based on the comparison determination result of the load and the threshold and the degree of importance.
 本実施の形態に係る接続情報138aでは、各監視装置の重要度についても管理する。
 図8は、本実施の形態に係る接続情報138aを示す図である。本実施の形態に係る接続情報138aは、実施の形態1に係る接続情報138が持つ内容に加え、重要度の情報を持つ。重要度は、監視装置の重要度を示しており、0から1の実数値で表現し、1が最も重要であるとする。具体的には、基幹のネットワーク装置を監視する監視装置については、重要度が高いと思われるため、重要度を高く設定する。また、顧客との監視契約により、重要度を変更する場合もある。
The connection information 138a according to the present embodiment also manages the importance of each monitoring device.
FIG. 8 is a diagram showing connection information 138a according to the present embodiment. The connection information 138a according to the present embodiment has information of importance in addition to the contents of the connection information 138 according to the first embodiment. The degree of importance indicates the degree of importance of the monitoring device, and is represented by a real value of 0 to 1, with 1 being the most important. Specifically, regarding the monitoring device that monitors the main network device, the importance is considered to be high, so the importance is set high. Also, the monitoring contract with the customer may change the degree of importance.
 図9は、本実施の形態に係る接続情報138aの設定のためのユーザインタフェースの例を示す図である。図10は、図9においてドロップダウンリストを表示させた状態を示す図である。このユーザインタフェースは、出力インタフェース940を介して表示装置に表示される。また、システムの管理者は、入力インタフェース930を介して入力装置によって設定を行う。図9の例では、監視装置ごとに、重要度を設定する欄が存在する。重要度の欄はドロップダウンリストになっており、「▽」の部分をマウスといった入力装置で選択する。すると、図10に示すように重要度の一覧が表示される。重要度の一覧から重要度が入力装置で選択され、重要度が設定される。システムの管理者は、重要度を設定した後、「OK」のボタンを選択することで、設定がメモリ921の接続情報138aに記憶される。「キャンセル」のボタンを選択した場合は、記憶されない。
 なお、図9および図10では、監視装置名の欄の数は3つであるが、欄の数はそれ以上あってもよいし、欄の数を変更出来るようになっていてもよい。
FIG. 9 is a diagram showing an example of a user interface for setting connection information 138a according to the present embodiment. FIG. 10 is a diagram showing a state in which the drop-down list is displayed in FIG. This user interface is displayed on the display device via the output interface 940. In addition, the administrator of the system performs setting by the input device via the input interface 930. In the example of FIG. 9, there is a column for setting the degree of importance for each monitoring device. The column of importance is a drop-down list, and the part of "▽" is selected by an input device such as a mouse. Then, as shown in FIG. 10, a list of importance is displayed. The importance is selected in the input device from the list of importance and the importance is set. The system administrator sets the importance and then selects the “OK” button, whereby the setting is stored in the connection information 138 a of the memory 921. If you select the "cancel" button, it will not be stored.
Although the number of monitoring device name columns is three in FIGS. 9 and 10, the number of columns may be more than that, or the number of columns may be changed.
 図11は、本実施の形態に係る接続情報138aの設定のためのユーザインタフェースの別例を示す図である。図11では、入力装置を用いてユーザインタフェース上で各監視装置を示すラベルまたはアイコンを移動させる。そして、各監視装置を示すラベルまたはアイコンを重要度の高低を示す位置に配置することにより、各監視装置の重要度を設定する。 FIG. 11 is a diagram showing another example of the user interface for setting the connection information 138a according to the present embodiment. In FIG. 11, an input device is used to move a label or icon indicating each monitoring device on the user interface. Then, by placing a label or icon indicating each monitoring device at a position indicating the level of importance, the degree of importance of each monitoring device is set.
 次に動作について説明する。実施の形態1と異なる部分についてのみ説明する。
 接続制御部136は、切替部133から待機系が起動した処理可能通知94を受けると、接続情報138aを参照し、切断状態にある監視装置のうち、最も重要度が高い1つについて接続状態に変更する。そして、接続制御部136は、接続切断部131に対し、当該監視装置との通信を接続するように指示する。
 また、一定時間、負荷判定部135から送られた比較判定結果が閾値以下であった場合、接続制御部136は、接続情報138aを参照し、切断状態となっている監視装置のうち最も重要度が高い1つを接続状態に変更する。そして、接続制御部136は、接続切断部131に対して、当該監視装置との通信を接続するように指示する。負荷判定部135から送られた比較判定結果が、閾値以上であった場合は、接続制御部136は、接続情報138aを参照し、接続状態となっている監視装置のうち最も重要度が低い1つを切断状態に変更する。そして、接続制御部136は、接続切断部131に対して、当該監視装置との通信を切断するように指示する。
Next, the operation will be described. Only differences from Embodiment 1 will be described.
When the connection control unit 136 receives from the switching unit 133 the process ready notification 94 activated by the standby system, the connection control unit 136 refers to the connection information 138a to connect one of the monitoring devices in the disconnected state with the highest degree of importance. change. Then, the connection control unit 136 instructs the connection disconnection unit 131 to connect communication with the monitoring device.
If the comparison determination result sent from the load determination unit 135 is less than or equal to the threshold for a certain period of time, the connection control unit 136 refers to the connection information 138 a and determines the most important of the monitoring devices in the disconnected state Change the high one to the connected state. Then, the connection control unit 136 instructs the connection disconnection unit 131 to connect communication with the monitoring device. If the comparison determination result sent from the load determination unit 135 is greater than or equal to the threshold value, the connection control unit 136 refers to the connection information 138a and the monitoring device in the connected state has the lowest importance 1 Change one to disconnected. Then, the connection control unit 136 instructs the connection disconnection unit 131 to disconnect communication with the monitoring device.
 以上のように、本実施の形態に係る切替管理装置によれば、待機系への切替後に、重要度の高い監視装置から接続していくことができる。これにより、基幹ネットワーク装置および顧客との契約上重要であるとされた装置の監視を優先して接続することが可能となる。基幹ネットワーク装置といったシステム構成上重要な部分の監視を優先して復活させることで、大規模な障害発生の根本原因の究明をより迅速に行うことができる。また、契約上重要なシステムの監視を優先して復活させることが可能となる。 As described above, according to the switching management device according to the present embodiment, after switching to the standby system, connection can be made from a monitoring device with a high degree of importance. As a result, it is possible to connect by priority the monitoring of the backbone network device and the device considered to be contract important with the customer. By giving priority to reviving the monitoring of the system configuration such as the backbone network device, it is possible to more quickly investigate the root cause of the large-scale failure occurrence. Moreover, it becomes possible to give priority to revival of monitoring of a system important on contract.
 実施の形態3.
 本実施の形態では、実施の形態1とは異なる点について説明する。なお、実施の形態1と同様の構成には同一の符号を付し、その説明を省略する場合がある。
Third Embodiment
In this embodiment, points different from the first embodiment will be described. In addition, the same code | symbol may be attached | subjected to the structure similar to Embodiment 1, and the description may be abbreviate | omitted.
 図12は、本実施の形態に係る監視制御システム500bの構成を示す図である。
 本実施の形態に係る監視制御システム500bは、複数の監視装置203の各々から多重系監視システム119に送信される障害データ231を転送する転送部140を備える。切替部133は、切替通知92を転送部140に送信する。負荷判定部135は、比較判定結果を転送部140に送信する。転送部140は、切替通知92と比較判定結果とに基づいて、障害データ231の転送数を制御する。
FIG. 12 is a diagram showing a configuration of a monitoring control system 500b according to the present embodiment.
The monitoring control system 500b according to the present embodiment includes a transfer unit 140 that transfers fault data 231 transmitted from each of the plurality of monitoring devices 203 to the multiplex monitoring system 119. The switching unit 133 transmits the switching notification 92 to the transfer unit 140. The load determination unit 135 transmits the comparison determination result to the transfer unit 140. The transfer unit 140 controls the number of transfers of the failure data 231 based on the switching notification 92 and the comparison determination result.
 以下、図1と異なる部分のみ説明する。
 転送部140は、監視装置203から送信された障害データ231を転送する。転送部140は、転送制御部141と転送バッファ142とを備える。転送制御部141は、監視装置203からの障害データ231の転送制御を行う。転送バッファ142は、障害データ231を一時的に保存する。
Hereinafter, only differences from FIG. 1 will be described.
The transfer unit 140 transfers the fault data 231 transmitted from the monitoring device 203. The transfer unit 140 includes a transfer control unit 141 and a transfer buffer 142. The transfer control unit 141 controls transfer of the fault data 231 from the monitoring device 203. The transfer buffer 142 temporarily stores the fault data 231.
 図13に、本実施の形態に係る障害データ231の転送設定情報143の例を示す。転送部140は、転送設定情報143を有する。転送部140には、予め転送設定情報143が記憶されている。この転送設定情報143では、Pull型転送およびPush型転送のそれぞれについて、転送間隔、転送数の初期値、転送数を減らす数、および転送数を増やす数が設定されている。なお、転送数を減らす数は、切替が発生した時に待機系の負荷が閾値を超過した場合の転送数を減らす数である。また、転送数を減らす数は、待機系の負荷が閾値を下回った場合の転送数を増やす数である。 FIG. 13 shows an example of the transfer setting information 143 of the fault data 231 according to the present embodiment. The transfer unit 140 has transfer setting information 143. Transfer setting information 143 is stored in advance in the transfer unit 140. In the transfer setting information 143, for each of Pull type transfer and Push type transfer, the transfer interval, the initial value of the transfer number, the number to reduce the transfer number, and the number to increase the transfer number are set. The number to reduce the number of transfers is the number to reduce the number of transfers when the load on the standby system exceeds the threshold when switching occurs. Further, the number of transfers to be reduced is the number to increase the number of transfers when the load on the standby system falls below a threshold.
 監視装置203から障害データ231を送る際は、監視装置203の種類および監視方法によって、情報の送る方式がいくつか存在する。その方式は、主に2通りに分けられる。
 一つがPull型転送である。Pull型転送では、監視装置203が内部にデータベースを持つ。監視装置203は、監視対象となるサーバ205あるいはネットワーク装置で障害が発生すると、その障害を検知し、障害データ231を一旦データベースに保存する。その情報を、監視装置203が提供するAPI(アプリケーションインタフェース)、あるいはデータベースへのアクセスにより、多重系監視システム119に転送する。転送時のプロトコルは、WebAPIを利用するプロトコル、データベースが提供するアクセスプロコトル、あるいは監視装置独自のプロコトルがある。この場合、APIあるいはデータベースアクセスで多重系監視システム119が障害データ231を取り出す形となり、一度に取得する障害データ231の数を指定することが可能である。
 もう一つがPush型転送である。Push型転送では、監視装置203はデータベースを持たない。監視装置203は、監視対象となるサーバ205あるいはネットワーク装置で障害が発生した障害の障害データ231を、監視装置203から即座に多重系監視システム119へ転送する。転送時のプロトコルとしては、SNMP Trapといったプロトコルを使用する。この場合、監視装置203が障害を検知するたびに障害データ231が転送される。よって、ネットワーク監視基幹装置の故障の場合では、多くのサーバおよび装置が不通となり、一度に大量の障害データ231が転送される可能性がある。
When sending fault data 231 from the monitoring device 203, there are several methods of sending information depending on the type of the monitoring device 203 and the monitoring method. The system is divided into two main ways.
One is Pull type transfer. In Pull type transfer, the monitoring device 203 has a database inside. When a failure occurs in the server 205 or a network device to be monitored, the monitoring device 203 detects the failure and temporarily stores failure data 231 in the database. The information is transferred to the multisystem monitoring system 119 by access to an API (application interface) provided by the monitoring apparatus 203 or a database. The transfer protocol may be a protocol using Web API, an access protocol provided by a database, or a protocol unique to the monitoring apparatus. In this case, the multiple system monitoring system 119 takes out the failure data 231 by API or database access, and it is possible to specify the number of failure data 231 to be acquired at one time.
Another is Push type transfer. In Push type transfer, the monitoring device 203 does not have a database. The monitoring device 203 immediately transfers from the monitoring device 203 to the multiple system monitoring system 119 failure data 231 of a failure that has occurred in the server 205 to be monitored or a network device. As a transfer protocol, a protocol such as SNMP Trap is used. In this case, fault data 231 is transferred each time the monitoring device 203 detects a fault. Therefore, in the case of the failure of the network monitoring backbone device, many servers and devices may become disconnected, and a large amount of failure data 231 may be transferred at one time.
 次に動作について説明する。実施の形態1と異なる部分についてのみ説明する。
 切替部133は、系の切り替えを実施した場合、切替通知92を転送部140に送信する。
 また、負荷判定部135は、負荷と閾値とを比較判定した比較判定結果を転送部140に送信する。
Next, the operation will be described. Only differences from Embodiment 1 will be described.
The switching unit 133 transmits the switching notification 92 to the transfer unit 140 when system switching is performed.
Further, the load determination unit 135 transmits, to the transfer unit 140, the comparison determination result obtained by comparing and determining the load and the threshold.
 まず、Pull型転送を用いる場合について説明する。
 転送部140は、初期状態において、一定時間ごとに監視装置203で検知した全ての障害データ231を、接続切断部131を経由して多重系監視システム119へ転送する。
First, the case of using pull type transfer will be described.
In an initial state, the transfer unit 140 transfers all failure data 231 detected by the monitoring device 203 at regular intervals to the multiplex monitoring system 119 via the connection disconnection unit 131.
 転送部140が切替通知92を受信すると、転送制御部141は、転送設定情報143に基づいて、監視装置203から一定間隔ごとに取得する障害データ231の数を減らす。一定時間の間、負荷判定部135で負荷が閾値以下であると判定された場合は、転送制御部141は、転送設定情報143に基づいて、監視装置203から一定間隔ごとに取得する障害データ231の数を増やす。また、負荷が閾値以上であると判定された場合は、監視装置203から一定間隔ごとに取得する障害データ231の数を減らす。
 具体的には、転送制御部141は、以下のように一定間隔ごとに取得する障害データ231の数を制御する。
 負荷が閾値を超えた場合に、一定間隔ごとに取得する障害データ231の数を、Pull型負荷超過時転送数減の分だけ減らす。
新しい数=現在の数-Pull型負荷超過時転送数減
 負荷が閾値を下回った場合に、一定間隔ごとに取得する障害データ231の数を、Pull型負荷非超過時転送数増の分だけ増やす。
新しい数=現在の数+Pull型負荷非超過時転送数増
When the transfer unit 140 receives the switching notification 92, the transfer control unit 141 reduces the number of pieces of failure data 231 acquired from the monitoring apparatus 203 at regular intervals based on the transfer setting information 143. If the load determination unit 135 determines that the load is equal to or less than the threshold value for a predetermined time, the transfer control unit 141 acquires failure data 231 acquired from the monitoring device 203 at regular intervals based on the transfer setting information 143. Increase the number of When it is determined that the load is equal to or higher than the threshold, the number of pieces of fault data 231 acquired from the monitoring device 203 at regular intervals is reduced.
Specifically, the transfer control unit 141 controls the number of pieces of failure data 231 acquired at regular intervals as follows.
When the load exceeds the threshold value, the number of pieces of failure data 231 acquired at regular intervals is reduced by the amount of pull type over load transfer number reduction.
New number = Current number-Pull type over load transfer number reduction When the load falls below the threshold, the number of failure data 231 acquired at regular intervals is increased by the amount of Pull type non load excess transfer number increase .
New number = present number + Pull type not loaded transfer number increase
 次に、Push型転送を用いる場合について説明する。
 転送部140は、初期状態において、監視装置203で検知して通知された障害データ231を、即時に全て接続切断部131を経由して多重系監視システム119へ転送する。
 転送部140が切替通知92を受信すると、転送制御部141は、監視装置203から送られた全ての障害データ231を一旦、転送バッファ142に保管する。転送バッファ142では、障害データ231に対する処理は何もせず、一時的に保管するのみとする。そのため、監視装置203側から大量の障害データ231を一度に受け取っても、処理が軽いため、障害データ231を全て転送バッファ142に保管することができる。転送制御部141は、転送バッファ142に障害データ231が入っていれば、古い物から順に多重系監視システム119へ送付する。そのとき、転送制御部141は、転送設定情報143に基づいて、一定時間における転送数を一定数に抑える。一定時間の間、負荷判定部135で負荷が閾値以下であると判定された場合は、転送制御部141は、転送バッファ142からの障害データ231の転送について、一定間隔ごとの転送数を増やす。負荷が閾値以上であると判定された場合は、転送制御部141は、転送バッファ142からの障害データ231の転送について、一定間隔ごとの転送数を減らす。転送制御部141は、転送設定情報143に基づいて、一定間隔ごとの転送数を決定する。
 具体的には、転送制御部141は、以下のように転送数を制御する。
 負荷が閾値を超えた場合に、一定間隔ごとの、転送バッファ142からの障害データ231の転送数を、Push型負荷超過時転送数減の分だけ減らす。
新しい転送数=現在の転送数-Push型負荷超過時転送数減
 負荷が閾値を超えた場合に、一定間隔ごとの、転送バッファ142からの障害データ231の転送数を、Push型負荷非超過時転送数増の分だけ増やす。
新しい転送数=現在の転送数+Push型負荷非超過時転送数増
Next, the case of using Push type transfer will be described.
In the initial state, the transfer unit 140 immediately transfers all fault data 231 detected and notified by the monitoring device 203 to the multiplex monitoring system 119 via the connection disconnection unit 131.
When the transfer unit 140 receives the switching notification 92, the transfer control unit 141 temporarily stores all the failure data 231 sent from the monitoring device 203 in the transfer buffer 142. The transfer buffer 142 does not perform any processing for the fault data 231, and only temporarily stores it. Therefore, even if a large amount of fault data 231 is received at one time from the monitoring device 203 side, all fault data 231 can be stored in the transfer buffer 142 because the processing is light. If the fault data 231 is stored in the transfer buffer 142, the transfer control unit 141 sends the fault data 231 to the multiplex monitoring system 119 in order from the old one. At that time, based on the transfer setting information 143, the transfer control unit 141 suppresses the number of transfers in a fixed time to a fixed number. If it is determined by the load determination unit 135 that the load is equal to or less than the threshold for a predetermined time, the transfer control unit 141 increases the number of transfers at regular intervals for transfer of the fault data 231 from the transfer buffer 142. When it is determined that the load is equal to or higher than the threshold, the transfer control unit 141 reduces the number of transfers at regular intervals for transfer of the fault data 231 from the transfer buffer 142. The transfer control unit 141 determines the number of transfers at constant intervals based on the transfer setting information 143.
Specifically, the transfer control unit 141 controls the number of transfers as follows.
When the load exceeds the threshold value, the number of transfers of failure data 231 from the transfer buffer 142 at regular intervals is reduced by the reduction of the number of Push type overload transfer.
New number of transfers = Current number of transfers-Push type excess load reduction Number of transfers When the load exceeds the threshold, the number of transfers of the fault data 231 from the transfer buffer 142 at regular intervals is the Push type non load excess condition. Increase by the number of transfers.
New number of transfers = Current number of transfers + Push type no load excess transfer number increase
 以上のように、本実施の形態に係る監視制御システム500bによれば、多重系監視システムでの切替発生時に、障害データの転送数を一時的に制限することができる。また、本実施の形態に係る監視制御システム500bによれば、負荷が閾値以下であれば徐々に転送数を増やしていくことにより、多重系監視システムの待機系に過負荷をかけることがない。また、Pull型の障害データ転送については、監視装置が持つデータベース内に障害データが保管されるため、切替時、切替後についても、障害データが欠損することが無い。Push型の障害データ転送については、転送バッファ142が一時的に障害データを保管するため、切替時、切替後についても、障害データが欠損することが無い。 As described above, according to the monitoring control system 500b according to the present embodiment, it is possible to temporarily limit the number of transfer of failure data at the time of switching occurrence in the multiplex monitoring system. Further, according to the monitoring control system 500b according to the present embodiment, if the load is equal to or less than the threshold value, the number of transfers is gradually increased, so that the standby system of the multiplex monitoring system is not overloaded. Further, with regard to pull type fault data transfer, since fault data is stored in the database possessed by the monitoring device, there is no loss of fault data even during switching and after switching. For Push type fault data transfer, since the transfer buffer 142 temporarily stores fault data, there is no loss of fault data even at switching time and after switching.
 図14は、本実施の形態に係る転送設定情報143の別例を示す図である。
 なお、図13の転送設定情報143では、閾値を超えたか超えていないかの判断で、転送数を増減している。しかし、図14に示すように。負荷に応じて段階的に転送数の増減を制御してもよい。図14では、横軸に負荷の高低、縦軸に転送数の増減をとり、曲線を設定することで、負荷の高低に応じて、転送数の増減を連続的に変更する設定を行っている。図14の転送設定情報143を用いることにより、CPU使用率およびIOPSといった負荷の種類により、各々転送数の設定を行うことができる。また、複数の負荷を組み合わせた、総合負荷指数を定義し、それに応じた転送数設定を行ってもよい。
 図14のように連続的に転送数を変化させる場合、転送制御部141は、図14の転送設定情報143により、負荷の値から転送数の増減数を取得する。そして、転送制御部141は、図13の転送設定情報を用いる場合と同様に、その増減数を使って新しい転送数を計算する。
FIG. 14 is a diagram showing another example of the transfer setting information 143 according to the present embodiment.
In the transfer setting information 143 of FIG. 13, the number of transfers is increased or decreased based on the determination as to whether the threshold value is exceeded or not. However, as shown in FIG. The increase and decrease of the number of transfers may be controlled in stages according to the load. In FIG. 14, the horizontal axis represents the load level and the vertical axis represents the increase and decrease of the transfer number, and the curve is set to continuously change the increase and decrease of the transfer number according to the load level. . By using the transfer setting information 143 of FIG. 14, it is possible to set the number of transfers according to the type of load such as the CPU usage rate and the IOPS. Further, it is also possible to define an overall load index in which a plurality of loads are combined, and to set the number of transfers accordingly.
When the transfer number is continuously changed as shown in FIG. 14, the transfer control unit 141 acquires the increase / decrease number of the transfer number from the value of the load from the transfer setting information 143 of FIG. Then, as in the case of using the transfer setting information of FIG. 13, the transfer control unit 141 calculates a new transfer number using the increase / decrease number.
 実施の形態1から3では、監視制御システムの各部を独立した機能ブロックとして説明した。しかし、監視制御システムの構成は、上述した実施の形態のような構成でなくてもよい。監視制御システムの機能ブロックは、上述した実施の形態で説明した機能を実現することができれば、どのような構成でもよい。 In the first to third embodiments, each part of the supervisory control system has been described as an independent functional block. However, the configuration of the monitoring control system may not be the configuration as in the above-described embodiment. The functional blocks of the supervisory control system may have any configuration as long as the functions described in the above-described embodiment can be realized.
 実施の形態1から3のうち、複数の部分を組み合わせて実施しても構わない。あるいは、これらの実施の形態のうち、1つの部分を実施しても構わない。その他、これらの実施の形態を、全体としてあるいは部分的に、どのように組み合わせて実施しても構わない。
 なお、上述した実施の形態は、本質的に好ましい例示であって、本発明の範囲、本発明の適用物の範囲、および本発明の用途の範囲を制限することを意図するものではない。上述した実施の形態は、必要に応じて種々の変更が可能である。
A plurality of parts in Embodiments 1 to 3 may be combined and implemented. Alternatively, one portion of these embodiments may be implemented. In addition, these embodiments may be implemented in any combination in whole or in part.
The embodiments described above are essentially preferable examples, and are not intended to limit the scope of the present invention, the scope of the application of the present invention, and the scope of the application of the present invention. The embodiment described above can be variously modified as needed.
 91 障害情報、92 切替通知、93 負荷情報、94 処理可能通知、106 ネットワーク、107 FW、110 イベント集約部、111 イベント集約(稼働系)、112 イベント集約(待機系)、113 イベントデータベース、114 インシデント管理部、115 インシデント管理(稼働系)、116 インシデント管理(待機系)、117 インシデントデータベース、118 障害表示部、119 多重系監視システム、121 オペレータ、130 切替管理装置、131 接続切断部、132 障害取得部、133 切替部、134 負荷取得部、135 負荷判定部、136 接続制御部、137 閾値情報、138,138a 接続情報、139 記憶部、140 転送部、141 転送制御部、142 転送バッファ、143 転送設定情報、191 監視部、192 稼働系監視部、193 待機系監視部、200 監視対象システム、202 FW、203 監視装置、204 SW、205 サーバ、231 障害データ、500,500b 監視制御システム、909 電子回路、910 プロセッサ、921 メモリ、922 補助記憶装置、930 入力インタフェース、940 出力インタフェース、950 通信装置、S130 切替管理処理。 91 failure information, 92 switching notification, 93 load information, 94 processable notification, 106 network, 107 FW, 110 event aggregation unit, 111 event aggregation (active system), 112 event aggregation (standby system), 113 event database, 114 incident Management unit 115 Incident management (operating system) 116 Incident management (standby system) 117 Incident database 118 Failure display unit 119 Multiple system monitoring system 121 Operator 130 switching management device 131 Connection disconnection unit 132 Failure acquisition Unit, 133 switching unit, 134 load acquisition unit, 135 load determination unit, 136 connection control unit, 137 threshold information, 138, 138a connection information, 139 storage unit, 140 transfer unit, 141 transfer control unit, 14 Transfer buffer, 143 transfer setting information, 191 monitoring unit, 192 active system monitoring unit, 193 standby system monitoring unit, 200 monitored system, 202 FW, 203 monitoring device, 204 SW, 205 server, 231 failure data, 500, 500b monitoring Control system, 909 electronic circuit, 910 processor, 921 memory, 922 auxiliary storage device, 930 input interface, 940 output interface, 950 communication device, S130 switching management processing.

Claims (11)

  1.  複数の監視装置に対する監視処理を実行する稼働系監視部と、前記稼働系監視部に障害が発生した場合に前記稼働系監視部の代わりに前記監視処理を実行する待機系監視部とを備えた多重系監視システムの系切替を管理する切替管理装置において、
     前記稼働系監視部に障害が発生したことを表す障害情報を取得すると、前記監視処理の実行を前記稼働系監視部から前記待機系監視部に切り替える切替部と、
     前記切替部から、前記稼働系監視部から前記待機系監視部に切り替えたことを表す切替通知を取得すると、前記多重系監視システムと前記複数の監視装置の各々との通信を切断する接続制御部と、
     前記監視処理を実行する前記待機系監視部の負荷を表す負荷情報を取得し、前記負荷と閾値とを比較する負荷判定部と
    を備え、
     前記接続制御部は、
     前記負荷と前記閾値との比較判定結果に基づいて、前記複数の監視装置の各々と前記待機系監視部との前記通信を制御する切替管理装置。
    The system comprises: an operation monitoring unit that executes monitoring processing on a plurality of monitoring devices; and a standby monitoring unit that executes the monitoring processing instead of the operation monitoring unit when a failure occurs in the operation monitoring unit. In a switching management device that manages system switching of a multiplex system monitoring system,
    A switching unit that switches execution of the monitoring process from the operating system monitoring unit to the standby system monitoring unit upon acquiring failure information indicating that a failure has occurred in the operating system monitoring unit;
    A connection control unit that disconnects communication between the multi-system monitoring system and each of the plurality of monitoring devices when acquiring from the switching unit a switching notification indicating that switching has been made to the standby system monitoring unit from the operating system monitoring unit. When,
    A load determination unit that acquires load information representing a load of the standby monitoring unit that executes the monitoring process, and comparing the load and a threshold;
    The connection control unit
    A switching management device that controls the communication between each of the plurality of monitoring devices and the standby monitoring unit based on the comparison determination result of the load and the threshold.
  2.  前記接続制御部は、
     前記負荷が前記閾値以下の場合に、前記複数の監視装置のうち少なくとも1つの監視装置について前記通信を接続状態にする請求項1に記載の切替管理装置。
    The connection control unit
    The switch management device according to claim 1, wherein, when the load is equal to or less than the threshold value, the communication is brought into a connected state for at least one of the plurality of monitoring devices.
  3.  前記切替部は、
     前記待機系監視部が処理可能となったことを表す処理可能通知を出力し、
     前記接続制御部は、
     前記切替部から前記処理可能通知を取得すると、前記複数の監視装置のうち少なくとも1つの監視装置について前記通信を接続状態にする請求項2に記載の切替管理装置。
    The switching unit is
    Outputting a processable notification indicating that the standby monitoring unit has become processable;
    The connection control unit
    The switch management device according to claim 2, wherein when the process enable notification is acquired from the switching unit, the communication is made in a connected state with respect to at least one of the plurality of monitoring devices.
  4.  前記接続制御部は、
     前記負荷が前記閾値を超えた場合に、前記複数の監視装置のうち少なくとも1つの監視装置について前記通信を切断する請求項3に記載の切替管理装置。
    The connection control unit
    The switch management device according to claim 3, wherein when the load exceeds the threshold, the communication is disconnected for at least one of the plurality of monitoring devices.
  5.  前記接続制御部は、
     前記複数の監視装置の全てについて前記通信が接続状態になった場合に、前記通信の制御を終了する請求項4に記載の切替管理装置。
    The connection control unit
    5. The switch management device according to claim 4, wherein control of the communication is ended when the communication is in a connected state for all of the plurality of monitoring devices.
  6.  前記切替管理装置は、
     前記複数の監視装置の各々と前記待機系監視部との前記通信の状態が設定された接続情報を記憶する記憶部を備え、
     前記接続制御部は、
     前記接続情報を用いて前記通信を制御する請求項1から5のいずれか1項に記載の切替管理装置。
    The switching management device is
    A storage unit configured to store connection information in which the state of the communication between each of the plurality of monitoring devices and the standby monitoring unit is set;
    The connection control unit
    The switch management device according to any one of claims 1 to 5, wherein the communication is controlled using the connection information.
  7.  前記接続情報には、前記複数の監視装置の各々の重要度が設定され、
     前記接続制御部は、
     前記負荷と前記閾値との比較判定結果と前記重要度とに基づいて、前記通信を制御する請求項6に記載の切替管理装置。
    In the connection information, the importance of each of the plurality of monitoring devices is set;
    The connection control unit
    The switch management device according to claim 6, wherein the communication is controlled based on a comparison determination result of the load and the threshold value and the importance.
  8.  複数の監視装置と、
     前記複数の監視装置に対する監視処理を実行する稼働系監視部と、前記稼働系監視部に障害が発生した場合に前記稼働系監視部の代わりに前記監視処理を実行する待機系監視部とを備えた多重系監視システムと、
     前記多重系監視システムの系切替を管理する切替管理装置と
    を備え、
     前記切替管理装置は、
     前記稼働系監視部に障害が発生したことを表す障害情報を取得すると、前記監視処理の実行を前記稼働系監視部から前記待機系監視部に切り替える切替部と、
     前記切替部から、前記稼働系監視部から前記待機系監視部に切り替えたことを表す切替通知を取得すると、前記多重系監視システムと前記複数の監視装置の各々との通信を切断する接続制御部と、
     前記監視処理を実行する前記待機系監視部の負荷を表す負荷情報を取得し、前記負荷と閾値とを比較する負荷判定部と
    を備え、
     前記接続制御部は、
     前記負荷と前記閾値との比較判定結果に基づいて、前記複数の監視装置の各々と前記待機系監視部との前記通信を制御する監視制御システム。
    With multiple monitoring devices,
    The system comprises: an operation monitoring unit that executes monitoring processing on the plurality of monitoring devices; and a standby monitoring unit that executes the monitoring processing instead of the operation monitoring unit when a failure occurs in the operation monitoring unit. Multiple system monitoring system,
    And a switching management device that manages system switching of the multiple system monitoring system,
    The switching management device is
    A switching unit that switches execution of the monitoring process from the operating system monitoring unit to the standby system monitoring unit upon acquiring failure information indicating that a failure has occurred in the operating system monitoring unit;
    A connection control unit that disconnects communication between the multi-system monitoring system and each of the plurality of monitoring devices when acquiring from the switching unit a switching notification indicating that switching has been made to the standby system monitoring unit from the operating system monitoring unit. When,
    A load determination unit that acquires load information representing a load of the standby monitoring unit that executes the monitoring process, and comparing the load and a threshold;
    The connection control unit
    The monitoring control system which controls said communication of each of said several monitoring apparatus and said standby system monitoring part based on the comparison determination result of the said load and the said threshold value.
  9.  前記監視制御システムは、
     前記複数の監視装置の各々から前記多重系監視システムに送信される障害データを転送する転送部を備え、
     前記切替部は、
     前記切替通知を前記転送部に送信し、
     前記負荷判定部は、
     前記比較判定結果を前記転送部に送信し、
     前記転送部は、前記切替通知と前記比較判定結果とに基づいて、前記障害データの転送数を制御する請求項8に記載の監視制御システム。
    The supervisory control system
    A transfer unit configured to transfer fault data transmitted from each of the plurality of monitoring devices to the multiplex monitoring system;
    The switching unit is
    Sending the switching notification to the transfer unit;
    The load determination unit
    Transmitting the comparison determination result to the transfer unit;
    The monitoring control system according to claim 8, wherein the transfer unit controls the number of transfers of the failure data based on the switching notification and the comparison determination result.
  10.  複数の監視装置に対する監視処理を実行する稼働系監視部と、前記稼働系監視部に障害が発生した場合に前記稼働系監視部の代わりに前記監視処理を実行する待機系監視部とを備えた多重系監視システムの系切替を管理する切替管理装置の切替管理方法において、
     切替部が、前記稼働系監視部に障害が発生したことを表す障害情報を取得すると、前記監視処理の実行を前記稼働系監視部から前記待機系監視部に切り替え、
     接続制御部が、前記切替部から、前記稼働系監視部から前記待機系監視部に切り替えたことを表す切替通知を取得すると、前記多重系監視システムと前記複数の監視装置の各々との通信を切断し、
     負荷判定部が、前記監視処理を実行する前記待機系監視部の負荷を表す負荷情報を取得し、前記負荷と閾値とを比較し、
     前記接続制御部が、前記負荷と前記閾値との比較判定結果に基づいて、前記複数の監視装置の各々と前記待機系監視部との前記通信を制御する切替管理方法。
    The system comprises: an operation monitoring unit that executes monitoring processing on a plurality of monitoring devices; and a standby monitoring unit that executes the monitoring processing instead of the operation monitoring unit when a failure occurs in the operation monitoring unit. In a switching management method of a switching management apparatus that manages system switching of a multiplex system monitoring system,
    When the switching unit acquires failure information indicating that a failure has occurred in the operating monitoring unit, the switching unit switches execution of the monitoring process from the operating monitoring unit to the standby monitoring unit.
    When the connection control unit acquires, from the switching unit, a switching notification indicating that switching has been performed from the working monitoring unit to the standby monitoring unit, communication between the multiple monitoring system and each of the plurality of monitoring devices is performed. Cut off
    The load determination unit acquires load information representing the load of the standby monitoring unit that executes the monitoring process, and compares the load with a threshold value.
    The switching management method according to claim 1, wherein the connection control unit controls the communication between each of the plurality of monitoring devices and the standby monitoring unit based on a comparison determination result between the load and the threshold.
  11.  複数の監視装置に対する監視処理を実行する稼働系監視部と、前記稼働系監視部に障害が発生した場合に前記稼働系監視部の代わりに前記監視処理を実行する待機系監視部とを備えた多重系監視システムの系切替を管理する切替管理装置の切替管理プログラムにおいて、
     前記稼働系監視部に障害が発生したことを表す障害情報を取得すると、前記監視処理の実行を前記稼働系監視部から前記待機系監視部に切り替える切替処理と、
     前記切替処理から、前記稼働系監視部から前記待機系監視部に切り替えたことを表す切替通知を取得すると、前記多重系監視システムと前記複数の監視装置の各々との通信を切断する切断処理と、
     前記監視処理を実行する前記待機系監視部の負荷を表す負荷情報を取得し、前記負荷と閾値とを比較する負荷判定処理と、
     前記負荷と前記閾値との比較判定結果に基づいて、前記複数の監視装置の各々と前記待機系監視部との前記通信を制御する通信制御処理と
    をコンピュータである前記切替管理装置に実行させる切替管理プログラム。
    The system comprises: an operation monitoring unit that executes monitoring processing on a plurality of monitoring devices; and a standby monitoring unit that executes the monitoring processing instead of the operation monitoring unit when a failure occurs in the operation monitoring unit. In a switching management program of a switching management device that manages system switching of a multiplex system monitoring system,
    A switching process of switching execution of the monitoring process from the operating system monitoring unit to the standby system monitoring unit when failure information indicating that a failure has occurred in the operating system monitoring unit is acquired;
    And a disconnection process for disconnecting communication between the multi-system monitoring system and each of the plurality of monitoring devices when acquiring a switching notification indicating that the standby system monitoring unit has been switched from the operation system monitoring unit from the switching process. ,
    Load determination processing for acquiring load information representing the load of the standby monitoring unit that executes the monitoring processing, and comparing the load with a threshold value;
    Switching that causes the switching management device, which is a computer, to execute communication control processing for controlling the communication between each of the plurality of monitoring devices and the standby monitoring unit based on the comparison determination result of the load and the threshold value. Management program.
PCT/JP2017/039778 2017-06-23 2017-11-02 Switching management device, monitoring control system, switching management method, and switching management program WO2018235310A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2017123594A JP2019008548A (en) 2017-06-23 2017-06-23 Switching management device, monitoring control system, switching management method and switching management program
JP2017-123594 2017-06-23

Publications (1)

Publication Number Publication Date
WO2018235310A1 true WO2018235310A1 (en) 2018-12-27

Family

ID=64737781

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2017/039778 WO2018235310A1 (en) 2017-06-23 2017-11-02 Switching management device, monitoring control system, switching management method, and switching management program

Country Status (2)

Country Link
JP (1) JP2019008548A (en)
WO (1) WO2018235310A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115053493A (en) * 2020-03-12 2022-09-13 欧姆龙株式会社 Information processing device, host device, information processing system, notification method, and information processing program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007207219A (en) * 2006-01-06 2007-08-16 Hitachi Ltd Computer system management method, management server, computer system, and program
JP2011248735A (en) * 2010-05-28 2011-12-08 Hitachi Ltd Server computer changeover method, management computer and program
JP2015011472A (en) * 2013-06-27 2015-01-19 富士通株式会社 Control method, control program, and information processing system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007207219A (en) * 2006-01-06 2007-08-16 Hitachi Ltd Computer system management method, management server, computer system, and program
JP2011248735A (en) * 2010-05-28 2011-12-08 Hitachi Ltd Server computer changeover method, management computer and program
JP2015011472A (en) * 2013-06-27 2015-01-19 富士通株式会社 Control method, control program, and information processing system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115053493A (en) * 2020-03-12 2022-09-13 欧姆龙株式会社 Information processing device, host device, information processing system, notification method, and information processing program

Also Published As

Publication number Publication date
JP2019008548A (en) 2019-01-17

Similar Documents

Publication Publication Date Title
US8732270B2 (en) Controlling communication among multiple industrial control systems
US8555189B2 (en) Management system and management system control method
JP6040612B2 (en) Storage device, information processing device, information processing system, access control method, and access control program
JP3957065B2 (en) Network computer system and management device
CN112217847A (en) Micro service platform, implementation method thereof, electronic device and storage medium
KR101586354B1 (en) Communication failure recover method of parallel-connecte server system
US8131871B2 (en) Method and system for the automatic reroute of data over a local area network
CN101488105B (en) Method for implementing high availability of memory double-controller and memory double-controller system
US9164825B2 (en) Computing unit, method of managing computing unit, and computing unit management program
JP5930029B2 (en) Management device and log collection method
US10884878B2 (en) Managing a pool of virtual functions
WO2018235310A1 (en) Switching management device, monitoring control system, switching management method, and switching management program
JP2018010441A (en) Log collection system, log collection server, and log collection method
JP2015045905A (en) Information processing system and failure processing method of information processing system
JP5483784B1 (en) CONTROL DEVICE, COMPUTER RESOURCE MANAGEMENT METHOD, AND COMPUTER RESOURCE MANAGEMENT PROGRAM
JP6134720B2 (en) Connection method
US11349964B2 (en) Selective TCP/IP stack reconfiguration
JP2015176168A (en) Administration server, fault restoration method, and computer program
JP2010087834A (en) Network monitoring system
JP5752646B2 (en) Fault monitoring apparatus and fault monitoring method
JP5631285B2 (en) Fault monitoring system and fault monitoring method
JP5464886B2 (en) Computer system
JP2007074252A (en) High availability communication system, failure management method and program
KR20230105959A (en) Apparatus and method for managing dualization of edge devices for cloud services
CN118612110A (en) Detection method based on cloud technology and cloud management platform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17914345

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17914345

Country of ref document: EP

Kind code of ref document: A1