WO2018235310A1

WO2018235310A1 - Switching management device, monitoring control system, switching management method, and switching management program

Info

Publication number: WO2018235310A1
Application number: PCT/JP2017/039778
Authority: WO
Inventors: 山田　耕一; 明半田
Original assignee: 三菱電機株式会社
Priority date: 2017-06-23
Filing date: 2017-11-02
Publication date: 2018-12-27
Also published as: JP2019008548A

Abstract

When a switching unit (133) acquires a failure notification indicating that a failure has occurred in an operation system monitoring unit (192), system switching from the operation system monitoring unit (192) to a standby system monitoring unit (193) is performed. When acquiring a switching notification from the switching unit (133), a connection control unit (136) disconnects communication between a multi-system monitoring system (119) and each of a plurality of monitoring devices (203). A load determination unit (135) compares a load on the standby system monitoring unit (193) with a threshold. On the basis of the result of comparison between the load and the threshold, the connection control unit (136) controls communication between each of the plurality of monitoring devices (203) and the standby system monitoring unit (193).

Description

Switching management device, monitoring control system, switching management method and switching management program

The present invention relates to a switching management device, a monitoring control system, a switching management method, and a switching management program. In particular, the present invention relates to a switching management device, a monitoring control system, a switching management method, and a switching management program that manages switching of a system in a multisystem monitoring system.

The monitoring system monitors a monitoring target such as a server or a network device. In a monitoring system, it is necessary to cope with all failures that occur in the monitoring target. Therefore, in the monitoring system, in order to provide fault tolerance, it is common to use multiple systems.
Patent Document 1 discloses a technique for preventing a standby system from stopping by using operation data synchronized using a failure flag and a stop flag when switching a system in a multiplex system.
Further, Patent Document 2 discloses a technology for stabilizing a web server by temporarily limiting the number of clients communicating with the web server.

Patent No. 5342701 JP 2008-250669 A

In a multiple system monitoring system, when a large number of failures are detected in the monitoring target, the operating system may stop responding due to overload and may switch to a standby system. However, when switching to the standby system, this time the standby system will handle a large number of failures. Therefore, the standby system may be overloaded, the response may be lost, and the operating system and the standby system may fall apart. Such a situation occurs when a failure occurs in a main device of a network such as a switch, a router, or a firewall, or a carrier line connecting a multiplexed monitoring system and a monitoring target.

When such a failure occurs, all devices far from the failure point in the network topology are disconnected from the multi-system monitoring system. Therefore, a fault that all the monitoring targets are disconnected occurs at one time, and all fault information is processed by the multi-system monitoring system. As a result, overloading occurs and both the active and standby systems fall apart.

The present inventor aims to prevent the standby system from being overloaded when the system is switched to the standby system due to an overload.

The switching management apparatus according to the present invention executes the monitoring process instead of the operating system monitoring unit when a failure occurs in the operating system monitoring unit that performs monitoring processing on a plurality of monitoring devices and the operating system monitoring unit In a switching management device that manages system switching of a multiplex system monitoring system including a standby system monitoring unit that
A switching unit that switches execution of the monitoring process from the operating system monitoring unit to the standby system monitoring unit when acquiring a failure notification indicating that a failure has occurred in the operating system monitoring unit;
A connection control unit that disconnects communication between the multi-system monitoring system and each of the plurality of monitoring devices when acquiring from the switching unit a switching notification indicating that switching has been made to the standby system monitoring unit from the operating system monitoring unit. When,
A load determination unit that acquires load information representing a load of the standby monitoring unit that executes the monitoring process, and comparing the load and a threshold;
The connection control unit
The communication between each of the plurality of monitoring devices and the standby monitoring unit is controlled based on the comparison determination result between the load and the threshold.

In the switching management device according to the present invention, when switched to the standby monitoring unit, the connection control unit disconnects communication between the multi-system monitoring system and each of the plurality of monitoring devices. Then, the connection control unit controls communication between each of the plurality of monitoring devices and the standby system monitoring unit based on the comparison determination result of the load of the standby system monitoring unit and the threshold value. Therefore, according to the switching management device of the present invention, it is possible to prevent the standby monitoring unit from becoming overloaded after switching to the standby monitoring unit.

FIG. 1 is a block diagram of a monitoring control system 500 according to a first embodiment. FIG. 2 is a block diagram of a switching management device 130 according to the first embodiment. FIG. 6 is a flowchart of switch management processing S130 by the switch management device 130 according to the first embodiment. FIG. 6 shows connection information 138 according to the first embodiment. FIG. 6 shows threshold information 137 according to the first embodiment. FIG. 10 is a flowchart of connection control processing S106 by the connection control unit 136 according to the first embodiment. FIG. 7 is a block diagram of a switching management device 130 according to a modification of the first embodiment. FIG. 16 is a diagram showing a connection table 138a according to Embodiment 2. FIG. 16 is a view showing an example of a user interface for setting of the connection table 138a according to the second embodiment. The figure which shows the state which displayed the drop-down list in FIG. FIG. 16 is a view showing another example of the user interface for setting of the connection table 138a according to the second embodiment. FIG. 10 is a block diagram of a monitoring control system 500b according to a third embodiment. FIG. 13 shows an example of transfer setting information 143 according to the third embodiment. FIG. 16 shows another example of transfer setting information 143 according to the third embodiment.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings, the same or corresponding parts are denoted by the same reference numerals. In the description of the embodiment, the description of the same or corresponding parts will be omitted or simplified as appropriate.

Embodiment 1
*** Description of the configuration ***
The configuration of the monitoring control system 500 according to the present embodiment will be described using FIG.
The monitoring control system 500 includes a multiple system monitoring system 119, a monitoring target system 200, and a switching management device 130.
The multiplex system monitoring system 119 and the monitoring target system 200 are connected via the network 106, the FW 107, and the switching management device 130.
The network 106 connects between the multisystem monitoring system 119 and the system 200 to be monitored. The network 106 is configured by the Internet, an intranet, or another network depending on requirements such as the installation site of the monitored system 200.

<Monitored system 200>
The monitoring target system 200 is a system that provides a service to a user. One or more monitoring target systems 200 exist.
The monitoring target system 200 includes an FW 202, a monitoring device 203, a SW 204, and a server 205.

The FW 202 is a firewall that connects the inside and the outside of the monitored system 200. Depending on the configuration of the monitored system 200, one or more FWs 202 exist. Alternatively, depending on the configuration of the monitored system 200, the FW 202 may not exist.

The monitoring device 203 monitors a server or a network device configuring the monitoring target system 200. The monitoring device 203 transmits failure data 231 to the multiplex monitoring system 119 when an abnormality is detected. One or more monitoring devices 203 exist in one monitoring target system 200.

The SW 204 is a network device for connecting the monitoring device 203 and a server or device existing in the monitoring target system 200. Specifically, the SW 204 is a device such as a router, a switch, or a hub. One or more SWs 204 exist in one monitored system 200.
One or more servers 205 exist in one monitoring target system 200.

The FW 107 is a firewall that connects the multiplex monitoring system 119 and the network 106. One or more FWs 107 exist depending on the system configuration. Alternatively, the FW 107 may not exist depending on the system configuration.

<Multiple monitoring system 119>
The multiple system monitoring system 119 includes a monitoring unit 191 that performs monitoring processing on a plurality of monitoring devices 203. The monitoring unit 191 includes an operating system monitoring unit 192 and a standby system monitoring unit 193 that executes monitoring processing in place of the operating system monitoring unit 192 when a failure occurs in the operating system monitoring unit 192. The operating system monitoring unit 192 includes an event aggregation (operating system) 111 and an incident management (operating system) 115. The standby monitoring unit 193 further includes an event aggregation (standby system) 112 and an incident management (standby system) 116.

The multiple system monitoring system 119 manages a failure that has occurred in the monitored system 200.
The event aggregation unit 110 receives failure data 231 reported from the plurality of monitoring devices 203 in the plurality of monitored systems 200. The event aggregation unit 110 centrally manages the received failure data 231 as an event. The event aggregation unit 110 links information of the monitoring target system 200 to event information. Also, the event aggregation unit 110 determines the severity of the failure. Event information is stored in the event database 113.
The event aggregation unit 110 is duplicated. The event aggregation unit 110 includes an event aggregation (operating system) 111 and an event aggregation (standby system) 112. Normally, the event aggregation (operating system) 111 is operating, and when a failure occurs in the event aggregation (operating system) 111 and the processing can not be continued, the process is taken over by the event aggregation (standby system) 112. Event information to be processed is stored in the event database 113. The event database 113 can be accessed from both the event aggregation (operating system) 111 and the event aggregation (standby system) 112. Therefore, the processing of event information stored in the event database 113 can be taken over by the event aggregation (standby system) 112.

The incident management unit 114 receives event information from the event aggregation unit 110. The incident management unit 114 stores incident information based on the event information in the incident database 117. The incident information includes information such as the content of the notification sent by the operator to the user of the monitoring target system 200, information collected for failure recovery, or processing performed for failure recovery. Incident information is stored in the incident database 117 until the failure is recovered.
The incident management unit 114 is duplicated. The incident management unit 114 includes an incident management (operating system) 115 and an incident management (standby system) 116. Normally, the incident management (operating system) 115 is operating, and when a failure occurs in the incident management (operating system) 115 and the process can not be continued, the incident management (standby system) 116 takes over the process. The incident database 117 can be accessed from both the incident management (operating system) 115 and the incident management (standby system) 116. Incident management (standby system) 116 can take over processing using the incident information stored in the incident database 117.

The fault display unit 118 displays a fault that occurs in the monitoring target system 200 and is detected. The fault display unit 118 may notify not only by displaying the fault content on the screen, but also by a method of lighting a lamp or sounding a sound. By displaying a fault on the fault display unit 118, the operator 121 starts handling the fault.

<Switching Management Device 130>
The configuration of the switch management device 130 according to the present embodiment will be described using FIG.
The switching management device 130 is a computer. The switch management device 130 includes a processor 910 and other hardware such as a memory 921, an auxiliary storage device 922, an input interface 930, an output interface 940, and a communication device 950. The processor 910 is connected to other hardware via signal lines to control these other hardware.

The switching management device 130 includes, as functional elements, a connection disconnection unit 131, a failure acquisition unit 132, a switching unit 133, a load acquisition unit 134, a load determination unit 135, a connection control unit 136, and a storage unit 139. . The storage unit 139 stores threshold information 137 and connection information 138. The functions of the connection disconnection unit 131, the fault acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136 are realized by software. The storage unit 139 is included in the memory 921.

The processor 910 is a device that executes a switching management program. The switching management program is a program for realizing the functions of the connection disconnection unit 131, the failure acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136.
The processor 910 is an IC (Integrated Circuit) that performs arithmetic processing. A specific example of the processor 910 is a central processing unit (CPU), a digital signal processor (DSP), or a graphics processing unit (GPU).

The memory 921 is a storage device that temporarily stores data. A specific example of the memory 921 is a static random access memory (SRAM) or a dynamic random access memory (DRAM).

The auxiliary storage device 922 is a storage device for storing data. A specific example of the auxiliary storage device 922 is an HDD. The auxiliary storage device 922 may also be a portable storage medium such as an SD (registered trademark) memory card, a CF, a NAND flash, a flexible disk, an optical disk, a compact disk, a Blu-ray (registered trademark) disk, and a DVD. HDD is an abbreviation of Hard Disk Drive. SD (registered trademark) is an abbreviation of Secure Digital. CF is an abbreviation of Compact Flash. DVD is an abbreviation of Digital Versatile Disk.

The input interface 930 is a port connected to an input device such as a mouse, a keyboard, or a touch panel. Specifically, the input interface 930 is a USB (Universal Serial Bus) terminal. The input interface 930 may be a port connected to a LAN (Local Area Network).

The output interface 940 is a port to which a cable of a display device such as a display is connected. Specifically, the output interface 940 is a USB terminal or an HDMI (registered trademark) (High Definition Multimedia Interface) terminal. The display is specifically an LCD (Liquid Crystal Display).

The communication device 950 is a device that communicates with other devices via a network. Communication device 950 has a receiver and a transmitter. The communication device 950 is connected to a communication network such as a LAN, the Internet, or a telephone line by wire or wirelessly. The communication device 950 is specifically a communication chip or a NIC (Network Interface Card).

The switching management program is read into the processor 910 and executed by the processor 910. The memory 921 stores not only the switching management program but also the OS (Operating System). The processor 910 executes the switching management program while executing the OS. The switching management program and the OS may be stored in the auxiliary storage device 922. The switching management program and the OS stored in the auxiliary storage device 922 are loaded into the memory 921 and executed by the processor 910. Note that part or all of the switching management program may be incorporated into the OS.

The switch management device 130 may include a plurality of processors that replace the processor 910. The plurality of processors share the execution of the switching management program. Each processor is an apparatus that executes a monitoring program in the same manner as the processor 910.

Data, information, signal values and variable values used, processed or output by the switching management program are stored in the memory 921, the auxiliary storage device 922, or a register or cache memory in the processor 910.

The switching management program “processes” “unit” of each unit of the connection disconnection unit 131, the failure acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136. The computer is made to execute each process, each procedure or each process which is replaced with "procedure" or "process". Further, the switching management method is performed by the switching management apparatus 130 executing a switching management program.
The switching management program may be provided by being recorded on a computer readable recording medium, or may be provided as a program product.

*** Explanation of the function of the switching management device 130 ***
The switching management device 130 manages system switching of the multiple system monitoring system 119. The switch management device 130 switches the system when the event aggregation (operating system) 111 or the incident management (operating system) 115 can not continue the process in the multi-system monitoring system 119.
The fault acquisition unit 132 monitors the state of the active system of the event aggregation unit 110 and the incident management unit 114. The failure acquisition unit 132 outputs the failure information 91 of the multisystem monitoring system 119 to the switching unit 133 when a failure occurs and the process can not be continued.

When the switching unit 133 acquires the failure information 91 indicating that the failure has occurred in the operation monitoring unit 192, the switching unit 133 switches the execution of the monitoring process from the operation monitoring unit 192 to the standby monitoring unit 191. When the switching unit 133 receives the failure information 91 of the multiplex monitoring system 119, the switching unit 133 switches the active system to the standby system so that the processing of the event aggregation unit 110 or the incident management unit 114 can be continued. The switching unit 133 transmits, to the connection control unit 136, a switching notification 92 indicating that the operation monitoring unit 192 has switched to the standby monitoring unit 193. In addition, the switching unit 133 transmits, to the connection control unit 136, a process enable notification 94 indicating that the standby system monitoring unit 193 has become processable.

The load acquisition unit 134 acquires load information 93 of the event aggregation unit 110 and the incident management unit 114. Specifically, the load information 93 is information such as the CPU usage rate, the number of IOs per second, that is, the IOPS, and the number of processing waits. The load acquisition unit 134 outputs the acquired load information 93 to the load determination unit 135.

The load determination unit 135 acquires load information 93 representing the load of the standby monitoring unit 193 that executes the monitoring process, and compares the load with a threshold. The load determination unit 135 compares the load included in the load information 93 with the threshold included in the threshold information 137 to determine whether the load exceeds the threshold. Then, the load determination unit 135 outputs the determination result to the connection control unit 136.

The connection control unit 136 acquires, from the switching unit 133, a switching notification 92 indicating that the operation monitoring unit 192 has switched to the standby monitoring unit 193. When the connection control unit 136 acquires the switching notification 92, the connection control unit 136 disconnects the communication between the multiplex monitoring system 119 and each of the plurality of monitoring devices 203. The connection control unit 136 also controls communication between each of the plurality of monitoring devices 203 and the standby monitoring unit 193 based on the comparison and determination result of the load and the threshold.
The storage unit 139 stores connection information 138 in which the state of communication between each of the plurality of monitoring devices 203 and the standby monitoring unit 193 is set. The connection control unit 136 controls communication using the connection information 138.
When the switching unit 133 performs switching, the connection control unit 136 refers to the connection information 138 and gives the connection disconnection unit 131 an instruction to connect or disconnect the monitoring apparatus 203. When the connection disconnection unit 131 receives an instruction from the connection control unit 136, the connection disconnection unit 131 connects or disconnects the network between the monitoring device 203 and the multisystem monitoring system 119. Control of communication by the connection control unit 136 will be described later.

*** Description of operation ***
<Operation of Monitored System 200>
The operation of the monitoring target system 200 will be described. The monitoring device 203 monitors servers and network devices in the monitoring target system 200 to determine whether a failure has occurred. As a specific example, the monitoring device 203 acquires the CPU usage rate and the response information of the PING with respect to the host srv1. Then, the monitoring device 203 determines that a failure occurs when the failure condition is met, such as 90% or more if the CPU usage rate. Further, the monitoring device 203 determines that there is a failure if the failure condition is met, such as no response three consecutive times or more in the case of PING.
When a failure occurs, the monitoring device 203 transmits the failure data 231 to the multisystem monitoring system 119.

<Operation of Multisystem Monitoring System 119>
In the multiple system monitoring system 119, the event aggregation unit 110 receives the failure data 231 sent from the plurality of monitoring devices 203. The event aggregation unit 110 processes the received fault data 231 and stores the process as event information in the event database 113.
In the initial state, when event aggregation (operating system) 111 is operating and a failure occurs in the operating system and processing can not be continued, event aggregation (standby system) 112 is activated and processing is continued. Do. The event aggregation unit 110 sends event information to the incident management unit 114.

The incident management unit 114 manages event information as incident information. Further, the incident management unit 114 displays the incident information on the failure display unit 118. The operator 121 responds to the fault displayed on the fault display unit 118. Specifically, the operator 121 contacts a customer using a failed server or network device. In addition, the operator 121 obtains detailed information such as a log from a server or network device in which a failure occurs. In addition, the operator 121 attempts failure recovery according to a determined procedure such as restart or setting change. In addition, the operator 121 notifies the customer of that when parts replacement is necessary. Thus, the operator 121 carries out various responses. When the response is completed, information that the fault has converged is recorded in the incident information and deleted from the display on the fault display unit 118.
In the initial state, incident management (operating system) 115 is operating, and when a failure occurs in the operating system and processing can not be continued, incident management (standby system) 116 is activated and processing is continued. Do.

<Operation of Switch Management Device 130>
FIG. 3 is a flowchart of switch management processing S130 by the switch management apparatus 130 according to the present embodiment. FIG. 4 is a diagram showing connection information 138 according to the present embodiment. FIG. 5 is a diagram showing threshold information 137 according to the present embodiment.
The switch management process S130 will be described with reference to FIGS. 3 to 5.

<< Disconnection process >>
In step S101, the connection disconnection unit 131 connects or disconnects the communication between the monitoring apparatus 203 and the multisystem monitoring system 119. The connection disconnecting unit 131 connects or disconnects the communication between the monitoring apparatus 203 and the multisystem monitoring system 119 according to the connection information 138. As shown in FIG. 4, in the initial state of the switching management device 130, all the monitoring devices described in the connection information 138 are in the connected state.

<< Failure acquisition processing >>
In step S102, the fault acquisition unit 132 monitors the states of the event aggregation (operating system) 111 and the incident management (operating system) 115. If the fault acquiring unit 132 determines that the process can not be continued due to a program abnormal termination or a state where the response is lost, the fault acquiring unit 132 outputs the fault information 91 to the switching unit 133.

<< Switching process >>
In step S103, the switching unit 133 disconnects the active system that can not continue the process, activates the standby system, and switches the connection so that the process can be continued in the standby system. In the switching method, the event aggregation unit 110 has one virtual IP address, and performs conversion between the event aggregation (active system) 111 and the physical IP address of the event aggregation (standby system) 112, There is a method to realize switching between operating system and standby system. The switching unit 133 notifies the connection control unit 136 of the switching notification 92 when the working system stops responding and disconnects. Also, the switching unit 133 notifies the connection control unit 136 of the process enable notification 94 when the standby system is activated and processing is possible.

<< Load acquisition processing >>
In step S104, the load acquisition unit 134 periodically acquires the load of the event aggregation (operating system) 111 and the incident management (operating system) 115.
As shown in FIG. 5, in the threshold information 137, a target for acquiring a load, a load item, a threshold, and a condition are set. The load acquisition unit 134 acquires information for objects corresponding to the load items described in the threshold information 137. Specifically, it is load information such as CPU utilization, IOPS, number of processing waits, or response time. The load acquisition unit 134 sends the acquired load information to the load determination unit 135.

<< Load judgment processing >>
In step S105, the load determination unit 135 compares the acquired load with the threshold based on the threshold information 137. The load determination unit 135 determines, based on the comparison result, whether the acquired load matches the threshold and the condition. Specifically, in the threshold information 137 of FIG. 5, when incident management is targeted, the threshold of the CPU usage rate which is a load item is “80%”, and the condition is “or more”. Here, it is assumed that the CPU usage rate acquired from the incident management (operating system) 115 is 70% as load information. At this time, the load determination unit 135 determines that the threshold value is not exceeded because the load information does not become “80% or more”, which is a combination of the threshold value and the condition. The comparison determination result determined by the load determination unit 135 is sent to the connection control unit 136.

<< Connection control processing >>
The connection control process S106 by the connection control unit 136 according to the present embodiment will be described with reference to FIG.
Step S601 is processing in the initial state. In the initial state, the connection control unit 136 permits the connection disconnection unit 131 to perform all communications.
In step S602, the connection control unit 136 receives a notification from the switching unit 133. As described above, the notification from the switching unit 133 includes the switching notification 92 when the working system stops responding and is disconnected, and the process available notification 94 when the standby system is activated and processing becomes possible.
In step S603, the connection control unit 136 determines whether the notification is the switching notification 92. If the notification is the switching notification 92, the process proceeds to step S604.
In step S604, the connection control unit 136 instructs the connection disconnection unit 131 to disconnect all communications. Specifically, the connection control unit 136 sets the disconnection state for all the monitoring devices described in the connection information 138. Then, the connection control unit 136 instructs the connection disconnection unit 131 to disconnect communication with all the monitoring devices.
In step S605, the connection control unit 136 waits until receiving, from the switching unit 133, a process enable notification 94 indicating that the standby system has become processable.
In step S605a, when the connection control unit 136 acquires the process enable notification 94 from the switching unit 133, the connection control unit 136 sets communication to at least one of the plurality of monitoring devices. Specifically, the connection control unit 136 changes the connection state of one of the monitoring devices described in the connection information 138 to the connection state when receiving the process enable notification 94 that the standby system has been activated from the switching unit 133. Then, the connection control unit 136 instructs the connection disconnection unit 131 to connect communication with the monitoring device.

In step S606, the connection control unit 136 acquires the comparison determination result from the load determination unit 135.
In step S607, the connection control unit 136 determines whether the load exceeds the threshold value based on the comparison determination result. If the load exceeds the threshold, the connection control unit 136 proceeds to step S608. If the load is equal to or less than the threshold, the connection control unit 136 proceeds to step S609.

In step S608, the connection control unit 136 disconnects communication for at least one of the plurality of monitoring devices. That is, the connection control unit 136 instructs the connection disconnection unit 131 to disconnect one of the connected monitoring devices. The connection control unit 136 realizes disconnection of one of the monitoring devices in the connection state by rewriting the monitoring device disconnection table 138.
In step S609, the connection control unit 136 causes the communication to be in a connected state for at least one of the plurality of monitoring devices. That is, the connection control unit 136 instructs the connection disconnection unit 131 to connect one of the disconnected monitoring devices. The connection control unit 136 realizes connection of one of the disconnected monitoring devices by rewriting the monitoring device disconnection table 138.

In step S610, the connection control unit 136 waits for a predetermined time.
In step S611, the connection control unit 136 determines whether all the monitoring devices are in the connected state. Specifically, the connection control unit 136 determines whether all the monitoring devices have been connected by referring to the monitoring device disconnection table 138. If all the monitoring devices are in the connected state, the connection control unit 136 proceeds to step S611. If there is an unconnected monitoring device, the connection control unit 136 returns to step S606.
In step S612, the switching to the standby system is completely completed. The connection control unit 136 replaces the standby system with the operating system if necessary. As described above, the connection control unit 136 ends the control of communication when the communication is connected to all of the plurality of monitoring devices.

Also, if the comparison determination result sent from the load determination unit 135 is less than or equal to the threshold for a certain period of time, the connection control unit 136 may change one of the monitoring devices in the disconnected state to the connected state. Good. That is, the connection control unit 136 refers to the connection information 138, changes one of the monitoring devices in the disconnection state to the connection state, and connects the connection disconnection unit 131 to the communication with the monitoring device. You may be instructed to
Also, if the comparison determination result sent from the load determination unit 135 exceeds the threshold for a certain period of time, the connection control unit 136 may change one of the monitoring devices in the connection state to the disconnection state. Good. That is, the connection control unit 136 refers to the connection information 138, changes one of the monitoring devices in the connection state to the disconnection state, and disconnects the connection disconnection unit 131 from the communication with the monitoring device. You may be instructed to

The connection disconnection unit 131 connects or disconnects the communication between the monitoring device 203 and the multisystem monitoring system 119 according to the connection or disconnection instruction sent from the connection control unit 136. As a method of connection or disconnection, there is a method of changing a firewall policy to change permission or non-permission of communication, or changing a setting of VLAN (Virtual LAN).

*** Other configuration ***
<Modification 1>
The above-mentioned collection of fault information and load information may use the form of monitoring data or load data used in tools and systems such as commercially available server monitoring or network monitoring. In that case, each operation is similar to tools and systems such as server monitoring or network monitoring. In addition, for switching from the active system to the standby system, shut down and terminate the active system after switching is completed, and let the previous standby system be the active system and treat the existing active systems as the standby system. It is also good. In this case, the cause of the unresponsive operation system needs to be eliminated by shutdown.
Alternatively, a commercially available clustering tool may be used to acquire fault data of the multisystem monitoring system and switch the multisystem monitoring system. In that case, the failure detection and switching method is realized by the method possessed by the clustering tool.
Further, in the configuration of the multiplex system monitoring system, although the event aggregation unit and the incident management unit are divided in the present embodiment, these may be realized by one functional block. In addition, there may exist respective functions used in a general multiplex system monitoring system such as a configuration management unit, a correspondence history storage unit, a correspondence history search unit, an event correlation analysis unit, and an automatic failure handling unit. Even in this case, the operation of switching to the standby system is the same as in the case where the operation management system can not respond to the multiplexed part when the operation management system can not respond.

<Modification 2>
In the present embodiment, the function of the switching management device 130 is realized by software, but as a modification, the function of the switching management device 130 may be realized by hardware.

FIG. 7 is a diagram showing a configuration of a switching management device 130 according to a modification of the present embodiment.
The switching management device 130 includes an electronic circuit 909, a memory 921, an auxiliary storage device 922, an input interface 930, an output interface 940, and a communication device 950.

The electronic circuit 909 is a dedicated electronic circuit that implements the functions of the connection disconnection unit 131, the fault acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136.
Specifically, the electronic circuit 909 is a single circuit, a complex circuit, a programmed processor, a parallel programmed processor, a logic IC, a GA, an ASIC, or an FPGA. GA is an abbreviation of Gate Array. ASIC is an abbreviation of Application Specific Integrated Circuit. FPGA is an abbreviation of Field-Programmable Gate Array.
The functions of the components of the switching management device 130 may be realized by one electronic circuit or may be realized by being distributed to a plurality of electronic circuits.
As another modification, some of the functions of the components of the switching management device 130 may be realized by an electronic circuit, and the remaining functions may be realized by software.

Each of the processor and electronics is also referred to as processing circuitry. That is, in the switching management device 130, the functions of the connection disconnection unit 131, the fault acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136 are realized by processing circuitry. Be done.

In the switching management device 130, the “unit” of the connection disconnection unit 131, the failure acquisition unit 132, the switching unit 133, the load acquisition unit 134, the load determination unit 135, and the connection control unit 136 is replaced with “process”. It is also good. In addition, "processing" of connection disconnection processing, failure acquisition processing, switching processing, load acquisition processing, load determination processing, and connection control processing is "program", "program product" or "computer readable record recording program" It may be read as "medium".

*** Explanation of the effect of the present embodiment ***
As described above, according to the monitoring control system according to the present embodiment, it is possible to prevent the standby system from being overloaded when the operation system which has become incapable of responding due to an overload is switched to the standby system.
In the monitoring control system according to the present embodiment, communication with all the monitoring devices is cut off once the redundant monitoring units become unresponsive due to overload and are switched to the standby system. Therefore, the load on the standby system does not increase. Further, in the monitoring control system according to the present embodiment, if the load is equal to or less than the threshold value, the monitoring devices are connected one by one based on the connection information. If the threshold is exceeded, the monitoring device is disconnected one by one. Therefore, it is possible to prevent the standby system load from exceeding the threshold and becoming incapable of responding due to overload.
Regarding the failure of the network backbone device or the failure of the carrier line, which is the root cause of the large amount of failure information, the failure data received when the active system was operating normally and the failure data received after switching to the standby system From which the operator responds. The failure can be dealt with in various ways, such as restarting the device, changing the setting of the device, contacting the carrier, and arranging the repair personnel to the site where the device is installed. Therefore, the root cause will be eliminated eventually. If the root cause is eliminated, a large amount of failure information will not be sent from the monitoring device side, and the load on the multi-system monitoring system will also be reduced, so the connection with the monitoring device will increase and eventually the monitoring device Communication with all the monitoring devices in is connected.

In addition, failures such as a failure of a network backbone device or a carrier failure rarely occur. However, when it occurs, a large amount of fault data occurs, and the multisystem monitoring system becomes overloaded with both the active system and the standby system after switching. Even in such a case, according to the monitoring control system according to the present embodiment, switching to the standby system can be realized without preparing a resource corresponding to the maximum load.

Second Embodiment
In this embodiment, points different from the first embodiment will be described. In addition, the same code | symbol may be attached | subjected to the structure similar to Embodiment 1, and the description may be abbreviate | omitted.
In the present embodiment, the importance of each of the plurality of monitoring devices is set in the connection information 138a. The connection control unit 136 also controls communication based on the comparison determination result of the load and the threshold and the degree of importance.

The connection information 138a according to the present embodiment also manages the importance of each monitoring device.
FIG. 8 is a diagram showing connection information 138a according to the present embodiment. The connection information 138a according to the present embodiment has information of importance in addition to the contents of the connection information 138 according to the first embodiment. The degree of importance indicates the degree of importance of the monitoring device, and is represented by a real value of 0 to 1, with 1 being the most important. Specifically, regarding the monitoring device that monitors the main network device, the importance is considered to be high, so the importance is set high. Also, the monitoring contract with the customer may change the degree of importance.

FIG. 9 is a diagram showing an example of a user interface for setting connection information 138a according to the present embodiment. FIG. 10 is a diagram showing a state in which the drop-down list is displayed in FIG. This user interface is displayed on the display device via the output interface 940. In addition, the administrator of the system performs setting by the input device via the input interface 930. In the example of FIG. 9, there is a column for setting the degree of importance for each monitoring device. The column of importance is a drop-down list, and the part of "▽" is selected by an input device such as a mouse. Then, as shown in FIG. 10, a list of importance is displayed. The importance is selected in the input device from the list of importance and the importance is set. The system administrator sets the importance and then selects the “OK” button, whereby the setting is stored in the connection information 138 a of the memory 921. If you select the "cancel" button, it will not be stored.
Although the number of monitoring device name columns is three in FIGS. 9 and 10, the number of columns may be more than that, or the number of columns may be changed.

FIG. 11 is a diagram showing another example of the user interface for setting the connection information 138a according to the present embodiment. In FIG. 11, an input device is used to move a label or icon indicating each monitoring device on the user interface. Then, by placing a label or icon indicating each monitoring device at a position indicating the level of importance, the degree of importance of each monitoring device is set.

Next, the operation will be described. Only differences from Embodiment 1 will be described.
When the connection control unit 136 receives from the switching unit 133 the process ready notification 94 activated by the standby system, the connection control unit 136 refers to the connection information 138a to connect one of the monitoring devices in the disconnected state with the highest degree of importance. change. Then, the connection control unit 136 instructs the connection disconnection unit 131 to connect communication with the monitoring device.
If the comparison determination result sent from the load determination unit 135 is less than or equal to the threshold for a certain period of time, the connection control unit 136 refers to the connection information 138 a and determines the most important of the monitoring devices in the disconnected state Change the high one to the connected state. Then, the connection control unit 136 instructs the connection disconnection unit 131 to connect communication with the monitoring device. If the comparison determination result sent from the load determination unit 135 is greater than or equal to the threshold value, the connection control unit 136 refers to the connection information 138a and the monitoring device in the connected state has the lowest importance 1 Change one to disconnected. Then, the connection control unit 136 instructs the connection disconnection unit 131 to disconnect communication with the monitoring device.

As described above, according to the switching management device according to the present embodiment, after switching to the standby system, connection can be made from a monitoring device with a high degree of importance. As a result, it is possible to connect by priority the monitoring of the backbone network device and the device considered to be contract important with the customer. By giving priority to reviving the monitoring of the system configuration such as the backbone network device, it is possible to more quickly investigate the root cause of the large-scale failure occurrence. Moreover, it becomes possible to give priority to revival of monitoring of a system important on contract.

Third Embodiment
In this embodiment, points different from the first embodiment will be described. In addition, the same code | symbol may be attached | subjected to the structure similar to Embodiment 1, and the description may be abbreviate | omitted.

FIG. 12 is a diagram showing a configuration of a monitoring control system 500b according to the present embodiment.
The monitoring control system 500b according to the present embodiment includes a transfer unit 140 that transfers fault data 231 transmitted from each of the plurality of monitoring devices 203 to the multiplex monitoring system 119. The switching unit 133 transmits the switching notification 92 to the transfer unit 140. The load determination unit 135 transmits the comparison determination result to the transfer unit 140. The transfer unit 140 controls the number of transfers of the failure data 231 based on the switching notification 92 and the comparison determination result.

Hereinafter, only differences from FIG. 1 will be described.
The transfer unit 140 transfers the fault data 231 transmitted from the monitoring device 203. The transfer unit 140 includes a transfer control unit 141 and a transfer buffer 142. The transfer control unit 141 controls transfer of the fault data 231 from the monitoring device 203. The transfer buffer 142 temporarily stores the fault data 231.

FIG. 13 shows an example of the transfer setting information 143 of the fault data 231 according to the present embodiment. The transfer unit 140 has transfer setting information 143. Transfer setting information 143 is stored in advance in the transfer unit 140. In the transfer setting information 143, for each of Pull type transfer and Push type transfer, the transfer interval, the initial value of the transfer number, the number to reduce the transfer number, and the number to increase the transfer number are set. The number to reduce the number of transfers is the number to reduce the number of transfers when the load on the standby system exceeds the threshold when switching occurs. Further, the number of transfers to be reduced is the number to increase the number of transfers when the load on the standby system falls below a threshold.

When sending fault data 231 from the monitoring device 203, there are several methods of sending information depending on the type of the monitoring device 203 and the monitoring method. The system is divided into two main ways.
One is Pull type transfer. In Pull type transfer, the monitoring device 203 has a database inside. When a failure occurs in the server 205 or a network device to be monitored, the monitoring device 203 detects the failure and temporarily stores failure data 231 in the database. The information is transferred to the multisystem monitoring system 119 by access to an API (application interface) provided by the monitoring apparatus 203 or a database. The transfer protocol may be a protocol using Web API, an access protocol provided by a database, or a protocol unique to the monitoring apparatus. In this case, the multiple system monitoring system 119 takes out the failure data 231 by API or database access, and it is possible to specify the number of failure data 231 to be acquired at one time.
Another is Push type transfer. In Push type transfer, the monitoring device 203 does not have a database. The monitoring device 203 immediately transfers from the monitoring device 203 to the multiple system monitoring system 119 failure data 231 of a failure that has occurred in the server 205 to be monitored or a network device. As a transfer protocol, a protocol such as SNMP Trap is used. In this case, fault data 231 is transferred each time the monitoring device 203 detects a fault. Therefore, in the case of the failure of the network monitoring backbone device, many servers and devices may become disconnected, and a large amount of failure data 231 may be transferred at one time.

Next, the operation will be described. Only differences from Embodiment 1 will be described.
The switching unit 133 transmits the switching notification 92 to the transfer unit 140 when system switching is performed.
Further, the load determination unit 135 transmits, to the transfer unit 140, the comparison determination result obtained by comparing and determining the load and the threshold.

First, the case of using pull type transfer will be described.
In an initial state, the transfer unit 140 transfers all failure data 231 detected by the monitoring device 203 at regular intervals to the multiplex monitoring system 119 via the connection disconnection unit 131.

When the transfer unit 140 receives the switching notification 92, the transfer control unit 141 reduces the number of pieces of failure data 231 acquired from the monitoring apparatus 203 at regular intervals based on the transfer setting information 143. If the load determination unit 135 determines that the load is equal to or less than the threshold value for a predetermined time, the transfer control unit 141 acquires failure data 231 acquired from the monitoring device 203 at regular intervals based on the transfer setting information 143. Increase the number of When it is determined that the load is equal to or higher than the threshold, the number of pieces of fault data 231 acquired from the monitoring device 203 at regular intervals is reduced.
Specifically, the transfer control unit 141 controls the number of pieces of failure data 231 acquired at regular intervals as follows.
When the load exceeds the threshold value, the number of pieces of failure data 231 acquired at regular intervals is reduced by the amount of pull type over load transfer number reduction.
New number = Current number-Pull type over load transfer number reduction When the load falls below the threshold, the number of failure data 231 acquired at regular intervals is increased by the amount of Pull type non load excess transfer number increase .
New number = present number + Pull type not loaded transfer number increase

Next, the case of using Push type transfer will be described.
In the initial state, the transfer unit 140 immediately transfers all fault data 231 detected and notified by the monitoring device 203 to the multiplex monitoring system 119 via the connection disconnection unit 131.
When the transfer unit 140 receives the switching notification 92, the transfer control unit 141 temporarily stores all the failure data 231 sent from the monitoring device 203 in the transfer buffer 142. The transfer buffer 142 does not perform any processing for the fault data 231, and only temporarily stores it. Therefore, even if a large amount of fault data 231 is received at one time from the monitoring device 203 side, all fault data 231 can be stored in the transfer buffer 142 because the processing is light. If the fault data 231 is stored in the transfer buffer 142, the transfer control unit 141 sends the fault data 231 to the multiplex monitoring system 119 in order from the old one. At that time, based on the transfer setting information 143, the transfer control unit 141 suppresses the number of transfers in a fixed time to a fixed number. If it is determined by the load determination unit 135 that the load is equal to or less than the threshold for a predetermined time, the transfer control unit 141 increases the number of transfers at regular intervals for transfer of the fault data 231 from the transfer buffer 142. When it is determined that the load is equal to or higher than the threshold, the transfer control unit 141 reduces the number of transfers at regular intervals for transfer of the fault data 231 from the transfer buffer 142. The transfer control unit 141 determines the number of transfers at constant intervals based on the transfer setting information 143.
Specifically, the transfer control unit 141 controls the number of transfers as follows.
When the load exceeds the threshold value, the number of transfers of failure data 231 from the transfer buffer 142 at regular intervals is reduced by the reduction of the number of Push type overload transfer.
New number of transfers = Current number of transfers-Push type excess load reduction Number of transfers When the load exceeds the threshold, the number of transfers of the fault data 231 from the transfer buffer 142 at regular intervals is the Push type non load excess condition. Increase by the number of transfers.
New number of transfers = Current number of transfers + Push type no load excess transfer number increase

As described above, according to the monitoring control system 500b according to the present embodiment, it is possible to temporarily limit the number of transfer of failure data at the time of switching occurrence in the multiplex monitoring system. Further, according to the monitoring control system 500b according to the present embodiment, if the load is equal to or less than the threshold value, the number of transfers is gradually increased, so that the standby system of the multiplex monitoring system is not overloaded. Further, with regard to pull type fault data transfer, since fault data is stored in the database possessed by the monitoring device, there is no loss of fault data even during switching and after switching. For Push type fault data transfer, since the transfer buffer 142 temporarily stores fault data, there is no loss of fault data even at switching time and after switching.

FIG. 14 is a diagram showing another example of the transfer setting information 143 according to the present embodiment.
In the transfer setting information 143 of FIG. 13, the number of transfers is increased or decreased based on the determination as to whether the threshold value is exceeded or not. However, as shown in FIG. The increase and decrease of the number of transfers may be controlled in stages according to the load. In FIG. 14, the horizontal axis represents the load level and the vertical axis represents the increase and decrease of the transfer number, and the curve is set to continuously change the increase and decrease of the transfer number according to the load level. . By using the transfer setting information 143 of FIG. 14, it is possible to set the number of transfers according to the type of load such as the CPU usage rate and the IOPS. Further, it is also possible to define an overall load index in which a plurality of loads are combined, and to set the number of transfers accordingly.
When the transfer number is continuously changed as shown in FIG. 14, the transfer control unit 141 acquires the increase / decrease number of the transfer number from the value of the load from the transfer setting information 143 of FIG. Then, as in the case of using the transfer setting information of FIG. 13, the transfer control unit 141 calculates a new transfer number using the increase / decrease number.

In the first to third embodiments, each part of the supervisory control system has been described as an independent functional block. However, the configuration of the monitoring control system may not be the configuration as in the above-described embodiment. The functional blocks of the supervisory control system may have any configuration as long as the functions described in the above-described embodiment can be realized.

A plurality of parts in Embodiments 1 to 3 may be combined and implemented. Alternatively, one portion of these embodiments may be implemented. In addition, these embodiments may be implemented in any combination in whole or in part.
The embodiments described above are essentially preferable examples, and are not intended to limit the scope of the present invention, the scope of the application of the present invention, and the scope of the application of the present invention. The embodiment described above can be variously modified as needed.

91 failure information, 92 switching notification, 93 load information, 94 processable notification, 106 network, 107 FW, 110 event aggregation unit, 111 event aggregation (active system), 112 event aggregation (standby system), 113 event database, 114 incident Management unit 115 Incident management (operating system) 116 Incident management (standby system) 117 Incident database 118 Failure display unit 119 Multiple system monitoring system 121 Operator 130 switching management device 131 Connection disconnection unit 132 Failure acquisition Unit, 133 switching unit, 134 load acquisition unit, 135 load determination unit, 136 connection control unit, 137 threshold information, 138, 138a connection information, 139 storage unit, 140 transfer unit, 141 transfer control unit, 14 Transfer buffer, 143 transfer setting information, 191 monitoring unit, 192 active system monitoring unit, 193 standby system monitoring unit, 200 monitored system, 202 FW, 203 monitoring device, 204 SW, 205 server, 231 failure data, 500, 500b monitoring Control system, 909 electronic circuit, 910 processor, 921 memory, 922 auxiliary storage device, 930 input interface, 940 output interface, 950 communication device, S130 switching management processing.

Claims

The system comprises: an operation monitoring unit that executes monitoring processing on a plurality of monitoring devices; and a standby monitoring unit that executes the monitoring processing instead of the operation monitoring unit when a failure occurs in the operation monitoring unit. In a switching management device that manages system switching of a multiplex system monitoring system,
A switching unit that switches execution of the monitoring process from the operating system monitoring unit to the standby system monitoring unit upon acquiring failure information indicating that a failure has occurred in the operating system monitoring unit;
A connection control unit that disconnects communication between the multi-system monitoring system and each of the plurality of monitoring devices when acquiring from the switching unit a switching notification indicating that switching has been made to the standby system monitoring unit from the operating system monitoring unit. When,
A load determination unit that acquires load information representing a load of the standby monitoring unit that executes the monitoring process, and comparing the load and a threshold;
The connection control unit
A switching management device that controls the communication between each of the plurality of monitoring devices and the standby monitoring unit based on the comparison determination result of the load and the threshold.
The connection control unit
The switch management device according to claim 1, wherein, when the load is equal to or less than the threshold value, the communication is brought into a connected state for at least one of the plurality of monitoring devices.
The switching unit is
Outputting a processable notification indicating that the standby monitoring unit has become processable;
The connection control unit
The switch management device according to claim 2, wherein when the process enable notification is acquired from the switching unit, the communication is made in a connected state with respect to at least one of the plurality of monitoring devices.
The connection control unit
The switch management device according to claim 3, wherein when the load exceeds the threshold, the communication is disconnected for at least one of the plurality of monitoring devices.
The connection control unit
5. The switch management device according to claim 4, wherein control of the communication is ended when the communication is in a connected state for all of the plurality of monitoring devices.
The switching management device is
A storage unit configured to store connection information in which the state of the communication between each of the plurality of monitoring devices and the standby monitoring unit is set;
The connection control unit
The switch management device according to any one of claims 1 to 5, wherein the communication is controlled using the connection information.
In the connection information, the importance of each of the plurality of monitoring devices is set;
The connection control unit
The switch management device according to claim 6, wherein the communication is controlled based on a comparison determination result of the load and the threshold value and the importance.
With multiple monitoring devices,
The system comprises: an operation monitoring unit that executes monitoring processing on the plurality of monitoring devices; and a standby monitoring unit that executes the monitoring processing instead of the operation monitoring unit when a failure occurs in the operation monitoring unit. Multiple system monitoring system,
And a switching management device that manages system switching of the multiple system monitoring system,
The switching management device is
A switching unit that switches execution of the monitoring process from the operating system monitoring unit to the standby system monitoring unit upon acquiring failure information indicating that a failure has occurred in the operating system monitoring unit;
A connection control unit that disconnects communication between the multi-system monitoring system and each of the plurality of monitoring devices when acquiring from the switching unit a switching notification indicating that switching has been made to the standby system monitoring unit from the operating system monitoring unit. When,
A load determination unit that acquires load information representing a load of the standby monitoring unit that executes the monitoring process, and comparing the load and a threshold;
The connection control unit
The monitoring control system which controls said communication of each of said several monitoring apparatus and said standby system monitoring part based on the comparison determination result of the said load and the said threshold value.
The supervisory control system
A transfer unit configured to transfer fault data transmitted from each of the plurality of monitoring devices to the multiplex monitoring system;
The switching unit is
Sending the switching notification to the transfer unit;
The load determination unit
Transmitting the comparison determination result to the transfer unit;
The monitoring control system according to claim 8, wherein the transfer unit controls the number of transfers of the failure data based on the switching notification and the comparison determination result.
The system comprises: an operation monitoring unit that executes monitoring processing on a plurality of monitoring devices; and a standby monitoring unit that executes the monitoring processing instead of the operation monitoring unit when a failure occurs in the operation monitoring unit. In a switching management method of a switching management apparatus that manages system switching of a multiplex system monitoring system,
When the switching unit acquires failure information indicating that a failure has occurred in the operating monitoring unit, the switching unit switches execution of the monitoring process from the operating monitoring unit to the standby monitoring unit.
When the connection control unit acquires, from the switching unit, a switching notification indicating that switching has been performed from the working monitoring unit to the standby monitoring unit, communication between the multiple monitoring system and each of the plurality of monitoring devices is performed. Cut off
The load determination unit acquires load information representing the load of the standby monitoring unit that executes the monitoring process, and compares the load with a threshold value.
The switching management method according to claim 1, wherein the connection control unit controls the communication between each of the plurality of monitoring devices and the standby monitoring unit based on a comparison determination result between the load and the threshold.
The system comprises: an operation monitoring unit that executes monitoring processing on a plurality of monitoring devices; and a standby monitoring unit that executes the monitoring processing instead of the operation monitoring unit when a failure occurs in the operation monitoring unit. In a switching management program of a switching management device that manages system switching of a multiplex system monitoring system,
A switching process of switching execution of the monitoring process from the operating system monitoring unit to the standby system monitoring unit when failure information indicating that a failure has occurred in the operating system monitoring unit is acquired;
And a disconnection process for disconnecting communication between the multi-system monitoring system and each of the plurality of monitoring devices when acquiring a switching notification indicating that the standby system monitoring unit has been switched from the operation system monitoring unit from the switching process. ,
Load determination processing for acquiring load information representing the load of the standby monitoring unit that executes the monitoring processing, and comparing the load with a threshold value;
Switching that causes the switching management device, which is a computer, to execute communication control processing for controlling the communication between each of the plurality of monitoring devices and the standby monitoring unit based on the comparison determination result of the load and the threshold value. Management program.