WO2017094168A1

WO2017094168A1 - Mutual supervisory system, failure notification method in mutual supervisory system, and non-transitory storage medium in which program for exerting failure notification function is recorded

Info

Publication number: WO2017094168A1
Application number: PCT/JP2015/084035
Authority: WO
Inventors: 里司藤恵; 信之雜賀
Original assignee: 株式会社日立製作所
Priority date: 2015-12-03
Filing date: 2015-12-03
Publication date: 2017-06-08

Abstract

[Problem] To provide a mutual supervisory system with a core edge structure capable of reinforcing a notification function of a failure caused in any one of a plurality of edges, a failure notification method in the mutual supervisory system, and a non-transitory storage medium in which a program for exerting a failure notification function is recorded. [Solution] A plurality of edges each has a failure information holding unit capable of holding own failure information based on a performance history of an own edge and another piece of failure information based on a performance history of another edge, notifies a core of the own failure information relating to the own edge in the failure information holding unit, and updates and reflects all pieces of failure information on the basis of the own failure information.

Description

Mutual monitoring system, fault notification method in the mutual monitoring system, and non-transitory storage medium in which a program for performing the fault notification function is recorded

The present invention relates to a mutual monitoring system, a failure notification method in the mutual monitoring system, and a non-temporary storage medium in which a program for performing a failure notification function is recorded, and is particularly suitable for application to a so-called core edge configuration system. Is.

A system having a conventional core edge configuration has a plurality of NAS heads (edges) connected to a core in which a shared resource is stored, and performs exclusive control regarding access to the shared resource as the shared resource. A management table is also included. In this management table, identifiers that can identify other NAS heads that are monitoring each NAS head itself are registered, and the number of times the other NAS heads are monitored for each NAS head corresponding to the identifier is registered. The count value to represent is registered.

In such a conventional core edge configuration system, whether or not a failure has occurred in each NAS head is determined based on the counter value indicating the number of times of monitoring, and a failure is detected in a certain NAS head Monitoring is continued by changing the monitoring partner to another NAS head (see Patent Document 1).

JP 2005-141672 A

However, the conventional core edge configuration system does not take into account the failure notification delay, and therefore, if the failure notification is delayed for some reason, the failure notification cannot be surely made.

The present invention has been made in consideration of the above points, and is a mutual monitoring system capable of reliably sharing information on a failure occurring in any one of a plurality of edges without delay in other edges. The present invention intends to propose a failure notification method in a monitoring system and a non-temporary storage medium in which a program for performing a failure notification function is recorded.

In order to solve such a problem, in the present invention, in a mutual monitoring system having a core edge configuration including a core in which a shared resource is stored and a plurality of edges that access the shared resource, the core includes the shared resource. A core side fault information management unit that manages all fault information at the plurality of edges as a resource is provided, and each of the plurality of edges includes the fault information based on the performance history of its own edge and the performance history of the other edge. A fault information holding unit capable of holding other fault information based on the fault information holding unit, and notifying the core of the fault information related to its own edge in the fault information holding unit, and determining all fault information based on the fault information The self-fault information reporting unit to be updated and reflected, and the other faults among the total fault information in the core-side fault information management unit. If the other fault information is newer than the other fault information already held, the other fault information held in the fault information holding unit is updated using the other fault information acquired from the all fault information. The fault information acquisition unit, and when the core further tries to update access to the fault information by a specific edge of the plurality of edges, In the state where update access is made to all the failure information by the edge of the other, instead of restricting update access by the specific edge, the request content of the update access by the specific edge is separately added as unreflected data. Contention that temporarily stores and updates all the failure information based on the unreflected data when update access by the other edge is completed Characterized in that it has a control unit.

Further, in the present invention, in the failure notification method in the mutual monitoring system of the core edge configuration having a core in which a shared resource is stored and a plurality of edges that access the shared resource, The failure information holding unit capable of holding other failure information based on the performance history of other edges together with the own failure information based on the performance history of the own edge in the own failure information reporting unit. A self-failure information reporting step of notifying the core of information and updating and reflecting the total failure information based on the failure information of the own device, and the other-failure information acquisition unit, the core-side failure information of the core as the shared resource Other fault information related to other edges among all fault information at the plurality of edges managed by the management unit If the fault information is newer than the other fault information already held, the other fault information acquisition step of updating other fault information held in the fault information holding unit using the other fault information acquired from the total fault information. In the core, when the contention control unit is caused to try update access to the entire failure information by a specific edge of the plurality of edges, the other edge of the plurality of edges In the state in which update access is made to all the failure information, instead of restricting update access by the specific edge, the request for update access by the specific edge is temporarily stored as unreflected data. And update all fault information based on the unreflected data when update access by the other edge is completed. Characterized in that to perform the contention control step of.

Further, in the present invention, a program that records a failure notification function in a core edge configuration mutual monitoring system having a core in which a shared resource is stored and a plurality of edges that access the shared resource is recorded. In the temporary recording medium, each of the plurality of edges stores failure information that can hold, in the computer, other failure information based on the performance history of the other edge together with the failure information based on the performance history of the edge. A self-failure information reporting step of notifying the core of the failure information related to the edge of itself in the unit, updating and reflecting the failure information based on the failure information of the core, and the core of the core as the shared resource Of all the fault information at the plurality of edges managed by the side fault information management unit If the other fault information related to the edge of the fault is newer than the other fault information already held, the other fault information held in the fault information holding unit is obtained using the other fault information acquired from the total fault information. The other fault information acquisition step to be updated is executed, and the core causes the computer to try update access to the entire fault information by a specific edge of the plurality of edges. When the update access is made to the entire failure information by another edge, the update access request content by the specific edge is not reflected instead of restricting the update access by the specific edge. It is stored separately as data, and the unreflected data is triggered by the completion of update access by the other edge. Based program to exhibit a failure notification function in the to execute the contention control step of updating the total fault information mutual monitoring system is characterized in that it is recorded.

According to the present invention, it is possible to reliably share information on a failure that has occurred at any one of a plurality of edges without delay even at other edges.

It is a block diagram which shows an example of the hardware constitutions of the mutual monitoring system by this Embodiment. It is a block diagram which shows an example of the software configuration of the mutual monitoring system by this Embodiment. It is a figure which shows an example of the content of a status file. It is a figure which shows an example of the format and content of the inode management table of a status file. It is a figure showing an example of the format of the node management table containing the information of each node, and the content. It is a figure which shows an example of a format and content of a notification destination management table. It is a flowchart showing an example of the failure notification method in a mutual monitoring system. It is a flowchart showing an example of the failure notification method in a mutual monitoring system. It is a flowchart showing an example of the failure notification method in a mutual monitoring system. It is a figure showing an example of the directory structure on the archive device in the core side. It is a figure which shows the outline | summary of the merge process at the time of contention generation | occurrence | production. It is a figure which shows the outline | summary of the merge process at the time of contention generation | occurrence | production. It is a figure which shows the outline | summary of the notification process by the email of failure information.

Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings.
(1) Concept of the present embodiment In the present embodiment, in a core edge configuration mutual monitoring system having a core in which a shared resource is stored and a plurality of edges that access the shared resource, the core is a shared resource. As a core side fault information management unit for managing all fault information at a plurality of edges. On the other hand, each of the plurality of edges has a failure information holding unit capable of holding other failure information based on the performance history of other edges together with its own failure information based on the performance history of its own edge. It has a self-failure information report unit that notifies the core of its own fault information regarding its own edge and updates and reflects all fault information based on its own fault information. In such a configuration, each of the plurality of edges further includes other fault information related to other edges among the plurality of edges in the total fault information in the core side fault information management unit, based on the other fault information already held. If the failure information is new, the other failure information held in the failure information holding unit is updated using the other failure information obtained from the entire failure information (other failure information obtaining unit). In the above-described core, when an update access to all fault information is attempted by a specific edge of a plurality of edges, the update access is made to all fault information by another edge of the plurality of edges. In this case, instead of restricting update access by a specific edge, the update access request by the specific edge is temporarily stored separately as unreflected data, and the update access by the other edge is stored. The above-mentioned failure information is updated on the basis of the unreflected data in response to the completion of (contention control unit). The concept will be specifically described below with reference to the drawings.

(1-1) Configuration of Mutual Monitoring System According to this Embodiment FIG. 1 shows an example of a hardware configuration of a mutual monitoring system 1 according to this embodiment, and FIG. 2 shows a mutual monitoring system 1 according to this embodiment. An example of the software configuration is shown. An example of this software configuration represents one form of a program for causing a computer to exhibit a failure notification function described later and to function as the mutual monitoring system 1 as in the present embodiment. This program may be stored in a non-transitory storage medium that can be read by a computer. Hereinafter, a configuration for exerting the failure notification function in the mutual monitoring system 1 will be described.

The mutual monitoring system 1 of FIG. 1 has a configuration in which a core 200 and a plurality of

edges

100A, 100B, and 100C (corresponding to “edge A”, “edge B”, and “edge C” shown in the figure) are connected by a network 301. ing. The plurality of

edges

100A, 100B, and 100C have substantially the same configuration in that each computer has a computer that is installed and provided at each user's office.

The core 200 includes an archive device 210 (hereinafter also referred to as “HCP”) and a RAID device 220. The archive device 210 includes a CPU (Central Processing Unit) 211, a memory 212, a NIC (Network Interface Card) 213, an HBA (Host Bus Adapter) 214, and a built-in disk device 215.

The CPU 211 uses the memory 212 in which an inode management table 330 in FIG. 4C, which will be described in detail later, is stored as a work area, and performs an archive function for the kernel / device driver 219, the file system program 218, and the specified file. An archiving program 217 that operates is operated. The NIC 213 is an interface card with the network 301, and the HBA 214 functions as an interface with the RAID device 220. The core 200 exchanges data with the plurality of

edges

100A and 100B via the network 301.

The RAID device 220 has a function of storing data from the archive device 210 and providing stored data in response to the request. The RAID device 130 has the same configuration as the RAID device 220. In the RAID device 220, the CPU 223 uses the memory 222 as a work area and operates an RTOS (Real Time Operating System) 229 and a RAID control program 228. The RAID control program 228 increases the reliability of data reading and writing by performing RAID control using the

disk devices

225 and 226.

In such a RAID device 220, the archive device 210 receives desired data by issuing a desired write command, and performs distributed storage management of data using both the disk device 225 and the disk device 226 by RAID control, for example. Do. When a failure occurs in any of the

disk devices

225 and 226, the RAID device 220 continues processing using the data of the disk device in which no failure has occurred.

The RAID device 220 generally includes a channel adapter 221, a memory 222, a CPU 223, a controller 224, and

disk devices

225 and 226. A plurality of

disk devices

225 and 226 are provided, and constitute a so-called RAID (Redundant Array Independent Disks).

The channel adapter 221 controls various processes in addition to the command reception process by the various programs that operate using the storage area of the memory 222 by the CPU 223. The controller 224 controls writing and reading of data using the

disk devices

225 and 226.

On the other hand, the plurality of

edges

100A and 100B have substantially the same configuration. For this reason, in this embodiment, the edge 100A will be described as a representative. The edge 100A includes a plurality of client devices 110, a file storage device 120, a RAID device 130, a mail server 140, and an analysis server 150. These have a common configuration in that they are computers. The RAID device 130 has the same function as the RAID device 220 described above with respect to the function of storing data and providing stored data, and thus the description thereof is omitted.

The file storage device 120 includes a CPU (Central Processing Unit) 121, a memory 122, a NIC (Network Interface Card) 123, a built-in disk device 124 and an HBA (Host Bus Adapter) 125.

The CPU 121 uses a memory 122 in which an inode management table 310 shown in FIG. 4A described later is stored as a work area, and stores a file sharing system 126, a kernel / device driver 129, a file system program 127, and a specified file. A data mover program 128 that exhibits the function of movement is operated. The NIC 123 is an interface card with the network 301, and the HBA 125 functions as an interface with peripheral computers such as the client device 110. The edge 100A exchanges data with the core 200 and the

edges

100B and 100C via the network 301.

In the present embodiment, when it is desired to distinguish the file storage device 120 or the like for each of the plurality of edges A to C, for example, the

file storage devices

120A, 120B, and 120C are encoded so as to know which edge corresponds to them. Will be described below in regular order. This also applies to other computers such as the mail server 140, for example. In the present embodiment, in order to distinguish the file storage devices 120 of the plurality of

edges

100A, 100B, and 100C, for example, as shown in FIGS. 3, 5, and 6 described later, “HDI-A”, “HDI” An identifier such as “-B” or “HDI-C” is also used.

The plurality of client devices 110, the mail server 140, and the analysis server 150 are common in that they are basically computer-based devices with different processing speeds, but differ in that they are specialized in specific functions. ing.

Each of the plurality of client devices 110 includes a CPU 111, a memory 112, a NIC 113, and a built-in disk device 114, and makes a data write request to the file storage device 120 (120 </ b> A) to write data or a data read request to store. Read the completed data. The CPU 111 operates an OS (Operating System) 119 and an application / Web browser 118 using the memory 112 as a work area, and stores data with the built-in disk device 114 and reads stored data as necessary. . The NIC 113 functions as an interface with the file storage apparatus 120.

The mail server 140 includes a CPU 141, a memory 142, a NIC 143, and a built-in disk device 144, respectively. The CPU 141 uses the memory 112 as a work area, operates a mail server program 148 that relays transmission / reception of e-mails, and the OS 149, and stores e-mail data with the built-in disk device 144 as necessary. Read mail data. The NIC 143 functions as an interface with the file storage device 120 and the analysis server 150.

The analysis server 150 includes a CPU 151, a memory 152, a NIC 153, and a built-in disk device 154. The CPU 151 uses the memory 152 as a work area and operates an analysis program 158 and an OS 159 for analyzing whether or not a failure occurs at an edge such as a NAS head, and if necessary, between the internal disk device 154 and the internal disk device 154. To store the analysis result data or read the stored analysis result data. The analysis result data here corresponds to monitoring information including a measurement value for each region and a failure determination threshold ratio as shown in FIG. 3 described later. The NIC 153 functions as an interface with the mail server 140.

(1-2) Status File Format and Its Contents FIG. 3 shows an example of the status file 300 format and its contents. The status file 300 manages monitoring information and error information for each date and time and each edge host name ("HDI-A" and "HDI-B" in the figure).

The monitoring information includes measured values for each part and failure determination threshold ratio, and manages, for example, power supply voltage and fan temperature. On the other hand, the error information includes the importance of the failure content and the notification status (corresponding to a notification status flag described later) regarding the message ID that can identify the failure content. The importance is set to be important in the order of, for example, “F (Fatal)”> “E (Error)”> “W (Warning)”> “I (Information)”. The notification status flag is updated to “done” when notification by e-mail using the e-mail notification function described later is completed, while “not yet” when such notification is not completed. And updated.

(1-3) Format and Contents of Inode Management Table FIGS. 4A to 4C are diagrams showing examples of formats and contents of the inode management tables 310, 320, and 330 of the status file, respectively. is there. While the inode management table 330 shown in FIG. 4C is prepared not only on the core 200 side but also the inode management table 310 shown in FIG. 4A is prepared in the file storage apparatus 120 (120A) at the edge 100A, FIG. The inode management table 320 shown in (B) is also prepared in the file storage apparatus 120 (120B) at the edge 100B.

Specifically, on the core 200 side, an Inode management table 330 is stored in the memory 212 of the archive device 210. On the other hand, on the edge 100A side, the Inode management table 310 is stored in the memory 122 of the file storage device 120, and on the edge 100B side, the Inode management table 320 has a file storage device (not shown) (hereinafter “file storage device 120B”). Stored in a memory (not shown).

Each Inode management table includes a file path name, version (corresponding to “Ver” in the figure), last update date / time, stub / flag, storage destination URL for each Inode number assigned to each node to identify each edge. And three block addresses. The file path name represents the storage location of the status file. Each block address is a head address that divides a file into a plurality of blocks and indicates the head of each block. In FIG. 4A and FIG. 4B, since they are stubs, for example, NULL is set.

The storage destination URL of the Inode management table 310 shown in FIG. 4A and the storage destination URL of the Inode management table 320 shown in FIG. 4B are the file path in the Inode management table 330 shown in FIG. Corresponds to the name. Note that the status file described above is managed on the file system as a file name “status.txt” in FIG.

(1-4) Information on Each Node FIG. 5 shows an example of the format and contents of the node management table 400 including information on each node. The node management table 400 has a transfer schedule including a startup time and a transfer interval for each host name at each edge, and a failure notification function. The failure notification function in each node indicates whether or not each node (edge) has a notification function by e-mail, as will be described in detail later.

The activation time represents the elapsed time since each node was activated, and the transfer interval represents the interval at which the status file is transferred from each node toward the core 200.

(1-5) Format and Contents of Notification Destination Management Table FIG. 6 shows an example of the format and contents of the notification destination management table 500. The notification destination management table 500 manages an e-mail address as an example of a contact and a notification timing to the contact for each host name at each edge.

(2) Specific Procedure of Failure Notification Method in Mutual Monitoring System 1 The mutual monitoring system 1 has the above-described configuration. Next, a failure notification method in the mutual monitoring system 1 will be described.

7 to 9 show examples of failure detection methods in the mutual monitoring system 1, respectively. 7 shows an example of the contents of the failure monitoring process, FIG. 8 shows an example of the first half contents of the status file merge process shown in FIG. 7, and FIG. 9 shows the latter half contents of the status file merge process shown in FIG. An example is shown.

(2-1) Failure Monitoring Process First, at each edge (here, the edge 100A is mainly exemplified as a representative), the analysis server 150 determines whether or not a failure has occurred in its own node (step S1). ). If no failure has occurred in its own node, the analysis server 150 determines, for example, whether the transfer has been executed periodically or manually (step S2). When the transfer is not executed, step S1 is executed again, while when the rolling is executed, a status merge process described later is executed (step S100).

If a failure has occurred in its own node, the analysis server 150 determines whether or not the e-mail notification function is ON (step S3). If the e-mail notification function is ON, the mail server 140 executes e-mail notification to the notification destination registered in the notification destination management table 500 (step S4).

Next, the analysis server 150 updates the status file (step S5). When e-mail notification is executed, the notification status flag is updated from “not yet” to “done”. Next, the analysis server 150 executes a status file merge process (step S100).

(2-2) Status File Merge Process FIG. 8 shows an example of the status file merge process executed at each edge. In the present embodiment, this status file merging process is executed, for example, periodically or when manual transfer is executed, or when a failure occurs.

At the edge 100A, the file storage device 120A (HDI-A) acquires a status file and unreflected data from the core 200 (step S101). Next, the file storage device 120A (HDI-A) merges the unreflected data into the status file acquired from the core 200 (step S102). At the same time, the file storage apparatus 120A (HDI-A) additionally writes the contents of the version that is not reflected.

The file storage apparatus 120 merges the failure information (corresponding to the status file) of its own node into the status file acquired from the core 200 (step S103), and determines whether or not unreported failure information exists (step S103). S104).

When there is no unreported failure information, the file storage apparatus 120 acquires an e-mail address as an example of a notification destination from the notification destination management table 500 (step S105). Next, the file storage device 120 sends an e-mail notification regarding unreported failure information, and changes the notification status flag of the status file to “completed” (step S106).

Next, the file storage device 120 transmits monitoring information of each edge to the analysis server 150 (step S107). The monitoring information includes, for example, a site-specific measurement value / failure determination threshold ratio. Next, since the merging has been completed as described above, the file storage device 120 deletes the unreflected data acquired from the core 200 (step S108).

Next, the file storage apparatus 120 acquires the version information of the status file from the core 200 (step S109 in FIG. 9), and determines whether or not the status file SF of the core 200 (the Inode management table 330 for managing it) has been updated. Determination is made (step S110).

When the status file (all fault information) of the core 200 has not been updated, the file storage device 120 increments the version of the metadata of the status file obtained from the core 200 and merged with the fault information of its own node ( Step S111). Next, the file storage apparatus 120 transfers the updated status file to the status file storage directory in the core 200 (step S112). The file storage device 120 deletes the status file originally held in its own node (step S113), returns to the caller (step S100 in FIG. 7), and continues the processing (in the case of the flowchart in FIG. 7, the subsequent processing is performed). Exit).

On the other hand, if the status file SF of the core 200 has been updated, the file storage apparatus 120 moves the status file originally held in its own node to the conflict directory (unreflected data storage directory) MD (step S114). ). Next, the file storage device 120 uploads the status file in the conflict directory to the unreflected data storage directory MD in the core 200 (step S115), and returns to the caller (step S100 in FIG. 7) to continue the process ( In the case of the flowchart of FIG. 7, the processing is thereafter terminated).

(3) Status File Storage Location (3-1) Core-side Directory Configuration FIG. 10 shows an example of a directory structure on the archive device 210 on the core 200 side. A subordinate directory of a certain superordinate directory is represented in a vertical relationship by a line extending from the bottom of the superordinate directory to the top of the subordinate directory.

As shown in FIG. 10, the archive device 210 (HCP) stores a status file with a directory name “coreH”, for example, as a lower directory of the root directory RD shown as “/”. A directory (hereinafter referred to as “status file storage directory”) SD and a directory (hereinafter referred to as “non-reflected data storage directory”) MD for storing the above-described unreflected data with a directory name “conflictH”, for example. Have.

In the illustrated example, a status file SF named “0” is stored in the status file storage directory SD.

(3-2) Merge Processing When Conflict Occurs FIGS. 11 and 12 each show an example of an outline of merge processing when a conflict occurs. FIG. 12 shows the continuation of the flowchart shown in FIG. Note that the procedures (1) to (5) given in both figures represent an example of the processing procedure in each figure. As shown in FIG. 11, the core 200 stores a status file SF having the name “0” in the status file storage directory SD.

11, the status file SF with the name “0” is downloaded from the core 200 to the file storage device 120A (HDI-A), and the status file with the name “0” is downloaded to the file storage device 120A. In the procedure (2) of FIG. 11, at the same time, the status file named “0” is downloaded from the core 200 to the file storage device 120B.

On the other hand, the file storage device 120A uploads the status file SF with the name “A1” including the failure information regarding its own node to the status file storage directory SD of the core 200, as shown in the procedure (3) of FIG.

As shown in step (4) of FIG. 11, in the core 200, the status file SF with the name “0” is rewritten with the uploaded status file SF with the name “A1”.

Next, the file storage apparatus 120B (HDI-B) moves the status file SF with the name “B1” including its own failure information to the status file storage directory SD of the core 200, as shown in step (5) of FIG. Try to upload.

(3-2-1) Conflict Confirmation Next, the file storage apparatus 120B (HDI-B) determines that the status file SF existing in the status file storage directory SD has been changed by another edge as shown in the procedure (5) of FIG. It is updated and it is confirmed whether the version is other than “0”. This is recognition that the status file SF of version “0” exists in the status file storage directory SD of the core 200 as described above for the file storage device 120B (HDI-B). Therefore, when there is a status file SF whose version is other than “0”, it is confirmed whether or not there is a possibility of conflict with the update by other edges.

When the file storage device 120B (HDI-B) detects a conflict based on the fact that the version of the status file SF existing in the status file storage directory SD of the core 200 is other than “0”, the file storage device 120B (HDI-B) uploads as described above. The status file SF (version “B1”) to be executed is uploaded to the unreflected data storage directory MD as shown in step (6) of FIG. 11 instead of the planned status file storage directory SD. .

As a result, in the present embodiment, when there is an update access request for the status file SF, instead of not accepting the update access request, by using the unreflected data storage directory MD temporarily, It is possible to control contention with respect to a request for update access to the status file SF by the

edges

100A and 100B.

On the other hand, in the procedure (1) of FIG. 12, due to the above-described conflict, the file storage device 120C that could not reflect the status file SF resolves the above-described conflict as shown in the procedure (1) of FIG. The status file SF whose version is updated to “A1” is downloaded from the core 200.

The file storage device 120C stores the status file SF (version “B1]) stored in the non-reflected data storage directory MD and put on hold earlier as shown in step (2) of FIG. Download from the directory MD.

In the file storage apparatus 120C, based on the two downloaded status files SF (version “A1” and version “B1”) and the status file SF updated by the file storage apparatus 120C itself (for example, version “C1”). Then, as shown in the procedure (3) of FIG. 12, these are merged, for example, a status file SF of version “ABC” is generated, and this is uploaded to the status file storage directory SD of the core 200. The status file SF (version “A1”) in the status file storage directory SD is updated.

As a result, on the core 200 side, the status file SF (version “B1”) existing in the unreflected data storage directory MD is triggered by the fact that the conflict has been resolved for the status file SF existing in the status file storage directory SD. Is overwritten and reflected in the status file SF existing in the status file storage directory SD.

As described above, first, when update access by each edge is not competing, the plurality of

edges

100A and 100B can share all the fault information regarding the plurality of

edges

100A and 100B in synchronization with each other through the core. On the other hand, even if update access by each edge competes, as soon as the conflicting update access is completed, all fault information updated by a specific edge is reliably notified between the plurality of

edges

100A and 100B without delay. Will be grasped.

As described above, in the present embodiment, the CPU 211 (contention control unit) of the archive device 210 in the core 200 starts the status file storage directory SD (all fault information storage directory) from one of the

edges

100A, 100B, 100C. ) When all the status files already stored in (1) are updated and accessed by a new status file, if the stored status file SF is updated and accessed from another edge, the new status file SF is stored as unreflected data. Are temporarily stored in the directory MD. Further, the CPU 211 of the archive device 210 uses the new status file to temporarily store the stored status file SF in the unreflected data storage directory MD when the update access of the stored status file SF is completed. The status file for which the update access has been completed is updated using a new status file SF (data update unit).

In this way, the plurality of

edges

100A, 100B, and 100C have the contents of the status file SF with each other, so that a failure that occurs at a specific edge among the plurality of

edges

100A, 100B, and 100C can be caused by other edges. I can grasp it. For this reason, it is not necessary to place an administrator in each of the plurality of

edges

100A, 100B, 100C and the core 200, and the failure between the plurality of

edges

100A, 100B, 100C and between them and the core 200 is suppressed while reducing costs. The notification function can be made more reliable.

In addition, by providing the status file storage directory SD in the core 200, even if there is a request for update access of all the fault information of the core from a plurality of

edges

100A, 100B, 100C in a short time, all faults after the update It is possible to ensure that there is no inconsistency in information. As a result, even when a large number of edges are provided, the failure notification function can be reliably operated between all the

edges

100A, 100B, and 100C. Therefore, the latest failure information is stored in all the plurality of

edges

100A, 100B, and 100C. Can be shared quickly.

(3-3) Overview of Failure Information E-mail Processing FIG. 13 shows an example of an overview of failure information e-mail notification processing. The archive device 210 (also abbreviated as “HCP” in the present embodiment) has, for example, a directory called “dir1” as a lower layer directory of the root directory RD shown as “/” on the file system as shown in the figure. A directory for storing the above-mentioned unreflected data with a directory name (hereinafter “status file storage directory”) SD and a directory name “dir2”, for example, Directory)). This e-mail notification process will be described mainly using the status file storage directory SD.

When the mail server 140A of the edge 100A (corresponding to the “mail server A” shown in the figure) cannot perform failure notification using e-mail as shown in the procedure (1) of FIG. 13, the file storage of the edge 100A is used instead. As shown in step (2) of FIG. 13, the device 120A (corresponding to “HDI-A” in the figure) indicates that the mail unreported information indicating that failure notification using electronic mail cannot be performed is the archive device 210 of the core 200. Is uploaded to the status file storage directory SD. Note that failure notification using e-mail cannot be performed as described above, for example, when the mail server 140A is not provided in the edge 100A, or a failure occurs in the e-mail transmission function of the mail server 140A itself. Assumes that.

On the other hand, at the edge 100B, the file storage device 120B (corresponding to “HDI-B” shown in the figure) downloads the mail non-notification information in the status file storage directory SD as shown in the procedure (3) in FIG. Instead, an e-mail is used for the specific manager (management terminal) of the file storage apparatus 120A (HDI-A), and the contents based on the unreported mail information are used as the contents of the mail server 140B (the “mail server- B ”)) that the failure notification is not possible using electronic mail.

In this way, even if a failure has occurred at the edge 100A but the failure has reached the mail server 140A (corresponding to the mail server-A), the specific administrator who operates the management terminal (not shown) Thus, it is more reliably reported that a failure has occurred in the edge 100A via the other edge 100B. For this reason, even if the dedicated manager does not exist in the edge 100A, the specific manager described above can grasp that a failure has occurred in the edge 100A. It is not necessary to place a person, and the maintenance cost of the edge 100A can be suppressed.

Further, in the present embodiment, the plurality of

edges

100A, 100B, and 100C are respectively the status file versions (for example, the edge 100A) of the plurality of

edges

100A, 100B, and 100C in the core 200 (the last in FIG. 4). It is determined that a communication failure has occurred between the other edge 100A and the core 200 based on the fact that the update date / time column) has not been updated for a predetermined period, that is, for example, is older than a predetermined threshold version. You may make it do.

In this case, even when the file storage apparatus 120A of the edge 100A has a large failure that the above-mentioned unreported mail information cannot be uploaded to the core 200 as shown in the example of FIG. Even if the edge 100B is far away from the edge 100A, it can be understood that such a large failure has occurred in the edge 100A, and can be reported to the specific manager, for example.

(4) Other Embodiments The above embodiment is an example for explaining the present invention, and is not intended to limit the present invention only to these embodiments. The present invention can be implemented in various forms without departing from the spirit of the present invention. For example, in the above-described embodiment, the processing of various programs is described sequentially, but this is not particularly concerned. Therefore, as long as there is no contradiction in the processing result, the processing order may be changed or the operation may be performed in parallel.

The present invention can be widely applied to a mutual monitoring system having a core edge configuration, a failure notification method in the mutual monitoring system, and a non-temporary storage medium in which a program for performing a failure notification function is recorded.

100, 100A, 100B, 100C ... Edge, 120, 120A, 120B ... File storage device, 140A, 140B ... Mail server, 200 ... Core.

Claims

In a core edge configuration mutual monitoring system having a core in which a shared resource is stored and a plurality of edges that access the shared resource,
The core is
A core side fault information management unit for managing all fault information at the plurality of edges as the shared resource;
The plurality of edges are respectively
A fault information holding unit capable of holding other fault information based on the performance history of other edges together with own fault information based on the performance history of the own edge;
A self-failure information reporting unit for notifying the core of the fault information relating to the edge of the fault information holding unit in the core, and updating and reflecting the total fault information based on the fault information;
When other fault information related to other edges among the total fault information in the core side fault information management unit is newer than the other fault information already held, the other fault information acquired from the total fault information is used. Other fault information acquisition unit for updating other fault information held in the fault information holding unit,
Have
The core further includes
When update access to the entire failure information is attempted by a specific edge of the plurality of edges, update access is made to the entire failure information by another edge of the plurality of edges. In some cases, instead of restricting update access by the specific edge, the update access request content by the specific edge is temporarily stored separately as unreflected data, and the update access by the other edge is completed. A mutual monitoring system characterized by having a contention control unit that updates all the failure information based on the unreflected data.
The core is
A total failure information storage directory in which the total failure information is stored;
A directory for storing unreflected data in which the unreflected data is temporarily stored;
Including
The contention control unit
When all fault information stored in the all fault information storage directory is updated and accessed with new fault information from one of the plurality of edges, the stored all fault information is updated and accessed from another edge. The unreflected data saving unit for temporarily storing the new failure information in the unreflected data storage directory;
The update using the new failure information temporarily stored in the unreflected data storage directory when the update access of all stored failure information using the new failure information has ended. A data update unit that updates all fault information that has been accessed;
The mutual monitoring system according to claim 1, comprising:
The plurality of edges are respectively
When it is determined that a failure has occurred in any one of the plurality of edges based on other failure information acquired from the core by the other failure information acquisition unit, the management terminal that monitors the plurality of edges The mutual monitoring system according to claim 1, further comprising a failure proxy reporting unit that notifies the user instead.
The plurality of edges are respectively
A communication failure occurs between the other edge and the core based on the fact that the other failure information related to the other edge of the plurality of edges is not updated over a predetermined period in the core. The mutual monitoring system according to claim 1 for determining.
In a failure notification method in a core edge configuration mutual monitoring system having a core in which a shared resource is stored and a plurality of edges that access the shared resource,
In each of the plurality of edges,
The own failure information reporting unit can store the own failure information based on the performance history of the edge and the failure information holding unit capable of holding other failure information based on the performance history of the other edge. A self-failure information reporting step for updating and reflecting all the fault information at the plurality of edges managed by the core-side fault information management unit of the core based on the fault information of the core,
In the other failure information acquisition unit, when other failure information related to other edges among the all failure information as the shared resource is newer than the other held failure information, the other failure information acquired from the all failure information Other fault information acquisition step for updating other fault information held in the fault information holding unit using
And execute
In the core,
When causing the contention control unit to attempt update access to the entire failure information by a specific edge of the plurality of edges, update access to the entire failure information is performed by another edge of the plurality of edges. In this state, instead of restricting update access by the specific edge, the request content of the update access by the specific edge is temporarily stored as unreflected data separately, and the other edge A failure notification method in a mutual monitoring system, wherein a contention control step for updating all the failure information based on the unreflected data is executed upon completion of update access according to.
The core is
A total failure information storage directory in which the total failure information is stored;
A directory for storing unreflected data in which the unreflected data is temporarily stored;
Including
In the contention control step,
When the contention control unit updates and accesses all the fault information stored in the all fault information storage directory with new fault information from any one of the plurality of edges, the stored total fault information is When update access is performed from another edge, an unreflected data saving step for temporarily storing the new failure information in the unreflected data storage directory;
The update using the new failure information temporarily stored in the unreflected data storage directory when the update access of all stored failure information using the new failure information has ended. A data update step for updating all fault information that has been accessed;
The failure notification method in the mutual monitoring system according to claim 5, wherein:
In each of the plurality of edges,
When the other failure information acquisition unit determines in the other failure information acquisition step that a failure has occurred in any one of the plurality of edges based on the other failure information acquired from the core, The failure notification method in the mutual monitoring system according to claim 5, wherein a failure proxy reporting step of notifying the management terminal that monitors the edge instead can be executed.
The plurality of edges are respectively
A communication failure occurs between the other edge and the core based on the fact that the other failure information related to the other edge of the plurality of edges is not updated over a predetermined period in the core. The failure notification method in the mutual monitoring system according to claim 5, wherein determination is performed.
In a non-temporary recording medium recorded with a program that exhibits a failure notification function in a core edge configuration mutual monitoring system having a core in which a shared resource is stored and a plurality of edges that access the shared resource,
In each of the plurality of edges,
Notifying the core of the failure information related to the edge in the failure information holding unit capable of holding other failure information based on the performance history of the other edge together with the own failure information based on the performance history of the own edge, Self-fault information reporting step for updating and reflecting all fault information at the plurality of edges managed by the core-side fault information management unit of the core based on the fault information of the core;
When other fault information related to other edges of the total fault information as the shared resource is newer than the other fault information already held, the fault information is held using other fault information acquired from the total fault information. Other fault information acquisition step for updating other fault information held in the department,
And execute
In the core, the computer
When update access to the entire failure information is attempted by a specific edge of the plurality of edges, update access is made to the entire failure information by another edge of the plurality of edges In this case, instead of restricting update access by the specific edge, the update access request content by the specific edge is temporarily stored separately as unreflected data, and the update access by the other edge is completed. A non-temporary recording medium recorded with a program for executing the contention control step of updating the entire failure information based on the unreflected data and performing the failure notification function in the mutual monitoring system.