EP1697846A2

EP1697846A2 - Device and method for controlling and commanding monitoring detectors in a node of a cluster system

Info

Publication number: EP1697846A2
Application number: EP04802700A
Authority: EP
Inventors: Klaus Hartung
Original assignee: Fujitsu Technology Solutions GmbH
Current assignee: Fujitsu Technology Solutions Intellectual Property GmbH
Priority date: 2003-12-22
Filing date: 2004-11-10
Publication date: 2006-09-06
Also published as: US20070011315A1; JP4584268B2; WO2005062172A3; JP2007515727A; WO2005062172A2; DE10360535A1; DE10360535B4; US8051173B2

Abstract

The invention relates to a monitoring device (DFW) and a method for monitoring at least two resources (M1,M2) on a node (C) of a cluster system. A priority (P) is respectively allocated to the at least two resources. In the monitoring device and method, one of the at least two resources which are to be monitored is selected according to the allocated priority (P), in addition to a monitoring detector (D1,D2) for said resource. The monitoring detector is configured and the resource is monitored once with said monitoring detector. The result of the monitoring carried out by the monitoring detector is indicated. It is possible to reduce computing time on the node (C) as a result of selection by means of allocated priority (P) and the single monitoring operation.

Description

description

Device and method for the control and monitoring of monitoring detectors in a node of a cluster system

The invention relates to a device in a node of a cluster system for checking and controlling monitoring detectors. The invention further relates to a method for the control and monitoring of monitoring detectors for at least two resources to be monitored in a cluster system.

Cluster systems with multiple nodes within the cluster, which are formed from individual computers, are often used for software that is said to be highly available. For this purpose, the cluster system has control and control software, which is also called Reliant Management Service RMS and which aHif monitors the highly available software that is running for the cluster. »The highly available software itself runs on one node of a cluster or is distributed among different nodes. In addition, the control software RMS can also be distributed across various nodes, ie in a decentralized manner.

If the error-free execution of the highly available software or part of it on a node of the cluster is no longer guaranteed, the control software RMS ends the application or the corresponding part of it and restarts it on another node. The RM service controlled so-called monitoring detectors monitor the highly available application or a part of the highly available application. These each monitor a specific part of the application, which is referred to as a resource, and report the status of the resource back to the RMS control software.

An example of this can be seen in FIG. 6. This shows a node N1, which is part of a cluster system. The knot Nl contains the Reliant Management Service RMS as control software. Furthermore, the highly available application APL is executed on the node N1, which in turn exchanges data with a memory management system FS via the connection C1. To monitor the APL application, the RMS control software starts the individual monitoring detectors D1, D2 and D3. Each of these detectors is specially designed to monitor a specific resource of the highly available software APL. For example, the detector D3 monitors the communication link C1 between the application APL and the file management system FS. Another detector D2 checks the highly available application APL on the basis of continuous queries as to whether it is still being executed and whether it is providing feedback. The third detector Dl checks, for example, the available temporary memory that is required for the highly available application APL.

Based on the status reports of the individual monitoring detectors, the RMS control takes suitable measures in the event of a failure of individual resources monitored by the monitoring detectors or other problems that occur. For example, it can terminate the highly available software and restart it on a second node (not shown).

The individual monitoring detectors are started independently of each other by the RMS control software. However, this leads to a high system load on the node, since the individual detectors correspondingly consume memory space or computing capacity. In the worst case, due to an unfavorable configuration or a large number of monitored resources within a node, the monitoring detectors can consume most of the available computing capacity. There is then too little available for the actual application. In addition, the control software receives status messages from monitoring detectors whose actual execution and monitoring of the resource is not currently necessary. dig is. The processing of all returned status messages also increases the computing time and places an unnecessary burden on the control software.

The object of the invention is therefore to provide a device in a node of a cluster system, with which the system load for the monitoring for mission-dependent requirements is adapted, but adequate monitoring of the resources is nevertheless ensured. It is also an object to provide a method for the control and monitoring of monitoring detectors which works efficiently with a low system load.

These objects are solved with the subject matter of the independent claims.

A monitoring device is provided in a node of a cluster system for monitoring at least two resources to be monitored on the node of the cluster system. A priority can be assigned to the resources to be monitored, which represents a measure of the importance of the resource to be monitored. The device comprises a means for selecting a resource from the at least two resources to be monitored on the basis of the priorities assigned to the resources to be monitored. Furthermore, the device comprises at least one monitoring detector which is designed for the type of monitoring of the resource to be monitored. Ultimately, the device contains a means for assigning the monitoring detector to the resource to be monitored, and a means for executing the monitoring detector. This is designed in such a way that after a one-time monitoring of the resource by the monitoring detector carried out, the execution is ended or stopped by the means.

In this version, the device forms a higher-level instance to which the individual resources to be monitored and in particular the monitoring detectors required for the resources to be monitored are subordinate. In particular, the execution of the individual monitoring detectors required for the resources to be monitored is no longer independent of one another, but is controlled in a combined manner by the device. This makes it possible to use the device to monitor only those resources whose monitoring is necessary at the current time. The device on the node also saves additional computing time, since the monitoring detector required for the resource to be monitored is only executed after a selection.

The monitoring is carried out in such a way that the execution of the monitoring detector is stopped again after the monitoring has taken place. Monitoring is therefore only once. The device is, of course, designed in such a way that, if necessary, it also frequently selects the resource to be monitored and executes the monitoring detector required for this several times. However, the monitoring detector is not operated continuously, but is only executed until it has returned a status message regarding the resource to be monitored. However, the monitoring detector itself can be designed for repeated monitoring. This is particularly advantageous in the case of scattering measured values from the monitoring detector. According to the invention, the monitoring detector carries out several monitoring operations and then returns an entire status message which represents the individual measured values. The execution of the detector is ended after the status message has been transmitted.

The method for monitoring at least two resources on a node of a cluster system, the at least two resources being assigned a priority, comprises the steps: a) selection of one of the at least two resources to be monitored on the basis of the priority assigned to the resources to be monitored; b) selection of a monitoring detector required for the monitoring for the resource to be monitored; c) assigning resource parameters to the monitoring detector; d) starting or executing the monitoring detector and performing a monitoring of the resource once by the monitoring detector; e) reporting the result of the monitoring performed by the monitoring detector.

In the method according to the invention, too, monitoring of a resource to be monitored is not carried out continuously, but rather only by executing the monitoring detector assigned to the resource to be monitored once. The monitoring detector itself can of course monitor the resource to be monitored in a variety of ways and in particular also several times at short time intervals before it returns a result. However, according to the invention, only one result or status message is returned once per execution of the monitoring detector.

The resource to be monitored at a time is selected based on the assigned priority. This saves computing time on the node of a cluster system, since the monitoring detector is only executed when this is necessary based on the assigned priority. In particular, the resources and the monitoring detectors are combined and viewed as a whole. Individual detectors are therefore no longer independent.

The resources to be monitored and the monitoring detectors required for this are diverse in nature. In one embodiment of the invention, a resource to be monitored is identified by an integration point within a file system of the Cluster system node. The monitoring detector is thus designed to check whether the integration point to be monitored is still valid. In an advantageous embodiment, the integration point is provided by a second file system on a mass storage device, which is integrated in the file system of the node of the cluster system. However, the correct monitoring detector required for monitoring the resource is always selected based on the selected resource.

In another embodiment, the monitoring detector is designed to monitor an available hard disk or other mass storage.

In yet another embodiment of the invention, the resource to be monitored is an executed program and the monitoring detector required for this is a detector that checks whether the executed program is still active. Another resource to be monitored is a network connection with another node of the cluster system. The monitoring detector required for this is a detector that checks the status of the network connection. Another resource is a database to be monitored, the system load of the node, the processor load of a program being executed, or the free space available within the node of the cluster system. A monitoring detector is provided for each type of different resource, which performs a specific monitoring. There can be several different types of monitoring for a resource and therefore different monitoring detectors.

In a development of the device, the means for selection comprises a list in which the at least two resources to be monitored are stored in an order determined by their priority. This enables a particularly simple selection of the resource to be monitored by the device using the list of the resources to be monitored determined and executes the monitoring detectors. The list can be changed particularly easily by adding additional resources or removing resources from the list. The device is designed in such a way that, based on a resource selected from the list, it automatically provides the associated monitoring detector required for monitoring the resource.

It is particularly useful if the priorities of the resources to be monitored are formed by a numerical value. As a result, a high degree of flexibility is achieved overall and it is possible to react dynamically to changes simply by changing the priority of the resource to be monitored.

In a further advantageous embodiment of the invention, a fixed period of time is provided per time interval. The device is designed in such a way that the average time for executing a monitoring detector is less than the defined time period. The device is expediently designed for a selection of a resource and for a single execution of the assigned monitoring detector, until the total execution time of all the monitoring detectors executed once reaches the defined time period. This fixed time period therefore specifies a time window in each time interval in which the device can monitor resources. In other words, the maximum computing capacity or computing time required by the device can thus be determined within a time interval. This is possible because the monitoring detectors are started and controlled by the device and are consequently no longer independent of one another.

A further development of the invention is characterized in that a second time period, which is required for monitoring the resource, is assigned to the resource to be monitored. This enables the facility to provide an accurate Make an estimate of the time required for monitoring. It is expedient if the device is designed to determine the period of time required for monitoring. This is expediently carried out by measuring the time on the monitoring detector.

In another development of the invention, the device has a first interface, which is designed to emit status messages from the monitoring detector after the monitoring detector has been executed once. This allows important status messages to be reported to a higher-level control and control device in particular. In another development of the invention, the device comprises a second interface, which is designed to receive user commands. As a result, it is also possible for the user to monitor a resource at any time using a monitoring detector. This is particularly useful if a current status message is required by the resource to be monitored. In a further development of this device, the first or the second interface can be designed to receive resources to be monitored. In this way, new resources to be monitored can be communicated to the device or resources monitored by the device can be removed from the monitoring again.

It is expedient to design the device as an independent process within the node of the cluster system. The facility thus forms an independent program. The monitoring detectors form sub-processes of the device during their execution.

In another development of the invention, the monitoring detector is designed as an independently executable program. This is carried out once by the facility after selection of the resource to be monitored. In a particularly advantageous development of the invention, the device has at least one idle subprocess which is executed on the node of the cluster system but is independent of the resource to be monitored. The means for executing the monitoring detector is designed to link the monitoring detector of the selected resource to be monitored with the independent sub-process. This development is particularly advantageous if the monitoring detector is designed as a function of a dynamic library or as a dynamic library.

As a result, the device links the function of the dynamic library or the dynamic library at the time of execution to the idle subprocess, starts it and thus monitors the resource to be monitored. After the execution, the link is released. Such training is particularly speed and computing efficient. By designing the monitoring detectors as functions in dynamic libraries or as dynamic libraries, improvements, extensions or error corrections are possible in a particularly simple and flexible manner. It also simplifies porting to other cluster operating systems.

In a further development of the method, a first time period is defined in a time interval for the monitoring of the resources to be monitored. Monitoring detectors and the associated sub-processes are only executed as long as the specified time period is not exceeded. The process can be repeated until the specified time is reached. The first time period in the time interval thus defines a maximum computing capacity that is required for monitoring. It is expedient to select at least one of the two resources to be monitored from a list in which the resources to be monitored are stored in the order of their priorities. In a further training, the list is worked through until the specified period of time is reached.

It is particularly expedient to increase the priority of a resource to be monitored if the monitoring detector has not monitored the resource in the first time period in the time interval. This prevents resources from never being monitored by an assigned monitoring detector due to a lack of monitoring time or low priority.

It is expedient to assign a resource to be monitored a second period of time, which specifies the duration for monitoring by the monitoring detector. Alternatively, the second time period can also be assigned to the monitoring detector.

In a development of the method, the second time period for monitoring is determined by the execution of the selected monitoring detector. This is particularly useful if the required period of time is not known from the outset or if parameters change during operation that affect the period of time for monitoring.

In one embodiment of the method, an idle sub-process is started that does not consume any computing time and is also referred to as a sleeping process. After the selection of a monitoring detector, the monitoring detector is linked to the idling sub-process and then executed. It is expedient to design the monitoring detector as a function of a dynamic library or as a dynamic library. Linking the monitoring detector to the idling sub-process is particularly quick as a result and efficient. After the result has been reported by the monitoring detector, the link is broken again and the idle process is put back to sleep. The idle process does not require any computing time on the node. Alternatively, the idling sub-process can be linked in succession with various monitoring detectors. Training on an idle sub-process is particularly flexible.

Further advantageous embodiments of the invention are the subject of the dependent claims.

In the following, the invention, the individual configurations and extensions of which can be combined as desired, is explained in detail with reference to the drawings.

Show it:

FIG. 1 shows an exemplary embodiment of the device according to the invention,

FIG. 2 shows a diagram of the means for execution in the device,

FIG. 3 shows a schematic sequence,

FIG. 4 examples of resources within the cluster system,

FIG. 5 shows a chronological sequence of the resources to be monitored,

Figure β a known device with surveillance detectors.

The environment in which the monitoring device according to the invention is used is first explained with reference to FIG. FIG. 4 shows two nodes C and C2 in a cluster system. These are interconnected via a network connection Nl. A highly available application APL is executed on the node C, which contains several resources to be monitored. The Reliant Management Service RMS is also executed on node C. This is a control and control software that is intended to monitor the high availability of the APL application. If necessary, it takes additional measures to ensure high availability. To do this, it is necessary to monitor the individual resources of the highly available APL application.

Specifically, the resources are two integration nodes within the file system of node C. These point to two external mass storage devices M1 and M2, which are designed as simple hard disk memories in this exemplary embodiment. The hard disk space Ml is mounted in the file system of the node C in the integration point "/ usr / opt", the hard disk space M2 in the integration point "/ usr / share". It is necessary to check whether the mass storage devices M1 and M2 attached to these points in the file system are functional and whether data can be read from them or written to them.

Furthermore, the highly available application APL accesses the database DB, which is executed on node C2. To do this, it is necessary to check the connection between the APL application on node C and the database DB on node C2. Finally, window manager X on node C is also monitored for the graphical user interface of the highly available application APL.

According to the invention, a superordinate monitoring device DFW is provided for the monitoring of all these resources, which is connected to the Reliant Management Service RMS. The monitoring device DFW is also referred to as an instance or detector framework and is designed as an independent process on node C. Part of this facility are the detectors D1, D2, D3 and D4. These are responsible for the monitoring of resources and are controlled by the monitoring facility DFW. The resources to be monitored were transferred to the DFW instance by the Reliant Management Service RMS or communicated as parameters.

FIG. 2 shows a more detailed block diagram of the monitoring device DFW according to the invention. The individual resources are monitored as in FIG. 4 by the individual detectors D1, D2, D3 and D4, which, however, are controlled by a control device KE. Like the detectors, this is part of the monitoring device and has further logic blocks which will be explained in detail later.

The higher-level facility DFW is responsible for communication with the Reliant Management Service RMS via the interface Sl. For this purpose, it contains a control device KE, which receives information about the resources to be monitored from the system RMS. User data or user commands are also transferred to the control device KE via the interface S2. The control device KE controls and controls the individual monitoring detectors D1, D2, D3 and D4.

The individual detectors are implemented by dynamic libraries Y.so, Z.so and X.so, which are started at runtime. The dynamic library Y.so contains all functions that are necessary for monitoring an integration point within the file system, ie to recognize that the two monitoring detectors D1 and D2 are implemented by the same library Y.so. The monitoring detectors even represent the same function in the Y.so library. The control device KE performs the function for monitoring in the dynamic library Y.so together with a set of parameters when the integration point of the mass storage devices M1 or M2 is monitored. The two monitoring detectors D1 and D2 thus contain the same function at runtime, but different parameters transferred to the function. The parameters for the detector Dl contain the information for monitoring the memory Ml, the parameters that were used for the detector D2 contain the necessary information for checking the mass memory M2.

In the exemplary embodiment, the set of parameters passed is the integration point in the file system for the memories M1 and M2 and, for example, the type of access right to be checked.

The dynamic library Z.so contains all the necessary functions for monitoring the database connection DB between the node C and the node C2 in FIG. If a check is required, the control device KE starts the function from the dynamic library Z.so. The last dynamic library X contains the functions for the monitoring detector D4, which checks the status of the window manager for the graphical user interface.

The instance DFW also provides a set of functions that can be used in common for all individual detectors. For example, this is the interface to the Reliant Management System RMS for the status messages, which are the same for all detectors. At the same time, the execution of the individual monitoring detectors D1 to D4 is controlled and monitored by the control device KE. The monitoring detectors are thus completely embedded in the detector framework DFW and are no longer independent of this. Figure 1 explains in detail the structure of the control device KE, which in turn contains various devices or means. The figure shows a first list with the resources M1, DB and X to be monitored, the type of monitoring of the control device KE is known. The resources were communicated to the Detector Framework DFW by the Relian Management Service with the order for monitoring. The list contains all the information necessary for monitoring.

A selection device KE1 is now provided, which selects one from the list of the resource to be monitored, in the exemplary embodiment the resource DB. The selection is based on a priority. In addition, other parameters, for example the computing time previously used or the time required for monitoring, can also be taken into account. The selection means KE transfers the resource to be monitored to an assignment unit which, on the basis of the resource, selects the detector suitable for the type of monitoring and transfers the necessary parameters to it. After an assignment there is a resource RS1, RS2 or RS3 which is now ready for monitoring and is stored in a list as shown.

The instance DFW also contains a number of subprocesses TH1 to THβ, the so-called threads, which are idle. Accordingly, they are dormant subprocesses that do not require any computing time, but can easily be linked to functions from dynamic libraries in order to monitor a resource. The threads have the advantage that no additional computing time has to be used for their start, but that once they are started they are waiting for their execution.

In order to check the resource R3, the device KE3 links the free partial process TH2 with the functions of a dynamic library required for the monitoring the parameters dependent on the resource R3 and assigned by KE2 and executes the sub-process TH2. This means that the assigned detector monitors the resource. Starting, executing, stopping and synchronization are carried out in the embodiment by the POSIX (Portable Operating System Interface for UNIX) standard for UNIX operating systems. After the execution of the monitoring function, the device KE3 breaks the link again and puts the thread TH2 to sleep again. The thread can then be linked to another resource. A result message supplied by the monitoring detector is returned to the Reliant Management Service RMS by the instance DFW after execution as a status message. Access to shared data between the device KE and the sub-processes TH is sequenced via Semphore.

By executing with individual sub-processes or threads, it is possible to carry out several monitoring processes at the same time. The selection of the resource, the linking of the threads with the monitoring detectors, the starting and stopping of each individual thread is controlled by the control device KE. The number of monitoring operations carried out in parallel changes over time. The number of sub-processes TH1 to TH6 also changes over time, since the DFW instance can start or end additional sub-processes if necessary.

In addition, it is possible to add or remove resources to be monitored from the instance DFW at any time. This is possible because the resources are not continuously monitored, but only during certain periods. The resources are only stored in a list that can be changed.

Furthermore, by designing the detectors using dynamic libraries that can be loaded and executed as required, a high degree of flexibility can be achieved. The dynamic libraries can be replaced by extended libraries at any time without having to stop or restart the Reliant Management Service RMS or the detector framework DFW. If the library expands or changes, the KE facility loads the new variant. Extensions, troubleshooting and dynamic reconfiguration are possible at any time.

In order to provide sufficient computing capacity for the highly available application APL in node C of the cluster system, it is necessary to limit the computing time for monitoring the individual resources on node C. However, the resources must be monitored sufficiently often to ensure that the highly available application operates correctly. Figure 3 shows an embodiment of the instance DFW, which meets these two requirements.

In a first configuration file P1, a time period is determined in a time interval in which the instance DFW may monitor resources. The time interval and the duration can be specified by a user. This can be a percentage value, for example 15% of the total computing time, or an absolute value, for example 100 ms in 1 second. Additional requirements, for example regarding hardware or software applications, can also be taken into account via the configuration file.

At the same time, a numerical priority value for each resource to be monitored is defined and assigned in a second configuration file P2. These are communicated to the control device of the instance DFW via the interfaces S1 and S2.

In the exemplary embodiment, the priority values are defined in the configuration file P2 by the Reliant Management Service RMS. For example, it is necessary to give resource X of the highly available application APL a higher priority assign as, for example, the resource for the integration point of the mass storage Ml. These priorities are used by the DFW instance to determine an order of monitoring. A resource with a higher priority should be monitored more often than a resource with a lower priority. For this purpose, the individual resources to be monitored are stored in a list L1 according to their priority.

The following table shows the resources for the highly available application APL according to FIG. 4, their assigned priorities, the parameters to be transferred and the time after which a check must be carried out. This therefore determines a maximum value that must not be exceeded. The last column in the table shows the length of time that the monitoring detector started by the control device KE needs to check the associated resource.

Table 1: List of resources with further information

The control device KE now checks the remaining time according to the specification in the configuration file P1, the priorities of the resources, the time elapsed since the last check for each resource and the time period and selects a resource to be checked from this.

According to FIG. 1, the detector assigned to the selected resource is linked to a still free subprocess or thread, the parameters are transferred and the subprocess is executed. After the monitoring has ended, the link is released again and the subprocess is available for a new connection. The resource becomes the list again Ll supplied, however, the time elapsed since the last check and possibly the priority changes. In addition, it is useful to determine the time that the monitoring detector required for the execution, since conditions could have changed and monitoring now takes longer or shorter.

In the case of resources that could not be monitored or checked within the time window defined by the configuration file P1, the control device KE or the monitoring device DFW increases the priority. This prevents waiting resources from never being checked due to insufficient priority.

FIG. 5 shows such a time sequence for monitoring. The following table 2 contains the resources Rl to R7, their respective original priority transferred from the Reliant Management System RMS to the detector framework DFW and the time taken for execution from a configuration file. Table 2:

As a requirement for the instance DFW, it was decided to check the resources only within 450 ms in a time interval of 3 s. Provision is also made not to check resources with a priority lower than the value 3 in the time interval. This means that additional processes that are executed on the nodes receive more computing capacity. After a while, the list L3 shown in FIG. 5 results. The resource R1 with its priority NP and its duration of 10 ms was only executed once and continues to run in the background. It is a resource for which a "non-polling" detector is provided. This is started and waits for a message from the linked resource. In contrast to "polling" detectors, polling is not active. This means that hardly any computing time is used. As soon as the detector R1 receives a message from the resource, it can be ended again by the detector framework DFW.

The resource R2 has the highest priority 5 with a duration of 30 ms and is linked and executed with the associated monitoring detector. At the same time, the control device KE of the detector framework DFW links resources R3 and R4, which also have priority 5, to an existing sub-process from its list, transfers the parameter sets of the resources to the dynamic library provided for monitoring, and manages the threads out. The resource R5 with its priority 3 can also be monitored within the time interval. The resource R6 with the same priority has an execution time of 100 ms and would therefore exceed the prescribed time interval of 450 ms.

In contrast, the execution time for the monitoring detector of the resource R7 is only 50 ms. However, monitoring of resource R7 is not carried out due to the requirement that only resources with priorities greater than 3 be checked. Thus, the resources R1 to R5 are actively monitored during the time interval of 3 seconds. The total time required for monitoring is the sum of the individual execution times, a total of 400 ms. However, it is when the monitoring is carried out within the time interval not fixed. The operating system scheduler takes on this task.

The detector framework only has the requirement not to exceed the 450 ms duration in a time interval of 3s on average, or to use no more than 15% of the available computing time for monitoring.

The new time interval begins after 3 seconds and the DFW instance starts the monitoring detectors again for the resources now provided. The resource R1 continues to run. A sub-process with the monitoring detector for resource R2 is also started due to high priority 5. Because resource R3 has been checked in the previous time interval, the priority of resource R3 in table L4 is reduced again to the original value 3. Because of the sufficient time available, the control device KE of the detector framework DFW links the resource again to a free thread and carries out surveillance.

After monitoring of resource R4 in the previous time interval, resource R4 now receives the original priority value 1 again. The same applies to resource R5. Since a check of resource R6 was not possible due to the lack of time in the previous time interval, the detector framework DFW increases the priority of resource R6 by one point to the value 4. Monitoring is now also carried out here. The total time for monitoring is now 170 ms.

In this exemplary embodiment, a resource to be monitored is started only once per time interval. However, it is possible, for example, to check resource R2 several times within the time interval of 450 ms. Furthermore, in this embodiment, the priority value is linked to the time of the last execution. The priority is in increased every time the resource was not monitored.

The resources are often represented by data structures within the cluster's memory. These can be read by monitoring detectors which are formed by the dynamic libraries. This is particularly useful if the resources have different types of monitoring.

The second interface S2 to a user interface makes it possible to issue commands for the immediate checking of a resource of the node. Furthermore, the configuration file of the instance DFW can also be read in again in order to implement dynamic changes.

The device and the method according to the invention create a possibility of no longer using a number of monitoring detectors independently, but instead of carrying them out as a function of one another. In this case, a monitoring detector is executed once, the monitoring detector itself being able to check the resource to be monitored several times during its execution. It is possible to check several different aspects of the resource and to return a final overall status message.

Furthermore, changes can be introduced dynamically without having to switch off the high-availability software or a monitoring tool such as the RMS. The shared "Detector Framework DFW" enables a particularly effective and time-saving programming through shared functions. The existing computing time of a node is optimally used and it also reacts dynamically to changes in the available computing time. LIST OF REFERENCE NUMBERS

RMS: Reliant Management Service

Dl, D2, D3, D4: monitoring detectors

APL: highly available application

C, C2: nodes

Nl: network

Ml, M2: mass storage

Sl, S2: interface

CLI: user interface

DFW: Detector Framework, monitoring device

KE: control device

KE1: Selection means

KE2: Means of assignment

KE3: means of execution

Y.so, Z.so, X.so: dynamic libraries

Rl, ..., Rll: detectors assigned to resources

Ml, M2, X, DB: resources

TH1, ..., TH6: subprocesses, threads

T: time for monitoring

Ll: list P1, P2: configuration files P: priority Ll, L3, L4: list ZI: duration I: time

Claims

Claims:

1. Monitoring device (DFW) in a node (C) of a cluster system for monitoring at least two resources (Ml, M2, DB, X) to be monitored on the node (C) of the cluster system, the resources (Ml, M2, X, DB) a dynamic priority (P) can be assigned, comprising:

- A selection means (KE1) for selecting a resource from the at least two resources to be monitored (MI, M2, X, DB) on the basis of the priorities (P) assigned to the resources to be monitored;

- At least one monitoring detector (Dl, D2, D3, D4) which is designed for the type of monitoring of the resource to be monitored (M1, M2, X, DB);

- A means (KE2) for assigning the monitoring detector (Dl, D2, D3, D4) to the resource to be monitored (Ml, M2, X, DB);

- A means (KE3) for executing the monitoring detector (Dl, D2, D3, D4), which is designed such that after the monitoring of the resource to be monitored once by the monitoring detector (Dl, D2, D3, D4) the execution of the monitoring Monitoring detector (Dl, D2, D3, D4) is ended.

2. Device according to claim 1, characterized in that the selection means (KE1) comprises a list (Ll) in which the at least two resources to be monitored (M1, M2, X, DB) are stored in an order determined by their assigned priority ,

3. Device according to one of claims 1 to 2, characterized in that the resource to be monitored by the monitoring detector (Dl, D2, D3, D4) (Ml, M2, X, DB) is assigned an average execution time (T), which for monitoring the Resource (M1, M2, X, DB) is required by the monitoring detector (D1, D2, D3, D4).

4. Device according to claim 3, characterized in that the monitoring device (DFW) is designed to determine the average execution time required for the monitoring (T).

5. Device according to one of claims 3 to 4, characterized in that a fixed time period (ZI) per time interval (I) is provided, the mean execution time (T) of the at least one monitoring detector (Dl, D2, D3, D4) less is as the specified time period (ZI).

6. Device according to one of claims 1 to 5, characterized in that the at least one monitoring detector (Dl, D2, D3, D4) is designed as an independently executable program.

7. Device according to one of claims 1 to 5, characterized in that the at least one monitoring detector (Dl, D2, D3, D4) is designed as a function of a dynamic library or as a dynamic library (X.so, Z.so).

8. Device according to one of claims 1 to 7, characterized in that the monitoring device (DFW) has at least one on the node (C) executed from the resource to be monitored sub-process (TH1), the means (KE3) for execution is designed to link the monitoring detector (DI) required for the resource to be monitored with the sub-process (TH1).

9. Device according to one of claims 1 to 8, characterized in that the monitoring device (DFW) has a first interface

(51), which is coupled to the at least one monitoring detector (Dl, D2, D3, D4) and which is designed to emit status messages from the monitoring detector (Dl, D2, D3, D4).

10. Device according to one of claims 1 to 9, characterized in that the monitoring device (DFW) has a second interface (52) which is designed for receiving user commands.

11. Device according to one of claims 1 to 10, characterized in that the resource to be monitored (M1, M2) is an integration node within a file system of the node (C) of the cluster system.

12. Device according to one of claims 1 to 10, characterized in that the resource to be monitored (X) is a program or a database (DB) or a network connection (Nl).

13. Device according to one of claims 1 to 12, characterized in that the monitoring device (DFW) is designed to receive resources to be monitored via an interface (Sl, S2).

14. Device according to one of claims 1 to 13, characterized in that the monitoring device (DFW) is an independent process.

15. A method for monitoring at least two resources (M1, M2) on a node (C) of a cluster system, the A dynamic priority (P) can be assigned to at least two resources by a) selecting one of the at least two resources (M1, M2) to be monitored on the basis of the assigned priority (P); b) a monitoring detector (Dl, D2) required for the monitoring is selected for the resource (Ml, M2) to be monitored; c) the selected monitoring detector (D1, D2) is assigned to the resource to be monitored; d) the monitoring detector is executed and terminated after a single monitoring of the resource to be monitored; e) the result of the monitoring carried out by the monitoring detector is reported.

16. The method according to claim 15; characterized in that in step c) the assignment is made by a parameter transfer of the resource to be monitored to the monitoring detector.

17. The method according to any one of claims 15 to 16, characterized in that the priority (P) is formed by a numerical value.

18. The method according to any one of claims 15 to 17, characterized in that a first time period (ZI) is defined in a time interval (I) for the monitoring of the resources to be monitored, at least steps c) to e) only being carried out ^" if the specified time period (ZI) is not exceeded on average.

19. The method according to claim 18, characterized in that the first time period (ZI) is determined by a percentage value of an available computing capacity.

20. The method according to any one of claims 15 to 19, characterized in that for the selection a list (Ll) is generated in which the resources to be monitored (M1, M2) are stored in the order of their priorities (P).

21. The method according to any one of claims 18 to 20, characterized in that the priority (P) of a resource to be monitored is increased if there is no monitoring of the resource to be monitored in the first period (ZI).

22. The method according to any one of claims 15 to 21, characterized in that an execution time (T) for monitoring a resource to be monitored is assigned by the monitoring detector to the resource to be monitored.

23. The method according to claim 22, characterized in that the execution time (T) for monitoring a resource to be monitored is determined by the execution of the monitoring detector required for the monitoring.

24. The method according to any one of claims 15 to 23, characterized in that at least one idle sub-process (TH1) is started, which is linked to the monitoring detector in step c) and is released again by the monitoring detector after step d) has ended.

25. The method according to any one of claims 15 to 24, characterized in that an interface (Sl) is provided through which a user carries out a monitoring of a resource by a monitoring detector.

26. The method according to any one of claims 15 to 25, characterized in that the monitoring detector is designed as a function of a dynamic library or as a dynamic library.