WO2007009454A2

WO2007009454A2 - Malfunction detection method

Info

Publication number: WO2007009454A2
Application number: PCT/DE2006/001287
Authority: WO
Inventors: Ruppert Koch
Original assignee: Ruppert Koch
Priority date: 2005-07-22
Filing date: 2006-07-24
Publication date: 2007-01-25
Also published as: WO2007009454A3

Abstract

The invention relates to a malfunction detection method for a system of communicating components. According to said method, signals are repeatedly transmitted by a first component in order to be received by at least one other component, a system status is determined, and malfunctions are assumed to occur when reception of the signals is at least excessively delayed regarding said system status. According to the invention, individual conditions that influence the system status are determined, signal delays which are tolerable in accordance with individual conditions are defined, and a malfunction is assumed to occur in response to an at least excessively delayed reception regarding the signal delays that are tolerable in accordance with the individual conditions.

Description

Title: Malfunction detection method

description

The present invention relates to the subject matter of the present invention, and is thus concerned with the detection of malfunctions in arrangements with multiple components communicating with one another.

In complex systems, such as are realized in particular by networked data processing systems or the like, there are a variety of applications in which a proper functioning must be ensured. These include, for example, computer farms for Internet trading platforms, large database applications, mission-critical military or space installations, etc. Critically, both the complete failure of one or more components and the temporary or permanent failure to comply with mandatory or guaranteed system services. It is common and known to take precautions in such cases to respond to a failure. For example, a number of hard disks have been provided in RAID systems for a long time in order to continue to have the data immediately available in the event of a single disk failure; in more complex systems reserve computers can be provided, which are switched on if necessary and take over those operations that were previously performed by broken computers.

It is known to provide monitoring for defects and failures in such a way that a first component for external is checked regularly and a second component is checked whether these regular signals, which are also referred to as "heartbeat", are received in good time If the signals remain too long or they do not reach the receiver unit, so it is concluded that a malfunction must be present and it can be responded to as necessary.

However, heartbeat signals will never arrive absolutely even in complex systems. Rather, the analysis of the arrival times of heartbeat signals, which should arrive evenly, shows that considerable variations can occur without an actual malfunction in a device being present.

On the one hand, it is now desired to be able to react quickly to a failure. As a result, regardless of a certain statistical probability of delaying incoming heartbeat signals, it is not possible to wait any length of time until a replacement system is activated. ^'Tion times short response is required regarding particularly tight restrictions on the time-out tolerances. On the other hand, it is expensive to activate a backup system, for example because certain data is not completely there, ongoing processes on the failed system must be restarted, etc. In order to keep these costs low, it is desirable to set the tolerable timeouts as large as possible. This shows that requirements for systems that react as quickly as possible to malfunctions have diametrically opposed requirements than those for which no false alarms are triggered. In order to bring about improvements, arrangements have already been proposed in which, with regard to the permissible delays in expected heartbeat signals, adaptation to external parameters such as the age of a transmitting or monitored system, its temperature as a measure of the current workload, etc. is provided. Reference is made in this regard in particular to US Pat. No. 6,782,496, US Pat. No. 5,699,511 or US Pat. No. 6,590,868. Further, reference is made to the following documents: US Pat. No. 6,446,225, US-PS

Nos. 5,682,470; 5,742,624; 6,360,333; 6,393,581; 5,978,939; 6,199,018; 6,037,868; 6,590,868; 6,782,496; 6,363,496; No. 6,199,069, US Pat. No. 5,699,511, US Pat. No. 6,820,221, US Pat. No. 6,728,781, US Pat. No. 6,687,847, US Pat. No. 6,782,489, US Pat. No. 6,370,656, THE RESEARCH OF FAULT DETECTION BASED ON HBM IN GRID COMPUTING ENVIRONMENT by Yu Hong and Shoubao Yang , CLUSTERING IN IPSO FAQ, Copyright 2002 Nokia, HARDWARE AND SOFTWARE ERROR DETECTION by Ravishankar K. Iyer, Department of Computer Science and Technology, University of Science and Technology of China

Zbigniew Kalbarczyk, Center for Reliable and High-Performance Computing, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, PROCEEDINGS OF THE 4 ^th ANNUAL LINUX SHOWCASE & CONFERENCE, Atlanta, Giorgia, USA, October 10-14, 2000, DUPLEX: A REUSABLE FAULT TOLERANCE EXTENSION

FRAMEWORK FOR NETWORK ACCESS DEVICES by Srikant Sharma, Jawu Chen, Wei Li, Kartik Gopalan, Tzicker Chiueh, Computer Science Department, Stony Brook University, SCALABLE SECUR GROUP COMMUNICATION OVER IP MULTICAST by Suman Banerjee, Bobby Bhattacharjee, Department of Computer Science, University of Maryland, USA, to ARCHITECTURAL FRAMEWORK FOR PROVISION RELIABILITY AND SECURITY SUPPORT by N. Nakka, Z. KaI- Barczyk, RK Iyer, Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, GOSSIP-STYLE FAILURE DETECTION SERVICE by Robert van Renesse, Yaron Minsky and Mark Hayden, Department of Computer Science, Cornell University, Ithaca, USA , ERROR DETECTION TECHNIQUES & ERROR CORREC- TING CODES, Industrial Information Technology Laboratory, Helsinki University of Technology.

However, despite this already extensive state of the art, there remains the need to be able to trigger a rapid triggering on malfunctions while avoiding unnecessary false alarms.

The object of the present invention is to provide new products for commercial use.

The solution to this problem is claimed in an independent form. Preferred embodiments can be found in the subclaims.

The present invention thus proposes, in a first aspect, a malfunction detection method for an arrangement of communicating components in which signals are repeatedly sent from a first component for reception by at least one other, an arrangement state is determined and malfunctions are at least excessively delayed with respect thereto Reception are assumed, in which it is provided that determines the arrangement state influencing individual conditions, set conditionally tolerable signal delay conditions and a

Malfunction in response to a conditionally tolerable signal delays at least excessively delayed reception is assumed.

Thus, it is essential first of all to recognize that the faulty shutdowns can be significantly reduced without an undue wait time extension if it is determined in detail for individual conditions how they may still have a tolerable effect on a monitoring signal delay. The invention is based not only on the recognition that different individual conditions have different effects on the delay of the monitoring signal, but also recognizes that by carefully examining the individual conditions without reducing the overall safety, a limitation of waiting signal delay easily possible becomes. For example, it is reasonable for a computer network that heartbeat signal delays can be tolerated much sooner without having to assume a malfunction of a heartbeat-emitting component when the load on the power line components, such as switches and routers, is currently so high that individual ones Data packets can be transmitted only delayed or undisturbed with full bandwidth, while a delay, which indicates a very critical state with still undisturbed network and simultaneously excessive working temperature of a heartbeat signal transmitting component under still low workload, so that delays are not readily acceptable here because they indicate a system failure. By monitoring both the operating temperature of the heartbeat signal transmitting component and the network load in the example, a distinction can be made between the two cases, and a rapid switch-off under delay can be distinguished from exclusively temperatures are reached. In the prior art, it would still be necessary for the same temperature, which may here be indicative of a critical operating state, to wait for those tolerable delays per se for high network loads, since there is no distinction between different individual conditions which can lead to signal delays ,

It should be noted that the malfunction detection method of the present invention is basically applicable to all

Can be used in areas where different influences act on a monitoring signal transmission and the detection of a malfunction is critical. This may for example be the case with analog radio systems in the military sector or the like. However, particularly preferred is the application for digital and / or data processing devices, where even entire networks can be monitored with the malfunction detection method of the present invention, as well as parts of networks, but also individual data processing devices with particular redundant components, such as at Hard disk arrangements to achieve a backup against failure; In no case must it be purely computer applications, but it can also be e.g. Overall arrangements are monitored, in which z. B. in addition to a receiver only signal emitting components are present; as an example for a non-computer application are called "fly-by-wire" control systems in aircraft and the like.

The method is particularly preferred if the individual conditions which result in a delay of a monitoring signal between the desired transmission time and the reception at the signal

- e - different in terms of the delay caused by the different individual conditions. The arrangement may in particular vary with regard to the connecting signal paths, with regard to the utilization of components in the signal paths, etc.

It should be noted that, as well as the invention itself is not limited to digital data processing systems, although the transmission of signals as digital data packets is particularly preferred. This makes it possible in a particularly simple manner to provide a multiplicity of transmitting and mission-critical, fail-safe or for other reasons to be monitored components by means of a single monitoring receiver. At the same time, the use of digital data packets is also preferred because they can also be used to record and transmit conditions that would otherwise not be taken into account by the receiver, such as the workload of the transmitter, the network load observed by it, its age, its operating voltage or similar.

Further, the digital data packets may not only be encoded to indicate that they represent heartbeat signals as such, but they may also carry additional information, which is also highly preferred. Thus, they can be provided with a transmitter identification, which in particular allows to monitor a plurality of components by one and the same monitoring receiver, since this can now separate the corresponding heartbeat signals from different transmitters.

Furthermore, the digital data packets can be provided with a transmission identifier, in particular a packet number, that is to say a particular typically sequentially assigned number that is changed from heartbeat signal to heartbeat signal to identify packet failures; In addition, the digital data packets can be provided with further information, for example, a send-actual or send-set time. The send-actual time can be specified either as the time of the actual transfer of a data packet from a computer to a network cable, provided that the corresponding interface modules are designed to add the respective information to the data packets. Alternatively, it is possible to give the time of processing by a CPU in the computer to the data packet. Furthermore, in addition or alternatively, the sender-target time can be packed into the data packet.

If only the sender deadline time is packed in the data packet, then the receiver can optionally analyze whether the sender is attempting to send data packets with those intervals which are specified or accepted at the receiver. If both the send-actual and the send-set time are coded into the data packets, it is possible to detect delays in the sending. A delayed heartbeat signal transmission can occur, for example, if the so-called scheduler in the operating system of the heartbeat-signal-transmitting computer initially has other tasks executed or unavoidable tasks with very high priority are to be processed. Optionally, information can also be written into the heartbeat signal data packet indicating the priority of the heartbeat signal transmission in the computer relative to other tasks that this computer has to process. This avoids that a computer system is shut down only because very urgent mission-critical tasks have to be completed without delay.

The provision of a packet number is also preferred, especially if no sender actual time, which can be encoded as relative local computer time or absolute time of a common time base, is also transmitted. The packet number nevertheless allows the receiving computer to check whether the heartbeat signals arrive completely, one after the other and / or in a reversed order. While such packet sequence changes are not expected in analog systems, this can easily happen especially in the monitoring of computers over extended networks, as routers and switches are usually provided in extensive networks, which optionally caching data packets to be transferred internally in order to transmit previously received data , If now not all data packets always have to follow the same path, for example because changes in the load of data transmission paths occur during the data transmission by other subscribers, it is possible that older packets still remain in the buffer memories

(Stacks) of largely overloaded or overloaded routers and

Switches wait for their transmission, while newer packets can already be transmitted via a less loaded way to the addressed receiving computer. In addition, it is possible for congested networks that individual parcels sent on time will not be transmitted along the way, but will be dropped. This can be z. B. happen even when detected as incorrect detected packets.

It is noteworthy here that the parcel number can lead to a shortening of the shutdown times without misinformation. Heartbeat signals are actually sent to show the receiver that the sending computer is still active. If, for example, the fifth packet is received after the seventh packet has already been received, the receiving computer learns that the computer which was still able to send out the seventh packet is still active when the fifth packet is sent has been. However, this is not information that should lead to the assumption that the sending computer was still active after the sending of the later sent and last received seventh packet. Therefore, a timekeeper that is reset whenever a new heartbeat signal is received and outputs a warning signal due to a presumed malfunction, when heartbeat signals fail too long, does not need to be reset to receive such obsolete heartbeat signals. At the same time, however, a further reaction can be initiated in addition to completely ignoring obsolete heartbeat signals. Thus, receiving per se obsolete heartbeat signals is an indicator that there is currently a high network load. This information obtained from the receipt of obsolete heartbeat signals can be used to extend the time-outs, from which it is assumed that a transmitting computer has a malfunction. Alternatively and / or additionally, the waiting time for heartbeat signals can be extended if the non-consecutive receipt of numbered heartbeat signal packets and / or the absence of heartbeat signals or a faulty packet indicate that a currently high network load is present.

Moreover, if the receiving supervisory computer monitors more than one foreign component or computer over the network, it may be possible for all other computers are made an adjustment. If a plurality of heartbeat signal transmitting units are arranged close to each other and the monitoring computer is positioned far away from it, it can be assumed that the network delay is approximately identical for all transmitting computers. In such a case, an adjustment can be made that is identical for all sending computers. On the other hand, if the monitored heartbeat signal transmitting computers are widely distributed, then the assumption is justified that only a part of the transmission network between the receiving computer and the other components is also overloaded. The adaptation can thus possibly take place taking into account the locations of other sending computers.

It is preferred if the heartbeat signal repetition itself is adjusted to one or more individual conditions. These can be device-specific individual conditions or else externally determined conditions. With devices, for example, there is a high probability of failure shortly after their commissioning, because certain components are already defective during production. On the other hand, if no malfunctions or malfunctions occur during the first phase of operation, it is highly probable that a computer system such as a server or the like will operate without problems for a long time until wear-related failures occur, for example due to faulty fan bearings, hard disk storage, drying electrolyte capacitors, etc. single-rate-adapted heartbeat signal repetition frequency can thus initially z. B. be very high during the first commissioning weeks in order to detect a very likely failure there very quickly,

- II - if this does not occur, the heartbeat signal repetition rates can be reduced since now malfunctions can occur primarily as a result of operating system instabilities, power failures or the like, but this is less likely overall than rapid early failure. After a longer period of operation, the heartbeat signal repetition frequency can then be increased again towards the end of the mean expected lifetime.

In addition, it is also possible to adjust the heartbeat signals to operating conditions such as the operating temperature of the transmitting computer, its load, the access to external memory, the CPU load, the latency of the heartbeat signal scheduler, etc. This makes it possible either to relieve the computer when it is so heavily loaded by important other tasks that are known that heartbeat signals would be delayed anyway, the reduction of the heartbeat signal frequency is useful here, if the respective conditions are not typical Failure of the system may result; Alternatively, such conditions may also increase the heartbeat repetition rates, such as when it is expected that certain tasks to be performed by the operating system will completely block and / or critically slow it down.

In principle, it is possible that the target time between two heartbeat signals is changed from emission to emission. However, it is preferable not to change these times too often. This allows in particular the simpler structure of a reception time statistics respectively a delay time statistics at the receiving computer. In addition, it is preferred if the nominal repetition distance of heartbeat signals is communicated. is nicked. This can be done on the one hand from the transmitter, for example, in view of the current load, temperature, the age of the components, etc., or it can be commanded a target repetition distance from the receiver, which also allows the target repetition distance to adapt to individual conditions, not directly related to the transmitter, such as the need to simultaneously monitor a variety of computer systems with the receiver, varying network loads, and the like.

If the heartbeat-signal-receiving components are computer systems with an operating system which have a prioritizing scheduler, that is to say a unit which can assign different priorities to different tasks in multitasking, multithreading or hyperthreading operation, it is preferred to use the receiver in the receiver Processing incoming messages to assign a higher priority than the processing of the timers. This can be advantageous in order to avoid erroneous shutdowns when heartbeat signals have already been received but are still present

Latch the receiver for data obtained from the network connection data. By ensuring that before processing a time-out routine, all data is first read from the memory for received network data and at least checked to see whether new heartbeat signals are included therein, it is avoided that a malfunction must be assumed only because in the receiver Processing of received heartbeat signals could not be timely.

Since the processing of information stored in memory for received network data is exclusively Restricting it to check if it's heartbeat signals, this extra processing will only result in a very small delay. This delay, when there is no heartbeat signal, and possibly an actual malfunction of the monitored system, is readily acceptable because the likelihood of having to perform a redundant shutdown is significantly reduced. The receiver's preferred guarantee of processing all received messages issued by the supervised transmitter before executing a time-out routine thus eliminates receiver scheduler latency. In practical implementation, this can preferably be done by a monitored transmitter sends a heartbeat signal at the absolute time t _s and the receiver receives it at the absolute time t _r . The difference between the transmission and the reception time is then the heartbeat delay td = t _r -t _s , wherein the delay t _d consists of three components, namely the scheduler delay tsENENERVERzόGERUNG in the transmitter as an absolute difference between the target time, too the heartbeat signal is to be generated and the actual time at which the heartbeat signal is to be generated by the transmitter; further, the heartbeat delay includes the network-related component t _NE τzwERκvERZöGERUNG, that is, the time due to delays in switches, routers, the signal transmission along the various cables and other transmission paths, etc.; as a third component, the heartbeat delay includes the receiver-scheduler delay DELTA DELAY, that is, the absolute time difference between the time the heartbeat signal is operable in the memory of the receiver and the time at which the receiver actually receives that heartbeat signal respectively evaluating his absence. It therefore applies on the one hand

td = t _r - t _s and td = t TRANSFER DELAY + t NETWORK DELAY + t RECEIVER DELAY

The situation may now arise that the heartbeat signal can be processed at the receiver at a time t _r , but the timeout has already occurred by the time at which the data reception is to be processed, that is to say the condition applies:

t _r <tAUSZEIT <t _r + t _EM PANEL DELAY

Now, if the processing of the timeout timer has a higher priority than the processing of incoming messages, regardless of the reception already made, the receiver will regard the transmitter as faulty regardless of the timely heartbeat signal reception. At best, this can be avoided in the prior art by determining the time-out as

DAY TIME> Maximum (TENDER DELAY) + Maximum (NETWORK DELAY)

+ Maximum (t RECIPIENT DELAY)

By first processing received messages with a higher priority than the time-out timer, there are no longer any erroneous shutdowns that are merely due to non-processing of already received heartbeat signals. In other words, t _OFF z _E iτ only needs to be set to

tAUSZEIT> Maximum (t TRANSFER DELAY) + Maximum (t NETWORK DELAY) It should be noted that in the described procedure, although the off timer is called only when the previously received messages were processed; although this leads per se to a slight delay compared to a case in which this order of execution is not complied with; However, the delay caused by the change in the execution order will regularly be smaller than the maximum (t _RECEIVER DELAY), so that the overall result is a shortening of the time-out signal generation while reducing the false switch-off probability.

It should be noted that the preferred processing of the network input memory prior to evaluation of a timeout timer may, if necessary, require changes, albeit minor, in the operating system. Preferably, an operating system kernel is prepared for these additional tasks.

It should be noted that while it is largely in the present description of it is mentioned that sent signals are detected by a receiver, but that does not mean that always only a single receiver can be addressed. Rather, it can be provided that a heartbeat signal is transmitted to a plurality of recipients. This has advantages if, for example, the monitoring receiver is also at risk of failure, because then a disconnection can already take place when one of the several receivers no longer receives heartbeat signals; Alternatively, a shutdown can only take place when none of the addressed receiver has received more signals. This can be done by communicating the receiver addressed with the heartbeat signal be determined among themselves. In special cases, situations could occur in which a shutdown takes place when a plurality or some of the receivers can no longer receive signals. This can happen, even though a signal reception on a few of the addressees still indicates that the transmitter is active if it is to be feared that the absence of signal reception at other addressees known per se as being an indicator of a partial network failure Thus, the communication channels to the still active, but not all target addressee reaching stations must be classified as too unreliable.

As regards the individual conditions which are taken into account in the determination of a maximum permissible heartbeat signal delay, their determination is possible on the one hand with regard to measurements which were either current or long-term, for example when encoding the heartbeat signal sending time and accurately determining the heartbeat receiving signal time network delay; Alternatively and / or additionally, statistical data, as expected from "historical" observations over a longer period of time, can also be used: For example, it is to be expected that a network within the corporate network will experience higher network loads than at weekends Here, recourse to general observations for the core times may extend the time-out in order to avoid misbehavings It should also be noted that several influencing parameters can also be detected to determine a single delay-relevant time such as the delay over a transmission network. in the above example, in addition to the Further corrections are made on weekday, for example, if remote network maintenance on a large number of computers is expected to result in particularly high network loads. The latter example was also mentioned by way of example for a predicted individual condition or an additional delay predicted from an individual condition.

It should be mentioned, especially with regard to multicomponent systems, that advantageous alternation monitoring can take place. If the advantages of the present invention are to be used to avoid false shutdowns with greater certainty, instead of shortening the tolerated time-outs with the same probability of fault elimination, it is particularly preferable to wait and see whether a monitored component whose heartbeat signal is not present at a first monitoring receiver was detected, at least to a second still signals has settled. In such a case, a shutdown may be prevented or a startup of a backup system terminated or interrupted. The malfunction can be reasonably assumed if within the given time and / or even later a receipt of a heartbeat signal completely fails; but even if heartbeat signals are still received, a malfunction can be assumed if, on the basis of the signal reception, it is assumed that assured or required service properties can no longer be met. For example, when a computer is used in the time-critical data streaming mode as for transmitting audio and / or video data and, for whatever reason, there are delays that result in out-of-order data transmission, it may become necessary be to either switch reserve calculator to one To achieve load distribution and / or bring about a standard transmission of the data to be transmitted. It is accordingly possible in such applications to place an upper limit on the maximum time-out with regard to assured or required service characteristics, although this need not necessarily be necessary, and, if no real-time critical or virtually real-time critical applications run, critical delays may also occur exclusively with regard to statistical evaluations and default probabilities.

It is possible and preferred that the individual conditions that are evaluated during the malfunction detection comprise at least one group of total transmitter load, transmitter component load, in particular CPU utilization of one or more processor units in a transmitter, CPU clocking, if this is variable , Memory usage and use of swap memory, CPU usage by certain processes, use of the file system or activity of the file system, ie the hard disk components, network activity, scheduling latency, scheduling latency statistics tiques of the monitored computer and / or other monitored computer with comparable or similar tasks or one, the or the monitoring computer, state of selected kernel activities, length of the network queue, Heartbeat loss detection for distinguishable heartbeat signals, non-consecutively received, consecutively separated signal identifiers, A Receive buffer status at the receiver, current signal path utilization, signal path utilization behavior, receiver load, receiver subcomponent load. It should be noted that, where in the above list was explicitly referred to load conditions of Transmitter equally load states of a receiver can be evaluated, unless it can be avoided by suitable prioritization or other measures thereby conditioned heartbeat signal retrieval evaluation delays.

The detection of individual conditions can be done either by defining allowable values and / or in response to measured individual conditions. The determining definition of a desired value may, for example, result from the need to comply with certain service characteristics. A measuring or determining determination of actually given conditions can first take place, for example, during system commissioning of more complex systems and / or be adapted over time.

It is particularly preferred if a tolerable signal delay is changed over and over again and adapted and adapted in particular to changing individual conditions. In this way, it is possible to react to a gradual increase in network load, to temporarily fluctuating loads from the transmitter and / or receiver, to fluctuating and thus more dangerous operating temperatures for components in non-air-conditioned environments due to increased temperatures in the summer.

The individual tolerable signal delays associated with individual conditions may be linked in different ways to derive a total signal delay in response to and taking into account the individual conditions. It is thus possible to determine a maximum time for each individual condition which is still tolerated before a failure can be assumed. Such a maximum time linkage has been discussed above by way of example. alternative and / or in addition, however, linear or alinear connections can also be made, for example if it is to be assumed on the basis of a high network load that a monitored component, for example a heartbeat signal emitting computer, is about to experience a particularly high load. If the monitored computer at the same time indicates that it is already heavily loaded, it immediately results that a simple addition of the permissible maximum delays is not sufficient, since the additional load of the transmitting computer predicted from the network load increases the signal delays there caused by the operating system scheduler of the transmitter Send computer expected. A particularly preferred system will determine an allowable maximum time by analyzing such interdependencies.

Upon detection of a malfunction, different responses can be triggered. A first reaction would be a warning signal that alerts an administrator or a maintenance service to a malfunction. The supervised computer can also be requested to generate a status report, if this is still possible. Furthermore, it is additionally and / or alternatively possible to switch over to a reserve system and / or to be able to redistribute it to a task assigned to the system identified as possibly malfunctioning should it not be available. At the same time, if possible, it is preferable to trigger a data backup and a shutdown of a computer that is recognized as being possibly defective.

When it comes to being able to react very quickly to non-standard behavior, one needs in one particular preferred variant of the invention, the monitored computer is not remotely switched off. Rather, it would be possible, if the computer to be monitored recognizes that predetermined specifications are no longer complied with, for example because the difference between heartbeat signal generation target time and heartbeat signal generation actual time or possible heartbeat signal generation actual time is too large, from the computer to be monitored itself a switch to a backup system, outsource assigned activities or the like to arrange. It should be noted that such an approach is not limited to mission-critical components, but that this is in any case feasible. The self-shutdown or self-relieving need not necessarily occur only when the computer to be monitored a difference between

Heartbeat signal generation target and Ist time is greater than the total duration, which is available for the heartbeat signal repetition, but also medium observed network loads and the like can be taken into account, in order to then make a self-shutdown can if an in-depth incoming reception of the strongly delayed heartbeat signals due to excessive network load or the like is highly probable.

The invention will now be described by way of example only with reference to the drawings. In this is shown by:

1 shows an arrangement with which the malfunction detection method of the present invention can be carried out; Fig. Ia, b, c detailed views to Fig. 1; Fig. 2 is a detail of a network used in Fig. 1 by way of example for the transmission of heartbeat signals;

FIG. 3a shows probability densities for the arrival of heartbeat signals after a transmission time t for a lightly loaded network state; FIG.

FIG. 3b shows probability densities for different load states of the network shown in FIG. 2; FIG.

FIGS. 4a, b show the derivation of a network load delay time curve from the distributions of FIG. 3b;

Fig. 4c shows an exemplary CPU load delay time curve; Fig. 4c 'a swap file delay time curve.

Referring to Fig. 1, for performing a malfunction detection method, an arrangement 1 of communicating components 2, 3 generally designated 1, in which signals are repetitively sent from the first component 2 for reception by at least one further component 3 via a connection 4, determines an arrangement condition and malfunctions are assumed with regard to this at least excessively delayed reception, such that individual conditions 5a, 5b, 5c determining the arrangement state determine signal delays which can be tolerated as a function of the condition by suitable means 3a, 3b, 3c, 3d and a malfunction is also assumed to be at least excessively delayed in response to a signal delay which is tolerable with regard to the individual conditions. In the case of the first component 2, which repeatedly sends signals for receipt by at least one other, in the present case it is a mission-critical computer which has successively to process different tasks, such as the various tasks Job 1, Heartbeat, Job 3, Job 4 , which are registered in a scheduler 6, is indicated. Mission-critical here means that a failure of the system can have such negative consequences for an application and / or a user that measures appear reasonable to ensure timely replacement in case of failure of the component. This is done by the special design of the transmitter, which may otherwise be a largely or completely conventional computer, server, PC or another component communicating with external components. In the illustrated embodiment, the component 2 comprises, in addition to the scheduler 6, a CPU 7, a disk memory 8, a timer 9, a network input / output interface 10 and a monitor and monitoring unit 11 for self-monitoring of the component. Further, a memory 12 is provided which is designated in the figure with RAM and for the temporary storage of data and / or program parts and / or for the removal of currently unneeded and / or executed program parts and / or data is at least partially usable. It should be noted that the individual components can be implemented by dedicated software program parts that are executed regularly or constantly, but that nonetheless dedicated hardware can nevertheless be present.

The scheduler 6 can be implemented in a conventional manner as part of an operating system and thus specifies when the CPU 7 which task has to work, in response to signals from the timer. 9

The central processing unit CPU 7 does not need to be a single processor, but it will be understood that the present invention is also applicable to multiprocessor systems and the like.

The data memory 8 can be realized in the present case as a disk storage such as Raid field or the like.

The timer 9 is designed to ensure within the component 2 that certain program parts to be processed in the scheduler 6 do not take excessively long times in the CPU, accesses do not take excessively long etc. In the exemplary embodiment shown, it is like its representation Clock shows synchronized to a global time, for which purpose a radio signal receiver for central time data in the component 2 can be provided and / or regular synchronization takes place with a clock connected to a network; that a global synchronization is not mandatory, but z. As well as a local synchronization of those watches that are provided at transmitters and / or receivers, also sufficient, it should be mentioned. Also, synchronization is not required for all optimization steps, as will be apparent.

The power input input-output terminal 10 is configured to communicate with a network via common protocols. It can be a LAN connection, Internet connection, W-LAN connection or the like. The usability for upcoming or not mentioned protocols is anticipated. Particularly relevant in the present case is the usability of the mains connection input-output connection 10 for sending heartbeat signals from a heartbeat generation unit 13, which in particular is either the CPU itself and / or one of the different layers of a typical LAN. Terminal can act, as well as for receiving the load of a network 4, to which the component 2 is connected, indicating signal or for receiving a permitted by the monitoring component 3 maximum time t _max . The monitor 11 within the component 2 receives on the one hand signals from the timer 9, on the other hand the maximum permissible time between two heartbeat signals, which was received via the network connection 10 from the component 3 and incidentally load-indicative signals via the lines 5a, 5b, 5c, which indicate, for example, the percentage utilization of the CPU, the percentage filling of the swap space, the percentage utilization of the hard disk cache, and so on. The monitor 11 further receives signals, in particular from the CPU and / or the network connection 10, which indicate when a heartbeat signal is transmitted, and compares this transmission time with the previously received, as maximum permitted time t _{max /} to self-descend the Component 2, provided that the actual time of the heartbeat _{signal transmission} t _{l3t is} greater than the maximum time that has been approved by the monitoring component. In such a case, a secondary system is caused to start up and / or to be active.

The heartbeat signal generation unit 13 is designed to compile heartbeat signal data packets and send them to the network connection unit 10 for transmission via the network. The heartbeat data packets include the address of the addressed receiver, ADR, a sequentially cyclically assigned number seq #, the current time at which the generation of the heartbeat signal was provided in the scheduler, t _SC he _/ the current load of the CPU, CPU%, the degree of filling of the memory for random access , RAM%, Disk% Disk Usage, Net% Net Usage. The data packets compiled in this way can be written into a transmission memory of the network connection 10, where they are transmitted as quickly as possible and / or in sequence with other, already existing data.

The network 4 shown diagrammatically in FIG. 1 can be an extensive network in which a plurality of computers represented by round circles, via nodes, are represented as square boxes. In the network illustrated by way of example only, two components II communicating with each other and exchanging very large amounts of data are provided which communicate via a node located in the shortest possible connection between heartbeat signal transmitter and heartbeat signal receiver and, as through the very thick connection lines 4a, 4b to this node 4c indicated to exchange extremely large amounts of data, which should bring the node to its limits in the example chosen for explanation, so that the direct connection between the heartbeat signal transmitter 2 and heartbeat receiver 3 is then no longer available when the components II communicate with each other. In this case, it is no longer possible to connect the components 2 and 3 to the dash-dotted path which identifies the network state I, but only the dotted path via the node 4c to immediate connections. This one is Significantly longer and it should be noted that running on this dash-dotted connection path running packets have longer maturities from the heartbeat signal transmitter to the heartbeat signal receiver. Also shown are a plurality of individual computers III, which under certain conditions can also burden the connecting lines and nodes in particular of the dotted marked bypass path. For the purposes of the following explanations, it is assumed that the nodes in the dash-dotted path and the computer marked III are part of a computer network used only on weekdays, so that on such working days a significantly higher network load is on the corresponding nodes of the dash-dotted path without these are assumed to be overloaded per se.

For the purposes of the following explanations, therefore, three operating states can be distinguished, namely the operating state I, in which only heartbeat transmitter and heartbeat receiver communicate with each other and otherwise no communication takes place in the network, the operating state I + II, in addition to the direct connection path through the very extensive communication between the components II is blocked and alternative paths must be found in the dot-dashed form, without the dotted path would be given a particularly high load of the network and the operating state I + II + III, in which by the communication of Computers III with each other additional load also on the node in the dotted path through the network. It is understood that the delay times of a heartbeat signal packet via the network are dependent on which operating state exists. This is shown in FIG. 2. The delay via the network Δt _{net /} is significantly greater for the operating state (I + II + III) than for that operating state in which not all computers corrode via the network, but only on the one hand the heartbeat signal transmitter and the heartbeat signal receiver (I). and on the other hand communicate the dash-dotted line connection blocking computer II with each other, that is

Δt _ne t (I + II + III)>> Δt _net (I ₊ Ii).

The delay time Δt _ne t (i ₊ ii) of the operating state in which the direct connection path is blocked is again significantly greater than the time when the direct connection is not blocked, that is to say for the operating state Δt _n e _t (i). _/ that means overall the relationship applies

Δtnet d + II) >> Δtnet (I) •

It should be noted that the delay that network-transmitted signals, including heartbeat signals, experience sharing, even with extended networks, of a variety of statistical processes, such as transmit and receive activity, at a given point in time along the way Computers, etc., is dependent. Therefore, it does not make sense to talk about a fixed delay that a particular package suffers under certain conditions, but it makes sense to talk about the likelihood that a given package will actually have arrived after a certain amount of time. This is illustrated in FIG. 2b on the basis of the probability density distribution of the signal reception over the delay time after transmission. Immediately after sending, the probability that the packet can already be received is identical. tisch equal to zero, because in any case, the package will require a finite time even with undisturbed forwarding. There will then be a variety of packets that can be transmitted very fast and without delay through the network 4, and there will be some packets that take a longer time, perhaps because other packets need to be sent over the nodes and therefore heartbeat signals, Subject to statistical conditions, only delayed to reach the recipient. A reasonable measure of the expected delay time is therefore the time that has elapsed under given conditions, for example under a given network load, until 99.9% of all packets have arrived at the receiver. Of course, a time could be chosen instead, which is assigned to any other percentage. FIG. 2c shows how probability densities of the packet reception look for different network conditions. The curves are normalized such that after an infinitely long time, the integral of the probability densities over time is 1, that is, it is assumed that no packet is lost. Shown in Fig. 2c is also the time t _99ι9 . These times t _{99 (9} are shown for the different operating states.

FIG. 4a again shows the curve of FIG. 2c with the times recorded after 99.9% of all data packets have arrived, namely for. the different operating conditions. In Fig. 4b these 99.9% times are plotted against the network load observed with the operating conditions and the corresponding curve which results is drawn. In Fig. 4c, 4d are then drawn in an analogous manner, the delays that arise on the heartbeat signals, if a particularly high load of the central processing unit is present and / or a particularly high load in memory 12 RAM.

The component 3 is adapted to receive the heartbeat signals over the network 4, the network interface to the network 4 being also adapted to receive other data as shown at 14, this reference numeral being used for the input memory of the network interface to show that data can be received there without them having to be further processed immediately in the computer, for example because other tasks are first to be processed by the scheduler 15 of the component 3. Component 3 again comprises a CPU 16, a timer 17 and, as required, further components of a standard computer system, without having to discuss these in detail below. Incidentally, it should also be pointed out here that the individual means and signal processing stages to be described and / or already described, provided in component 3, can be implemented as usual by hardware and / or software means.

The component 3 further comprises a heartbeat signal decoupling stage 18 in which the heartbeat signal into which the sequence number Seq #, the generation time t _sche d / the component component CPU load set in the component 2, the generation component memory utilization RAM%, the generation component file load File%, as well as the network load Net% of the network 4 used for the transmission are separated from the heartbeat signal packet. The heartbeat signal - unpacking stage 18 is on the one hand for the unpacked sequence number associated with a sequence number evaluation stage 25, in which the sequence number of a received heartbeat signal packet the sequence number of previously received heartbeat signals is compared, and the load-indicative data, here indicated as CPU%, RAM%, File% and Net%, are fed to the inputs of the look-up tables LUT 3a, 3b, 3c, 3d. The scheduled time t _SC hed is fed to a Konsistenzprüfstufe 19th The look-up tables 3a to 3d are adapted to output allowable delays Δt _{cpu /} Δt _RAM , Δt _F ii _{e /} or Δt _Net in response to load conditions for the CPU, RAM, etc. entered at the input. that other parameters and / or not all of the mentioned parameters can be used should be mentioned. The look-up tables 3a to 3d may implement curves such as those shown in FIGS. 4b to 4g. The outputs of the lookup tables are fed to a delay time link 22, which determines therefrom a maximum allowable delay time. The maximum delay time t _Ma χ is made available to the CPU via a suitable interface 23 such that the maximum permissible time when processing the heartbeat signal job in the scheduler 15 can be compared with the received actual time, ie with a time that has elapsed since receipt of a last valid time value. The component 3 further has an alarm stage 24, which is designed to generate an alarm in the absence or inadmissible delay of a heartbeat signal and, if necessary, to activate reserves for components 2 and, if so desired, to deactivate component 2.

With an arrangement like that described above, a malfunction detection method can be practiced as follows:

It is assumed that when commissioning the components 2 and 3, the network 4 is initially in state 1, al- so the two components 2 and 3 are the only ones communicating over the network. Further, assume that, as is preferred, timers 9 and 17 are first synchronized. The component 3 then specifies a maximum time between heartbeat signals that is chosen so that it can be complied with easily by the component 2, and communicates these via network 4 to the component 2. The component 2 also receives information about the current network load ,

In component 2, the generation of a heartbeat signal is then regularly scheduled taking into account the specification of t _m ax between two heartbeat signals.

When the heartbeat signal generation scheduled in the scheduler 6 is executed, the current load of the CPU, the memory RAM, the hard disks and the network load received via the network input / output interface 10 are packed into a heartbeat signal. This is fed to the network input / output interface 10, together with a sequential sequence number and the address of the receiving computer 3. In the monitor stage 11 is also checked whether the heartbeat signal generation time ti _{st is} smaller than the legal time until the next heartbeat signal is generated. If this is the case, nothing needs to be done. If this is not the case, that is, the condition t ≤ t _{max is} not met and thus provides a corresponding query a logical value "0", the system 2 itself will shut down after this the monitoring component 3 still on the In this case, it is assumed below that the target transmission times can always be maintained. packet is then transmitted over the network 4 and enters the latches 14 of the network input interface of component 3, which monitors the heartbeat signal transmission. There, the heartbeat signal is read in the scheduler 15 when the corresponding instructions are processed, unpacked and the unpacked values are fed to the look-up tables, the consistency check stage 19 or the sequence number evaluation stage 25 as required. The lookup tables 3a to 3d then determine corresponding allowable delays 3 in response to the obtained load values for CPU, RAM, component 2 disk space, and the network load. These are linked together in the delay time linking stage 22, and for purposes of the present disclosure, it can be assumed that this linking occurs by addition; Significantly, other functions can be implemented, for example, when all components are at their load limit of 100%, in order then to avoid that the maximum permissible delay time is greater than that which is tolerable regardless of load under general safety aspects.

The times output by the look-up tables 3a to 3d are simultaneously output to the consistency check stage 19, which checks whether the time at which the generation of the heartbeat signal in the component 2 was scheduled by the scheduler, with the delay times, which are load-dependent allowed, taking into account the reception of the previous heartbeat signal is permitted or whether it may be necessary to correct the look-up tables, because about the Nachschautabellen provide short times that can not be met even in the generation of the heartbeat signal. The maximum allowable time t _max determined by the delay time linking stage 22 is then provided to the CPU. It is compared with the current actual time or the time elapsed since the last received heartbeat signal. It is further checked whether the heartbeat signal that was received is a newer heartbeat signal or whether an outdated heartbeat signal arrives very late; This evaluation is carried out in the sequence number evaluation stage 25. If a new heartbeat signal is involved, provided that the maximum permissible time between two heartbeat signals, which was determined on the basis of the transmitted heartbeat signals and the lastindicative data encoded therein, is greater than the time elapsed since the last evaluation, the timer Reset zero and do nothing except schedule a new heartbeat alert after a period of time.

This continues, and if, for example, load changes in component 2 occur during processing of other jobs or in the network 4, this will normally result in delays in the heartbeat signal reception at component 3. However, this heartbeat signal reception delay is not critical because it is found in the look-up tables by evaluating the respective load states of, for example, CPU%, RAM%, File% and Net%, that larger individual-component-related delays are to be expected, resulting in an extension in the delay-time connection stage the maximum allowable waiting time for a new heartbeat signal. Now, if after a heavy load on the network, that is, for example, an operation in the state (I + II + III) communicating with each other, the direct connection blocking computer II are turned off, it may happen that initially very fast heartbeat signals are detected, not arrive in the send order because older packets are still traveling the longer way in the network and can be forwarded there only if the respective nodes provide for this. This can lead to later-arriving heartbeat signals being outdated, that is, for example, heartbeat number 15 is received after heartbeat number 17 has already been received. In such a case, no further action is taken in evaluating this heartbeat signal, that is, in particular, the timer is not reset, but it is simply a new heartbeat signal monitoring provided without further response in the scheduler 15, if t _max , derived from the most recent received heartbeat signal maximum time delay , has not yet been exceeded. In undisturbed operation, the heartbeat signal input is typically interrogated more frequently than heartbeat signals arrive. As a result, the outdated data packets 15 and 16 can be discarded after receiving the packet with the sequence number 17 which has been released via the free connection and the timer is reset by the then exemplary reception of the 18th heartbeat signal packet.

If the heartbeat signal now remains off, for example as a result of an error in component 2, then in the received component 3 it is first determined during processing of the heartbeat signal monitoring thread in the scheduler 15 that no newer heartbeat signals are present. This must be cause, especially in the final analysis, the startup of a replacement system for the assumed due to the absence of heartbeat signals as a component 2. However, before this happens, once again in the input / output memory of the network port 14 looked up, if not but in the meantime at the very last moment Heartbeat signals have arrived. If so, it is analyzed in the prescribed manner whether it is a newer heartbeat signal and apart from a shutdown of the component 2. At the same time, however, the corresponding look-up tables can be corrected in order to take account of such a situation for future cases, which is possible if the overall delay is not already so great that the delayed signal external fertilizer is based on a total system disturbance or overall misfire got to. In this way, a dispensable, avoidable shutdown of the component 2 is avoided. Only if no input of a newer heartbeat signal is detected in the case of the check of the network connection stack or fifo to be processed with a higher priority than the shutdown of component 2, an alarm is triggered in the alarm stage 24 and a reserve system is activated for the then presumed component 2 ,

While reference has been made above to addition as a possible linkage of single-conditional delays, not only is it meaningful. Instead, any mathematical, in particular empirically determined and / or learned function can be used per se, and / or look-up tables and / or limit-taking relationships.

Claims

claims

A malfunction detection method for an arrangement of communicating components wherein repeatedly signals are sent from a first component for reception by at least one other, an arrangement state is determined, and malfunctions are assumed in response to said at least excessively delayed reception, characterized in that the arrangement state is affected - Determines flowing individual conditions, individual condition-dependent tolerable signal delays set and a malfunction is assumed in response to a respect to the individual condition-dependent tolerable signal delays at least excessively delayed reception.

Second malfunction detection method according to the preceding claim, characterized in that the components are realized by data processing equipment and / or parts thereof, in particular computer systems in a secured against failing system and / or functionally or mission critical parts components.

3. The malfunction detection method according to claim 1, wherein the arrangement varies with respect to the communicating components and / or with respect to the signal paths connecting them and / or with respect to the utilization of components and / or signal paths.

A malfunction detection method according to any one of the preceding claims, wherein the signals are realized by digital data packets.

5. malfunction detection method according to the preceding claim, characterized in that the digital data packets are provided with a transmitter and / or transmit identifier, in particular a packet number and / or a send-actual and / or -Soll-time.

6. malfunction detection method according to any one of the preceding claims, characterized in that the signals Ie are repeated adapted to one or more individual conditions.

7. malfunction detection method according to any one of the preceding claims, characterized in that the Signa Ie be repeated at least for a certain period and / or a certain repetition number with a constant desired distance.

8. malfunction detection method according to one of the preceding claims, characterized in that the SoIl

Repeat pitch of signals is communicated.

9. malfunction detection method according to any one of the preceding claims, characterized in that in Si gnaiübertragungsweg several stations provided and signals between them, if necessary, partially and / or completely repeated.

10. malfunction detection method according to one of the preceding claims, characterized in that the components are realized by computer provided with an operating system and this with a prioritizing Scheduler operates, which, especially on the receiver side, the message or message processing is assigned a higher priority than the timeout or timer processing.

11. malfunction detection method according to any one of the preceding claims, characterized in that the sent signals are addressed, in particular repeatedly to one or more, in particular constant (n) addresses.

12. malfunction detection method according to any one of the preceding claims, characterized in that at least one individual condition at the receiver and / or at the transmitter and / or on the signal path is determined with respect to a long-term statistically historically expected, one expected from current measurements and / or a predicted.

13. malfunction detection method according to any one of the preceding claims, characterized in that a MuIti- component system is present with at least three components and sends at least one of the components to the regular reception of repeatedly received signals of a signal sending component to a third component in this respect information.

14. malfunction detection method according to any one of the preceding claims, characterized in that a malfunction is assumed if a reception is completely absent and / or if a reception is delayed so long that the assured or required service characteristics are not to be adhered to sinά, especially if an excessive delay is delayed too many times or too often for compliance with assured service characteristics too long or completely absent.

A malfunction detection method according to any one of the preceding claims, characterized in that the individual conditions comprise at least one of total transmitter load, transmitter subcomponent load, non and / or consecutively received consecutive signal identifiers, receiver input buffer status, current signal path utilization, signal path utilization, receiver load, Receiver subcomponent load.

16. malfunction detection method according to any one of the preceding claims, characterized in that the single condition determination comprises a determining definition of a desired value and / or a determining determination of an actually given condition.

17. malfunction detection method according to any one of the preceding claims, characterized in that the tolerable Signalverzδgerung is adapted.

18. Malfunction detection method according to one of the preceding claims, characterized in that the tolerable signal delay is determined by adding a plurality of individual condition-related maximum times and / or by linking consideration of individual condition-related delay error likelihood relationships.

19. Malfunction detection method according to one of the preceding claims, characterized in that, in response to an error, at least one of the steps is undertaken: warning signal output, switching to redundancy system,. Change of the transmission component state, triggering of a data backup.

20. Malfunction detection method according to one of the preceding claims, characterized in that on the transmitter side, upon detection of the non-compliance of a timely desired signal repetition, a reaction step is triggered, in particular switching to a redundancy system and / or self-shutdown of a transmitter system.