DE4332881A1

DE4332881A1 - Fault-tolerant multicomputer system

Info

Publication number: DE4332881A1
Application number: DE19934332881
Authority: DE
Inventors: Uwe Dr Ing Held
Original assignee: Ksp Ingenieurtechnische Dienste Komponenten-Systeme-Projekte 09125 Chemnitz De GmbH; KSP INGENIEURTECHNISCHE DIENST
Current assignee: Airbus DS GmbH
Priority date: 1993-09-21
Filing date: 1993-09-21
Publication date: 1995-03-23
Anticipated expiration: 2013-09-22
Also published as: DE4332881C2

Abstract

The object of the invention is to provide a fault-tolerant multicomputer system which is implemented in a completely decentralised manner, both avoids the customary common resources, which are particularly at risk of failure, between the modules (assemblies), such as RAM, bus system and clock, and interconnects computer peripherals in a fault-tolerant manner, and which has automatic fault detection and fault elimination (recovery) within the network right up to the input and output points to the peripherals. The multicomputer system is characterised in that in order to tolerate M faults in computer nodes, there is an arrangement of at least N = 2M + 1 computer nodes (k0...k8) which are interconnected in such a way that every computer node (k0...k8) can reach every other node via a plurality of partially or completely different paths, in that a distributed communications system permits the reliable exchange of data, and in that at least (2Q + 1) input and output modules (p0...p5) are arranged for the toleration of Q faults in the input and output modules, each of the input and output modules (p0...p5) being connected to a different one of the computer nodes (k0...k8). <IMAGE>

Description

Die Erfindung betrifft ein fehlertolerantes Multicomputersystem bei dem sich mehrere Mikrocomputer gegenseitig vertreten können.The invention relates to a fault-tolerant Multicomputer system in which there are several microcomputers can represent each other.

Viele Anwendungen von Rechentechnik verlangen ein hohes Maß an Zuverlässigkeit der verwendeten Hard- und Software. Schwerpunkte sind im Luft- und Raumfahrtsektor oder im Kraftwerks- und Chemiebereich zu finden. Neben dem Schutz vor Gefahren durch unzuverlässige Rechentechnik in sicherheitskritischen Bereichen sind im Zuge der umfassenden Automatisierung aber auch zunehmend Betreiber komplexer Industriesysteme wie Taktstraßen o. ä. wegen der hohen Kosten bei Ausfall einzelner Komponenten am Einsatz fehlertoleranter Rechentechnik interessiert.Many computing technology applications require a high level Level of reliability of the hardware and Software. The focus is on the aerospace sector or to be found in the power plant and chemical sector. Next protection against dangers from unreliable Computing technology in security-critical areas are in the In the course of extensive automation, however, increasingly Operators of complex industrial systems such as cycle lines or similar because of the high costs in the event of individual failure Components using fault-tolerant computing technology Interested.

Methoden der Fehlerkorrektur durch Anwendung spezieller Codes eignen sich für die Beseitigung zufälliger Datenfehler, scheiden aber für informationsverarbeitende Systeme aus, da keine Algorithmen existieren, die über alle Verarbeitungsfunktionen hinweg gültig sind (W. W. Peterson, "Prüfbare und korrigierbare Codes", R. Oldenbourg Verlag, München und Wien, 1968). Ebenso sind diese Verfahren ungeeignet, wenn Komponenten oder Teilbereiche einen Totalausfall aufweisen.Methods of correcting errors by using special Codes are useful for eliminating accidental Data errors, but separate for information processors Systems because there are no algorithms that use all processing functions are valid (W. W. Peterson, "Verifiable and Correctable Codes", R. Oldenbourg Verlag, Munich and Vienna, 1968). Likewise are these procedures are unsuitable when components or Sub-areas have a total failure.

Ein bekanntes Grundprinzip fehlertoleranter Rechner besteht in der Mehrfachanordnung prinzipiell gleicher Rechnereinheiten, die zur selben Zeit alle die gleiche Aufgabe bearbeiten, wobei durch Mehrheitsentscheidung das richtige Ergebnis ausgewählt wird. A well-known basic principle of fault-tolerant computers consists in principle of the same multiple arrangement Computing units that are all the same at the same time Work on the task, whereby the majority decision correct result is selected.

Eine solche Anordnung besteht aus wenigstens 3 Rechnereinheiten, die mit den gleichen Eingangsinformationen arbeiten, und einem "Voter", der die Mehrheitsentscheidung ausführt (R. Weiß, "Fehlertolerante Rechnersysteme", Regelungstechnische Praxis, 25. Jahrgang, 1983, Heft 10, Seiten 408-415). Ein Schwachpunkt ist hier immer der Voter, der nur einmal vorhanden ist und somit selbst nicht fehlertolerant ist. Ein Ausfall dieser Komponente hat immer einen Totalausfall des Gesamtsystem zur Folge.Such an arrangement consists of at least 3 Computing units with the same Input information work, and a "voter" who carries out the majority decision (R. Weiß, "Fault-tolerant computer systems", control engineering Praxis, 25th year, 1983, volume 10, pages 408-415). A weak point here is always the voter, who only once is present and is therefore itself not fault-tolerant. A failure of this component always has one Total failure of the entire system.

Einfache Systeme, die bereits als fehlertolerant bezeichnet werden, arbeiten mit zwei quasi parallelgeschalteten Prozessoren. Bei Ungleichheit der Ergebnisse wird der letzte Rechenschritt wiederholt, wodurch zufällige Fehler erkannt und beseitigt werden. Ergibt sich auch bei der Wiederholung keine Übereinstimmung, wird die Funktionstüchtigkeit durch Testroutinen überprüft und so der fehlerhafte Rechner ermittelt. Derartige Lösungen werden bereits auf Schaltkreisbasis realisiert (David Jones, "Fehlertoleranz und Zuverlässigkeit in Mikroprozessorsystemen", Elektronik, Heft 24/1990). Auch hier ist jedoch eine Vergleichslogik erforderlich, die nur einmal vorhanden und somit nicht fehlertolerant ist.Simple systems that are already considered fault tolerant work with two quasi parallel processors. In case of inequality of the results, the last calculation step is repeated, which detects and eliminates random errors. There is also no repetition Agreement, the functionality is determined by Test routines checked and so the faulty computer determined. Such solutions are already on Circuit base realized (David Jones, "Fault Tolerance and reliability in microprocessor systems ", Electronics, issue 24/1990). But here is one too Comparison logic required that exists only once and is therefore not fault tolerant.

Systeme mit gleichartigen Rechnereinheiten, die lose über ein Verbindungsnetzwerk gekoppelt sind und ohne globale Speichereinheiten arbeiten, besitzen die grundsätzlichen Voraussetzungen zur Implementierung fehlertoleranter Eigenschaften. So ist z. B. bekannt (DE 29 39 487), zur Erhöhung der Leistungsfähigkeit mehrere Rechnereinheiten zu Parallelrechnersystemen zu verbinden, wobei mehrere Mikrocomputer gleiche Aufgaben lösen können. Systems with similar computing units that are loose over a connection network are coupled and without global Storage units work, have the basic Requirements for implementing more tolerant Properties. So z. B. known (DE 29 39 487) for Increase the performance of multiple computing units to connect to parallel computer systems, with several Microcomputers can solve the same tasks.

Einzelne der Mikrocomputer bearbeiten im Normalbetrieb die zu lösenden Aufgaben, andere sind als Ersatzbaugruppen (stand by) vorgesehen. Dabei sind sowohl die Eingabe- als auch die Ausgabeschnittstellen aller Mikrocomputer jeweils nur über ein einstufiges Koppelnetzwerk mit den Peripheriegerätekontrollern verbunden. Diese zentrale Ressource (einstufiges Koppelnetzwerk) ist aber ebenso wie ein getrennter Voter ein Schwachpunkt des Systems, da eine einfache Anordnung keine Fehlertoleranz besitzt.Some of the microcomputers work in normal operation the tasks to be solved are other than Replacement modules (stand by) provided. Both are the input and output interfaces of all Microcomputers each have only one stage Coupling network with the peripheral device controllers connected. This central resource (one-step Coupling network) is just like a separate voter a weak point of the system because of a simple arrangement has no fault tolerance.

Auch sind fehlertolerierende Rechnersysteme bekannt (DE 36 39 055, DE 40 05 321) die mehrere Rechner aufweisen, die sich im Fehlerfalle vertreten können. Zur Feststellung fehlerhaften Verhaltens teilen sich die Rechner gemeinsame Ressourcen, greifen also z. B. auf einen gemeinsamen Speicher zu und vergleichen dort abgelegte Rechenergebnisse. Im Fehlerfall wird über zentral angeordnete Auswahleinheiten die fehlerhafte Komponente eliminiert und soweit vorhanden auf eine Ersatzbaugruppe umgeschaltet. Derartige Systeme haben den Nachteil, daß gemeinsame Ressourcen, insbesondere Bussysteme und Globalspeicher, existieren, die ebenfalls versagen können und dann zu einem sofortigen Gesamtausfall führen.Fault-tolerant computer systems are also known (DE 36 39 055, DE 40 05 321) the multiple computers which can be represented in the event of an error. For The detection of faulty behavior is shared Computer shared resources, so grab z. B. on a shared memory and compare there stored calculation results. In the event of an error, over centrally located selection units the faulty Component eliminated and if available on a Replacement module switched. Such systems have that Disadvantage that shared resources, in particular Bus systems and global storage systems exist, too can fail and then to an immediate Complete failure.

Es sind keine Lösungen bekannt, die ohne eine zentrale Ressource für die Ausführung der Votingfunktion auskommen, unabhängig davon, ob diese Funktion hardwaremäßig oder softwaremäßig realisiert ist. Da für jede zentrale Ressource wegen ihrer exponierten Stellung besonders hohe Anforderungen an die Zuverlässigkeit gestellt werden müssen, sind derartige Baugruppen aufwendig und teuer. There are no known solutions without a central one Resource for executing the voting function get along regardless of whether this feature is implemented in hardware or software. Therefore every central resource because of its exposed position particularly high demands on reliability Assemblies of this type are required complex and expensive.

Aufgabe der Erfindung ist es, ein fehlertolerantes Multicomputersystem zu schaffen, das vollständig dezentral realisiert ist, sowohl die üblichen besonders ausfallgefährdeten gemeinsamen Ressourcen zwischen den Baugruppen wie RAM, Bussystem und Takt vermeidet als auch Computerperipherie fehlertolerant einbindet und das über eine automatische Fehlererkennung und Fehlerbehebung (Recovery) innerhalb des Netzwerkes bis hin zu den Ein- und Ausgabepunkten zur Peripherie verfügt.The object of the invention is a fault-tolerant Multicomputer system to create that completely is implemented decentrally, both the usual especially failure-prone shared resources between the Avoids assemblies like RAM, bus system and clock as well Integrates computer peripherals in a fault-tolerant manner automatic error detection and troubleshooting (Recovery) within the network down to the and output points to the periphery.

Erfindungsgemäß wird die Aufgabe dadurch gelöst, daß für die Tolerierung von M Fehlern in Computerknoten wenigstens N=2M+1 Computerknoten angeordnet sind, die durch ein Kommunikationsnetzwerk in Form mehrerer fest vorgegebener und nach dem Handshake-Verfahren arbeitender Punkt-zu-Punkt-Verbindungen miteinander verbunden sind. Dabei sind die Computerknoten derart miteinander verbunden, daß jeder jeden anderen auf mehreren teilweise oder vollständig verschiedenen Wegen erreichen kann und damit bei Ausfall eines oder mehrerer Kommunikationspfade weitere Routen zum Datentransport zur Verfügung stehen. Es ist ein verteiltes Kommunikationssystem vorhanden, das neben den Datentransportdiensten über Funktionen zur Ausfallerkennung, Fehlereliminierung und Reintegration verfügt. Jeder sendende Computerknoten überwacht mit Hilfe einer time-out-Bedingung den Empfang einer Sendung, leitet bei fehlender Aktivität eines Verbindungsweges die Sendung um und teilt diesen Zustand allen Computerknoten unter Nutzung aller aktiven Verbindungswege mit. Ein Computerknoten wird als ausgefallen erkannt, wenn auch der letzte zu ihm führende Verbindungsweg als ausgefallen erkannt wird. Zur Feststellung ob ein Verbindungsweg ausgefallen oder noch ausgefallen ist, also zur zyklischen Überwachung des Systemzustandes, werden sowohl auf fehlerfreien als auch auf als fehlerhaft gekennzeichneten Verbindungswegen in gleichen Zeitabständen Testnachrichten gesendet. Dadurch ist das System in der Lage, bei lokaler Erkennung einer Zustandsänderung des Computersystems jedem der Computerknoten im System den Ausfall sowie die Wiedereingliederung bei erkannter Wiedererlangung der Aktivität von Komponenten mitzuteilen und dezentral den Zustand der Verfügbarkeit zu ermitteln sowie eine automatische Anpassung der Kommunikationsrouten durchzuführen. Somit ist der störungsfreie Weiterbetrieb für alle fehlerfreien Computerknoten gewährleistet.According to the invention the object is achieved in that for the tolerance of M errors in computer nodes at least N = 2M + 1 computer nodes are arranged which through a communication network in the form of several fixed specified and working according to the handshake process Point-to-point connections are interconnected. The computer nodes are in this way with each other connected that each other in part on several or completely different ways and if one or more communication paths fail further routes for data transport are available. There is a distributed communication system that in addition to the data transport services via functions for Failure detection, error elimination and reintegration disposes. Each sending computer node also monitors Help of a time-out condition the receipt of a shipment, directs the if there is no connection path activity Shipment and shares this state with all computer nodes using all active connection paths with. A Computer node is recognized as failed, though the last connection path leading to him as failed is recognized. To determine whether a connection path failed or is still failed, i.e. for cyclical monitoring of the system status, both on flawless as well as on faulty marked connection routes in the same Time intervals test messages sent. That’s it System capable of local detection of one Change of state of the computer system each of the Computer nodes in the system the failure as well as the Reintegration upon detection of recovery Communicate activity of components and decentralized Determine the state of availability as well as a automatic adjustment of communication routes perform. This ensures trouble-free operation guaranteed for all error-free computer nodes.

Über einen systemweiten Adressierungsmechanismus ist das Kommunikationssystem in der Lage, einen sicheren Datenaustausch zwischen beliebig im Netzwerk angeordneten Computerknoten und Ein- bzw. Ausgabebaugruppen zu gestatten.This is through a system-wide addressing mechanism Communication system able to secure Data exchange between any network Computer nodes and input and output modules allow.

Für die Tolerierung von Q Fehlern in den Ein- bzw. Ausgabebaugruppen sind wenigstens (2Q+1) Ein- bzw. Ausgabebaugruppen angeordnet. Jede der Ein- bzw. Ausgabebaugruppen wiederum ist mit wenigstens einem anderen der Computerknoten verbunden, wobei die fehlerfreien Daten durch eine (Q+1)-aus-(2Q+1)-Mehrheitsentscheidung ermittelt werden.For the tolerance of Q errors in the inputs or Output modules are at least (2Q + 1) input or Output modules arranged. Each of the inputs or Output modules in turn have at least one connected to another of the computer nodes, the error-free data by a (Q + 1) -aus- (2Q + 1) majority decision can be determined.

Zur Aufgabenverteilung ist eine dezentral arbeitende Steuerung zur Aufgabenverteilung vorhanden. Sie stellt sicher, daß jede Teilaufgabe parallel von wenigstens (2M+1) Computerknoten bearbeitet wird und nach einer (M+1)-aus-(2M+1)-Mehrheitsentscheidung die jeweils nachfolgende Teilaufgabe gestartet wird. There is a decentralized division of tasks Control for task distribution available. She poses sure that each subtask is parallel by at least (2M + 1) computer node is edited and after a (M + 1) -aus- (2M + 1) majority decision each subsequent subtask is started.

Vorteilhaft ist es, wenn das Computersystem in Form eines N=A _* B-Tours mit A Spalten und B Zeilen aufgebaut ist, wobei für die Tolerierung von Q Fehlern wenigstens A=(2Q+1) Spalten anzuordnen sind und jede Spalte ein- und ausgangsseitig mit einer anderen Ein- bzw. Ausgabebaugruppen verbunden ist. Weiterhin ist es vorteilhaft, wenn die Ein- bzw. Ausgabebaugruppen als intelligente Komponenten dergestalt aufgebaut sind, daß jede Ein- bzw. Ausgabebaugruppe aus einem Controller und einer Ein- bzw. Ausgabeeinheit besteht, wobei die Controller und die Ein- bzw. Ausgabeeinheit voll vermascht angeordnet werden. Dadurch ist es möglich im laufenden Betrieb eine Umkonfigurierung der Zuordnungen Controller-, Ein- bzw. Ausgabegerät vorzunehmen.It is advantageous if the computer system is constructed in the form of an N = A _* B tour with A columns and B rows, with at least A = (2Q + 1) columns having to be arranged for the tolerance of Q errors and each column in and out is connected on the output side to another input or output module. It is also advantageous if the input and output modules are constructed as intelligent components such that each input and output module consists of a controller and an input and output unit, the controller and the input and output unit being fully meshed to be ordered. This makes it possible to reconfigure the controller, input or output device assignments during operation.

Der Vorteil der Erfindung besteht darin, daß das fehlertolerante Multicomputersystem vollständig dezentral realisiert ist. Es vermeidet jegliche Art zentraler Ressourcen (Voter, RAM, Bussystem, Takt . . . ). Es sind keine leerlaufende stand-by-Ressourcen erforderlich, die gesamte Rechenleistung aller Computerknoten im störungsfreien Betrieb kann genutzt werden. Die Erkennung, Lokalisierung und Eliminierung erfolgt automatisch, wobei der Grad der Fehlertoleranz den Anforderungen des Einzelfalles durch Skalierung der Netzwerksgrößen anpaßbar ist. Durch den vervielfachten Einsatz normaler Industriebaugruppen als Computerknoten anstelle kostenintensiver Spezialhardware ergibt sich ein großer Kostenvorteil.The advantage of the invention is that fault-tolerant multicomputer system completely decentralized is realized. It avoids any kind of central Resources (Voter, RAM, bus system, clock ...). There are not any idle standby resources required that total computing power of all computer nodes in the trouble-free operation can be used. The Detection, localization and elimination take place automatically, with the degree of fault tolerance the Requirements of the individual case by scaling the Network sizes is customizable. By the multiplied Use of normal industrial assemblies as computer nodes instead of expensive special hardware there is a great cost advantage.

Das fehlertolerante Multicomputersystem soll anhand eines Ausführungsbeispieles näher erläutert werden.The fault-tolerant multicomputer system is supposed to be based on a Embodiment are explained in more detail.

In der dazugehörigen Zeichnung zeigenShow in the accompanying drawing

Fig. 1 ein Prinzipschaltbild eines fehlertoleranten Multicomputersystems, Fig. 1 is a schematic diagram of a fault tolerant multi-computer system,

Fig. 2 ein Eingabenetzwerk, Fig. 2, an input network,

Fig. 3 ein Ausgabenetzwerk, Fig. 3 an output network,

Fig. 4 ein gestörtes Netzwerk nach Fig. 1, Fig. 4 a disturbed network of FIG. 1,

Fig. 5 eine Konfigurationstabelle für die Beispielkonfiguration aus Fig. 1, Fig. 5 is a configuration table for the example configuration of FIG. 1,

Fig. 6 eine lokale Routingtabelle von Computerknoten k4 der Beispielskonfiguration aus Fig. 1, Fig. 6 shows a local routing table of computer nodes k4 the example configuration of FIG. 1,

Fig. 7 eine Konfigurationstabelle für das gestörte Netzwerk nach Fig. 4 und Fig. 7 is a configuration table for the disturbed network of FIG. 4 and

Fig. 8 eine lokale Routingtabelle von Computerknoten k4 des gestörten Netzwerks nach Fig. 4. Fig. 8 shows a local routing table of computer nodes k4 of the disturbed network according to Fig. 4.

Fig. 1 zeigt die Prinzipdarstellung eines fehlertoleranten Multicomputersystems mit den Computerknoten kx, wobei x der fortlaufende Index der Computerknoten ist. In diesem Ausführungsbeispiel sind 3×3 Computerknoten (k0 . . . k8) dargestellt, die matrixförmig angeordnet sind; für die Tolerierung von M Fehlern in Computerknoten sind wenigstens N=2M+1 Computerknoten notwendig. Die Computerknoten (k0 . . . k8) sind prinzipiell gleichartig aufgebaute, mit gleicher Funktionalität ausgestattete, vollständig autark arbeitsfähige Microcomputer. Sie verfügen jeder über einen lokalen Speicher sowie über eine eigene Takt- (und damit Zeit-) und Stromversorgung. Sie verfügen über n (n<1), typisch z. B. 4, wie auch im Ausführungsbeispiel dargestellt, Kommunikationsinterfaces vxn (v00, v01, v02, v03, v10, v11, v12, v13 . . . v80, v81, v82, v83) in Form paralleler oder serieller Schnittstellen. Die Computerknoten (k0 . . . k8) sind durch ein Kommunikationsnetzwerk in Form mehrerer fest vorgegebener und nach dem Handshake-Verfahren arbeitender Punkt-zu-Punkt-Verbindungen derart miteinander verbunden, daß jeder Computerknoten jeden anderen auf mehreren teilweise oder vollständigen verschiedenen Wegen erreichen kann. Fig. 1 shows the basic diagram of a fault tolerant multi-computer system with the computer node kx, where x is the sequential index of the computer node. In this exemplary embodiment, 3 × 3 computer nodes (k0... K8) are shown, which are arranged in a matrix; For the tolerance of M errors in computer nodes at least N = 2M + 1 computer nodes are necessary. In principle, the computer nodes (k0... K8) are microcomputers which are constructed in the same way, are equipped with the same functionality and are fully self-sufficient. They each have a local memory and their own clock (and therefore time) and power supply. They have n (n <1), typically z. B. 4, as also shown in the exemplary embodiment, communication interfaces vxn (v00, v01, v02, v03, v10, v11, v12, v13... V80, v81, v82, v83) in the form of parallel or serial interfaces. The computer nodes (k0... K8) are connected to one another by a communication network in the form of several predetermined point-to-point connections which work according to the handshake method in such a way that each computer node can reach each other in several partially or completely different ways .

Es führt also von jedem der Interfaces eines Computerknotens eine Verbindung zu genau einem anderen der Computerknoten (k0 . . . k8) oder zu den Ein- bzw. Ausgabebaugruppen (p0 . . . p5), von den restlichen Interfaces dieses Computerknotens wieder zu anderen usw. über das entstehende, räumlich enge Punkt-zu-Punkt-Verbindungsnetzwerk sind die Computerknoten (k0 . . . k8) dadurch derartig miteinander vermascht, daß jeder Computerknoten jeden anderen unter Nutzung des verteilten Routing-Systems auf mehr als einem Wege erreichen kann. Der Ausfall einer Verbindung kann solange toleriert werden, wie von jedem Computerknoten (k0 . . . k8) zu jedem anderen noch wenigstens ein möglicher Weg existiert. Das Netzwerk kann inhomogener Form sein, der anfängliche Grad der Fehlertolerierung ist dann aber für die einzelnen Bereiche verschieden. Somit stehen bei Ausfall eines oder mehrerer Kommunikationspfade weitere Routen zum Datentransport zur Verfügung.So it carries one of each of the interfaces Computer node connects to exactly one other the computer node (k0... k8) or to the Input or output modules (p0... P5), from the rest Interfaces of this computer node to others etc. about the emerging, spatially narrow Point-to-point connection network are the Computer nodes (k0... K8) as a result meshed that every computer node under every other Use of the distributed routing system on more than one Ways can reach. The failure of a connection can be tolerated as long as from any computer node (k0... k8) to each other at least one possible way exists. The network can be more inhomogeneous Be in shape, the initial level of fault tolerance but then different for the individual areas. Consequently stand in the event of failure of one or more Communication paths further routes for data transport to Available.

Jeder Computerknoten (k0 . . . k8) ist in der Lage, aus dem lokalen Speicher (RAM, lokales Boot-Filesystem o. ä.) die Informationen zum Booting seiner Nachbarn wie Konfigurationsdateien oder Systemdateien zu gewinnen und im Bedarfsfalle jeden seiner Nachbarn damit neu zu booten. Der Ausfall beliebiger anderer Komponenten des Systems hat wegen des autarken Aufbaus keine unmittelbare Auswirkung auf die Arbeitsfähigkeit eines einzelnen Computerknotens (k0 . . . k8).Each computer node (k0... K8) is able to derive from the local memory (RAM, local boot file system or similar) Information about how to boot his neighbors Win configuration files or system files and if necessary, every new neighbor boot. The failure of any other components of the Systems has no immediate because of the self-sufficient structure Impact on an individual's ability to work Computer node (k0... K8).

Um die hohe Fehlertoleranz innerhalb des Netzwerkes nicht an zentralen Ein- bzw. Ausgabepunkten zur Peripherie zunichte zu machen, wird die Peripherie ebenfalls redundant ausgeführt und redundant angebunden. Um bezüglich der Peripherie Q-fach fehlertolerant zu sein, sind für die Ein- bzw. Ausgabe (2Q+1) Ein- bzw. Ausgabebaugruppen (p0 . . . p5) vorgesehen, wobei jede der Ein- bzw. Ausgabebaugruppen (p0 . . . p5) mit einem anderen der Computerknoten (k0 . . . k5) verbunden ist.Not about the high fault tolerance within the network at central input and output points to the periphery destroying the periphery too redundant and connected redundantly. Around to be Q-fold fault tolerant with regard to the periphery, are for input or output (2Q + 1) input or Output modules (p0... P5) provided, each of the Input or output modules (p0... P5) with another the computer node (k0... k5) is connected.

Die Peripherie kann vorteilhafterweise als Ein- bzw. Ausgabenetzwerk (1; 2), ausgeführt werden, wobei das Eingabenetzwerk (1) gemäß Fig. 2 aus (2Q+1) Controllern (P0; P1; P2) und der gleichen Anzahl Eingabeeinheiten (E0; E1; E2) und das Ausgabenetzwerk (2) gemäß Fig. 3 aus (2Q+1) Controllern (P3; P4; P5) und der gleichen Anzahl Ausgabeeinheiten (E3; E4; E5) besteht. So sind also für die Tolerierung von Q Fehlern in der Eingabe bzw. Ausgabe wenigstens je (2Q+1) Controller (P0; P1; P2 bzw. P3; P4; P5) vorgesehen, die mit der gleichen Anzahl von Eingabeeinheiten (E0; E1; E2) bzw. Ausgabeeinheiten (E3; E4; E5) voll vernetzt verbunden sind. Dadurch ist es im laufenden Betrieb möglich, durch Neuzuordnung eines Controllers zu einer Ein- bzw. Ausgabeeinheit Fehler in Controller oder Ein- bzw. Ausgabeeinheit zu tolerieren.The periphery can advantageously be implemented as an input or output network ( 1; 2 ), the input network ( 1 ) according to FIG. 2 consisting of (2Q + 1) controllers (P0; P1; P2) and the same number of input units (E0 ; E1; E2) and the output network ( 2 ) according to FIG. 3 consists of (2Q + 1) controllers (P3; P4; P5) and the same number of output units (E3; E4; E5). So for the tolerance of Q errors in the input or output at least one (2Q + 1) controller (P0; P1; P2 or P3; P4; P5) is provided, which has the same number of input units (E0; E1 ; E2) or output units (E3; E4; E5) are fully networked. This makes it possible during operation to tolerate errors in the controller or input or output unit by reassigning a controller to an input or output unit.

Die korrekten Ein- bzw. Ausgabedaten werden durch eine (Q+1)-aus-(2Q+1)-Mehrheitsentscheidung ermittelt. Dadurch wird eine korrekte Weiterarbeit sowohl beim Ausfall von Q Ein- bzw. Ausgabebaugruppen als auch beim Ausfall von Q mit der Peripherie verbundenen Computerknoten gesichert.The correct input and output data are confirmed by a (Q + 1) -aus- (2Q + 1) majority decision determined. Thereby will continue to work correctly if Q Input or output modules as well as in the event of Q failure computer nodes connected to the periphery secured.

Um die Verteilung der Aufgaben, den Austausch von Verwaltungs- und Anwenderprozeß-Daten in einem Multicomputersystem realisieren zu können, wird prinzipiell ein Kommunikationssystem als Bestandteil der Systemsoftware benötigt. In der Erfindung wird dieses Kommunikationssystem um genau die Komponenten erweitert, die der Erkennung und Eliminierung defekter Komponenten sowie der Reintegration nach transienten Fehlern dienen. To the distribution of tasks, the exchange of Administrative and user process data in one To be able to implement a multicomputer system basically a communication system as part of the System software needed. In the invention this is Communication system expanded by exactly the components the detection and elimination of defective components as well as the reintegration after transient errors.

Das Kommunikationssystem (auch: Routingsystem) arbeitet nach folgendem, vollständig dezentralem Prinzip: Jeder der Computerknoten (k0 . . . k8) kennt die gesamte Verbindungsliste des Netzwerkes, d. h. er weiß, welcher Computerknoten mit welchem anderen über welche Interfaces miteinander verbunden ist. Die initiale Verbindungsliste, die Konfigurationstabelle, Fig. 5, wird zur Konfigurationszeit festgelegt und richtet sich nach der Anzahl der Computerknoten und der Anzahl der Kommunikationsinterfaces je Computerknoten. Sie wird jedem Computerknoten beim Systemstart in Form einer Tabelle mitThe communication system (also: routing system) works according to the following, completely decentralized principle: Each of the computer nodes (k0... K8) knows the entire connection list of the network, i.e. it knows which computer node is connected to which other via which interfaces. The initial connection list, the configuration table, FIG. 5, is determined at the configuration time and depends on the number of computer nodes and the number of communication interfaces per computer node. It is included in the form of a table with every computer node when the system is started

k (k Knotenanzahl) Spalten und
n (n = Interfaceanzahl/Knoten) Zeilen übergeben.k (k number of nodes) columns and
n (n = number of interfaces / nodes) lines passed.

Ein Beispiel einer derartigen Konfigurationstabelle ist für ein 3×3-System nach Fig. 1 in Fig. 5 gegeben. So ist z. B. der Computerknoten k4 über das Kommunikationsinterface v40 mit dem Computerknoten k1, über das Kommunikationsinterface v41 mit dem Computerknoten k5, über das Kommunikationsinterface v42 mit dem Computerknoten k7 und über das Kommunikationsinterface v43 mit dem Computerknoten k3 verbunden. Während der Initialisierung und nach jeder erkannten Zustandsänderung des Netzwerkes wird aus dieser initialen oder durch Ausfall bzw. Reintegration modifizierten Konfigurationstabelle von jedem der Computerknoten (k0 . . . k8) lokal die Weglänge für eine Nachricht zu jedem anderen der Computerknoten (k0 . . . k8) im Netzwerk über jedes der möglichen Interfaces (v00 . . . v83) berechnet und die Weglängenminima in eine zweite Tabelle, die lokale Routingtabelle, Fig. 6, eingetragen. Die Weglängen (Netzwerk-Hops) sind Grundlage für die Wahl des optimalen Routingweges von Anwendernachrichten. An example of such a configuration table is given for a 3 × 3 system according to FIG. 1 in FIG. 5. So z. B. the computer node k4 via the communication interface v40 with the computer node k1, via the communication interface v41 with the computer node k5, via the communication interface v42 with the computer node k7 and via the communication interface v43 with the computer node k3. During the initialization and after each detected change in the state of the network, this initial configuration table, or one modified by failure or reintegration, locally becomes the path length for a message from each of the computer nodes (k0... K8) to each other of the computer nodes (k0.. K8 ) is calculated in the network via each of the possible interfaces (v00... v83) and the path length minima are entered in a second table, the local routing table, FIG. 6. The path lengths (network hops) are the basis for choosing the optimal routing path for user messages.

Der Eintrag "NE" bedeutet in der Routingtabelle, daß der Computerknoten über diesen Ausgang den entsprechenden Zielprozessor nicht erreicht. Zur Erkennung von Fehlern macht sich das Kommunikationssystem die strukturellen Vorteile der beschriebenen Hardware zunutze. Grundsätzlich gilt dabei, daß jeder Computerknoten seine lokale Sicht in das Netzwerk hat und genau die Verbindungen mit seinen unmittelbaren Nachbarn überwacht. Zur Erkennung des Ausfalls einer Verbindung (fehlende Aktivität z. B. durch defektes Bauteil, Leitungsunterbrechung o. ä.) werden über alle Kommunikationsinterfaces (v00 . . . v83) der Computerknoten in gleichen Zeitabständen Testnachrichten gesendet und deren ordnungsgemäße Übertragung mittels geeigneten Datenprotokollen und Time-Out überwacht. Somit ist eine Fehlererkennung auch ohne aktives Nutzerprogramm innerhalb des angewendeten Zeitrasters (üblicherweise einige Millisekunden) möglich. Wird eine Anwender-Nachricht übertragen, dient diese indirekt ebenfalls dem Nachweis der Funktionsfähigkeit des Kanals.The entry "NE" in the routing table means that the Computer nodes via this output the corresponding Target processor not reached. To detect errors the communication system makes the structural Take advantage of the hardware described. Basically, every computer node has its own local view of the network and just that Connections monitored with its immediate neighbors. To detect the failure of a connection (missing Activity z. B. due to defective component, Line interruption or the like) will affect all Communication interfaces (v00... V83) of the computer nodes test messages sent at equal intervals and their proper transmission by means of suitable Data logs and time-out monitored. So is one Error detection even without an active user program within the applied time grid (usually a few milliseconds) possible. Will one Transfer user message, this serves indirectly also proof of the functionality of the channel.

Fig. 4 zeigt ein gestörtes Netzwerk nach Fig. 1. Wird eine Verbindung als fehlerhaft (Time-Out-Bedingung wird wahr oder Protokollfehler) erkannt, z. B. die Verbindung zwischen zwei Computerknoten (k4 und k5), wird sie aus der Menge der möglichen Wege dieser zwei Nachbarcomputerknoten (k4 und k5) gestrichen. Wird der Fehler bei der Übertragung einer Anwender-Nachricht detektiert, sendet das Kommunikationssystem diese über einen anderen Weg zum Empfänger und der Anwender bemerkt von dieser Störung nichts. Als zweite Reaktion auf diesen Fehler informiert der den Ausfall zuerst erkennende Computerknoten alle anderen über den Ausfall der Verbindung. Dies geschieht über einen Broadcast-Mechanismus, der die entsprechende Information (Botschaft) sicher an alle aktiven Computerknoten im Netzwerk verteilt, indem die Botschaft über alle Interfaces genau an jeden Nachbarn und von diesen wieder an die weiteren Nachbarn, außer an den, von dem die Botschaft empfangen wurde, verschickt wird. Die Verteilung terminiert überall dort, wo die Botschaft zum wiederholten Male empfangen wird. Der Broadcast wird so lange wiederholt, bis die Botschaftsquittung fehlerfrei zum Initiator zurückgelaufen ist. FIG. 4 shows a faulty network according to FIG. 1. If a connection is recognized as faulty (time-out condition becomes true or protocol error), e.g. B. the connection between two computer nodes (k4 and k5), it is deleted from the set of possible routes of these two neighboring computer nodes (k4 and k5). If the error in the transmission of a user message is detected, the communication system sends it to the receiver in another way and the user does not notice anything of this disturbance. As a second reaction to this error, the computer node that first recognizes the failure informs all others of the failure of the connection. This is done via a broadcast mechanism that securely distributes the corresponding information (message) to all active computer nodes in the network, by sending the message to all neighbors and from these back to the other neighbors, except for the one from whom the Message has been received, is being sent. The distribution terminates wherever the message is received again. The broadcast is repeated until the message receipt has returned to the initiator without errors.

Damit wird eine vollständige Informationsübertragung im Fall des gleichzeitigen oder des unmittelbaren Ausfalls mehrerer Kommunikationsinterfaces (v00 . . . v83) oder Computerknoten (k0 . . . k8) gesichert. Im Ergebnis der Verteilung der Information entstehen auf jedem Computerknoten (k0 . . . k8) lokal die modifizierten Konfigurationstabellen, wobei nicht aktive Kommunikationsinterfaces mit "NE" gekennzeichnet sind, dargestellt in Fig. 7, und die Routingtabellen, dargestellt in Fig. 8. In Fig. 7 ist die im Beispiel ausgefallene Verbindung v41/v53 durch die "NE"-Einträge in Spalte k4/Zeile vx1 sowie in Spalte k5/Zeile vx3 dargestellt. Fig. 8 zeigt mit den "NE"-Einträgen für die gesamte Zeile v41, daß Computerknoten k4 über Kommunikationsinterface v41 keinen anderen Computerknoten mehr erreicht.This ensures complete information transmission in the event of simultaneous or immediate failure of several communication interfaces (v00... V83) or computer nodes (k0... K8). As a result of the distribution of the information, the modified configuration tables arise locally on each computer node (k0... K8), inactive communication interfaces being identified with "NE", shown in FIG. 7, and the routing tables, shown in FIG. 8. In FIG. 7 shows the connection v41 / v53 which has failed in the example by the "NE" entries in column k4 / line vx1 and in column k5 / line vx3. Fig. 8 shows the "NE" entries for the entire row v41 that more computer nodes reached via communication interface k4 v41 any other computer node.

Um transient ausgefallene Verbindungen (z. B. durch Störstrahlungseinfluß und ohne physikalische Defekte) wieder ins Netzwerk zu integrieren, werden als defekt gekennzeichnete Verbindungswege in gleichen Zeitabständen mit Synchronisationsnachrichten beaufschlagt. Wird vom Nachbarknoten eine Synchronisationsnachricht korrekt empfangen und bleibt dieser Zustand eine gewisse Zeit stabil, wird die Verbindung wieder als gültig vermerkt und ein Broadcast in der gleichen Form wie beim Ausfall über die wiedererlangte Aktivität dieser Komponente verschickt. Nach Erhalt der Meldung über den Ausfall bzw. die Wiederaufnahme einer Verbindung berechnet jeder Computerknoten den Netzwerkstatus und damit die aktuellen Weglängen zu allen anderen Computerknoten neu. Über dieses Verfahren wird die vollständig dezentrale Kenntnis über den aktuellen Netzwerkzustand sichergestellt.In order to avoid transient connections (e.g. through Interference and without physical defects) Integrating back into the network is considered defective marked connection paths at equal intervals charged with synchronization messages. Will be from Neighboring node correctly a synchronization message received and remains this state for a certain time stable, the connection is again noted as valid and a broadcast in the same form as the failure about the regained activity of this component sent. After receiving the notification of the failure or everyone re-establishes a connection Computer nodes the network status and thus the current Path lengths to all other computer nodes new. over this process becomes completely decentralized knowledge ensured via the current network status.

Fehler in den Computerknoten (k0 . . . k8) selbst können so vielfältiger Natur sein, daß eine Erkennung der Ursache durch einen Nachbarn, der nur die Kommunikation überwachen kann, nicht möglich ist. Daher werden beliebige Ausfälle im Computerknoten durch fehlerhafte oder fehlende Aktivität seiner Kommunikationsinterfaces erkannt. Das heißt zum Beispiel, daß der vollständige Ausfall eines Computerknotens (k0 . . . k8) die Broadcast-Meldungen über den Ausfall aller Verbindungen zu jedem seiner Nachbarn an alle Computerknoten (k0 . . . k8) im Netz zur Folge hat. Bei der anschließenden Berechnung des Netzwerkzustandes wird der ausgefallene Computerknoten als nicht mehr erreichbar und damit nicht mehr verfügbar gekennzeichnet. Dieses Attribut ist vor allem für den weiteren Ablauf in Hinblick auf Aufgabenverteilungen im Netzwerk relevant.Errors in the computer nodes (k0... K8) themselves can do so be diverse in nature that identifying the cause through a neighbor who is only communicating can monitor, is not possible. Therefore any failures in the computer node due to faulty or lack of activity of its communication interfaces recognized. That means, for example, that the complete Failure of a computer node (k0... K8) the Broadcast messages about the failure of all connections to each of its neighbors to all computer nodes (k0... k8) in the network. In the subsequent The failure of the network is calculated Computer nodes as no longer accessible and therefore not marked more available. This attribute is before all for the rest of the process in terms of Task distribution in the network relevant.

Da einerseits die konkrete Ursache eines Fehlverhaltens durch Nachbarn nicht erkennbar ist, andererseits nicht von einer zukünftigen korrekten Weiterarbeit eines sich abnormal verhaltenden Computerknotens ausgegangen werden kann, werden für einen als nicht mehr verfügbar gekennzeichneten Computerknoten die Reintegrationsversuche, d. h. die Beaufschlagung mit Testnachrichten eingestellt, bis die Strategiesoftware, ein Anwenderprogramm, das ebenfalls auf (2M+1) verschiedenen Computerknoten läuft und das über die Ressourcennutzung und Aufgabenverteilung auf Nutzerebene entscheidet, den kompletten Neustart über ein Reboot durch einen über Mehrheitsverfahren selektierten Computerknoten anweist.On the one hand, the concrete cause of misconduct is not recognizable by neighbors, on the other hand not of future correct continuing work yourself abnormally behaving computer node may be no longer available for one labeled computer nodes the Reintegration attempts, d. H. the application of Test messages set until the strategy software, a user program that is also based on (2M + 1) different computer nodes runs and that over the Resource use and task allocation at user level decides to reboot completely through a selection via majority procedure Instructs computer nodes.

Das in der Erfindung beschriebene Multicomputersystem ist nun durch seine Hardware-Strukturierung in der Lage, die volle Funktionsfähigkeit einer Anwendung auch nach Ausfall einer beliebigen Baugruppe zu gewährleisten. Um die Fehlertoleranz auch für die Anwendersoftware zu garantieren, wird eine dezentral arbeitende auf ebenfalls (2M+1) verschiedenen Computerknoten laufende Steuerung zur Aufgabenverteilung vorgesehen, die sicherstellt, daß sie selbst, die Strategiesoftware und jede Teilaufgabe der Anwendung parallel von wenigstens (2M+1) Computerknoten bearbeitet wird. End- und Zwischenergebnisse der Teilaufgaben werden über (M+1)-aus-(2M+1)-Mehrheitsentscheidungen bewertet und ggf. lokal korrigiert. Mit dem korrekten Ergebnis wird die nächste Teilaufgabe gestartet.The multi-computer system described in the invention is now through its hardware structuring able to full functionality of an application even after To ensure failure of any module. Around the fault tolerance also for the user software guarantee a decentralized working on also (2M + 1) various computer nodes running control intended for the distribution of tasks, which ensures that yourself, the strategy software and every subtask the application in parallel of at least (2M + 1) Computer node is edited. Final and Interim results of the subtasks are over (M + 1) -aus- (2M + 1) -Majority decisions evaluated and if necessary corrected locally. With the correct result, the next subtask started.

Die Strategiesoftware hat die Aufgabe, Recoverymaßnahmen zur Wiederinbetriebnahme defekter Komponenten einzuleiten und das (2M+1)-fache parallele Ablaufen der Applikation auf (2M+1) verschiedenen Computerknoten zu gewährleisten. Wird lokal ein Fehler festgestellt, der die Reaktion der Strategiesoftware erfordert, wird zunächst über Mehrheitsentscheid wie bei der Anwendung die Gültigkeit des Fehlerzustandes geprüft bzw. bei eigenem Irrtum der Fehlerzustand rückgesetzt. Ist ein Fehler von der Mehrheit festgestellt worden, ist dieser Zustand für alle parallel laufenden Strategieprogramme gültig. In diesem Fall ermittelt jedes Strategieprogramm aus seinem lokalen Kontext, wer die Aktion zur Behebung des Fehlers (z. B. das Re-Bootings eines ausgefallenen Computerknotens) auslösen soll. Ist über die Vergabe des Auftrages über Mehrheitsentscheid Einigkeit erzielt worden, startet der temporär als Ausführender ausgezeichnete Computerknoten die Recovery-Maßnahme. Der Erfolg wird nun wieder dezentral festgestellt, indem sich zum Beispiel ein nachgebooteter Computerknoten bei seinen Nachbarn über die Kommunikationsinterfaces anmeldet. Diese Nachbarknoten melden die Wiederaufnahme der Aktivität einer Verbindung, wie beim Verbindungs-Recovery beschrieben, an alle anderen Teilnehmer. In der anschließenden lokalen Netzwerksneuberechnung kann für diesen reaktivierten Computerknoten das Attribut "nicht mehr verfügbar" zurückgesetzt werden. Der Computerknoten wird damit wieder in die Aufgabenverteilung einbezogen.The strategy software has the task of recovery measures to initiate the restart of defective components and (2M + 1) times the parallel execution of the application guarantee on (2M + 1) different computer nodes. If an error is detected locally that affects the reaction of the Strategy software is required first Majority decision as in the application of validity the fault condition checked or in the event of your own error of Error status reset. Is a mistake from the Majority has been determined, this condition is for everyone Strategy programs running in parallel are valid. In this Each strategy program determines from its local case Context of who the action to fix the bug (e.g. re-booting a failed Computer node). Is about awarding the Orders on majority decision reached agreement has been started temporarily as the executor excellent computer node the recovery measure. Of the Success is now decentralized again by for example a post-booted computer node at his Registers neighbors via the communication interfaces. These neighboring nodes report the resumption of the Connection activity, such as connection recovery to all other participants. In the Subsequent local network recalculation can be done for these reactivated computer nodes do not have the attribute " more available ". The computer node is thus included in the distribution of tasks.

Claims

1. Fault-tolerant multicomputer system in which several microcomputers can represent each other, characterized in that
that at least N = 2M + 1 computer nodes (k0... k8) are arranged for the tolerance of M errors in computer nodes, which in this way are established by a communication network in the form of a plurality of predetermined point-to-point connections which work according to the handshake method are connected to each other so that each computer node (k0... k8) can reach each other in several partially or completely different ways and, if one or more communication paths fail, further routes are available for data transport,
that a distributed communication system is present, which has failure detection, error elimination and reintegration mechanisms, in that each sending computer node (k0... k8) monitors the receipt of a shipment with the aid of a time-out condition and the shipment if there is no connection path activity redirects and notifies this state to all computer nodes (k0... k8) using all active connection paths, whereby a failed computer node (k0... k8) is recognized by the failure of all connection paths leading to it and that both faultless and on as well Incorrectly marked connection paths are sent at regular intervals test messages for cyclical monitoring of the system status, whereby the system is able to detect the failure as well as the reintegration when recovery is detected, if a change in the status of the computer system is detected locally Communicate the activity of components and decentrally determine the state of availability as well as carry out an automatic adaptation of the communication routes and thereby the trouble-free operation for all error-free computer nodes (k0. . . k8) to ensure
that the communication system is able, via a system-wide addressing mechanism, to allow secure data exchange between two computer nodes (k0... k8) arranged arbitrarily in the network,
that at least (2Q + 1) input or output modules (p0... p5) are arranged for the tolerance of Q errors in the input or output modules and each of the input or output modules (p0... p5) is also arranged another of the computer nodes (k0... k8) is connected, the error-free data being determined by a (Q + 1) -out- (2Q + 1) majority decision and
that there is a decentralized control for task distribution, which ensures that each subtask is processed in parallel by at least (2M + 1) computer nodes (k0... k8) and after an (M + 1) -aus- (2M + 1) Majority decision the subsequent subtask is started.

2. Fault-tolerant multicomputer system according to claim 1, characterized in that the computer system is constructed in the form of an N = A _* B torus with A columns and B rows.

3. fault-tolerant multicomputer system according to claim 2, characterized in that for the tolerance of Q Errors at least A = (2Q + 1) columns are arranged and each input and output module (p0... p5) with one each Column is connected.

4. fault-tolerant multicomputer system according to claim 1, 2 or 3, characterized in that for the tolerance of Q errors in the input or output at least (2Q + 1) Controller for the input (P0; P1; P2) or (2Q + 1) Controller for the output (P3; P4; P5) of the Computer system are provided with the same Number of input units (E0; E1; E2) or Output units (E3; E4; E5) connected fully networked the controllers (P0; P1; P2 or P3; P4; P5) with each N = 2M + 1 different computer nodes (k0... k2 and k6. . . k8) are connected.