CA1143026A - Computer system - Google Patents

Computer system

Info

Publication number
CA1143026A
CA1143026A CA000311096A CA311096A CA1143026A CA 1143026 A CA1143026 A CA 1143026A CA 000311096 A CA000311096 A CA 000311096A CA 311096 A CA311096 A CA 311096A CA 1143026 A CA1143026 A CA 1143026A
Authority
CA
Canada
Prior art keywords
computer
memory
individual
modules
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired
Application number
CA000311096A
Other languages
French (fr)
Inventor
Rudolf Kober
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Siemens AG
Original Assignee
Siemens AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens AG filed Critical Siemens AG
Application granted granted Critical
Publication of CA1143026A publication Critical patent/CA1143026A/en
Expired legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2038Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with a single idle spare processing component
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2041Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant with more than one idle spare processing component
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2025Failover techniques using centralised failover control functionality

Abstract

ABSTRACT OF THE DISCLOSURE

A computer system has two or more computer modules, each including an individual computer, a coupling memory and a working memory. The modules can be coupled to a system bus comprising a control and address bus and a data bus. Access is obtained to the coupling memory either from the system bus or from the individual computer by switching techniques, and only the individual computer has access to its working memory. The system bus can be coupled to a control computer and a safeguarding memory, to which the control computer has access over the system bus, and a further memory, to which the control computer also has access, are provided. The control computer, the further memory and part of the existing modules are employed to process the user's program, and a monitoring phase is inserted at regular intervals in which all the individual computers are checked for functional capacity by test programs stored in the working memories of the modules. The fact that defec-tive modules are not found is stored in the safeguarding memory and normal processing continues. If one or more defective modules are recognized, the same are replaced by other modules not used for processing the user's program and the individual function of the module replaced is loaded from the further memory which stores the entire user program into the replacement module before processing continues with the last safeguarded intermediate results stored in the safeguarding memory.

Description

3~6 BACKGROUND OF THE NVENTION
Field of the Invention _ _ _ _ . _ _ _ _ _ _ _ _ _ The present invention relates to a computer system in which two or more computer modules can be coupled to a system bus, each of the modules including an individual computer, a coupling memory and a working memory, and in which the system bus comprises a control and address bus and a data bus, and more particularly to such a system in which access can be gained to a coupling memory either from the system bus or from an individual computer by transfer techniques and in which only the individual computer has access to ].0 its worki.ng memory and the system bus can be coupled to a control computer.
Description of the Prior Art _ ____ _ _____ A computer system of the type briefly described above is known in the art. This prior system operates in a three-phase operation. The first phase consists of a control phase during which only the control computer is operative, carries out its program and informs the individual computers of the function which they must carry out during the following phase. The second phase consists of an autonomous phase during which the individual computers carry out their assigned functions simultaneously and independently of one another without being connected to the control computer or to its memory, and then report the execution of their function by transmitting a "STOP" signal to the control computer. The third phase consists of a data exchange phase which starts when the control computer has received a "STOP"
signal from all of the indi.vidual computers or from a selection of individual computers established by the circuit, and during which, under the control of the control computer, the data exchange is carri.ed out between the memories of the individual com-.~.~f b ~ 6 puters, and possibly the control computer.
For specific fields of use of data processing systems, for example in process control monitoring of nuclear power stations~ for example, and in navigation systems for flying bodies, as another example, computer systems having a high degree of reliability are required.
The reliability of data processing systems can be increased by re-dundancy in construction, for example by a multiple provision of critical components such as a central unit with a working memory, in which in the case of differing results, the result emitted by the majority of components is used or else by a redundancy in the organization, for example by means of redundant, full-correcting codes. A fundamental requirement of the organiz-ation is in being able to continue computation without a time loss or with ; only a small time loss when faultsoccur. It is not sufficient to isolate and replace faulty components and then to reinitiate the function being processed from the beginning. If this is at all possible, the result in time loss would generally be incompatible with the requirements of real time problems.
SUMMARY OF Tl-IE INVENTION
The object of the present invention is to provide a computer system which facilitates real time operation inspite of breakdown of individual com-ponents.
This object is achieved by means of a computer system of the typebriefly described above in that a safeguarding memory to which the control computer has access the other system bus and a further memory to which the control computer also has access are provided.
A high degree of reliability can be achieved with this computer system if it is operated in such a manner that the control computer, the further memory and a part of the existing modules are used to process the user program. A monitoring phase is interposed at regular intervals in which all
-2-
3~6 of the individual computers are checked in respect of functioning capacity by means of test programs stored in tlle working memories of the modules and de-fective modules are determined and indicated. In the event that no defective modules are recognized, the intermediate results calculated at that time are stored in the safeguarding memory and further processing of the user program is continued in normal fashion. In the event that one or more of the one defective module is recognized, such modules are replaced by certain of the other modules which are not used for processing the user program, for which purpose the individual computer function of the module which is to be re-placed is loaded from the other memory which stores the entire user programinto each replacing module. Then, further processing is continued with the last-safeguarded intermediate results stored in the safeguarding memory.
Advantageously, the computer system processes the user program in a three-phase cycle.
Advantageously, the computer system is operated in such a manner that after as few as possible phase cycles a monitoring phase is additionally inserted between the autonomous phase and the next data exchange phase.
For triggering of the monitoring phases, the computer system is advantageously provided with a pulse generator which is coupled to a control computer and which triggers the monitoring phases with a period of a pulse train.
In order to exchange a defective module for an intact module, it is expedient for each module to be provided with a fixed module number and a module number which can be modified by the control computer for character-ization purposes. The exchange process is then expediently carried out in that the modifiable module numbers of the defective modules are exchanged with those of intact modules, and their fixed module numbers are used for addressing purposes.

Advantageously, a computer system constructed in accordance with the present invention is provided with a time monitoring device which is coupled to the computer system and which indicates an impermissibly long auto-nomous phase and immediately i.ntroduces an addi~ional monitoring phase.
The computer system can advantageously be designed in such a manner that each module possesses a parity production and checking unit which con-stantly monitors the module and, on the recognition of a defect, reports this defect to the control computer by means of a parity fault message and thus immediately triggers a monitoring phase.
Thus, in accordance with one broad aspect of the invention, there is provided a computer system comprising: a control computer; a system bus system, including a control and address bus and a data bus, connected to said control computer; a plurality of computer modules connected to said system bus, each including an individual computer, a coupling memory and a working memory, access to said coupling memory being had from said system bus and from said individual computer; an information safeguarding memory connected to said system bus for storing intermediate computed results; a further memory connected to said control computer for storing an entire user program, said control computer operable to monitor the performance of said individual computers and to substitute an operable individual computer along with said safeguarding and further memories in response to faulty operation of an individual computer, means operable to periodically interpose a monitoring phase in the multi-phase operation of the system; means in the individual computers, including test program means in sai.d working memories, to check the functioning capacity of the respective computers; means for determining and signaling an intact or a defective module; means for storing the intermediate computed results in response to fault-free detection; means for 3~6 causing said memory to load thc program of a defective module into a substitute module in response to detection of a defective module; and means for causing the safeguarding memory to provide the intermediate results to the substitute module and continuing of the data processing originally undertaken.
In accordance with another broad aspect of the inventi.on there is provided a method of operating a computer system which has a control computer, a bus system connected to the control computer and a plurality of modules connected to the bus system each including a working memory storing test programs, an individual computer and a coupling memory, a safeguarding memory and a further memory storing an entire user program, comprising the steps of:
operating the system through a con-crol phase in which the control computer informs the individual cornputers of their f~mctions, an autonomous phase in which the individual computers carry out their functions, and a data exchange phase in which data is exchanged between computers; operating the system at regular intervals through a monitoring phase in which the individual test programs are rlm and contemporaneously checking the operating capabilities of each module, storing intermediate computed results in the safeguarding memory; continuing normal processing when faults are not found; loadi.ng the individual function of a defective module from the further memory into replacement module in response to detecti.on of a faulty module; and continuing processing with the replacement module and the intermediate results stored in the safeguarding memory.
BRIEF D SCRIPTION OF THE _ RAWING
Other objects, features and advantages of the invention, its organizati.on, construction and mode of operation will be best understood from the following detailed description, taken in conjunction with the accompanying drawing, on which there is a single fi~gure which is a block diagram -~a-~ 't~ ~ ~

illustration of an exemplary embodiment of a computer system which ;s con-structed and operates in accordance with the present invention.
DESCRIPTI~N OF THE PREFERRED EM DI~ENT
Referring to the drawing, the exemplary embodiment illustrated comprises a pluralIty of computer modules 11, 12, 13, 15, 16 and 18 which are coupled to a system data line. Each module comprises a coupling memory KS, an individual computer ER and a working memory AS. In each module, only the individual computer has access to its working memory, whereas access can be obtained to the coupling memory selectively from the individual computer or from the system bus. For purposes of fault recognition, each module is pro-vided with a parity production and checking unit and possesses its own out-put a for the parity fault message. By way of characterization, each module possesses a fixed module number and a module number which can be modified from the control computer. Furthermore, a control computer STR is provided -4b-. .

r~2~

which can be coupled to the system bus 1 and has access to a further memory GS and access via the system bus, to a safeguarding memory SS. The further memory GS preferably consists of a high-speed large-capacity memory, for example a disc memory. All of the individual computers are preferably micro-processors. The safeguarding memory SS is preferably identical in construc-tion to the coupling memory of a module. Also provided are a pulse generator T and a time monitoring device ZU which are both coupled to the control computer STR. The pulse train period of the pulse generator regularly triggers monitoring phases. All of the outputs a of the computer modules are likewise connected to the control computer STR.
In the following the cooperation of all the described components will be explained.
It has been assumed that the modules 11--15 are used to process the user program, whereas the modules 16--18 are redundant modules. The computer system which processes the user program comprises the modules 11--15, the control computer STR and the further memory GS and can simultaneously process as many sub-functions of the user program as computer modules 11^-15 are provided.
Computer system operates in the above-described three-phase cycle.
The computer state is established following each three-phase cycle by the individual functions stored in the modules and by the exchanged results which are primarily intermediate results.
Whereas the individual functions are fixed and can be called up, for example from the further memory, the intermediate results must be safe-guarded. Safeguarding is carried out, together with a check on the computer, in the additionally interposed monitoring phases.
The duration between two monitoring phases is determined by the period duration of the pulse generator T. The pulse generator transmits an 3~"~

interrupt request to the control computer which inserts a monitoring phase before the next data exchallge pilase.
A control computer starts test programs which are provided in all the modules and which carry out a function check of the modules. Here, it is necessary to use test programs which, in the case of fault-free modules, do not permanently alter the memory contents. The fault messages are stored in the coupling memory KS. The control computer now checks whether fault messages have been received from modules and trusted with the processing of a sub-function. If this is not the case, for the following data exchange phase the safeguarding memory is coupled to the system bus in order to simultaneously receive the intermediate results with the coupling memories of the modules entrusted with the sub-functions. The further processing of the user program is then continued without modification. If, however, the faults occur, the defective modules are replaced by intact, previously unused modules.
Replacement is carried out in the following steps: the module numbers, modifiable by the control computer, of the free and defective modules are exchanged and addressed during this procedure by way of the fixed module numbers; then, the missing individual functions are reloaded from the further memory which stores the entire user program. For the duration of the following exchange phase, the safeguarding memory is coupled to the system bus. In contrast ~o a fault-free situation in which the inter-mediate results have been written into the safeguarding memory, it now forms the source of safeguarded results. These safeguarded results are read from the safeguarding memory and transferred into the coupling memories.
- This fulfills the conditions for th0 restarting of the system.
The starting point is the control phase which follows the last phase cycle with a fault-free monitoring phase.

~3~326 In addition to initiation by the pulse generator T, monitoring phases can also be triggered by the time monitoring device ZU which indicates an impermissibly long autonomous phase or by a parity fault message from one of the modules which appears at the output a. In these situations, the modules are checked immediately and, only after the conclusion of the auto-nomous phase.
Although I have described my invention by reference to a particular illustrative embodiment thereof, many changes and modifications of the inven-tion may become apparent to those skilled in the art without departing from the spirit and scope of the invention. I therefore in~end to include within the patent warranted hereon all such changes and modifications as may reason-ably and properly be included within the scope of my contribution to the art.

Claims (8)

THE EMBODIMENTS OF THE INVENTION IN WHICH AN EXCLUSIVE
PROPERTY OR PRIVILEGE IS CLAIMED ARE DEFINED AS FOLLOWS:
1. A computer system comprising: a control computer; a system bus system, including a control and address bus and a data bus, connected to said control computer; a plurality of computer modules connected to said system bus, each including an individual computer, a coupling memory and a working memory, access to said coupling memory being had from said system bus and from said individual computer; an information safeguarding memory connected to said system bus for storing intermediate computed results; a further memory connected to said control computer for storing an entire user program, said control computer operable to monitor the performance of said individual computers and to substitute an operable individual computer along with said safeguarding and further memories in response to faulty operation of an individual computer, means operable to periodically interpose a monitoring phase in the multi-phase operation of the system; means in the individual computers, including test program means in said working memories, to check the functioning capacity of the respective computers; means for determining and signaling an intact or a defective module; means for storing the intermediate computed results in response to fault-free detection; means for causing said further memory to load the program of a defective module into a substitute module in response to detection of a defective module; and means for causing the safeguarding memory to provide the intermediate results to the substitute module and continuing of the data processing originally undertaken.
2. The computer system of claim 1, wherein said control computer in-cludes means operable to control said modules in three-phase operation including a control phase informing the individual computers of their pro-cesses, an autonomous phase in which the individual processes. are carried out, and a data exchange phase in which data is exchanged between computers.
3. The computer system of claim 2, comprising means. for periodically interposing a monitoring phase between said autonomous phase and said data exchange phase.
4. The computer of claim 1, wherein the first-mentioned means includes a pulse train generator connected to said control computer for triggering the monitoring phases with the period of the pulse train produced by said pulse generator.
5. The computer system of claim 1, wherein each of said modules com-prises: fixed address number means; and modifiable address number means.
6. The computer system of claim 1, comprising time monitoring means connected to said control computer for monitoring the time of operation of said modules and causing said control computer to effect monitoring of the functioning of said modules in response to the time of operation being greater than a predetermined time interval.
7. The computer system of claim 1, wherein the means for determining and signaling an intact or a defective module includes a parity checking circuit in each module operable to transmit a fault message to said control computer in response to defective operation of a module.
8. A method of operating a computer system which has a control com-puter, a bus system connected to the control computer and a plurality of modules connected to the bus system each including a working memory storing test programs, an individual computer and a coupling memory, a safeguarding memory and a further memory storing an entire user program, comprising the steps of: operating the system through a control phase in which the control computer informs the individual computers of their functions, an autonomous phase in which the individual computers carry out their functions, and a data exchange phase in which data is exchanged between computers; operating the system at regular intervals through a monitoring phase in which the individual test programs are run and contemporaneously checking the operating capabilities of each module, storing intermediate computed results in the safeguarding memory; continuing normal processing when faults are not found; loading the individual function of a defective module from the further memory into re-placement module in response to detection of a faulty module; and continuing processing with the replacement module and the intermediate results stored in the safeguarding memory.
CA000311096A 1977-09-14 1978-09-12 Computer system Expired CA1143026A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE19772741379 DE2741379A1 (en) 1977-09-14 1977-09-14 COMPUTER SYSTEM
DEP2741379.2 1977-09-14

Publications (1)

Publication Number Publication Date
CA1143026A true CA1143026A (en) 1983-03-15

Family

ID=6018946

Family Applications (1)

Application Number Title Priority Date Filing Date
CA000311096A Expired CA1143026A (en) 1977-09-14 1978-09-12 Computer system

Country Status (8)

Country Link
JP (1) JPS5451439A (en)
BE (1) BE870484A (en)
CA (1) CA1143026A (en)
DE (1) DE2741379A1 (en)
FR (1) FR2403598B1 (en)
GB (1) GB2004673B (en)
IT (1) IT1098538B (en)
NL (1) NL7809313A (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4412281A (en) * 1980-07-11 1983-10-25 Raytheon Company Distributed signal processing system
GB2217487B (en) * 1988-04-13 1992-09-23 Yokogawa Electric Corp Dual computer system
GB2369538B (en) * 2000-11-24 2004-06-30 Ibm Recovery following process or system failure

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB1243464A (en) * 1969-01-17 1971-08-18 Plessey Telecomm Res Ltd Stored-programme controlled data-processing systems
JPS5633915B1 (en) * 1970-11-06 1981-08-06
JPS5627905B1 (en) * 1970-11-06 1981-06-27
BE789828A (en) * 1972-10-09 1973-04-09 Bell Telephone Mfg DATA PROCESSING OPERATING SYSTEM.
CA1053352A (en) * 1974-11-12 1979-04-24 Scott A. Inrig Method for providing a substitute memory module in a data processing system
DE2546202A1 (en) * 1975-10-15 1977-04-28 Siemens Ag COMPUTER SYSTEM OF SEVERAL INTERCONNECTED AND INTERACTING INDIVIDUAL COMPUTERS AND PROCEDURES FOR OPERATING THE COMPUTER SYSTEM

Also Published As

Publication number Publication date
DE2741379A1 (en) 1979-03-15
JPS618988B2 (en) 1986-03-19
BE870484A (en) 1979-01-02
FR2403598B1 (en) 1985-08-30
JPS5451439A (en) 1979-04-23
IT1098538B (en) 1985-09-07
GB2004673B (en) 1982-02-03
NL7809313A (en) 1979-03-16
IT7827595A0 (en) 1978-09-13
FR2403598A1 (en) 1979-04-13
GB2004673A (en) 1979-04-04

Similar Documents

Publication Publication Date Title
EP0045836B1 (en) Data processing apparatus including a bsm validation facility
EP0031501A2 (en) Diagnostic and debugging arrangement for a data processing system
EP0496506A2 (en) A processing unit for a computer and a computer system incorporating such a processing unit
US3959638A (en) Highly available computer system
CA2032067A1 (en) Fault-tolerant computer system with online reintegration and shutdown/restart
JPH0950424A (en) Dump sampling device and dump sampling method
JPH0526214B2 (en)
JPS6375963A (en) System recovery system
CA1143026A (en) Computer system
US20070055480A1 (en) System and method for self-diagnosis in a controller
TW200307200A (en) Multiple fault location in a series of devices
JPH11120154A (en) Device and method for access control in computer system
JPH0754947B2 (en) Standby system monitoring method
JPH1027115A (en) Fault information sampling circuit for computer system
Blakeney et al. An application-oriented multiprocessing system, II: Design characteristics of the 9020 system
JPH07114521A (en) Multimicrocomputer system
JPH047645A (en) Fault tolerant computer
EP0342261B1 (en) Arrangement for error recovery in a self-guarding data processing system
SU849219A1 (en) Data processing system
JPH079636B2 (en) Bus diagnostic device
JPS6113627B2 (en)
JP3042034B2 (en) Failure handling method
Dieterich et al. A compatible airborne multiprocessor
JPH07334383A (en) Computer with monitoring and diagnostic function
JPH02122335A (en) Test method for ras circuit

Legal Events

Date Code Title Description
MKEX Expiry