JPH02301854A

JPH02301854A - Distributed processing system

Info

Publication number: JPH02301854A
Application number: JP1123795A
Authority: JP
Inventors: Makoto Fujii; 誠藤井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1989-05-17
Filing date: 1989-05-17
Publication date: 1990-12-13

Abstract

PURPOSE:To eliminate the necessity of special consideration to the software so that the reliability of the system can be improved by monitoring the quantity of the data fetched to a data latching mechanism and, when the quantity exceeds a fixed value, outputting a fault detecting signal. CONSTITUTION:When abnormality occurs in a processor 11 and the processor 11 does not make data access, new data arriving thereafter are successively stored in a data latching mechanism 16. When the value of an addressing mechanism 17 exceeds a fixed value, it is discriminated that the processor 11 becomes defective and the defective processor 11 is switched to a normal processor 11 so as to secure the function of the whole system by transmitting a fault detecting signal, namely, switch changeover signal to a crossbar switch 15 for making route switching. Therefore, it becomes unnecessary to add a special excessive function to the software of the processor 11 and the reliability of the system is improved.

Description

【発明の詳細な説明】〔発明の目的〕（産業上の利用分野）本発明は、原子力発電プラントや宇宙術Ｊに等、計算機
に高い信頼性が要求される分野に適用される分野処理シ
ステムに関する。[Detailed description of the invention] [Object of the invention] (Industrial application field) The present invention is a field processing system applied to fields where high reliability of computers is required, such as nuclear power plants and space technology. Regarding.

（従来の技術）近年、計算機の信頼性を向上させる手段として、複数の
プロセッサから構成される分散処理システムが用いられ
るようになってきている。(Prior Art) In recent years, distributed processing systems composed of a plurality of processors have come to be used as a means to improve the reliability of computers.

この分散処理システムでは、備えられた複数のプロセッ
サがそれぞれに与えられた機能を果し、全体がデータ伝
送路で結合されてデータのやりとりを行ないながら、協
調して稼動するものである。In this distributed processing system, a plurality of processors are provided, each performing a given function, and the entire system is connected through a data transmission path to exchange data and operate in cooperation.

そして、分散処理システムは、本来、処理効率の向上を
目的としたものであったか、分散処理の特性を活かすこ
とにより、信頼性を向上させることも可能となってきて
いる。Distributed processing systems were originally intended to improve processing efficiency, but by taking advantage of the characteristics of distributed processing, it has become possible to improve reliability.

分散処理システムの具体的なハードウェア構成について
はいくつかの種類が提案されているが、プロセッサの一
部に故障が発生した場合でも、他の健全なプロセッサが
機能の一部を肩代わりでき、システムの機能のダウンが
回避できる手法として、クロスバ−スイッチを用いた分
散処理システムか提案されている。Several types of specific hardware configurations for distributed processing systems have been proposed, but even if a part of the processor fails, other healthy processors can take over some of the functions, and the system A distributed processing system using a crossbar switch has been proposed as a method to avoid the failure of the functions of the computer.

第２図は、クロスバ−スイッチを用いた従来の分散処理
システムの一例を示すもので、このシステムは、複数の
プロセッサ２１、複数のメモリ２２、および複数の入出
力インターフェース２３を、伝送線２４およびタロスパ
ースイッチ２５で結合して構成され、クロスバ−スイッ
チ２５ては、プロセッサ２１がメモリ２２および入出力
インターフェース２３をアクセスする場合に、設定する
アドレスをスイッチングすることとにより、任意のアク
セスが可能となるようになっている。FIG. 2 shows an example of a conventional distributed processing system using a crossbar switch. This system connects multiple processors 21, multiple memories 22, and multiple input/output interfaces 23 to transmission lines The crossbar switch 25 allows arbitrary access by switching the set address when the processor 21 accesses the memory 22 and the input/output interface 23. It's supposed to be.

ところで、この種の分散処理システムにおいて、すべて
のプロセッサ２］が健全な場合にはクロスバ−スイッチ
２５のアドレスは、メモリ２２および入出力インターフ
ェース２３を各プロセッサ２１に１対１に固定している
。すなわち、プロセッサ２１、メモリ２２および入出力
インターフェース２Ｂが一組となったものか、複数存在
する形でアドレスが設定されており、各組か各別の機能
を果せるようになっている。By the way, in this kind of distributed processing system, when all the processors 2 are healthy, the addresses of the crossbar switch 25 are fixed on a one-to-one basis for the memory 22 and the input/output interface 23 for each processor 21. That is, the addresses are set such that the processor 21, memory 22, and input/output interface 2B are either a set or a plurality of sets, and each set can perform a different function.

一方、プロセッサ２］のうちの１つに故障か発生した場
合には、通常はそのプロセッサ２］か果していた機能が
ダウンすることになるか、このシステムでは、プロセッ
サ２］に故障か発生し７たことを検出すると、別のプロ
セッサ２１のアＩ・レスをクロスバ−スイッチ２５で切
換え、これにより、切換えられたプロセッサ２１か、自
分が本来アクセスしたメモリ２２、入出力インターフェ
ース２３の他に、故障したプロセッサ２１かアクセスし
たメモリ２２、入出力インターフェース２Ｂにもアクセ
スできるようになる。On the other hand, if a failure occurs in one of the processors 2, the function that the processor 2 was performing will normally go down, or in this system, if one of the processors 2 When this is detected, the crossbar switch 25 switches the address of another processor 21, and the processor 21, the memory 22 and the input/output interface 23 that were originally accessed, are The memory 22 and the input/output interface 2B that were accessed by the processor 21 can also be accessed.

このように、故障したプロセッサ２］の機能かダウンす
ることなく、別のプロセッサ２］によって肩代わりされ
、全体としての機能を確保することとが可能となる。そ
して、一部のプロセッサ２１の故障か、システム全体の
機能に普及することがなくなることから、システムとし
ての信頼性を向上させることが可能となる。In this way, the functions of the failed processor 2 can be taken over by another processor 2 without any downtime, and the overall functionality can be ensured. Furthermore, since the malfunction of a part of the processor 21 will not spread to the functions of the entire system, it is possible to improve the reliability of the system.

ところで、この種の分散処理システムにおいて、最も重
要なことは、プロセッサ２］に故障か発生した場合に、
その故障を確実に検出できることである。By the way, in this kind of distributed processing system, the most important thing is that if a failure occurs in processor 2,
It is possible to reliably detect the failure.

従来、このような故障検知手法として、ウォッチドッグ
タイマを用いることが行なわれている。Conventionally, a watchdog timer has been used as such a failure detection method.

ウォッチドッグタイマを用いた分散処理システムでは、
第２図に示すように、クロスバ−スイッチを用いた分散
システムにおいて、各プロセッサ２１にさらにタイマ２
６がそれぞれ設けられており、各タイマ２６は、常時時
刻をカウントアツプしていく。一方、プロセッサでは、
ソフトウェアにより一定周期でタイマ２６の値をクリア
する。In a distributed processing system using a watchdog timer,
As shown in FIG. 2, in a distributed system using a crossbar switch, each processor 21 also has a timer 2.
6, and each timer 26 constantly counts up the time. On the other hand, the processor
The value of the timer 26 is cleared at regular intervals by software.

このような方式を用いることにより、タイマ２６の値は
、プロセラ１す２１の機能が健全である限り、一定の値
以」二になることはないので、タイマ２６の値が、−・
定値を超えた場合には、プロセッサ２１に異常が発生し
ているものと判断できる。By using such a method, the value of the timer 26 will never exceed a certain value as long as the functions of the processors 1 and 21 are healthy.
If it exceeds the fixed value, it can be determined that an abnormality has occurred in the processor 21.

そして、タイマ２６は、一定値を超えた際に故障発生信
号を発生し、クロスバ−スイッチ２５にｆ云えてアドレ
スの切換えを行なう。When the timer 26 exceeds a certain value, the timer 26 generates a failure occurrence signal, sends a signal to the crossbar switch 25, and switches the address.

以上の構成を有するウォッチドッグタイマ方式の分散処
理システムにおいては、各プロセッサ２１に用いられる
ソフトウェアの中に、周期的にタイマ２６をクリアする
機能を含ませなければならない。通常、高い信頼性か要
求される計算機システムでは、ソウトウエアについても
同様の信頼性が要求されることから、ソウトウエアの構
成は、できるたけ単純にして余分な機能を含まないよう
にする必要がある。したがって、ウオッチドックタイマ
方式の分散処理シテムはソウトウエアに特別な配慮が必
要となることから、ソフトウェア仁頼性の而で問題があ
る。In the watchdog timer type distributed processing system having the above configuration, the software used for each processor 21 must include a function to periodically clear the timer 26. Generally, in computer systems that require high reliability, similar reliability is also required for the software, so the software configuration must be kept as simple as possible and do not include unnecessary functions. Therefore, the watchdog timer type distributed processing system requires special consideration for the software, which poses a problem in terms of software reliability.

そこで一部では、第３図に示すように共通メモリを用い
た分散処理システムが提案され°Ｃいる。Therefore, some people have proposed a distributed processing system using a common memory as shown in FIG.

このシステムは、第３図に承りように、複数のプロセッ
ーリ３１、複数のメモリ３２、および複数の入出力イン
ターフェース３３を、伝送線３４およびクロスバ−スイ
ッチ３５で結合して構成され、さらに、各プロセッサ３
１が共通にアクセスできる共通メモリ３６か設けられて
いる。この共通メモリ３６には、各プロセッサ３１が自
分の機能を果す中で、その状態を示す情報が格納される
。そして、その情報の中に、何等かの異常状態を表す情
報が含まれていた場合には、クロスバ−スイッチ３５に
その旨を伝えてアドレス切換えを行なう。As shown in FIG. 3, this system is composed of a plurality of processors 31, a plurality of memories 32, and a plurality of input/output interfaces 33 connected by a transmission line 34 and a crossbar switch 35. 3
A common memory 36 is provided which can be accessed in common by all. This common memory 36 stores information indicating the status of each processor 31 while it performs its own function. If the information includes information indicating some kind of abnormal state, the crossbar switch 35 is informed of this fact and address switching is performed.

（発明が解決しようとする課題）前記従来の共通のメモリを用いた分散処理システムにお
いては、各プロセッサ３１の診断情報がすべて共通メモ
リ３６に集中することになるため、万一この共通メモリ
３６に故障が発生した場合には、システムに重大な影響
を与えることになるという問題がある。(Problems to be Solved by the Invention) In the conventional distributed processing system using a common memory, all of the diagnostic information of each processor 31 is concentrated in the common memory 36, so in the unlikely event that this common memory 36 There is a problem in that if a failure occurs, it will have a serious impact on the system.

本発明は、このような点を考慮してなされたもので、ソ
フトウェアに余分な機能を付加する必要がなく、しかも
共通部分を設けることなく、プロセッサに発生した故障
を検知することができる分散処理システムを提供するこ
とを目的とする。The present invention has been made with these points in mind, and it provides distributed processing that can detect failures that occur in processors without the need to add extra functions to software and without providing common parts. The purpose is to provide a system.

[Structure of the invention]

（課題を解決するだめの手段）本発明は、前記１」的を達成する手段として、複数のプ
ロセッサと、複数のメモリと、複数の入出力インタフェ
ースと、任意のプロセッサ、メモリおよび入出力インタ
ーフェースの間で伝送経路が確保されるように経路切換
を行なうクロスバ−スイッチ機構とを具（ｌｉｉｔする
分数処理システムにおいて、入力データを周期的に取込
むとともにプロセッサからのデータアクセスにより取込
んだデータを送出するデークラッチ手段と、データラッ
チ丁。(Means for Solving the Problems) As a means for achieving the above-mentioned object 1, the present invention provides a plurality of processors, a plurality of memories, a plurality of input/output interfaces, and a plurality of processors, memories, and input/output interfaces. In a fractional processing system that includes a crossbar switch mechanism that switches paths so that a transmission path is secured between data latch means and data latch mechanism.

段に取込まれているデータ量を監視しその値か一定値を
超えた際に故障検出信号を出力する故障検出手段とを設
けるようにしたことを特徴とする。The present invention is characterized in that it is provided with a failure detection means for monitoring the amount of data taken into the stage and outputting a failure detection signal when the amount exceeds a certain value.

（作　用）本発明に係る分散処理システムにおいて、データラッチ
手段は、対象プロセッサから得られたプロセスデータを
周期的に蓄え、時系列的に格納していき、一方プロセッ
ザは、このデータラッチ手段からＦ　Ｉ　Ｆ　Ｏ（Ｆｊ
ｒｓｔ　Ｉｎ　Ｆｉｒｓｔ　０ｕｔ）に基つい−Ｑ　　
　　− てデータをアクセスしていく。そして、データラッチ手
段に蓄えられているデータ数等のデータ量は、故障検出
手段により常時監視される。(Function) In the distributed processing system according to the present invention, the data latch means periodically stores the process data obtained from the target processor and stores it in chronological order, while the processor receives the process data from the data latch means. F I F O (Fj
rst In First 0ut) -Q
- access the data. The amount of data such as the number of data stored in the data latch means is constantly monitored by the failure detection means.

ところで、データラッチ手段のデータの取込み周期とデ
ータ参照周期とを、例えば同一にしておくと、アドレス
の値はほぼ一定となり、したがって通常は、ある値以上
に大きくなることはない。By the way, if the data acquisition cycle of the data latch means and the data reference cycle are made the same, for example, the address value will be approximately constant, and therefore will not normally exceed a certain value.

ところが、いずれかのプロセッサが故障してデータアク
セスしなくなった場合には、アドレスの値は次第に大き
くなっていき、最終的にある一定値を超えてしまう。こ
れを、故障検出手段により検出し、あるプロセッサに故
障が発生した旨を他の健全なプロセッサに伝え、クロス
バ−スイッチを切換えて故障プロセッサの機能を代行さ
せる。However, if one of the processors fails and no longer accesses data, the address value gradually increases and eventually exceeds a certain value. This is detected by a failure detection means, and the fact that a failure has occurred in a certain processor is notified to other healthy processors, and the crossbar switch is switched to take over the function of the failed processor.

（実施例）以下、本発明の一実施例を第１図を参照して説明する。(Example) An embodiment of the present invention will be described below with reference to FIG.

第１図において、符号１］は複数のプロセッサであり、
これら各プロセッサ１１、複数のメモリ］２および複数
の入出力インターフェース１３は、−Ｑ　　　　− 伝送線１４を介してクロスバ−スイッチ１５に１１−合
されている。In FIG. 1, reference numeral 1] indicates a plurality of processors;
These processors 11, multiple memories 2, and multiple input/output interfaces 13 are connected to a crossbar switch 15 via a -Q transmission line 14.

各入出力インターフェース］３には、第１図に示すよう
に、データラッチ機構１６とアドレス機構１７とがそれ
ぞれ設けられており、データラッチ機構１６は、入力デ
ータを一定の周期で取込み格納するとともに、プロセッ
サ］］からのデータアクセスにより、最も以前に格納さ
れたデータから順次送出するようになっている。そして
、このデータラッチ機構］６内に格納されているデータ
の数は、アドレス機構１７に保持されるようにプよって
いる。Each input/output interface] 3 is provided with a data latch mechanism 16 and an address mechanism 17, respectively, as shown in FIG. , processor]], data is sequentially sent out from the earliest stored data. The number of data stored in this data latch mechanism 6 is determined so as to be held in the address mechanism 17.

次に本実施例に作用について説明する。Next, the operation of this embodiment will be explained.

各入出力インターフェース１３は、一定の周期でデータ
を取込み、データラッチ機＋Ｍ　１６にデータを格納す
るとともに、アドレス機構１７の値を１つカウントアツ
プする。Each input/output interface 13 takes in data at a constant cycle, stores the data in the data latch device +M 16, and counts up the value of the address mechanism 17 by one.

一方、各プロセッサ１１は、入出力インターフェース１
３からのデータが必要となると、入出力インタフェース
１３にデータアクセスする。すると、入出力インタフェ
ース１３は、データラッチ機構１６に格納されている最
も以前のデータを、プロセッサ１］に渡すとともに、ア
ドレス機構１７の値を１つカウントダウンする。On the other hand, each processor 11 has an input/output interface 1
When data from 3 is required, data access is made to the input/output interface 13. Then, the input/output interface 13 passes the earliest data stored in the data latch mechanism 16 to the processor 1 and counts down the value of the address mechanism 17 by one.

ここで、入出力インターフェース１３におけるデータ取
込み周期と、プロセッサ］］からのデータアクセス周期
とを同一にしておけば、プロセッサ１１に伝わるプロセ
スデータも、充分に新しいデータを渡すことができ、し
かもアドレス機＋１４１７に保持される値も、１または
０となってそれ以上に増えることはない。したがって、
プロセッサ１１の健全性を確認するには、アドレス機構
１７の値が１以下であることを監視すればよい。Here, if the data acquisition cycle at the input/output interface 13 and the data access cycle from the processor] are made the same, the process data transmitted to the processor 11 can be sufficiently new, and moreover, the address machine The value held at +1417 also becomes 1 or 0 and does not increase any further. therefore,
In order to confirm the health of the processor 11, it is sufficient to monitor that the value of the address mechanism 17 is 1 or less.

すなわち、プロセッサ１１に異常が発生してデータアク
セスしなくなると、データラッチ機構１６には、新しい
データが次々と蓄えられることになり、アドレス機構１
７の値も、これにつれて次第に大きくなっていく。そこ
でアドレス機構の値か、ある一定の値、例えば３以上に
なったところで、そのプロセッサ１１に故障か発生した
と判断し、クロスバ−スイッチ］５に故障検出（ｒ３号
、すなイっちスイッチ切換信号を伝えて経路切換を１１
ない、故障したプロセッサ］１の機能を健全なプロセッ
サ１１で代行し、システム全体の機能を（１ｎ゛保する
。That is, when an abnormality occurs in the processor 11 and data access is no longer performed, new data is stored one after another in the data latch mechanism 16, and the address mechanism 1
The value of 7 also gradually increases accordingly. Then, when the value of the address mechanism reaches a certain value, for example 3 or more, it is determined that a failure has occurred in that processor 11, and the failure is detected by the crossbar switch 5 (r3, Convey the switching signal and switch the route 11
A healthy processor 11 takes over the functions of the failed processor]1, and the functions of the entire system are maintained (1n).

このように、本実施例によれば、プロセッサ１］のソフ
トウェアに特殊な機能を追加することなく故障検出が可
能となり、ソフトウェアの単純性による信頼性の向上か
可能となる。また、故１９※診断機構が、各サブシステ
ムに分散していて共通部分がないので、システム全体の
信頼性向上もＩＩＪ能となる。As described above, according to the present embodiment, failure detection can be performed without adding special functions to the software of the processor 1, and reliability can be improved due to the simplicity of the software. In addition, since the diagnostic mechanism is distributed among each subsystem and has no common parts, it is possible to improve the reliability of the entire system.

なお、前記実施例では、アドレス機構］７に、データラ
ッチ機構１６のデータの数自体を格納する場合について
説明したか、データの数自体を格納する必要はなく、例
えば最も以前のデータと最も新しいデータを格納したア
ドレスレジスタを持ち、その差を監視する等の方法でも
よい。In addition, in the above embodiment, the case where the number of data of the data latch mechanism 16 is stored in the address mechanism]7 is explained, or the number of data itself does not need to be stored. A method such as having an address register storing data and monitoring the difference therebetween may also be used.

また、前記実施例では、データ取込み周期とプロセッサ
１１からのデータアクセス周期とを同−一　　　」　１
　　− にする場合について説明したか、両同期を必ずしも同一
にする必要はなく、例えば、前記アドレスレジスタを用
いる方式で、レジスタの値を適当に操作することによっ
ても、同様の機能を実現できる。Furthermore, in the embodiment, the data acquisition period and the data access period from the processor 11 are the same.
- However, it is not necessary to make both synchronizations the same. For example, the same function can be achieved by appropriately manipulating the value of the register using a method using the address register.

〔Effect of the invention〕

以上説明したように本発明は、プロセッサのソフトウェ
アに特別に余分な機能を付加する必要がなく、また、そ
の部分の故障によってシステム全体に重大な影響を与え
るような共通部分もないので、システムの信頼性を大幅
に向上させることができる。As explained above, the present invention does not require any special extra functions to be added to the software of the processor, and there is no common part where failure of that part would seriously affect the entire system. Reliability can be significantly improved.

[Brief explanation of the drawing]

第１図は本発明の一実施例に係る分散処理システムを示
す構成図、第２図はウォッチドッグタイマ方式の従来の
分散処理システムを示す構成図、第３図は共通メモリ方
式の従来の分散処理システムを示す構成図である。１１・・・プロセッサ、１２・・・メモリ、］３・入出
カインターフェース、］−４・・・伝送線、１５・クロ
スバ−スイッチ、１６・・・データラッチ機ｈＬ］−７
・・・アドレス機構。FIG. 1 is a block diagram showing a distributed processing system according to an embodiment of the present invention, FIG. 2 is a block diagram showing a conventional distributed processing system using a watchdog timer method, and FIG. 3 is a block diagram showing a conventional distributed processing system using a common memory method. FIG. 1 is a configuration diagram showing a processing system. DESCRIPTION OF SYMBOLS 11... Processor, 12... Memory, ]3. Input/output interface, ]-4... Transmission line, 15. Crossbar switch, 16... Data latch machine hL]-7
...Address mechanism.

Claims

[Claims]

Distributed processing that includes multiple processors, multiple memories, multiple input/output interfaces, and a crossbar switch mechanism that switches paths so that transmission paths are secured between arbitrary processors, memories, and input/output interfaces. In the system, there is a data latch means that periodically captures input/output data and sends out the captured data through data access from the processor, and a data latch means that monitors the amount of data captured by the data latch means and maintains the value at a constant value. A distributed processing system comprising: failure detection means for outputting a failure detection signal when a failure detection signal is exceeded.