JPH117430A

JPH117430A - Parallel processing system backup method

Info

Publication number: JPH117430A
Application number: JP9158307A
Authority: JP
Inventors: Yasuhiro Kawase; 康裕川瀬
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1997-06-16
Filing date: 1997-06-16
Publication date: 1999-01-12

Abstract

PROBLEM TO BE SOLVED: To inexpensively construct an inter-CPN parallel processing system which has high efficiency and reliability and to operate it by deciding the number of connection mechanisms(CF) in accordance-with the number of CPNs in an inter-processing node(CPN) parallel processing system, sharing hardware resources and assigning an alternate CF. SOLUTION: When a CF supervisory program detects a failure in a CF1 (4), the supervision of the CF1 (4) is suspended and all of copy data that are managed by an CF0 (3) are nullified. A CF state supervisory CPN1 (18) sucks contents of memory M1 (5) through a link cable 19, writes them on memory 2 (6) and moves all of necessary data that are under the management of the CF0 (3) onto a coupling device 2. Processor 1(7) to n(9) are given a lock right to shared memory M3 (16) and processing in each CPN is resumed. In such times, the connection between the alternate CF1(18) and CPN0 (10) to n-1 (12) shares a link cable 21 that connects the CF1 (4) and the CPN0 (10) to n-1 (12).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、複数のＣＦと複数
のＣＰＮを備えたシステムで、一つのＣＦに障害が起き
た際、システムをダウンさせることなく運用させること
を目的とする、ＣＦ多重化方式、ＣＦ監視方式、及びＣ
Ｆ移行方式にかかわるものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a system provided with a plurality of CFs and a plurality of CPNs, which is intended to be operated without causing a system down when one CF fails. System, CF monitoring system, and C
This is related to the F transition method.

【０００２】[0002]

【従来の技術】今日の莫大な情報量を処理するために不
可欠な、高速な並列処理実現のため、結合装置を導入し
たが、システムが大きくなると結合装置上にあるＣＦの
負荷が増大し、さらにＣＦがダウンすると、システムに
大きなダメージを与える。そこで、結合装置を多重化す
ることにより、負荷と障害時のリスクを分散してきた
が、従来の障害バックアップ方式には、以下に示す２つ
の問題があった。2. Description of the Related Art Coupling devices have been introduced to realize high-speed parallel processing, which is indispensable for processing today's huge amount of information. However, as the size of the system increases, the CF load on the coupling device increases. Further, if the CF goes down, the system is seriously damaged. Therefore, the load and the risk at the time of failure have been dispersed by multiplexing the coupling devices. However, the conventional failure backup method has the following two problems.

【０００３】１）完全二重化方式では、コストが２倍か
かる。1) The cost is doubled in the complete duplex system.

【０００４】２）機能分散して障害時に並列システムを
構築しなおす方式では、システムの回復に時間がかか
る。2) In a system in which functions are distributed and a parallel system is reconstructed when a failure occurs, it takes time to recover the system.

【０００５】[0005]

【発明が解決しようとする課題】本発明では、結合装置
を多重化し、各結合装置上にあるＣＦを該ＣＦとハード
ウェアを共有した状態監視用ＣＰＮで常に監視すること
により、ＣＦの負荷を分散させるだけでなく、ＣＦ障害
時のシステムダウンを未然に防ぐ手段を、コストを掛け
ずに提供するものである。According to the present invention, the load on the CFs is reduced by multiplexing the coupling devices and constantly monitoring the CFs on each coupling device by the status monitoring CPN sharing hardware with the CFs. It is intended to provide a means for preventing the system from being down due to a CF failure as well as dispersing the data without increasing the cost.

【０００６】[0006]

【課題を解決するための手段】以上に示した課題を解決
するために、本発明では、複数の結合装置と、結合装置
上にある各ＣＦとハードウェア資源を論理分割されたＣ
Ｆ状態監視用ＣＰＮ、ＣＦ機能移行後に動作する交代Ｃ
Ｆ、またＣＦ状態監視用ＣＰＮ上で動作するＣＦ監視プ
ログラムとＣＦ移行プログラムを用意する。In order to solve the above-mentioned problems, according to the present invention, a plurality of coupling devices, each CF and hardware resources on the coupling device are logically divided into Cs.
Alternating C that operates after transition to FPN monitoring CPN and CF functions
F, a CF monitoring program and a CF migration program that operate on the CF state monitoring CPN are prepared.

【０００７】ここで、ＣＦの障害発見時には、該プログ
ラムにより障害の起きたＣＦのローカルメモリの内容を
他のＣＦのローカルメモリにコピーし、該ＣＰＮを交代
ＣＦとして動作させることで、障害の起こる前の状態
で、システムをダウンさせることなく運用を継続するこ
とが可能となる。Here, when a failure of a CF is found, the contents of the local memory of the failed CF are copied to the local memory of another CF by the program, and the CPN is operated as a replacement CF to cause a failure. In the previous state, the operation can be continued without bringing down the system.

【０００８】[0008]

【発明の実施の形態】以下、本発明による、ＣＦ監視か
ら、ＣＦ障害時におけるＣＦ移行にいたるまでの形態
を、図面により詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention, from CF monitoring to CF transition at the time of a CF failure, will be described in detail with reference to the drawings.

【０００９】図１は、本発明の一つの実施形態を示した
ものである。図１において、(1)は、第１の結合装置、
(2)は、第２の結合装置である。結合装置上には、それ
ぞれ結合機構ＣＦ０(3)，ＣＦ１(4)とローカルメモリＭ
１(5)，Ｍ２(6)を備えている。(7)は第１の処理装置で
あり、(8)，(9)はそれぞれ第２、第ｎ番目の処理装置で
ある。各処理装置は、中央処理ノード（以下ＣＰＮとい
う）とローカルメモリを備えていて、それぞれ、処理装
置毎に、ＣＰＮ０(10)，ＣＰＮ１(11)，ＣＰＮｎ−１(1
2)、ローカルメモリＭ４(13)，Ｍ５(14)，Ｍ６(15)のよ
うになっている。これらｎ個のＣＰＮは、すべてＯＳが
動作するためのＣＰＮで、それぞれＣＦ０(3)はリンク
ケーブル(20)、ＣＦ１(4)はリンクケーブル(21)を介し
て各結合機構と接続されている。これら処理装置１
(7)、処理装置２(8)、処理装置ｎ(9)はまた、共有メモ
リＭ３(16)に接続されている。処理装置１(7)、処理装
置２(8)、処理装置ｎ(9)から共有メモリＭ３(16)へのア
クセスは、各処理装置間での独立な動作が可能となって
いて、さらにローカルメモリに必要データのコピーを持
つため、共有メモリに変更が無い限りローカルメモリへ
のアクセスで済む。データのコピーが複数あるため、こ
のうちどれか一つのデータが書換えられた時点で、その
他のデータは無効となる。また、共有メモリ上のデータ
は、データごとにロックを取ってからでないとアクセス
できないようになっていて、コピーデータはまず、共有
メモリを書換えてからでないと書換えられない。これに
より、同時に複数のコピーデータが書換えられないよう
になっている。このコピーデータが有効かどうかといっ
た情報は、例えば、図４に示すように処理装置１(7)上
のＣＰＮ０(10)によって該ローカルメモリＭ４(13)にあ
るコピーデータが書換えられた時、ＣＰＮ０(10)は結合
装置１(1)上にあるＣＦ０(3)にデータを書換えたことを
通知する、するとＣＦ０(3)によってＣＰＮ(10)以外の
全ＣＰＮつまり、ＣＰＮ１(11)，ＣＰＮｎ−１(12)の有
効フラグテーブル(24)へ該データは無効であると書き込
まれる。ＣＰＮ１(11)，ＣＰＮｎ−１(12)はデータを使
用する時、各処理装置ごとの有効フラグテーブルを見
て、データの無効を確認すると、共有メモリＭ３(16)か
らデータを再びローカルメモリＭ５(14)又はＭ６(15)へ
コピーしないとデータを使用できない。このようにＣＦ
０(3)，ＣＦ１(4)によりコピーデータの有効性が、共有
メモリのロック制御と共に制御され、その制御データ
は、それぞれローカルメモリＭ１(5)，Ｍ２(6)に格納さ
れている。FIG. 1 shows one embodiment of the present invention. In FIG. 1, (1) is a first coupling device,
(2) is a second coupling device. On the coupling device, coupling mechanisms CF0 (3) and CF1 (4) and local memory M are respectively provided.
1 (5) and M2 (6). (7) is the first processing device, and (8) and (9) are the second and n-th processing devices, respectively. Each processing device is provided with a central processing node (hereinafter referred to as CPN) and a local memory. CPN0 (10), CPN1 (11), and CPNn-1 (1
2) Local memory M4 (13), M5 (14), M6 (15). These n CPNs are all CPNs for operating the OS. CF0 (3) is connected to each coupling mechanism via a link cable (20) and CF1 (4) is connected to each coupling mechanism via a link cable (21). . These processing devices 1
(7), the processing device 2 (8) and the processing device n (9) are also connected to the shared memory M3 (16). The processing device 1 (7), the processing device 2 (8), and the processing device n (9) can access the shared memory M3 (16) independently of each other. Since a copy of necessary data is stored in the memory, access to the local memory is sufficient as long as there is no change in the shared memory. Since there are a plurality of copies of data, when any one of the data is rewritten, the other data becomes invalid. Further, data on the shared memory cannot be accessed until a lock is obtained for each data, and copy data cannot be rewritten unless the shared memory is first rewritten. This prevents a plurality of copy data from being rewritten at the same time. The information such as whether the copy data is valid is, for example, as shown in FIG. 4, when the copy data in the local memory M4 (13) is rewritten by the CPN 0 (10) on the processing device 1 (7), (10) notifies the CF0 (3) on the coupling device 1 (1) that the data has been rewritten. Then, all the CPNs other than the CPN (10) by the CF0 (3), that is, CPN1 (11), CPNn- The data is written as invalid in the valid flag table 24 of 1 (12). When using the data, the CPN1 (11) and the CPNn-1 (12) check the validity flag table of each processing unit and confirm that the data is invalid. Data cannot be used unless it is copied to (14) or M6 (15). Thus CF
0 (3) and CF1 (4) control the validity of the copy data together with the lock control of the shared memory, and the control data is stored in the local memories M1 (5) and M2 (6), respectively.

【００１０】また、ＣＦ状態監視ＣＰＮ０(17)は、ＣＦ
０(3)とハードウェアを共有し、ＣＦ０(3)とＣＦ(4)の
状態を監視するプログラムが常駐し動作する専用のＣＰ
Ｎであり、ＣＦ状態監視ＣＰＮ１(18)は、ＣＦ１(4)と
ハードウェアを共有し、ＣＦ０(3)，ＣＦ１(4)の状態を
監視する専用のＣＰＮである。ＣＦ状態監視ＣＰＮ０(1
7)は、ＣＦ状態監視プログラムにより、障害が検出され
た時にはＣＦ１(4)の交替用ＣＦとしても機能するよう
になっていて、ＣＦ(3)のローカルメモリを共有してい
る。またＣＦ状態監視ＣＰＮ１(18)は、ＣＦ状態監視プ
ログラムにより、障害が検出された時にはＣＦ０(3)の
交替用ＣＦとしても機能するようになっていて、ＣＦ１
(4)のローカルメモリを共有している。ＣＦ状態監視Ｃ
ＰＮ０(17)、ＣＦ状態監視ＣＰＮ１(18)はまた、お互い
にＣＦ０(3)，ＣＦ１(4)とも外部接続されていて、ＣＦ
監視プログラムにより定期的にＣＦの状態をチェックし
ている。Further, the CF status monitoring CPN0 (17)
0 (3) and a dedicated CP which shares the hardware and monitors the status of CF0 (3) and CF (4).
N, the CF state monitoring CPN1 (18) is a dedicated CPN that shares hardware with CF1 (4) and monitors the states of CF0 (3) and CF1 (4). CF status monitoring CPN0 (1
7) functions as a replacement CF for CF1 (4) when a failure is detected by the CF status monitoring program, and shares the local memory of CF (3). The CF status monitor CPN1 (18) also functions as a replacement CF for CF0 (3) when a failure is detected by the CF status monitor program.
The local memory of (4) is shared. CF status monitoring C
PN0 (17) and CF status monitoring CPN1 (18) are also externally connected to CF0 (3) and CF1 (4).
The status of the CF is regularly checked by a monitoring program.

【００１１】ここで、ＣＦ状態監視ＣＰＮ１(18)で動作
しているＣＦ監視プログラムにより、ＣＦ１(3)に障害
が検出された場合、ＣＦ状態監視ＣＰＮ１(18)上のＣＦ
監視プログラムは、ＣＦ状態監視ＣＰＮ０(17)上のＣＦ
監視プログラムに、ＣＦ１(3)の監視の中断を伝える。
さらにＣＦ状態監視ＣＰＮ１(18)上のＣＦ監視プログラ
ムは、ＣＦ１(3)の監視を中断し、さらにＣＦ移行プロ
グラムを起動する。ＣＦ移行プログラムは、まず図５に
示すようにＣＦ０(3)が管理しているコピーデータを全
て無効とする。この間、処理装置１(7)、処理装置２
(8)、処理装置ｎ(9)には、共有メモリＭ３(16)のロック
権が与えられないので、誤ったデータを用いて処理を続
行することはない。次にメモリＭ１(5)の内容を、ＣＦ
監視に使用していたリンクケーブル(19)を介して、ＣＦ
状態監視ＣＰＮ１(18)が吸上げ、メモリＭ２(6)へ書込
むことにより、ＣＦ０(3)管理下の必要データを全て結
合装置(2)上へ移行させる。もしこの時、メモリＭ１(5)
の障害によりデータが読み出せない場合でも、共有メモ
リＭ３(16)と各ＣＰＮ上にあるコピーデータを比較する
ことにより、読み出せなくなったメモリＭ１(5)にある
データと同様のデータをＣＦ移行プログラムが作成し、
メモリＭ２(6)に格納する。ただし、この場合、障害後
のホットスタンバイ動作に多少時間を要してしまう。そ
の後、ＣＦ状態監視ＣＰＮ１(18)上で、ＣＦプログラム
を起動させることにより、ＣＦ０(3)から交代ＣＦ１(1
8)への移行が完了する。そして、処理装置１(7)、処理
装置２(8)、処理装置ｎ(9)に共有メモリＭ３(16)へのロ
ック権を与え、各ＣＰＮでの処理を再開させる。この
時、交代ＣＦ１(18)とＣＰＮ０(10)，ＣＰＮ１(11)，Ｃ
ＰＮｎ−１(12)間の接続は、ＣＦ１(4)とＣＰＮ０(1
0)，ＣＰＮ１(11)，ＣＰＮｎ−１(12)を接続しているリ
ンクケーブル(21)を共有する。これにより、ＣＰＮ０(1
0)，ＣＰＮ１(11)，ＣＰＮｎ−１(12)間でのパラレル制
御は保たれるため、障害前の状態でシステムダウンさせ
ることなく運用を継続することが可能である。Here, if a failure is detected in CF1 (3) by the CF monitoring program running on CF status monitoring CPN1 (18), the CF on CF status monitoring CPN1 (18)
The monitoring program executes the CF monitoring on the CF status monitoring CPN0 (17).
Inform the monitoring program that monitoring of CF1 (3) has been interrupted.
Further, the CF monitoring program on the CF status monitoring CPN1 (18) suspends the monitoring of CF1 (3), and starts the CF migration program. The CF migration program first invalidates all copy data managed by CF0 (3) as shown in FIG. During this time, processing device 1 (7), processing device 2
(8) Since the right to lock the shared memory M3 (16) is not given to the processing device n (9), the processing will not be continued using incorrect data. Next, the content of the memory M1 (5) is stored in CF
Via the link cable (19) used for monitoring, CF
The status monitor CPN1 (18) siphons and writes it to the memory M2 (6), thereby transferring all necessary data under the control of the CF0 (3) to the coupling device (2). If this time, the memory M1 (5)
Even if the data cannot be read due to the failure, the same data as the data in the unreadable memory M1 (5) is transferred to the CF by comparing the shared memory M3 (16) with the copy data on each CPN. The program creates
It is stored in the memory M2 (6). However, in this case, some time is required for the hot standby operation after the failure. Thereafter, by starting a CF program on the CF state monitoring CPN1 (18), the CF1 (1) is changed from CF0 (3).
The transition to 8) is completed. Then, the right to lock the shared memory M3 (16) is given to the processing device 1 (7), the processing device 2 (8), and the processing device n (9), and the processing in each CPN is restarted. At this time, the replacement CF1 (18), CPN0 (10), CPN1 (11), C
The connection between PNn-1 (12) is CF1 (4) and CPN0 (1
0), CPN1 (11) and CPNn-1 (12) share a link cable (21). Thereby, CPN0 (1
0), CPN1 (11), and CPNn-1 (12) are maintained in parallel control, so that the operation can be continued in the state before the failure without causing the system to go down.

【００１２】図２は、図１以外の一つの実施形態を示し
たものである。図２において、図１と異なるのは、結合
装置間の接続方法である。通常時の動作は、図１に説明
した通りである。結合装置間をＴＣＭＰ結合(20)するこ
とにより、ＣＦ障害時にホットスタンバイ動作のうち、
最も処理時間を要するメモリの内容のコピーをしなくて
も、直接相手側のメモリ内容を読めるので、ＣＦ機能の
交代がスムーズに行えるという利点がある。FIG. 2 shows an embodiment other than FIG. 2 differs from FIG. 1 in the connection method between the coupling devices. The operation in the normal state is as described in FIG. By performing the TCMP coupling (20) between the coupling devices, the hot standby operation during the CF failure can be performed.
Since the memory contents of the other party can be read directly without copying the contents of the memory which requires the most processing time, there is an advantage that the CF function can be switched smoothly.

【００１３】図３は、図１，２以外の一つの実施形態を
示したものである。すべての結合装置、処理装置、共有
メモリがバス接続(23)されることにより、ＣＦとＣＰＮ
間を接続していたリンクケーブルは必要なくなるだけ
で、通常時の動作及びＣＦ移行プログラムの動作は、図
１に説明した通りである。FIG. 3 shows an embodiment other than FIGS. By connecting all coupling devices, processing devices, and shared memories via a bus (23), CF and CPN are connected.
Only the link cable connecting between them becomes unnecessary, and the normal operation and the operation of the CF transfer program are as described in FIG.

【００１４】[0014]

【発明の効果】以上説明したように、本発明によれば、
多くのＣＰＮが接続されたＣＰＮ間並列処理システム
で、ＣＰＮの台数に合わせてＣＦの台数を決め、さら
に、ハードウェア資源を共有して交代ＣＦを割り当てる
ことにより、効率が良いだけでなく、信頼性の高いＣＰ
Ｎ間並列処理システムを、コストを掛けずに構築し、運
用することが可能となる。As described above, according to the present invention,
In an inter-CPN parallel processing system to which many CPNs are connected, the number of CFs is determined in accordance with the number of CPNs, and hardware resources are shared to assign replacement CFs. High CP
It is possible to construct and operate the N-to-N parallel processing system without increasing the cost.

[Brief description of the drawings]

【図１】本発明を適用したシステム構成図の一例を示し
た図面FIG. 1 shows an example of a system configuration diagram to which the present invention is applied.

【図２】本発明を適用した、図１以外の一例を示した図
面FIG. 2 is a diagram showing an example other than FIG. 1 to which the present invention is applied.

【図３】本発明を適用した、図１，２以外の一例を示し
た図面FIG. 3 is a diagram showing an example other than FIGS.

【図４】ＣＦのパラレル制御方式を説明する図面FIG. 4 is a diagram illustrating a parallel control method of a CF.

【図５】本発明の、ＣＦ交代方式を説明する図面FIG. 5 is a diagram illustrating a CF replacement method according to the present invention.

[Explanation of symbols]

１…第１の結合装置、２…第２の結合装置、３…第１の
結合機構(CF0)、４…第２の結合機構(CF1) 、５
…ＣＦ０のローカルメモリ(M1)、６…ＣＦ１のローカル
メモリ(M2)、７…第１の処理装置、８…第２の処理装
置、９…第ｎの処理装置、１０…第１の
中央処理ノード(CPN0)、１１…第２の中央処理ノード
(CPN1)、１２…第ｎの中央処理ノード(CPNn-1)、１３…
ＣＰＮ０のローカルメモリ(M4)、１４…ＣＰＮ１のロー
カルメモリ(M5)、１５…ＣＰＮｎ−１のローカルメモリ
(M6)、１６…共有メモリ(M3)、１７…第１のＣＦ状態監
視ＣＰＮ０（兼交代ＣＦ０）、１８…第２のＣＦ状態監
視ＣＰＮ１（兼交代ＣＦ１）、１９…ＣＦ状態監視ＣＰ
ＮとＣＦ間を接続する外部リンクケーブル郡、２０…Ｃ
Ｆ０とＣＰＮ０，ＣＰＮ１，ＣＰＮｎ間を接続するリン
クケーブル郡、２１…ＣＦ１とＣＰＮ０，ＣＰＮ１，Ｃ
ＰＮｎ間を接続するリンクケーブル郡、２２…結合装置
間を密結合マルチプロセッサ接続するケーブル、２３…
各処理装置、結合装置、共有メモリを接続するバス、２
４…コピーデータの有効フラグテーブル。DESCRIPTION OF SYMBOLS 1 ... 1st coupling apparatus, 2 ... 2nd coupling apparatus, 3 ... 1st coupling mechanism (CF0), 4 ... 2nd coupling mechanism (CF1), 5
... Local memory (M1) of CF0, 6 ... Local memory (M2) of CF1, 7 ... First processing device, 8 ... Second processing device, 9 ... Nth processing device, 10 ... First central processing Node (CPN0), 11: second central processing node
(CPN1), 12 ... n-th central processing node (CPNn-1), 13 ...
CPN0 local memory (M4), 14... CPN1 local memory (M5), 15... CPNn-1 local memory
(M6), 16: Shared memory (M3), 17: First CF status monitor CPN0 (alternate CF0), 18: Second CF status monitor CPN1 (alternate CF1), 19: CF status monitor CP
External link cables connecting N and CF, 20 ... C
Link cable group connecting F0 and CPN0, CPN1, CPNn, 21 ... CF1 and CPN0, CPN1, C
Link cable group connecting between PNn, 22 ... Cable connecting tightly-coupled multiprocessor between coupling devices, 23 ...
A bus connecting each processing device, coupling device, shared memory, 2
4: Valid flag table for copy data

Claims

[Claims]

1. A processing node (CPN) on a processing unit in several parallel processing systems as shown in FIG.
Is coupled to a coupling device.
In the inter-parallel processing system, the multiplexing of the coupling devices improves the parallel processing performance of the system. CPN), the CPN constantly monitors the state of each CF, and when a failure is found in a certain CF, the CPN itself functions as a replacement CF to prevent the entire system from going down. A CF status monitoring type inter-CPN parallel processing system that enables quick switching operation (hot standby).

2. In the CF status monitoring type inter-CPN parallel processing system according to claim 1, the coupling devices sharing hardware resources with the CF and monitoring each CF are connected as shown in FIG. A CF state monitoring CPN connected to a tightly-coupled multiprocessor (TCMP) so that a local memory content other than the subordinate memory can be read when a CF failure is detected, and enabling a high-speed transition of the CF function.

3. A system as shown in FIG. 3, wherein a CPN on each processing unit and a CF on a coupling unit are connected by a bus connection. A state monitoring type parallel processing system between CPNs and a CPN dedicated to CF state monitoring.

4. A CF status monitor on the CF status monitoring CPN according to claim 1, wherein the CF status is detected when a CF fault is detected.
A CF state monitoring program that performs function transfer processing.

5. A CF transition program, wherein the CF state monitoring program according to claim 4 performs transition processing to a replacement CF when a CF failure is detected.