JP3077669B2

JP3077669B2 - Process Stop Method for Distributed Memory Multiprocessor System

Info

Publication number: JP3077669B2
Application number: JP10143780A
Authority: JP
Inventors: 敦久大谷
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1997-05-30
Filing date: 1998-05-26
Publication date: 2000-08-14
Anticipated expiration: 2018-05-26
Also published as: JPH1145229A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、プロセス停止方式
に関し、特に、ネットワークに接続された複数のノード
を含み、各ノードが並列処理を行うスレッド群を保持す
る、分散メモリ型マルチプロセッサシステムにおいて実
行されるチェックポイント処理におけるプロセス停止方
式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a process stopping method, and more particularly to a process stopping method for a distributed memory type multiprocessor system including a plurality of nodes connected to a network, each of which holds a group of threads for performing parallel processing. The present invention relates to a process stop method in the checkpoint processing performed.

【０００２】[0002]

【従来の技術】特開平８−２６３３１７号公報には、チ
ェックポイント処理における同期制御（排他制御）に関
係する複数のプロセスの停止順序を制御するためのチェ
ックポイント／リスタート処理システムが開示されてい
る。2. Description of the Related Art Japanese Unexamined Patent Publication No. Hei 8-263317 discloses a checkpoint / restart processing system for controlling the stop order of a plurality of processes related to synchronous control (exclusive control) in checkpoint processing. I have.

【０００３】また、特開平２−２８７８５８号公報に
は、分散処理システムにおけるリスタートシステムが開
示されている。このリスタートシステムにおいては、あ
るプロセッサ中の通信制御部が他のプロセッサ群とデー
タの送受信を行う際には必ず、該通信制御部にそのよう
な処理を行わせるプログラムがチェックポイントデータ
としてセーブされる。[0003] Japanese Patent Application Laid-Open No. 2-287858 discloses a restart system in a distributed processing system. In this restart system, whenever a communication control unit in a certain processor transmits and receives data to and from another processor group, a program that causes the communication control unit to perform such processing is saved as checkpoint data. You.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、特開平
８−２６３３１７号公報記載の技術は、密結合型マルチ
プロセッサシステムに適用されるもので、分散メモリ型
マルチプロセッサシステムに適用されるものではない。
分散メモリ型マルチプロセッサシステムにおいて個々の
プロセッサは、他のプロセッサ群のいずれのプロセスか
らもアクセスできない固有の（ローカル）メモリを保持
している。もし、特開平８−２６３３１７号公報記載の
チェックポイント／リスタート処理システムをこの分散
メモリ型マルチプロセッサシステムに適用したとする
と、異なるプロセッサ群上の複数のプロセスはチェック
ポイント処理のための同期処理を実行することができな
い可能性がある。なぜならば、あるプロセッサ中のある
プロセス（同期処理の一方）が停止状態である場合、メ
モリを共有していない他のプロセッサ群上のプロセス群
は先のプロセスが停止状態であることを認識することが
できないからである。そのような状況においては、他の
プロセッサ群上のプロセス群は、その停止状態のプロセ
スからの（返却されるはずのない）応答を待ち続けてし
まうという問題点がある。However, the technique described in Japanese Patent Application Laid-Open No. 8-263317 is applied to a tightly-coupled multiprocessor system, not to a distributed memory multiprocessor system.
In a distributed memory type multiprocessor system, each processor has a unique (local) memory that cannot be accessed by any process of another processor group. If the checkpoint / restart processing system described in Japanese Patent Application Laid-Open No. 8-263317 is applied to this distributed memory multiprocessor system, a plurality of processes on different processor groups perform synchronous processing for checkpoint processing. May not be able to do so. This is because, when a certain process (one of the synchronous processes) in a certain processor is stopped, the processes on the other processors not sharing the memory recognize that the previous process is stopped. Is not possible. In such a situation, there is a problem that the processes on the other processors continue to wait for a response (which should not be returned) from the stopped process.

【０００５】また、特開平２−２８７８５８号公報記載
の技術においては、チェックポイントデータを任意のタ
イミングでセーブできないという問題点がある。さら
に、チェックポイントデータをセーブする頻度が並列処
理の性能を低下させる要因となる可能性がある。The technique described in Japanese Patent Application Laid-Open No. 2-287858 has a problem that checkpoint data cannot be saved at an arbitrary timing. Further, the frequency of saving the checkpoint data may be a factor that degrades the performance of the parallel processing.

【０００６】本発明の目的は、分散メモリ型マルチプロ
セッサシステムのノード数が多い場合でも、ノード間で
データの送受信を行なう並列処理プロセスに対して、効
率的にチェックポイント採取のための停止処理を行える
ようにする分散メモリ型マルチプロセッサシステムにお
けるプロセスの停止方式を提供することにある。SUMMARY OF THE INVENTION It is an object of the present invention to provide a process for stopping checkpoint collection efficiently in a parallel processing process for transmitting / receiving data between nodes even when the number of nodes in a distributed memory type multiprocessor system is large. It is an object of the present invention to provide a method of stopping a process in a distributed memory type multiprocessor system which enables the process to be performed.

【０００７】本発明の他の目的は、分散メモリ型マルチ
プロセッサシステム上において、ノード間でデータの送
受信を行なう並列処理プロセスの通常の処理において
は、同プロセスの性能を低下させないように、チェック
ポイント／リスタートの機能を実現する分散メモリ型マ
ルチプロセッサシステムにおけるプロセスの停止方式を
提供することにある。[0007] Another object of the present invention is to checkpoint a normal processing of a parallel processing process for transmitting and receiving data between nodes on a distributed memory type multiprocessor system so as not to lower the performance of the process. Another object of the present invention is to provide a method for stopping a process in a distributed memory type multiprocessor system that realizes the function of / restart.

【０００８】本発明のさらに他の目的は、分散メモリ型
マルチプロセッサシステム上において、ノード間でデー
タの送受信を行なう並列処理プロセスに対して、任意の
時点で、チェックポイント採取が行えるようにする分散
メモリ型マルチプロセッサシステムにおけるプロセスの
停止方式を提供することにある。Still another object of the present invention is to provide a distributed memory type multiprocessor system in which a checkpoint can be collected at any time for a parallel processing process of transmitting and receiving data between nodes. An object of the present invention is to provide a method for stopping a process in a memory type multiprocessor system.

【０００９】[0009]

【課題を解決するための手段】本発明の第１の分散メモ
リ型マルチプロセッサシステムにおけるプロセス停止方
式は、複数のノードを相互結合網により結合し、各ノー
ドに並列処理の対象のスレッドを備える分散メモリ型マ
ルチプロセッサシステムにおいて、前記各ノードが並列
処理の対象のスレッドを管理する管理プロセスを備え、
全ノードのいずれかにチェックポイント要求コマンドが
入力された場合に、ノード番号の最も若い前記ノードの
管理プロセスが、自ノード内の前記スレッドに対して停
止要求を行ない、自ノード内のスレッドが停止した後、
次のノード番号を有する前記ノードの管理プロセスに停
止要求を行ない、並列処理の対象の全ノードのスレッド
が停止するまで待ち合わせることを特徴とする分散メモ
リ型マルチプロセッサシステムにおけるプロセス停止方
式であって、前記スレッドは、ノード間でのデータの送
受信処理の同期をとるためのノード間同期手段と、停止
要求が発行されているかどうかを示す情報を格納するチ
ェックポイント情報と、他のノードに対してデータの送
受信を行なう送受信手段と、自ノードの前記管理プロセ
スに対して自スレッドが停止したことを通知した後自ス
レッドを停止させる停止手段とを備え、前記ノード間同
期手段は、ノード間のデータ送受信において、データ送
受信の同期処理がタイムオーバーになった場合に、前記
チェックポイント情報に基づいて停止要求が発行されて
いるかどうかを調べ、停止要求が発行されていれば、自
ノードのノード番号と相手ノードのノード番号を比較
し、自ノードのノード番号の方が若い場合に、前記停止
手段により自スレッドの停止を行なうことを特徴とす
る。 According to a first aspect of the present invention, there is provided a distributed memory type multiprocessor system in which a plurality of nodes are connected by an interconnection network, and each node has a thread to be processed in parallel. In a memory-type multiprocessor system, the node includes a management process for managing a thread to be processed in parallel,
When a checkpoint request command is input to any of the nodes, the management process of the node with the lowest node number issues a stop request to the thread in the own node, and the thread in the own node stops. After doing
A distributed memo requesting a stop request to a management process of the node having the next node number and waiting until threads of all nodes to be processed in parallel are stopped.
How to Stop Processes in Re-type Multiprocessor System
Wherein the thread transmits data between nodes.
Inter-node synchronization means for synchronizing reception processing, and stop
A key that stores information indicating whether a request has been issued.
Check point information and data transmission to other nodes.
Transmitting / receiving means for receiving, and the management process of the own node.
After notifying that the thread has stopped,
Stopping means for stopping red, and
In the data transmission and reception between nodes,
If the reception synchronization process times out,
Stop request issued based on checkpoint information
Check if a stop request has been issued.
Compare the node number of the node with the node number of the partner node
If the own node has a lower node number,
Stopping own thread by means
You.

【００１０】本発明の第２の分散メモリ型マルチプロセ
ッサシステムにおけるプロセス停止方式は、第１の分散
メモリ型マルチプロセッサシステムにおけるプロセス停
止方式において、前記管理プロセスが、停止要求に基づ
いて自ノード内のスレッドに対して停止要求を行なうノ
ード内停止要求手段と、自ノード内の全スレッドが停止
するまで待ち合わせるノード内停止待ち合わせ手段と、
自ノード内の全スレッドから停止通知を受けた場合に、
次のノード番号のノードの管理プロセスに対して停止要
求を行なうノード間停止要求手段とを備えることを特徴
とする。[0010] A second aspect of the present invention is a process stop method in the first distributed memory type multiprocessor system, wherein the management process is executed in a local node based on a stop request. In-node stop request means for issuing a stop request to a thread, and in-node stop waiting means for waiting until all threads in the own node are stopped,
When a stop notification is received from all threads in the own node,
Inter-node stop request means for issuing a stop request to the management process of the node having the next node number.

【００１１】本発明の第３の分散メモリ型マルチプロセ
ッサシステムにおけるプロセス停止方式は、第２の分散
メモリ型マルチプロセッサシステムにおけるプロセス停
止方式において、前記管理プロセスが、全ノードのスレ
ッドが停止するまで待ち合わせるノード間停止待ち合わ
せ手段と、全ノードのスレッドが停止した後、各ノード
に全ノードのスレッドが停止したことを通知するノード
間停止手段とを備えることを特徴とする。According to a third aspect of the present invention, in the process stop method of the second distributed memory type multiprocessor system, the management process waits until threads of all nodes stop. It is characterized by comprising inter-node stop waiting means and inter-node stop means for notifying each node that threads of all nodes have stopped after threads of all nodes have stopped.

【００１２】本発明の第４の分散メモリ型マルチプロセ
ッサシステムにおけるプロセス停止方式は、第２の分散
メモリ型マルチプロセッサシステムにおけるプロセス停
止方式において、前記管理プロセスが、前記並列処理プ
ロセスのスレッドが存在する前記ノードのノード番号を
記述したノード番号表を備え、前記管理プロセスの前記
ノード間停止要求手段は、前記ノード番号表によって次
に若いノード番号のノードの管理プロセスに対して停止
要求を行なうことを特徴とする。According to a fourth aspect of the present invention, there is provided a process stop method in the distributed memory type multiprocessor system, wherein the management process is a thread of the parallel processing process in the process stop method in the second distributed memory type multiprocessor system. A node number table describing a node number of the node, wherein the inter-node stop request means of the management process issues a stop request to a management process of a node having a next lower node number according to the node number table. Features.

【００１３】本発明の第５の分散メモリ型マルチプロセ
ッサシステムにおけるプロセス停止方式は、第３の分散
メモリ型マルチプロセッサシステムにおけるプロセス停
止方式において、前記管理プロセスが、前記並列処理プ
ロセスのスレッドが存在する前記ノードのノード番号を
記述したノード番号表を備え、前記スレッドの前記ノー
ド間同期手段は、前記ノード番号表によって自ノードの
ノード番号と相手ノードのノード番号を比較することを
特徴とする。According to a fifth aspect of the present invention, there is provided a process stop method in the distributed memory type multiprocessor system, wherein the management process is a thread of the parallel processing process in the process stop method in the third distributed memory type multiprocessor system. A node number table describing the node number of the node is provided, and the inter-node synchronization means of the thread compares the node number of its own node with the node number of the partner node according to the node number table.

【００１４】本発明の第６の分散メモリ型マルチプロセ
ッサシステムにおけるプロセス停止方式は、第３の分散
メモリ型マルチプロセッサシステムにおけるプロセス停
止方式において、ノード間のデータの送受信において、
送信終了の情報と受信終了の情報をセットするノード間
通信情報を備え、前記ノード間同期手段は、所定時間内
に前記ノード間通信情報に前記送信終了及び受信終了の
情報がセットされている場合に同期を確立し、所定時間
内にセットされない場合に、前記チェックポイント情報
に基づいて停止要求が発行されているかどうかを調べ、
停止要求が発行されていれば、自ノードのノード番号と
相手ノードのノード番号を比較し、自ノードのノード番
号の方が若い場合に、前記停止手段により自スレッドの
停止を行なうことを特徴とする。According to a sixth aspect of the present invention, there is provided a process stop method in a distributed memory type multiprocessor system, wherein the process stop method in the third distributed memory type multiprocessor system comprises the steps of:
The inter-node communication means for setting the information of the transmission end and the information of the reception end, wherein the inter-node synchronization means sets the transmission end and the reception end information in the inter-node communication information within a predetermined time. Establish synchronization, if not set within a predetermined time, check whether a stop request has been issued based on the checkpoint information,
If a stop request has been issued, the node number of the own node and the node number of the partner node are compared, and if the node number of the own node is smaller, the own means stops the own thread. I do.

【００１５】[0015]

【００１６】[0016]

【発明の実施の形態】以下、本発明の実施の形態につい
て図面を参照して詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１７】本発明の対象となるハードウェアは、１つ
のプロセッサと１つのローカルメモリの組を１つのノー
ドとして、または複数のプロセッサが１つのメモリを共
有するという共有メモリ型マルチプロセッサを１つのノ
ードとして、複数のノードが相互結合網により結合され
ている分散メモリ型マルチプロセッサシステムである。
また、データの送受信時の同期または排他制御のため
に、各ノードから参照できるノード間通信領域を持つこ
とを前提とする。The hardware which is the object of the present invention is a single memory and a shared memory multiprocessor in which a plurality of processors share one memory as one node. Is a distributed memory type multiprocessor system in which a plurality of nodes are connected by an interconnection network.
In addition, it is assumed that there is an inter-node communication area that can be referred to from each node for synchronization or exclusive control during data transmission / reception.

【００１８】図１に、本発明の実施の形態による分散メ
モリ型マルチプロセッサシステムの全体構成を示す。図
１において、本システムは、マスタースレッドが存在す
るノードであるマスターノード１０と、マスタースレッ
ドが存在するノード以外のスレーブスレッドのみが存在
するスレーブノード２０と、ノード間通信領域３０を備
えて構成される。FIG. 1 shows an overall configuration of a distributed memory type multiprocessor system according to an embodiment of the present invention. In FIG. 1, the system includes a master node 10 that is a node where a master thread exists, a slave node 20 where only slave threads other than the node where the master thread exists, and an inter-node communication area 30. You.

【００１９】マスターノード１０は、マスタースレッド
５０ａと、マスターノード内のスレッドを管理する管理
プロセス６０ａを含む。The master node 10 includes a master thread 50a and a management process 60a for managing threads in the master node.

【００２０】スレーブノード２０は、スレーブスレッド
５０ｂと、スレーブノード内のスレッドを管理する管理
プロセス６０ｂを含む。The slave node 20 includes a slave thread 50b and a management process 60b for managing threads in the slave node.

【００２１】マスターノード１０及びスレーブノード２
０のマスタースレッド５０ａ及びスレーブスレッド５０
ｂは、それぞれノード間でのデータの送受信を行なう送
信手段５０−１及び受信手段５０−２と、データの送受
信時に送信手段５０−１及び受信手段５０−２から使わ
れる、相手ノードとの同期をとるためのノード間同期手
段５０−３と、安全な停止位置であると判断された時
に、管理プロセス６０ａ、６０ｂに停止通知を行なった
後、スレッドを停止させる停止手段５０−４と、チェッ
クポイント情報５０−５を含む。Master node 10 and slave node 2
0 master thread 50a and slave thread 50
b denotes a transmission unit 50-1 and a reception unit 50-2 for transmitting and receiving data between nodes, and a synchronization between the transmission unit 50-1 and the reception unit 50-2 used for transmission / reception of data. And a stopping means 50-4 for stopping the thread after notifying the management processes 60a and 60b when it is determined that the position is a safe stopping position. Point information 50-5 is included.

【００２２】マスターノード１０及びスレーブノード２
０内の管理プロセス６０ａ、６０ｂは、それぞれ自ノー
ドに停止要求が発行された時に、管理するスレッドに対
してチェックポイント要求フラグを設定することによっ
て停止要求を行なうノード内停止要求手段６０−１と、
停止要求を行なったスレッドが停止するまで待ち合わせ
るノード内停止待ち合わせ手段６０−２と、管理プロセ
ス６０ａ、６０ｂが管理するスレッドから停止通知を受
けた後、ノード番号が次のノードの管理プロセスに対し
て停止要求が発行されたことを通知するノード間停止要
求手段６０−３と、全ノード内の停止対象のスレッドが
停止するまで待ち合わせるノード間停止待ち合わせ手段
６０−４と、全ノード内の停止対象のスレッドが停止し
た後、各ノードの管理プロセスに全ノードが停止したこ
とを通知するノード間停止手段６０−５と、並列処理プ
ロセスの起動時に、ユーザによって指定された最大使用
ノード数に基づいて作られる、並列処理プロセスのスレ
ッドが存在するノードのノード番号を格納するノード番
号表６０−６を含む。Master node 10 and slave node 2
0, the management processes 60a and 60b each execute a stop request by setting a checkpoint request flag for a managing thread when a stop request is issued to the own node. ,
After receiving the stop notification from the in-node stop waiting means 60-2 which waits until the thread which has issued the stop request and the threads managed by the management processes 60a and 60b, the node number is changed to the management process of the next node. An inter-node stop requesting unit 60-3 for notifying that a stop request has been issued, an inter-node stop waiting unit 60-4 for waiting until threads to be stopped in all nodes are stopped, After the thread has stopped, the inter-node stopping means 60-5 for notifying the management process of each node that all nodes have stopped, and a process based on the maximum number of used nodes specified by the user when starting the parallel processing process. And a node number table 60-6 storing the node number of the node where the thread of the parallel processing process exists. .

【００２３】ノード間通信領域３０は、送信手段５０−
１及び受信手段５０−２によるノード間のデータ送受信
において、ノード間同期手段５０−３が両者の同期をと
るために用いるノード間通信情報３０−１を含む。The inter-node communication area 30 includes a transmitting unit 50-
1 and the receiving means 50-2 transmit and receive data between nodes, and the inter-node synchronizing means 50-3 includes inter-node communication information 30-1 used for synchronizing the two.

【００２４】本発明の実施形態の特徴は、マスターノー
ド１０とスレーブノード２０間でデータの送受信を行な
っている時に停止要求が発行された場合、送信手段５０
−１及び受信手段５０−２から使われるノード間同期手
段５０−３が、相手ノードが既にチェックポイントによ
る停止要求により停止しているかどうかを自ノード内で
判断できるように構成したことにある。A feature of the embodiment of the present invention is that, when a stop request is issued during data transmission / reception between the master node 10 and the slave node 20, the transmitting means 50
-1 and the inter-node synchronizing means 50-3 used by the receiving means 50-2 are configured to be able to determine in their own node whether or not the partner node has already been stopped by a stop request by a checkpoint.

【００２５】次に、図１及び図２を参照して、上述のよ
うに構成される本発明の実施の形態の動作について説明
する。Next, the operation of the embodiment of the present invention configured as described above will be described with reference to FIGS.

【００２６】図１の構成では、スレーブノード２０が１
つの場合を示しているが、実際には複数のスレーブノー
ドが存在する場合も想定している。In the configuration shown in FIG.
However, it is assumed that there are actually a plurality of slave nodes.

【００２７】並列処理プロセスの起動時において、最初
に、現在使用できる最も若い番号のノードをマスターノ
ード１０とし、このマスターノード１０に管理プロセス
６０ａが生成される。この管理プロセス６０ａは、ユー
ザによって指定された最大使用ノード数に基づいて、ス
レーブノードとして使用する全てのノードを決定し、そ
れらのノード及びマスターノード１０のノード番号をノ
ード番号表６０−６に記述する。At the time of starting the parallel processing process, first, the node with the lowest number that can be currently used is set as the master node 10, and the management process 60a is generated in this master node 10. The management process 60a determines all nodes to be used as slave nodes based on the maximum number of nodes used by the user, and describes the node numbers of these nodes and the master node 10 in the node number table 60-6. I do.

【００２８】次に、マスターノード１０内の管理プロセ
ス６０ａは、ノード番号表６０−６に記載されているス
レーブノード２０内の管理プロセス６０ｂとの接続を確
立し、ノード番号表６０−６をスレーブノード２０の管
理プロセス６０ｂのノード番号表６０−６へ複写する。
そして、マスターノード１０及びスレーブノード２０の
管理プロセス６０ａ、６０ｂによって、並列処理プロセ
スのマスタースレッド５０ａ、スレーブスレッド５０ｂ
がそれぞれ生成され、ノード間でのデータの送受信を行
ないながら計算を進める。Next, the management process 60a in the master node 10 establishes a connection with the management process 60b in the slave node 20 described in the node number table 60-6, and sets the node number table 60-6 as a slave. It is copied to the node number table 60-6 of the management process 60b of the node 20.
Then, the master thread 50a and the slave thread 50b of the parallel processing process are managed by the management processes 60a and 60b of the master node 10 and the slave node 20.
Are generated, and the calculation proceeds while transmitting and receiving data between the nodes.

【００２９】その後、利用者から入力されたチェックポ
イント要求コマンドにより、この並列処理プロセスにチ
ェックポイントが要求されると、チェックポイントの処
理が開始される。Thereafter, when a checkpoint is requested for this parallel processing by a checkpoint request command input by the user, the checkpoint processing is started.

【００３０】図４を参照すると、チェックポイント要求
コマンドが利用者によりノードｉに入力されると、その
チェックポイント要求コマンドはそのチェックポイント
要求コマンドを送る相手のマスターノード０を決定する
ために、ノードｉのノード間停止要求手段６０−３に入
力される。ノードｉのノード間停止要求手段６０−３
は、ノード番号表６０−６を参照してマスターノード０
の管理プロセスへ、チェックポイント要求コマンドを指
示する信号を送る。ノードｉは、並列処理システムにお
けるプロセス群を常に監視する特別ノードである。ノー
ドｉは、マスターノード０と区別されてもされなくても
良い。ノードｉがマスターノード０と同一の場合には、
チェックポイント要求コマンドは利用者によりマスター
ノード０へ入力される。Referring to FIG. 4, when a checkpoint request command is input to the node i by the user, the checkpoint request command is used to determine the master node 0 to which the checkpoint request command is sent. It is inputted to the inter-node stop request means 60-3 of i. Inter-node stop request means 60-3 of node i
Is the master node 0 with reference to the node number table 60-6.
A signal instructing a checkpoint request command is sent to the management process. The node i is a special node that constantly monitors a process group in the parallel processing system. Node i may or may not be distinguished from master node 0. If node i is the same as master node 0,
The checkpoint request command is input to the master node 0 by the user.

【００３１】マスターノード１０の管理プロセス６０ａ
のノード間停止要求手段６０−３を介してチェックポイ
ントによる停止要求が通知されると、この管理プロセス
６０ａは、ノード内停止要求手段６０−１によりマスタ
ースレッド５０ａに対して停止要求を行ない、ノード内
停止待ち合わせ手段６０−２により、マスタースレッド
５０ａが停止するまで待ち合わせる。Management process 60a of master node 10
When the stop request by the check point is notified via the inter-node stop request unit 60-3, the management process 60a issues a stop request to the master thread 50a by the intra-node stop request unit 60-1, and It waits until the master thread 50a stops by the internal stop waiting means 60-2.

【００３２】ノード内停止要求手段６０−１からの停止
要求を受けたマスタースレッド５０ａは、他のノードと
データの送受信中でなければ、停止要求を認識した時点
（例えば、割り込み処理の終了時）で、停止手段５０−
４により管理プロセス６０ａに対し停止通知を行なって
停止する。The master thread 50a, which has received the stop request from the intra-node stop request unit 60-1, recognizes the stop request unless data is being transmitted / received to / from another node (for example, at the end of interrupt processing). Then, the stopping means 50-
4, a stop notification is sent to the management process 60a, and the process is stopped.

【００３３】この後、停止要求を受けたマスターノード
１０の管理プロセス６０ａは、ノード間停止要求手段６
０−３を用いて、ノード番号表６０−６からマスターノ
ード１０の次に若いノード番号のノードがスレーブノー
ド２０であることを調べ、スレーブノード２０の管理プ
ロセス６０ｂに対して停止要求を行なう。そして、マス
ターノード１０の管理プロセス６０ａは、ノード間停止
待ち合わせ手段６０−４により、チェックポイント対象
の並列処理プロセス内の全ノードのスレッドが停止する
まで待ち合わせる。Thereafter, the management process 60a of the master node 10 which has received the stop request becomes the inter-node stop request means 6
Using 0-3, it is checked from the node number table 60-6 that the node having the next lowest node number after the master node 10 is the slave node 20, and a stop request is issued to the management process 60b of the slave node 20. Then, the management process 60a of the master node 10 waits until the threads of all the nodes in the checkpoint target parallel processing process are stopped by the inter-node stop waiting means 60-4.

【００３４】マスターノード１０から停止要求を受けた
スレーブノード２０の管理プロセス６０ｂは、ノード内
停止要求手段６０−１を用いてスレーブノード２０内の
スレーブスレッド５０ｂに対して停止要求を行ない、ノ
ード内停止待ち合わせ手段６０−２により、スレーブス
レッド５０ｂが停止するまで待ち合わせる。The management process 60b of the slave node 20, which has received the stop request from the master node 10, issues a stop request to the slave thread 50b in the slave node 20 by using the in-node stop request means 60-1. The stop waiting means 60-2 waits until the slave thread 50b stops.

【００３５】この時、スレーブスレッド５０ｂが、他の
ノードとデータの送受信中でなければ、マスタースレッ
ド５０ａと同様に、上記停止要求を認識した時点で、停
止手段５０−４により、管理プロセス６０ｂに対し停止
通知を行なって停止する。At this time, if the slave thread 50b is not transmitting / receiving data to / from another node, the stopping means 50-4 recognizes the management process 60b by the stopping means 50-4 at the time of recognizing the stop request, similarly to the master thread 50a. On the other hand, a stop notification is issued and the operation is stopped.

【００３６】次に、上記スレーブスレッド５０ｂが、デ
ータの送受信中の場合について、説明する。Next, a case where the slave thread 50b is transmitting and receiving data will be described.

【００３７】図２は、マスタースレッド５０ａとスレー
ブスレッド５０ｂがデータの送受信を行なっている場合
であって、マスタースレッド５０ａは、受信手段５０−
２により受信処理を始める前に既に停止要求を受けて停
止中で、スレーブスレッド５０ｂは、停止要求を受ける
前に送信手段５０−１により送信処理を開始したが、マ
スタースレッド５０ａが既に停止しているので、送信処
理を終わることができない場合の動作の一例を示してい
る。FIG. 2 shows a case where the master thread 50a and the slave thread 50b are transmitting and receiving data.
The slave thread 50b has already stopped receiving the stop request before starting the receiving process by the step 2, and the slave thread 50b has started the transmission process by the transmitting means 50-1 before receiving the stop request, but the master thread 50a has already stopped and has stopped. Therefore, an example of the operation when the transmission process cannot be completed is shown.

【００３８】データの送受信中のスレッドは、ノード間
の同期をとるために、ノード間同期手段５０−３によ
り、あらかじめ設定された一定時間のループをしながら
相手ノードのスレッドが、送信手段５０−１または受信
手段５０−２によりデータの送受信処理を始めたかどう
かをノード間通信情報３０−１内のフラグを繰り返し参
照して調べる。In order to synchronize the nodes during transmission and reception of data, the inter-node synchronization means 50-3 causes the thread of the partner node to execute a loop for a predetermined period of time while the thread of the partner node transmits the data to the transmission means 50-. 1 or whether the reception means 50-2 has started data transmission / reception processing is checked by repeatedly referring to the flag in the inter-node communication information 30-1.

【００３９】このループ中にノード間の同期がとれたこ
とが確認できた場合は、送信手段５０−１または受信手
段５０−２によるデータの送受信が終わった時点で、停
止手段５０−４により、スレーブノード２０の管理プロ
セス６０ｂに停止通知を行ない停止する。When it is confirmed that the synchronization between the nodes has been achieved during this loop, when the transmission / reception of data by the transmitting means 50-1 or the receiving means 50-2 is completed, the stopping means 50-4 causes A stop notification is sent to the management process 60b of the slave node 20 to stop.

【００４０】しかし、この一定時間のループ中にノード
間の同期がとれず、タイムオーバーになることがある。
このため、ノード間同期手段５０−３では、図２のよう
に相手側のノード（マスターノード１０）のノード番号
が自ノード（スレーブノード２０）のノード番号より若
い場合、相手側のノードのスレッドは停止要求により既
に停止している可能性があるので、停止要求が発行され
ているかどうかを確認し、発行されている場合は、デー
タの送受信処理内で、停止手段５０−４により、スレー
ブノード２０の管理プロセス６０ｂに停止通知を行ない
停止する。However, during this fixed time loop, the nodes may not be synchronized with each other, resulting in a time over.
Therefore, if the node number of the partner node (master node 10) is smaller than the node number of its own node (slave node 20) as shown in FIG. May have already been stopped by the stop request, it is checked whether or not the stop request has been issued. A stop notification is sent to the 20 management processes 60b to stop.

【００４１】相手側のノードのノード番号より自ノード
のノード番号の方が若い場合は、相手ノードのスレッド
が停止している可能性はないので、データの送受信処理
内では停止せず、相手ノードのスレッドとの同期がとれ
るまでさらに待ち合わせる。そして、送信手段５０−１
または受信手段５０−２によるデータの送受信処理が終
了した時点で、停止手段５０−４により、スレーブノー
ド２０の管理プロセス６０ｂに停止通知を行ない停止す
る。If the node number of the own node is smaller than the node number of the partner node, there is no possibility that the thread of the partner node has stopped. Wait for more threads to synchronize. And transmitting means 50-1
Alternatively, when the data transmission / reception processing by the receiving unit 50-2 ends, the stop unit 50-4 sends a stop notification to the management process 60b of the slave node 20 and stops.

【００４２】以上のどれかの処理によって、停止通知を
受けたスレーブスレッド５０ｂの管理プロセス６０ｂ
は、ノード番号表６０−６を参照して、次のスレーブス
レッドがある場合は、ノード間停止要求手段６０−３に
より、次に若いノード番号のノードの管理プロセスに対
して、さらに停止要求を行ない、ノード間停止待ち合わ
せ手段６０−４により、チェックポイント対象プロセス
内の全ノードのスレッドが停止するまで待ち合わせる。The management process 60b of the slave thread 50b having received the stop notification by any of the above processes
Referring to the node number table 60-6, if there is a next slave thread, the inter-node stop request unit 60-3 issues a further stop request to the management process of the node with the next lowest node number. The process waits until threads of all nodes in the checkpoint target process are stopped by the inter-node stop waiting means 60-4.

【００４３】次のスレーブスレッドがない場合は、スレ
ーブノード２０の管理プロセス６０ｂが、マスターノー
ド１０の管理プロセス５０ａにノード間停止通知手段６
０−５により、全ノードのスレッドが停止したことを通
知する。さらに、マスターノード２０から順次全ノード
にノード間停止通知手段６０−５により、全ノードのス
レッドが停止したことを通知する。When there is no next slave thread, the management process 60b of the slave node 20 sends the inter-node stop notification means 6 to the management process 50a of the master node 10.
By 0-5, it notifies that threads of all nodes have stopped. Further, the master node 20 sequentially notifies all the nodes that the threads of all the nodes have been stopped by the inter-node stop notification unit 60-5.

【００４４】これによって、各マスターノード１０及び
スレーブノード２０の管理プロセス６０ａ、６０ｂは、
リスタートファイルの作成処理に移ることができる。Thus, the management processes 60a, 60b of each master node 10 and slave node 20
The process can proceed to the creation of a restart file.

【００４５】リスタート時の処理については、上記の処
理によって、チェックポイントの停止時にノード間での
同期のずれが生じていないことが保証されているので、
例えば、特開平８−２６３３１７号公報のような分散メ
モリ型マルチプロセッサシステムを対象としないチェッ
クポイント／リスタートの処理方式から、容易に実現可
能である。つまり、マスターノード１０の管理プロセス
６０ａがスレーブノード２０の管理プロセス６０ｂとの
接続を確立し、その管理プロセスへリスタート要求が発
行されたことを通知する。そして、各ノード内の管理プ
ロセスは、特開平８−２６３３１７号公報に記述された
りスタートの処理方式をノード毎に適用し、スレーブノ
ード２０の管理プロセス６０ｂは、管理するノードのり
スタートが終了した時点で、マスターノード１０の管理
プロセス６０ａへ通知することにより実現できる。Regarding the processing at the time of restart, the above processing guarantees that no synchronization deviation occurs between nodes when the checkpoint is stopped.
For example, a checkpoint / restart processing method which does not target a distributed memory type multiprocessor system as disclosed in Japanese Patent Laid-Open No. 8-263317 can be easily realized. That is, the management process 60a of the master node 10 establishes a connection with the management process 60b of the slave node 20, and notifies the management process that a restart request has been issued. The management process in each node is described in Japanese Patent Application Laid-Open No. 8-263317 or a start processing method is applied to each node. The management process 60b of the slave node 20 determines when the start of node management to be managed ends. Thus, it can be realized by notifying the management process 60a of the master node 10.

【００４６】図３は、送信手段５０−１及び受信手段５
０−２によるデータの送受信処理で使用されるノード間
同期手段５０−３の処理内容を説明するフローチャート
である。FIG. 3 shows transmission means 50-1 and reception means 5
It is a flowchart explaining the processing content of the inter-node synchronization means 50-3 used in the data transmission / reception processing by 0-2.

【００４７】チェックポイント要求が発行されていな
い、通常の並列処理プロセスの処理ににおけるデータの
送受信処理では、図３に従って以下のような処理が行な
われる。In a data transmission / reception process in a normal parallel processing process in which a checkpoint request has not been issued, the following process is performed according to FIG.

【００４８】送信側のノードでは、送信手段５０−１か
らノード間同期手段５０−３によって、ノード間通信情
報３０−１に送信データが用意できたことを示すフラグ
をセットし、同時にノード間通信情報３０−１を参照し
て（ステップ３０１）、受信側ノードによって受信終了
フラグがセットされているかどうかをチェックする（ス
テップ３０２）。In the transmitting node, the transmitting means 50-1 to the inter-node synchronizing means 50-3 set a flag indicating that the transmission data has been prepared in the inter-node communication information 30-1. With reference to the information 30-1 (step 301), it is checked whether or not the reception end flag is set by the receiving node (step 302).

【００４９】受信終了フラグがセットされていない場合
は、受信側のノードとの同期がまだとれていないので、
一定の時間が経過してタイムオーバーになっているかど
うかを判定し（ステップ３０３）、タイムオーバーにな
っていなければ、再びノード間通信情報３０−１を参照
して、ノード間の同期が取れるまで、ステップ３０１〜
３０３の処理を繰り返すことにより待ち合わせる。If the reception end flag has not been set, since synchronization with the receiving node has not yet been achieved,
It is determined whether or not the time is over after a certain time has elapsed (step 303). If the time is not over, the process returns to the inter-node communication information 30-1 until the nodes are synchronized. 301-
The process waits by repeating the process of step 303.

【００５０】受信側のノードにおいては、受信手段５０
−２からノード間同期手段５０−３によって、ノード間
通信情報３０−１を参照して（ステップ３０１）、送信
データが用意できたことを示すフラグがセットされてい
るかどうかをチェックする（ステップ３０２）。フラグ
がセットされていない場合は、送信側のノードと同様
に、送信側との同期がまだとれていないので、一定の時
間が経過してタイムオーバーになっているかどうかを判
定し（ステップ３０３）、タイムオーバーになっていな
ければ、再びノード間通信情報３０−１を参照して、同
期が取れるまで、ステップ３０１〜３０３の処理を繰り
返して待ち合わせる。In the receiving node, the receiving means 50
-2, the inter-node synchronization means 50-3 refers to the inter-node communication information 30-1 (step 301) to check whether a flag indicating that the transmission data is ready is set (step 302). ). If the flag has not been set, as with the node on the transmitting side, since synchronization with the transmitting side has not yet been established, it is determined whether or not time has passed after a certain period of time has elapsed (step 303). If the time is not over, the processing of steps 301 to 303 is repeated and waited until the synchronization is obtained by referring to the inter-node communication information 30-1 again.

【００５１】次に、送信側または受信側のどちらかのノ
ードが、同期待ちで上記のステップ３０１〜３０３の処
理を繰り返している時に、チェックポイントの停止要求
が発行された場合は、以下ような処理を行なう。例え
ば、図２に示したように、受信側のマスタースレッド５
０ａが、受信手段５０−２により受信処理を始める前に
チェックポイントにより停止した場合、送信側のスレー
ブスレッド５０ｂのノードは、ステップ３０１〜３０３
の処理を繰り返しているうちにタイムオーバーになる。Next, when a request to stop a checkpoint is issued while either the node on the transmission side or the node on the reception side repeats the processing of steps 301 to 303 while waiting for synchronization, the following occurs. Perform processing. For example, as shown in FIG.
If 0a is stopped by the checkpoint before the receiving unit 50-2 starts the receiving process, the node of the slave thread 50b on the transmitting side performs steps 301 to 303.
The time runs out while repeating the process.

【００５２】この後、スレーブノード２０の管理プロセ
ス６０ｂからノード内停止要求手段６０−１によって、
チェックポイント要求フラグがチェックポイント情報５
０−５に設定されているかどうかを調べる（ステップ３
０４）。チェックポイント要求フラグが設定されていれ
ば、ノード番号表６０−６を参照して受信側ノードのノ
ード番号と送信側ノードのノード番号を比較し（ステッ
プ３０５）、送信側ノードのノード番号の方が若い場合
は、既にチェックポイントにより停止していると判断で
きるので、停止手段５０−４によりスレーブノード２０
の管理プロセス６０ｂに通知を行ない停止する（ステッ
プ３０６）。Thereafter, from the management process 60b of the slave node 20 by the intra-node stop requesting means 60-1,
Checkpoint request flag is checkpoint information 5
It is checked whether 0-5 has been set (step 3
04). If the checkpoint request flag is set, the node number of the receiving node is compared with the node number of the transmitting node with reference to the node number table 60-6 (step 305). Is younger, it can be determined that the slave node 20 has already been stopped at the checkpoint.
The process is notified to the management process 60b and stopped (step 306).

【００５３】上記の実施の形態では、マスタースレッド
１０が存在するノードを最も若いノード番号に配置し、
若いノード番号のノードから順に停止要求行なうので、
現在停止要求が発行されているノードのスレッドは、ど
のノードのスレッドが既に停止していて、どのノードが
停止していないかを知ることができる。このため、適切
な停止位置であるかどうかを自ノード内で判断すること
ができる。In the above embodiment, the node where the master thread 10 exists is located at the lowest node number,
Since stop requests are issued in order from the node with the youngest node number,
The thread of the node to which the stop request is currently issued can know which thread of the node has already stopped and which node has not stopped. For this reason, it can be determined within the own node whether or not the stop position is appropriate.

【実施例】次に、本発明の具体的な実施例について詳細
に説明する。Next, specific examples of the present invention will be described in detail.

【００５４】この実施例では、マスターノード１０がノ
ード番号０、スレーブノード２０がノード番号１にそれ
ぞれ割り当てられているものとする。In this embodiment, it is assumed that the master node 10 is assigned to the node number 0 and the slave node 20 is assigned to the node number 1.

【００５５】図２は、マスターノード１０のマスタース
レッド５０ａとスレーブノード２０のスレーブスレッド
５０ｂがデータの送受信をしようとしている場合で、マ
スタースレッド５０ａは、受信手段５０−２により受信
処理を始める前に既に停止要求を受けて停止中で、スレ
ーブスレッド５０ｂは、停止要求を受ける前に送信手段
５０−１により送信処理を開始したが、マスタースレッ
ド５０ａが既に停止しているので、送信処理を終わるこ
とができない場合の動作例を示している。FIG. 2 shows a case where the master thread 50a of the master node 10 and the slave thread 50b of the slave node 20 are going to transmit and receive data. The slave thread 50b is already stopped after receiving the stop request, and the transmission process is started by the transmission unit 50-1 before receiving the stop request. However, since the master thread 50a has already stopped, the slave thread 50b ends the transmission process. It shows an operation example in the case where it cannot be performed.

【００５６】この場合、スレーブスレッド５０ｂは、ノ
ード間の同期をとるために、ノード間同期手段５０−３
により、あらかじめ設定された一定時間のループをしな
がらマスタースレッド５０ａが、受信手段５０−２によ
りデータの受信処理を始めたかどうかをノード間通信情
報３０−１内のフラグを繰り返し参照して調べる。In this case, the slave thread 50b uses the inter-node synchronization means 50-3 to synchronize the nodes.
Thus, the master thread 50a repeatedly checks the flag in the inter-node communication information 30-1 to determine whether or not the receiving means 50-2 has started the data receiving process while performing a loop for a predetermined period of time.

【００５７】しかし、マスタースレッド５０ａは、既に
チェックポイントの停止要求により停止しており、受信
処理を始めることはない。このため、図３に示したよう
に、ステップ３０１〜３０３のステップを繰り返してい
る間にタイムオーバーを起こして、ステップ３０４へ進
み、スレーブスレッド５０ｂ内のチェックポイント情報
５０−５を参照して、チェックポイント要求フラグが設
定されていることを確認する。However, the master thread 50a has already been stopped by a checkpoint stop request, and does not start receiving processing. For this reason, as shown in FIG. 3, a time-out occurs while repeating the steps 301 to 303, and the process proceeds to step 304, where the check is performed by referring to the checkpoint information 50-5 in the slave thread 50b. Check that the point request flag is set.

【００５８】次に、ステップ３０５において、受信側の
ノードであるマスターノード５０ａはノード番号０、送
信側のノードであるスレーブスノード５０ｂはノード番
号１に割り当てられているので、受信側ノードのノード
番号の方が若いため、スレーブスレッド５０ｂは、マス
ターノード５０ａは既に停止していると判断し、ステッ
プ３０９で停止手段５０−４によりスレーブスレッド５
０ｂの管理プロセス６０ｂに停止通知を行ない停止す
る。Next, in step 305, the master node 50a as the receiving node is assigned the node number 0, and the slave node 50b as the transmitting node is assigned the node number 1. Since the number is younger, the slave thread 50b judges that the master node 50a has already stopped, and in step 309, the stopping means 50-4 stops the slave thread 5a.
0b is notified to the management process 60b and the operation is stopped.

【００５９】以上好ましい実施の形態と実施例をあげて
本発明を説明したが、本発明は必ずしも上記実施の形態
及び実施例の内容に限定されるものではない。本発明の
実施の形態では、２ノードで、スレーブスレッドが１ス
レッドの場合の並列処理プロセスについて説明したが、
複数のスレーブノードを有する場合にも適用でき、ノー
ド数の制限はない。Although the present invention has been described with reference to the preferred embodiments and examples, the present invention is not necessarily limited to the contents of the above embodiments and examples. In the embodiment of the present invention, the parallel processing process in the case of two nodes and one slave thread has been described.
The present invention can be applied to a case having a plurality of slave nodes, and there is no limitation on the number of nodes.

【００６０】また、各ノード内のスレッド数について
も、マスターノード、スレーブノード共に複数のスレッ
ドを有する場合にも適用でき、スレッド数に制限はな
い。ただし、管理プロセスは、スレッド数が増加しても
各ノードに１つでよい。この場合、管理プロセスは、そ
のノード内の全てのスレッドから停止手段５０−４によ
り、停止通知を受けた後、ノード間停止要求手段６０−
３により、次のノードの管理プロセスへ停止通知が発行
されたことを通知する。The number of threads in each node can be applied to a case where both the master node and the slave node have a plurality of threads, and the number of threads is not limited. However, one management process may be provided for each node even if the number of threads increases. In this case, after receiving the stop notification from all the threads in the node by the stop unit 50-4, the management process starts the inter-node stop request unit 60-
3 notifies the management process of the next node that the stop notification has been issued.

【００６１】さらに、ノード間のデータの送受信につい
ても、任意のノード間での送受信中の場合について適用
できる。Further, the transmission and reception of data between nodes can be applied to the case where transmission and reception are being performed between arbitrary nodes.

【００６２】[0062]

【発明の効果】以上説明したように本発明の分散メモリ
型マルチプロセッサシステムにおけるプロセス停止方式
によれば、以下に述べる効果が得られる。As described above, according to the process stop method in the distributed memory type multiprocessor system of the present invention, the following effects can be obtained.

【００６３】第１に、最若番ノードから順に停止処理を
行なうため、データの送受信処理内のノード間同期手段
で、相手ノードが既に停止しているかどうかを通信等の
手段で相手ノードの状態を調べることなく、知ることが
できるので、自ノードが停止できるかどうかを自ノード
内で判断することができるため、多くのノードを有する
分散メモリ型マルチプロセッサシステムにおいて、多く
のスレッドを有する並列処理プロセス内で、データの送
受信を行なっている時にチェックポイントのための停止
要求が発行された場合でも、高々ノード数に比例したオ
ーダーの手間で停止させることができるため、効率的に
停止処理を行なうことができる。First, in order to perform stop processing sequentially from the youngest node, the inter-node synchronization means in the data transmission / reception processing determines whether or not the other node has already been stopped by means of communication or the like. Can be determined without checking, so that it is possible to judge whether the own node can be stopped or not in the own node. Therefore, in a distributed memory type multiprocessor system having many nodes, parallel processing having many threads Even if a stop request for a checkpoint is issued during data transmission / reception in the process, the stop can be performed with the trouble of the order at most proportional to the number of nodes, so that the stop process is performed efficiently. be able to.

【００６４】第２に、チェックポイント要求が発行され
た時のみ、停止処理を含むチェックポイントの処理を行
なうため、分散メモリ型マルチプロセッサシステムにお
いて動作する並列処理プロセスの通常の処理に、チェッ
クポイントのための停止処理が影響を与えることがなく
性能の低下がない。Second, the checkpoint processing including the stop processing is performed only when the checkpoint request is issued. Therefore, the checkpoint processing is performed in the normal processing of the parallel processing process operating in the distributed memory type multiprocessor system. There is no reduction in performance because the stopping process does not have any effect.

【００６５】第３に、チェックポイント要求は、外部コ
マンドによって発行されるため、分散メモリ型マルチプ
ロセッサシステムにおいて動作する並列処理プロセスに
対して、任意の時点でチェックポイント採取を行える。Third, since a checkpoint request is issued by an external command, a checkpoint can be collected at any time for a parallel processing process operating in a distributed memory type multiprocessor system.

[Brief description of the drawings]

【図１】本発明の実施の形態における分散メモリ型マ
ルチプロセッサシステムの全体構成を示すブロック図で
ある。FIG. 1 is a block diagram showing an overall configuration of a distributed memory multiprocessor system according to an embodiment of the present invention.

【図２】本発明の実施の形態におけるノード間のデー
タ送受信中の具体的な操作を説明する図である。FIG. 2 is a diagram illustrating a specific operation during data transmission and reception between nodes according to the embodiment of the present invention.

【図３】本発明の実施の形態におけるノード間同期手
段の具体的な処理内容を説明するフローチャートであ
る。FIG. 3 is a flowchart illustrating specific processing contents of an inter-node synchronization unit according to the embodiment of the present invention.

【図４】本発明の実施の形態における特別ノードとマ
スターノードとの関係を示すブロック図である。FIG. 4 is a block diagram illustrating a relationship between a special node and a master node according to the embodiment of the present invention.

[Explanation of symbols]

１０マスターノード２０スレーブノード３０ノード間通信領域３０−１ノード間通信情報５０ａマスタースレッド５０ｂスレーブスレッド５０−１送信手段５０−２受信手段５０−３ノード間同期手段５０−４停止手段５０−５チェックポイント情報６０ａ、６０ｂ管理プロセス６０−１ノード内停止要求手段６０−２ノード内停止待ち合わせ手段６０−３ノード間停止要求手段６０−４ノード間停止待ち合わせ手段６０−５ノード間停止通知手段６０−６ノード番号表 Reference Signs List 10 master node 20 slave node 30 inter-node communication area 30-1 inter-node communication information 50a master thread 50b slave thread 50-1 transmitting means 50-2 receiving means 50-3 inter-node synchronizing means 50-4 stopping means 50-5 check Point information 60a, 60b Management process 60-1 Node stop request means 60-2 Node stop waiting means 60-3 Node stop request means 60-4 Node stop waiting means 60-5 Node stop notification means 60-6 Node number table

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G06F 15/177 681 G06F 9/46 360 ──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁷ , DB name) G06F 15/177 681 G06F 9/46 360

Claims

(57) [Claims]

In a distributed memory type multiprocessor system in which a plurality of nodes are connected by an interconnection network and each node has a thread to be processed in parallel, a management process in which each node manages a thread to be processed in parallel When a checkpoint request command is input to any of the nodes, the management process of the node with the smallest node number issues a stop request to the thread in the own node, and After the thread has stopped,
It issues a stop request to the management process of the node having the next node number, and waits until threads of all nodes to be processed in parallel are stopped.
Distributed memory type multi-processor
A process stop method in Ssashisutemu, the thread, Roh for synchronizing transmission and reception of data between the nodes
It stores inter- node synchronization means and information indicating whether a stop request has been issued.
Transmission / reception means for transmitting / receiving checkpoint information and data to / from another node
And the own thread stops for the management process of the own node
Stop means to stop own thread after notifying that
With the door, the node synchronization means, in the data transmission and reception between nodes, the data transmission and reception synchronization
If the process times out,
Whether a stop request has been issued based on the point information
And if a stop request has been issued,
Node number and the node number of the partner node.
If the node number is smaller,
The thread is stopped.

2. The in-node stop requesting means for the management process issuing a stop request to a thread in the own node based on a stop request, and the in-node stop waiting means for waiting until all the threads in the own node stop. And, when the stop notification is received from all the threads in the own node,
2. A process stop method in a distributed memory type multiprocessor system according to claim 1, further comprising inter-node stop request means for issuing a stop request to a management process of a node having a next node number.

3. The inter-node stop waiting means wherein the management process waits until threads of all nodes are stopped, and a node which notifies each node that the threads of all nodes have stopped after the threads of all nodes have stopped. 3. A process stopping method in a distributed memory type multiprocessor system according to claim 2, further comprising an inter-stop means.

4. The management process includes a node number table describing a node number of the node in which the thread of the parallel processing process exists, and the inter-node stop request unit of the management process uses the node number table. 3. The process stop method in the distributed memory type multiprocessor system according to claim 2, wherein a stop request is issued to a management process of a node having the next lowest node number.

5. The management process includes a node number table describing a node number of the node in which the thread of the parallel processing process exists, and the inter-node synchronization unit of the thread includes a self-node according to the node number table. 4. The process stop method in the distributed memory type multiprocessor system according to claim 3 , wherein the node number of the other node is compared with the node number of the partner node.

6. In the transmission / reception of data between nodes, there is provided inter-node communication information for setting transmission end information and reception end information, and wherein the inter-node synchronization means sets the inter-node communication information within a predetermined time. Synchronization is established when the transmission end and reception end information are set, and when it is not set within a predetermined time, it is checked whether or not a stop request has been issued based on the checkpoint information. 4. The method according to claim 3 , wherein the node number of the own node is compared with the node number of the partner node, and if the node number of the own node is smaller, the own means stops the own thread. 3. A process stop method in the distributed memory type multiprocessor system according to 1.