JPH04147347A

JPH04147347A - Processor failure recovery control method

Info

Publication number: JPH04147347A
Application number: JP2271807A
Authority: JP
Inventors: Takahiro Hamada; 浜田　隆宏
Original assignee: NEC Communication Systems Ltd
Current assignee: NEC Communication Systems Ltd
Priority date: 1990-10-09
Filing date: 1990-10-09
Publication date: 1992-05-20

Abstract

PURPOSE:To execute failure recovery processing efficiently and appropriately when a processor failure occurs in a slave processor by controlling the processor failure recovery processing using the value of a parallel execution counter. CONSTITUTION:Master processor 1, when a failure occurs in a slave processor 2, compares a parallel execution counter 11 with a parallel execution allowable number 16 obtained by looking up a parallel execution control table 13 using a CPU activity ratio information 12 calculated by a CPU activity ratio function 31. As a result of the comparison, if the value of the parallel execution counter 11 is within the parallel execution allowable number 16, the master processor 1 starts failure recovery processing 7 onto a slave processor 2 through a processor bus 5. When the value of parallel execution counter 11 exceeds the parallel execution allowable number 16, the master processor waits unit the processor failure recovery processing 7 starts. After the recovery processing starts, the master processor sends an inter-instruction signal sending timing 21 to the slave processor.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、電子交換機やデータ通信システムなどマスタ
プロセッサと複数のスレーブプロセッサとをプロセッサ
バスにより接続したマルチプロセッサシステムで、−度
稼働状態に入ったらサービスを中断することが許されな
いようなシステムにおけるプロセッサ障害復旧制御方式
に関する。[Detailed Description of the Invention] [Field of Industrial Application] The present invention relates to a multiprocessor system such as an electronic exchange or a data communication system in which a master processor and a plurality of slave processors are connected by a processor bus. This paper relates to a processor failure recovery control method in a system in which it is not allowed to interrupt services.

[Conventional technology]

従来、この種のマルチプロセッサシステムにおけるプロ
セッサ障害復旧制御方式では、障害となった複数のスレ
ーブプロセッサに対し一定数のプロセッサ障害復旧処理
を無条件に並列実行したり、プロセッサ障害復旧処理に
おけるマスタプロセッサからスレーブプロセッサへの一
連の指示信号送信時に常に一定のタイミング遅延を取る
ようになっていた。Conventionally, processor failure recovery control methods in this type of multiprocessor system have involved unconditionally executing a certain number of processor failure recovery processes in parallel for multiple failed slave processors, or A certain timing delay was always required when sending a series of instruction signals to slave processors.

〔発明が解決しようとする課題〕上述したように従来のマルチプロセッサシステムにおけ
るプロセッサ障害復旧制御方式では、定数のプロセッサ
障害復旧処理を無条件に並列実行したり、プロセッサ障
害復旧処理におけるマスタプロセッサからスレーブプロ
セッサへの一連の指示信号送信時に常に一定のタイミン
グ遅延を取るようになっているので、スレーブプロセッ
サにプロセッサ障害が発生した場合に、システムが既に
高負荷状態になっている時には、当該プロセッサ障害復
旧処理の開始により高負荷状態がより一層冗長されたり
、システムが低負荷状態の時は、システム処理能力に余
裕があっても当該プロセッサ障害復旧処理時間を一定時
間以下には短縮できず、プロセッサ障害復旧処理を効率
良く行うことができない等の欠点がある。[Problems to be Solved by the Invention] As mentioned above, in the conventional processor failure recovery control method in a multiprocessor system, a constant number of processor failure recovery processes are executed in parallel, and a master processor to slave There is always a certain timing delay when sending a series of instruction signals to the processor, so if a processor failure occurs in a slave processor and the system is already under a high load, it will be difficult to recover from the processor failure. When a high load state becomes redundant due to the start of processing, or when the system is in a low load state, the processing time for recovering from a processor failure cannot be shortened to a certain amount of time even if the system processing capacity is available, and the processor failure may occur. There are drawbacks such as the inability to perform recovery processing efficiently.

本発明の目的は、スレーブプローセッサのプロセッサ障
害発生時のシステム負荷状況の高低に拘わらす、システ
ム動作へ悪影響を与えずに、当該プロセッサ障害復旧処
理を効率良く適切に制御することができるプロセッサ障
害復旧制御方式を提供することにある。An object of the present invention is to efficiently and appropriately control processor failure recovery processing without adversely affecting system operation, regardless of the level of system load when a processor failure occurs in a slave processor. The objective is to provide a recovery control method.

[Means to solve the problem]

本発明のプロセッサ障害復旧制御方式は、主記憶装置を
備えたマスタプロセッサと複数のスレーブプロセッサと
をプロセッサバスにより接続したマルチプロセッサシス
テムにおけるプロセッサ七章害復旧制御方式において、
該主記憶装置の中に、プロセッサ障害復旧処理の並列実
行中個数を示す並列実行カウンタと、該マスタプロセッ
サのＣＰＵ使用率を定期的に算出するＣＰＵ使用率関数
と、該ＣＰＵ使用率関数が算出したＣＰＵ使用率を格納
するＣＰＵ使用率情報と、該ＣＰＵ使用率情報によって
索引されるプロセッサ障害復旧処理の並列実行許容数を
格納する並列実行制御表と、前記マスタプロセッサより
前記スレーブプロセッサへの指示信号送信時から該スレ
ーブプロセッサより該マスタプロセッサへの応答信号受
信時までの時間を測定する時間測定関数と、該時間側定
量数により測定した時間を格納する応答時間情報と、該
応答時間情報によって索引される該マスタプロセッサか
ら該スレーブプロセッサへの指示信号間送信タイミング
を格納する送信タイミング制御表と、任意のタイミング
遅延を取るタイミング関数とを有し、前記スレーブプロセッサにプロセッサ障害が発生した場
合に、前記並列実行カウンタが前記ＣＰＵ使用率情報に
よって索引される前記並列実行制御表中の当該並列実行
許容数以内の時には、該スレーブプロセッサへのプロセ
ッサ障害復旧処理を開始し、前記並列実行許容数を超過
した時は、前記プロセッサ障害復旧処理の開始を待ち合
わせ、当該プロセッサ障害復旧処理の開始後は、該プロセッサ
障害復旧処理における該マスタプロセッサから当該スレ
ーブプロセッサへの指示信号送信時に、前記応答時間情
報によって索引される前記送信タイミング制御表中の当
該指示信号間送信タイミングを前記タイミング関数によ
りタイミング遅延を取った後に送信する構成である。The processor failure recovery control method of the present invention is a processor failure recovery control method in a multiprocessor system in which a master processor equipped with a main storage device and a plurality of slave processors are connected by a processor bus.
The main storage includes a parallel execution counter that indicates the number of processor failure recovery processes being executed in parallel, a CPU usage rate function that periodically calculates the CPU usage rate of the master processor, and a CPU usage rate function that calculates the CPU usage rate of the master processor. a parallel execution control table that stores a permissible number of parallel executions of processor failure recovery processing indexed by the CPU usage rate information; and an instruction from the master processor to the slave processor. A time measurement function that measures the time from the time of signal transmission to the time of reception of a response signal from the slave processor to the master processor, response time information that stores the time measured by the time-side quantitative number, and based on the response time information. It has a transmission timing control table that stores indexed transmission timing between instruction signals from the master processor to the slave processor, and a timing function that takes an arbitrary timing delay, and when a processor failure occurs in the slave processor, , when the parallel execution counter is within the permissible number of parallel executions in the parallel execution control table indexed by the CPU usage information, start processor failure recovery processing for the slave processor, and increase the permissible number of parallel executions. When the time limit is exceeded, wait for the start of the processor failure recovery process, and after the start of the processor failure recovery process, use the response time information when transmitting an instruction signal from the master processor to the slave processor in the processor failure recovery process. The configuration is such that the transmission timing between the instruction signals in the indexed transmission timing control table is transmitted after a timing delay is taken by the timing function.

〔Example〕

次に、本発明について図面を参照して説明する。 Next, the present invention will be explained with reference to the drawings.

第１図は本発明によるマルチプロセッサシステムの一構
成例を示す図であり、マスタプロセッサ１及びスレーブ
プロセッサ２，３．４かプロセッサバス５により接続さ
れ、マスタプロセッサ１はメインメモリ６を備えている
。マスタプロセッサ１は、スレーブプロセッサ２，３．
４のプロセッサ障害発生時に、プロセッサバス５を通し
てプロセッサ障害復旧処理７，８．９によりスレーブプ
ロセッサ２，３．４の障害復旧を制御する。FIG. 1 is a diagram showing an example of the configuration of a multiprocessor system according to the present invention, in which a master processor 1 and slave processors 2, 3, and 4 are connected by a processor bus 5, and the master processor 1 is equipped with a main memory 6. . Master processor 1 has slave processors 2, 3 .
When a processor failure occurs in slave processor 4, failure recovery of slave processors 2 and 3.4 is controlled by processor failure recovery processing 7 and 8.9 through processor bus 5.

第２図は第１図に示したメインメモリに収容された制御
情報及び処理関数の一実施例を示す図である。第２図に
おいて、メインメモリ６の中に、プロセッサ障害復旧処
理の並列実行中個数を示す並列実行カウンタ］１と、マ
スタプロセッサ１のＣＰＵ使用率を定期的に算出するＣ
ＰＵ使用率関数３１と、算出したＣＰＵ使用率を格納す
るＣＰＵ使用率情報１２と、そのＣＰＵ使用率情報１２
によって索引されるプロセッサ障害復旧処理の並列実行
許容数１４．１５．１６を格納する並列実行制御表１３
と、マスタプロセッサ１よりスレーブプロセッサ２．３
．４への指示信号送信時から当該スレーブプロセッサ２
，３．４よりマスタプロセッサ１への応答信号受信時ま
での時間を測定する時間測定関数３２と、測定した時間
を格納する応答時間情報１７と、その応答時間情報１７
によって索引されるマスタプロセッサ１からスレーブプ
ロセッサ２，３．４への指示信号間送信タイミング１９
，２０．２１を格納する送信タイミング制御表１８と、
任意のタイミング遅延を取るタイミング関数３３とを有
している。FIG. 2 is a diagram showing an example of control information and processing functions stored in the main memory shown in FIG. 1. In FIG. 2, in the main memory 6, there is a parallel execution counter]1 that indicates the number of processor failure recovery processes currently being executed in parallel, and a counter C that periodically calculates the CPU usage rate of the master processor 1.
A PU usage rate function 31, CPU usage rate information 12 that stores the calculated CPU usage rate, and the CPU usage rate information 12.
Parallel execution control table 13 that stores the number of parallel executions allowed for processor failure recovery processing indexed by 14, 15, and 16
and slave processor 2.3 from master processor 1.
．． 4, the slave processor 2
, 3.4, a time measurement function 32 that measures the time until the response signal is received to the master processor 1, response time information 17 that stores the measured time, and response time information 17.
Transmission timing 19 between instruction signals from master processor 1 to slave processors 2, 3.4 indexed by
, 20.21;
It has a timing function 33 that takes an arbitrary timing delay.

次に、具体例として、スレーブプロセッサ２に障害か発
生した場合の動作について説明する。マスタプロセッサ
１はスレーブプロセッサ２のプロセッサ障害が発生した
場合に、並列実行カウンタ１１と、ＣＰＵ使用率関数３
１により算出し格納されたＣＰＵ使用率情報１２によっ
て並列実行制御表１３を索引して求めた並列実行許容数
１６とを比較し、並列実行カウンタ１１が並列実行許容
数１６以内の時には、プロセッサバス５を通して、スレ
ーブプロセッサ２／＼のプロセッサ障害復旧処理７を開
始し、並列実行カウンタ１１が並列実行許容数１６を超
過した時は、プロセッサ障害復旧処理７の開始を持ち合
わせる。Next, as a specific example, the operation when a failure occurs in the slave processor 2 will be described. When a processor failure occurs in the slave processor 2, the master processor 1 calculates the parallel execution counter 11 and the CPU usage rate function 3.
The CPU utilization information 12 calculated and stored in step 1 is compared with the permissible number of parallel executions 16 obtained by indexing the parallel execution control table 13, and when the parallel execution counter 11 is within the permissible number of parallel executions 16, the processor bus 5, the processor failure recovery process 7 of the slave processor 2/\ is started, and when the parallel execution counter 11 exceeds the parallel execution allowable number 16, the processor failure recovery process 7 is started.

プロセッサ障害復旧処理７の開始後は、そのプロセッサ
障害復旧処理７におけるマスタプロセッサ１からスレー
ブプロセッサ２への指示信号送信時に、時間測定関数３
２により測定し格納された応答時間情報１７によって送
信タイミング制御表１８を索引して求めた指示信号間送
信タイミング２１を、タイミング関数３３によりタイミ
ング遅延を取った後に送信する。プロセッサ障害復旧処
理７における一連の指示信号送信及び応答信号受信に関
して、上記手順を繰り返し実行する。After the processor failure recovery process 7 is started, the time measurement function 3 is used when transmitting an instruction signal from the master processor 1 to the slave processor 2 in the processor failure recovery process 7
The inter-instruction signal transmission timing 21 obtained by indexing the transmission timing control table 18 using the response time information 17 measured and stored in step 2 is transmitted after a timing delay is taken by the timing function 33. Regarding a series of instruction signal transmission and response signal reception in processor failure recovery processing 7, the above procedure is repeatedly executed.

次に、第３図により、指示信号間送信タイミング２１の
値の決め方（考え方）の一実施例を説明する。第３図は
、送信タイミング制御表１８とそこに格納された指示信
号間送信タイミング２１の具体的な値の一実施例を示し
た図であり、応答時間情報１７を５０ｍ５単位に分類し
た場合が示されている。応答時間情報が５０　ｒｎ　ｓ
以内であった場合は、タイミング遅延をせずに次の指示
信号を送信し、５１ｍ５以上１００ｍ５以内の場合は、
２５ｍ５のタイミング遅延を取った後に次の指示信号を
送信し、１０１ｍ５以上１５０ｍ５以内の場合は、５０
ｍ５のタイミング遅延を取った後に次の指示信号を送信
する。同様に、応答時間情報の値に対応して、所定の指
示信号間送信タイミングを取った後に次の指示信号を送
信する。Next, with reference to FIG. 3, an example of how to determine (concept) the value of the instruction signal inter-transmission timing 21 will be described. FIG. 3 is a diagram showing an example of the transmission timing control table 18 and specific values of the instruction signal inter-transmission timing 21 stored therein. It is shown. Response time information is 50 rn s
If it is within 51m5, the next instruction signal will be sent without timing delay, and if it is between 51m5 and 100m5,
Send the next instruction signal after taking a timing delay of 25m5, and if the distance is between 101m5 and 150m5, 50m5
After taking the timing delay of m5, the next instruction signal is transmitted. Similarly, the next instruction signal is transmitted after a predetermined inter-instruction signal transmission timing has been established in accordance with the value of the response time information.

〔Effect of the invention〕

以上説明したように本発明は、スレーブプロセッサにプ
ロセッサ障害が発生した場合に、並列実行カウンタがＣ
ＰＵ使用情報によって索引される並列実行制御表中の並
列実行許容数以内の時には、当該スレーブプロセッサへ
のプロセッサ障害復旧処理を開始し、並列実行許容数を
超過した時には、プロセッサ障害復旧処理の開始を待ち
合わせ、プロセッサ障害復旧処理の開始後は、当該プロ
セッサ障害復旧処理におけるマスタプロセッサから当該
スレーブプロセッサへの指示信号送信時に、応答時間情
報によって索引される送信タイミング制御表中の指示信
号間送信タイミング関数によりタイミング遅延を取った
後に送信することにより、スレーブプロセッサのプロセ
ッサ障害発生時のシステｌ＼負荷状況の高低に拘わらず
、システム動作へ悪影響を与えずに、当該プロセッサ障
害復旧処理を効率良く適切に制御することができる効果
を有する。As explained above, in the present invention, when a processor failure occurs in a slave processor, the parallel execution counter
When the number of parallel executions is within the permissible number of parallel executions in the parallel execution control table indexed by the PU usage information, the processor failure recovery process for the slave processor is started, and when the permissible number of parallel executions is exceeded, the processor failure recovery process is started. After waiting and starting the processor failure recovery process, when the master processor sends an instruction signal to the slave processor in the processor failure recovery process, the transmission timing function between the instruction signals in the transmission timing control table indexed by the response time information is used. By transmitting the data after a timing delay, it is possible to efficiently and appropriately control the processor failure recovery processing without adversely affecting system operation, regardless of the level of system load when a processor failure occurs in a slave processor. It has the effect of being able to.

[Brief explanation of drawings]

第１図は本発明によるマルチプロセッサシステムの一構
成例を示す図、第２図は第１図に示したメインメモリに
収容された制御情報及び処理関数の一実施例を示す図、
第３図は送信タイミング制御表とそこに格納された指示
信号間送信タイミングの一実施例を示す図である。１・・・マスタプロセッサ、２，３．４・・・スレーブ
プロセッサ、５・・・プロセッサバス、６・・・メイン
メモリ、１１・・・並列実行カウンタ、１２・・・ＣＰ
Ｕ使用率情報、１３・・・並列実行制御表、１４，１５
゜１６・・・並列実行許容数、１７・・・応答時間情報
、１８・・・送信タイミング制御表、１９，２０．２１
・・・指示信号間送信タイミング、３１・・・ＣＰＵ使
用率関数、３２・・・時間測定関数、３３・・・タイミ
ング関数。FIG. 1 is a diagram showing an example of the configuration of a multiprocessor system according to the present invention, and FIG. 2 is a diagram showing an example of the control information and processing functions stored in the main memory shown in FIG.
FIG. 3 is a diagram showing an example of a transmission timing control table and the transmission timing between instruction signals stored therein. 1... Master processor, 2,3.4... Slave processor, 5... Processor bus, 6... Main memory, 11... Parallel execution counter, 12... CP
U usage rate information, 13... Parallel execution control table, 14, 15
゜16... Permissible number of parallel executions, 17... Response time information, 18... Transmission timing control table, 19, 20.21
... Transmission timing between instruction signals, 31... CPU usage rate function, 32... Time measurement function, 33... Timing function.

Claims

[Scope of Claims] In a processor failure recovery control method in a multiprocessor system in which a master processor equipped with a main memory and a plurality of slave processors are connected by a processor bus, a processor failure recovery process is performed in the main memory. A parallel execution counter that indicates the number of parallel executions, and a CPU that periodically calculates the CPU usage rate of the master processor.
A usage rate function, CPU usage information that stores the CPU usage rate calculated by the CPU usage rate function, and a parallel execution control table that stores the permissible number of parallel executions of processor failure recovery processing indexed by the CPU usage rate information. and a time measurement function that measures the time from the time when the master processor sends an instruction signal to the slave processor until the time when a response signal is received from the slave processor to the master processor.
response time information that stores the time measured by the time measurement function; a transmission timing control table that stores the transmission timing between instruction signals from the master processor to the slave processor indexed by the response time information; and an arbitrary timing. and a timing function that takes a delay, and when a processor failure occurs in the slave processor, the parallel execution counter is within the permissible number of parallel executions in the parallel execution control table indexed by the CPU usage information. Sometimes, a processor failure recovery process is started for the slave processor, and when the permissible number of parallel executions is exceeded, the start of the processor failure recovery process is waited for, and after the start of the processor failure recovery process, the processor failure recovery process is started. When transmitting an instruction signal from the master processor to the slave processor in processing, the transmission timing between the instruction signals in the transmission timing control table indexed by the response time information is transmitted after a timing delay is taken by the timing function. A processor failure recovery control method characterized by: