JP2004310252A

JP2004310252A - Failure restoration method for multiprocessor system

Info

Publication number: JP2004310252A
Application number: JP2003099990A
Authority: JP
Inventors: Takamasa Mitsunobu; 隆正光信
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2003-04-03
Filing date: 2003-04-03
Publication date: 2004-11-04

Abstract

<P>PROBLEM TO BE SOLVED: To provide the failure restoring method of a multiprocessor system for making another processor appropriately detect the failure of an arbitrary processor, and appropriately perform arbitrary processing in behalf of the failed processor. <P>SOLUTION: Each program storage memory stores a complementary program for making each processor perform the processing of another processor in behalf. Each processor writes operating situations in an operating situation information storage memory, and monitors the operating situations stored in the operating situation information storage memory. When the operation situations of the n-th processor include the contents of failure, the normally operating m-th processor executes the complementary program according to the operating situation information, and performs the processing of the n-th processor in behalf, and an initializing means initializes the n-th processor. After the initialization of the n-th processor is completed, the m-th processor returns the processing to the n-th processor according to the operating situation information. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明はマルチプロセッサシステムの故障復帰方法に関する。
【０００２】
【従来の技術】
従来、マルチプロセッサシステムの故障復帰方法としては、例えば特許文献１に開示されているように、マイクロプログラム制御の主プロセッサとマイクロプログラム制御の副プロセッサとを有し、主プロセッサは主プロセッサ用のマイクロプログラムと副プロセッサを起動するマイクロプログラムとを格納する主制御記憶を有し、副プロセッサは主プロセッサによって起動されて副制御記憶に格納されるマイクロプログラムによる制御の下に処理を行いその結果を主プロセッサに報告するマルチプロセッサシステムであって、主プロセッサは主制御記憶中に副プロセッサの処理を代替するマイクロプログラムを備え、主プロセッサが副プロセッサから障害の報告があったとき副プロセッサを起動するマイクロプログラムが格納されている主制御記憶のアドレスを切り替えて代替マイクロプログラムを実行するようにしたことを特徴としたマルチプロセッサシステムがあった。上記のようにシステム内に処理を代替する手段を備えておいて、システムに故障が発生して或る処理が実行できなくなった場合でも、処理を代替することでシステムの機能を維持することができる。
【０００３】
【特許文献１】
特開平１−１０２６７４号公報
【０００４】
【発明が解決しようとする課題】
しかし従来の技術では解決されていない課題がある。先ず故障の検出方法だが、マルチプロセッサシステムでは、どのプロセッサが故障するかは予想できないので、特定のプロセッサの故障だけを想定した故障検出は不充分である。プロセッサ能力とプロセッサ上で実行されているプログラム規模によって故障の確率を推測することはできるが、プロセッサ能力が高くプログラム規模が大きい方が故障の確率は一般的に高い。しかし従来の技術では主プロセッサの故障を想定していない。また障害に関係している可能性のあるプロセッサは誤った処理をする可能性があるので、障害に関係しているプロセッサがその障害について報告するのは論理的に矛盾を含んでいる。
【０００５】
更に障害の報告タイミングにも考慮が必要である。すなわち障害発生と障害報告の時差が大きい場合には、故障復帰処理が適切に行われない可能性があるし、障害報告の内容が不充分、つまり複数の障害が発生した場合にどの障害を故障復帰処理の判断対象とするのかを考慮するための情報が欠如する可能性がある。
【０００６】
次に故障したプロセッサの処理の代替方法だが、上記故障の検出方法の課題と同様に、代替可能な処理が限定されていると、代替できない処理が存在して、故障の状況によっては故障からの復帰までシステムの機能が失われる。代替処理に処理速度制約があって、処理が遅くなって顧客に無駄な待ち時間を強いることも課題だった。
【０００７】
最後に故障からの復帰方法だが、従来の技術には明確な開示がない。故障からの復帰方法としてプロセッサの初期化があるが、初期化が終了するまでの期間、システムは機能を失うことになっていた。
【０００８】
【課題を解決するための手段】
この課題を解決するために本発明は、複数のプロセッサと、各プロセッサと固有バスで接続されたプログラム格納メモリと、各プロセッサと共通バスで接続された動作状況情報格納メモリと、各プロセッサと共通バスで接続された初期化手段と、ハードウエア資源と、各プロセッサとハードウエア資源を選択的に接続するバス切替手段とからなり、各プログラム格納メモリには各プロセッサが他のプロセッサの処理を代替する相補プログラムが格納され、各プロセッサは動作状況を動作状況情報格納メモリに書き込むと共に動作状況情報格納メモリに格納された動作状況を監視し、第ｎプロセッサの動作状況が障害内容を含む場合には正常動作している第ｍプロセッサが前記動作状況情報に従って前記相補プログラムを実行して第ｎプロセッサの処理を代替すると共に初期化手段が第ｎプロセッサを初期化し第ｎプロセッサの初期化終了後に第ｍプロセッサが前記動作状況情報に応じて処理を第ｎプロセッサに戻すことを特徴とするマルチプロセッサシステムの故障復帰方法を提供する。
【０００９】
【発明の実施の形態】
以下、本発明の実施の形態について、図１から図４を用いて説明する。
【００１０】
（実施の形態１）
図１は本発明の実施の形態１におけるマルチプロセッサシステムのブロック図を示し、図１において１はマルチプロセッサシステム、２は初期化手段、３は動作状況情報格納メモリ、４は共通バス、５はバス切替手段、６はハードウエア資源、７は第１プロセッサ、８は第１バス、９は第１メモリ、１０は第２プロセッサ、１１は第２バス、１２は第２メモリ、１３は第ｎプロセッサ、１４は第ｎバス、１５は第ｎメモリである。
【００１１】
以上のように構成されたマルチプロセッサシステムについて、以下、その動作を述べる。マルチプロセッサシステム１において、第１プロセッサ７は第１バス８を介して第１メモリ９に格納されているプログラムに従って動作する。同様に第２プロセッサ１０は第２バス１１を介して第２メモリ１２に格納されているプログラムに従って動作する。Ｎ個のプロセッサが存在する場合、第ｎプロセッサ１３は第ｎバス１４を介して第ｎメモリ１５に格納されているプログラムに従って動作する。図１には第１プロセッサ７、第２プロセッサ１０、第ｎプロセッサ１３が記述されているが、第ｎプロセッサ１３はＮ個のうち任意のプロセッサを示している。マルチプロセッサシステム１の通常動作時には、第ｎプロセッサ１３は定期的に共通バス４を介して動作状況情報格納メモリ３に動作状況を格納する。動作状況とは、ハードウエア資源６の制御レジスタ設定値、状態レジスタ値、割り込み要求状態、初期化履歴、故障履歴、初期化からの動作時間、認識されている動作時刻、メモリ使用状況、第ｎメモリ１５に格納されたアプリケーションプログラム２１の動作状態、ドライバプログラム２２の動作状態、オペレーティングシステム２３の動作状態で、通常動作時にはＮ個のプロセッサ毎に動作状況情報格納メモリ３上で動作状況が逐次更新される。また第ｎプロセッサ１３は定期的に動作状況情報格納メモリ３上の動作状況を監視して、動作状況の中に故障と見なされる状況を発見した場合には、故障の起きているプロセッサを限定し、バス切替手段５を制御して故障したプロセッサのバスをハードウエア資源６から切り離す。これによって故障したプロセッサはハードウエア資源６を制御できなくなると共にハードウエア資源６からの情報、例えば割り込み要求等を受けられなくなる。次に第ｎプロセッサ１３は初期化手段２を制御して故障したプロセッサを初期化する。図２に初期化手段の一例を示されている。第１プロセッサ７、第２プロセッサ１０、第ｎプロセッサ１３に対応して初期化手段２は第１初期化Ｆ／Ｆ１６、第２初期化Ｆ／Ｆ１７、第ｎ初期化Ｆ／Ｆ１８を備え、適当なプロセッサから故障したプロセッサに該当する初期化Ｆ／Ｆをセットすることで初期化信号がイネーブルされて該当するプロセッサが初期化される。故障とは、該当するプロセッサに関連するハードウエア及び該当するプロセッサのメモリに格納されているプログラムの異常を全て含む。またハードウエア資源６に故障が認められた場合も、該故障を認識したプロセッサの故障と見なす。第ｎプロセッサ１３は故障したプロセッサの初期化が行われている期間、該当するプロセッサが行うべきであった処理を第ｎメモリ１５に格納されている相補プログラムを実行して代替する。相補プログラムの一例は図４に示したように、アプリケーション２１、ドライバ２２、オペレーティングシステム２３から構成される。相補プログラムには故障時の動作状況の分析、次処理の予測、次処理の実行、故障したプロセッサの初期化の状況観察が含まれる。例えば、故障したプロセッサが処理Ａを実行中に故障した場合、第ｎプロセッサ１３は処理Ａが正常終了したかどうかを検証し、正常終了していないときには代替として処理Ａを再実行する。また処理Ａを起動してから故障までに時間の経過があり、処理Ａに続く処理Ａ＋１、Ａ＋２、Ａ＋ｎが必要であるにも関わらずそれらの処理が実行されていない場合には、誤動作を継続させないために処理Ａに関連する動作を停止させる場合もある。第ｎメモリ１５に格納された相補プログラムが故障したプロセッサの処理を網羅しているとは限らない。第ｎプロセッサの処理能力が故障したプロセッサの処理能力より低い場合には、処理を制限する。すなわち同じ処理速度で機能を落とすか、或いは同じ機能で処理速度を落とすかで制限する。例えば、故障時に実行していたアプリケーションが速度重視のアプリケーションであれば、相補プログラムは機能を劣化させても処理速度を維持し、見栄えや機能の多様性を重視したアプリケーションの場合には、速度を落としても全ての機能と表示を維持する。機能の劣化とは、例えば、グラフィックス描画機能に関して、複雑な図形を描く際に曲線を適当な数の直線で近似したり、任意の楕円を円で代替したり、点線や破線を直線で代替したり、カラー描画を白黒描画で代替したり、塗りつぶしを輪郭線だけで代替したり、任意の図形回転を回転角度限定で代替したり、優先順位の低い処理を無視することを意味する。他にも外部機器との通信速度を低速に設定したり、縮小拡大や符号化復号化等のデータの加工精度を低く設定したりすることも容易に類推される。動作状況情報格納メモリ３上の動作状況から現在の処理負荷を分析し、負荷に応じて相補プログラムの機能に制限を設けるなどして処理速度を維持することも容易に考えられる。第ｎプロセッサ１３が動作状況情報格納メモリ３の動作状況を監視して、故障したプロセッサの初期化が終了し正常動作を再開したことを知ると、バス切替手段５を使って、初期化終了したプロセッサのバスをハードウエア資源６に再接続する。初期化の後複数回の故障があった場合には、第ｎプロセッサが故障したプロセッサの処理を代替し続ける。
【００１２】
【発明の効果】
以上のように本発明によれば、マルチプロセッサシステムで各プロセッサが動作状況を定期的に共通バス上のメモリに格納し、各プロセッサが任意のプロセッサの動作状況を定期的に読み出して監視するので、任意をプロセッサの故障を適時に且つ適切に他のプロセッサが認識し、故障したプロセッサを初期化することができる。
【００１３】
また故障したプロセッサの処理を代替する場合、故障時の動作状況を把握したうえで処理を代替するので、未完了の処理が放置されたり、誤動作に帰直する可能性のある処理が続行されたり、処理手順に矛盾が生じることがなく、適切に代替することができる。
【００１４】
また故障したプロセッサは初期化が完了するまでバス切替手段でハードウエア資源から隔離されるので、故障したプロセッサがハード資源に影響を与えることを避け、システムの信頼性と安定性を確保できる。各プロセッサの相補プログラムはプロセッサの処理能力に応じて機能や規模を最適にするので、過分な処理能力のプロセッサを採用する必要がなく経済的である。
【００１５】
更にシステムにかかる負荷に応じて相補プログラムに機能制限ができるので、合理的である。
【図面の簡単な説明】
【図１】本発明の実施の形態１におけるマルチプロセッサシステムのブロック図
【図２】本発明の実施の形態１におけるマルチプロセッサシステムの初期化手段を示すブロック図
【図３】本発明の実施の形態１におけるマルチプロセッサシステムの操作状況情報を示すブロック図
【図４】本発明の実施の形態１におけるマルチプロセッサシステムの相補プログラムの構成例を示すブロック図
【符号の説明】
１マルチプロセッサシステム
２初期化手段
３動作状況情報格納メモリ
４共通バス
５バス切替手段
６ハードウエア資源
７第１プロセッサ
８第１バス
９第１メモリ
１０第２プロセッサ
１１第２バス
１２第２メモリ
１３第ｎプロセッサ
１４第ｎバス
１５第ｎメモリ
１６第１初期化Ｆ／Ｆ
１７第２初期化Ｆ／Ｆ
１８第ｎ初期化Ｆ／Ｆ
１９第１初期化信号
２０第２初期化信号
２１第ｎ初期化信号
２２アプリケーション
２３ドライバ
２４オペレーティングシステム[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a fault recovery method for a multiprocessor system.
[0002]
[Prior art]
2. Description of the Related Art Conventionally, as a method of recovering from a failure in a multiprocessor system, for example, as disclosed in Patent Document 1, a microprogram-controlled main processor and a microprogram-controlled sub-processor are provided. A main control memory for storing a program and a microprogram for activating the subprocessor; the subprocessor is activated by the main processor and performs processing under the control of the microprogram stored in the subcontrol memory; A multiprocessor system for reporting to a processor, wherein the main processor includes a microprogram for substituting processing of a subprocessor in a main control memory, and the main processor activates the subprocessor when a failure is reported from the subprocessor. Main where the program is stored There is a multi-processor system is characterized in that so as to perform alternate microprogram switch the address of the control memory. As described above, the system is provided with means for substituting processing, and even if a failure occurs in the system and a certain processing cannot be executed, the function of the system can be maintained by substituting the processing. it can.
[0003]
[Patent Document 1]
JP-A-1-102677
[Problems to be solved by the invention]
However, there is a problem that has not been solved by the conventional technology. First of all, the method of detecting a failure is that in a multiprocessor system, it is not possible to predict which processor will fail, so that failure detection that assumes only a failure of a specific processor is insufficient. Although the probability of failure can be estimated based on the processor capacity and the scale of the program executed on the processor, the probability of failure is generally higher when the processor capacity is higher and the program scale is larger. However, the prior art does not assume a failure of the main processor. Also, because a processor that may be involved in a fault may perform incorrect processing, it is logically inconsistent for a processor that is involved in a fault to report on the fault.
[0005]
It is also necessary to consider the timing of reporting a failure. In other words, if the time difference between the occurrence of a failure and the failure report is large, the failure recovery process may not be performed properly, and the content of the failure report is insufficient, that is, which failure occurs when multiple failures occur. There is a possibility that there is a lack of information for considering whether or not to make a determination in the return processing.
[0006]
Next, as an alternative method of the processing of the failed processor, similar to the problem of the failure detection method described above, if the alternative processing is limited, there is an irreplaceable process, and depending on the failure situation, System functionality will be lost until resumption. There was also a problem that the processing speed was restricted in the alternative processing, and the processing was slowed, forcing the customer to uselessly wait.
[0007]
Finally, there is no clear disclosure in the prior art regarding the method of recovering from a failure. As a method for recovering from the failure, there is initialization of the processor, but the system has lost its function until the initialization is completed.
[0008]
[Means for Solving the Problems]
In order to solve this problem, the present invention provides a plurality of processors, a program storage memory connected to each processor via a unique bus, an operation status information storage memory connected to each processor via a common bus, and a common memory for each processor. Each processor consists of initialization means connected by a bus, hardware resources, and bus switching means for selectively connecting each processor and hardware resources. Each processor writes the operation status into the operation status information storage memory and monitors the operation status stored in the operation status information storage memory. If the operation status of the n-th processor includes the content of the failure, The m-th processor operating normally executes the complementary program according to the operation status information to execute the n-th processor. A multiprocessor system wherein the initialization means initializes the n-th processor, and after the initialization of the n-th processor is completed, the m-th processor returns the processing to the n-th processor in accordance with the operation status information. To provide a fault recovery method.
[0009]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, an embodiment of the present invention will be described with reference to FIGS.
[0010]
(Embodiment 1)
FIG. 1 is a block diagram of a multiprocessor system according to a first embodiment of the present invention. In FIG. 1, 1 is a multiprocessor system, 2 is initialization means, 3 is operation status information storage memory, 4 is a common bus, and 5 is Bus switching means, 6 a hardware resource, 7 a first processor, 8 a first bus, 9 a first memory, 10 a second processor, 11 a second bus, 12 a second memory, and 13 an nth A processor, 14 is an n-th bus, and 15 is an n-th memory.
[0011]
The operation of the multiprocessor system configured as described above will be described below. In the multiprocessor system 1, a first processor 7 operates according to a program stored in a first memory 9 via a first bus 8. Similarly, the second processor 10 operates according to a program stored in the second memory 12 via the second bus 11. When there are N processors, the nth processor 13 operates according to the program stored in the nth memory 15 via the nth bus 14. Although FIG. 1 illustrates the first processor 7, the second processor 10, and the n-th processor 13, the n-th processor 13 is an arbitrary one of the N processors. During the normal operation of the multiprocessor system 1, the n-th processor 13 periodically stores the operation status in the operation status information storage memory 3 via the common bus 4. The operation status includes the control register set value of the hardware resource 6, the status register value, the interrupt request status, the initialization history, the failure history, the operation time since the initialization, the recognized operation time, the memory usage status, and the n-th state. The operating status of the application program 21 stored in the memory 15, the operating status of the driver program 22, and the operating status of the operating system 23. During normal operation, the operating status is sequentially updated on the operating status information storage memory 3 for every N processors. Is done. Further, the n-th processor 13 periodically monitors the operation status in the operation status information storage memory 3 and, if it finds a condition considered to be a failure in the operation status, limits the processor in which the failure has occurred. The bus switching means 5 is controlled to disconnect the bus of the failed processor from the hardware resources 6. As a result, the failed processor cannot control the hardware resource 6 and cannot receive information from the hardware resource 6, for example, an interrupt request. Next, the n-th processor 13 controls the initialization means 2 to initialize the failed processor. FIG. 2 shows an example of the initialization means. The initialization means 2 includes a first initialization F / F16, a second initialization F / F17, and an nth initialization F / F18 corresponding to the first processor 7, the second processor 10, and the n-th processor 13. By setting an initialization F / F corresponding to a failed processor from another processor, an initialization signal is enabled and the corresponding processor is initialized. The failure includes all abnormalities of hardware related to the processor and programs stored in the memory of the processor. Also, when a failure is found in the hardware resource 6, it is regarded as a failure of the processor that has recognized the failure. While the failed processor is being initialized, the n-th processor 13 executes the complementary program stored in the n-th memory 15 to replace the processing that should be performed by the processor. As shown in FIG. 4, an example of the complementary program includes an application 21, a driver 22, and an operating system 23. The complementary program includes an analysis of the operation status at the time of failure, prediction of the next process, execution of the next process, and observation of the status of initialization of the failed processor. For example, if the failed processor fails during the execution of the process A, the n-th processor 13 verifies whether or not the process A has completed normally. If not, the n-th processor 13 re-executes the process A as an alternative. Further, if time elapses from the start of the process A to the failure, and the processes A + 1, A + 2, and A + n following the process A are required but not executed, the malfunction continues. In some cases, the operation related to the process A is stopped in order to prevent the operation from being performed. The complementary program stored in the n-th memory 15 does not always cover the processing of the failed processor. If the processing capacity of the n-th processor is lower than the processing capacity of the failed processor, the processing is restricted. That is, it is limited whether the function is reduced at the same processing speed or the processing speed is reduced at the same function. For example, if the application being executed at the time of the failure is an application that emphasizes speed, the complementary program will maintain the processing speed even if the function is degraded, and if the application emphasizes the appearance and the diversity of functions, the speed will be increased. Maintain all functions and displays even if dropped. Deterioration of functions means, for example, when drawing a complicated figure, approximating a curve with an appropriate number of straight lines, replacing an arbitrary ellipse with a circle, replacing a dotted line or a broken line with a straight line when drawing a complicated figure This means that color drawing is replaced with black-and-white drawing, fill-out is replaced only with outlines, arbitrary figure rotation is replaced with limited rotation angles, and processing with low priority is ignored. In addition, it is easily analogized that the communication speed with the external device is set to be low, or the processing accuracy of data such as reduction / enlargement and encoding / decoding is set low. It is also conceivable to maintain the processing speed by analyzing the current processing load from the operation status in the operation status information storage memory 3 and limiting the function of the complementary program according to the load. When the n-th processor 13 monitors the operation status of the operation status information storage memory 3 and knows that the failed processor has been initialized and has resumed normal operation, the initialization has been completed using the bus switching means 5. Reconnect the processor bus to the hardware resource 6. If a plurality of failures occur after the initialization, the n-th processor continues to substitute the processing of the failed processor.
[0012]
【The invention's effect】
As described above, according to the present invention, in a multiprocessor system, each processor periodically stores the operation status in the memory on the common bus, and each processor periodically reads and monitors the operation status of any processor. Any other processor can timely and appropriately recognize the failure of the processor and initialize the failed processor.
[0013]
Also, when replacing the processing of a failed processor, the processing is replaced after grasping the operating condition at the time of failure, so processing that may not be completed or processing that may return to malfunction may be continued. Therefore, there is no inconsistency in the processing procedure, and the processing procedure can be appropriately replaced.
[0014]
Further, the failed processor is isolated from the hardware resources by the bus switching means until the initialization is completed, so that the failed processor does not affect the hardware resources, and the reliability and stability of the system can be secured. Since the complementary program of each processor optimizes the function and scale according to the processing capacity of the processor, there is no need to employ a processor with an excessive processing capacity, and it is economical.
[0015]
Further, the function can be limited to the complementary program according to the load on the system, which is reasonable.
[Brief description of the drawings]
FIG. 1 is a block diagram of a multiprocessor system according to a first embodiment of the present invention; FIG. 2 is a block diagram illustrating initialization means of the multiprocessor system according to the first embodiment of the present invention; FIG. 4 is a block diagram illustrating operation status information of the multiprocessor system according to the first embodiment. FIG. 4 is a block diagram illustrating a configuration example of a complementary program of the multiprocessor system according to the first embodiment of the present invention.
DESCRIPTION OF SYMBOLS 1 Multiprocessor system 2 Initialization means 3 Operation status information storage memory 4 Common bus 5 Bus switching means 6 Hardware resources 7 First processor 8 First bus 9 First memory 10 Second processor 11 Second bus 12 Second memory 13 Nth processor 14 nth bus 15 nth memory 16 first initialization F / F
17 Second initialization F / F
18 n-th initialization F / F
19 First Initialization Signal 20 Second Initialization Signal 21 nth Initialization Signal 22 Application 23 Driver 24 Operating System

Claims

A plurality of processors, a program storage memory connected to each processor via a unique bus, an operation status information storage memory connected to each processor via a common bus, initialization means connected to each processor via a common bus, and hardware Hardware resources, and bus switching means for selectively connecting each processor and hardware resources. Each program storage memory stores a complementary program in which each processor substitutes for another processor, and each processor operates. The status is written to the operation status information storage memory, and the operation status stored in the operation status information storage memory is monitored. If the operation status of the nth processor includes a failure content, the mth processor that is operating normally performs the operation. Means for executing the complementary program in accordance with status information to substitute for the processing of the n-th processor and for initialization Fault recovery method for a multiprocessor system, characterized in that the process according to the m processors the operating condition information n-th processor after completion initialization of the n processor initializes back to the n processor.

2. The multiprocessor system according to claim 1, wherein the complementary programs are different depending on the processing capability of each processor.

3. The multiprocessor system according to claim 1, wherein each processor sets a limit on a complementary program in accordance with operation status information.

4. The multiprocessor system according to claim 1, wherein the complementary program comprises an application program, a hardware control program, and an operating system.