JP3688217B2

JP3688217B2 - Multiprocessor initialization / concurrent diagnosis method

Info

Publication number: JP3688217B2
Application number: JP2001113419A
Authority: JP
Inventors: 真一落合; 和宏村山
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2001-04-12
Filing date: 2001-04-12
Publication date: 2005-08-24
Anticipated expiration: 2021-04-12
Also published as: JP2002312333A

Description

【０００１】
【発明の属する技術分野】
この発明は、マルチプロセッサにより構成される情報処理システムの初期化処理に関するものであり、特にシステムを構成するプロセッサのいくつかに異常がある場合でも、システムの起動時間を短縮すると同時に、その異常に対処してシステムの信頼性を向上しようとするものである。
【０００２】
【従来の技術】
従来のマルチプロセッサシステムの初期化時に障害があった場合の復旧処理方式としては、例えば、特開平１１−１６１６１６号公報に示されるようなものがあった。図２３はこの公報に示されたマルチプロセッサシステムの初期化処理方式の動作の一部を示す図である。
【０００３】
この方式の動作を説明する。
マルチプロセッサシステムを構成する複数のプロセッサが初期化処理において異常となった場合に、障害復旧の初期化（ＩＰＬ）処理プログラム種別ＩＤをブロードキャスト通信により送信し、障害を起こした複数のプロセッサを同時に復旧させることにより、障害発生時でも起動処理時間を短縮しようとするものである。障害を起こしたと思われるプロセッサに対し、ブロードキャスト通信により起動指示を再試行すると供に、再試行の起動指示に対し応答しないプロセッサを切離すという手順により、システムの初期化処理を完了させる。
このように、障害を起こした複数のプロセッサに同時に起動処理の再試行を行わせることができるので、障害を起こしたプロセッサごとに個別に起動処理の再試行を行う場合よりも、高速にシステムを起動することができる。しかし、初期化処理の再試行を行うために、やはりシステムの起動時間は正常時と比べて遅延する。
【０００４】
【発明が解決しようとする課題】
従来のマルチプロセッサシステムの初期化処理方式は以上のように構成されており、応答が返らないプロセッサの判定処理は、最も起動処理が遅れた場合をタイムアウト時間として設定するので、長い再初期化時間が必要となる。更に信頼性に劣る通信路を用いたブロードキャスト通信をに頼るので、このような初期化処理動作が複数回再試行されることになる。こうして再試行初期化処理における正常／異常の判定の間、システムの運用開始が大きく遅延する可能性が発生する。通常、冗長構成のマルチプロセッサシステムにおいては、いくつかのプロセッサが異常でもシステムは運用可能であるにも関わらず、結果的に異常プロセッサの状態確認が優先して起動時間が遅くなるという課題があった。
【０００５】
この発明は上記のような課題を解消するためになされたもので、システム運用開始においては必要となるプロセッサを絞り込んで立ち上げ、システム運用が立ち上がって後のフェーズで順次、必要なプロセッサを参入するようにして信頼性を向上し、プロセッサの初期化、異常確認と、システムの運用を並行して行うようにする。
【０００６】
【課題を解決するための手段】
この発明に係るマルチプロセッサ初期化／並行診断方法は、マルチプロセッサにより構成される情報処理システムの初期化処理とそれに続く処理の方法において、
上記システムの初期化時に、上記マルチプロセッサを構成する各プロセッサの中で設定されたシステムマスタプロセッサが、上記各プロセッサに対してシステム初期化指令を行って、上記各プロセッサから所定時間内の応答を確認して、該応答があった各プロセッサを正常プロセッサと判定して正常プロセッサプールに登録するステップと、
上記正常プロセッサと判定された各プロセッサにより上記システムの運用を開始するステップと、
上記システムマスタプロセッサが、上記所定時間内に応答がない各プロセッサを留保プロセッサと判定して留保プロセッサプールに登録するステップと、
上記システムマスタプロセッサが、上記正常プロセッサと判定された各プロセッサの中から留保プロセッサプール・マスタを選定するステップと、
上記システムの運用と並行して、選定された上記留保プロセッサプール・マスタが、留保プロセッサプールに登録された上記留保プロセッサに対して動作の確認処理をするステップ、とを備えたことを特徴とする。
【０００７】
システム運用時において、上記正常と判定された各プロセッサは、要求プロセッサと実行プロセッサとなり、
上記要求プロセッサが、上記実行プロセッサに対して処理要求を行うステップと、
上記実行プロセッサに出した上記処理要求に対して該実行プロセッサから所定の時間内に実行結果の受信が確認できないと、上記要求プロセッサが、上記システムマスタプロセッサに非応答を通知するステップと、
上記システムマスタプロセッサが、上記非応答の通知を受けると、正常プロセッサプールに登録されている未使用のプロセッサを代替プロセッサとして選択して実行プロセッサに変更するステップと、
上記システムマスタプロセッサが、上記非応答の実行プロセッサを留保プロセッサプールに移すステップ、とを備えたことを特徴とする。
【０００８】
システム運用と並行して、選定された上記留保プロセッサプール・マスタが、留保プロセッサと判定されたプロセッサに対して動作の再確認処理を行うステップと、
上記システムマスタプロセッサが、上記再確認処理により正常応答を得たプロセッサを正常プロセッサと判定して正常プロセッサプールに登録するステップ、を備えたことを特徴とする。
【０００９】
システム運用と並行して、選定された上記留保プロセッサプール・マスタが、留保プロセッサと判定されたプロセッサに対して動作の再確認処理を行うステップと、
上記システムマスタプロセッサが、上記再確認処理により正常応答が得られなかったプロセッサを異常プロセッサと判定して異常プロセッサプールに登録するステップ、を備えたことを特徴とする。
【００１０】
上記システムマスタプロセッサが、正常プロセッサプールに登録された正常プロセッサの数をカウントする正常プロセッサプール・プロセッサ数確認処理ステップと、
上記システムマスタプロセッサが、カウントした上記正常プロセッサ数とシステム運用に必要な最低プロセッサ数とを比較するステップと、
上記比較で正常プロセッサ数がシステム運用に必要な最低プロセッサ数より少なくなると、上記システムマスタプロセッサが、システム異常処理を行うステップ、とを備えたことを特徴とする。
【００１１】
上記システムマスタプロセッサが、正常プロセッサプールに登録された正常プロセッサの中から新たな留保プロセッサプールマスタを選択するステップと、
上記システムマスタプロセッサが、上記選択する以前に留保プロセッサプールマスタとなっていたプロセッサを正常プロセッサとして登録するステップ、とを備えたことを特徴とする。
【００１４】
【発明の実施の形態】
実施の形態１．
マルチプロセッサシステムの初期化処理において、構成する各プロセッサを、正常と判定できた「正常」プロセッサプール、正常・異常の判定ができなかった「保留」プロセッサプール、異常と判定した「異常」プロセッサプールに分けることにより、個々のプールのその後の処理を変更する方法を説明する。
図１は、この発明の実施の形態１におけるマルチプロセッサシステムの初期化処理方式の概念を示す構成図である。図において、１０１はシステムを構成するプロセッサを示す。本実施の形態では、多数のプロセッサで構成され、プロセッサを冗長構成したシステムを想定する。各プロセッサはプロセッサ間のデータ交換のためにネットワーク１０３で接続される。このネットワークはバス、ＬＡＮ、ファイバーチャネルなど、種々のものであってよい。また、大規模構成を実現するためにネットワーク１０３は、１０２で示すネットワークスイッチにより、階層構造を持つ場合もある。
１００はプロセッサの中で、システム管理を行うように選定されたシステムマスタプロセッサである。システムマスタプロセッサ１００は、ネットワーク１０３、ネットワークスイッチ１０２を介して、他の全てのプロセッサ１０１と通信できる。
【００１５】
このような構成のマルチプロセッサシステムにおいて、システム起動時には、従来例で示された技術を利用して、システムマスタプロセッサ１００が、初期化指示、もしくは初期プログラムのダウンロード１０４を、全てのプロセッサに対して同時に行って、システムを起動する。各プロセッサ１０１はその指示に従い、プロセッサの初期化処理を行う。
図２は初期化処理の完了状態を示す図である。個々のプロセッサ１０１は初期化処理が完了すると、システムの同期のために、それぞれシステムマスタプロセッサ１００に対して完了通知１０５を返す。この完了通知は、システムマスタプロセッサ１００の特定のレジスタへの書込みによる方法や、メッセージ送信による方法など、種々の方法が利用できる。システムマスタプロセッサ１００は、この完了通知の受信により、各プロセッサの初期化処理が正常に完了したことを確認する。
【００１６】
この発明の実施の形態１におけるマルチプロセッサシステムの初期化処理方法では、図１、図の正常な初期化動作に加えて、システムを構成するプロセッサを図３に示すような３種類のプロセッサプールと呼ぶ論理的なグループに分類する。図３中の２００は、正常動作が確認できたプロセッサが属する「正常」プロセッサプールである。２０１は正常／異常の判定がまだ完了していないプロセッサが属する「留保」プロセッサプールである。２０２は異常であり、システム運用に利用できないと判断したプロセッサの属する「異常」プロセッサプールである。これらのプロセッサプールは、システム管理における論理的な分類であり、プロセッサに対して物理的に何らかの変更が行われるわけではない。
【００１７】
図３のグループ分けを実現するために、図４に示す構成要素を用意する。３００は「正常」プロセッサプールに属するプロセッサを管理する「正常」プロセッサプール管理表であり、「正常」プロセッサプールに属するプロセッサの識別子が登録される。３０１は「留保」プロセッサプールに属するプロセッサを管理する「留保」プロセッサプール管理表であり、「留保」プロセッサプールに属するプロセッサの識別子が登録される。３０２は「異常」プロセッサプールに属するプロセッサを管理する「異常」プロセッサプール管理表であり、「異常」プロセッサプールに属するプロセッサの識別子が登録される。本実施の形態では、３００〜３０２のプロセッサプール管理表はシステムマスタプロセッサ１００が管理している。そして、システムは十分に冗長度があるプロセッサ１０１が用意されているので、「正常」プロセッサによりシステムの立ち上げを完了する。
【００１８】
従来は、図２における完了通知の受信のための待ち時間は、どのような状態においても完了通知が受信できるように、最悪の条件を予想した長い時間を設定する必要があった。また、一過性の異常を回避するために、応答がない場合は初期化指示の再試行を行う必要があった。しかし、この実施の形態１によるマルチプロセッサシステムの初期化処理方法では、システムを構成するプロセッサを図３に示したプロセッサプールに分類し、また、以降のシステムの動作をプロセッサプールごとに変更する。
こうして、完了通知受信のための待ち時間を短く設定し、正常にも関わらずその時間内に完了通知を返せなかったプロセッサが存在しても、改めてその返せなかったプロセッサに絞り込んで正常／異常確認の再試行を行うことが可能になる。即ち、システムの初期化処理の効率化が得られ、起動時間が短縮できる。
【００１９】
以下、この「正常」プロセッサの具体的な識別と登録方法を述べる。
いったん「正常」プロセッサが他の「留保」プロセッサ等から分離されると、これら「正常」プロセッサを使用した高速初期化が可能となる。
図４は高速初期化処理４１０の概念を示す図である。この高速初期化処理４１０は、図１、図２に示した処理動作を行うが、当然、その初期化処理タイムアウト時間３０４を従来よりも短く設定して、その設定時間内に初期化を終了する。
【００２０】
この高速初期化処理４１０の具体的な動作フローを図５に示す。
まずステップ（以後、ステップの記述を省略する）Ｓ１０１で、システムマスタプロセッサ１００は、システムを構成する全プロセッサ１０１に対して起動指示を行う。起動指示の方法としては、システムリセット、ブロードキャスト通信による起動指示、ブロードキャスト通信による初期プログラムのダウンロードなどの種々の方法がある。この後Ｓ１０２で、システムマスタプロセッサは各プロセッサ１０１からの初期化処理完了通知を待つ。Ｓ１０２でいずれかのプロセッサから初期化処理完了通知を受信すると、Ｓ１０３でその応答を返したプロセッサを「正常」プロセッサプール２００に入れて、そのプロセッサのプロセッサ識別子を「正常」として、「正常」プロセッサプール管理表３００へ登録する。
【００２１】
このＳ１０２、Ｓ１０３の処理を繰り返しているうちに、初期化処理タイムアウト時間３０４になると、Ｓ１０４の条件判断を経てＳ１０５に移る。従来は、ここまでの処理でプロセッサの正常／異常判定を行わなければならなかったために、初期化処理タイムアウト時間３０４を十分に長くする必要があり、さらには応答が得られない場合は、Ｓ１０１の起動指示を再試行してプロセッサの異常を確認する必要があった。しかし、本実施の形態では、Ｓ１０５で完了通知を得られなかったプロセッサを「留保」プロセッサプール２０１に入れてしまう。その方法は、完了通知を得られなかったプロセッサの識別子を「留保」プロセッサプール管理表３０１に登録することにより行う。「留保」プロセッサプール２０１に属するプロセッサは、正常／異常の判定ができていないプロセッサであり、改めて正常／異常判定を行うことになる。一方、「正常」プロセッサプール２００に属するプロセッサは、既に正常判定を行えたプロセッサであり、システム運用に初期の段階から利用できるプロセッサとなる。
こうして、指定時間で正常と判断できなかったプロセッサを別のプロセッサプールに分け、後で別の処理を行えるようにしたので、従来よりも短い応答待ち時間を設定でき、また初期化処理の再試行をやめることが可能で、システムの初期化時間を短縮できることが明らかとなった。
【００２２】
実施の形態２．
本実施の実施の形態では、「正常」プロセッサ群によるシステムの立ち上げ後に、システムの運用と並行した選定処理を説明する。
以下、この発明の実施の形態２におけるマルチプロセッサシステムの初期化処理方法を図を用いて説明する。先ず、図６に示すように、「正常」プロセッサプールに属しているプロセッサの一つを選定し、それを「留保」プロセッサプールを管理する「留保」プロセッサプール・マスタ４００とする。つまり、「正常」プロセッサプールに属するプロセッサは、既に正常動作が確認できたプロセッサであるので、プロセッサプールを管理する動作を行わせることができると考えられるからである。この「留保」プロセッサプール・マスタ４００を決定するために、本実施の形態では、新たに図に示したシステム初期化処理内に、「留保」プロセッサプール・マスタ選定処理５１０を設ける。また、図７に示すように、システムマスタプロセッサ１００内に、「留保」プロセッサプール・マスタ４００の識別子を持つ「留保」プロセッサプール・マスタエントリ５００を設ける。
【００２３】
「留保」プロセッサプール・マスタ選定処理５１０の動作を図８を用いて説明する。図８の例では、この選定処理は図５の動作に引き続き行う動作として記述している。
Ｓ２０１でシステムマスタプロセッサ１００は、「正常」プロセッサプール管理表３００を参照し、何らかのアルゴリズムによりプロセッサを一つ選定する。このプロセッサの識別子を「正常」プロセッサプール管理表３００のエントリから外し、この選定したプロセッサを選定プロセッサとする。この選定の方法は、システムの要求仕様に合わせて、管理表内のシステムマスタプロセッサではない最初のプロセッサを選ぶ方法や、特定のプロセッサ識別子を持つプロセッサを選択する方法など種々の方法がありえる。次にＳ２０２で、選定したプロセッサの識別子を「留保」プロセッサプール・マスタエントリ５００に登録する。システムは「留保」プロセッサプール・マスタエントリ５００に登録されたプロセッサを「留保」プロセッサプールを管理するプロセッサとして認識する。
【００２４】
こうして、既に正常と判定できたプロセッサプールの中から一つのプロセッサを「留保」プロセッサプールのマスタプロセッサとし、「留保」プロセッサプールの管理を行わせるようにしたので、「留保」プロセッサプールのプロセッサをシステムマスタプロセッサから切離し、「正常」プロセッサプールのプロセッサと独立して動作させる準備ができた。
【００２５】
先ず、並行処理の起動について述べる。
この並行処理の起動のために、新たにシステム初期化処理内に、図４に示される並行起動処理６１０を持つ。並行起動処理６１０は、「正常」プロセッサプール２００に属するプロセッサによる正常系のシステム運用と、「留保」プロセッサプール・マスタプロセッサ４００による「留保」プロセッサプール２０１に属するプロセッサの正常／異常判定処理を同時に動作させる。
【００２６】
並行起動処理６１０の動作を図９を用いて説明する。図９の例では、この並行起動処理６１０は図８の動作に引き続き行う動作として記述している。
Ｓ３０１でシステムマスタプロセッサ１００は、プロセッサ間通信により、「留保」プロセッサプール・マスタプロセッサ４００に「留保」プロセッサプールに属するプロセッサの正常／異常判定を行う再確認処理の起動を指示する。これにより、「留保」プロセッサプール・マスタプロセッサは「正常」プロセッサプール内のプロセッサと独立して動作する。次にＳ３０２で、システムマスタプロセッサは「正常」プロセッサプール管理表３００を参照し、そこに登録されている「正常」プロセッサプール内のプロセッサのみで、システムの運用を開始する。この時点では「正常」プロセッサプールに登録されているプロセッサは、システムを構成するプロセッサの一部であるが、システム運用の初期段階では演算負荷が小さいことが多いことや、また全てのプロセッサが必要となるシステム状態は少ないことから、このような起動が可能となる。
【００２７】
こうして、「留保」プロセッサプール内のプロセッサに対して行う異常判定などの処理と並行して、「正常」プロセッサプール内のプロセッサを使ってシステム運用を開始するようにしたので、システムの初期化時間を短縮できる。
【００２８】
次にシステムの立ち上げと並行した、「保留」プロセッサに対する正常／異常の再認識処理について述べる。
この再認識処理のために、図１０に示すように、選択された「留保」プロセッサプール・マスタ４００が再確認処理４２０を持つ。更にこの再確認処理において、システムマスタプロセッサ１００の持つプロセッサプール管理表の操作を行う。再確認処理４２０におけるプロセッサプール管理表に対する操作を図１１に示す。
選定された「留保」プロセッサプール・マスタ４００は、「留保」プロセッサプール管理表３０１を参照して、登録しているプロセッサに対し、従来例と同様なマルチキャスト通信により初期化リトライ要求５００を発行し、正常／異常が確認できていないプロセッサに対し、初期化処理を再試行する。再試行により最終的に正常と確認できたプロセッサは、そのプロセッサ識別子を「留保」プロセッサプール管理表３０１から外し、「正常」プロセッサプール管理表３００に登録することにより、そのプロセッサを「正常」プロセッサプール２００に移行させる。［正常］プロセッサプールに移行されたプロセッサは、システムマスタプロセッサが「正常」プロセッサプール管理表３００を参照し、システム運用に参入させる。
【００２９】
再試行によっても正常と確認できなかったプロセッサは、そのプロセッサ識別子を「留保」プロセッサプール管理表３０１から外し、「異常」プロセッサ管理表３０２に登録して、そのプロセッサを「異常」プロセッサプール２０２に入れる。「異常」プロセッサプール２０２のプロセッサは、Ｈ／Ｗ異常などの恒久的障害と判断し、システム運用から切離す。この再確認処理４２０が終わると、「留保」プロセッサプールのプロセッサは、「正常」プロセッサプール、もしくは「異常」プロセッサプールに移行し、「留保」プロセッサプールは空となる。
【００３０】
上記で概略を示した再確認処理４２０の詳細動作を図１２を用いて説明する。図１２の例では、再確認処理４２０は図９Ｓ３０１の指示により起動される処理として記述している。
再確認処理４２０は、従来と同様な方法により、確実なプロセッサの正常／異常判定を行う。まず、Ｓ４０１で、システムで規定した回数の初期化要求の再試行を実行したかどうかを確認する。指定回数の再試行を行っていない場合には、Ｓ４０２で「留保」プロセッサプール管理表３０１を参照し、登録してあるプロセッサに対し、単体リセットやマルチキャスト通信などの手段により、初期化処理の再試行を要求する。次にＳ４０３で、その初期化処理完了の応答を待つ。応答があった場合は、Ｓ４０４で完了通知を返したプロセッサのプロセッサ識別子を「正常」プロセッサプール管理表３００へ登録する。
Ｓ４０３、Ｓ４０４の動作をＳ４０５で初期化処理タイムアウト時間に達するまで繰り返す。Ｓ４０５における初期化処理タイムアウト時間は、実施の形態２の高速初期化処理４１０における初期化処理タイムアウト時間３０４とは異なり、もっとも初期化処理が遅れた場合の初期化処理時間に設定された従来の長いタイムアウト時間である。
【００３１】
次にＳ４０６で初期化処理完了通知を返さない未応答のプロセッサが存在するかを確認する。未応答のプロセッサは、「留保」プロセッサプール管理表にプロセッサ識別子が残っていることにより判別できる。未応答のプロセッサが存在する場合はＳ４０１に戻り、Ｓ４０２からの初期化処理再試行を行う。Ｓ４０６で未応答のプロセッサが存在しない場合は、全てのプロセッサを正常と判断し、「正常」プロセッサグループへ移行させたので、再確認処理４２０は完了である。
Ｓ４０１でシステムで規定した回数の初期化処理の試行を繰り返した場合は、Ｓ４０７に移る。Ｓ４０７では、最終的に初期化処理完了通知を返さなかったプロセッサ、つまりまだ「留保」プロセッサプール管理表３０１に残っている全てのプロセッサ識別子のプロセッサを「異常」プロセッサプール管理表３０２に移す。「異常」プロセッサプールのプロセッサは恒久的な異常と判断される。
【００３２】
こうして、システム運用と並行して動作する「留保」プロセッサプールを使い、従来と同様な初期化処理の方法により再度正常／異常の判定を行う。これにより、システム運用を妨げずにプロセッサの正常／異常判定を行い、システム起動の遅延の発生を回避する。また、正常と判定されたプロセッサは正常プロセッサプールに加えることによりシステムの運用に参入し、正常と判定できなかったプロセッサは「異常」プロセッサプールに加えることにより恒久的な異常としてシステム運用から切離す。
【００３３】
実施の形態３．
本実施の形態では、実施の形態１のシステムを運用中に実行プロセッサに異常が発生した場合の異常対処処理を説明する。即ち、システムの運用が妨げられないようにする。また、異常プロセッサの効果的な初期化再試行が行える方式を説明する。
図１３は、本実施の形態における処理概念を示す図である。マルチプロセッサシステムは、プロセッサ間通信により、要求プロセッサが実行プロセッサに処理要求を発行することで、システムの種々の処理が並行して進められるという形式にモデル化できる。図１３では、６００が処理要求を発行する要求プロセッサ、６０１が要求に基づき処理を実行する実行プロセッサである。従来の技術では、発行した要求に対して実行プロセッサ６０１から応答が返らない場合、タイムアウト時間として処理の最悪実行時間を設定して待ち、さらに通信やプロセッサの異常を確認するために、要求プロセッサ６００は要求発行の再試行を実行する必要があった。この判定処理には長い時間が必要で、従って、実時間性が必要なシステムでは、不具合が生じる危険性があった。
本実施の形態では、実行プロセッサ６０１から一定時間内に応答が返らない場合は、要求の再試行は行わず、実行プロセッサ６０１を「留保」プロセッサプール２０１に入れる。そして要求プロセッサは実行プロセッサの代替となる代替実行プロセッサ６０２を選択し、それに対して改めて要求を発行し、処理を継続する。こうして何度も再試行せず、次のプロセッサを代替として処理を行う。
一定時間内に応答を返さなかった実行プロセッサ６０１の正常／異常の確認は、実施の形態２で述べた仕組みを利用して、システムの運用と並行した「留保」プロセッサプール内の処理として実行する。
【００３４】
上記の機能を持たせるため、要求プロセッサ６００は、図１４に示すプロセッサ間通信処理７１０を持つ。プロセッサ間通信処理７１０の動作を図１５を用いて説明する。
まず、Ｓ５０１で要求プロセッサ６００は実行プロセッサ６０１に必要な処理要求を送信する。この後、従来と同様にＳ５０２で指定されたタイムアウト時間まで、実行プロセッサからの受信確認を待つ。Ｓ５０３でタイムアウト時間までに実行プロセッサから受信確認を受信できた場合は正常であり、従来と同様に運用を継続する。Ｓ５０３で受信確認を受信できなかった場合は、本実施の形態のプロセッサ間通信処理では新たに、Ｓ５０４で処理要求を発行した相手である実行プロセッサ６０１の識別子を、システムマスタプロセッサ１００に非応答と通知する。これにより、システムマスタプロセッサ１００により非応答の実行プロセッサ６０１に対する対応処理が行われる。要求プロセッサ６００は、Ｓ５０４でシステムマスタプロセッサに対し非応答プロセッサの通知を行った後、Ｓ５０５で直ちに「正常」プロセッサプール内にある他のプロセッサを代替実行プロセッサ６０２として選択し、これに対して再度同じ処理要求を送信し、処理を継続する。
【００３５】
こうして、プロセッサ間通信で実行プロセッサが応答を返さなかった場合の処理時間を短縮できる。本実施の形態では、実行プロセッサが応答を返さない場合、代替実行プロセッサに対して同じ処理要求を発行する時間が必要となるが、従来のようにプロセッサ間通信処理の中で実行プロセッサの正常／異常を判定する複数回の再試行や、実行プロセッサの正常／異常判断は行わないため、プロセッサに異常がある場合でも、予測可能な時間内で処理が継続される。
【００３６】
システム運用中に非応答となり「保留」プロセッサ対象となったものの対応処理を説明する。
システムマスタプロセッサ１００は、図１４に示す「留保」プロセッサプール追加処理８１０を持つ。この「留保」プロセッサプール追加処理８１０は、図１６に概念を示すように、処理要求に対して一定時間以内に応答を返さない実行プロセッサの識別子を「正常」プロセッサプール管理表３００から「留保」プロセッサプール管理表３０１へ移すことにより、プロセッサをシステムの運用から一時的に外すと供に、プロセッサの正常／異常の確認処理を行えるようにする。
【００３７】
「留保」プロセッサプール追加処理８１０の動作を図１７を用いて説明する。図１７の例では、動作は図１５のＳ５０４により起動される処理として記述している。
まず、Ｓ６０１で、処理要求に対して一定時間以内に応答を返さなかった実行プロセッサと判明した実行プロセッサ６０１のプロセッサ識別子を「正常」プロセッサプール管理表３００から検索し外す。次にＳ６０２で実行プロセッサ６０１のプロセッサ識別子を「留保」プロセッサプール管理表３０１に追加する。Ｓ６０１とＳ６０２により実行プロセッサ６０１は「正常」プロセッサプールから「留保」プロセッサプールに移行したことになる。システムマスタプロセッサ１００は「正常」プロセッサプール管理表３００を参照し、そこに登録されているプロセッサのみをシステムの運用に使うので、これにより応答を返さなかった実行プロセッサ６０１はシステムの運用から外れることになる。次にＳ６０３で「留保」プロセッサプール・マスタエントリ５００を参照し、登録されている「留保」プロセッサプール・マスタプロセッサ４００に対し、「留保」プロセッサプールにプロセッサが追加されたことを通知する。
【００３８】
こうして、異常の可能性のあるプロセッサを「正常」プロセッサプールから「留保」プロセッサプールに移行させ、システム運用から切り離す。従って、これら、異常の可能性のあるプロセッサに対して、正常系のシステム運用とは別の処理を行うことができる。
【００３９】
次に、切り離した後の「留保」プロセッサの再認識動作を説明する。
「留保」プロセッサプール・マスタプロセッサ４００は、図１４に示すように、運用時再確認処理９１０を持つ。運用時再確認処理９１０は、システム初期化時の「留保」プロセッサに対する再確認処理４２０に対応した、運用中の再確認処理である。運用時再確認処理９１０では、「留保」プロセッサプール内のプロセッサの復旧処理と正常／異常判定処理を行う。
【００４０】
運用時再確認処理９１０の動作を図１８を用いて説明する。図１８の例では、本動作は図１７のＳ６０３により起動される処理として記述している。
この処理ではまず、「留保」プロセッサプール・マスタプロセッサが「留保」プロセッサプール内のプロセッサに対してダミーの処理要求の発行を再試行して、そのプロセッサが正常／異常の判定を行う。まず、Ｓ７０１でダミー処理要求発行の再試行を指定回数行ったかを確認する。再試行が指定回数に達していない場合は、Ｓ７０２で「留保」プロセッサプール管理表３０１を参照し、登録されているプロセッサに対しダミー処理要求送信を行う。要求した処理が実行できるか確認することにより、プロセッサの正常／異常の判定を行うためである。Ｓ７０２ではシステムで指定された時間をタイムアウト時間として、要求受信確認の応答を待つ。Ｓ７０３でプロセッサからの要求受信確認を受信した場合は、該プロセッサは正常であると判断し、Ｓ７０４で再びそのプロセッサのプロセッサ識別子を「留保」プロセッサプール管理表３０１から外し、「正常」プロセッサプール管理表３００に登録して、システム運用に参入させる。
Ｓ７０３で応答が返らない場合は、Ｓ７０１に戻り処理要求発行の再試行を行う。Ｓ７０１で指定回数再試行を繰り返した場合は、そのプロセッサに何らかの異常があると判定し、Ｓ７０５でプロセッサを単体リセットし、プロセッサを初期状態に戻す。この後Ｓ７０６で図１２の処理を実行して、初期化処理の再試行を行って、プロセッサが恒久的異常であるか判定する。初期化処理再試行により、正常が確認された場合は「正常」プロセッサプールに再参入させる。最終的に正常が確認できなかった場合は、「異常」プロセッサプールに入れ、システム運用から切離す。
【００４１】
こうして、システム運用中に検出した異常の可能性のあるプロセッサに対して、システムから切離して処理要求の再試行や単体リセットを行って、プロセッサが恒久的異常であるかの判定が行える。従って、システムのリアルタイム性遅延の発生を回避できる。
【００４２】
実施の形態４．
本実施の形態では、「正常」プロセッサ数が十分ない場合の対処について説明する。
以下、本実施の形態のマルチプロセッサシステムの初期化処理方法では、図１９に示すように、システムマスタプロセッサ１００に、「正常」プロセッサプール最低プロセッサ数７００と、「正常」プロセッサプール・プロセッサ数確認処理１０１０を持たせる。即ち、「正常」プロセッサプールにシステムの運用に必要なプロセッサ数が足りなくなったという状態を検出できるようにする。
【００４３】
次に、この「正常」プロセッサプール・プロセッサ数確認処理１０１０の動作を図２０を用いて説明する。「正常」プロセッサプール・プロセッサ数確認処理１０１０は、図９の処理、または図１７の処理の後に起動すべき処理である。
まず、Ｓ８０１で「正常」プロセッサプール管理表３００に登録されているプロセッサ数をカウントする。この数をＳ８０２で「正常」プロセッサプール最低プロセッサ数７００と比較する。「正常」プロセッサプール最低プロセッサ数７００以上のプロセッサが「正常」プロセッサプール管理表３００に登録されている場合は、Ｓ８０３でシステム運用を継続し、「正常」プロセッサプール・プロセッサ数確認処理１０１０を完了する。
「正常」プロセッサプール最低プロセッサ数７００よりも「正常」プロセッサプール管理表３００に登録されているプロセッサが少ない場合は、Ｓ８０４でシステムが異常状態と判断し、定義されたシステム異常処理を実行する。システム異常処理としては、メッセージの出力、システム停止などが想定される。
【００４４】
こうして、「正常」プロセッサプールに登録されたプロセッサをカウントし、システム運用に最低限必要なプロセッサ数がに不足の場合は、縮退運用などを指示して、異常対処を行うことができる。
【００４５】
次にシステム運用中の縮退運用を回避する対応方法を説明する。
このために図２１に示すように、システムマスタプロセッサ１００には、「留保」プロセッサプール・マスタプロセッサ交換処理１１１０を持つ。
【００４６】
「留保」プロセッサプール・マスタプロセッサ交換処理１１１０の動作を図２２を利用して説明する。
「留保」プロセッサプール・マスタプロセッサ交換処理１１１０は、システムマスタプロセッサ１００がシステム運用中に一定の時間周期で起動する処理である。まず、Ｓ９０１で「正常」プロセッサプール管理表３００から何らかのアルゴリズムによりプロセッサを一つ選定する。この選定の方法は、システムの要求仕様に合わせて、管理表内のシステムマスタプロセッサ、「留保」プロセッサプール・マスタプロセッサのどちらでもない最初のプロセッサを選ぶ方法や、特定のプロセッサ識別子を持つプロセッサを選択する方法など種々の方法がありえる。次にＳ９０２で「留保」プロセッサプール・マスタエントリ５００に登録されているプロセッサを「正常」プロセッサプール管理表３００に登録し、そのプロセッサを「留保」プロセッサプール・マスタプロセッサの役割から外し、システム運用の通常処理に参入させる。次にＳ９０３でＳ９０１で選定したプロセッサを「留保」プロセッサプール・マスタエントリ５００に登録し、新たな「留保」プロセッサプール・マスタプロセッサとする。こうして、今後の「留保」プロセッサプールの管理処理は、ここで新たに登録したプロセッサが行うことになる。この「留保」プロセッサプール・マスタプロセッサ交換処理１１１０を定期的に実行することにより、「留保」プロセッサプール・マスタプロセッサ４００に異常があった場合でも、定期的なマスタプロセッサの交換により、次の周期で復旧し、本発明で述べた処理を行うことができるようになる。従って、システムを恒久的な異常状態に陥らせない。また、「留保」プロセッサプール・マスタプロセッサ４００を、異常に気づかずにＳ９０２で「正常」プロセッサプールに参入させた場合でも、実施の形態３で述べた動作により、異常の可能性があるプロセッサとして「留保」プロセッサプールに移され、復旧処理を行わせることができる。
【００４７】
【発明の効果】
以上のようにこの発明によれば、システム初期化指令を行って所定時間後に応答を確認して正常プロセッサと判定するステップと、応答があった正常プロセッサによりシステム運用を開始するステップと、所定時間内に応答がないプロセッサを留保プロセッサとして、この留保プロセッサをシステム運用と並行して確認処理をするステップ、とを備えたので、システム立ち上げの効率化が図れる効果がある。
【００４８】
また更に、留保プロセッサの確認処理を、正常プロセッサの１つが行うようにしたので、早期システムの立ち上げとシステム運用には支障無くすることができる効果がある。
【００４９】
また更に、留保プロセッサの確認処理として、再初期化指令を行うようにしたので、異常プロセッサを確定して交換対象を容易に特定できる効果がある。
【図面の簡単な説明】
【図１】この発明の実施の形態１におけるマルチプロセッサシステムにおける初期化方法を説明するためのシステム構成を示す図である。
【図２】実施の形態１における正常完了動作を説明するシステム図である。
【図３】実施の形態１における動作を説明するためのグループ図である。
【図４】実施の形態１におけるマスタプロセッサが持つ構成機能を示す図である。
【図５】実施の形態１における初期化動作を示すフロー図である。
【図６】この発明の実施の形態２におけるプロセッサグループの区分を説明する図である。
【図７】実施の形態２におけるマスタプロセッサが持つ構成機能を示す図である。
【図８】実施の形態２におけるプロセッサプール・マスタ選定処理の動作を示すフロー図である。
【図９】実施の形態２における平行起動処理の動作を示すフロー図である。
【図１０】実施の形態２におけるマスタプロセッサが持つ他の構成機能を示す図である。
【図１１】実施の形態２におけるマスタプロセッサが行う動作を示す概念図である。
【図１２】実施の形態２における再確認処理の動作を示すフロー図である。
【図１３】この発明の実施の形態３におけるシステム運用時に起こる異常に対する処理概念を示す図である。
【図１４】この発明における他の各種処理機能を示す図である。
【図１５】実施の形態３におけるプロセッサ間処理の動作を示すフロー図である。
【図１６】実施の形態３におけるマスタプロセッサが持つ他の構成機能を示す概念図である。
【図１７】実施の形態３におけるプロセッサプール追加処理の動作を示すフロー図である。
【図１８】この発明の実施の形態３におけるシステム運用時に起こる異常に対する処理概念を示す図である。
【図１９】この発明の実施の形態４におけるシステム運用時のプロセッサ追加のためにシステムマスタプロセッサが持つ構成機能を示す図である。
【図２０】実施の形態４におけるプロセッサ数確認処理の動作フロー図である。
【図２１】実施の形態４における留保交換または補充の概念を説明する図である。
【図２２】実施の形態４におけるプロセッサ交換処理の動作フロー図である。
【図２３】従来のマルチプロセッサシステムの初期化方法を説明する図である。
【符号の説明】
Ｓ１０１起動指示ステップ、Ｓ１０２初期化完了通知確認ステップ、Ｓ１０４タイムアウト確認ステップ、Ｓ１０５留保プロセッサ登録ステップ、Ｓ２０１プロセッサ選定ステップ、Ｓ３０１再確認処理指示ステップ、Ｓ３０２システム運用開始ステップ、Ｓ４０１再確認処理開始ステップ、Ｓ４０２再初期化指示ステップ、Ｓ４０３再初期化完了通知確認ステップ、Ｓ４０４正常プロセッサへ戻すステップ、Ｓ４０５タイムアウト確認ステップ、Ｓ４０７異常プロセッサ登録ステップ、Ｓ５０１処理要求送信ステップ、Ｓ５０２タイムアウトまでの確認ステップ、Ｓ５０３受信確認ステップ、Ｓ５０４代替プロセッサ選択ステップ、Ｓ６０２留保プロセッサ登録ステップ、Ｓ７０１再認識処理開始ステップ、Ｓ７０２ダミー要求通信タイムアウトまでの確認ステップ、Ｓ７０３受信確認ステップ、Ｓ７０４正常プロセッサへ戻すステップ、Ｓ８０１正常プロセッサ数カウントステップ、Ｓ８０２所要数との比較ステップ、Ｓ８０３異常通知ステップ、Ｓ９０２留保プロセッサからの選択ステップ。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an initialization process of an information processing system configured by multiprocessors, and in particular, even when some of the processors constituting the system have an abnormality, the system startup time is shortened and the abnormality is simultaneously corrected. It tries to improve the reliability of the system.
[0002]
[Prior art]
As a restoration processing method when there is a failure at the time of initialization of a conventional multiprocessor system, for example, there is a method as disclosed in JP-A-11-161616. FIG. 23 is a diagram showing a part of the operation of the initialization processing method of the multiprocessor system disclosed in this publication.
[0003]
The operation of this method will be described.
When multiple processors that make up a multiprocessor system become abnormal during initialization processing, the failure recovery initialization (IPL) processing program type ID is sent by broadcast communication, and multiple failed processors are recovered simultaneously. By doing so, it is intended to shorten the startup processing time even when a failure occurs. The system initialization process is completed by a procedure of retrying the start instruction for the processor that seems to have caused the failure by broadcast communication and disconnecting the processor that does not respond to the retry start instruction.
In this way, multiple failed processors can be made to retry the startup process at the same time, so the system can be run at a higher speed than when retrying the startup process individually for each failed processor. Can be activated. However, since the initialization process is retried, the system startup time is delayed as compared with the normal time.
[0004]
[Problems to be solved by the invention]
The initialization processing method of the conventional multiprocessor system is configured as described above, and the determination process of the processor that does not return a response is set as the timeout time when the startup process is delayed most, so a long reinitialization time Is required. Furthermore, since it relies on broadcast communication using a communication path with inferior reliability, such initialization processing operation is retried a plurality of times. Thus, there is a possibility that the system operation start is greatly delayed during the normal / abnormal determination in the retry initialization process. Normally, a multi-processor system with a redundant configuration has a problem that although the system can be operated even if some of the processors are abnormal, the startup time is delayed because priority is given to checking the status of the abnormal processors. It was.
[0005]
The present invention has been made to solve the above-described problems, and by narrowing down and starting up the necessary processors at the start of system operation, the necessary processors are entered sequentially in later phases after the system operation is started up. In this way, the reliability is improved, and the initialization and abnormality check of the processor and the system operation are performed in parallel.
[0006]
[Means for Solving the Problems]
 A multiprocessor initialization / concurrent diagnosis method according to the present invention includes an initialization process for an information processing system constituted by a multiprocessor and a subsequent process method.
 At the time of initialization of the system, a system master processor set in each of the processors constituting the multiprocessor issues a system initialization command to each of the processors and receives a response within a predetermined time from each of the processors. Confirming, determining that each processor having the response is a normal processor and registering it in a normal processor pool;
 Starting operation of the system by each processor determined to be the normal processor;
 The system master processor determining each processor that does not respond within the predetermined time period as a reserved processor and registering it in a reserved processor pool;
 The system master processor selecting a reserved processor pool master from among the processors determined to be normal processors;
 In parallel with the operation of the system, the selected reserved processor pool master includes a step of confirming the operation of the reserved processor registered in the reserved processor pool. .
[0007]
 During system operation, each processor determined to be normal is a request processor and an execution processor.
 The request processor making a processing request to the execution processor;
 A step of notifying the system master processor of a non-response when the execution request is not confirmed within a predetermined time from the execution processor in response to the processing request issued to the execution processor;
 When the system master processor receives the notification of the non-response, selecting an unused processor registered in the normal processor pool as an alternative processor and changing it to an execution processor;
 The system master processor includes a step of moving the non-responsive execution processor to a reserved processor pool.
[0008]
 In parallel with system operation, the selected reserved processor pool master performs an operation reconfirmation process for the processor determined to be a reserved processor;
 The system master processor includes a step of determining a processor that has obtained a normal response through the reconfirmation process as a normal processor and registering the processor in a normal processor pool.
[0009]
 In parallel with system operation, the selected reserved processor pool master performs an operation reconfirmation process for the processor determined to be a reserved processor;
 The system master processor includes a step of determining a processor for which a normal response is not obtained by the reconfirmation process as an abnormal processor and registering the processor in an abnormal processor pool.
[0010]
 A normal processor pool / processor number confirmation processing step in which the system master processor counts the number of normal processors registered in the normal processor pool;
 The system master processor comparing the counted number of normal processors with the minimum number of processors necessary for system operation;
 The system master processor includes a step of performing system abnormality processing when the number of normal processors becomes smaller than the minimum number of processors necessary for system operation in the comparison.
[0011]
 The system master processor selecting a new reserved processor pool master from the normal processors registered in the normal processor pool;
 The system master processor includes a step of registering a processor that has been a reserved processor pool master before the selection as a normal processor.
[0014]
DETAILED DESCRIPTION OF THE INVENTION
Embodiment 1 FIG.
In the initialization processing of a multiprocessor system, the “normal” processor pool that can be determined to be normal, the “pending” processor pool that cannot be determined as normal or abnormal, and the “abnormal” processor pool that is determined as abnormal A method for changing the subsequent processing of each pool will be described.
FIG. 1 is a block diagram showing the concept of the initialization processing method of the multiprocessor system in the first embodiment of the present invention. In the figure, reference numeral 101 denotes a processor constituting the system. In the present embodiment, it is assumed that the system includes a large number of processors and the processors are redundantly configured. Each processor is connected by a network 103 for data exchange between the processors. This network may be various, such as a bus, LAN, or fiber channel. In order to realize a large-scale configuration, the network 103 may have a hierarchical structure by a network switch indicated by 102.
Reference numeral 100 denotes a system master processor selected to perform system management among the processors. The system master processor 100 can communicate with all other processors 101 via the network 103 and the network switch 102.
[0015]
In a multiprocessor system having such a configuration, at the time of system startup, the system master processor 100 sends an initialization instruction or initial program download 104 to all processors using the technique shown in the conventional example. Go at the same time to start the system. Each processor 101 performs a processor initialization process in accordance with the instruction.
FIG. 2 is a diagram illustrating a completion state of the initialization process. When the initialization processing is completed, each processor 101 returns a completion notification 105 to the system master processor 100 for system synchronization. For this completion notification, various methods such as a method by writing to a specific register of the system master processor 100 and a method by message transmission can be used. The system master processor 100 confirms that the initialization processing of each processor has been normally completed by receiving this completion notification.
[0016]
In the initialization processing method of the multiprocessor system according to the first embodiment of the present invention, in addition to the normal initialization operation of FIG. 1 and FIG. 3, the processors constituting the system include three types of processor pools as shown in FIG. Classify into logical groups to call. Reference numeral 200 in FIG. 3 denotes a “normal” processor pool to which a processor whose normal operation has been confirmed belongs. Reference numeral 201 denotes a “reserved” processor pool to which processors for which normal / abnormal determination has not yet been completed belong. Reference numeral 202 denotes an “abnormal” processor pool to which a processor that has been determined to be abnormal and cannot be used for system operation belongs. These processor pools are logical classifications in system management, and no physical changes are made to the processors.
[0017]
In order to realize the grouping of FIG. 3, the components shown in FIG. 4 are prepared. Reference numeral 300 denotes a “normal” processor pool management table for managing processors belonging to the “normal” processor pool, in which identifiers of processors belonging to the “normal” processor pool are registered. Reference numeral 301 denotes a “reserved” processor pool management table for managing processors belonging to the “reserved” processor pool, in which identifiers of processors belonging to the “reserved” processor pool are registered. Reference numeral 302 denotes an “abnormal” processor pool management table for managing processors belonging to the “abnormal” processor pool, in which identifiers of processors belonging to the “abnormal” processor pool are registered. In the present embodiment, the system master processor 100 manages the processor pool management tables 300 to 302. Since the processor 101 having sufficient redundancy is prepared in the system, the start-up of the system is completed by the “normal” processor.
[0018]
Conventionally, the waiting time for receiving the completion notification in FIG. 2 has to be set to a long time for which the worst condition is expected so that the completion notification can be received in any state. Further, in order to avoid a transient abnormality, it is necessary to retry the initialization instruction when there is no response. However, in the initialization processing method of the multiprocessor system according to the first embodiment, the processors constituting the system are classified into the processor pools shown in FIG. 3, and the subsequent system operation is changed for each processor pool.
In this way, the waiting time for receiving the completion notification is set to a short time, and even if there is a processor that is normal but failed to return the completion notification within that time, it is narrowed down to the processor that could not return the completion notification, and normality / abnormality confirmation Can be retried. That is, the efficiency of the initialization process of the system can be obtained and the startup time can be shortened.
[0019]
Hereinafter, a specific identification and registration method of the “normal” processor will be described.
Once a “normal” processor is separated from other “reserved” processors or the like, fast initialization using these “normal” processors is possible.
FIG. 4 is a diagram showing the concept of the high-speed initialization process 410. The high-speed initialization processing 410 performs the processing operations shown in FIGS. 1 and 2, but naturally the initialization processing timeout time 304 is set shorter than the conventional one and the initialization is completed within the set time. .
[0020]
A specific operation flow of the high-speed initialization process 410 is shown in FIG.
First, in step (hereinafter, step description is omitted) S101, the system master processor 100 issues a start instruction to all the processors 101 constituting the system. As a method of instructing activation, there are various methods such as system reset, activation instruction by broadcast communication, and downloading of an initial program by broadcast communication. Thereafter, in S102, the system master processor waits for an initialization process completion notification from each processor 101. When the initialization process completion notification is received from one of the processors in S102, the processor that returned the response in S103 is placed in the “normal” processor pool 200, the processor identifier of the processor is set to “normal”, and the “normal” processor Register in the pool management table 300.
[0021]
While the processes of S102 and S103 are repeated, when the initialization process time-out time 304 is reached, the process proceeds to S105 through the condition determination of S104. Conventionally, it has been necessary to determine whether the processor is normal or abnormal in the processing up to this point. Therefore, it is necessary to sufficiently lengthen the initialization processing timeout time 304, and if no response is obtained, the process of S101 It was necessary to check the processor abnormality by retrying the start instruction. However, in the present embodiment, the processor that has not received the completion notification in S105 is put in the “reserved” processor pool 201. The method is performed by registering the identifier of the processor for which the completion notification has not been obtained in the “reserved” processor pool management table 301. Processors belonging to the “reserved” processor pool 201 are processors that have not been determined to be normal / abnormal, and will perform normal / abnormal determination again. On the other hand, the processors belonging to the “normal” processor pool 200 are processors that have already been determined to be normal, and are processors that can be used from the initial stage for system operation.
In this way, the processors that could not be determined to be normal within the specified time were divided into different processor pools so that different processing could be performed later, so a shorter response wait time than before could be set, and initialization processing was retried. It has become clear that the system initialization time can be shortened.
[0022]
Embodiment 2. FIG.
In the present embodiment, after the system is started by the “normal” processor group, a selection process in parallel with the system operation will be described.
Hereinafter, an initialization processing method for a multiprocessor system according to Embodiment 2 of the present invention will be described with reference to the drawings. First, as shown in FIG. 6, one of the processors belonging to the “normal” processor pool is selected and set as a “reserved” processor pool master 400 that manages the “reserved” processor pool. In other words, the processors belonging to the “normal” processor pool are the processors whose normal operation has already been confirmed, and it is considered that the operation for managing the processor pool can be performed. In this embodiment, in order to determine the “reserved” processor pool master 400, a “reserved” processor pool / master selection process 510 is provided in the system initialization process shown in the figure. Further, as shown in FIG. 7, a “reserved” processor pool master entry 500 having an identifier of “reserved” processor pool master 400 is provided in the system master processor 100.
[0023]
The operation of the “reserved” processor pool / master selection processing 510 will be described with reference to FIG. In the example of FIG. 8, this selection process is described as an operation performed following the operation of FIG.
In S201, the system master processor 100 refers to the “normal” processor pool management table 300 and selects one processor by some algorithm. The identifier of the processor is removed from the entry of the “normal” processor pool management table 300, and the selected processor is set as the selected processor. This selection method can be various methods such as a method of selecting the first processor that is not the system master processor in the management table, or a method of selecting a processor having a specific processor identifier, in accordance with the required specifications of the system. In step S 202, the identifier of the selected processor is registered in the “reserved” processor pool master entry 500. The system recognizes the processor registered in the “reserved” processor pool master entry 500 as a processor managing the “reserved” processor pool.
[0024]
In this way, one processor from the processor pool that has already been determined to be normal is set as the master processor for the “reserved” processor pool, and the “reserved” processor pool is managed. Detached from the system master processor and ready to run independently of the processors in the “normal” processor pool.
[0025]
First, activation of parallel processing will be described.
In order to start this parallel processing, a new parallel startup processing 610 shown in FIG. 4 is newly included in the system initialization processing. The parallel activation processing 610 simultaneously performs normal system operation by the processors belonging to the “normal” processor pool 200 and normal / abnormal determination processing of the processors belonging to the “reserved” processor pool 201 by the “reserved” processor pool / master processor 400. Make it work.
[0026]
The operation of the parallel activation process 610 will be described with reference to FIG. In the example of FIG. 9, this parallel activation process 610 is described as an operation performed following the operation of FIG.
In step S 301, the system master processor 100 instructs the “reserved” processor pool / master processor 400 to start a reconfirmation process for determining whether the processor belonging to the “reserved” processor pool is normal or abnormal by inter-processor communication. Thus, the “reserved” processor pool master processor operates independently of the processors in the “normal” processor pool. In step S 302, the system master processor refers to the “normal” processor pool management table 300 and starts operating the system only with the processors in the “normal” processor pool registered there. At this point, the processor registered in the “normal” processor pool is a part of the processor that constitutes the system. However, the computation load is often small at the initial stage of system operation, and all processors are required. Since there are few system states, such activation is possible.
[0027]
In this way, system operation is started using the processor in the “normal” processor pool in parallel with processing such as abnormality determination performed for the processors in the “reserved” processor pool. Can be shortened.
[0028]
Next, the normal / abnormal re-recognition processing for the “pending” processor in parallel with the system startup will be described.
For this re-recognition process, the selected “reserved” processor pool master 400 has a re-confirmation process 420 as shown in FIG. Further, in this reconfirmation process, the processor pool management table possessed by the system master processor 100 is operated. FIG. 11 shows operations for the processor pool management table in the reconfirmation process 420.
The selected “reserved” processor pool master 400 refers to the “reserved” processor pool management table 301 and issues an initialization retry request 500 to the registered processor by multicast communication similar to the conventional example. The initialization process is retried for a processor whose normality / abnormality has not been confirmed. A processor that has finally been confirmed to be normal by retrying removes the processor identifier from the “reserved” processor pool management table 301 and registers it in the “normal” processor pool management table 300, thereby making the processor a “normal” processor. Transition to the pool 200. [Normal] For the processor transferred to the processor pool, the system master processor refers to the “normal” processor pool management table 300 and enters the system operation.
[0029]
A processor that has not been confirmed to be normal by retrying removes its processor identifier from the “reserved” processor pool management table 301, registers it in the “abnormal” processor management table 302, and registers the processor in the “abnormal” processor pool 202. Put in. The processor in the “abnormal” processor pool 202 is determined as a permanent failure such as an H / W abnormality and is disconnected from the system operation. When this reconfirmation process 420 is completed, the processor of the “reserved” processor pool moves to the “normal” processor pool or the “abnormal” processor pool, and the “reserved” processor pool becomes empty.
[0030]
The detailed operation of the reconfirmation process 420 outlined above will be described with reference to FIG. In the example of FIG. 12, the reconfirmation process 420 is described as a process that is activated by the instruction of S301 in FIG.
The reconfirmation processing 420 performs reliable processor normality / abnormality determination by a method similar to the conventional method. First, in S401, it is confirmed whether or not the initialization request has been retried the number of times specified by the system. If the specified number of retries has not been performed, the “reserved” processor pool management table 301 is referred to in S402, and the initialization process is restarted by means such as a single reset or multicast communication for the registered processor. Request a trial. In step S403, the control unit waits for a response for completion of the initialization process. If there is a response, the processor identifier of the processor that returned the completion notification in S 404 is registered in the “normal” processor pool management table 300.
The operations of S403 and S404 are repeated until the initialization processing timeout time is reached in S405. The initialization process time-out time in S405 is different from the initialization process time-out time 304 in the high-speed initialization process 410 of the second embodiment, and is the conventional long time set for the initialization process when the initialization process is delayed. Timeout time.
[0031]
In step S406, it is confirmed whether there is an unresponsive processor that does not return an initialization process completion notification. An unresponsive processor can be identified by the fact that the processor identifier remains in the “reserved” processor pool management table. If there is an unresponsive processor, the process returns to S401, and the initialization process from S402 is retried. If there is no unresponsive processor in S406, it is determined that all the processors are normal and the process is shifted to the “normal” processor group, so the reconfirmation process 420 is completed.
If the initialization process is repeated the number of times specified by the system in S401, the process proceeds to S407. In S 407, the processors that have not finally returned the initialization process completion notification, that is, the processors with all the processor identifiers still remaining in the “reserved” processor pool management table 301 are moved to the “abnormal” processor pool management table 302. A processor in the “abnormal” processor pool is determined to be a permanent abnormality.
[0032]
In this way, using the “reserved” processor pool that operates in parallel with the system operation, the normality / abnormality is determined again by the initialization method similar to the conventional method. This makes it possible to determine whether the processor is normal or abnormal without hindering system operation, and avoid the occurrence of a delay in system startup. In addition, a processor that is determined to be normal enters the system operation by adding it to the normal processor pool, and a processor that cannot be determined to be normal is disconnected from the system operation as a permanent error by adding it to the “abnormal” processor pool. .
[0033]
Embodiment 3 FIG.
In the present embodiment, an abnormality handling process will be described when an abnormality occurs in the execution processor during operation of the system of the first embodiment. That is, the system operation is not hindered. In addition, a method capable of effectively retrying an abnormal processor will be described.
FIG. 13 is a diagram showing a processing concept in the present embodiment. A multiprocessor system can be modeled in such a manner that various processes of the system can proceed in parallel by a request processor issuing a processing request to an execution processor through inter-processor communication. In FIG. 13, 600 is a request processor that issues a processing request, and 601 is an execution processor that executes processing based on the request. In the conventional technique, when a response is not returned from the execution processor 601 in response to the issued request, the request processor 600 is set in order to wait for the worst-case execution time of the process as a timeout time and to confirm communication or processor abnormality. Had to perform a reissue of the request. This determination process requires a long time. Therefore, in a system that requires real-time performance, there is a risk of malfunction.
In this embodiment, if a response is not returned from the execution processor 601 within a certain time, the request is not retried and the execution processor 601 is put in the “reserved” processor pool 201. Then, the request processor selects an alternative execution processor 602 to be an alternative to the execution processor, issues a request to it again, and continues processing. Thus, without retrying again and again, the next processor is used as an alternative.
The normality / abnormality of the execution processor 601 that has not returned a response within a certain time is executed as a process in the “reserved” processor pool in parallel with the system operation using the mechanism described in the second embodiment. .
[0034]
In order to provide the above functions, the request processor 600 has an inter-processor communication process 710 shown in FIG. The operation of the inter-processor communication process 710 will be described with reference to FIG.
First, in S501, the request processor 600 transmits a necessary processing request to the execution processor 601. After this, the reception confirmation from the execution processor is waited until the timeout time designated in S502 as in the conventional case. If the reception confirmation is received from the execution processor by the timeout time in S503, it is normal and the operation is continued as in the conventional case. If the reception confirmation cannot be received in S503, the identifier of the execution processor 601 that issued the processing request in S504 is newly set in the system master processor 100 as a non-response in the inter-processor communication processing of this embodiment. Notice. As a result, the system master processor 100 performs corresponding processing for the non-responsive execution processor 601. In step S504, the request processor 600 notifies the system master processor of the non-response processor, and in step S505, immediately selects another processor in the “normal” processor pool as the alternative execution processor 602. Send the same processing request and continue processing.
[0035]
In this way, it is possible to shorten the processing time when the execution processor does not return a response in the inter-processor communication. In the present embodiment, when the execution processor does not return a response, it takes time to issue the same processing request to the alternative execution processor. Since a plurality of retries for determining an abnormality and normal / abnormal determination of the execution processor are not performed, even if there is an abnormality in the processor, the processing is continued within a predictable time.
[0036]
The processing for dealing with a non-response and “pending” processor target during system operation will be described.
The system master processor 100 has a “reserved” processor pool addition process 810 shown in FIG. In the “reserved” processor pool addition process 810, as conceptually shown in FIG. 16, an identifier of an execution processor that does not return a response to a processing request within a certain time is received from the “normal” processor pool management table 300. By moving to the processor pool management table 301, the processor is temporarily removed from the system operation, and the processor normal / abnormal confirmation process can be performed.
[0037]
The operation of the “reserved” processor pool addition processing 810 will be described with reference to FIG. In the example of FIG. 17, the operation is described as a process activated by S504 in FIG.
First, in S 601, the processor identifier of the execution processor 601 identified as an execution processor that has not returned a response to the processing request within a predetermined time is retrieved from the “normal” processor pool management table 300 and removed. In step S 602, the processor identifier of the execution processor 601 is added to the “reserved” processor pool management table 301. Through S601 and S602, the execution processor 601 has shifted from the “normal” processor pool to the “reserved” processor pool. Since the system master processor 100 refers to the “normal” processor pool management table 300 and uses only the processors registered there for system operation, the execution processor 601 that did not return a response is excluded from system operation. become. In step S 603, the “reserved” processor pool master entry 500 is referred to, and the registered “reserved” processor pool master processor 400 is notified that a processor has been added to the “reserved” processor pool.
[0038]
In this way, the processor having the possibility of abnormality is transferred from the “normal” processor pool to the “reserved” processor pool and separated from the system operation. Therefore, processing different from normal system operation can be performed on these processors that may be abnormal.
[0039]
Next, the re-recognition operation of the “reserved” processor after separation will be described.
The “reserved” processor pool master processor 400 has an operation reconfirmation process 910 as shown in FIG. The in-operation reconfirmation process 910 is an in-operation reconfirmation process corresponding to the reconfirmation process 420 for the “reserved” processor at the time of system initialization. In operation reconfirmation processing 910, recovery processing and normality / abnormality determination processing of the processor in the “reserved” processor pool is performed.
[0040]
The operation of the operation reconfirmation process 910 will be described with reference to FIG. In the example of FIG. 18, this operation is described as a process activated by S603 of FIG.
In this process, first, the “reserved” processor pool master processor retries issuing a dummy processing request to the processors in the “reserved” processor pool, and the processor determines normality / abnormality. First, in step S701, it is confirmed whether the dummy process request issuance has been retried a specified number of times. If the specified number of retries has not been reached, the “reserved” processor pool management table 301 is referred to in S702, and a dummy process request is transmitted to the registered processor. This is to determine whether the processor is normal or abnormal by checking whether the requested processing can be executed. In step S702, a request reception confirmation response is waited using a time designated by the system as a timeout time. If a request reception confirmation from the processor is received in S703, it is determined that the processor is normal. In S704, the processor identifier of the processor is removed from the “reserved” processor pool management table 301 again, and “normal” processor pool management is performed. Register in table 300 to enter system operation.
If no response is returned in S703, the process returns to S701 to retry issuing a processing request. If the retry is repeated a specified number of times in S701, it is determined that there is some abnormality in the processor, the processor is reset alone in S705, and the processor is returned to the initial state. Thereafter, the process of FIG. 12 is executed in S706, and the initialization process is retried to determine whether the processor is permanently abnormal. If normality is confirmed by retrying the initialization process, re-enter the “normal” processor pool. If normality is not finally confirmed, enter the "abnormal" processor pool and disconnect from system operation.
[0041]
In this way, it is possible to determine whether the processor is permanently abnormal by disconnecting from the system and retrying the processing request or performing a single reset for a processor that may be abnormal during the system operation. Therefore, it is possible to avoid the occurrence of a real-time delay in the system.
[0042]
Embodiment 4 FIG.
In the present embodiment, a countermeasure when the “normal” number of processors is not sufficient will be described.
Hereinafter, in the initialization processing method of the multiprocessor system of the present embodiment, as shown in FIG. 19, the system master processor 100 confirms the “normal” processor pool minimum processor number 700 and the “normal” processor pool / processor number confirmation. A process 1010 is provided. That is, it is possible to detect a state where the number of processors necessary for system operation is insufficient in the “normal” processor pool.
[0043]
Next, the operation of the “normal” processor pool / processor count confirmation processing 1010 will be described with reference to FIG. The “normal” processor pool / processor number confirmation processing 1010 is processing to be started after the processing in FIG. 9 or the processing in FIG. 17.
First, in S801, the number of processors registered in the “normal” processor pool management table 300 is counted. This number is compared with the “normal” processor pool minimum processor number 700 in S802. If the “normal” processor pool minimum processor number 700 or more processors are registered in the “normal” processor pool management table 300, the system operation is continued in S803, and the “normal” processor pool processor number confirmation processing 1010 is completed. To do.
If there are fewer processors registered in the “normal” processor pool management table 300 than the “normal” processor pool minimum number of processors 700, it is determined in S804 that the system is in an abnormal state, and the defined system abnormality processing is executed. As the system abnormality processing, message output, system stop, etc. are assumed.
[0044]
In this way, the number of processors registered in the “normal” processor pool is counted, and when the minimum number of processors necessary for system operation is insufficient, it is possible to deal with the abnormality by instructing degenerate operation.
[0045]
Next, a method for avoiding degenerate operation during system operation will be described.
For this purpose, as shown in FIG. 21, the system master processor 100 has a “reserved” processor pool / master processor exchange process 1110.
[0046]
The operation of the “reserved” processor pool / master processor exchange process 1110 will be described with reference to FIG.
The “reserved” processor pool / master processor exchange process 1110 is a process that the system master processor 100 starts at a constant time period during system operation. First, in S901, one processor is selected from the “normal” processor pool management table 300 by some algorithm. Depending on the system requirements, this method can be selected by selecting the first processor that is neither the system master processor nor the “reserved” processor pool / master processor in the management table, or the processor with a specific processor identifier. There can be various methods such as a selection method. Next, in S902, the processor registered in the “reserved” processor pool / master entry 500 is registered in the “normal” processor pool management table 300, and the processor is removed from the role of the “reserved” processor pool / master processor. Let's enter the normal processing. In step S903, the processor selected in step S901 is registered in the “reserved” processor pool master entry 500 to be a new “reserved” processor pool master processor. In this way, management processing for the future “reserved” processor pool will be performed by the newly registered processor. By periodically executing the “reserved” processor pool / master processor exchange process 1110, even if there is an abnormality in the “reserved” processor pool / master processor 400, the next period is replaced by the periodic replacement of the master processor. And the processing described in the present invention can be performed. Therefore, the system is not put into a permanent abnormal state. Further, even if the “reserved” processor pool master processor 400 is made to enter the “normal” processor pool in S902 without noticing the abnormality, the operation described in the third embodiment is performed as a processor that may be abnormal. It is moved to the “reserved” processor pool and can be restored.
[0047]
【The invention's effect】
As described above, according to the present invention, a step of performing a system initialization command and confirming a response after a predetermined time to determine a normal processor, a step of starting system operation by a normal processor that has responded, and a predetermined time Since a processor having no response is used as a reserved processor and the reserved processor is subjected to a confirmation process in parallel with the system operation, the system can be efficiently started.
[0048]
Furthermore, since one of the normal processors performs the reserved processor confirmation process, there is an effect that there is no problem in the early system startup and system operation.
[0049]
Furthermore, since the reinitialization instruction is issued as the reservation processor confirmation process, there is an effect that the replacement target can be easily identified by identifying the abnormal processor.
[Brief description of the drawings]
FIG. 1 is a diagram showing a system configuration for explaining an initialization method in a multiprocessor system according to Embodiment 1 of the present invention;
FIG. 2 is a system diagram illustrating a normal completion operation in the first embodiment.
FIG. 3 is a group diagram for explaining an operation in the first embodiment.
FIG. 4 is a diagram illustrating constituent functions of the master processor according to the first embodiment.
FIG. 5 is a flowchart showing an initialization operation in the first embodiment.
FIG. 6 is a diagram for explaining the division of processor groups in Embodiment 2 of the present invention.
FIG. 7 is a diagram illustrating constituent functions of a master processor according to the second embodiment.
FIG. 8 is a flowchart showing the operation of processor pool / master selection processing in the second embodiment;
FIG. 9 is a flowchart showing an operation of parallel activation processing in the second embodiment.
FIG. 10 is a diagram illustrating another constituent function of the master processor according to the second embodiment.
FIG. 11 is a conceptual diagram illustrating an operation performed by a master processor in the second embodiment.
FIG. 12 is a flowchart showing an operation of reconfirmation processing in the second embodiment.
FIG. 13 is a diagram showing a processing concept for an abnormality that occurs during system operation in Embodiment 3 of the present invention.
FIG. 14 is a diagram showing other various processing functions in the present invention.
FIG. 15 is a flowchart showing an operation of inter-processor processing in the third embodiment.
FIG. 16 is a conceptual diagram showing other constituent functions of the master processor in the third embodiment.
FIG. 17 is a flowchart showing the operation of processor pool addition processing in the third embodiment.
FIG. 18 is a diagram illustrating a processing concept for an abnormality that occurs during system operation according to Embodiment 3 of the present invention;
FIG. 19 is a diagram showing configuration functions of a system master processor for adding a processor during system operation according to Embodiment 4 of the present invention;
FIG. 20 is an operation flowchart of the processor number confirmation process in the fourth embodiment.
FIG. 21 is a diagram for explaining the concept of reservation exchange or replenishment in the fourth embodiment.
FIG. 22 is an operation flowchart of processor replacement processing in the fourth embodiment.
FIG. 23 is a diagram illustrating an initialization method of a conventional multiprocessor system.
[Explanation of symbols]
S101 start instruction step, S102 initialization completion notification confirmation step, S104 timeout confirmation step, S105 reserved processor registration step, S201 processor selection step, S301 reconfirmation processing instruction step, S302 system operation start step, S401 reconfirmation processing start step, S402 Reinitialization instruction step, S403 Reinitialization completion notification confirmation step, S404 step to return to normal processor, S405 timeout confirmation step, S407 abnormal processor registration step, S501 processing request transmission step, S502 confirmation step until timeout, S503 reception confirmation step , S504 alternative processor selection step, S602 reserved processor registration step, S701 re-recognition processing start step, S702 Confirmation step until dummy request communication timeout, S703 reception confirmation step, S704 step to return to normal processor, S801 normal processor number counting step, S802 comparison step with required number, S803 abnormality notification step, S902 selection step from reserved processor.

Claims

In an initialization process of an information processing system configured by a multiprocessor and a subsequent process method ,
At the time of initialization of the system, a system master processor set in each of the processors constituting the multiprocessor issues a system initialization command to each of the processors and receives a response within a predetermined time from each of the processors. Confirming , determining that each processor having the response is a normal processor and registering it in a normal processor pool ;
Starting operation of the system by each processor determined to be the normal processor ;
The system master processor determining each processor that does not respond within the predetermined time period as a reserved processor and registering it in a reserved processor pool;
The system master processor selecting a reserved processor pool master from among the processors determined to be normal processors;
In parallel with the operation of the system, the selected reserved processor pool master includes a step of confirming the operation of the reserved processor registered in the reserved processor pool. Multiprocessor initialization / parallel diagnostic method.

During system operation , each processor determined to be normal is a request processor and an execution processor.
The request processor making a processing request to the execution processor;
A step of notifying the system master processor of a non-response when the execution request is not confirmed within a predetermined time from the execution processor in response to the processing request issued to the execution processor;
When the system master processor receives the notification of the non-response, selecting an unused processor registered in the normal processor pool as an alternative processor and changing it to an execution processor;
2. The multiprocessor initialization / parallel diagnosis method according to claim 1 , wherein the system master processor includes a step of moving the non-responsive execution processor to a reserved processor pool .

In parallel with system operation, the selected reserved processor pool master performs an operation reconfirmation process for the processor determined to be a reserved processor;
3. The multi- processor according to claim 1 , further comprising a step in which the system master processor determines a processor that has obtained a normal response through the reconfirmation process as a normal processor and registers the processor in a normal processor pool. Processor initialization / concurrent diagnosis method.

In parallel with system operation, the selected reserved processor pool master performs an operation reconfirmation process for the processor determined to be a reserved processor;
The system master processor, according to claim 1 or claim 2 wherein further comprising the step of registering the abnormal processor pool is determined that abnormality processor a processor which has not obtained a normal response by the reconfirmation process Multiprocessor initialization / parallel diagnostic method.

A normal processor pool / processor number confirmation processing step in which the system master processor counts the number of normal processors registered in the normal processor pool;
The system master processor comparing the counted number of normal processors with the minimum number of processors necessary for system operation;
2. The multiprocessor initialization according to claim 1 , wherein the system master processor performs a system abnormality process when the number of normal processors is smaller than the minimum number of processors necessary for system operation in the comparison . / Parallel diagnostic method.

The system master processor selecting a new reserved processor pool master from the normal processors registered in the normal processor pool;
3. The multiprocessor initialization according to claim 1 , wherein the system master processor includes a step of registering a processor that has been a reserved processor pool master before the selection as a normal processor . / Parallel diagnostic method.