JP2004062470A

JP2004062470A - Switching system of multiprocessor

Info

Publication number: JP2004062470A
Application number: JP2002219037A
Authority: JP
Inventors: Takeshi Koike; 小池　毅
Original assignee: NEC Engineering Ltd
Current assignee: NEC Engineering Ltd
Priority date: 2002-07-29
Filing date: 2002-07-29
Publication date: 2004-02-26
Anticipated expiration: 2022-07-29
Also published as: JP4072392B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system for minimizing the exhaustion of a standby processor even in the case that faults are generated in a plurality of processors and maintaining performance as a system as much as possible regarding the switching system of the standby processor in a multiprocessor type information processor provided with a processor continuously operable by disconnecting a fault part by a partial degradation function. <P>SOLUTION: The system is provided with the standby processor 5 to be switched when the faults are generated in operation processors 1-4 and a means for digitizing a performance decline amount by the fault part of the operation processor and the standby processor. A diagnostic processor 6 is provided with a function of comparing performance after the fault part degradation of the operation processor with performance after the fault part degradation of the standby processor and determining the propriety of the incorporation of the standby processor to the system. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
本発明は、情報処理装置の障害処理方式に関し、特にマルチプロセッサシステムにおける予備プロセッサの切り替え方式の改良に関する。
【０００２】
【従来の技術】
従来の予備プロセッサシステムの一例が、ＩＢＭ（インターナショナル・ビジネス・マシーン）社のＳ／３９０シリーズで採用されており、１９９９年５月２７日同社発行の技報「ＩＢＭ　Ｊｏｕｒｎａｌ　ｏｆ　Ｒｅｓｅａｒｃｈ　ａｎｄ　Ｄｅｖｅｌｏｐｍｅｎｔ」のＶｏｌ．４３，Ｎｏｓ．５／６，１９９９“ＲＡＳ　ｓｔｒａｔｅｇｙ　ｆｏｒ　ＩＢＭ　Ｓ／３９０　Ｇ５　ａｎｄ　Ｇ６”（８８０ページ〜８８３ページまで）に記載されている。
【０００３】
ＩＢＭ　Ｓ／３９０システムの場合、中央処理装置と入出力処理を行う支援処理機構および内部結合機構が、予備切り替え可能なプロセッサとして冗長な構成となっており、これらの装置に障害が発生した場合、動的（「システムを停止することなく」の意）またはパワーオンリセットによって予備プロセッサへの切り替えを行い、システム性能を低下させることなく、引き続き運用を継続させることを特徴としている。
【０００４】
一般的に、汎用コンピュータのような基幹業務サーバにおいては、販売の形態としてレンタル契約するケースが殆どで、中央処理装置および入出力装置の性能と使用時間の積で価格が決定される。従って、課金方法の性質からもシステムとしての性能維持が装置に期待される要件となる。
【０００５】
また、汎用コンピュータ上で稼働するシステムも、銀行の勘定系に代表されるように、極めて高い信頼性、高い可用性を要求する業務が中心であるため、オンライン中の業務を中断させることなく、速やかに故障箇所を回復させるための冗長機能を装備する必要がある。
【０００６】
予備プロセッサ方式は、上述の販売形態および運用形態に則した汎用コンピュータ独特の機構であり、一般のＰＣサーバのようにプロセッサの最大処理性能（通常は動作周波数）で価格が決定されるような装置、あるいは業務を複数のサーバに分散させることで耐故障性を維持するような装置には存在しない機構である。
【０００７】
ＩＢＭ　Ｓ／３９０システムの予備プロセッサ方式は、プロセッサを最小単位とする予備切り替え方式である。汎用コンピュータの一構成例を示した図１を参照すると、この従来システムにおいて演算プロセッサ１〜５は、システムバスを介してマルチプロセッサ方式で接続されている。演算プロセッサ１〜５のうち何台かは、運用中の演算プロセッサに障害が発生した際に切り替えられるべき予備プロセッサとして、システム内に待機している。例えば、この構成例において、演算プロセッサ５が予備プロセッサとしてシステムから切り離されて待機しているものとすると、演算プロセッサ１〜４は運用中のプロセッサとなり、システムに障害が発生していない状態では、この演算プロセッサ１〜４の４台でシステム運用が継続されている。
【０００８】
いま、演算プロセッサ１で障害が発生したケースを例にあげると、従来システムでは演算プロセッサ１をシステムから切り離し、替わりに予備として待機している演算プロセッサ５をシステムに組み込むように処理する。
【０００９】
これに対しＩＢＭ　Ｓ／３９０システムでは、故障した演算プロセッサ１を停止させた後、演算プロセッサ１から制御レジスタやキャッシュの内容を抜き取り、予備プロセッサ５に移し替えた後、予備プロセッサ５を動作させることによって、予備プロセッサ切り替えを実現している。
【００１０】
上述の文献において、制御レジスタやキャッシュの内容を移行させる機構として、サポート・エレメント（ＳＥ）という用語を使用しているが、本構成例ではサポート・エレメントを具体化するために、予備プロセッサ切り替えの動作を指示する機構として、診断プロセッサ６を定義する。
【００１１】
【発明が解決しようとする課題】
しかしながら、上述したプロセッサ単位での予備切り替えを行う従来システムの場合には、次のような問題がある。
【００１２】
まず、部分的に縮退可能な機能を有するプロセッサで構成された予備プロセッサシステムでは、プロセッサの特定の部位が故障し性能低下が発生した場合に、故障プロセッサの障害部位が、プロセッサの継続運用不可能な状態まで拡大してから予備プロセッサに切り替わる方式と、ＩＢＭ　Ｓ／３９０システムのように性能低下が発生した時点で即座に予備プロセッサに切り替わる方式の２種類が考えられる。
【００１３】
前者の場合、重度の障害となるまで故障したプロセッサはシステムに組み込まれた状態でいるため、システムとしての性能低下状態が継続し、システム性能を可能な限り保持するという汎用コンピュータの要件を満たすことが困難となる。
【００１４】
また、後者の場合、性能低下が発生した時点で即座に予備プロセッサに切り替わるため性能低下状態は速やかに回復するが、軽度の障害が発生しただけで予備プロセッサを使用してしまうため、予備プロセッサの台数が少ないシステムで多重障害が発生した場合には、切り替えるべき予備プロセッサを確保できない状態に陥る。
【００１５】
本発明の目的は、部分的な縮退機能により故障部位を切り離して継続運用可能なプロセッサを有するマルチプロセッサ方式の情報処理装置における予備プロセッサの切り替え方式に関し、複数のプロセッサで障害が発生した場合でも予備プロセッサの枯渇を最小限に抑え、システムとしての性能を可能な限り維持する方式を提供することにある。
【００１６】
【課題を解決するための手段】
以上の課題を鑑みて、本発明のマルチプロセッサ切り替え方式は、マルチプロセッサ方式の情報処理装置において、通常時に運用される少なくとも１台の運用プロセッサと、前記運用プロセッサの障害発生時に切り替えて用いられる予備プロセッサと、前記障害が発生した運用プロセッサの故障部位を縮退させる手段と、前記運用プロセッサの故障部位縮退による性能低下量を数値化する手段と、前記故障部位縮退による性能低下量を考慮した前記運用プロセッサの単体性能と前記予備プロセッサの単体性能とを比較する手段を有することを特徴としている。
【００１７】
また、本発明のマルチプロセッサ切り替え方式の別の構成例では、前記運用プロセッサで障害が発生した際に、前記運用プロセッサの故障部位縮退後の単体性能と前記予備プロセッサの単体性能とを比較した結果、前記予備プロセッサの単体性能の方が前記運用プロセッサの故障部位縮退後の単体性能より小さいか或いは等しい場合は、前記運用プロセッサを継続して運用し、前記予備プロセッサの単体性能が前記運用プロセッサの故障部位縮退後の単体性能より大きい場合は、前記予備プロセッサをシステムに組み込んだ後、故障した前記運用プロセッサをシステムから切り離して新たな予備プロセッサとして待機させることを特徴とする。
【００１８】
さらに別の例では、複数の運用プロセッサ及び複数の予備プロセッサを有するマルチプロセッサ方式の情報処理装置において、システム内の全運用プロセッサ及び予備プロセッサの故障部位による性能低下量を算出する手段を有し、前記複数の運用プロセッサのいずれかにおいて障害が発生した際に、本来システムに障害がない状態の全運用プロセッサ単体性能の合計より等しいか或いは大きくなるまで、前記複数の予備プロセッサから単体性能の大きい順に順次システムに組み込むことを特徴としている。
【００１９】
さらに最後の構成例としては、システムに組み込まれている前記複数の運用プロセッサの中から、最も値が小さいものを除いた単体性能の合計が、本来システムに障害がない状態における全運用プロセッサの単体性能の合計より大きいか或いは等しい場合は、前記運用プロセッサの中で最も単体性能が小さいプロセッサを順次システムから切り離して、予備プロセッサとして待機させることを特徴としている。
【００２０】
【発明の実施の形態】
以下、本発明の実施の形態について、図面を参照して詳細に説明する。
【００２１】
図１に示すとおり、複数台の演算プロセッサ１〜５と、診断プロセッサ６で構成される。演算プロセッサ１〜５は、通常の業務処理を行うプロセッサと予備プロセッサに大別され、予備プロセッサは通常はシステムから切り離されて待機状態でいる。
【００２２】
図２は本発明における演算プロセッサ１〜５の機能を表した詳細ブロック図である。構成制御ユニット１１は、診断プロセッサ６と診断バスを介して接続されており、診断プロセッサ６からの指示で縮退可能な内部ユニットの組み込み／切り離しを制御する。演算ユニット１２は演算処理を実行する。命令キャッシュユニット（＃０）１３および命令キャッシュユニット（＃１）１４、データキャッシュユニット（＃０）１５およびデータキャッシュユニット（＃１）１６、変換ルックアサイドバッファ１７は、演算プロセッサ１の命令列／データ列のキャッシングおよびアドレス変換を行うユニットであり、構成制御ユニット１１の制御下において、このユニット単位で縮退を行う。これにより、演算プロセッサ１〜５は本来の性能値から一定の性能低下を起こした状態で継続運転することが可能である。バスインタフェースユニット１８は、システムバスとインタフェースを持ち、各演算プロセッサ１〜５の間でデータ交換を行う。演算ユニット１２とバスインタフェースユニット１８は、演算プロセッサ１〜５の核となる部分で、上述の演算ユニット１２およびバスインタフェースユニット１８が故障した場合には、プロセッサの継続運転は不可能となり、演算プロセッサの切り離しが実行される。
【００２３】
続いて、本発明の動作につき図面を参照して詳細に説明する。
【００２４】
本発明の第一形態では、部分的な縮退機能により故障部位を切り離して、継続運用可能なプロセッサを有するマルチプロセッサ方式の情報処理装置において、運用プロセッサの障害発生時に切り替えるべき予備プロセッサと、運用プロセッサおよび予備プロセッサの故障部位による性能低下量を数値化する手段とを具備し、運用プロセッサで障害が発生した際に、その故障部位が縮退した後の単体性能と、予備プロセッサの故障部位縮退後の単体性能とを比較し、予備プロセッサの単体性能の方が運用プロセッサのそれより大きい場合に予備プロセッサをシステムに組み込む。その後、故障した運用プロセッサをシステムから切り離して新たな予備として待機させる。これに対し、予備プロセッサの単体性能の方が運用プロセッサの単体性能より小さいあるいは等しい場合には、予備プロセッサをシステムに組み込まず、運用プロセッサの故障部位を縮退して運用を継続する。
【００２５】
図３はこの一連の動作を示すフローチャートである。演算プロセッサ１〜４を通常の運用プロセッサ、演算プロセッサ５を予備プロセッサとし、演算プロセッサ１で障害が発生した場合について説明する。
【００２６】
この演算プロセッサ１で発生した故障は（Ｓ３−１）、診断プロセッサ６に通知され、故障の部位から縮退可能な範囲と縮退による性能低下量が計算される。また、診断プロセッサ６は演算プロセッサ１が縮退運転可能と判断された場合に、演算プロセッサ１内の故障部位を縮退する（Ｓ３−２）。
【００２７】
次に診断プロセッサ６は、予備として待機している演算プロセッサ５の性能低下量を計算するが、この際演算プロセッサ５には故障箇所がないので、図３の条件判定Ａにより、「障害が発生したプロセッサ１の単体性能＜予備の演算プロセッサ５の単体性能」となり（Ｓ３−３）、予備の演算プロセッサ５をシステムに組み込んだ後（Ｓ３−４）、障害が発生した演算プロセッサ１をシステムから切り離して、新たな予備プロセッサとして待機させる（Ｓ３−５）。
【００２８】
続いて、演算プロセッサ２で演算プロセッサ１よりも軽度の障害が発生した場合について説明する。診断プロセッサ６は、障害が発生した演算プロセッサ２と、予備として待機している演算プロセッサ１の性能低下量を計算し、両者の単体性能を比較する。この場合は、図３の条件判定Ａにより「障害が発生したプロセッサ２の単体性能＞＝予備の演算プロセッサ１の単体性能」となるため、予備として待機している演算プロセッサ１をシステムに組み込むことはせず、障害が発生した演算プロセッサ２の故障部位を縮退させ、故障した演算プロセッサ２の運用を継続する。
【００２９】
本発明の第二の形態では、第一の形態に加えてさらに運用プロセッサで障害が発生した際に、システム内の全ての運用プロセッサおよび予備プロセッサの故障部位による性能低下量を数値化する手段を具備し、障害が発生した運用プロセッサの故障部位を縮退させた後、予備プロセッサをシステムに組み込んだ結果、システム性能が本来システムに障害がない状態の時の全ての運用プロセッサの単体性能の総和より大きくなるか等しくなるまで、複数台ある予備プロセッサを「単体性能の大きい順」にシステムに組み込んだ後、システムに組み込まれている運用プロセッサの中で最も単体性能の小さいプロセッサを除く運用プロセッサの総和が、本来システムに障害がない状態における全ての運用プロセッサの単体性能総和より大きいあるいは等しい期間、運用プロセッサの中で最も単体性能の小さいプロセッサを順次システムから切り離して、予備として待機させる。
【００３０】
図４は本発明の第二形態の動作を示すフローチャートである。図１において演算プロセッサ１〜３を通常の運用プロセッサ、演算プロセッサ４および５を予備プロセッサとし、演算プロセッサ１で障害が発生した場合を例に挙げると、演算プロセッサ１の故障（Ｓ４−１）はただちに診断プロセッサ６に通知され、これを受け診断プロセッサ６は演算プロセッサ１の故障部位を縮退させる（Ｓ４−２）。
【００３１】
また診断プロセッサ６は、システム内の全ての演算プロセッサ１〜５の単体性能を計算し（Ｓ４−３）、このうち障害が発生した演算プロセッサ１を含めた全ての運用中プロセッサ１〜３の単体性能の総和と、本来システムに障害がない状態における全ての運用プロセッサの単体性能の総和とを比較する（Ｓ４−４）。この結果、図４の条件判定Ａにより「本来システムに障害がない状態における全ての運用プロセッサの単体性能の総和＞障害発生後の運用プロセッサの単体性能の総和」となる。診断プロセッサ６は、予備プロセッサ４または予備プロセッサ５の何れかを（予備プロセッサ４および５は故障のない健全なプロセッサであり、単体性能が等しいため）システムに組み込む（Ｓ４−５）。
【００３２】
ここで仮に予備プロセッサ４がシステムに組み込まれたとすると、システム上で稼働している運用プロセッサは１〜４の４台となる。
【００３３】
次に、図４の条件判定Ｂにより、「システム本来の総合性能の総和と、最も単体性能の小さいプロセッサを除いた運用中プロセッサの単体性能の総和」とを比較する（Ｓ４−６）。この場合、障害が発生した演算プロセッサ１をシステムから取り除いても、システムの総合性能は本来障害がない状態における全ての運用プロセッサの単体性能の総和と等しくなるため、診断プロセッサ６は障害が発生した演算プロセッサ１をシステムから切り離し、新たな予備プロセッサとして待機させる（Ｓ４−７）。
【００３４】
続いて、複数台の予備プロセッサをシステムに組み込む場合の動作につき説明する。この動作は、何れか２台以上の予備プロセッサの単体性能の合計が、運用中に故障したプロセッサの性能低下量を補えない場合に発生する。
【００３５】
演算プロセッサ１および２が順次故障し、予備の演算プロセッサ４および５が既にシステムに組み込まれているものとする。この際、演算プロセッサ１および２は新たな予備として待機状態となる。複数台の予備の演算プロセッサがシステムに組み込まれる条件は、この後に運用中の演算プロセッサ３、４、５のうちの何れかが故障し、なおかつ故障した演算プロセッサの単体性能の低下量が、演算プロセッサ１および２の単体性能の合計よりも大きい場合である。すなわち、ここで演算プロセッサ３が故障したとすると、図４に示すフローチャートに従い、運用中の演算プロセッサ３〜５の単体性能の合計と、システム全体で障害なく運用されている場合の演算プロセッサの単体性能合計とが比較される。その不足分は予備として待機している演算プロセッサ１の組み込みにより補われる。またそれでもシステムの総合性能が本来システムに障害がない状態における全ての運用プロセッサの単体性能の総和に満たない場合は、不足分の性能を演算プロセッサ２の組み込みによってさらに補完する。この結果、最終的にはシステムの構成は演算プロセッサ１〜５の合計５台がシステムに組み込まれて稼動している状態となる。
【００３６】
さらに図４の条件判定Ｂについて説明する。もっとも単体性能が小さいプロセッサを切り離し予備として待機させる動作において、予備プロセッサが組み込まれることにより、組み込み後のシステムの総合性能が本来システムに障害のない状態での全ての運用プロセッサの単体性能の総和を超過しており、さらに超過分の性能に比べて単体性能の小さい運用中プロセッサがシステム内に存在しているようなケースにおいて発生する。すなわち、前記の複数台の予備プロセッサをシステムに組み込む場合の動作例を参照すると、演算プロセッサ３の故障で演算プロセッサ１〜５の５台がシステムに組み込まれたときに、演算プロセッサ３の故障部位縮退後の単体性能が、他の演算プロセッサ１、２、４、５の単体性能の合計から、本来システムに障害がない状態での全ての運用プロセッサの単体性能の総和を引いた性能より低くなる場合、演算プロセッサ３の切り離しが行われる。この際、切り離された演算プロセッサ３は、新たな予備プロセッサとして待機状態となる。
【００３７】
【実施例】
次に、本発明の一実施例について図面を参照して説明する。図１は本発明の一実施例を示すシステム構成ブロック図である。図２は演算プロセッサの詳細ブロック図であり、図５は図２に示した演算プロセッサにつき、縮退可能な機能である命令キャッシュユニット（０）１３および命令キャッシュユニット（１）１４、データキャッシュユニット（０）１５およびデータキャッシュユニット（１）１６、変換ルックアサイドバッファ１７の各々の縮退時の性能低下量を表した対応表である。図５において、演算プロセッサ１の命令キャッシュユニット１３および１４は「縮退時に２０％の性能低下」を、データキャッシュユニット１５および１６は「縮退時に４０％の性能低下」を、変換ルックアサイドバッファ１７は「縮退時に８０％の性能低下」をそれぞれ発生するものとする。
【００３８】
なお、説明を簡易にするため、これらの機能部は縮退を発生しても互いに他の機能部の性能に影響を及ぼさないものとする。ただし実際のシステムでは複数段のキャッシュやアドレス変換機能が縮退した場合に、他の機能部の性能に影響を及ぼすことは大いにあり得るので、縮退による性能低下量はこれらを加味した計算式で導かれた値でなければならない。また、障害発生により切り離される演算プロセッサの台数に比べて、組み込まれる予備プロセッサの台数の方が多い場合もあり得るため、このような場合には運用中のプロセッサの増減によるマルチプロセッサ係数（ＭＰ係数と称す）も性能低下量を試算する場合のパラメータとして加味されなければならない。
【００３９】
次に、図３を用いて本発明の動作を示すフローチャートの処理を、そして図６を用いて本発明の動作遷移状態を詳細に説明する。
【００４０】
演算プロセッサ１〜４を通常の運用プロセッサ、演算プロセッサ５を予備プロセッサとし、障害がない状態の各演算プロセッサの単体性能を１００とすると、図６の手順１において運用中のプロセッサの単体性能総和（総合性能）は４００となる。
【００４１】
続いて手順２において、演算プロセッサ１のデータキャッシュ（０）１５で障害が発生したとする。この障害については診断バスを介して診断プロセッサ６にただちに通知される。診断プロセッサ６は、データキャッシュ（０）１５が縮退可能な部位であることから、演算プロセッサ１内の構成制御ユニット１１に指示を行い、データキャッシュ（０）１５を縮退させる。また、診断プロセッサ６は図５に示す性能低下量より、データキャッシュ（０）１５の性能低下量が４０であることを判断し、演算プロセッサ１の単体性能が６０に低下したことを認識する。この状態ではシステムの総合性能は３６０となる。
【００４２】
手順３において、診断プロセッサ６は予備として待機している演算プロセッサ５の性能低下量を計算するが、このとき演算プロセッサ５には故障箇所はないので単体性能は１００である。次に、図３の動作処理フローにおける条件判定Ａより「障害が発生した演算プロセッサ１の単体性能＜予備の演算プロセッサ５の単体性能」となるため、予備の演算プロセッサ５をシステムに組み込む。
【００４３】
手順４において、データキャッシュ（０）１５が故障した演算プロセッサ１は、システムから切り離され新たな予備プロセッサとして待機状態になる。この状態におけるシステムの総合性能は障害が発生する前と同じ４００に回復する。
【００４４】
手順５では、予備として待機している演算プロセッサ１よりも軽度の障害が演算プロセッサ２に発生した場合を例に説明する。ここでは演算プロセッサ２の命令キャッシュ（０）１３が故障したものとする。演算プロセッサ２の命令キャッシュ（０）１３の故障は、前述の演算プロセッサ１の故障時と同様、診断プロセッサ６に通知される。診断プロセッサ６は図５の性能低下量より演算プロセッサ２の命令キャッシュ（０）１３の性能低下量が２０％であることから、演算プロセッサ２の単体性能が８０に低下したことを認識する。続いて、図３の条件判定Ａにより、障害を発生した演算プロセッサ２と、予備として待機している演算プロセッサ１の単体性能を比較する。だが今度は「障害を発生したプロセッサ２の単体性能＞＝予備の演算プロセッサ１の単体性能」となるため、予備として待機している演算プロセッサ１をシステムに組み込むことはせず、障害を発生した演算プロセッサ２の故障部位を、演算プロセッサ２の構成制御ユニット１１に通知して縮退させ、演算プロセッサ２をシステムに組み込んだ状態で運用を継続させる。この状態におけるシステムの総合性能は３８０となる。
【００４５】
次に、図３を用いて本発明の第二実施例の動作を示すフローチャートの処理を、そして図６を用いて本発明の第二実施例の動作遷移状態を詳細に説明する。
【００４６】
演算プロセッサ１〜３を通常の運用プロセッサ、演算プロセッサ４および５を予備プロセッサとし、障害がない状態の各演算プロセッサの単体性能を１００とすると、図７の手順１において第二実施例のシステムでは運用中のプロセッサの単体性能の総和（総合性能）は３００となる。
【００４７】
続いて手順２において、演算プロセッサ１の命令キャッシュ（０）１３で障害が発生したとすると、演算プロセッサ１の命令キャッシュ（０）１３の故障は診断バスを介して診断プロセッサ６に通知される。診断プロセッサ６は、命令キャッシュ（０）１３が縮退可能な部位であることから、演算プロセッサ１内の構成制御ユニット１１に指示し、演算プロセッサ１の命令キャッシュ（０）１３を縮退させる。また、診断プロセッサ６は図５に示す性能低下量より、予備プロセッサも含めてシステム内に存在する全演算プロセッサ１〜５の単体性能を計算する。命令キャッシュ（０）１３の性能低下量が２０％であることから、演算プロセッサ１は単体性能が８０に低下する。また演算プロセッサ１以外の演算プロセッサは、故障がないため単体性能は１００のままであり、システムの総合性能は２８０となる。
【００４８】
手順３において、診断プロセッサ６は図４の動作処理フローチャートにおける条件判定Ａにより、障害発生後のシステム内の全運用プロセッサ１〜３の単体性能の総和と、本来システムに障害のない状態における全運用プロセッサの単体性能の総和とを比較する。その結果「本来システムに障害がない状態の全運用プロセッサの単体性能の総和＞障害発生後の運用中プロセッサの単体性能の総和」となり、診断プロセッサ６は予備として待機している演算プロセッサ４あるいは５のうち、単体性能の大きい方をシステムに組み込む。この第二実施例では、演算プロセッサ４と５はいずれも故障のない健全なプロセッサであり、単体性能は共に１００であるため、ここでは演算プロセッサ４を予備としてシステムに組み込むものとする。これによりシステムに組み込まれている運用中のプロセッサは１〜４の４台となり、システムの総合性能は３８０となる。ここで、予備プロセッサ組み込み後の図４の動作処理フローチャートにおける条件判定Ａより「本来システムに障害がない状態での全運用プロセッサの単体性能の総和＞障害発生後の運用プロセッサの単体性能の総和」という条件を満たさなくなるため、次の処理に移る。
【００４９】
手順４では、図４の動作処理フローチャートにおける条件判定Ｂにより、システム内でもっとも単体性能の小さい演算プロセッサの切り離し条件が判定される。この際、運用中の演算プロセッサ１〜４の中で最も単体性能の小さい演算プロセッサは、命令キャッシュ（０）１３が故障した演算プロセッサ１である。もしここで演算プロセッサ１をシステムから取り除いても、システムの総合性能は本来システムに障害がない状態における全ての運用プロセッサの単体性能の総和３００と等しくなるから、条件判定Ｂより「本来システムに障害がない状態の全運用プロセッサの単体性能の総和＜＝最も単体性能の小さい演算プロセッサを除いた運用プロセッサの単体性能の総和」という条件が満たされ、最も単体性能が小さい演算プロセッサ１がシステムから切り離され、新たな予備プロセッサとして待機する。その際、システムの総合性能は本来システムに障害がない状態のときと等しく３００となる。演算プロセッサ１００の切り離し後に再度条件判定Ｂに照らし合わせると、「本来システムに障害がない状態の全運用プロセッサの単体性能の総和＜＝最も単体性能の小さいプロセッサを除いた運用プロセッサの単体性能の総和」という条件に該当する演算プロセッサは存在しなくなっているので、一連の予備プロセッサ組み込みに関する処理は終了する。
【００５０】
続いて、図４の処理フローチャートの別の動作を説明するため、システム内に元々実装されていた健全な予備プロセッサを全て使い切るまで、システムの障害状態を進行させる。健全な予備プロセッサである演算プロセッサ５をシステムに組み込むために、ここでは運用中の演算プロセッサ２でデータキャッシュ（０）１５が故障したものとする。手順５〜７では手順２〜４と同様の予備プロセッサ組み込み処理が行われる。
【００５１】
手順５において、演算プロセッサ２のデータキャッシュ（０）１５で障害が発生し、診断プロセッサ６はデータキャッシュ（０）１５が縮退可能な部位であることから、演算プロセッサ２内の構成制御ユニット１１に指示して、演算プロセッサ２のデータキャッシュ（０）１５を縮退させる。また、診断プロセッサ６は図５に示す性能低下量より、予備プロセッサも含めてシステム内に存在する全ての演算プロセッサ１〜５の単体性能を計算する。データキャッシュ（０）１５の性能低下量が４０％であることから演算プロセッサ２は単体性能が６０に低下する。この状態でシステムの総合性能は２６０となる。
【００５２】
手順６において、診断プロセッサ６は図４の動作処理フローチャートにおける条件判定Ａにより、システム内の全運用プロセッサ２〜４の単体性能の総和と、本来システムに障害がない状態の全運用プロセッサの単体性能の総和とを比較する。この結果、「本来システムに障害がない状態の全運用プロセッサの単体性能の総和＞障害発生後の運用プロセッサの単体性能の総和」となり、診断プロセッサ６は予備として待機している演算プロセッサ１と５のうち、単体性能の大きい演算プロセッサ５をシステムに組み込む。これによりシステムに組み込まれている運用中の演算プロセッサは２〜５の４台となり、システムの総合性能は３６０となる。
【００５３】
手順７において、図４の動作処理フローチャートの条件判定Ｂで、運用中の演算プロセッサ２〜５の中で最も単体性能が小さい演算プロセッサはデータキャッシュ（０）１５が故障した演算プロセッサ２である。ここで障害が発生した演算プロセッサ２をシステムから取り除いても、システムの総合性能は本来システムに障害がない状態における全運用プロセッサの単体性能の総和３００と等しくなるので、演算プロセッサ２をシステムから切り離し、新たな予備プロセッサとして待機させる。
【００５４】
手順８〜９では、図４の動作処理フローチャートの条件判定Ａにおいて、「本来システムに障害のない状態の全運用プロセッサの単体性能の総和＞障害発生後の運用中プロセッサの単体性能の総和」による予備プロセッサの組み込み処理は行われるが、図４の動作処理フローチャートにおける条件判定Ｂ「本来システムに障害がない状態での全運用プロセッサの単体性能の総和＜＝最も単体性能が小さいプロセッサを除く運用プロセッサの単体性能の総和」に基づく演算プロセッサの切り離し処理が行われない際の動作について説明する。
【００５５】
手順８において、運用中の演算プロセッサ３の変換ルックアサイドバッファ１７が故障したものとする。演算プロセッサ３は図５より性能低下量が８０％であることから単体性能が２０に低下する。この状態でシステムの総合性能は２２０となる。
【００５６】
手順９において、診断プロセッサ６は図４の動作処理フローチャートにおける条件判定Ａにより、システム内の全ての運用プロセッサ３〜５の単体性能の総和と、本来システムに障害がない状態の際の全運用プロセッサの単体性能の総和を比較する。その結果、「本来システムに障害がない状態の全運用プロセッサの単体性能の総和＞障害発生後の運用中プロセッサの単体性能の総和」となるため、診断プロセッサ６は予備として待機している演算プロセッサ１と２のうち、単体性能の大きい演算プロセッサ１の方をシステムに組み込む。これによってシステムに組み込まれている運用プロセッサは１、３、４、５の４台となり、システムの総合性能は３００となる。次に、図４の動作処理フローチャートにおける条件判定Ｂでは、予備プロセッサ１の組み込み後のシステムの総合性能が３００であることから、「本来システムに障害がない状態での全運用プロセッサの単体性能の総和＜＝最も単体性能が小さいプロセッサを除く運用プロセッサの単体性能の総和」を満たしていないので、最も単体性能が小さいプロセッサの切り離し処理は行われず、最終的に演算プロセッサ１、３、４、５の４台でシステムの運用が継続される。
【００５７】
次に、手順１０〜１２および１３では、図４の動作処理フローチャートにおける条件判定Ａ「本来システムに障害がない状態での全運用プロセッサの単体性能の総和＞障害発生後の運用プロセッサの単体性能の総和」による予備プロセッサの組み込み処理も、図４の動作処理フローチャートにおける条件判定Ｂ「本来システムに障害がない状態の全運用プロセッサの単体性能の総和＜＝最も単体性能の小さいプロセッサを除く運用プロセッサの単体性能の総和」による運用中の演算プロセッサの切り離し処理も行わない場合の動作について説明する。なお、本手順を説明するためには、障害発生直前のシステムの総合性能が、本来システムに障害がない状態における全運用プロセッサの単体性能の総和を上回っている必要がある。
【００５８】
手順１０〜１２において、システムの障害状態をさらに進行させる。すなわち、手順１０において、すでに命令キャッシュ（０）１３が縮退状態にある演算プロセッサ１で、命令キャッシュ（１）１４にも障害が発生したとすると、演算プロセッサ１００の単体性能は６０に低下し、またシステム内で運用されている演算プロセッサ１、３、４、５の単体性能の総和は３４０となり、手順１２において運用中の演算プロセッサの中で最も単体性能が小さい演算プロセッサ３がシステムから切り離され、予備プロセッサとして待機することにより、システムの総合性能は３２０となる。
【００５９】
手順１３では、運用中の演算プロセッサで障害が発生しても、何れの予備プロセッサもシステムに組み込まれることはなく、また運用中の演算プロセッサも切り離されることはない。手順１３において、運用中の演算プロセッサ４で命令キャッシュ（０）１３が故障した場合を例に挙げると、演算プロセッサ４の単体性能が８０に低下する。この際、障害発生直前のシステムの総合性能は３２０であったのに対し、障害発生後は総合性能が３００に低下する。これはシステムに障害がない状態の全運用プロセッサの単体性能の総和３００と等しく、図４の動作処理フローチャートにおける条件判定Ａ／条件判定Ｂの何れの条件も満たさないので、システムはこの状態で安定する。
【００６０】
次に、複数台の予備プロセッサをシステムに組み込む際の動作について説明する。この動作は図７に示す手順７が完了している状態において、運用中の演算プロセッサ３〜５のうちの何れかが故障し、故障した演算プロセッサの単体性能の低下量が、予備として待機している演算プロセッサ１と２の単体性能の合計よりも大きい場合に発生する。すなわち、手順１４において、演算プロセッサ３で演算ユニット１２の故障のような致命的障害が発生し使用不可能になったとすると、演算プロセッサ３の単体性能は０となる。この際、運用中の演算プロセッサ３〜５の単体性能の合計は２００となる。
【００６１】
手順１５において、図４の動作処理フローチャート条件判定Ａ、不足分の性能を予備として待機している演算プロセッサ１と２のうち、単体性能が大きい方の演算プロセッサ１をシステムに組み込むことにより、システムの総合性能は２８０まで回復するが、演算プロセッサ１の組み込み後に再度判定される条件判定Ａにより、システムに障害がない状態での運用プロセッサの単体性能の総和３００までにはあと２０だけ性能が不足しており、さらに予備プロセッサの組み込みが必要であると判定される。
【００６２】
手順１６において、予備として待機している演算プロセッサ２がシステムに組み込まれ、図４の条件判定Ａを抜ける。この際演算プロセッサ１〜５の５台の合計によるシステムの総合性能は３４０となる。
【００６３】
手順１７では、致命的障害が発生した演算プロセッサ３の切り離しが実行される。演算プロセッサの致命的障害は性能低下量が１００％と換算し、単体性能０に相当とする。よって図４の処理動作フローチャートにおける条件判定Ｂに基づき、最も単体性能が小さい運用中プロセッサの切り離し条件に従って、演算プロセッサ３がシステムより切り離される。演算プロセッサ３は単体性能が０であるから、予備プロセッサとして待機することはせず、運用中のプロセッサの次障害においてシステムに再度組み込まれることはない。
【００６４】
【発明の効果】
本発明によれば、プロセッサの継続運用が可能な軽度の障害が発生し、システム性能が低下しても即予備プロセッサに切り替えるため、システムの性能低下状態が長時間継続するような事態が回避される。また、全ての予備プロセッサを使用した後に別のプロセッサでさらに重度の障害が発生したような場合、あるいは運用不可能な障害が発生した場合でも、予備に切り替えた故障プロセッサと性能低下量を比較し、より性能低下を抑えるような予備プロセッサの組み込みを選択するので、システムとしての性能低下を最小限に留める効果を奏する。
【図面の簡単な説明】
【図１】本発明の実施形態を表すシステム構成図である。
【図２】本発明に実施形態にかかる演算プロセッサの詳細構成を示すブロック図である。
【図３】第一実施例の動作を示す処理動作フローチャートである。
【図４】第二実施例の動作を示す処理動作フローチャートである。
【図５】演算プロセッサの縮退部位による性能低下量を表した図である。
【図６】第一実施例の状態遷移を示す図である。
【図７】第二実施例の状態遷移を示す図である。
【符号の説明】
１　演算プロセッサ＃１
２　演算プロセッサ＃２
３　演算プロセッサ＃３
４　演算プロセッサ＃４
５　演算プロセッサ＃５
６　診断プロセッサ
１１　構成制御ユニット
１２　演算ユニット
１３　命令キャッシュユニット（０）
１４　命令キャッシュユニット（１）
１５　データキャッシュユニット（０）
１６　データキャッシュユニット（１）
１７　変換ルックアサイドバッファ
１８　バスインタフェースユニット[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a failure processing method for an information processing apparatus, and more particularly to an improvement in a method for switching a spare processor in a multiprocessor system.
[0002]
[Prior art]
An example of a conventional spare processor system is employed in the IBM (International Business Machine) S / 390 series, and is published on May 27, 1999 in the technical report "IBM Journal of Research and Development", Vol. . 43, Nos. 5/6, 1999, "RAS strategy for IBM S / 390 G5 and G6" (pages 880 to 883).
[0003]
In the case of the IBM S / 390 system, the support processing mechanism and the internal coupling mechanism that perform input / output processing with the central processing unit have a redundant configuration as a processor that can be switched over, and when a failure occurs in these devices, Switching to a spare processor is performed dynamically (meaning "without stopping the system") or by power-on reset, and operation is continued without deteriorating system performance.
[0004]
In general, in a core business server such as a general-purpose computer, in most cases, a rental contract is made as a form of sale, and the price is determined by the product of the performance of the central processing unit and the input / output device and the usage time. Therefore, maintaining the performance of the system is a requirement expected of the apparatus from the nature of the billing method.
[0005]
In addition, systems that run on general-purpose computers, such as those represented by bank accounting systems, are mainly businesses that require extremely high reliability and high availability. It is necessary to provide a redundant function for recovering a failed part.
[0006]
The spare processor system is a mechanism unique to a general-purpose computer that conforms to the above-mentioned sales form and operation form, and is an apparatus such as a general PC server whose price is determined by the maximum processing performance (usually operating frequency) of the processor. Alternatively, this is a mechanism that does not exist in a device that maintains fault tolerance by distributing work to a plurality of servers.
[0007]
The spare processor system of the IBM S / 390 system is a spare switching system using a processor as a minimum unit. Referring to FIG. 1 showing an example of the configuration of a general-purpose computer, in this conventional system, arithmetic processors 1 to 5 are connected in a multiprocessor system via a system bus. Some of the arithmetic processors 1 to 5 are waiting in the system as spare processors to be switched when a fault occurs in the operating arithmetic processor. For example, in this configuration example, assuming that the arithmetic processor 5 is disconnected from the system as a standby processor and is on standby, the arithmetic processors 1 to 4 are operating processors, and in a state where no failure has occurred in the system, The system operation is continued by the four arithmetic processors 1 to 4.
[0008]
Now, taking a case where a failure occurs in the arithmetic processor 1 as an example, in the conventional system, the arithmetic processor 1 is separated from the system, and processing is performed so that the standby arithmetic processor 5 is incorporated in the system instead.
[0009]
On the other hand, in the IBM S / 390 system, after stopping the failed arithmetic processor 1, the contents of the control register and the cache are extracted from the arithmetic processor 1, transferred to the spare processor 5, and then the spare processor 5 is operated. Thus, switching of the spare processor is realized.
[0010]
In the above-mentioned document, the term "support element (SE)" is used as a mechanism for transferring the contents of the control register and the cache. However, in this configuration example, in order to embody the support element, switching of the spare processor is performed. The diagnostic processor 6 is defined as a mechanism for instructing the operation.
[0011]
[Problems to be solved by the invention]
However, in the case of the above-described conventional system in which the preliminary switching is performed on a processor basis, there are the following problems.
[0012]
First, in a spare processor system composed of processors having partially degenerateable functions, if a specific part of the processor fails and performance degrades, the failed part of the failed processor will not be able to continue operation of the processor. There are two types of methods: a method of switching to a spare processor after expanding to a normal state, and a method of immediately switching to a spare processor when performance degradation occurs, as in the IBM S / 390 system.
[0013]
In the former case, since the failed processor remains in the system until a severe failure occurs, the performance of the system continues to degrade and the requirements of general-purpose computers that maintain the system performance as much as possible must be satisfied. Becomes difficult.
[0014]
In the latter case, the performance degradation state recovers quickly because the standby processor is immediately switched to when the performance degradation occurs.However, since the spare processor is used only when a minor failure occurs, the spare processor is used. When multiple failures occur in a small number of systems, a situation occurs in which a spare processor to be switched cannot be secured.
[0015]
An object of the present invention relates to a method of switching a spare processor in a multiprocessor information processing apparatus having a processor that can be continuously operated by separating a faulty part by a partial degeneration function. An object of the present invention is to provide a method for minimizing processor depletion and maintaining system performance as much as possible.
[0016]
[Means for Solving the Problems]
In view of the above problems, a multiprocessor switching method according to the present invention provides a multiprocessor information processing apparatus that includes at least one operating processor that is normally operated and a spare that is switched and used when a failure occurs in the operating processor. A processor, means for degenerating a failed part of the operating processor in which the failure has occurred, means for quantifying a performance degradation amount due to the failure part degradation of the operating processor, and the operation taking into account the performance degradation amount due to the failed part degradation It is characterized in that it has means for comparing the single performance of the processor with the single performance of the spare processor.
[0017]
Further, in another configuration example of the multiprocessor switching method of the present invention, when a failure occurs in the operation processor, a result of comparing the unit performance of the operation processor after degraded with the failed part and the unit performance of the spare processor If the stand-alone performance of the spare processor is smaller than or equal to the stand-alone performance of the working processor after the failure site degeneration, the working processor is continuously operated, and the stand-by performance of the spare processor is lower than that of the working processor. When the performance is higher than the unit performance after the failure part is degenerated, the spare processor is incorporated into the system, and then the failed active processor is separated from the system and put on standby as a new spare processor.
[0018]
In yet another example, in a multiprocessor information processing apparatus having a plurality of operation processors and a plurality of spare processors, the information processing apparatus includes means for calculating a performance reduction amount due to a failure site of all the operation processors and the spare processors in the system, When a failure occurs in any of the plurality of operational processors, the plurality of spare processors start with the greatest unit performance until the total is equal to or greater than the sum of the individual performances of all the operating processors in a state where the system is originally free from failure. It is characterized in that it is incorporated into the system sequentially.
[0019]
Further, as a final configuration example, the total of the unit performance excluding the smallest value among the plurality of operation processors incorporated in the system is the single operation of all the operation processors in the state where the system originally has no failure. If the total performance is greater than or equal to the total performance, the processor with the smallest single unit performance among the operational processors is sequentially disconnected from the system and is put on standby as a standby processor.
[0020]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
[0021]
As shown in FIG. 1, the diagnostic processor 6 includes a plurality of arithmetic processors 1 to 5 and a diagnostic processor 6. The arithmetic processors 1 to 5 are roughly divided into a processor that performs normal business processing and a spare processor, and the spare processor is usually separated from the system and is in a standby state.
[0022]
FIG. 2 is a detailed block diagram showing the functions of the arithmetic processors 1 to 5 in the present invention. The configuration control unit 11 is connected to the diagnostic processor 6 via a diagnostic bus, and controls the incorporation / separation of a degradable internal unit according to an instruction from the diagnostic processor 6. The arithmetic unit 12 performs arithmetic processing. The instruction cache unit (# 0) 13 and the instruction cache unit (# 1) 14, the data cache unit (# 0) 15 and the data cache unit (# 1) 16, and the conversion lookaside buffer 17 A unit for caching data strings and performing address conversion. Under the control of the configuration control unit 11, degeneration is performed for each unit. As a result, the arithmetic processors 1 to 5 can be continuously operated in a state where a certain performance degradation has occurred from the original performance value. The bus interface unit 18 has an interface with a system bus and exchanges data among the arithmetic processors 1 to 5. The arithmetic unit 12 and the bus interface unit 18 are core parts of the arithmetic processors 1 to 5. If the arithmetic unit 12 and the bus interface unit 18 break down, continuous operation of the processor becomes impossible. Disconnection is performed.
[0023]
Next, the operation of the present invention will be described in detail with reference to the drawings.
[0024]
In a first embodiment of the present invention, in a multiprocessor information processing apparatus having a processor capable of continuous operation by separating a faulty part by a partial degeneration function, a spare processor to be switched when a failure occurs in an operation processor, an operation processor And means for quantifying the amount of performance degradation due to the failure part of the spare processor, and when a failure occurs in the operating processor, the unit performance after the failure part is degraded and the performance after the failure part of the spare processor is degraded The stand-alone processor is compared with the stand-alone processor, and if the stand-alone processor has a higher stand-alone performance than the working processor, the spare processor is incorporated into the system. After that, the failed operation processor is separated from the system and is put on standby as a new spare. On the other hand, when the stand-alone performance of the spare processor is smaller than or equal to the stand-alone performance of the working processor, the spare processor is not incorporated into the system, and the faulty part of the working processor is degraded to continue the operation.
[0025]
FIG. 3 is a flowchart showing this series of operations. A case where a failure occurs in the arithmetic processor 1 will be described below, where the arithmetic processors 1 to 4 are normal operation processors and the arithmetic processor 5 is a spare processor.
[0026]
The failure that has occurred in the arithmetic processor 1 (S3-1) is notified to the diagnosis processor 6, and the range in which the failure can be reduced and the amount of performance reduction due to the reduction are calculated. When it is determined that the degenerate operation of the arithmetic processor 1 is possible, the diagnostic processor 6 degenerates a faulty part in the arithmetic processor 1 (S3-2).
[0027]
Next, the diagnostic processor 6 calculates the performance reduction amount of the standby arithmetic processor 5. At this time, since the arithmetic processor 5 has no faulty part, the condition determination A in FIG. (S3-3). After the spare processor 5 is incorporated into the system (S3-4), the failed processor 1 is removed from the system. It is separated and made to stand by as a new spare processor (S3-5).
[0028]
Subsequently, a case where a less serious failure has occurred in the arithmetic processor 2 than in the arithmetic processor 1 will be described. The diagnostic processor 6 calculates the performance reduction amount of the faulty arithmetic processor 2 and the standby arithmetic processor 1 and compares the performances of the two. In this case, since the condition determination A in FIG. 3 satisfies “the single performance of the failed processor 2> = the single performance of the spare arithmetic processor 1”, the standby arithmetic processor 1 is incorporated in the system. Instead, the faulty part of the failed processor 2 is degenerated and the operation of the failed processor 2 is continued.
[0029]
In the second embodiment of the present invention, in addition to the first embodiment, when a failure occurs in the operation processor, a means for quantifying the amount of performance deterioration due to the failure site of all the operation processors and the spare processor in the system is provided. After the faulty part of the failed operating processor is degraded, the spare processor is incorporated into the system, and as a result, the system performance is calculated based on the sum of the unit performances of all the operating processors when there is no failure in the system. Until it becomes equal or larger, multiple spare processors are installed in the system in the order of the single unit performance, and then the total of the operating processors excluding the processor with the lowest single unit performance among the installed processors in the system Is greater than the sum of the individual performances of all operating processors when there is no failure in the system. Equal period, separately from the sequentially system the most simple performance of small processors in the operational processor, to wait as reserve.
[0030]
FIG. 4 is a flowchart showing the operation of the second embodiment of the present invention. In FIG. 1, when the arithmetic processors 1 to 3 are normal operation processors, the arithmetic processors 4 and 5 are spare processors, and a failure occurs in the arithmetic processor 1 as an example, the failure (S4-1) of the arithmetic processor 1 is as follows. Immediately, the diagnosis processor 6 is notified, and upon receiving the notification, the diagnosis processor 6 degenerates the faulty part of the arithmetic processor 1 (S4-2).
[0031]
Further, the diagnostic processor 6 calculates the single performance of all the processors 1 to 5 in the system (S4-3), and among them, the single processor of all the active processors 1 to 3 including the failed processor 1 among them. The total of the performance is compared with the total of the single performances of all the operational processors in a state where the system is originally free from failure (S4-4). As a result, the condition determination A of FIG. 4 satisfies “the sum of the individual performances of all the operation processors in the state where the system originally has no failure> the sum of the individual performances of the operation processors after the occurrence of the failure”. The diagnostic processor 6 incorporates either the spare processor 4 or the spare processor 5 into the system (since the spare processors 4 and 5 are sound processors without failure and have the same unit performance) (S4-5).
[0032]
Here, assuming that the spare processor 4 is incorporated in the system, there are four operating processors 1-4 operating on the system.
[0033]
Next, the condition determination B of FIG. 4 compares “the sum of the system's original total performance and the sum of the single performances of the operating processors excluding the processor with the lowest single performance” (S4-6). In this case, even if the failed processor 1 is removed from the system, the diagnostic processor 6 has failed because the total performance of the system is equal to the sum of the individual performances of all the operating processors in the state where there is no failure. The arithmetic processor 1 is disconnected from the system, and is put on standby as a new spare processor (S4-7).
[0034]
Subsequently, an operation when a plurality of spare processors are incorporated in the system will be described. This operation occurs when the sum of the single performances of any two or more spare processors cannot compensate for the performance decrease of the failed processor during operation.
[0035]
It is assumed that the arithmetic processors 1 and 2 fail sequentially and the spare arithmetic processors 4 and 5 are already incorporated in the system. At this time, the arithmetic processors 1 and 2 enter a standby state as a new standby. The condition under which a plurality of spare arithmetic processors are incorporated into the system is that any one of the operating processors 3, 4, and 5 that is operating later fails, and the amount of decrease in the unit performance of the failed arithmetic processor is calculated as This is a case where it is larger than the sum of the single performances of the processors 1 and 2. That is, assuming that the arithmetic processor 3 has failed, the total of the single performances of the operating arithmetic processors 3 to 5 is calculated according to the flowchart shown in FIG. The total performance is compared. The shortfall is compensated for by the incorporation of the standby arithmetic processor 1. If the overall performance of the system is still less than the sum of the individual performances of all the operating processors in a state where the system does not originally have a failure, the shortfall performance is further complemented by incorporating the arithmetic processor 2. As a result, finally, the system configuration is in a state where a total of five arithmetic processors 1 to 5 are installed in the system and are operating.
[0036]
Further, the condition determination B in FIG. 4 will be described. In the operation in which the processor with the smallest single unit performance is separated and put on standby as a spare, the overall performance of the system after installation is the sum of the unit performances of all operating processors in the state where there is no failure in the system by incorporating the spare processor. This occurs in a case where an operating processor that has exceeded the performance and has a smaller unit performance than the performance of the excess exists in the system. That is, referring to the operation example in the case where the plurality of spare processors are incorporated in the system, when five of the arithmetic processors 1 to 5 are incorporated in the system due to the failure of the arithmetic processor 3, the failure site of the arithmetic processor 3 The unit performance after the degeneration becomes lower than the sum of the unit performances of the other arithmetic processors 1, 2, 4, and 5 minus the sum of the unit performances of all the operating processors in the state where there is no failure in the system. In this case, the operation processor 3 is disconnected. At this time, the separated arithmetic processor 3 enters a standby state as a new standby processor.
[0037]
【Example】
Next, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a system configuration block diagram showing one embodiment of the present invention. FIG. 2 is a detailed block diagram of the arithmetic processor, and FIG. 5 shows the instruction cache unit (0) 13, the instruction cache unit (1) 14, and the data cache unit ( 6 is a correspondence table showing the amount of performance degradation of each of the data cache unit (0) 15, the data cache unit (1) 16, and the conversion lookaside buffer 17 when degenerated. In FIG. 5, the instruction cache units 13 and 14 of the arithmetic processor 1 show “20% performance decrease at the time of degeneration”, the data cache units 15 and 16 show “40% performance decrease at the time of degeneration”, and the conversion lookaside buffer 17 It is assumed that “80% decrease in performance at the time of degeneration” occurs.
[0038]
For the sake of simplicity, it is assumed that these functional units do not affect the performance of other functional units even if degeneration occurs. However, in an actual system, if the caches and address translation functions of multiple stages are degraded, it is highly possible that the performance of other functional units will be affected. Value must be specified. In addition, since the number of spare processors to be incorporated may be larger than the number of processors to be separated due to the occurrence of a failure, in such a case, the multiprocessor coefficient (MP coefficient ) Must be taken into account as a parameter when calculating the performance reduction amount.
[0039]
Next, the processing of the flowchart showing the operation of the present invention will be described in detail with reference to FIG. 3, and the operation transition state of the present invention will be described in detail with reference to FIG.
[0040]
Assuming that the arithmetic processors 1 to 4 are normal operating processors, the arithmetic processor 5 is a spare processor, and the single performance of each of the arithmetic processors in a fault-free state is 100, the total single performance of the operating processors in procedure 1 of FIG. Overall performance) is 400.
[0041]
Subsequently, in procedure 2, it is assumed that a failure has occurred in the data cache (0) 15 of the arithmetic processor 1. This fault is immediately notified to the diagnostic processor 6 via the diagnostic bus. Since the data cache (0) 15 is a degradable part, the diagnostic processor 6 instructs the configuration control unit 11 in the arithmetic processor 1 to degenerate the data cache (0) 15. Further, the diagnostic processor 6 determines from the performance decrease amount shown in FIG. 5 that the performance decrease amount of the data cache (0) 15 is 40, and recognizes that the single performance of the arithmetic processor 1 has decreased to 60. In this state, the total performance of the system is 360.
[0042]
In procedure 3, the diagnostic processor 6 calculates the performance reduction amount of the standby arithmetic processor 5, but at this time, the single performance is 100 because there is no failure point in the arithmetic processor 5. Next, since the condition determination A in the operation processing flow of FIG. 3 indicates that “the single performance of the failed arithmetic processor 1 <the single performance of the spare arithmetic processor 5”, the spare arithmetic processor 5 is incorporated in the system.
[0043]
In procedure 4, the arithmetic processor 1 in which the data cache (0) 15 has failed is disconnected from the system and enters a standby state as a new spare processor. The overall performance of the system in this state returns to the same 400 as before the failure occurred.
[0044]
In the procedure 5, a case will be described as an example where a failure of a lower degree than that of the arithmetic processor 1 waiting as a standby occurs in the arithmetic processor 2. Here, it is assumed that the instruction cache (0) 13 of the arithmetic processor 2 has failed. The failure of the instruction cache (0) 13 of the arithmetic processor 2 is notified to the diagnostic processor 6 as in the case of the failure of the arithmetic processor 1 described above. The diagnostic processor 6 recognizes that the single performance of the arithmetic processor 2 has decreased to 80 since the performance decrease of the instruction cache (0) 13 of the arithmetic processor 2 is 20% from the performance decrease amount of FIG. Subsequently, based on the condition determination A in FIG. 3, the unit performances of the failed processor 2 and the stand-by processor 1 are compared. However, this time, since “the single performance of the failed processor 2> = the single performance of the standby arithmetic processor 1”, the standby arithmetic processor 1 was not incorporated in the system as a standby, and a fault occurred. The faulty part of the arithmetic processor 2 is notified to the configuration control unit 11 of the arithmetic processor 2 to degenerate, and the operation is continued with the arithmetic processor 2 incorporated in the system. The total performance of the system in this state is 380.
[0045]
Next, the processing of the flowchart showing the operation of the second embodiment of the present invention will be described in detail with reference to FIG. 3, and the operation transition state of the second embodiment of the present invention will be described in detail with reference to FIG.
[0046]
Assuming that the arithmetic processors 1 to 3 are normal operation processors, the arithmetic processors 4 and 5 are spare processors, and the single performance of each of the arithmetic processors in a fault-free state is 100, the system of the second embodiment in the procedure 1 in FIG. The sum of the individual performances of the operating processors (total performance) is 300.
[0047]
Subsequently, in procedure 2, if a failure occurs in the instruction cache (0) 13 of the arithmetic processor 1, the failure of the instruction cache (0) 13 of the arithmetic processor 1 is notified to the diagnostic processor 6 via the diagnostic bus. Since the instruction cache (0) 13 is a degenerateable part, the diagnostic processor 6 instructs the configuration control unit 11 in the arithmetic processor 1 to degenerate the instruction cache (0) 13 of the arithmetic processor 1. Further, the diagnostic processor 6 calculates the single performance of all the arithmetic processors 1 to 5 including the spare processor from the performance reduction amount shown in FIG. Since the performance reduction amount of the instruction cache (0) 13 is 20%, the unit performance of the arithmetic processor 1 is reduced to 80. In addition, since the arithmetic processors other than the arithmetic processor 1 have no failure, the single performance remains at 100, and the total performance of the system becomes 280.
[0048]
In the procedure 3, the diagnostic processor 6 determines the total of the single performances of all the operation processors 1 to 3 in the system after the occurrence of the failure and the total operation in the state where there is no failure in the system by the condition determination A in the operation processing flowchart of FIG. Compare the sum of the single processor performance. As a result, “the sum of the single performances of all the operating processors in the state where the system originally has no fault> the sum of the single performances of the operating processors after the fault occurs”, and the diagnostic processor 6 is the standby arithmetic processor 4 or 5 Of these, the one with the higher single-unit performance is incorporated into the system. In the second embodiment, the arithmetic processors 4 and 5 are sound processors without any failures, and have a single unit performance of 100. Therefore, the arithmetic processor 4 is incorporated in the system as a spare here. As a result, the number of operating processors incorporated in the system becomes four, that is, one to four, and the total performance of the system becomes 380. Here, from the condition determination A in the operation processing flowchart of FIG. 4 after the spare processor is installed, “the sum of the individual performances of all the operation processors after the failure has occurred> the sum of the individual performances of the operation processors after the failure has occurred” Is not satisfied, so that the process proceeds to the next process.
[0049]
In procedure 4, the condition for separating the arithmetic processor having the smallest single unit performance in the system is determined by condition determination B in the operation processing flowchart of FIG. At this time, the operation processor with the smallest unit performance among the operation processors 1 to 4 in operation is the operation processor 1 in which the instruction cache (0) 13 has failed. If the arithmetic processor 1 is removed from the system here, the total performance of the system becomes equal to the sum of the individual performances of all the operating processors 300 in the state where the system originally has no failure. The condition that the sum of the single performances of all the operation processors without the condition <= the sum of the single performances of the operation processors excluding the operation processor with the lowest single performance is satisfied, and the arithmetic processor 1 with the lowest single performance is disconnected from the system. And stand by as a new spare processor. At that time, the total performance of the system is 300, which is the same as when there is no failure in the system. When the condition determination B is performed again after the separation of the arithmetic processor 100, “the sum of the single performances of all the operating processors in the state where the system originally has no failure <= the sum of the single performances of the operating processors excluding the processor with the smallest single performance” ", There is no arithmetic processor that satisfies the condition, and a series of processes relating to the incorporation of the spare processor ends.
[0050]
Subsequently, in order to explain another operation of the processing flow chart of FIG. 4, the failure state of the system is advanced until all the healthy spare processors originally mounted in the system are used up. Here, it is assumed that the data cache (0) 15 has failed in the operating processor 2 in order to incorporate the arithmetic processor 5, which is a sound standby processor, into the system. In Steps 5 to 7, the same processing as in the steps 2 to 4 is performed.
[0051]
In procedure 5, a failure occurs in the data cache (0) 15 of the arithmetic processor 2, and the diagnostic processor 6 sends the data cache (0) 15 to the configuration control unit 11 in the arithmetic processor 2 because the data cache (0) 15 is a degenerateable part. Instruct the data cache (0) 15 of the arithmetic processor 2 to degenerate. Further, the diagnostic processor 6 calculates the unit performance of all the arithmetic processors 1 to 5 including the spare processor from the performance reduction amount shown in FIG. Since the performance reduction amount of the data cache (0) 15 is 40%, the single processor performance of the arithmetic processor 2 is reduced to 60. In this state, the total performance of the system is 260.
[0052]
In step 6, the diagnostic processor 6 determines the sum of the individual performances of all the operating processors 2 to 4 in the system and the individual performances of all the operating processors in a state where there is no failure in the system based on the condition determination A in the operation processing flowchart of FIG. With the sum of As a result, the “sum of the single performances of all the operation processors in the state where there is no failure in the system> the sum of the single performances of the operation processors after the occurrence of the failure” is obtained, and the diagnostic processor 6 is in standby as a standby processor. Of these, the arithmetic processor 5 having high single-unit performance is incorporated into the system. As a result, the number of operational processors incorporated in the system becomes four, 2 to 5, and the total performance of the system becomes 360.
[0053]
In the procedure 7, in the condition determination B of the operation processing flowchart of FIG. 4, the arithmetic processor having the smallest single unit performance among the operating arithmetic processors 2 to 5 is the arithmetic processor 2 in which the data cache (0) 15 has failed. Here, even if the failed processor 2 is removed from the system, the overall performance of the system is equal to the sum of the individual performances of all the operating processors 300 in a state where the system is originally free from failure, so the processor 2 is disconnected from the system. , As a new standby processor.
[0054]
In steps 8 and 9, the condition determination A in the operation processing flowchart of FIG. 4 is based on the “sum of the individual performances of all operating processors in a state where there is no failure in the original system> the sum of the individual performances of operating processors after the occurrence of the failure”. Although the processing for incorporating the spare processor is performed, the condition determination B in the operation processing flowchart of FIG. 4 "the sum of the individual performances of all the operational processors in a state where there is no failure in the original system <= the operational processors excluding the processor with the smallest individual performance" The operation when the processing for disconnecting the arithmetic processor based on the “sum of the single performances” is not performed is described.
[0055]
In procedure 8, it is assumed that the conversion lookaside buffer 17 of the operational processor 3 in operation has failed. The performance of the arithmetic processor 3 is 80% as shown in FIG. In this state, the total performance of the system is 220.
[0056]
In step 9, the diagnostic processor 6 determines the total of the single performances of all the operation processors 3 to 5 in the system and the total operation processors when there is no failure in the system by the condition determination A in the operation processing flowchart of FIG. Compare the sum of the single performances. As a result, “sum of the individual performances of all the operating processors in the state where the system originally has no failure> sum total of the individual performances of the operating processors after the occurrence of the failure”, so that the diagnostic processor 6 is a standby arithmetic processor. Of the processors 1 and 2, the processor 1 having higher single-unit performance is incorporated into the system. As a result, the number of operation processors incorporated in the system becomes 1, 3, 4, and 5, and the total performance of the system becomes 300. Next, in the condition determination B in the operation processing flowchart of FIG. 4, since the total performance of the system after the installation of the spare processor 1 is 300, “the performance of the single processor of all the operating processors in a state where there is no failure in the original system” Does not satisfy the “sum total <= sum total of the unit performances of the operating processors excluding the processor with the smallest unit performance”, the processor having the smallest unit performance is not separated, and finally the arithmetic processors 1, 3, 4, 5 The operation of the system is continued with these four units.
[0057]
Next, in steps 10 to 12 and 13, the condition determination A in the operation processing flowchart of FIG. 4 “the sum of the individual performances of all the operational processors in the state where there is no failure in the original system> The process of installing the spare processor by the “sum” is also performed by the condition determination B “sum of the individual performances of all the operating processors in the state where there is no failure in the original system <= the operating processors excluding the processor with the smallest single performance” in the operation processing flowchart of FIG. The operation in the case where the processing of disconnecting the operating processor based on the “sum of unit performances” is not performed will be described. In order to explain this procedure, it is necessary that the total performance of the system immediately before the occurrence of the fault exceeds the sum of the single performances of all the operation processors in a state where the system does not originally have a fault.
[0058]
In steps 10 to 12, the fault state of the system is further advanced. That is, in the procedure 10, if the instruction cache (0) 13 is already in the degenerate state and the instruction cache (1) 14 also fails, the single performance of the arithmetic processor 100 is reduced to 60, In addition, the sum of the single performances of the arithmetic processors 1, 3, 4, and 5 operating in the system is 340, and the arithmetic processor 3 having the lowest single performance among the operating arithmetic processors in operation 12 is separated from the system. By waiting as a spare processor, the overall performance of the system becomes 320.
[0059]
In step 13, even if a failure occurs in the operating processor, no spare processor is incorporated into the system, and the operating processor is not disconnected. If the instruction cache (0) 13 fails in the operating processor 4 in the procedure 13 as an example, the single performance of the operating processor 4 is reduced to 80. At this time, the total performance of the system immediately before the occurrence of the failure was 320, whereas the total performance decreased to 300 after the occurrence of the failure. This is equal to the sum of the unit performances 300 of all the operational processors in a state where there is no failure in the system, and neither of the condition determination A / condition determination B in the operation processing flowchart of FIG. 4 is satisfied, so that the system is stable in this state. I do.
[0060]
Next, an operation of incorporating a plurality of spare processors into the system will be described. In this operation, in the state where the procedure 7 shown in FIG. 7 is completed, one of the operating arithmetic processors 3 to 5 fails and the amount of decrease in the single performance of the failed arithmetic processor stands by as a standby. This occurs when the sum of the performances of the single processors 1 and 2 is larger than the sum of the single performances. That is, in step 14, if a fatal failure such as a failure of the arithmetic unit 12 occurs in the arithmetic processor 3 and the arithmetic processor 3 becomes unusable, the single performance of the arithmetic processor 3 becomes zero. At this time, the sum of the unit performances of the operating processors 3 to 5 in operation is 200.
[0061]
In step 15, the operation processing flowchart shown in FIG. 4 is a condition determination A, and among the processors 1 and 2 which stand by for the performance of the shortage as a spare, the processor 1 having the higher single performance is incorporated into the system. The overall performance of the processor recovers to 280, but the condition is determined again after the incorporation of the arithmetic processor 1, and the condition is insufficient again by another 20 for the total of the single performances of the operating processor in a state where there is no failure in the system. Therefore, it is determined that it is necessary to further incorporate a spare processor.
[0062]
In step 16, the standby arithmetic processor 2 is incorporated into the system, and the process exits the condition determination A of FIG. At this time, the total performance of the system based on the sum of the five arithmetic processors 1 to 5 is 340.
[0063]
In step 17, the operation processor 3 in which the catastrophic failure has occurred is disconnected. A fatal failure of the arithmetic processor is equivalent to a single unit performance of 0 when the performance reduction amount is converted to 100%. Therefore, based on the condition determination B in the processing operation flowchart of FIG. 4, the arithmetic processor 3 is disconnected from the system according to the disconnection condition of the operating processor having the smallest unit performance. Since the unit performance of the arithmetic processor 3 is 0, the arithmetic processor 3 does not stand by as a spare processor and is not re-incorporated in the system at the next failure of the operating processor.
[0064]
【The invention's effect】
According to the present invention, even if a minor failure that allows the continuous operation of the processor occurs and the system performance is reduced, the system is immediately switched to the spare processor. You. If a more serious failure occurs in another processor after all the spare processors have been used, or if an inoperable failure occurs, compare the performance degradation with the failed processor switched to the spare. Since the selection of a spare processor that further suppresses the performance degradation is selected, the effect of minimizing the performance degradation of the system is achieved.
[Brief description of the drawings]
FIG. 1 is a system configuration diagram showing an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a detailed configuration of an arithmetic processor according to an embodiment of the present invention.
FIG. 3 is a processing operation flowchart showing the operation of the first embodiment.
FIG. 4 is a processing operation flowchart showing an operation of the second embodiment.
FIG. 5 is a diagram illustrating a performance reduction amount due to a degenerate part of an arithmetic processor.
FIG. 6 is a diagram showing a state transition of the first embodiment.
FIG. 7 is a diagram showing a state transition of the second embodiment.
[Explanation of symbols]
1 arithmetic processor # 1
2 Arithmetic processor # 2
3 Arithmetic processor # 3
4 arithmetic processor # 4
5 arithmetic processor # 5
6. Diagnostic processor
11 Configuration control unit
12 arithmetic unit
13 Instruction cache unit (0)
14 Instruction Cache Unit (1)
15 Data cache unit (0)
16 Data Cache Unit (1)
17 Conversion Lookaside Buffer
18 Bus interface unit

Claims

In an information processing apparatus of a multiprocessor system, at least one active processor that is operated in a normal state, a spare processor that is switched to be used when a failure occurs in the active processor, and a failure site of the failed active processor is degenerated Means, means for quantifying the amount of performance degradation due to the failure site degradation of the operation processor, and means for comparing the unit performance of the operation processor and the unit performance of the spare processor in consideration of the amount of performance degradation due to the failure region degradation A multiprocessor switching method, comprising:

When a failure occurs in the operating processor, a result of comparing the single performance of the operating processor after degraded with the faulty part and the single performance of the spare processor indicates that the single performance of the spare processor is the faulty part of the operating processor. If the performance of the spare processor is smaller than or equal to the performance of the spare processor, the operation processor is continuously operated. 2. The multiprocessor switching method according to claim 1, wherein, after being incorporated into the system, the failed operational processor is separated from the system and put on standby as a new spare processor.

In a multiprocessor information processing apparatus having a plurality of operation processors and a plurality of spare processors, the information processing apparatus has means for calculating a performance reduction amount due to a failure site of all the operation processors and the spare processors in the system, and When a failure occurs in any of the above, the plurality of spare processors are sequentially incorporated into the system in descending order of the unit performance until the total is equal to or larger than the sum of the unit performances of all the operation processors in a state where the system is not faulty. Characteristic multiprocessor switching method.

The sum of the single performances excluding the smallest value among the plurality of operation processors incorporated in the system is greater than or equal to the sum of the single performances of all the operation processors in a state where there is no failure in the system. 4. The multiprocessor switching method according to claim 3, wherein, in the case, the processor having the smallest single unit performance among the operational processors is sequentially separated from the system and is made to stand by as a standby processor.