JP3628595B2

JP3628595B2 - Interconnected processing nodes configurable as at least one NUMA (NON-UNIFORMMOMERYACCESS) data processing system

Info

Publication number: JP3628595B2
Application number: JP2000180619A
Authority: JP
Inventors: ビショップ・チャップマン・ブロック; デビッド・ブライアン・グラスコ; ジェームス・ライル・ピータソン; ラマクリシナン・ラジャモニ; ロナルド・リン・ロックホールド
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 1999-06-17
Filing date: 2000-06-15
Publication date: 2005-03-16
Anticipated expiration: 2020-06-15
Also published as: JP2001051959A; SG91873A1; TW457437B; US6421775B1

Description

【０００１】
【発明の属する技術分野】
本発明は、一般にはデータ処理に関し、具体的には、ＮＵＭＡ（ｎｏｎ−ｕｎｉｆｏｒｍｍｅｍｏｒｙａｃｃｅｓｓ）データ処理システムに関する。より具体的には、本発明は、少なくとも１つのＮＵＭＡデータ処理システムを含む１つまたは複数のデータ処理システムとして構成することのできる、相互接続された処理ノードの集合に関する。
【０００２】
【従来の技術】
コンピュータの技術では、複数の個々のプロセッサの処理能力を連繋してたばねることによって、高いコンピュータ・システム性能を達成できることが周知である。マルチプロセッサ（ＭＰ）コンピュータ・システムは、複数の異なるトポロジを用いて設計することができ、そのさまざまなトポロジのそれぞれが、各応用分野の性能要件およびソフトウェア環境に応じて特定の応用分野により適する場合がある。一般的なＭＰコンピュータ・トポロジの１つが、複数のプロセッサが、システム・メモリおよび入出力サブシステムなど、通常は共用システム相互接続に結合される共通資源を共用する、対称型マルチプロセッサ（ＳＭＰ）構成である。そのようなコンピュータ・システムが対称型と呼ばれるのは、ＳＭＰコンピュータ・システム内のすべてのプロセッサが、共用システム・メモリに格納されたデータに関して、理想的には同一のアクセス待ち時間を有するからである。
【０００３】
ＳＭＰコンピュータ・システムでは、比較的単純なプロセッサ間通信およびデータ共用の方法論の使用が可能であるが、ＳＭＰコンピュータ・システムは、スケーラビリティが制限されている。言い換えると、ＳＭＰコンピュータ・システムの性能は、一般に、スケールに伴って（すなわちより多くのプロセッサの追加に伴って）向上すると期待することができるが、固有のバス、メモリおよび入出力の帯域幅制限があるので、これらの共用資源の利用が最適化される実装依存のサイズを超えてＳＭＰをスケーリングすることから大きい利益を得ることができない。したがって、ＳＭＰトポロジ自体は、システムのスケールが増大するにつれて、特にシステム・メモリでの帯域幅制限からある程度の損害をこうむる。ＳＭＰコンピュータ・システムは、製造効率の観点からもスケーリングが良好ではない。たとえば、いくつかの構成要素は、単一プロセッサと小規模なＳＭＰコンピュータ・システムの両方での使用のために最適化することができるが、そのような構成要素は、大規模なＳＭＰでの使用に関して非効率的であることがしばしばである。逆に、大規模ＳＭＰでの使用のために設計された構成要素は、コストの観点から、小規模のシステムでの使用のためには非実用的になる可能性がある。
【０００４】
その結果、最近、ＮＵＭＡ（ｎｏｎ−ｕｎｉｆｏｒｍｍｅｍｏｒｙａｃｃｅｓｓ）と称する、多少の追加の複雑さを犠牲にしてＳＭＰコンピュータ・システムの制限の多くに対処するＭＰコンピュータ・システム・トポロジへの関心が高まっている。通常のＮＵＭＡコンピュータ・システムには、それぞれが１つまたは複数のプロセッサとローカル「システム」メモリとを含む複数の相互接続されたノードが含まれる。このようなコンピュータ・システムが、ｎｏｎ−ｕｎｉｆｏｒｍｍｅｍｏｒｙａｃｃｅｓｓ（不均一なメモリ・アクセス）を有するといわれるのは、各プロセッサが、そのローカル・ノードのシステム・メモリに格納されたデータに関して、リモート・ノードのシステム・メモリに格納されたデータに関するものよりも低いアクセス待ち時間を有するからである。ＮＵＭＡシステムは、さらに、データ・コヒーレンシが異なるノードのキャッシュ間で維持されるか否かに応じて、非コヒーレントまたはキャッシュ・コヒーレントのいずれかとして分類することができる。キャッシュ・コヒーレントＮＵＭＡ（ＣＣ−ＮＵＭＡ）システムの複雑さは、各ノード内のさまざまなレベルのキャッシュ・メモリおよびシステム・メモリの間だけではなく、異なるノード内のキャッシュ・メモリおよびシステム・メモリの間でもデータ・コヒーレンシをハードウェアが維持するために必要な追加の通信に帰するところが大きい。しかし、ＮＵＭＡコンピュータ・システムは、ＮＵＭＡコンピュータ・システム内の各ノードを、より小さい単一プロセッサまたはＳＭＰシステムとして実施することができるので、通常のＳＭＰコンピュータ・システムのスケーラビリティ制限に対処する。したがって、各ノード内の共用コンポーネントを、１つまたは少数のプロセッサによる使用のために最適化できると同時に、システム全体が、比較的低い待ち時間を維持しながらより大きいスケールの並列性を利用できることから利益を得る。
【０００５】
【発明が解決しようとする課題】
本発明では、大規模ＮＵＭＡデータ処理システムの出費が、変化する作業負荷を有する環境などのいくつかのコンピューティング環境では正当化が困難であることが認識されている。すなわち、いくつかのコンピューティング環境では、単一のアプリケーションを実行するために大規模ＮＵＭＡデータ処理システムの処理資源が必要になる頻度が低く、異なるオペレーティング・システムまたは異なるアプリケーションの走行のために複数のより小さいデータ処理システムが必要になる頻度が高い。本発明の以前には、そのようなコンピューティング環境の変化する作業負荷は、異なるスケールの複数のコンピュータ・システムによるか、必要に応じてノードを接続または切断することによってＮＵＭＡシステムを物理的に再構成することによってのみ対処することができた。
【０００６】
【課題を解決するための手段】
当技術分野の上述の短所に対処するために、本発明は、それぞれに少なくとも１つのプロセッサとデータ記憶装置とが含まれる複数の処理ノードを含むデータ処理システムを提供する。複数の処理ノードは、システム相互接続によって一緒に結合される。このデータ処理システムには、さらに、複数の処理ノードのうちの少なくとも１つのデータ記憶装置に常駐する構成ユーティリティが含まれる。構成ユーティリティは、システム相互接続を介する通信を介して、複数の処理ノードを、単独のＮＵＭＡ（ｎｏｎ−ｕｎｉｆｏｒｍｍｅｍｏｒｙａｃｃｅｓｓ）システムまたは複数の独立のデータ処理システムのいずれかに選択的に構成する。
【０００７】
【発明の実施の形態】
システムの概要
ここで図面、具体的には図１を参照すると、本発明によるデータ処理システムの実施例が示されている。図示の実施例は、たとえば、ワークステーション、サーバまたはメインフレーム・コンピュータとして実現することができる。図からわかるように、データ処理システム６には、ノード相互接続２２によって相互接続された複数の処理ノード８（この場合は４つ）が含まれる。下でさらに説明するように、ノード間データ・コヒーレンスは、相互接続コヒーレンス・ユニット（ＩＣＵ）３６によって維持される。
【０００８】
ここで図２を参照すると、処理ノード８ａないし８ｄのそれぞれには、１つまたは複数のプロセッサ１０ａないし１０ｍ、ローカル相互接続１６および、メモリ・コントローラ１７を介してアクセスされるシステム・メモリ１８が含まれる。プロセッサ１０ａないし１０ｍは、同一であることが好ましい（必要ではない）。全般的にプロセッサ・コア１２として示される、プログラム命令の実行に使用されるレジスタ、命令シーケンシング論理および実行ユニットのほかに、プロセッサ１０ａないし１０ｍのそれぞれには、システム・メモリ１８から関連するプロセッサ・コア１２にデータをステージングするのに使用される、オンチップのキャッシュ階層１４が含まれる。キャッシュ階層１４のそれぞれには、たとえば、８ないし３２キロバイト（ｋＢ）の記憶容量を有するレベル１（Ｌ１）キャッシュと、１ないし１６メガバイト（ＭＢ）の記憶容量を有するレベル２（Ｌ２）キャッシュを含めることができる。
【０００９】
処理ノード８ａないし８ｄのそれぞれには、さらに、ローカル相互接続１６とノード相互接続２２の間に結合されるめいめいのノード・コントローラ２０が含まれる。ノード・コントローラ２０のそれぞれは、少なくとも２つの機能を実行することによって、リモートの処理ノード８のローカル・エージェントとして働く。第１に、各ノード・コントローラ２０は、関連するローカル相互接続１６をスヌープし、リモートの処理ノード８へのローカル通信トランザクションの伝送を容易にする。第２に、各ノード・コントローラ２０は、ノード相互接続２２上の通信トランザクションをスヌープし、関連するローカル相互接続１６上の関係する通信トランザクション（たとえば読取要求）をマスタリングする。各ローカル相互接続１６上の通信は、アービタ２４によって制御される。アービタ２４は、プロセッサ１０によって生成されるバス要求信号に基づいてローカル相互接続１６へのアクセスを調整し、ローカル相互接続１６上のスヌープされた通信トランザクションに関するコヒーレンシ応答をコンパイルする。
【００１０】
ローカル相互接続１６は、メザニン・バス・ブリッジ２６を介してメザニン・バス３０に結合され、メザニン・バス３０は、たとえばＰＣＩ（ＰｅｒｉｐｈｅｒａｌＣｏｍｐｏｎｅｎｔＩｎｔｅｒｃｏｎｎｅｃｔ）ローカル・バスとして実施することができる。メザニン・バス・ブリッジ２６は、プロセッサ１０がそれを介してバス・メモリまたは入出力アドレス空間にマッピングされる入出力装置３２および記憶装置３４のうちの装置に直接アクセスできる低待ち時間経路と、入出力装置３２および記憶装置３４がそれを介してシステム・メモリ１８にアクセスできる高帯域幅経路の両方を提供する。入出力装置３２には、たとえば、表示装置、キーボード、グラフィカル・ポインタ、および、外部ネットワークまたは付加された装置への接続のためのシリアル・ポートおよびパラレル・ポートを含めることができる。その一方で、記憶装置３４には、オペレーティング・システムおよびアプリケーション・ソフトウェアのための不揮発性記憶域を提供する、光ディスクまたは磁気ディスクを含めることができる。
【００１１】
ローカル相互接続１６は、さらに、ホスト・ブリッジ３８を介して、メモリ・バス４０およびサービス・プロセッサ・バス４４に結合される。メモリ・バス４０は、不揮発性ランダム・アクセス・メモリ（ＮＶＲＡＭ）４２に結合され、ＮＶＲＡＭ４２には、処理ノード８の構成データおよび他のクリティカルなデータが格納される。サービス・プロセッサ・バス４４は、サービス・プロセッサ５０をサポートし、サービス・プロセッサ５０は、処理ノード８のブート・プロセッサとして働く。処理ノード８のブート・コードには、通常は電源オン自己試験（ＰＯＳＴ）、基本入出力システム（ＢＩＯＳ）およびオペレーティング・システム・ローダ・コードが含まれ、処理ノード８のブート・コードは、フラッシュ・メモリ４８に格納される。ブートの後に、サービス・プロセッサ５０は、サービス・プロセッサ・ダイナミック・ランダム・アクセス・メモリ（ＳＰＤＲＡＭ）４６からシステム監視ソフトウェアを実行することによって、処理ノード８のソフトウェアおよびハードウェアに関するシステム・モニタとして働く。
【００１２】
システム構成可能性
本発明の好ましい実施例では、フラッシュ・メモリ４８に格納されたＢＩＯＳブート・コードに、データ処理システム６を１つまたは複数の独立に動作可能なサブシステムに選択的に区分できるようにする構成ユーティリティが含まれる。下で詳細に述べるように、データ処理システム６は、構成ソフトウェアによって、処理負荷の予想される特性に応答して、単一のＮＵＭＡデータ処理システムとして、複数のＮＵＭＡデータ処理サブシステムとして、または、単一ノードまたは複数マルチノード（すなわちＮＵＭＡ）のデータ処理サブシステムの他の組合せとして、有利に構成することができる。たとえば、単一のアプリケーションを実行するために大量の処理能力が必要な場合には、データ処理システム６を単一のＮＵＭＡコンピュータ・システムとして構成し、したがって、そのアプリケーションの実行に使用できる処理能力を最大にすることが望ましい。その一方で、複数の別個のアプリケーションまたは複数の別個のオペレーティング・システムの実行が必要な場合には、データ処理システム６を複数のＮＵＭＡデータ処理サブシステムまたは複数の単一ノード・サブシステムとして構成することが望ましい可能性がある。
【００１３】
データ処理システム６が、複数のデータ処理サブシステムとして構成される時には、それらのデータ処理サブシステムに、処理ノード８のばらばらのおそらくは異なるサイズの組が含まれる。複数のデータ処理サブシステムのそれぞれは、他のデータ処理サブシステムの動作に干渉せずに、独立に構成、走行、遮断、リブートおよび再区分を行うことができる。重要なことに、データ処理システム６の再構成は、ノード相互接続２２への処理ノード８の付加または切離しを必要としない。
【００１４】
メモリ・コヒーレンシ
システム・メモリ１８に格納されるデータは、所与のデータ処理サブシステム内のいずれかのプロセッサ１０によって要求され、アクセスされ、変更される可能性があるので、キャッシュ・コヒーレンス・プロトコルを実施して、同一の処理ノード内のキャッシュの間と、同一のデータ処理サブシステム内の異なる処理ノード内のキャッシュの間の両方でコヒーレンスを維持する。実施されるキャッシュ・コヒーレンス・プロトコルは、実装依存である。しかし、好ましい実施例では、キャッシュ階層１４とアービタ２４によって、通常のＭｏｄｉｆｉｅｄ、Ｅｘｃｌｕｓｉｖｅ、Ｓｈａｒｅｄ、Ｉｎｖａｌｉｄ（変更済み、排他、共用、無効、ＭＥＳＩ）プロトコルまたはその変形形態が実施される。ノード間キャッシュ・コヒーレンシは、ノード相互接続２２に接続されたＩＣＵ３６に集中化されるディレクトリベースの機構を介して維持されることが好ましいが、その代わりに、ノード・コントローラ２０によって維持されるディレクトリ内に分散することができる。このディレクトリベースのコヒーレンス機構は、好ましくはＭ状態、Ｓ状態およびＩ状態を認識し、正しさに関してＥ状態はＭ状態にマージされるとみなす。すなわち、リモート・キャッシュによって排他的に保持されるデータは、そのデータが実際に変更されたか否かにかかわらず、変更済みとみなされる。
【００１５】
相互接続アーキテクチャ
ローカル相互接続１６およびノード相互接続２２は、さまざまな相互接続アーキテクチャを用いて実施することができる。しかし、好ましい実施例では、少なくともノード相互接続２２が、米国ニューヨーク州アーモンクのＩＢＭＣｏｒｐｏｒａｔｉｏｎによって開発された６ｘｘ通信プロトコルによって制御されるスイッチベースの相互接続として実施される。この２地点間通信方法論では、ノード相互接続２２が、ソースの処理ノード８からのアドレス・パケットおよびデータ・パケットを、同一のデータ処理サブシステム内の処理ノード８だけに経路指定することが可能になる。
【００１６】
ローカル相互接続１６およびノード相互接続２２では、分割トランザクションが可能であり、これは、通信トランザクションを構成するアドレス保有権とデータ保有権の間に固定されたタイミング関係が存在しないことと、データ・パケットを関連するアドレス・パケットと異なる形で順序付けることができることを意味する。ローカル相互接続１６およびノード相互接続２２の使用は、パイプライン化通信トランザクションによって機能強化されることが好ましく、パイプライン化通信トランザクションでは、前の通信トランザクションのマスタが各宛先からコヒーレンシ応答を受け取る前に、後続の通信トランザクションを供給することができる。
【００１７】
構成ユーティリティ
ここで図３を参照すると、本発明に従って、データ処理システム６などのマルチノード・データ処理システムを１つまたは複数のデータ処理サブシステムに区分し、構成するための処理を示す高水準論理流れ図が示されている。図からわかるように、この処理は、ブロック８０で、処理ノード８ａないし８ｄのすべての電源が投入されることに応答して開始され、その後、ブロック８２に進んで、各処理ノード８のサービス・プロセッサ５０が、フラッシュ・メモリ４８からＰＯＳＴコードを実行して、ローカル・ハードウェアを既知の安定した状態に初期設定する。ＰＯＳＴの後に、各サービス・プロセッサ５０は、主要な周辺機器（たとえばキーボードおよび表示装置）とインターフェースするために従来のＢＩＯＳルーチンを実行し、割込み処理を初期設定する。その後、ブロック８４以降に示されているように、各処理ノード８のプロセッサ（すなわち、サービス・プロセッサ５０またはプロセッサ１０もしくはその両方）が、データ処理システム６が区分される独立のデータ処理サブシステムの数と、各データ処理サブシステムに属する特定の処理ノード８とを指定する入力を得ることによって、上で述べたＢＩＯＳ構成ユーティリティの実行を開始する。ブロック８４に示された入力は、たとえばデータ記憶媒体上に存在するファイルまたは１つまたは複数の処理ノード８での操作員入力など、複数の供給源のうちのいずれかから得ることができる。
【００１８】
本発明の好ましい実施例では、ブロック８４に示された入力は、１つまたは複数の処理ノード８で表示される一連のメニュー画面に応答して、そのような処理ノード８で操作員から得られる。この入力は、その後、各処理ノード８で、その処理ノード８と共にグループ化されてデータ処理サブシステムを形成する他の処理ノード８を示す区分マスクを作成するのに使用される。たとえば、データ処理システム６内の４つの処理ノード８のそれぞれが、４ビット・マスクの１ビットを割り当てられる場合には、全処理ノードを含むＮＵＭＡ構成は、１１１１によって表すことができ、２つの２ノードＮＵＭＡサブシステムは、００１１および１１００か、１０１０および０１０１によって表すことができ、２ノードＮＵＭＡサブシステムおよび２つの単独ノード・サブシステムは、００１１、１０００および０１００（および他の同様のノードの組合せ）によって表すことができる。データ処理システム６の所望の区分を示す入力が、処理ノード８のすべてより少ない個数で供給される場合には、適当なマスクが、ノード相互接続２２を介して他の処理ノード８に伝送される。この形で、各処理ノード８は、共にグループ化される他の処理ノード８がある場合に、互いにそれらの処理ノードのレコードを有する。
【００１９】
ブロック８４の後に、この処理はブロック８６に進み、データ処理システム６の各データ処理サブシステムが、図４および５に関して下で詳細に説明するように、その構成を独立に完了する。その後、処理はブロック８８で継続される。
【００２０】
ここで図４および５を参照すると、図３のブロック８６に示された、マスタ処理ノードおよびクライアント処理ノードがデータ処理システム６のデータ処理サブシステムの構成を達成できる処理をそれぞれ示す高水準論理流れ図が示されている。図示の処理は、その間の通信の詳細を示すために一緒に説明するが、上で述べたＢＩＯＳ構成ユーティリティの一部として実施されることが好ましい。
【００２１】
図４に示された処理は、マスタである処理ノード８の動作を表し、図５に示された処理は、クライアントである処理ノード８（存在する場合）の動作を表すが、それぞれ、図３のブロック８４の後に、ブロック１００および１４０で並列に開始される。それぞれブロック１０２および１４２に示されているように、データ処理サブシステム内の各処理ノード８は、それがそのデータ処理サブシステムの構成を完了する責任を負うマスタの処理ノード８であるかどうかを判定する。データ処理サブシステムのマスタの処理ノード８は、投票および競争を含む多数の周知の機構によって判定することができるが、好ましい実施例では、マスタの処理ノード８は、区分マスクでセットされているビットのうちで最下位のビットを有する、データ処理サブシステム内の処理ノード８として、デフォルトで設定される。マスタであると判定された処理ノード８のマスタ・プロセッサ（すなわち、サービス・プロセッサ５０または指定されたプロセッサ１０のいずれか）は、図４のブロック１０４ないし１３０で詳細に説明するように、そのデータ処理サブシステムの構成を管理する。
【００２２】
ブロック１０４を参照すると、マスタ・プロセッサは、データ処理サブシステムに属するクライアントの処理ノード８がある場合には、ローカル相互接続１６上でその処理ノード８を目標とするメッセージを発行する。矢印Ａによって表されるこのメッセージは、その処理ノード８がマスタであることを主張する。このメッセージは、ローカルのノード・コントローラ２０によってスヌープされ、ノード相互接続２２を介して、指示されたクライアントの処理ノード８に転送される。それぞれブロック１４４および１４６に示されているように、クライアントの処理ノード８は、マスタからこのメッセージを受け取るまで待機し、メッセージの受取に応答して、矢印Ｂによって示される肯定応答メッセージを、マスタの処理ノード８に送る。図４のブロック１０６および１０８に示されているように、マスタは、クライアントの処理ノード８から肯定応答メッセージを受け取るまで待機し、肯定応答を受け取った後に、追加のクライアントの処理ノード８がマスタ主張メッセージにまだ連絡していないことが区分マスクから示される場合には、ブロック１０４に戻る。このマスタ主張−肯定応答プロトコル（その代わりに、複数のクライアントの処理ノード８に対して並列に実行することもできる）は、データ処理サブシステム内のすべての処理ノード８が、どの処理ノード８がマスタであるかに関して合意することを保証するだけではなく、有利なことに、異なる時に電源を投入され、異なる速度でブートする可能性がある、サブシステム内のさまざまな処理ノード８を同期化させるようにも働く。
【００２３】
マスタの処理ノード８は、図４のブロック１０８からブロック１１０に進む処理によって示されるように、データ処理サブシステム内のすべてのクライアントの処理ノード８（存在する場合）からそのマスタシップの肯定応答を受け取った後に、マスタの処理ノード８は、クライアントの処理ノード８（存在する場合）に構成情報（たとえば資源リスト）を要求する。この構成情報の要求は、クライアントに対する１つまたは複数のメッセージを含む可能性があるが、矢印Ｃによって表される。図５のブロック１４８および１５０によって示されるように、クライアントの処理ノード８は、この資源リスト要求を待ち、資源リスト要求の受取に応答して、その入出力資源、存在するシステム・メモリ１８の量、それに含まれるプロセッサ１０の数および他の構成情報を指定する１つまたは複数のメッセージを、マスタの処理ノード８に送ることによって応答する。この構成情報応答は、矢印Ｄによって表される。図４のブロック１１２および１１４は、マスタの処理ノード８が、クライアントの処理ノード８からの応答を待ち、応答の受取の後に、指定された資源をサブシステム資源リストに追加することを示す。ブロック１１６に示されているように、マスタの処理ノード８は、区分マスクで指定されたクライアントの処理ノード８のそれぞれについて、ブロック１１０ないし１１４を実行する。
【００２４】
マスタが各クライアント（存在する場合）から資源リストを得た後に、図４のブロック１１６からブロック１１８に進む処理によって示されるように、マスタの処理ノード８のマスタ・プロセッサは、サブシステムの全体的構成を判定し、クライアントの処理ノード８のそれぞれの資源を再マッピングする方法を計算する。次に、ブロック１２０で、マスタの処理ノード８のマスタ・プロセッサは、クライアントの処理ノード８（存在する場合）に、そのクライアントの処理ノード８がその資源を再マッピングする方法を指定する１つまたは複数のメッセージ（矢印Ｅによって表される）を送る。たとえば、マスタ・プロセッサは、クライアントの処理ノード８のメモリ・コントローラ１７に、付加されたシステム・メモリ１８の記憶位置に関連する物理アドレスの範囲を指定することができる。さらに、マスタ・プロセッサは、クライアントの処理ノード８内の入出力装置３２のメモリ・マップド・アドレスを指定することができる。実施形態によっては、マスタ・プロセッサは、クライアントの処理ノード８の各プロセッサ１０のプロセッサＩＤを指定することもできる。
【００２５】
好ましい実施例では、各データ処理サブシステム内のプロセッサ１０のすべてが、単一の物理メモリ空間を共用するが、これは、各物理アドレスが、システム・メモリ１８のうちの１つの単一の位置だけに関連することを意味する。したがって、一般にデータ処理サブシステム内のすべてのプロセッサ１０がアクセスできる、データ処理サブシステムのシステム・メモリの全体的な内容は、データ処理サブシステムを構成する処理ノード８内のシステム・メモリ１８の間で区分されるものとして見ることができる。たとえば、各処理ノード８に１ＧＢのシステム・メモリ１８が含まれ、データ処理システム６が２つのＮＵＭＡデータ処理サブシステムとして構成される場合の例の実施形態では、各ＮＵＭＡデータ処理サブシステムが、２ギガバイト（ＧＢ）の物理アドレス空間を有する。
【００２６】
図５のブロック１５２および１５４に示されているように、クライアントの処理ノード８は、マスタの処理ノード８からの再マッピング要求を待ち、再マッピング要求の受取に応答して、矢印Ｆによって表される再マッピング要求の肯定応答を用いて応答する。ブロック１２２および１２４に示されているように、マスタの処理ノード８は、この再マッピング要求肯定応答を待ち、再マッピング要求肯定応答の受取に応答して、区分マスクで示される他のクライアントの処理ノード８のそれぞれについて、ブロック１２０および１２２を繰り返す。
【００２７】
図４のブロック１２４および図５のブロック１５４の後に、マスタの処理ノード８およびクライアントの処理ノード８のそれぞれは、それぞれブロック１２６および１５６に示されるように、マスタの処理ノード８によって決定された構成に従って、めいめいのローカル資源を再マッピングする。図５のブロック１５８に示されているように、クライアントの処理ノード８のそれぞれは、その後、データ処理サブシステムのオペレーティング・システム（ＯＳ）がプロセッサ１０に作業をスケジューリングするまで、プロセッサ１０による処理を停止する。それに対して、図４のブロック１２８に示されるように、マスタの処理ノード８は、たとえば記憶装置３４のうちの１つからそのデータ処理サブシステムのためのオペレーティング・システムをブートする。前に述べたように、複数のデータ処理サブシステムが、データ処理システム６の処理ノード８から形成される場合には、複数のデータ処理サブシステムが、ＷｉｎｄｏｗｓＮＴおよびＳＣＯ（ＳａｎｔａＣｒｕｚＯｐｅｒａｔｉｏｎ社）ＵＮＩＸなどの異なるオペレーティング・システムを走行させることができる。その後、マスタの処理ノード８による処理が、ブロック１３０で継続される。
【００２８】
上で説明したように、本発明は、内部接続された処理ノードの集合を、単一のＮＵＭＡデータ処理システムまたは選択された数の独立に動作可能なデータ処理サブシステムのいずれかに構成する方法を提供する。本発明によれば、処理ノードの複数のデータ処理サブシステムへの区分は、処理ノードのいずれをも接続または切断せずに達成される。
【００２９】
好ましい実施例に関して本発明を具体的に図示し、説明してきたが、本発明の主旨および範囲から逸脱せずに、形態および詳細におけるさまざまな変更を行うことができることを当業者は理解するであろう。たとえば、本発明の諸態様を、本発明の方法を指示するソフトウェアを実行するコンピュータ・システムに関して説明してきたが、本発明は、その代わりに、コンピュータ・システムと共に使用するためのコンピュータ・プログラム製品として実施することができることを理解されたい。本発明の機能を定義するプログラムは、書換可能でない記憶媒体（たとえばＣＤ−ＲＯＭ）と、書換可能記憶媒体（たとえばフロッピ・ディスケットまたはハード・ディスク装置）と、コンピュータ・ネットワークおよび電話網などの通信媒体とを制限なしに含む、さまざまな信号担持媒体を介してコンピュータ・システムに配布することができる。したがって、このような信号担持媒体は、本発明の方法機能を指示するコンピュータ可読命令を担持または符号化する時に、本発明の代替実施形態を表すことを理解されたい。
【００３０】
まとめとして、本発明の構成に関して以下の事項を開示する。
【００３１】
（１）システム相互接続と、
複数の処理ノードのそれぞれが、少なくとも１つのプロセッサとデータ記憶装置とを含む、前記システム相互接続に結合された前記複数の処理ノードと、
前記複数の処理ノードのうちの少なくとも１つのシステム・メモリに常駐する構成ユーティリティであって、前記構成ユーティリティが、前記システム相互接続を介する通信を介して、前記複数の処理ノードを、単一のＮＵＭＡ（ｎｏｎ−ｕｎｉｆｏｒｍｍｅｍｏｒｙａｃｃｅｓｓ）および複数の独立のデータ処理システムのうちの１つに選択的に構成する、前記構成ユーティリティと
を含むデータ処理システム。
（２）前記複数の独立のデータ処理システムのうちの少なくとも１つが、前記複数の処理ノードのうちの少なくとも２つを含むＮＵＭＡ（ｎｏｎ−ｕｎｉｆｏｒｍｍｅｍｏｒｙａｃｃｅｓｓ）システムである、上記（１）に記載のデータ処理システム。
（３）前記複数の独立のデータ処理システムが、前記複数の処理ノードのばらばらのサブセットを含む、上記（１）に記載のデータ処理システム。
（４）前記データ処理システムが、前記複数の処理ノードのうちの少なくとも１つのデータ記憶装置に格納されたブート・コードを含み、前記構成ユーティリティが、前記ブート・コードの部分を形成する、上記（１）に記載のデータ処理システム。
（５）前記通信が、前記複数の処理ノードのうちのマスタ処理ノードから前記複数の処理ノードのうちの少なくとも１つの他の処理ノードに送られる、構成情報の要求を含む、上記（１）に記載のデータ処理システム。
（６）前記通信が、前記複数の処理ノードのうちの前記少なくとも１つの他の処理ノードから前記マスタ処理ノードへ送られる応答メッセージを含み、前記応答メッセージが、要求された構成情報を含む、上記（５）に記載のデータ処理システム。
（７）複数の処理ノードのそれぞれが、少なくとも１つのプロセッサとデータ記憶装置とを含む、前記複数の処理ノードをシステム相互接続と結合するステップと、
前記システム相互接続を介して少なくとも１つの構成メッセージを送るステップと、
前記少なくとも１つの構成メッセージを使用して、前記システム相互接続に結合された前記複数の処理ノードを、単一のＮＵＭＡ（ｎｏｎ−ｕｎｉｆｏｒｍｍｅｍｏｒｙａｃｃｅｓｓ）システムおよび複数の独立のデータ処理システムのうちの１つに構成するステップと
を含む、複数の相互接続された処理ノードを１つまたは複数のデータ処理システムに構成する方法。
（８）前記複数の処理ノードを複数の独立のデータ処理システムに構成するステップが、前記複数の処理ノードを、前記複数の処理ノードのうちの少なくとも２つを含む少なくとも１つのＮＵＭＡ（ｎｏｎ−ｕｎｉｆｏｒｍｍｅｍｏｒｙａｃｃｅｓｓ）サブシステムに構成するステップを含む、上記（７）に記載の方法。
（９）前記複数の処理ノードを複数の独立のデータ処理システムに構成するステップが、前記複数の処理ノードを、前記複数の処理ノードのばらばらのサブセットを含む複数の独立のデータ処理システムに構成するステップを含む、上記（７）に記載の方法。
（１０）前記複数の処理ノードのうちの少なくとも１つのデータ記憶装置に、ブート・コードの部分を形成する構成ユーティリティを格納するステップと、
前記複数の処理ノードを構成するために前記構成ユーティリティを実行するステップと
をさらに含む、上記（７）に記載の方法。
（１１）少なくとも１つの構成メッセージを送るステップが、前記複数の処理ノードのうちのマスタ処理ノードから前記複数の処理ノードのうちの少なくとも１つの他の処理ノードへ、構成情報の要求を送るステップを含む、上記（７）に記載の方法。
（１２）少なくとも１つの構成メッセージを送るステップが、さらに、前記複数の処理ノードのうちの前記少なくとも１つの他の処理ノードから前記マスタ処理ノードへ応答メッセージを送るステップを含み、前記応答メッセージが、要求された構成情報を含む、上記（１１）に記載の方法。
（１３）複数の処理ノードを結合されたシステム相互接続を含むデータ処理システムを構成するためのプログラム製品であって、前記複数の処理ノードのそれぞれが、少なくとも１つのプロセッサとデータ記憶装置とを含み、前記プログラム製品が、
データ処理システム使用可能媒体と、
前記データ処理システム使用可能媒体内で符号化された構成ユーティリティであって、前記構成ユーティリティが、前記システム相互接続を介する通信を介して、前記複数の処理ノードを、単一のＮＵＭＡ（ｎｏｎ−ｕｎｉｆｏｒｍｍｅｍｏｒｙａｃｃｅｓｓ）システムおよび複数の独立のデータ処理システムのうちの１つに選択的に構成する、前記構成ユーティリティと
を含む、プログラム製品。
（１４）前記複数の独立のデータ処理システムのうちの少なくとも１つが、前記複数の処理ノードのうちの少なくとも２つを含むＮＵＭＡ（ｎｏｎ−ｕｎｉｆｏｒｍｍｅｍｏｒｙａｃｃｅｓｓ）システムである、上記（１３）に記載のプログラム製品。
（１５）前記複数の独立のデータ処理システムが、前記複数の処理ノードのばらばらのサブセットを含む、上記（１３）に記載のプログラム製品。
（１６）前記構成ユーティリティが、ブート・コードの部分を形成する、上記（１３）に記載のプログラム製品。
（１７）前記通信が、前記複数の処理ノードのうちのマスタ処理ノードから前記複数の処理ノードのうちの少なくとも１つの他の処理ノードに送られる、構成情報の要求を含む、上記（１３）に記載のプログラム製品。
（１８）前記通信が、前記複数の処理ノードのうちの前記少なくとも１つの他の処理ノードから前記マスタ処理ノードへ送られる応答メッセージを含み、前記応答メッセージが、要求された構成情報を含む、上記（１７）に記載のプログラム製品。
【図面の簡単な説明】
【図１】本発明を有利に使用することのできる、複数ノード・データ処理システムの実施例を示す図である。
【図２】図１に示されたデータ処理システム内の処理ノードのより詳細なブロック図である。
【図３】図１のデータ処理システムを選択的に区分し、１つまたは複数のデータ処理サブシステムに構成する方法を示す、高水準論理流れ図である。
【図４】本発明の実施例による、マスタ処理ノードがデータ処理サブシステムを構成する方法の高水準論理流れ図である。
【図５】本発明の実施例による、クライアント処理ノードを構成する方法の高水準論理流れ図である。
【符号の説明】
６データ処理システム
８処理ノード
８ａ処理ノード
８ｂ処理ノード
８ｃ処理ノード
８ｄ処理ノード
１０ａプロセッサ
１０ｂプロセッサ
１０ｃプロセッサ
１０ｄプロセッサ
１０ｅプロセッサ
１０ｆプロセッサ
１０ｇプロセッサ
１０ｈプロセッサ
１０ｉプロセッサ
１０ｊプロセッサ
１０ｋプロセッサ
１０ｌプロセッサ
１０ｍプロセッサ
１２プロセッサ・コア
１４キャッシュ階層
１６ローカル相互接続
１７メモリ・コントローラ
１８システム・メモリ
２０ノード・コントローラ
２２ノード相互接続
２４アービタ
２６メザニン・バス・ブリッジ
３０メザニン・バス
３２入出力装置
３４記憶装置
３６相互接続コヒーレンス・ユニット（ＩＣＵ）
３８ホスト・ブリッジ
４０メモリ・バス
４２不揮発性ランダム・アクセス・メモリ（ＮＶＲＡＭ）
４４サービス・プロセッサ・バス
４６サービス・プロセッサ・ダイナミック・ランダム・アクセス・メモリ（ＳＰＤＲＡＭ）
４８フラッシュ・メモリ
５０サービス・プロセッサ[0001]
BACKGROUND OF THE INVENTION
The present invention generally relates to data processing, and more particularly, to a NUMA (non-uniform memory access) data processing system. More specifically, the present invention relates to a collection of interconnected processing nodes that can be configured as one or more data processing systems including at least one NUMA data processing system.
[0002]
[Prior art]
In computer technology, it is well known that high computer system performance can be achieved by linking the processing power of multiple individual processors together. Multiprocessor (MP) computer systems can be designed with a number of different topologies, each of which is more suitable for a particular application depending on the performance requirements and software environment of each application. There is. One common MP computer topology is a symmetric multiprocessor (SMP) configuration in which multiple processors share common resources, such as system memory and I / O subsystems, typically coupled to a shared system interconnect. It is. Such a computer system is called symmetric because all processors in an SMP computer system ideally have the same access latency for data stored in shared system memory. .
[0003]
While SMP computer systems allow the use of relatively simple interprocessor communication and data sharing methodologies, SMP computer systems have limited scalability. In other words, the performance of SMP computer systems can generally be expected to improve with scale (ie with the addition of more processors), but with inherent bus, memory and I / O bandwidth limitations. As such, there is no significant benefit from scaling SMP beyond the implementation-dependent size where utilization of these shared resources is optimized. Thus, the SMP topology itself suffers some damage from bandwidth limitations, particularly in system memory, as the scale of the system increases. SMP computer systems do not scale well from a manufacturing efficiency perspective. For example, some components can be optimized for use with both a single processor and a small SMP computer system, but such components can be used with a large SMP. Is often inefficient. Conversely, a component designed for use in a large scale SMP can become impractical for use in a small scale system from a cost standpoint.
[0004]
As a result, there has recently been increased interest in MP computer system topologies that address many of the limitations of SMP computer systems, at the expense of some additional complexity, called NUMA (non-uniform memory access). . A typical NUMA computer system includes a plurality of interconnected nodes, each including one or more processors and local “system” memory. Such a computer system is said to have non-uniform memory access when each processor is in a remote node with respect to data stored in its local node's system memory. This is because it has a lower access latency than that associated with data stored in the system memory. NUMA systems can be further classified as either non-coherent or cache coherent depending on whether data coherency is maintained between the caches of different nodes. The complexity of a cache coherent NUMA (CC-NUMA) system is not only between the various levels of cache memory and system memory within each node, but also between cache memory and system memory within different nodes. It is largely attributed to the additional communication required for hardware to maintain data coherency. However, the NUMA computer system addresses the scalability limitations of regular SMP computer systems because each node in the NUMA computer system can be implemented as a smaller single processor or SMP system. Thus, the shared components within each node can be optimized for use by one or a few processors, while the entire system can take advantage of greater scale parallelism while maintaining relatively low latency. Profit.
[0005]
[Problems to be solved by the invention]
The present invention recognizes that the expense of large-scale NUMA data processing systems is difficult to justify in some computing environments, such as environments with varying workloads. That is, in some computing environments, the processing resources of a large NUMA data processing system are less often needed to run a single application, and multiple operating systems or multiple applications for running different applications Often, smaller data processing systems are required. Prior to the present invention, the changing workload of such a computing environment was physically re-established by multiple computer systems of different scales or by connecting or disconnecting nodes as needed. It could only be dealt with by configuring.
[0006]
[Means for Solving the Problems]
To address the aforementioned shortcomings of the art, the present invention provides a data processing system that includes a plurality of processing nodes each including at least one processor and a data storage device. Multiple processing nodes are coupled together by a system interconnect. The data processing system further includes a configuration utility that resides in at least one data storage device of the plurality of processing nodes. The configuration utility selectively configures a plurality of processing nodes into either a single NUMA (non-uniform memory access) system or a plurality of independent data processing systems via communication over the system interconnect.
[0007]
DETAILED DESCRIPTION OF THE INVENTION
System overview
Referring now to the drawings, and specifically to FIG. 1, an embodiment of a data processing system according to the present invention is shown. The illustrated embodiment can be implemented, for example, as a workstation, server, or mainframe computer. As can be seen, the data processing system 6 includes a plurality of processing nodes 8 (four in this case) interconnected by a node interconnect 22. As described further below, inter-node data coherence is maintained by an interconnect coherence unit (ICU) 36.
[0008]
Referring now to FIG. 2, each processing node 8a-8d includes one or more processors 10a-10m, a local interconnect 16, and a system memory 18 accessed via a memory controller 17. It is. The processors 10a to 10m are preferably identical (not necessary). In addition to the registers used to execute program instructions, instruction sequencing logic and execution units, generally indicated as processor core 12, each of processors 10a-10m has associated processor memory from system memory 18. An on-chip cache hierarchy 14 is included that is used to stage data in the core 12. Each of the cache hierarchies 14 includes, for example, a level 1 (L1) cache having a storage capacity of 8 to 32 kilobytes (kB) and a level 2 (L2) cache having a storage capacity of 1 to 16 megabytes (MB). be able to.
[0009]
Each processing node 8 a-8 d further includes a respective node controller 20 coupled between the local interconnect 16 and the node interconnect 22. Each of the node controllers 20 acts as a local agent for the remote processing node 8 by performing at least two functions. First, each node controller 20 snoops the associated local interconnect 16 to facilitate transmission of local communication transactions to the remote processing node 8. Second, each node controller 20 snoops communication transactions on the node interconnect 22 and masters related communication transactions (eg, read requests) on the associated local interconnect 16. Communication on each local interconnect 16 is controlled by an arbiter 24. Arbiter 24 coordinates access to local interconnect 16 based on bus request signals generated by processor 10 and compiles coherency responses for snooped communication transactions on local interconnect 16.
[0010]
The local interconnect 16 is coupled to the mezzanine bus 30 via a mezzanine bus bridge 26, which can be implemented as a peripheral component interconnect (PCI) local bus, for example. The mezzanine bus bridge 26 includes a low latency path through which the processor 10 can directly access devices of the input / output devices 32 and storage 34 that are mapped to a bus memory or input / output address space, and input / output Device 32 and storage 34 provide both high bandwidth paths through which system memory 18 can be accessed. Input / output devices 32 may include, for example, display devices, keyboards, graphical pointers, and serial and parallel ports for connection to external networks or attached devices. On the other hand, the storage device 34 may include an optical or magnetic disk that provides non-volatile storage for the operating system and application software.
[0011]
Local interconnect 16 is further coupled to memory bus 40 and service processor bus 44 via host bridge 38. The memory bus 40 is coupled to a non-volatile random access memory (NVRAM) 42 that stores processing node 8 configuration data and other critical data. Service processor bus 44 supports service processor 50, which acts as a boot processor for processing node 8. Processing node 8 boot code typically includes power-on self test (POST), basic input / output system (BIOS) and operating system loader code, and processing node 8 boot code is flash Stored in the memory 48. After booting, the service processor 50 acts as a system monitor for the processing node 8 software and hardware by executing system monitoring software from the service processor dynamic random access memory (SP DRAM) 46. .
[0012]
System configuration possibility
In the preferred embodiment of the present invention, the BIOS boot code stored in flash memory 48 allows the data processing system 6 to be selectively partitioned into one or more independently operable subsystems. Is included. As described in detail below, the data processing system 6 is responsive to the expected characteristics of the processing load by the configuration software as a single NUMA data processing system, as multiple NUMA data processing subsystems, or Other combinations of single node or multi-node (ie, NUMA) data processing subsystems may be advantageously configured. For example, if a large amount of processing power is required to run a single application, the data processing system 6 can be configured as a single NUMA computer system and thus have the processing power available to run that application. It is desirable to maximize. On the other hand, if it is necessary to run multiple separate applications or multiple separate operating systems, the data processing system 6 is configured as multiple NUMA data processing subsystems or multiple single node subsystems. It may be desirable.
[0013]
When the data processing system 6 is configured as a plurality of data processing subsystems, those data processing subsystems include disjoint, possibly different sized sets of processing nodes 8. Each of the multiple data processing subsystems can independently configure, run, shut down, reboot, and repartition without interfering with the operation of other data processing subsystems. Importantly, reconfiguration of the data processing system 6 does not require the addition or disconnection of the processing node 8 to the node interconnect 22.
[0014]
Memory coherency
Since the data stored in the system memory 18 can be requested, accessed and modified by any processor 10 in a given data processing subsystem, the cache coherence protocol is implemented. Maintain coherence both between caches in the same processing node and between caches in different processing nodes within the same data processing subsystem. The cache coherence protocol that is implemented is implementation dependent. However, in the preferred embodiment, the cache layer 14 and arbiter 24 implement the normal Modified, Exclusive, Shared, Invalid (modified, exclusive, shared, invalid, MESI) protocol or variations thereof. Inter-node cache coherency is preferably maintained through a directory-based mechanism that is centralized in the ICU 36 connected to the node interconnect 22, but instead in a directory maintained by the node controller 20. Can be dispersed. This directory-based coherence mechanism preferably recognizes the M, S and I states and regards the E state as merged with the M state for correctness. That is, data held exclusively by the remote cache is considered changed regardless of whether or not the data has actually been changed.
[0015]
Interconnect architecture
Local interconnect 16 and node interconnect 22 may be implemented using a variety of interconnect architectures. However, in the preferred embodiment, at least the node interconnect 22 is implemented as a switch-based interconnect controlled by a 6xx communication protocol developed by IBM Corporation of Armonk, New York. This point-to-point communication methodology allows the node interconnect 22 to route address and data packets from the source processing node 8 only to the processing node 8 in the same data processing subsystem. Become.
[0016]
The local interconnect 16 and the node interconnect 22 allow split transactions because there is no fixed timing relationship between the address ownership and data ownership that make up the communication transaction, and the data packet Can be ordered differently from the associated address packet. The use of local interconnect 16 and node interconnect 22 is preferably enhanced by pipelined communication transactions, where before the master of the previous communication transaction receives a coherency response from each destination. Subsequent communication transactions can be provided.
[0017]
Configuration utility
Referring now to FIG. 3, there is a high level logic flow diagram illustrating the process for partitioning and configuring a multi-node data processing system, such as data processing system 6, into one or more data processing subsystems in accordance with the present invention. It is shown. As can be seen, the process begins at block 80 in response to all processing nodes 8a-8d being powered on, and then proceeds to block 82 where each processing node 8's service The processor 50 executes the POST code from the flash memory 48 to initialize the local hardware to a known stable state. After POST, each service processor 50 executes conventional BIOS routines to interface with key peripherals (eg, keyboard and display) and initializes interrupt handling. Thereafter, as shown in block 84 and thereafter, the processor of each processing node 8 (ie, service processor 50 and / or processor 10) is responsible for the independent data processing subsystem into which data processing system 6 is partitioned. The execution of the BIOS configuration utility described above is initiated by obtaining input specifying the number and the specific processing node 8 belonging to each data processing subsystem. The input shown in block 84 can be obtained from any of a plurality of sources, such as, for example, a file residing on a data storage medium or operator input at one or more processing nodes 8.
[0018]
In the preferred embodiment of the present invention, the input shown in block 84 is obtained from an operator at such processing node 8 in response to a series of menu screens displayed at one or more processing nodes 8. . This input is then used at each processing node 8 to create a partition mask indicating the other processing nodes 8 that are grouped with that processing node 8 to form the data processing subsystem. For example, if each of the four processing nodes 8 in the data processing system 6 is assigned one bit of a 4-bit mask, the NUMA configuration including all processing nodes can be represented by 1111 The node NUMA subsystem can be represented by 0011 and 1100 or 1010 and 0101, the two node NUMA subsystem and the two single node subsystems are 0011, 1000 and 0100 (and other similar node combinations) Can be represented by If the input indicating the desired section of the data processing system 6 is supplied in less than all of the processing nodes 8, an appropriate mask is transmitted to the other processing nodes 8 via the node interconnect 22. . In this way, each processing node 8 has a record of those processing nodes when there are other processing nodes 8 grouped together.
[0019]
After block 84, the process proceeds to block 86 where each data processing subsystem of data processing system 6 independently completes its configuration, as described in detail below with respect to FIGS. Thereafter, processing continues at block 88.
[0020]
Referring now to FIGS. 4 and 5, a high-level logic flow diagram that illustrates the processes shown in block 86 of FIG. It is shown. The illustrated process will be described together to show details of the communication between them, but is preferably implemented as part of the BIOS configuration utility described above.
[0021]
The process shown in FIG. 4 represents the operation of the processing node 8 that is the master, and the process shown in FIG. 5 represents the operation of the processing node 8 that is the client (if present). After block 84, blocks 100 and 140 are started in parallel. As shown in blocks 102 and 142, respectively, each processing node 8 in the data processing subsystem determines whether it is the master processing node 8 responsible for completing the configuration of that data processing subsystem. judge. Although the master processing node 8 of the data processing subsystem can be determined by a number of well-known mechanisms including voting and competition, in the preferred embodiment the master processing node 8 is the bit set in the partition mask. Is set by default as the processing node 8 in the data processing subsystem having the least significant bit. The master processor of processing node 8 that has been determined to be the master (ie, either service processor 50 or designated processor 10) has its data as described in detail in blocks 104-130 of FIG. Manage the configuration of the processing subsystem.
[0022]
Referring to block 104, if there is a client processing node 8 belonging to the data processing subsystem, the master processor issues a message on the local interconnect 16 targeting that processing node 8. This message, represented by arrow A, claims that the processing node 8 is the master. This message is snooped by the local node controller 20 and forwarded via the node interconnect 22 to the processing node 8 of the indicated client. As shown in blocks 144 and 146, respectively, the client processing node 8 waits until it receives this message from the master, and in response to receipt of the message, acknowledges the acknowledgment message indicated by arrow B. Send to processing node 8. As shown in blocks 106 and 108 of FIG. 4, the master waits until an acknowledgment message is received from the client processing node 8, and after receiving the acknowledgment, the additional client processing node 8 may If the partition mask indicates that the message has not yet been contacted, return to block 104. This master assertion-acknowledgment protocol (alternatively, it can also be executed in parallel for multiple client processing nodes 8) is performed by all processing nodes 8 in the data processing subsystem. In addition to ensuring agreement on what is the master, it advantageously also synchronizes the various processing nodes 8 in the subsystem that can be powered on at different times and boot at different speeds. Work as well.
[0023]
Master processing node 8 acknowledges its mastership from all client processing nodes 8 (if any) in the data processing subsystem, as indicated by the process proceeding from block 108 to block 110 in FIG. After receiving, the master processing node 8 requests configuration information (eg, resource list) from the client processing node 8 (if present). This request for configuration information may include one or more messages to the client, but is represented by arrow C. As indicated by blocks 148 and 150 in FIG. 5, the client processing node 8 waits for this resource list request and, in response to receiving the resource list request, its I / O resources, the amount of system memory 18 present. Respond by sending one or more messages to the master processing node 8 specifying the number of processors 10 contained therein and other configuration information. This configuration information response is represented by arrow D. Blocks 112 and 114 of FIG. 4 indicate that the master processing node 8 waits for a response from the client processing node 8 and adds the specified resource to the subsystem resource list after receiving the response. As indicated at block 116, the master processing node 8 performs blocks 110-114 for each of the client processing nodes 8 specified in the partition mask.
[0024]
After the master has obtained the resource list from each client (if any), the master processor of the master processing node 8 is responsible for the overall subsystem, as shown by the process proceeding from block 116 to block 118 in FIG. The configuration is determined and a method for remapping each resource of the client processing node 8 is calculated. Next, at block 120, the master processor of the master processing node 8 specifies to the client processing node 8 (if present) one or more methods that the client processing node 8 remaps its resources. Send multiple messages (represented by arrow E). For example, the master processor can specify the range of physical addresses associated with the storage location of the added system memory 18 to the memory controller 17 of the client processing node 8. Furthermore, the master processor can specify the memory mapped address of the input / output device 32 in the client processing node 8. In some embodiments, the master processor can also specify the processor ID of each processor 10 of the client processing node 8.
[0025]
In the preferred embodiment, all of the processors 10 in each data processing subsystem share a single physical memory space, where each physical address is one single location in the system memory 18. Means only relevant. Thus, the overall contents of the data processing subsystem's system memory, which is generally accessible to all processors 10 in the data processing subsystem, are between the system memory 18 in the processing nodes 8 comprising the data processing subsystem. Can be viewed as being separated by For example, in an example embodiment where each processing node 8 includes 1 GB of system memory 18 and the data processing system 6 is configured as two NUMA data processing subsystems, each NUMA data processing subsystem has 2 It has a gigabyte (GB) physical address space.
[0026]
As shown in blocks 152 and 154 of FIG. 5, the client processing node 8 waits for a remapping request from the master processing node 8 and is represented by arrow F in response to receiving the remapping request. Respond with an acknowledgment of the remapping request. As shown in blocks 122 and 124, the master processing node 8 waits for this remapping request acknowledgment and responds to receipt of the remapping request acknowledgment by processing other clients indicated by the partition mask. Blocks 120 and 122 are repeated for each of the nodes 8.
[0027]
After block 124 of FIG. 4 and block 154 of FIG. 5, each of the master processing node 8 and the client processing node 8 is determined by the master processing node 8 as shown in blocks 126 and 156, respectively. Remap each local resource according to As shown in block 158 of FIG. 5, each of the client processing nodes 8 then processes the processor 10 until the operating system (OS) of the data processing subsystem schedules the processor 10 for work. Stop. In contrast, as shown in block 128 of FIG. 4, the master processing node 8 boots an operating system for its data processing subsystem from one of the storage devices 34, for example. As described above, when a plurality of data processing subsystems are formed from the processing nodes 8 of the data processing system 6, the plurality of data processing subsystems are Windows NT and SCO (Santa Cruz Operation) UNIX. It is possible to run different operating systems. Thereafter, processing by the master processing node 8 continues at block 130.
[0028]
As explained above, the present invention provides a method for configuring a set of interconnected processing nodes into either a single NUMA data processing system or a selected number of independently operable data processing subsystems. I will provide a. In accordance with the present invention, partitioning of processing nodes into multiple data processing subsystems is accomplished without connecting or disconnecting any of the processing nodes.
[0029]
While the invention has been particularly shown and described with respect to preferred embodiments, those skilled in the art will recognize that various changes in form and detail may be made without departing from the spirit and scope of the invention. Let's go. For example, although aspects of the invention have been described in terms of a computer system executing software that directs the methods of the invention, the invention is instead provided as a computer program product for use with a computer system. It should be understood that it can be implemented. A program defining functions of the present invention includes a non-rewritable storage medium (for example, a CD-ROM), a rewritable storage medium (for example, a floppy diskette or a hard disk device), and communication media such as a computer network and a telephone network. Can be distributed to computer systems via various signal bearing media including, without limitation. Thus, it should be understood that such signal bearing media represent an alternative embodiment of the present invention when carrying or encoding computer readable instructions that direct the method functions of the present invention.
[0030]
In summary, the following matters are disclosed regarding the configuration of the present invention.
[0031]
(1) system interconnection;
The plurality of processing nodes coupled to the system interconnect, each of the plurality of processing nodes including at least one processor and a data storage device;
A configuration utility resident in at least one system memory of the plurality of processing nodes, the configuration utility connecting the plurality of processing nodes to a single NUMA via communication over the system interconnect. A non-uniform memory access and a configuration utility that selectively configures one of a plurality of independent data processing systems;
Including data processing system.
(2) The at least one of the plurality of independent data processing systems is a NUMA (non-uniform memory access) system including at least two of the plurality of processing nodes. Data processing system.
(3) The data processing system according to (1), wherein the plurality of independent data processing systems includes a disjoint subset of the plurality of processing nodes.
(4) The data processing system includes boot code stored in at least one data storage device of the plurality of processing nodes, and the configuration utility forms part of the boot code. A data processing system according to 1).
(5) In the above (1), the communication includes a request for configuration information sent from a master processing node of the plurality of processing nodes to at least one other processing node of the plurality of processing nodes. The data processing system described.
(6) The communication includes a response message sent from the at least one other processing node of the plurality of processing nodes to the master processing node, and the response message includes requested configuration information. The data processing system according to (5).
(7) coupling each of the plurality of processing nodes with a system interconnect, each of the plurality of processing nodes including at least one processor and a data storage device;
Sending at least one configuration message over the system interconnect;
Using the at least one configuration message, the plurality of processing nodes coupled to the system interconnect are designated as one of a single non-uniform memory access (NUMA) system and a plurality of independent data processing systems. Steps to configure
Configuring a plurality of interconnected processing nodes into one or more data processing systems.
(8) The step of configuring the plurality of processing nodes into a plurality of independent data processing systems includes at least one NUMA (non-uniform) including at least two of the plurality of processing nodes. The method according to (7) above, comprising the step of configuring in a memory access) subsystem.
(9) The step of configuring the plurality of processing nodes into a plurality of independent data processing systems configures the plurality of processing nodes into a plurality of independent data processing systems including a disjoint subset of the plurality of processing nodes. The method according to (7) above, comprising a step.
(10) storing a configuration utility that forms part of a boot code in at least one data storage device of the plurality of processing nodes;
Executing the configuration utility to configure the plurality of processing nodes;
The method according to (7), further comprising:
(11) sending at least one configuration message from the master processing node of the plurality of processing nodes to sending at least one other processing node of the plurality of processing nodes a request for configuration information; The method according to (7) above.
(12) sending at least one configuration message further comprises sending a response message from the at least one other processing node of the plurality of processing nodes to the master processing node, the response message comprising: The method according to (11) above, including the requested configuration information.
(13) A program product for configuring a data processing system including a system interconnection coupled with a plurality of processing nodes, each of the plurality of processing nodes including at least one processor and a data storage device. The program product is
A data processing system usable medium; and
A configuration utility encoded in the data processing system usable medium, wherein the configuration utility connects the plurality of processing nodes to a single NUMA (non-uniform) via communication over the system interconnect. a configuration utility for selectively configuring one of a memory access system and a plurality of independent data processing systems;
Including program products.
(14) The at least one of the plurality of independent data processing systems is a NUMA (non-uniform memory access) system including at least two of the plurality of processing nodes. Program product.
(15) The program product according to (13), wherein the plurality of independent data processing systems includes a disjoint subset of the plurality of processing nodes.
(16) The program product according to (13), wherein the configuration utility forms a boot code portion.
(17) In the above (13), the communication includes a request for configuration information sent from a master processing node of the plurality of processing nodes to at least one other processing node of the plurality of processing nodes. The listed program product.
(18) The communication includes a response message transmitted from the at least one other processing node of the plurality of processing nodes to the master processing node, and the response message includes requested configuration information. The program product according to (17).
[Brief description of the drawings]
FIG. 1 illustrates an embodiment of a multiple node data processing system in which the present invention can be advantageously used.
FIG. 2 is a more detailed block diagram of a processing node in the data processing system shown in FIG.
FIG. 3 is a high level logic flow diagram illustrating a method for selectively partitioning the data processing system of FIG. 1 into one or more data processing subsystems.
FIG. 4 is a high level logic flow diagram of a method by which a master processing node configures a data processing subsystem according to an embodiment of the present invention.
FIG. 5 is a high level logic flow diagram of a method for configuring a client processing node according to an embodiment of the present invention.
[Explanation of symbols]
6 Data processing system
8 processing nodes
8a processing node
8b Processing node
8c processing node
8d processing node
10a processor
10b processor
10c processor
10d processor
10e processor
10f processor
10g processor
10h processor
10i processor
10j processor
10k processor
10l processor
10m processor
12 processor cores
14 Cache hierarchy
16 Local interconnect
17 Memory controller
18 System memory
20 node controller
22 node interconnection
24 Arbiter
26 Mezzanine Bus Bridge
30 Mezzanine Bath
32 I / O devices
34 Storage device
36 Interconnect Coherence Unit (ICU)
38 Host Bridge
40 Memory bus
42 Nonvolatile Random Access Memory (NVRAM)
44 Service Processor Bus
46 Service Processor Dynamic Random Access Memory (SP DRAM)
48 Flash memory
50 Service processor

Claims

A data processing system ,
System interconnection,
Each, at least one processor and including a data storage device, the processing node of said number multiple, which is coupled to the system interconnect,
A configuration utility resident in at least one system memory of the plurality of processing nodes, the plurality of processing nodes via communication via the system interconnection based on information regarding a partition of the data processing system which contained said configuration utility for selectively configured in one of a single NUMA (non-uniform memory access) and a plurality of independent data processing system,
Said communication, said sent from the master processing node of the plurality of processing nodes to other processing nodes belonging to the same classification as the master processing node, seen including a request for a list of resources that the other processing node has, The data processing system, wherein the master processing node determines an overall configuration of the partition based on the obtained resource list and calculates a method for remapping the resource list .

The data processing system according to claim 1, wherein at least one of the plurality of independent data processing systems is a non-uniform memory access (NUMA) system including at least two of the plurality of processing nodes.

The data processing system of claim 1, wherein the plurality of independent data processing systems includes a disjoint subset of the plurality of processing nodes.

The data processing system includes boot code stored in at least one data storage device of the plurality of processing nodes, and the configuration utility forms part of the boot code. Data processing system.

A plurality of processing nodes coupled to the system interconnect is a method of making one or more data processing systems, each of the plurality of processing nodes, see contains at least one processor and a data storage device, The method
Sending at least one configuration message over the system interconnect;
Based on said data processing system section information about the at least one configuration using the message, pre-Symbol a plurality of processing nodes, a single NUMA (non-uniform memory access) system and a plurality of independent data processing system look including a step of configuration to one of the,
Sending the at least one configuration message, wherein the plurality of other processing nodes from the master processing nodes belonging to the master processing node same section and of the processing node, a list of resources to which the other processing node has viewing including the step of sending the request,
The method comprising the step of the step of the master processing node determining an overall configuration of the partition based on the obtained list of resources and calculating a method for remapping the list of resources .

Wherein the step of configuring is said plurality of processing nodes, comprising at least one NUMA (non-uniform memory access) steps constituting a subsystem comprising at least two of said plurality of processing nodes, claim 5 The method described in 1.

Configuring the plurality of processing nodes into a plurality of independent data processing systems includes configuring the plurality of processing nodes into a plurality of independent data processing systems including a disjoint subset of the plurality of processing nodes. The method according to claim 5 .

Storing a configuration utility that forms part of a boot code in at least one data storage device of the plurality of processing nodes;
6. The method of claim 5 , further comprising executing the configuration utility to configure the plurality of processing nodes.

A computer readable recording medium recording a program for configuring a data processing system including a system interconnection coupled with a plurality of processing nodes,
Wherein each of the plurality of processing nodes, and at least one processor and a data storage device, said computer,
Based on information related to the classification of the data processing system, a list of resources of the other processing node is transferred from the master processing node of the plurality of processing nodes to another processing node belonging to the same classification as the master processing node. Sending at least one configuration message that is a request over the system interconnect ;
It said plurality of processing nodes, comprising the steps of: configuring the one of the single NUMA (non-uniform memory access) system and a plurality of independent data processing system, the master processing node, resulting A computer-readable recording medium having recorded thereon a program for determining a whole configuration of the partition based on a list of resources and calculating a method of remapping the list of resources .

The computer-readable medium of claim 9 , wherein at least one of the plurality of independent data processing systems is a NUMA (non-uniform memory access) system including at least two of the plurality of processing nodes. Recording medium .

The computer-readable recording medium of claim 9 , wherein the plurality of independent data processing systems includes a disjoint subset of the plurality of processing nodes.

The computer-readable recording medium of claim 9 , wherein the configuration utility forms part of a boot code.