JP2000215189A

JP2000215189A - Multiprocessor computer architecture accompanied by many operating system instances and software control system resource allocation

Info

Publication number: JP2000215189A
Application number: JP10350660A
Authority: JP
Inventors: Stephen H Zalewski; エイチザレウスキースティーヴン; Andrew H Mason; エイチメイソンアンドリュー; Gregory H Jordan; エイチジョーダングレゴリー; Karen L Noel; エルノエルカレン; James R Kauffman; アールコーフマンジェームズ; Paul K Harter Jr; ケイハータージュニアポール; Frederick G Kleinsorge; ジークラインソーガフレデリック; Stephen F Shirron; エフシャロンスティーヴン
Original assignee: Digital Equipment Corp
Current assignee: Digital Equipment Corp
Priority date: 1998-11-04
Filing date: 1998-11-04
Publication date: 2000-08-04

Abstract

PROBLEM TO BE SOLVED: To obtain a computer system which imparts flexibility, resource usability, and expansibility by providing operating system instances which are executed in different sections. SOLUTION: A multiprocessing computer system 200 divides hardware logically into sections 202, 204, and 206. Hardware elements are assigned so that many operating system instances 208, 210, and 212 can be executed at the same time. This allocation is carried out by console programs 213, 215, and 217. The operating system instances 208, 210, and 212 serve to give the consistency between the resource allocation and sharing by properly using system resources. Serial lines 220, 222, and 224 are connected to a single multiplexer 226 fitted to a workstation 228 to display console information.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、プロセッサ及び他
のコンピュータハードウェアリソースが区画においてグ
ループ編成され、その各々がオペレーティングシステム
インスタンスを有するようなマルチプロセッサコンピュ
ータアーキテクチャに係り、より詳細には、コンピュー
タハードウェアリソースを区画に割り当てる方法及び装
置に係る。The present invention relates to a multiprocessor computer architecture in which processors and other computer hardware resources are grouped in partitions, each of which has an operating system instance, and more particularly to computer hardware. The present invention relates to a method and apparatus for allocating wear resources to partitions.

【０００２】[0002]

【従来の技術】現在の計算環境における多数のアプリケ
ーションの効率的な動作は、高速で、パワフルなそして
融通性のある計算システムによって左右される。このよ
うなシステムの構成及び設計は、多数の個別の部門や、
多数の異なる問題形式や、常時変化する計算ニーズが存
在する「事業的」商業環境にこのようなシステムを使用
すべきときには非常に複雑なものとなる。このような環
境のユーザは、一般に、システムの容量、その速度及び
その構成を迅速且つ容易に変更できることを希望する。
又、ユーザは、システム上でのアプリケーションプログ
ラムの実行を停止せずにリソースの良好な使用を達成す
るためにシステムの作業容量を拡張しそして構成を変更
することも希望する。更に、ユーザは、リソースの利用
性が最大になるようにシステムを構成して各アプリケー
ションに最適な計算構成をもたせるようにすることも希
望する。BACKGROUND OF THE INVENTION Efficient operation of many applications in modern computing environments is dependent on fast, powerful and flexible computing systems. The configuration and design of such a system is dependent on a number of individual departments,
It becomes very complicated when such a system is to be used in a "business" commercial environment where there are a number of different problem types and changing computing needs. Users of such an environment generally desire that the capacity of the system, its speed, and its configuration can be changed quickly and easily.
Users also desire to extend the working capacity of the system and modify its configuration to achieve good use of resources without stopping execution of application programs on the system. In addition, users also want to configure the system to maximize resource utilization so that each application has an optimal computational configuration.

【０００３】[0003]

【発明が解決しようとする課題】従来、計算速度は、デ
ータ、ビジネスロジック及びグラフィックユーザインタ
ーフェイスが別々の接続体であって、各接続体に専用の
特定の計算リソースを有するような「何も共用しない」
計算アーキテクチャを使用することにより対処された。
最初は、単一の中央処理ユニットが使用され、そしてこ
の単一の中央処理ユニットのクロックレートを高めるこ
とによりこのような計算システムのパワー及び速度が増
加された。最近、１つの大きさプロセッサが単独で動作
するのではなく、チームとして動作する多数のプロセッ
サを使用する計算システムが開発されている。このよう
に、複雑なアプリケーションが単一のプロセッサにより
実行されるのを待機するのではなく、それを多数のプロ
セッサ間に分散させることができる。このようなシステ
ムは、一般に、単一のオペレーティングシステムにより
制御される多数の中央処理ユニット（ＣＰＵ）で構成さ
れる。「系統的多処理」又はＳＭＰと称するマルチプロ
セッサシステムの形態においては、アプリケーションが
全てのプロセッサにわたって等しく分散される。プロセ
ッサは、メモリも共用する。「非系統的多処理」又はＡ
ＭＰと称する別の態様においては、１つのプロセッサが
「マスター」として働き、そして他の全てのプロセッサ
が「スレーブ」として働く。それ故、オペレーティング
システムを含む全てのオペレーションは、マスターを通
った後にスレーブプロセッサへ通されねばならない。こ
れら多処理アーキテクチャは、付加的なプロセッサを追
加することにより性能を高められるという利点を有する
が、このようなシステムで実行されるソフトウェアは多
数のプロセッサの利点を取り入れるために入念に書き込
まれねばならず、そしてプロセッサの数が増加するにつ
れてソフトウェアを拡張することが困難であるという欠
点もある。現在の商業的なワークロードは、単一ＳＭＰ
システムとして８ないし２４個のＣＰＵを越えて拡張す
ることができず、その厳密な数は、プラットフォーム、
オペレーティングシステム及びアプリケーションの混合
によって左右される。Conventionally, the computational speed has been such that "data, business logic, and the graphic user interface are separate connections, each connection having a specific computational resource dedicated to it." do not do"
This was addressed by using a computational architecture.
Initially, a single central processing unit was used, and the power and speed of such computing systems was increased by increasing the clock rate of the single central processing unit. Recently, computing systems have been developed that use multiple processors operating as a team, rather than one size processor operating alone. In this way, rather than waiting for a complex application to be executed by a single processor, it can be distributed among multiple processors. Such systems generally consist of a number of central processing units (CPUs) controlled by a single operating system. In a form of multiprocessor system called "systematic multiprocessing" or SMP, applications are equally distributed across all processors. Processors also share memory. "Non-systematic multi-processing" or A
In another embodiment, referred to as MP, one processor acts as a "master" and all other processors act as "slaves". Therefore, all operations, including the operating system, must be passed to the slave processor after passing through the master. While these multiprocessing architectures have the advantage that performance can be increased by adding additional processors, the software running on such systems must be carefully written to take advantage of the large number of processors. And the difficulty of extending the software as the number of processors increases. The current commercial workload is a single SMP
The system cannot scale beyond 8 to 24 CPUs, the exact number of which depends on the platform,
It depends on a mix of operating systems and applications.

【０００４】性能を高めることについては、コンピュー
タリソース（マシン）をあるアプリケーション専用とし
てマシンリソースをそのアプリケーションに最適に同調
する別の典型的な回答がある。しかしながら、異なる売
主により開発された多数のアプリケーション及び別々の
データベースがほとんどの場所にあるために、大部分の
ユーザはこのような解決策を採用していない。それ故、
特に、アプリケーションの混合体が常時変化するような
環境においては全てのアプリケーションの間でリソース
を専用化することは困難であり且つ経費がかかる。或い
は又、コンピュータにおけるリソースのサブセットを特
定のアプリケーションに使用できるように、コンピュー
タシステムをハードウェアで区画化することもできる。
この解決策は、区画を変更できるのでリソースの永久的
な専用化を回避するが、区画間のリソースの負荷バラン
ス及びリソースの利用性により性能改善に関する問題は
依然として残される。[0004] When it comes to increasing performance, there is another typical answer to dedicate computer resources (machines) to an application and optimally tune machine resources to that application. However, most users do not adopt such a solution because of the large number of applications and separate databases developed by different vendors in most locations. Therefore,
In particular, in an environment where the mix of applications constantly changes, it is difficult and expensive to dedicate resources among all applications. Alternatively, the computer system can be hardware partitioned so that a subset of the resources on the computer can be used for particular applications.
Although this solution avoids permanent dedicating resources because partitions can be changed, problems with performance improvement still remain due to resource load balancing and resource utilization between partitions.

【０００５】利用性及びメンテナンス性の問題は、ほと
んどのリソースを含む大きな集中型の頑丈なサーバが、
多数の小さな複雑でないクライエントネットワークコン
ピュータとネットワーク形成されてそれらにサービスす
るような「全てを共用する」モデルによって対処されて
いる。或いは、各システム即ち「ノード」がそれ自身の
メモリを有しそしてそれ自身のオペレーティングシステ
ムによって制御されるような「クラスター」が使用され
る。システムは、ディスクを共用しそしてある形式の通
信ネットワークを経てそれらの間にメッセージを通すこ
とにより対話する。クラスターシステムは、付加的なシ
ステムをクラスターに容易に追加できるという利点を有
する。しかしながら、ネットワーク及びクラスターは、
共用メモリが欠乏し、そして相互接続帯域巾に限度があ
って、性能上制限を課するという悩みがある。The problem of availability and maintainability is that large centralized, rugged servers containing most resources,
It is addressed by a "share all" model in which a number of small, uncomplicated client network computers are networked and serve them. Alternatively, a "cluster" is used, where each system or "node" has its own memory and is controlled by its own operating system. The systems interact by sharing disks and passing messages between them over some form of communication network. Cluster systems have the advantage that additional systems can be easily added to the cluster. However, networks and clusters
There is a concern that shared memory is scarce and interconnect bandwidth is limited, imposing performance limitations.

【０００６】多くの事業用計算環境においては、２つの
別々の計算モデルを同時に受け入れて、各モデルを最適
化しなければならないことが明らかである。このような
受け入れを試みるために多数の公知解決策が使用されて
いる。例えば、ニューヨーク州、アーモンクのインター
ナショナル・ビジネス・マシン・コーポレーションによ
り開発されて市場に出された「バーチャルマシン」又は
ＶＭと称する設計は、多数のバーチャルマシンを模擬す
るソフトウェアと組み合わせて１つ以上の物理的なプロ
セッサを伴う単一の物理的なマシンを使用している。こ
れらバーチャルマシンの各々は、原理的に、その基礎と
なるリアルコンピュータの全ての物理的リソースにアク
セスする。各バーチャルマシンへのリソースの指定は、
「ハイパーバイザー」と称するプログラムによって制御
される。システムには１つのハイパーバイザーしかな
く、そしてそれは、全ての物理的なリソースを受け持
つ。従って、他のオペレーティングシステムではなく、
ハイパーバイザーが物理的ハードウェアの割り当てを取
り扱う。ハイパーバイザーは、他のオペレーティングシ
ステムからのリソースの要求をインターセプトし、そし
てそれらの要求を全体的に正しいやり方で取り扱う。It is clear that in many commercial computing environments, two separate computational models must be accepted simultaneously and each model must be optimized. A number of known solutions have been used to attempt such acceptance. For example, a design called "virtual machine" or VM developed and marketed by International Business Machines Corporation of Armonk, NY has one or more physical machines in combination with software that simulates multiple virtual machines. Using a single physical machine with a generic processor. Each of these virtual machines has in principle access to all the physical resources of the underlying real computer. Specifying resources for each virtual machine
It is controlled by a program called "hypervisor". There is only one hypervisor in the system, which is responsible for all physical resources. Therefore, instead of other operating systems,
The hypervisor handles physical hardware allocation. The hypervisor intercepts requests for resources from other operating systems and handles those requests in the overall right way.

【０００７】ＶＭアーキテクチャは、「論理的区画」即
ちＬＰＡＲの概念をサポートする。各ＬＰＡＲは、その
区画に論理的に指定された使用可能な物理的ＣＰＵ及び
リソースの幾つかを含む。同じリソースを２つ以上の区
画に指定することができる。ＬＰＡＲは、アドミニスト
レータにより静的に設定されるが、負荷の変化に動的
に、且つ再ブートを伴わずに、多数の仕方で応答し得
る。例えば、各々１０個のＣＰＵを含む２つの論理的な
区画が、１０個の物理的ＣＰＵを含む物理的システムに
共用され、そして論理的な１０個のＣＰＵの区画が相補
的なピーク負荷を有する場合には、ワークロードがシフ
トするときに再ブート又はオペレータの介入を伴わずに
各区画が物理的な１０個のＣＰＵのシステム全体を引き
継ぐことができる。[0007] The VM architecture supports the concept of "logical partitions" or LPARs. Each LPAR contains some of the available physical CPUs and resources logically designated for that partition. The same resource can be assigned to more than one partition. The LPAR is set statically by the administrator, but can respond to load changes dynamically and without rebooting in a number of ways. For example, two logical partitions each containing ten CPUs are shared by a physical system containing ten physical CPUs, and the logical ten CPU partitions have complementary peak loads. In such cases, each partition can take over the entire system of 10 physical CPUs without a reboot or operator intervention when the workload shifts.

【０００８】更に、各区画に論理的に指定されたＣＰＵ
は、通常のオペレーティングシステムのオペレータコマ
ンドにより再ブートなしに動的にターン「オン」及び
「オフ」にすることができる。唯一の制約は、システム
の初期化時にアクティブなＣＰＵの数が、任意の区画に
おいてターン「オン」することのできるＣＰＵの最大数
である点である。更に、全ての区画の全体的なワークロ
ード需要が、物理的なシステムによって供給できるもの
以上になる場合には、ＬＰＡＲの重みを使用して、全Ｃ
ＰＵリソースのいかに多くを各区画に与えるかを定める
ことができる。これらの重みは、何ら障害を伴わずにオ
ペレータがオンザフライで変更することができる。別の
公知システムは、「パラレルシスプレックス(Parallel
Sysplex)」と称するもので、これも、インターナショナ
ル・ビジネス・マシン・コーポレーションにより開発さ
れて市場に出されたものである。このアーキテクチャ
は、各ＣＰＵに取り付けられる「カプリング・ファシリ
ティ」と称するハードウェアエンティティを経てクラス
ター化された１組のコンピュータで構成される。各ノー
ドにおけるカプリング・ファシリティは光ファイバリン
クを経て接続され、そして各ノードは、最大１０個のＣ
ＰＵを伴う従来のＳＭＰマシンとして働く。あるＣＰＵ
命令は、カプリング・ファシリティを直接的にインボー
クする。例えば、あるノードは、カプリング・ファシリ
ティと共にデータ構造体を登録し、次いで、カプリング
・ファシリティは、各ノードのローカルメモリ内でデー
タ構造体をコヒレントに維持するように注意を払う。Further, a CPU logically designated for each partition
Can be dynamically turned "on" and "off" without rebooting via normal operating system operator commands. The only constraint is that the number of CPUs active at system initialization is the maximum number of CPUs that can be turned "on" in any partition. In addition, if the overall workload demand of all partitions is greater than can be provided by the physical system, the LPAR weights are used to determine the total C demand.
It is possible to determine how much of the PU resources are given to each partition. These weights can be changed on the fly by the operator without any obstacles. Another known system is the "Parallel Sysplex.
Sysplex), also developed and marketed by International Business Machine Corporation. This architecture consists of a set of computers clustered through a hardware entity called a "coupling facility" attached to each CPU. The coupling facilities at each node are connected via fiber optic links, and each node has up to ten C
Serves as a conventional SMP machine with a PU. Some CPU
The instructions invoke the coupling facility directly. For example, a node registers a data structure with a coupling facility, and then the coupling facility takes care to keep the data structure coherent in the local memory of each node.

【０００９】カリフォルニア州、マウンテンビューのサ
ン・マイクロシステムズにより開発されて市場に出され
たエンタープライズ１００００ユニックスサーバは、
「ダイナミック・システム・ドメインズ」と称する区画
化構成体を使用して、単一物理的サーバのリソースを、
各々スタンドアローンサーバとして動作する多数の区画
即ちドメインへと論理的に分割する。各区画は、ＣＰ
Ｕ、メモリ及びＩ／Ｏハードウェアを有する。動的な再
構成は、システムアドミニストレータが、再ブートを伴
わずにドメインをオンザフライで形成し、サイズ変更し
又は削除することができるようにする。各ドメインは、
システム内の他のドメインから論理的に分離されたまま
であり、他のドメインにより発生されるソフトウェアエ
ラー、或いはＣＰＵ、メモリ又はＩ／Ｏエラーからそれ
を完全に分離する。いずれのドメイン間でもリソースの
共用はない。The Enterprise 10,000 Unix Server, developed and marketed by Sun Microsystems of Mountain View, California,
Using a partitioning construct called "Dynamic System Domains", the resources of a single physical server
It is logically divided into a number of partitions or domains, each acting as a standalone server. Each section is a CP
U, memory and I / O hardware. Dynamic reconfiguration allows a system administrator to create, resize, or delete domains on the fly without rebooting. Each domain is
It remains logically isolated from other domains in the system, completely isolating it from software errors or CPU, memory or I / O errors generated by other domains. There is no resource sharing between any domains.

【００１０】スタンフォード大学で行われたハイブプロ
ジェクト(Hive Project)は、１組のセルとして構成され
たアーキテクチャを使用している。システムがブートす
るときに、各セルには、実行全体にわたってそれが所有
するある範囲のノードが指定される。各セルは、あたか
もそれが独立したオペレーティングシステムであるかの
ように、これらノードにおけるプロセッサ、メモリ及び
Ｉ／Ｏデバイスを管理する。これらのセルは、ユーザレ
ベルプロセスに単一システムの幻覚を与えるように協働
する。ハイブセルは、それらのリソースをローカル要求
とリモート要求との間でいかに分割するかを決定する役
目を負わない。各セルは、その内部リソースを維持しそ
してそれが割り当てられたりソース内で性能を最適化す
るという役目のみを果たす。全体的なリソース割り当て
は、「ワックス(wax) 」と称するユーザレベルプロセス
により行われる。ハイブシステムは、セル間にある欠陥
含有境界を使用することによりデータの崩壊を防止する
ように試みる。セル間の欠陥含有境界にも拘わらずマル
チプロセッサシステムから予想される緊密な共用を実施
するために、種々のセルカーネルの協働によりリソース
共用が実施されるが、そのポリシーは、ワックスプロセ
スにおいてカーネルの外部で実施される。メモリ及びプ
ロセッサの両方を共用することができる。The Hive Project, conducted at Stanford University, uses an architecture organized as a set of cells. As the system boots, each cell is assigned a range of nodes that it owns throughout the run. Each cell manages the processors, memory and I / O devices at these nodes as if it were an independent operating system. These cells work together to give the user-level process a single-system hallucination. The hive cell has no role in determining how to split those resources between local and remote requests. Each cell only serves to maintain its internal resources and optimize its performance within the allocated and source. The overall resource allocation is done by a user level process called "wax". Hive systems attempt to prevent data corruption by using defect-containing boundaries between cells. Resource sharing is implemented by the cooperation of various cell kernels to implement the tight sharing expected from a multiprocessor system despite the defect-containing boundaries between cells, but the policy is that the kernel process in the wax process Implemented outside of Both memory and processor can be shared.

【００１１】カリフォルニア州、マウンテンビューのシ
リコン・グラフィックス・インクにより開発されて市場
に出された「セルラーＩＲＩＸ」と称するシステムは、
従来の系統的な多処理システムを拡張することによりモ
ジュラー計算をサポートする。セルラーＩＲＩＸアーキ
テクチャは、全体的なカーネルテキスト及びデータを最
適なＳＭＰサイズのチャンク即ち「セル」へと分散す
る。セルは、１つ以上のマシンモジュールより成る制御
ドメインを表わし、各モジュールは、プロセッサ、メモ
リ及びＩ／Ｏより成る。これらセルにおいて実行される
アプリケーションは、オペレーティングシステムテキス
ト及びカーネルデータ構造体のローカルコピーを含むロ
ーカルオペレーティングシステムサービスの全セットに
大きく依存する。オペレーティングシステムの１つのイ
ンスタンスのみが全システムに存在する。セル間整合
は、アプリケーションイメージが、データコピー又は余
計なコンテクストスイッチのオーバーヘッドを被ること
なく他のセルからの処理、メモリ及びＩ／Ｏリソースを
直接的及び透過的に利用できるようにする。A system called "Cellular IRIX" developed and marketed by Silicon Graphics, Inc. of Mountain View, Calif.
It supports modular computation by extending the traditional systematic multiprocessing system. The cellular IRIX architecture distributes the overall kernel text and data into chunks or "cells" of optimal SMP size. A cell represents a control domain consisting of one or more machine modules, each module consisting of a processor, memory and I / O. Applications running in these cells rely heavily on the entire set of local operating system services, including local copies of operating system text and kernel data structures. Only one instance of the operating system exists in the whole system. Inter-cell alignment allows application images to directly and transparently utilize processing, memory and I / O resources from other cells without incurring data copy or extra context switch overhead.

【００１２】オレゴン州、ビューバートンのシーケント
・コンピュータ・システム・インクにより開発されて市
場に出されたＮＵＭＡ―Ｑと称する別の既存のアーキテ
クチャは、メモリの一部分当たり４つのプロセッサのグ
ループ即ち「クオド」を、ＮＵＭＡ−ＱＳＭＰノード
のための基本的なビルディングブロックとして使用して
いる。各クオドにＩ／Ｏを追加すると、性能が更に改善
される。それ故、ＮＵＭＡ―Ｑアーキテクチャは、物理
的メモリを分散するだけでなく、所定数のプロセッサ及
びＰＣＩスロットを各部分の次ぎに入れる。各クオドの
メモリは、慣例的な意味のローカルメモリではない。む
しろ、これは、物理的なメモリアドレス空間の１／３で
あり、特定のアドレス範囲を有する。アドレスマップ
は、メモリにわたって均一に分割され、各クオドはアド
レス空間の隣接部分を含む。オペレーティングシステム
の１つのコピーのみが実行され、そしていずれのＳＭＰ
システムの場合とも同様に、それはメモリにあって、１
つ以上のプロセッサにおいて区別なく同時にプロセスを
実行する。Another existing architecture, called NUMA-Q, developed and marketed by Sequent Computer Systems, Inc. of Beauverton, Oreg. Is used as a basic building block for NUMA-Q SMP nodes. Adding I / O to each quad further improves performance. Therefore, the NUMA-Q architecture not only distributes physical memory, but also places a predetermined number of processors and PCI slots next to each part. The memory of each quad is not local memory in the conventional sense. Rather, it is one-third of the physical memory address space and has a specific address range. The address map is divided evenly across the memory, each quad including adjacent portions of the address space. Only one copy of the operating system runs and any SMP
As in the system, it is in memory and
Execute processes simultaneously on two or more processors without distinction.

【００１３】従って、最大のリソース利用性及び拡張性
を有する融通性のあるコンピュータシステムを提供する
ために多数の試みがなされているが、既存のシステム
は、どれも、著しい欠点を有する。それ故、改良された
融通性、リソース利用性及び拡張性を与える新規なコン
ピュータシステム設計をもつことが要望される。Thus, while many attempts have been made to provide a versatile computer system with maximum resource utilization and scalability, all existing systems have significant drawbacks. Therefore, it is desirable to have a new computer system design that provides improved flexibility, resource availability and scalability.

【００１４】[0014]

【課題を解決するための手段】本発明の原理によれば、
オペレーティングシステムの多数のインスタンスは、全
てのプロセッサ及びリソースが電気的に互いに接続され
た単一のマルチプロセッサコンピュータにおいて協働的
に実行される。多数の物理的なプロセッサ及びリソース
を伴う単一の物理的なマシンは、ソフトウェアにより、
オペレーティングシステムの個別のコピー又はインスタ
ンスを実行する能力を各々もつ多数の区画に適応式に細
分化される。各区画は、それ自身の物理的なリソース及
び共用されると指示されたリソースにアクセスする。１
つの実施形態によれば、リソースの区画化は、ある構成
内のリソースを指定することにより実行される。より詳
細には、ソフトウェアは、ＣＰＵ、メモリ及びＩ／Ｏポ
ートを一緒に指定することによりそれらを論理的に及び
適応式に区画化する。次いで、オペレーティングシステ
ムのインスタンスが区画にロードされる。異なる時間
に、異なるオぺレーティングシステムインスタンスが所
与の区画にロードされる。システムマネージャーが指令
するところのこの区画化は、ソフトウェアファンクショ
ンであり、ハードウェア境界は必要とされない。各個々
のインスタンスは、それが独立して実行する必要のある
リソースを有する。ＣＰＵ及びメモリのようなリソース
は、異なる区画に動的に指定され、そして構成を変更す
ることによりマシン内で実行されるオペレーティングシ
ステムのインスタンスにより使用される。又、区画それ
自体は、構成ツリーを修正することにより、システムを
再ブートせずに変更することができる。これにより得ら
れる適応式に区画された多処理（ＡＰＭＰ）システム
は、拡張性及び高性能の両方を指示する。According to the principles of the present invention,
Multiple instances of an operating system run cooperatively on a single multiprocessor computer with all processors and resources electrically connected to each other. A single physical machine with a large number of physical processors and resources
It is adaptively subdivided into multiple partitions, each with the ability to run a separate copy or instance of the operating system. Each partition has access to its own physical resources and resources designated to be shared. 1
According to one embodiment, partitioning of resources is performed by specifying resources within a configuration. More specifically, the software logically and adaptively partitions them by specifying the CPU, memory and I / O ports together. Then, an instance of the operating system is loaded into the partition. At different times, different operating system instances are loaded into a given partition. This partitioning, as dictated by the system manager, is a software function and no hardware boundaries are required. Each individual instance has the resources it needs to run independently. Resources such as CPU and memory are dynamically assigned to different partitions and used by instances of the operating system running in the machine by changing the configuration. Also, the partition itself can be changed without modifying the system by modifying the configuration tree. The resulting adaptively partitioned multi-processing (APMP) system dictates both scalability and high performance.

【００１５】[0015]

【発明の実施の形態】本発明の上記及び更に別の効果
は、添付図面を参照した以下の詳細な説明から良く理解
されよう。本発明の原理により構成されたコンピュータ
プラットホームは、オペレーティングシステムソフトウ
ェアの多数のインスタンスを同時に実行できるように区
画化することのできるマルチプロセッサシステムであ
る。このシステムは、そのメモリ、ＣＰＵ及びＩ／Ｏサ
ブシステムを区画化するためのハードウェアサポートを
必要としないが、あるハードウェアを使用して、欠陥を
分離しそしてソフトウェアエンジニアリングのコストを
最小にするための付加的なハードウェア支援を与えるこ
ともできる。本発明のソフトウェアアーキテクチャをサ
ポートするのに必要なインターフェイス及びデータ構造
体を以下に説明する。ここに述べるインターフェイス及
びデータ構造体は、特定のオペレーティングシステムを
使用しなければならないことを意味するものでもない
し、又、単一形式のオペレーティングシステムのみが同
時に実行することを意味するものでもない。以下に述べ
るソフトウェア要件を実施するオペレーティングシステ
ムは、本発明のシステムオペレーションに関与すること
ができる。BRIEF DESCRIPTION OF THE DRAWINGS The above and further advantages of the present invention will be better understood from the following detailed description, taken in conjunction with the accompanying drawings, in which: FIG. A computer platform constructed in accordance with the principles of the present invention is a multiprocessor system that can be partitioned to execute multiple instances of operating system software simultaneously. This system does not require hardware support to partition its memory, CPU and I / O subsystem, but uses some hardware to isolate defects and minimize software engineering costs. Additional hardware support may be provided. The interfaces and data structures required to support the software architecture of the present invention are described below. The interfaces and data structures described herein do not imply that a particular operating system must be used, nor that only a single type of operating system will be running at the same time. An operating system that implements the software requirements described below can participate in the system operation of the present invention.

【００１６】システムビルディングブロック本発明のソフトウェアアーキテクチャは、多数のＣＰ
Ｕ、メモリ及びＩ／Ｏハードウェアを組み込んだハード
ウェアプラットホームにおいて動作する。図１に示すよ
うなモジュラーアーキテクチャを使用するのが好ましい
が、他のアーキテクチャも使用できることが当業者に明
らかであり、これらのアーキテクチャは、モジュラーで
ある必要がない。図１は、４つの基本的なシステムビル
ディングブロック（ＳＢＢ）１００―１０６で構成され
たコンピュータシステムを示す。ここに示す実施形態で
は、ブロック１００のような各ビルディングブロックは
同一のものであり、多数のＣＰＵ１０８―１１４と、多
数のメモリスロット（メモリ１２０として集合的に示さ
れている）と、Ｉ／Ｏプロセッサ１１８と、システムを
別のシステムに接続できるスイッチ（図示せず）を含む
ポート１１６とを備えている。しかしながら、他の実施
形態では、ビルディングブロックは、同一である必要は
ない。所望数のシステムビルディングブロックをそれら
のポートにより接続することにより大きなマルチプロセ
ッサシステムを構成することができる。バス技術ではな
く、スイッチ技術を用いて、ビルディングブロック要素
を接続し、改良された帯域巾が得られると共に、非均一
なメモリアーキテクチャ（ＮＵＭＡ）を得ることができ
る。 System Building Blocks The software architecture of the present invention uses multiple CPs.
It runs on hardware platforms that incorporate U, memory and I / O hardware. While it is preferred to use a modular architecture as shown in FIG. 1, it will be apparent to those skilled in the art that other architectures can be used, and these architectures need not be modular. FIG. 1 shows a computer system composed of four basic system building blocks (SBBs) 100-106. In the embodiment shown, each building block, such as block 100, is identical, with multiple CPUs 108-114, multiple memory slots (collectively shown as memory 120), and I / O. It has a processor 118 and a port 116 that includes a switch (not shown) that can connect the system to another system. However, in other embodiments, the building blocks need not be identical. By connecting a desired number of system building blocks by these ports, a large multiprocessor system can be configured. Switch technology, rather than bus technology, can be used to connect building block elements to provide improved bandwidth and a non-uniform memory architecture (NUMA).

【００１７】本発明の原理によれば、ハードウェアスイ
ッチは、ライン１２２で概略的に示すように構成された
ビルディングブロックの数に拘わりなく各ＣＰＵが使用
可能な全てのメモリ及びＩ／Ｏポートをアドレスできる
ように構成される。更に、全てのＣＰＵは、プロセッサ
間割り込みのような従来のメカニズムによって全てのＳ
ＢＢのいずれかの又は全ての他のＣＰＵと通信すること
ができる。従って、ＣＰＵ及び他のハードウェアリソー
スは、ソフトウェアのみに関連することができる。この
ようなプラットホームアーキテクチャは、本来拡張可能
であり、多量の処理能力、メモリ及びＩ／Ｏを単一のコ
ンピュータにおいて得ることができる。ソフトウェアの
観点から本発明の原理に基づいて構成されたＡＰＭＰコ
ンピュータシステム２００が図２に示されている。この
システムでは、多数のオペレーティングシステムインス
タンス２０８、２１０、２１２を同時に実行できるよう
にハードウェア要素が割り当てられている。In accordance with the principles of the present invention, a hardware switch allocates all memory and I / O ports available to each CPU, regardless of the number of building blocks configured schematically as line 122. It is configured to be addressable. In addition, all CPUs have all S
BB can communicate with any or all other CPUs. Thus, CPUs and other hardware resources can relate only to software. Such a platform architecture is inherently scalable, allowing a great deal of processing power, memory and I / O to be obtained on a single computer. An APMP computer system 200 configured in accordance with the principles of the present invention from a software perspective is shown in FIG. In this system, hardware elements are allocated so that multiple operating system instances 208, 210, 212 can run simultaneously.

【００１８】好ましい実施形態では、この割り当ては、
「コンソール」プログラムと称するソフトウェアプログ
ラムにより実行され、これは、以下で詳細に述べるよう
に、パワーアップ時にメモリにロードされる。コンソー
ルプログラムは、プログラム２１３、２１５及び２１７
として図２に概略的に示されている。コンソールプログ
ラムは、既存のアドミニストレーティブプログラム又は
個別のプログラムの変形であって、これは、オペレーテ
ィングシステムと対話して好ましい実施形態のオペレー
ションを制御する。コンソールプログラムは、システム
リソースを仮想化せず、即ち実行中のオペレーティング
システム２０８、２１０及び２１２と、メモリ及びＩ／
Ｏユニット（図２には示さず）のような物理的なハード
ウェアとの間にソフトウェアレイヤを形成しない。又、
同じハードウェアにアクセスするために、実行中のオペ
レーティングシステム２０８、２１０及び２１２がスワ
ップされる状態もない。むしろ、本発明のシステムは、
ハードウェアを区画へと論理的に分割する。オペレーテ
ィングシステムインスタンス２０８、２１０及び２１２
の役割は、リソースを適切に使用して、リソース割り当
て及び共用の整合を与えることである。ハードウェアプ
ラットホームは、リソースを分割するためのハードウェ
ア支援を任意に与えることができ、そしてオペレーティ
ングシステムがメモリを崩壊したり或いは別のオペレー
ティングシステムコピーにより制御されるデバイスに悪
影響を及ぼすおそれを最小にするための欠陥バリアを与
えることもできる。In a preferred embodiment, this assignment is
It is executed by a software program called a "console" program, which is loaded into memory at power-up, as described in detail below. The console programs are programs 213, 215 and 217.
2 is shown schematically in FIG. The console program is a variant of an existing administrative program or a separate program, which interacts with the operating system to control the operation of the preferred embodiment. The console program does not virtualize system resources, ie, running operating systems 208, 210 and 212, memory and I / O.
It does not form a software layer with physical hardware such as O-units (not shown in FIG. 2). or,
There is no state in which running operating systems 208, 210 and 212 are swapped to access the same hardware. Rather, the system of the present invention
Logically partition hardware into partitions. Operating system instances 208, 210 and 212
Is to use resources appropriately to provide coordination of resource allocation and sharing. The hardware platform can optionally provide hardware support for partitioning resources and minimize the possibility that the operating system will corrupt memory or adversely affect devices controlled by another operating system copy. It can also provide a defect barrier for

【００１９】コピー２０８のようなオペレーティングシ
ステムの単一のコピーに対する実行環境は、「区画」２
０２と称され、そして区画２０２における実行中のオペ
レーティングシステム２０８は、「インスタンス」２０
８と称される。各オペレーティングシステムインスタン
スは、コンピュータシステムにおける他の全てのオペレ
ーティングシステムインスタンスとは独立してブート及
び実行することができ、そして以下に述べるように、オ
ペレーティングシステムインスタンス間でにリソースの
共用に協働的に参加することができる。オペレーティン
グシステムインスタンスを実行するために、区画は、ハ
ードウェア再スタートパラメータブロック（ＨＷＲＰ
Ｂ）と、コンソールプログラムのコピーと、ある量のメ
モリと、１つ以上のＣＰＵと、コンソールに対する専用
の物理的ポートをもたねばならない少なくとも１つのＩ
／Ｏバスとを含まねばならない。ＨＷＲＰＢは、コンソ
ールプログラムとオペレーティングシステムとの間に通
される構成ブロックである。The execution environment for a single copy of the operating system, such as copy 208, is a "partition" 2
02, and the running operating system 208 in the partition 202 is the "instance" 20
No. 8. Each operating system instance can boot and run independently of all other operating system instances in the computer system, and cooperate in sharing resources between operating system instances, as described below. You can participate. To run an operating system instance, the partition must be configured with a hardware restart parameter block (HWRP
B), a copy of the console program, an amount of memory, one or more CPUs, and at least one I which must have a dedicated physical port to the console.
/ O bus must be included. HWRPB is a building block passed between the console program and the operating system.

【００２０】コンソールプログラム２１３、２１５及び
２１７の各々は、ポート２１４、２１６及び２１８とし
て各々示されたコンソールポートに接続される。ポート
２１４、２１６及び２１８のようなコンソールポート
は、一般に、シリアルラインポート、又は取り付けられ
るグラフィックス、キーボード及びマウスオプションの
形態である。本発明のコンピュータシステムの説明上、
専用のグラフィックポート及び関連入力デバイスをサポ
ートする能力は必要とされないが、特定のオペレーティ
ングシステムはそれを必要とする。各区画に対してシリ
アルポートで充分であるというように基本的に仮定す
る。個別のターミナル又は独立したグラフィックコンソ
ールを使用して、各コンソールにより発生された情報を
表示することができるが、ワークステーション、ＰＣ又
はＬＡＴ２２８に取り付けられた単一のマルチプレクサ
２２６にシリアルライン２２０、２２２及び２２４を全
て接続してコンソール情報を表示できるのが好ましい。Each of the console programs 213, 215 and 217 is connected to a console port, shown as ports 214, 216 and 218, respectively. Console ports, such as ports 214, 216 and 218, are generally in the form of serial line ports or attached graphics, keyboard and mouse options. In describing the computer system of the present invention,
The ability to support a dedicated graphics port and associated input device is not required, but certain operating systems do. Basic assumption is made that a serial port is sufficient for each partition. Separate terminals or independent graphic consoles can be used to display information generated by each console, but serial lines 220, 222 and a single multiplexer 226 attached to a workstation, PC or LAT 228. Preferably, all 224's can be connected to display console information.

【００２１】区画は、システムビルディングブロックと
同意語ではないことに注意するのが重要である。例え
ば、区画２０２は、図１のビルディングブロック１００
及び１０６のハードウェアを構成するが、区画２０４及
び２０６は、各々、ビルディングブロック１０２及び１
０４のハードウェアを構成してもよい。又、区画は、ビ
ルディングブロックのハードウェアの一部分を含んでも
よい。区画は、「初期化」又は「非初期化」することが
できる。初期化された区画は、オペレーティングシステ
ムインスタンスを実行するのに充分なリソースを有し、
イメージロードされたコンソールプログラムと、使用可
能で且つ実行を行う一次ＣＰＵとを有する。初期化され
た区画は、コンソールプログラムの制御下にあってもよ
いし、又はオペレーティングシステムインスタンスを実
行してもよい。初期化された状態においては、区画は、
それに指定されたハードウェア要素の完全な所有権及び
制御権を有し、そして区画それ自体だけがその要素を解
除できる。It is important to note that partitions are not synonymous with system building blocks. For example, parcel 202 may correspond to building block 100 of FIG.
, And 106, where partitions 204 and 206 are building blocks 102 and 1 respectively.
04 may be configured. A partition may also include a portion of the building block hardware. A partition can be "initialized" or "uninitialized." The initialized partition has enough resources to run an operating system instance,
It has an image loaded console program and a primary CPU that is usable and executes. The initialized partition may be under the control of a console program or may execute an operating system instance. In the initialized state, the compartment is
It has full ownership and control of the hardware element designated for it, and only the partition itself can release that element.

【００２２】本発明の原理によれば、リソースは、１つ
の初期化された区画から別の区画へ再指定することがで
きる。リソースの再指定は、リソースが現在指定されて
いる初期化された区画でしか実行できない。ある区画が
非初期化状態にあるときには、他の区画は、そのハード
ウェア要素を再指定したり、それを削除したりすること
ができる。非初期化の区画とは、コンソールプログラム
又はオペレーティングシステムの制御下で実行する一次
ＣＰＵをもたない区画である。例えば、ある区画は、パ
ワーアップ時に一次ＣＰＵを実行するに充分なリソース
がないために非初期化とされるか、又はシステムアドミ
ニストレータがコンピュータシステムを再構成するとき
に非初期化とされる。非初期化状態にあるときは、ある
区画は、そのハードウェア要素を再指定することもでき
るし、その区画を別の区画によって削除することもでき
る。非指定のリソースは、いずれの区画によって指定す
ることもできる。In accordance with the principles of the present invention, resources can be reassigned from one initialized partition to another. Reassignment of a resource can only be performed in the initialized partition where the resource is currently specified. When one partition is in an uninitialized state, other partitions can reassign the hardware element or delete it. An uninitialized partition is a partition that has no primary CPU that executes under the control of a console program or operating system. For example, a partition may be uninitialized at power-up due to lack of sufficient resources to execute the primary CPU, or may be uninitialized when the system administrator reconfigures the computer system. When in the uninitialized state, one partition can re-designate its hardware elements or the partition can be deleted by another partition. Unspecified resources can be specified by any partition.

【００２３】区画は、協働リソース共用を許すために個
別の実行コンテクストをグループ分けするための基礎と
なる「コミュニティ」に編成することができる。同じコ
ミュニティ内の区画は、リソースを共用することができ
る。同じコミュニティ内にない区画は、リソースを共用
できない。リソースは、同じコミュニティにない区画間
では、システムアドミニストレータによりリソースを指
定解除し（及び使用を停止し）そしてリソースを手動で
再構成することによって手動で移動するしかない。コミ
ュニティは、独立したオペレーティングシステムドメイ
ンを形成したり、又はハードウェア使用のためのユーザ
ポリシーを実施したりするのに使用できる。図２におい
て、区画２０２及び２０４は、コミュニティ２３０へと
編成されている。区画２０６は、それ自身のコミュニテ
ィ２０５にある。これらのコミュニティは、以下に述べ
る構成ツリーを用いて形成することができ、そしてハー
ドウェアにより実施することができる。Partitions can be organized into underlying "communities" for grouping individual execution contexts to allow cooperative resource sharing. Partitions within the same community can share resources. Partitions that are not in the same community cannot share resources. Resources must be moved manually between partitions that are not in the same community by unassigning (and decommissioning) resources by the system administrator and manually reconfiguring the resources. Communities can be used to form independent operating system domains or to enforce user policies for hardware usage. In FIG. 2, the sections 202 and 204 are organized into a community 230. Partition 206 is in its own community 205. These communities can be formed using the configuration tree described below and can be implemented in hardware.

【００２４】コンソールプログラム本発明の原理により構成されたコンピュータシステムが
プラットホームにおいてイネーブルされたときには、多
数のＨＷＲＰＢを形成しなければならず、多数のコンソ
ールプログラムコピーをロードしなければならず、そし
て各ＨＷＲＰＢがシステムの特定の要素に関連するよう
にシステムリソースを指定しなければならない。これを
行うために、実行すべき第１のコンソールプログラム
は、システムないの全てのハードウェアを表わす構成ツ
リー構造体をメモリに形成する。このツリーは、又、ソ
フトウェア区画化情報、及び区画に対するハードウェア
指定も含み、このツリーについては、以下で詳細に述べ
る。より詳細には、ＡＰＭＰシステムがパワーアップさ
れるときには、システムが動作しているプラットホーム
に特有のハードウェアにより、あるＣＰＵが従来のやり
方で一次ＣＰＵとして選択される。一次ＣＰＵは、次い
で、コンソールプログラムのコピーをメモリにロードす
る。このコンソールコピーは、「マスターコンソール」
プログラムと称される。一次ＣＰＵは、最初に、マスタ
ーコンソールプログラムの制御の下で動作し、マシン全
体を所有している単一のシステムが存在するという仮定
でテスト及びチェックを実行する。その後、システム区
画を定義する１組の環境変数がロードされる。最終的
に、マスターコンソールは、環境変数に基づいて区画を
形成しそして初期化する。この後者のプロセスにおい
て、マスターコンソールは、構成ツリーを形成し、付加
的なＨＷＲＰＢデータブロックを形成し、付加的なコン
ソールプログラムコピーをロードし、そして別のＨＷＲ
ＰＢにおけるＣＰＵを始動するように動作する。各区画
は、次いで、そこで実行されるオペレーティングシステ
ムインスタンスを有し、このインスタンスは、これも又
その区画で実行されるコンソールプログラムコピーと協
働する。非構成のＡＰＭＰシステムでは、マスターコン
ソールプログラムは、一次ＣＰＵと、最低量のメモリ
と、プラットホーム特定のやり方で選択された物理的な
システムアドミニストレータのコンソールとを含む単一
の区画を最初に形成する。コンソールプログラムコマン
ドは、次いで、システムアドミニストレータが付加的な
区画を形成すると共に、各区画に対するＩ／Ｏバス、メ
モリ及びＣＰＵを構成できるようにする。 Console Program When a computer system constructed in accordance with the principles of the present invention is enabled on a platform, multiple HWRPBs must be created, multiple console program copies must be loaded, and each HWRBPB must be loaded. Must specify system resources so that they relate to specific elements of the system. To do this, the first console program to be executed forms in memory a configuration tree structure representing all hardware in the system. The tree also contains software partitioning information and hardware designations for the partitions, which are described in more detail below. More specifically, when the APMP system is powered up, one CPU is selected in a conventional manner as the primary CPU, due to the hardware specific to the platform on which the system is running. The primary CPU then loads a copy of the console program into memory. This console copy is called the "master console"
It is called a program. The primary CPU initially runs under the control of the master console program and performs tests and checks on the assumption that there is a single system that owns the entire machine. Thereafter, a set of environment variables defining the system partition is loaded. Finally, the master console forms and initializes partitions based on environment variables. In this latter process, the master console forms a configuration tree, forms additional HWRPB data blocks, loads additional console program copies, and
It operates to start the CPU in PB. Each partition then has an operating system instance running on it, which cooperates with a console program copy also running on that partition. In an unconfigured APMP system, the master console program initially forms a single partition containing the primary CPU, a minimal amount of memory, and the console of the physical system administrator selected in a platform-specific manner. The console program commands then allow the system administrator to create additional partitions and configure the I / O bus, memory and CPU for each partition.

【００２５】区画に対するリソースの関連付けがコンソ
ールプログラムによって行われた後に、その関連性が不
揮発性ＲＡＭに記憶され、その後のブート中にシステム
を自動的に構成できるようにされる。その後のブート中
に、マスターコンソールプログラムは、新たな要素の追
加及び除去を取り扱うために現在の構成を記憶された構
成で確認しなければならない。新たに追加される要素
は、それらがシステムアドミニストレータによって指定
されるまで非指定状態に入れられる。ハードウェア要素
を除去したときに、区画のもつリソースがオペレーティ
ングシステムを実行するのに不充分なものとなる場合に
は、リソースがその区画に指定され続けるが、付加的な
新たなりソースがそこに指定されるまでオペレーティン
グシステムインスタンスを実行することはできない。After the association of resources to partitions is made by the console program, the association is stored in non-volatile RAM so that the system can be automatically configured during subsequent boots. During subsequent boots, the master console program must verify the current configuration with the stored configuration to handle the addition and removal of new elements. Newly added elements are placed in the unspecified state until they are specified by the system administrator. If removing a hardware element leaves the partition with insufficient resources to run the operating system, the resource will continue to be assigned to that partition, but additional new sources will be placed there. The operating system instance cannot run until specified.

【００２６】既に述べたように、コンソールプログラム
は、オペレーティングシステムのブートアップ中にオペ
レーティングシステムへ通されたＨＷＲＰＢによりオペ
レーティングシステムインスタンスと通信する。コンソ
ールプログラムに対する基本的な要件は、ＨＷＲＰＢそ
れ自体及びその多数のコピーを形成できねばならないこ
とである。コンソールプログラムにより形成された各Ｈ
ＷＲＰＢコピーは、独立したオペレーティングシステム
インスタンスをメモリのプライベート区分へとブートす
ることができ、そしてこのようにブートされる各オペレ
ーティングシステムインスタンスは、ＨＷＲＰＢに入れ
られる独特の値によって識別することができる。この値
は区画を指示し、そしてオペレーティングシステムイン
スタンスＩＤとしても使用される。As already mentioned, the console program communicates with the operating system instance via HWRPB passed to the operating system during the operating system boot-up. The basic requirement for a console program is that it must be able to form HWRPB itself and multiple copies thereof. Each H formed by the console program
The WRPB copy can boot an independent operating system instance into a private partition of memory, and each operating system instance thus booted can be identified by a unique value placed in HWRPB. This value indicates the partition and is also used as the operating system instance ID.

【００２７】更に、コンソールプログラムは、区画内で
実行されているオペレーティングシステムによる要求に
応答してその区画内で使用できるＣＰＵからあるＣＰＵ
を除去するためのメカニズムを形成するように構成され
る。各オペレーティングシステムインスタンスは、コン
ソールプログラムに制御権が通されるように、遮断、停
止又はさもなくばクラッシュできねばならない。逆に、
各オペレーティングシステムインスタンスは、他のオペ
レーティングシステムインスタンスとは独立して、ある
オペレーションモードへと再ブートできねばならない。
コンソールプログラムにより形成された各ＨＷＲＰＢ
は、システム内にあるか又はシステム全体をパワーダウ
ンせずにシステムに追加できる各ＣＰＵに対してＣＰＵ
スロット特有のデータベースを含む。物理的に存在する
各ＣＰＵは、「存在」とマークされるが、特定の区画に
おいて最初に実行するＣＰＵだけは、区画のＨＷＲＰＢ
において「使用可能」とマークされる。ある区画におい
て実行されるオペレーティングシステムインスタンス
は、ＨＷＲＰＢのＣＰＵごとの状態フラグフィールドに
おける「存在」（ＰＰ）ビットにより将来のある時間に
ＣＰＵを使用できると確認することができ、そしてこれ
を表わすデータ構造体を形成することができる。ＣＰＵ
ごとの状態フラグフィールドにおける「使用可能」（Ｐ
Ａ）ビットは、これがセットされると、その関連ＣＰＵ
が区画に現在関連されていて、ＳＭＰオペレーションに
加えるように案内できることを指示する。[0027] In addition, the console program may include a CPU from a CPU available in the partition in response to a request by an operating system running in the partition.
Is configured to form a mechanism for removing Each operating system instance must be able to shut down, shut down or otherwise crash so that control can be passed to the console program. vice versa,
Each operating system instance must be able to reboot to an operating mode independently of the other operating system instances.
Each HWRPB formed by the console program
Is a CPU for each CPU in the system or that can be added to the system without powering down the entire system.
Contains a slot specific database. Each CPU physically present is marked as "present", but only the first CPU executing in a particular partition is
Are marked as “enabled”. An operating system instance running in a partition can determine that the CPU is available at some time in the future by a "Presence" (PP) bit in the per-CPU status flag field of HWRPB, and a data structure representing this. Body can be formed. CPU
"Available" (P
A) The bit, when set, indicates its associated CPU
Is currently associated with the partition and can be guided to add to the SMP operation.

【００２８】構成ツリー既に述べたように、マスターコンソールプログラムは、
ハードウェアの構成と、各区画に対するシステムの各要
素の指定とを表わす構成ツリーを形成する。次いで、各
コンソールプログラムは、ＨＷＲＰＢにツリーのポイン
タを入れることにより構成ツリーをそれに関連したオペ
レーティングシステムインスタンスに識別する。図３に
戻ると、構成ツリー３００は、システム内のハードウェ
ア要素と、プラットホームの制約及び最小値と、ソフト
ウェア構成とを表わす。マスターコンソールプログラム
は、以前の初期化中に発生された構成情報を含む不揮発
性ＲＡＭに記憶された情報と、ハードウェアの検知とに
より発見された情報を用いてツリーを形成する。 Configuration Tree As already mentioned, the master console program is:
A configuration tree is formed that represents the hardware configuration and the designation of each element of the system for each partition. Each console program then identifies the configuration tree to its associated operating system instance by placing the tree pointer in the HWRPB. Returning to FIG. 3, the configuration tree 300 represents the hardware elements in the system, platform constraints and minimums, and the software configuration. The master console program forms a tree using information stored in non-volatile RAM, including configuration information generated during previous initialization, and information discovered by hardware detection.

【００２９】マスターコンソールは、全てのオペレーテ
ィングシステムインスタンスによってコピーが共用され
るところのツリーの単一コピーを発生することもできる
し、又は各インスタンスごとにツリーを複写することも
できる。ツリーの単一コピーは、独立したメモリを伴う
システムに単一の欠陥点を形成し得るという欠点があ
る。しかしながら、多数のツリーコピーを発生するプラ
ットホームは、コンソールプログラムがツリーに対する
変化を同期状態に保持できることを必要とする。構成ツ
リーは、根ノード、子ノード及び兄弟ノードを含む多数
のノードより成る。各ノードは、固定のヘッダと、オー
バーレイデータ構造体に対する可変長さ延長部とで形成
される。ツリーは、全システムボックスを表わすツリー
根ノード３０２で出発し、その後に、ハードウェア構成
（ハードウェア根ノード３０４）、ソフトウェア構成
（ソフトウェア根ノード３０６）及び最小区画要件（テ
ンプレート根ノード３０８）を示すブランチが続く。図
３において、矢印は、子供及び兄弟関係を表わす。ある
ノードの子供は、ハードウェア及びソフトウェア構成の
構成要素を表わす。兄弟は、同じ親をもつ意外関係のな
い要素の同等のものを表わす。ツリー３００のノード
は、ソフトウェアコミュニティ及びオペレーティングシ
ステムインスタンス、ハードウェア構成、構成制約、性
能境界及びホットスワップ能力に関する情報を含む。
又、これらノードは、ハードウェア対ソフトウェア所有
権の関係又はハードウェア要素の共用も与える。The master console can generate a single copy of the tree, where a copy is shared by all operating system instances, or can duplicate the tree for each instance. The disadvantage is that a single copy of the tree can create a single point of failure in systems with independent memory. However, platforms that generate multiple tree copies require that the console program be able to keep changes to the tree synchronized. The configuration tree is made up of a number of nodes, including root nodes, child nodes, and sibling nodes. Each node is formed with a fixed header and a variable length extension to the overlay data structure. The tree starts with a tree root node 302 representing the entire system box, followed by the hardware configuration (hardware root node 304), software configuration (software root node 306), and minimum partition requirements (template root node 308). A branch follows. In FIG. 3, arrows represent child and sibling relationships. The children of a node represent components of the hardware and software configuration. Siblings represent the equivalent of an unrelated element with the same parent. The nodes of the tree 300 include information about software communities and operating system instances, hardware configurations, configuration constraints, performance boundaries, and hot swap capabilities.
These nodes also provide a hardware-to-software ownership relationship or sharing of hardware elements.

【００３０】これらのノードは、メモリ内に隣接して記
憶され、そしてツリー３００のツリー根ノード３０２か
ら特定ノードへのアドレスオフセットが「ハンドル」を
形成し、これは、オペレーティングシステムインスタン
スにおける同じ要素を明確に識別するためにオペレーテ
ィングシステムインスタンスにより使用することができ
る。更に、本発明のコンピュータシステムの各要素は、
個別のIDを有する。これは、説明上、６４ビットの無符
号値である。このＩＤは、要素の形式及びサブ形式値と
組み合わされたときに独特の要素を特定しなければなら
ない。即ち、所与の形式の要素に対し、ＩＤは、特定の
要素を識別しなければならない。ＩＤは、単純な数字、
例えば、ＣＰＵＩＤであってもよいし、他の何らかの
独特のエンコード又は物理的なアドレスであってもよ
い。要素ID及びハンドルは、任意の数のコンピュータシ
ステムがハードウェア又はソフトウェアの特定の部片を
識別できるようにする。即ち、いずれの特定方法を使用
するいかなる区画も、同じ仕様を用いて同じ結果を得る
ことができねばならない。These nodes are stored contiguously in memory, and the address offset from the tree root node 302 of the tree 300 to a particular node forms a "handle", which identifies the same element in the operating system instance. Can be used by an operating system instance to clearly identify it. Further, each element of the computer system of the present invention includes:
Has an individual ID. This is a 64-bit unsigned value for explanation. This ID must identify the unique element when combined with the element type and sub-type values. That is, for an element of a given type, the ID must identify the particular element. ID is a simple number,
For example, it could be a CPU ID or some other unique encoding or physical address. The element ID and handle allow any number of computer systems to identify a particular piece of hardware or software. That is, any compartment using any particular method must be able to achieve the same result using the same specifications.

【００３１】上記のように、本発明のコンピュータシス
テムは、１つ以上のコミュニティより成り、これらコミ
ュニティは、次いで、１つ以上の区画より成る。独立し
たコミュニティにわたって区画を分割することにより、
本発明のコンピュータシステムは、デバイス及びメモリ
の共用を制限し得る構成にすることができる。コミュニ
ティ及び区画は、高密度でパックされるＩＤを有する。
ハードウェアプラットホームは、システムに存在するハ
ードウェアに基づいて区画の最大数を決定し、そしてプ
ラットホーム最大限界を有する。区画及びコミュニティ
ＩＤは、ランタイム中にこの値を決して越えることがな
い。ＩＤは、削除された区画及びコミュニティに対して
再使用される。コミュニティの最大数は、区画の最大数
と同じである。更に、各オペレーティングシステムイン
スタンスは、独特のインスタンス識別子、例えば、区画
ＩＤと具体的な数字との組み合わせによって識別され
る。As mentioned above, the computer system of the present invention comprises one or more communities, which in turn comprise one or more partitions. By dividing parcels across independent communities,
The computer system of the present invention can be configured to limit sharing of devices and memory. Communities and parcels have IDs that are densely packed.
The hardware platform determines the maximum number of partitions based on the hardware present in the system, and has a platform maximum limit. Partition and community IDs never exceed this value during runtime. The ID is reused for deleted parcels and communities. The maximum number of communities is the same as the maximum number of parcels. Further, each operating system instance is identified by a unique instance identifier, for example, a combination of a partition ID and a specific number.

【００３２】コミュニティ及び区画は、ソフトウェア根
ノード３０６により表わされ、これは、コミュニティノ
ード子供（そのコミュニティノード３１０が示されてい
る）と、区画ノード孫（その２つのノード３１２及び３
１４が示されている）とを有する。ハードウェア要素
は、ハードウェア根ノード３０４により表わされ、これ
は、コンピュータシステムに現在存在する全てのハード
ウェアのハイアラーキー表示を示す子供を含む。ハード
ウェア要素の「所有権」は、適当なソフトウェアノード
（３１０、３１２又は３１４）を指す関連ハードウェア
ノードにおけるハンドルにより表わされる。これらのハ
ンドルは図４に示されており、これについて、以下に説
明する。特定の区画が所有する要素は、その区画を表わ
すノードを指すハンドルを有する。多数の区画が共用す
るハードウェア（例えば、メモリ）は、その共用が拘束
されるコミュニティを指すハンドルを有する。未所有の
ハードウェアは、ゼロのハンドル（根ノード３０２を表
わす）を有する。The communities and parcels are represented by software root nodes 306, which are children of the community node (whose community node 310 is shown) and parcel node grandchildren (the two nodes 312 and 312).
14 are shown). The hardware element is represented by a hardware root node 304, which includes children indicating a hierarchical display of all hardware currently present in the computer system. The "ownership" of a hardware element is represented by a handle on the relevant hardware node pointing to the appropriate software node (310, 312 or 314). These handles are shown in FIG. 4 and are described below. Elements owned by a particular partition have handles pointing to the nodes representing that partition. Hardware (e.g., memory) shared by multiple partitions has handles that point to the communities whose sharing is bound. Unowned hardware has a zero handle (representing the root node 302).

【００３３】ハードウェア要素は、所有権をいかに分割
するかについて構成上の制約を課する。各要素に関連し
た構成ツリーノードにおける「ｃｏｎｆｉｇ」ハンドル
は、ハードウェア根ノード３０４を指すことにより要素
をコンピュータシステムのどこにでも自由に関連させる
べきかどうか決定する。しかしながら、あるハードウェ
ア要素は、祖先ノードに結合することができ、そしてこ
のノードの一部分として構成されねばならない。この例
は、どこで実行するかの制約をもたないが、ＳＢＢ３２
２又は３２４のようなシステムビルディングブロック
（ＳＢＢ）の構成要素であるＣＰＵである。この場合
に、たとえＣＰＵがＳＢＢの子供であっても、そのｃｏ
ｎｆｉｇハンドルは、ハードウェア根ノード３０４を指
す。しかしながら、Ｉ／Ｏバスは、そのＩ／Ｏプロセッ
サを所有する区画以外の区画が所有することはできな
い。この場合に、Ｉ／Ｏバスを表わす構成ツリーノード
は、Ｉ／Ｏプロセッサを指すｃｏｎｆｉｇハンドルを有
する。ハードウェア構成を支配するルールは、プラット
ホーム特有のものであるから、この情報は、ｃｏｎｆｉ
ｇハンドルによりオペレーティングシステムインスタン
スに与えられる。Hardware elements impose configuration constraints on how ownership is divided. The “config” handle in the configuration tree node associated with each element determines whether the element should be freely associated anywhere in the computer system by pointing to the hardware root node 304. However, certain hardware elements can be bound to an ancestor node and must be configured as part of this node. This example has no restrictions on where to execute, but the SBB32
The CPU is a component of a system building block (SBB) such as 2 or 324. In this case, even if the CPU is a child of SBB,
The nfig handle points to the hardware root node 304. However, the I / O bus cannot be owned by a partition other than the partition that owns the I / O processor. In this case, the configuration tree node representing the I / O bus has a config handle pointing to the I / O processor. Since the rules governing the hardware configuration are platform specific, this information is
Provided to the operating system instance by the g handle.

【００３４】各ハードウェア要素は、「親和力(affinit
y)」ハンドルも有する。この親和力ハンドルは、ｃｏｎ
ｆｉｇハンドルと同じであるが、要素の最良の性能を得
る構成を表わす。例えば、ＣＰＵ又はメモリは、コンピ
ュータシステムのどこででも構成できるようにするｃｏ
ｎｆｉｇハンドル（ハードウェア根ノード３０４を指
す）を有するが、最適な性能のためには、ＣＰＵ又はメ
モリは、それらが一部分であるところのシステムビルデ
ィングブロックを使用するように構成されねばならな
い。その結果、ｃｏｎｆｉｇポインタは、ハードウェア
根ノード３０４を指すが、親和力ポインタは、ノード３
２２又は３２４のようなＳＢＢノードを指す。いかなる
要素の親和力もプラットホーム特有のもので、ファーム
ウェアにより決定される。ファームウェアは、「最適」
な自動構成を作るように求めるときに子の情報を使用す
ることができる。Each hardware element has an "affinity (affinit
y) "also has a handle. This affinity handle is con
Same as the fig handle, but represents the configuration that gives the best performance of the element. For example, a CPU or memory can be configured anywhere in a computer system.
Having an nfig handle (which points to the hardware root node 304), but for optimal performance, the CPU or memory must be configured to use the system building blocks of which they are a part. As a result, the config pointer points to the hardware root node 304, while the affinity pointer points to node 3
Points to an SBB node such as 22 or 324. The affinity of any element is platform specific and is determined by the firmware. Firmware is “optimal”
Child information can be used when asking to create a simple automatic configuration.

【００３５】又、各ノードは、ノードの形式及び状態を
指示する多数のフラグも含む。これらのフラグは、表わ
される要素が「ホットスワップ可能」な要素でありそし
てその親及び兄弟とは独立してパワーダウンできること
を指示するｎｏｄｅｈｏｔｓｗａｐフラグを含む。し
かしながら、このノードの全ての子供は、この要素がパ
ワーダウンする場合にはパワーダウンしなければならな
い。子供がこの要素と独立してパワーダウンできる場合
には、それに対応するノードにおいてこのビットをセッ
トしなければならない。別のフラグは、ｎｏｄｅｕｎ
ａｖａｉｌａｂｌｅフラグであり、これは、セットされ
ると、ノードにより表わされる要素が使用のために現在
入手できないことを指示する。２つのフラグｎｏｄｅ
ｈａｒｄｗａｒｅ及びｎｏｄｅｔｅｍｐｌｅｔｅは、
ノードの形式を指示する。又、ノードが初期化された区
画を表わすか又は現在の一次ＣＰＵであるＣＰＵを表わ
すかを指示するために、ｎｏｄｅｉｎｉｔｉａｌｉｚ
ｅｄ及びｎｏｄｅｃｐｕｐｒｉｍａｒｙのような更
に別のフラグを設けることもできる。Each node has the form and state of the node.
Also includes a number of flags to indicate. These flags are
Elements that are “hot swappable”
Power down independently of their parents and siblings
Node to indicate Contains the hotswap flag. I
However, all children of this node have this element
If you go down, you have to power down
No. If the child can power down independently of this element
Set this bit at the corresponding node.
Must be Another flag is the node un
available flag, which is set
Then the element represented by the node is currently
Indicates that it is not available. Two flags node
hardware and node template is
Indicate the format of the node. The zone where the node was initialized
Represents the image or represents the CPU that is the current primary CPU.
Node to indicate initializ
ed and node cpu Updates like primary
Can be provided with another flag.

【００３６】構成ツリー３００は、オペレーティングシ
ステムがバスを検知せずにバス及びデバイス構成テーブ
ルを形成できるようにするデバイスコントローラのレベ
ルまで拡張できる。しかしながら、ツリーは任意のレベ
ルで終了してもよい。但し、それより下の全ての要素を
独立して構成できない場合である。システムソフトウェ
アは、ツリーにより与えられないバス及びデバイス情報
を検知することが依然として要求される。コンソールプ
ログラムは、システムの各要素に構成の制約がもしあれ
ばそれを実行及び実施する。一般に、要素は、制約なし
に指定可能である（例えば、ＣＰＵは、制約をもたな
い）、又は別の要素の一部分としてのみ構成可能である
（例えば、デバイスアダプタは、そのバスの一部分とし
てのみ構成可能である）。上記のように、ＣＰＵ、メモ
リ及びＩ／Ｏデバイスを独特のソフトウェアエンティテ
ィへとグループ編成したものである区画は、最小要件も
有する。例えば、区画のための最小ハードウェア要件
は、少なくとも１つのＣＰＵと、あるプライベートメモ
リ（プラットホームに従属する最小のもので、コンソー
ルメモリを含む）と、物理的な非共用コンソールポート
を含むＩ／Ｏバスとである。The configuration tree 300 can be extended to the level of a device controller that allows the operating system to form bus and device configuration tables without detecting the bus. However, the tree may end at any level. However, there are cases where all the elements below it cannot be configured independently. System software is still required to detect bus and device information not provided by the tree. The console program executes and enforces any configuration restrictions on each element of the system, if any. In general, an element can be specified without restriction (e.g., the CPU has no restrictions) or can be configured only as part of another element (e.g., a device adapter can only be part of its bus). Configurable). As noted above, partitions that group CPUs, memory, and I / O devices into unique software entities also have minimum requirements. For example, the minimum hardware requirements for a partition include at least one CPU, some private memory (the minimum platform dependent, including console memory), and I / O including a physical non-shared console port. With the bus.

【００３７】区画のための最小要素要件は、テンプレー
ト根ノード３０８に含まれた情報によって与えられる。
テンプレート根ノード３０８は、ノード３１６、３１８
及び３２０を含み、これは、コンソールプログラム及び
オペレーティングシステムインスタンスを実行すること
のできる区画を形成するために設けなければならないハ
ードウェア要素を表わす。構成エディタは、新たな区画
を形成するためにどんな形式及びどれほど多くのリソー
スを使用できねばならないかを決定するための基礎とし
てこの情報を使用することができる。新たな区画の形成
中に、テンプレートサブツリーは、「ウオーキング」さ
れ、そしてテンプレートサブツリーの各ノードごとに、
新たな区画により所有される同じ形式及びサブ形式のノ
ードがあって、コンソールプログラムをロードしそして
オペレーティングシステムインスタンスをブートするこ
とができねばならない。テンプレートツリーに同じ形式
及びサブ形式のノードが２つ以上ある場合には、新たな
区画にも多数のノードがなければならない。コンソール
プログラムは、コンソールプログラムをロードしそして
初期化オペレーションを試みる前に、テンプレートを使
用して、新たな区画が最小要件を有することを確認す
る。The minimum element requirements for a partition are given by the information contained in the template root node 308.
Template root node 308 includes nodes 316 and 318
And 320, which represent the hardware elements that must be provided to form a partition capable of running console programs and operating system instances. The configuration editor can use this information as a basis to determine what format and how many resources must be available to form a new partition. During the formation of a new parcel, the template subtree is "walked" and for each node of the template subtree,
There must be nodes of the same type and sub-type owned by the new partition to be able to load the console program and boot the operating system instance. If there is more than one node of the same type and sub-type in the template tree, the new partition must also have many nodes. The console program uses the template to verify that the new partition has the minimum requirements before loading the console program and attempting an initialization operation.

【００３８】構成ツリーノードの特定の実施に関する詳
細な例を以下に示す。これは、単に説明上のものに過ぎ
ず、これに限定されるものではない。各ＨＷＲＰＢは、
現在の構成と、区画に対する要素の指定とを与える構成
ツリーを指さねばならない。ＨＷＲＰＢの構成ポインタ
（ＣＯＮＦＩＧフィールドにおける）は、構成ツリーを
指すのに使用される。ＣＯＮＦＩＧフィールドは、ツリ
ーに対するメモリプールのサイズと、メモリの初期チェ
ック和とを含む６４バイトヘッダを指す。ヘッダの直後
に、ツリーの根ノードがある。ツリーのヘッダ及び根ノ
ードは、ページ整列される。構成ツリーに割り当てられ
るメモリの全サイズ（バイト）は、ヘッダの第１のクオ
ドワードに位置される。このサイズは、ハードウェアペ
ージサイズの倍数となるように保証される。ヘッダの第
２のクオドワードは、チェック和に指定される。構成ツ
リーを検査するために、オペレーティングシステムイン
スタンスは、ツリーをそのローカルアドレス空間にマッ
プする。オペレーティングシステムインスタンスは、全
てのアプリケーションに許された読み取りアクセスでこ
のメモリをマップするので、特権のないアプリケーショ
ンが、それがアクセスしてはならないコンソールデータ
へのアクセスを得るのを防止するための何らかの構成を
設けねばならない。メモリを適当に割り当てることによ
りアクセスが制限される。例えば、メモリはページ整列
されそして全ページに割り当てられてもよい。通常は、
オペレーティングシステムインスタンスは、構成ツリー
の第１ページをマップし、ツリーサイズを得、そして構
成ツリーの使用のために割り当てられたメモリを再マッ
プする。全サイズは、ツリーへの動的な変化に対してコ
ンソールにより使用される付加的なメモリを含むことが
できる。好ましくは、構成ツリーノードは固定のヘッ
ダで形成され、そしてその固定のヘッダに続いて形式特
有の情報を任意に含む。サイズフィールドは、ノードの
全長を含み、ノードは、この例では６４バイトの倍数で
割り当てられ、そして必要に応じてパッドが付けられ
る。ノードの固定ヘッダにおけるフィールドを以下に一
例として説明する。A detailed example for a specific implementation of a configuration tree node is provided below. This is merely illustrative and not limiting. Each HWRPB is
It must point to a configuration tree that gives the current configuration and the specification of the elements for the partition. The HWRPB configuration pointer (in the CONFIG field) is used to point to the configuration tree. The CONFIG field points to a 64-byte header that contains the size of the memory pool for the tree and the initial checksum of the memory. Immediately after the header is the root node of the tree. The tree header and root nodes are page aligned. The total size (bytes) of memory allocated to the configuration tree is located in the first quadword of the header. This size is guaranteed to be a multiple of the hardware page size. The second quadword of the header is specified in the check sum. To examine the configuration tree, the operating system instance maps the tree into its local address space. The operating system instance maps this memory with read access granted to all applications, so some configuration to prevent unprivileged applications from gaining access to console data that they should not have access to. Must be provided. Access is limited by the proper allocation of memory. For example, memory may be page aligned and assigned to all pages. Normally,
The operating system instance maps the first page of the configuration tree, gets the tree size, and remaps the memory allocated for use of the configuration tree. The total size may include additional memory used by the console for dynamic changes to the tree. Preferably, the configuration tree node is formed with a fixed header, and optionally includes format-specific information following the fixed header. The size field contains the total length of the node, which is allocated in this example in multiples of 64 bytes and padded as needed. The fields in the fixed header of the node will be described below as an example.

【００３９】 typedef struct gct node｛ unsigned char type; unsigned char subtype; unit16 size; GCT HANDLE owner; GCT HANDLE current owner; GCT ID id; union ｛ unit64 node flags; struct ｛ unsigned node hardware :1; unsigned node hotswap :1; unsigned node unavailable :1; unsigned node hw templete :1; unsigned node initialized :1; unsigned node cpu primary :1; #defineNODE HARDWARE 0x001 #defineNODE HOTSWAP 0x002 #defineNODE UNAVAILAVLE 0x004 #defineNODE HW TEMPLATE 0x008 #defineNODE INITIALIZED 0x010 #defineNODE PRIMARY 0x020 ｝flag bit; ｝ flag union; GCT HANDLE config; GCT HANDLE affinity; GCT HANDLE parent; GCT HANDLE next sib; GCT HANDLE prev sib; GCT HANDLE child; GCT HANDLE reserved; Unit32 magic ｝GCT NODE; Typedef struct gct node ｛unsigned char type; unsigned char subtype; unit16 size; GCT HANDLE owner; GCT HANDLE current owner; GCT ID id; union ｛unit64 node flags; struct ｛unsigned node hardware: 1; unsigned node hotswap: 1; unsigned node unavailable: 1; unsigned node hw templete: 1; unsigned node initialized: 1; unsigned node CPU primary: 1; #defineNODE HARDWARE 0x001 #defineNODE HOTSWAP 0x002 #defineNODE UNAVAILAVLE 0x004 #defineNODE HW TEMPLATE 0x008 #defineNODE INITIALIZED 0x010 #defineNODE PRIMARY 0x020｝ flag bit;｝ flag union; GCT HANDLE config; GCT HANDLE affinity; GCT HANDLE parent; GCT HANDLE next sib; GCT HANDLE prev sib; GCT HANDLE child; GCT HANDLE reserved; Unit32 magic｝ GCT NODE;

【００４０】上記定義において、形式定義「ｕｎｉｔ」
は、適当なビット長さをもつ無符号の整数である。上述
したように、ノードは、ハンドルにより位置決めされ、
識別される（上記定義では、ｔｙｐｅｄｅｆＧＣＴ
ＨＡＮＤＬＥにより識別される）。ここに例示するハン
ドルは、構成ツリーのベースからノードまでの符号付き
３２ビットオフセットである。値は、コンピュータシス
テムの全ての区画にわたって独特である。即ち、ある区
画において得られるハンドルは、全ての区画において、
ノードをルックアップするために、又はコンソールコー
ルバックへの入力として有効でなければならない。ｍａ
ｇｉｃフィールドは、ノードが実際に有効なノードであ
ることを指示する所定のビットパターンを含む。ツリー
根ノードは、システム全体をあらわす。そのハンドルは
常にゼロである。即ち、それは、常に、ｃｏｎｆｉｇヘ
ッダに続く構成ツリーに割り当てられたメモリの第１の
物理的な位置に配置される。これは、次の定義を有す
る。In the above definition, the format definition "unit"
Is an unsigned integer with the appropriate bit length. As described above, the node is positioned by the handle,
Identified (in the above definition, typedef GCT
HANDLE). The handle illustrated here is a signed 32-bit offset from the base of the configuration tree to the node. The values are unique across all partitions of the computer system. That is, the handle obtained in a certain section, in all sections,
Must be valid to look up a node or as input to a console callback. ma
The gic field contains a predetermined bit pattern indicating that the node is actually a valid node. The tree root node represents the entire system. Its handle is always zero. That is, it is always located at the first physical location in memory allocated to the configuration tree following the config header. It has the following definition:

【００４１】 typedef struct gct root node｛ GCT NODE hd; unit64 lock; unit64 transient level; unit64 current level; unit64 console req; unit64 min alloc; unit64 min align; unit64 base alloc; unit64 base align; unit64 max phys address; unit64 mem size; unit64 platform type; int32 platform name; GCT HANDLE primary instance; GCT HANDLE first free; GCT HANDLE high limit; GCT HANDLE lookaside; GCT HANDLE available; unit32 max partition; int32 partitions; int32 communities; unit32 max platform partition; unit32 max fragments; unit32 max desc; char APMX id[16]; char APMX id pad[4]; int32 bindings; ｝ GCT ROOT NODE; Typedef struct gct root node ｛GCT NODE hd; unit64 lock; unit64 transient level; unit64 current level; unit64 console req; unit64 min alloc; unit64 min align; unit64 base alloc; unit64 base align; unit64 max phys address; unit64 mem size; unit64 platform type; int32 platform name; GCT HANDLE primary instance; GCT HANDLE first free; GCT HANDLE high limit; GCT HANDLE lookaside; GCT HANDLE available; unit32 max partition; int32 partitions; int32 communities; unit32 max platform partition; unit32 max fragments; unit32 max desc; char APMX id [16]; char APMX id pad [4]; int32 bindings;｝ GCT ROOT NODE;

【００４２】根ノードにおけるフィールドは、次のよう
に定義される。ｌｏｃｋこのフィールドは、ツリーの構造体への変更を禁止しよ
うとするソフトウェアと、ソフトウェア構成とにより単
純なロックとして使用される。この値が―１（全てのビ
ットがオン）であるときには、ツリーがロック解除さ
れ、そしてこの値が０以上であるときには、ツリーがロ
ックされる。このフィールドは、原子オペレーションを
用いて変更される。ロックルーチンの発呼者は区画ＩＤ
を送り、これはロックフィールドに書き込まれる。これ
は、欠陥追跡を助成しそしてクラッシュ中に回復するの
に使用できる。ｔｒａｎｓｉｅｎｔｌｅｖｅｌこのフィールドは、ツリー更新の始めに増加される。ｃｕｒｒｅｎｔｌｅｖｅｌこのフィールドは、ツリー更新の完了時に更新される。ｃｏｎｓｏｌｅｒｅｑこのフィールドは、区画のベースメモリセグメントにお
いてコンソールに対して要求されるメモリ（バイト）を
特定する。The fields at the root node are defined as follows. lock This field is used as a simple lock by software trying to prohibit changes to the structure of the tree and by software configuration. When this value is -1 (all bits on), the tree is unlocked, and when this value is 0 or greater, the tree is locked. This field is modified using atomic operations. Lock routine caller is block ID
And this is written to the lock field. This can be used to assist defect tracking and recover during a crash. transient level This field is incremented at the beginning of a tree update. current level This field is updated upon completion of the tree update. console req This field specifies the memory (bytes) required for the console in the partition's base memory segment.

【００４３】ｍｉｎａｌｌｏｃこのフィールドは、メモリ断片の最小サイズと、割り当
て単位を保持する（断片サイズは、割り当ての倍数でな
ければならない）。これは、２の累乗でなければならな
い。ｍｉｎａｌｉｇｎこのフィールドは、メモリ断片に対する整列要求を保持
する。これは、２の累乗でなければならない。ｂａｓｅａｌｌｏｃこのフィールドは、区画のベースメモリセグメントとし
て要求される最小メモリ（バイト）（ｃｏｎｓｏｌｅ
ｒｅｑを含む）を特定する。これは、区画に対してコン
ソール、コンソール構造体及びオペレーティングシステ
ムがどこでロードされるかである。これは、ｍｉｎａｌ
ｌｏｃ及びｍｉｎａｌｌｏｃの倍数以上でなければなら
ない。Min alloc This field holds the minimum size of the memory fragment and the allocation unit (the fragment size must be a multiple of the allocation). It must be a power of two. min align This field holds an alignment request for a memory fragment. It must be a power of two. base alloc This field contains the minimum memory (bytes) (console) required as the base memory segment for the partition.
req). This is where the console, console structure and operating system are loaded for the partition. This is minal
It must be at least a multiple of loc and minalloc.

【００４４】ｂａｓｅａｌｉｇｎこのフィールドは、区画のベースメモリセグメントに対
する整列要求を保持する。これは、２の累乗でなければ
ならず、そして少なくともｍｉｎａｌｉｇｎの整列を
有していなければならない。ｍａｘｐｈｙｓａｄｄｒｅｓｓこのフィールドは、現在パワーオン及び使用可能でない
メモリサブシステムを含むシステムに存在し得る計算さ
れた最大の物理的アドレスを保持する。ｍｅｍｓｉｚｅこのフィールドは、現在システムにある全メモリを保持
する。ｐｌａｔｆｏｒｍｔｙｐｅこのフィールドは、ＨＷＲＰＢのフィールドから得たプ
ラットホームの形式を記憶する。ｐｌａｔｆｏｒｍ
ｎａｍｅこのフィールドは、ツリー根ノードのベースからプラッ
トホームの名前を表わすストリングまでの整数オフセッ
トをあらわす。ｐｒｉｍａｒｙｉｎｓｔａｎｃｅこのフィールドは、第１のオペレーティングシステムイ
ンスタンスの区画ＩＤを保持する。Base align This field holds an alignment request for the base memory segment of the partition. It must be a power of two and at least min Must have an align alignment. max phys address This field holds the calculated maximum physical address that may be present in the system including the memory subsystem that is currently powered on and unavailable. mem size This field holds all memory currently in the system. platform type This field stores the platform type obtained from the HWRPB field. platform
name This field represents an integer offset from the base of the tree root node to a string representing the name of the platform. primary instance This field holds the partition ID of the first operating system instance.

【００４５】ｆｉｒｓｔｔｒｅｅこのフィールドは、ツリー根ノードから新たなノードに
使用されるメモリプールの第１の空きバイトまでのオフ
セットを保持する。ｈｉｇｈｌｉｍｉｔこのフィールドは、構成ツリー内に有効なノードを配置
できるところの最上位アドレスを保持する。これは、ハ
ンドルが適正なものであることを確認するためにコール
バックにより使用される。ｌｏｏｋａｓｉｄｅこのフィールドは、削除されていて再請求することので
きるノードのリンクされたリストのハンドルである。コ
ミュニティ又は区画が削除されたときには、ノードがこ
のリストにリンクされ、そして新たな区画又はコミュニ
ティを形成すると、空きプールからの割り当ての前にこ
のリストが探索される。First tree This field holds the offset from the tree root node to the first free byte of the memory pool used for the new node. high limit This field holds the highest address where a valid node can be placed in the configuration tree. This is used by the callback to verify that the handle is correct. Lookaside This field is the handle of a linked list of nodes that have been deleted and can be reclaimed. When a community or parcel is deleted, a node is linked to this list, and when a new parcel or community is created, the list is searched prior to assignment from the free pool.

【００４６】ａｖａｉｌａｂｌｅこのフィールドは、ｆｉｒｓｔｔｒｅｅフィールドに
より指示された空きプールに残っているバイト数を保持
する。ｍａｘｐａｒｔｉｔｉｏｎｓこのフィールドは、現在使用できるハードウェアリソー
スの量に基づいてプラットホームにより計算される区画
の最大数を保持する。Available This field contains the first Holds the number of bytes remaining in the free pool specified by the tree field. max partitions This field holds the maximum number of partitions calculated by the platform based on the amount of hardware resources currently available.

【００４７】ｐａｒｔｉｔｉｏｎｓこのフィールドは、根ノードのベースからハンドルのア
レーまでのオフセットを保持する。各区画ＩＤは、この
アレーへのインデックスとして使用され、そして区画ノ
ードハンドルは、インデックスされた位置に記憶され
る。新たな区画が形成されたときに、このアレーを検査
し、対応する区画ノードハンドルをもたない第１の区画
ＩＤを見つけ、この区画ＩＤは新たな区画に対するＩＤ
として使用される。ｃｏｍｍｕｎｉｔｉｅｓこのフィールドも、根ノードのベースからハンドルのア
レーまでのオフセットを保持する。各コミュニティＩＤ
は、このアレーへのインデックスとして使用され、そし
てコミュニティノードハンドルがこのアレーに記憶され
る。新たなコミュニティが形成されると、このアレーを
検査して、対応するコミュニティノードハンドルをもた
ない第１のコミュニティＩＤを見つけ、このコミュニテ
ィＩＤは新たなコミュニティに対するＩＤとして使用さ
れる。区画以上の多くのコミュニティが存在することは
なく、従って、アレーは、区画の最大数に基づくサイズ
とされる。Partitions This field holds the offset from the base of the root node to the array of handles. Each partition ID is used as an index into this array, and the partition node handle is stored at the indexed location. When a new partition is created, the array is examined to find a first partition ID that does not have a corresponding partition node handle, where the partition ID is the ID for the new partition.
Used as communities This field also holds the offset from the base of the root node to the array of handles. Each community ID
Is used as an index into the array, and the community node handle is stored in the array. When a new community is created, the array is examined to find a first community ID that does not have a corresponding community node handle, and this community ID is used as an ID for the new community. There can be no more communities than parcels, so the array is sized based on the maximum number of parcels.

【００４８】ｍａｘｐｌａｔｆｏｒｍｐａｒｔｉｔｉｏｎこのフィールドは、たとえ付加的なハードウェアが追加
されても（潜在的にインスワップされる）プラットホー
ムに同時に存在し得るプラットホームの最大数を保持す
る。ｍａｘｆｒａｇｍｅｎｔこのフィールドは、メモリ記述子を分割できるところの
断片のプラットホーム定義最大数を保持する。これは、
メモリ記述子ノードにおける断片のアレーサイズを決め
るのに使用される。ｍａｘｄｅｓｃこのフィールドは、プラットホームに対するメモリ記述
子の最大数を保持する。[0048] max platform partition This field holds the maximum number of platforms that can exist simultaneously on the platform, even if additional hardware is added (potentially in-swapped). max fragment This field holds the platform-defined maximum number of fragments into which a memory descriptor can be split. this is,
Used to determine the array size of the fragment at the memory descriptor node. max desc This field holds the maximum number of memory descriptors for the platform.

【００４９】ＡＰＭＰｉｄこのフィールドは、システムソフトウェアによってセッ
トされて不揮発性ＲＡＭにセーブされるシステムＩＤを
保持する。ＡＰＭＰｉｄｐａｄこのフィールドは、ＡＰＭＤＩＤのパッディングバイ
トを保持する。ｂｉｎｄｉｎｇｓこのフィールドは、「バインディング」のアレーに対す
るオフセットを保持する。各バインディングエントリ
は、ハードウェアノードの形式、親でなければならない
ノードの形式、構成バインディング、及びノード形式に
対する親和力バインディングを記述する。バインディン
グは、ノード形式がいかに関係しているか及び構成及び
親和力ルールを決定するためにソフトウェアにより使用
される。APMP id This field holds the system ID set by the system software and saved in non-volatile RAM. APMP id pad This field holds the padding byte of the APMD ID. bindings This field holds the offset for the array of "bindings". Each binding entry describes the type of hardware node, the type of node that must be a parent, the configuration binding, and the affinity binding for the node type. Bindings are used by software to determine how node types are related and the configuration and affinity rules.

【００５０】コミュニティは、区画間のリソースの共用
の基礎を与える。ハードウェア要素は、コミュニティの
いずれの区画にも指定できるが、メモリのようなデバイ
スの実際の共用は、コミュニティ内で生じるだけであ
る。コミュニティノード３１０は、ＡＰＭＰデータベー
スと称する制御区分のポインタを含み、これは、オペレ
ーティングシステムインスタンスがインスタンス間でメ
モリ及び通信を共用する目的でコミュニティにおけるア
クセス及び会員資格を制御できるようにする。ＡＰＭＰ
データベース及びコミュニティの形成は、以下に詳細に
述べる。コミュニティに対する構成ＩＤは、コンソール
プログラムにより指定された符号付き１６ビット整数値
である。ＩＤ値は、プラットホームにおいて形成できる
区画の最大数より決して大きくならない。Communities provide the basis for sharing resources between partitions. Hardware elements can be specified in any partition of the community, but the actual sharing of devices, such as memory, only occurs within the community. The community node 310 includes a pointer to a control partition called the APMP database, which allows operating system instances to control access and membership in the community for the purpose of sharing memory and communication between instances. APMP
The formation of databases and communities is described in detail below. The configuration ID for the community is a signed 16-bit integer value specified by the console program. The ID value is never greater than the maximum number of compartments that can be formed on the platform.

【００５１】ノード３１２又は３１４のような区画ノー
ドは、コンソールプログラムの独立コピー及びオペレー
ティングシステムの独立コピーを実行することのできる
ハードウェアの集合をあらわす。このノードに対する構
成ＩＤは、コンソールにより指定される符号月６ビット
整数値である。このＩＤは、プラットホームにおいて形
成できる区画の最大数より決して大きくならない。ノー
ドは、次の定義を有する。 typedef struct gct partition node｛ GCT NODE hd; unit64 hwrpb; unit64 incarnation; unit64 priority; int32 os type; unit32 partition reserved 1; unit64 instance name format; char instance name[128]; ｝ GCT PARTITION NODE; 定義されたフィールドは、次の定義を有する。A partition node, such as node 312 or 314, represents a set of hardware that can execute an independent copy of a console program and an independent copy of an operating system. The configuration ID for this node is a sign month 6-bit integer value specified by the console. This ID is never greater than the maximum number of compartments that can be formed on the platform. A node has the following definition: typedef struct gct partition node ｛GCT NODE hd; unit64 hwrpb; unit64 incarnation; unit64 priority; int32 os type; unit32 partition reserved 1; unit64 instance name format; char instance name [128];｝ GCT PARTITION NODE; The defined fields have the following definitions:

【００５２】ｈｗｒｐｂこのフィールドは、この区画に対するハードウェア再ス
タートパラメータブロックの物理的なアドレスを保持す
る。ＨＷＲＰＢの対する変化を最小にするために、ＨＷ
ＲＰＢは、区画のポインタ又は区画ＩＤを含まない。む
しろ、区画がＨＷＲＰＢのポインタを含む。従って、シ
ステムソフトウェアは、ＨＷＲＰＢの物理的アドレスを
含む区画に対する区画ノードをサーチすることにより、
それが実行される区画の区画ＩＤを決定することができ
る。ｉｎｃａｒｎａｔｉｏｎこのフィールドは、区画の一次ＣＰＵが区画においてブ
ート又は再スタート動作を実行するたびに増加される値
を保持する。Hwrpb This field holds the physical address of the hardware restart parameter block for this partition. To minimize changes in HWRPB, HWRPB
The RPB does not include a section pointer or section ID. Rather, the partition contains the HWRPB pointer. Therefore, the system software searches for the partition node for the partition containing the physical address of the HWRPB,
The partition ID of the partition where it is executed can be determined. incarnation This field holds a value that is incremented each time the partition's primary CPU performs a boot or restart operation on the partition.

【００５３】ｐｒｉｏｒｉｔｙこのフィールドは、区画に優先順位を保持する。ｏｓｔｙｐｅこのフィールドは、区画にロードされるオペレーティン
グシステムの形式を指示する値を保持する。Priority This field holds the priority of the partition. os type This field holds a value indicating the type of operating system to be loaded into the partition.

───────────────────────────────────────────────────── フロントページの続き (72)発明者アンドリューエイチメイソンアメリカ合衆国ニューハンプシャー州 03049 ホーリスバクスターロード 61 (72)発明者グレゴリーエイチジョーダンアメリカ合衆国ニューハンプシャー州 03049 ホーリスジャンバードロード 22 (72)発明者カレンエルノエルアメリカ合衆国ニューハンプシャー州 03275 ペンブロークアカデミーロード 238 (72)発明者ジェームズアールコーフマンアメリカ合衆国ニューハンプシャー州 03062 ナシュアノーマドライヴ 31 (72)発明者ポールケイハータージュニアアメリカ合衆国マサチューセッツ州 01540 グロートンウィーリーロード 61 (72)発明者フレデリックジークラインソーガアメリカ合衆国ニューハンプシャー州 03031 アムハーストファーミントンロード４ (72)発明者スティーヴンエフシャロンアメリカ合衆国マサチューセッツ州 01720 アクトンパーカーストリート 123 Ｆターム(参考） 5B045 EE25 GG01 JJ46 ────────────────────────────────────────────────── ─── Continued on the front page (72) Inventor Andrew H Mason United States 03049 Hollis Baxter Road, New Hampshire 61 (72) Inventor Gregory H. Jordan United States New Hampshire 03049 Hollis Jambird Road 22 (72) Inventor Karen El Noel United States New Hampshire State 03275 Pembroke Academy Road 238 (72) Inventor James Earl Kauffman New Hampshire United States 03062 Nashua Norma Drive 31 (72) Inventor Paul Kay Harter Jr. United States of America Massachusetts 01540 Groton Wheely Road 61 (72) Inventor Frederick Keezy Kleinsauga 03031 Amherst Farmington Road, New Hampshire, USA 4 (72) Inventor Stephen F. Sharon 01720 Acton Parker Street 123 F Term, Massachusetts USA 5B045 EE25 GG01 JJ46

Claims

[Claims]

1. A computer system having a plurality of system resources including a processor, a memory, and an I / O circuit, wherein each processor includes all memory and at least some I / O circuits.
An interconnect mechanism for electrically interconnecting the processor, memory and I / O circuits to electrically access the I / O circuit; a software mechanism for dividing system resources into a plurality of partitions; At least one operating system instance running on the partition.

2. The computer system of claim 1, wherein the at least one operating system instance includes at least two operating system instances of different operating systems.

3. The computer system of claim 1, wherein at least some memory is exclusively assigned to each partition.

4. The computer system of claim 1, wherein a plurality of processors are physically divided among the partitions, and each partition includes a console program for controlling a processor of the partition.

5. The computer system according to claim 1, further comprising means for maintaining configuration information indicating which of a plurality of system resources is designated for each partition.

6. One of the processors executes a master console program that generates the configuration information, each partition includes a console program that controls a processor of the partition, and a console program of each partition includes:
6. The computer system according to claim 5, wherein the computer system is provided to communicate configuration information by communicating with a master console program.

7. The computer system of claim 1, wherein said interconnection mechanism includes a switch.

8. A configuration database stored in memory pointing to a partition that is part of the computer system.
A master console including means for forming a configuration database during a power up sequence of the computer system, wherein the configuration database includes information indicating whether each operating system instance is active. Computer system.

9. The operating system instance comprises means for continuously monitoring each other for activity for detecting malfunctions in the operating instance, and each of the operating system instances is connected to another operating system by a heartbeat mechanism. 9. The computer system according to claim 8, comprising means for monitoring an instance.

10. A method for configuring a computer system having a plurality of system resources including a processor, a memory, and an I / O circuit, wherein: (a) each processor electrically connects all memories and at least some I / O circuits; Electrically interconnecting the processor, memory and I / O circuitry to access the: (b) partitioning system resources into a plurality of partitions; and (c) running at least one operating system instance in the plurality of partitions. Performing the method.

11. The method of claim 10, wherein step (c) comprises: (c1) executing at least two different operating system instances in the plurality of partitions.

12. The method of claim 1, wherein step (b) includes (b1) assigning at least some of the memory to each partition.
The method according to 0.

13. The step (b) comprises: (b2) physically dividing the processor between partitions; and (b3) executing a console program at each partition's processor, the console program comprising: 11. The method of claim 10, comprising controlling.

14. The step (b) includes: (b4) specifying a primary processor in each partition; and the step (c) includes: (c1) each operating system in a primary processor of a partition. Run the instance, and (c2)
14. The method of claim 13, including controlling each operating system instance to communicate with a console program for the partition.

15. The method of claim 10, further comprising the step of: (d) maintaining configuration information indicating which of the plurality of system resources is designated for each partition.

16. The step (d) includes: (d2) executing a master console program at one of the processors, the master console program generating configuration information; and (d3) at each partition the processor of the partition. Executing a console program that controls the
The step (d3) further includes: (d3a) communicating with the master console program and exchanging configuration information using the console program of each partition; The method of claim 15, comprising transmitting configuration information to each of the console programs.

17. The step (a) further comprises: (a1) using a switch, a processor, a memory and an I / O
11. The method of claim 10, further comprising the step of interconnecting the / O circuits.
The method described in.

18. The method according to claim 18, further comprising the step of: (e) forming a configuration database containing information regarding which partitions are part of the computer system.
The method of claim 10, further comprising: (e1) forming a configuration database that includes information indicating whether each operating system instance is active.

19. The step (c) further comprises: (c3) using the operating system instance to continuously monitor activities for detecting malfunction of the operating instance by a heartbeat mechanism with respect to each other. 19. The method of claim 18, comprising steps.