JP2008243203A

JP2008243203A - Workload control in virtualized data processing environment

Info

Publication number: JP2008243203A
Application number: JP2008071099A
Authority: JP
Inventors: Diane G Flemming; ダイアン・ガルザ・フレミング; Mysore Sathyanarayana Srinivas; マイソール・サシャナラヤーナ・シュリニバス; William A Maron; ウィリアム・エー・マローン; Octavian Florin Herescu; オクタビアン・フローリン・ヘレスキュ
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2007-03-28
Filing date: 2008-03-19
Publication date: 2008-10-09
Anticipated expiration: 2028-03-19
Also published as: JP5243822B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system, a method and a computer-readable medium, for balancing accesses of a computer system to physical system resources using system virtualization among a large number of logical partitions. <P>SOLUTION: The respective logical partitions are classified according to use levels of allocated dispatch windows, in initial start-up. Performance evaluation indexes about one or plurality of the physical system resources are determined in association with one or plurality of the logical partitions. The determination of the performance evaluation index is executed in a hardware level, independent of programming interruption. A given set of the physical system resources is allocated again to an alterative logical partition, according to the performance evaluation index determined in association with the alterative logical partition, and according to the dispatch window use classification of the alterative logical partition, in the dispatch window composed of the given set of physical system resources to be allocated to one of the logical partitions. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、一般に、データ処理システムにおけるワークロードを管理することに関する。より具体的には、本発明は、論理パーティション・システムなどのパーティション・システムにおいて、ワークロードを管理することに関する。 The present invention relates generally to managing workloads in data processing systems. More specifically, the present invention relates to managing workloads in a partition system, such as a logical partition system.

コンピュータ・リソースの論理パーティショニングによって、単一の物理マシン又はプロセッサ複合体内で多数のシステム・イメージを確立できるようになる。仮想化とは、仮想マシン（ＶＭ）としても知られる各々のシステム・イメージが、その物理コンピュータ・システムの共有リソースを用いる他のＶＭから論理的に独立して作動する、システムのイメージ化を指す用語である。このような形で、ＶＭに対応する各々の論理パーティションは、独立にリセットし、各々のパーティションで異なるものとすることができるオペレーティング・システムをロードし、異なる入出力（Ｉ／Ｏ）デバイスを用いる異なるソフトウェア・プログラムで作動させることができる。論理パーティション・システムの市販の形態として、例えば、ＩＢＭ社のＰＯＷＥＲ５マルチプロセッサ・アーキテクチャが挙げられる。 Logical partitioning of computer resources allows multiple system images to be established within a single physical machine or processor complex. Virtualization refers to the imaging of a system where each system image, also known as a virtual machine (VM), operates logically independent of other VMs that use the shared resources of that physical computer system. It is a term. In this way, each logical partition corresponding to a VM resets independently, loads an operating system that can be different for each partition, and uses different input / output (I / O) devices. Can be operated with different software programs. A commercially available form of logical partition system is, for example, IBM's POWER5 multiprocessor architecture.

論理パーティショニングの１つの重要な側面が、それぞれのパーティションのワークロードの管理である。例えばＰＯＷＥＲ５においては、ハイパーバイザと呼ばれるワークロード・マネージャがパーティション間のワークロードを管理する。このタイプの共有リソース環境においては、ハイパーバイザは、広義には一般的なマルチタスク・コンピュータ・スケジューリングに類似の交互タイム・スロット・スケジューリング技術を用いて、メモリ、中央処理ユニット（ＣＰＵ）、Ｉ／Ｏといった物理システム・リソースを論理パーティションに割り当てる。ハイパーバイザは、必要に応じて、及び／又は、予め割り当てられた形で、論理プロセッサとしてのパーティションの仕事を物理システム・リソースにディスパッチすることによって、パーティションのワークロードをバランスさせようとする。 One important aspect of logical partitioning is managing the workload of each partition. For example, in POWER5, a workload manager called a hypervisor manages the workload between partitions. In this type of shared resource environment, the hypervisor uses memory, central processing unit (CPU), I / O, in a broad sense, using alternating time slot scheduling techniques similar to general multitasking computer scheduling. Allocate physical system resources such as O to logical partitions. The hypervisor attempts to balance the partition workload by dispatching the partition's work as a logical processor to physical system resources as needed and / or pre-assigned.

パーティション・スケジューリングの１つの態様は、具体的には、プロセッサ・リソースの使用及び共有に関する。即ち、共有プロセッサ・プールのプロセッサ容量を用いるパーティションは、スケジューリング目的について上限付きのもの又は上限なしのもののいずれかとして定義される。上限付きパーティションは、設定されたプロセッサの使用資格（entitlement）を超えることができない。論理パーティションについての上限なしサポートは、共有プロセッサ・プールに未使用の容量が存在する場合には、上限なしパーティションがその設定された容量を超えられるようにするものである。このような未使用の容量は、他のパーティションがその設定された容量の全てを十分に使用していないか、そうでなければ共有プールの容量が完全に割り当てられていないことによってもたらされる。 One aspect of partition scheduling specifically relates to the use and sharing of processor resources. That is, partitions that use the processor capacity of the shared processor pool are defined as either having an upper limit or not having an upper limit for scheduling purposes. A capped partition cannot exceed the configured processor entitlement. Unbounded support for a logical partition allows an unbounded partition to exceed its set capacity if there is unused capacity in the shared processor pool. Such unused capacity is caused by other partitions not fully using all of their configured capacity, or otherwise the shared pool capacity is not fully allocated.

ディスパッチされると、論理パーティションは、割り当てられた物理プロセッサ・リソースを論理プロセッサとして組み込む。論理プロセッサ（仮想プロセッサと呼ばれることもある）のスケジューリングは、所与のディパッチ・ウィンドウの間にパーティションが使用するために処理サイクル、メモリ、及び物理システム・リソースが割り当てられる期間である、予め指定された期間、即ちタイム・スライスを割り当てることを必要とする。例えばＰＯＷＥＲ５上で動作するＡＩＸオペレーティング・システムは、デフォルトで１０ミリ秒のディスパッチ・ウィンドウを有する。割り当てられたディスパッチ・ウィンドウのいずれかの未使用部分は、システムの上限なしパーティションの１つ又は複数に割り当てることができる。ディスパッチ・ウィンドウの未使用部分についてどの上限なしパーティションが元のスケジューリングされたパーティションを代替するかを決定するために、上限なしパーティションの優先順位レベルに基づくくじ引き機構（lottery mechanism）が用いられることが多い。 When dispatched, the logical partition incorporates the allocated physical processor resources as a logical processor. The scheduling of a logical processor (sometimes called a virtual processor) is a pre-specified period of time during which processing cycles, memory, and physical system resources are allocated for use by a partition during a given dispatch window. It is necessary to allocate a time period, that is, a time slice. For example, an AIX operating system running on POWER5 has a 10 millisecond dispatch window by default. Any unused portion of the assigned dispatch window can be assigned to one or more of the uncapped partitions of the system. A lottery mechanism based on the priority level of an uncapped partition is often used to determine which uncapped partition replaces the original scheduled partition for unused portions of the dispatch window .

上記の代替ディスパッチ技術は、比較的単純であり計算コストが低いが、パーティションの論理構造及び機能特性に関連する潜在的な非効率性には適切に対処していない。スケジューリングの非効率性の重大な原因は、いわゆるインタラクティブ・パーティションをそのそれぞれのディスパッチ・ウィンドウの間に代替するときに生じる。パーティションは、外部処理イベントへの依存性と、それに対応する所与のディスパッチ・ウィンドウの間の割り込み可能性とに基づいて、「インタラクティブ」あるいは「バッチ」として特徴付けられる。バッチ・パーティションは、外部イベントからの応答とはほぼ無関係であり、したがって通常はディスパッチ・ウィンドウ全体を使用する。インタラクティブ・パーティションは、対照的に、ディスパッチ・ウィンドウの間に活動を中断して、外部イベント応答を待つのが普通である。 Although the above alternative dispatch techniques are relatively simple and low in computational cost, they do not adequately address the potential inefficiencies associated with partition logical structure and functional characteristics. A significant source of scheduling inefficiency occurs when replacing so-called interactive partitions during their respective dispatch windows. A partition is characterized as "interactive" or "batch" based on its dependency on external processing events and its corresponding interruptability during a given dispatch window. Batch partitions are largely independent of responses from external events and therefore typically use the entire dispatch window. Interactive partitions, in contrast, typically suspend activity during the dispatch window and wait for external event responses.

インタラクティブ・パーティションが作業を中断しているディスパッチ・ウィンドウの他の未使用サイクルを有効に利用するために、ハイパーバイザは、上述の優先順位付けされたくじ引き機構を用いて、中断されたパーティションの代替を試みることができる。しかしながら多くの場合においては、中断されたパーティションは、緊急の外部イベント応答を待っており、したがって、パーティションの現在の中断状態にかかわらず、パーティションを代替することなく現在のディスパッチ・ウィンドウ内で終了することになっているタスクを終了させるためには、付加的なサイクルを必要とする可能性が高い。 In order to take advantage of other unused cycles in the dispatch window where the interactive partition is suspended, the hypervisor uses the prioritized lottery mechanism described above to replace the suspended partition. Can try. In many cases, however, the suspended partition is waiting for an urgent external event response, and therefore terminates in the current dispatch window without replacing the partition, regardless of the partition's current suspension state. It is likely that an additional cycle will be required to complete the intended task.

ディスパッチ・ウィンドウ・サイクルは、中断されたパーティションがパーティションの非アクティブの期間中に代替されない場合には、無駄になる。一方、従来のパーティション代替技術は、代替されなければ無駄になるディスパッチ・ウィンドウ・サイクルを有効に利用できるようにするものであるが、代替されるインタラクティブ・パーティションのインタラクティブ処理に割り込むことの計算コストに対処できない。こうした割り込みは、代替されるインタラクティブ・パーティションを再キューイングし、パーティションを再びディスパッチするためにキューを通してサイクル・バックすることを必要とする。専用システムとは異なり、仮想システムは、各々のディスパッチについてメモリ・フットプリントを再び規定する（establish）ことを必要とする。したがって、代替されるインタラクティブ・パーティションは、再キューイングを必要とすることに加えて、メモリ・フットプリントを回復するために付加的なサイクルを費やさなければならず、このことが仮想化されたシステムにおけるワークロード管理の非効率性の重大な原因となる。 The dispatch window cycle is wasted if the suspended partition is not replaced during the partition inactivity period. On the other hand, the conventional partition replacement technology makes effective use of dispatch window cycles that would otherwise be wasted, but it reduces the computational cost of interrupting the interactive processing of the replaced interactive partition. I can't deal with it. Such an interrupt requires re-queuing the alternate interactive partition and cycling back through the queue to dispatch the partition again. Unlike dedicated systems, virtual systems need to re-establish the memory footprint for each dispatch. Therefore, in addition to requiring requeueing, alternate interactive partitions must spend additional cycles to recover memory footprint, which is a virtualized system Cause significant workload management inefficiencies.

従来の論理パーティション管理は、パーティション・スケジューリング及び実行時のワークロード・バランシングに関する上述の問題点及び多くの他の問題に対処できない。したがって、論理パーティションの間でスケジューリングとワークロード・バランシングとを管理するための方法、システム、及びコンピュータ・プログラムの必要性が存在することが理解できる。本発明は、従来技術によって未解決のこれらの必要性及び他の必要性に対処するものである。 Conventional logical partition management cannot address the above-mentioned problems and many other problems related to partition scheduling and runtime workload balancing. Thus, it can be seen that there is a need for methods, systems, and computer programs for managing scheduling and workload balancing among logical partitions. The present invention addresses these and other needs that have not been resolved by the prior art.

システム仮想化を用いるコンピュータ・システムの物理システム・リソースへのアクセスを多数の論理パーティション間でバランスさせるためのシステム、方法、及びコンピュータ可読媒体が、本明細書に開示される。論理パーティションの各々は、起動時に、割り当てられたディスパッチ・ウィンドウの使用のレベルに従って分類される。物理システム・リソースの１つ又は複数についての性能評価指標（performance metrics）が、論理パーティションの１つ又は複数と関連付けて決定される。性能評価指標の決定は、プログラミング割り込みから独立したハードウェア検出及び追跡論理を用いて、パーティションのディスパッチの間に実施される。論理パーティションの１つに割り当てるために物理システム・リソースの所与の組が構成されているディスパッチ・ウィンドウの間に、物理システム・リソースの所与の組は、代替論理パーティションと関連付けられた決定済み性能評価指標と代替論理パーティションのディスパッチ・ウィンドウ使用分類とに従って、代替論理パーティションに再割り当てされる。 Disclosed herein are systems, methods, and computer-readable media for balancing access to physical system resources of a computer system using system virtualization across multiple logical partitions. Each logical partition is categorized according to the assigned dispatch window usage level at startup. Performance metrics for one or more of the physical system resources are determined in association with one or more of the logical partitions. The performance metric determination is performed during partition dispatch using hardware detection and tracking logic independent of programming interrupts. During a dispatch window in which a given set of physical system resources is configured to be assigned to one of the logical partitions, the given set of physical system resources is determined to be associated with an alternate logical partition Reassigned to the alternate logical partition according to the performance evaluation index and the dispatch window usage classification of the alternate logical partition.

別の態様においては、物理システム・リソースを共有する多数の論理パーティションの間でワークロードをバランスさせるための方法、システム、及びコンピュータ・プログラムが、メモリ・フットプリント統計を使用してパーティション代替の適格性及び優先順位を決定する。本方法は、物理システム・リソースの１つ又は複数についての性能評価指標を論理パーティションと関連付けて決定するステップと、メモリ・フットプリント値を求めるために性能評価指標を用いるステップとを含む。物理システム・リソースの所与の組が論理パーティションの１つに割り当てられるディスパッチ・ウィンドウの間に、物理システム・リソースの所与の組は、求められたメモリ・フットプリント値に従って別の論理パーティションに再割り当てされる。 In another aspect, a method, system, and computer program for balancing workload among multiple logical partitions that share physical system resources is eligible for partition substitution using memory footprint statistics. Determine gender and priority. The method includes determining a performance metric for one or more of the physical system resources in association with the logical partition and using the performance metric to determine a memory footprint value. During a dispatch window in which a given set of physical system resources is assigned to one of the logical partitions, the given set of physical system resources is transferred to another logical partition according to the determined memory footprint value. Reassigned.

別の態様においては、所与のディスパッチ・ウィンドウの間に、物理システム・リソースを共有する多数の論理パーティションをスケジューリングするスケジューラを動的にチューニングするための方法、システム、及びコンピュータ・プログラムが開示される。本方法は、システムの起動時に、予め設定されたディスパッチ・ウィンドウ期間を用いて論理パーティションをディスパッチするステップを含む。論理パーティションのディスパッチの間に、物理システム・リソースの１つ又は複数についての性能評価指標が論理パーティションと関連付けて決定される。パーティションと関連付けられた性能評価指標を使用して、ディスパッチ・ウィンドウ期間中に他のスケジューリング経験則（ヒューリスティクス）の中からパーティションのスケジューリングを動的に決めるのに使用される、論理パーティションのメモリ・フットプリント値が求められる。 In another aspect, disclosed is a method, system, and computer program for dynamically tuning a scheduler that schedules multiple logical partitions that share physical system resources during a given dispatch window. The The method includes dispatching the logical partition using a preset dispatch window period at system startup. During logical partition dispatch, performance metrics for one or more of the physical system resources are determined in association with the logical partition. The logical partition's memory used to dynamically determine partition scheduling among other scheduling heuristics during the dispatch window using performance metrics associated with the partition A footprint value is determined.

本発明の上記の並びに付加的な目的、特徴、及び利点は、以下の詳細な説明において明らかとなる。 The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.

本発明の特性と考えられる新規な特徴が、特許請求の範囲に記載される。しかしながら、本発明自体、並びに、その好ましい使用形態、更なる目的、及び利点は、添付図面と併せて読むときに、例示的な実施形態に関する以下の詳細な説明を参照することによって最も良く理解される。 The novel features believed characteristic of the invention are set forth in the appended claims. However, the invention itself, as well as its preferred mode of use, further objects and advantages, are best understood by referring to the following detailed description of exemplary embodiments when read in conjunction with the accompanying drawings. The

本発明のシステム及び方法は、仮想化されたコンピュータ環境内において、より効率的なリソース割り当て及びワークロード・バランシングを可能にする。例示的な仮想化されたコンピュータ環境は、パーティションの２つ又はそれ以上の間でワークロードが管理された状態で、多数の論理パーティションを含む。本明細書において用いられるパーティションという用語は、一般に、オペレーティング・システムに割り当てられるデータ処理ハードウェア・リソースのサブセットを指す。パーティションは、スレッド又は他のいずれかの計算ユニットを指すこともある。本発明のシステム及び方法を実装するための好ましい実施形態は、全てにわたり同一の参照符号が同一の対応する部分を指している図面を参照して、以下で図示され説明される。 The system and method of the present invention enables more efficient resource allocation and workload balancing within a virtualized computer environment. An exemplary virtualized computer environment includes a number of logical partitions with the workload managed between two or more of the partitions. As used herein, the term partition generally refers to a subset of data processing hardware resources allocated to the operating system. A partition may refer to a thread or any other computing unit. Preferred embodiments for implementing the systems and methods of the present invention are illustrated and described below with reference to the drawings, wherein like reference numerals refer to like corresponding parts throughout.

本発明は、ワークロード・マネージャの指揮の下で、論理パーティション間における共有可能リソースの動的な再分配を可能にする。１つの態様においては、本発明は、動的に調整可能なパーティション・スケジューリング機構を用いることによって、ワークロード管理及びシステム効率の改善を達成する。パーティション・スケジューリング機構は、１つの実施形態においてはパーティション・スケジューリングの評価指標としてメモリ・フットプリントを規定することに関連するパーティション性能評価指標を決定する、ハードウェア追跡機構を特徴とする。別の態様においては、本発明は、ハードウェアにより追跡されたメモリ・フットプリント評価指標を使用して、ディスパッチ・ウィンドウ内におけるそれぞれの論理パーティションのスケジューリングを動的に調整する。 The present invention enables dynamic redistribution of sharable resources among logical partitions under the direction of a workload manager. In one aspect, the present invention achieves improved workload management and system efficiency by using a dynamically adjustable partition scheduling mechanism. The partition scheduling mechanism features a hardware tracking mechanism that, in one embodiment, determines a partition performance metric associated with defining a memory footprint as a partition scheduling metric. In another aspect, the present invention dynamically adjusts the scheduling of each logical partition within the dispatch window using a hardware footprint memory footprint metric.

本発明は、１つの実施形態においてはハイパーバイザとすることができるワークロード・マネージャの指揮の下で、論理パーティション間における共有可能リソースの動的な再分配を可能にする。これらのリソースは、例えば、ＣＰＵリソース、論理プロセッサ・リソース、Ｉ／Ｏリソース、コプロセッサ、チャネル・リソースなどを含むことができる。１つの実施形態においては、リソース割り当ての動的な調整は、ワークロード・バランシングと全体的なシステム効率の増大とを達成するための性能調整フィードバック・ループにおいて、ハイパーバイザ機能とハードウェア及びファームウェアのパーティション監視機構とを統合することによって達成される。 The present invention enables dynamic redistribution of sharable resources among logical partitions under the direction of a workload manager, which in one embodiment may be a hypervisor. These resources can include, for example, CPU resources, logical processor resources, I / O resources, coprocessors, channel resources, and the like. In one embodiment, dynamic adjustment of resource allocation is performed in hypervisor functions and hardware and firmware in a performance tuning feedback loop to achieve workload balancing and overall system efficiency gains. This is achieved by integrating with a partition monitoring mechanism.

１つの態様においては、本発明は、メモリ・アクセス待ち時間によってもたらされるシステム・スループットの制限に対応する。本発明は、パーティション・スケジューリングについてのメモリ・アクセス統計を決定して使用することによって、メモリ待ち時間の影響を軽減する。こうしたメモリ依存のパーティション・スケジューリングは、パーティション・スケジューリング決定を改善すると共に、より大きなディスパッチ・ウィンドウ・スケジューリングの自由度を与える。１つの実施形態においては、本発明は、元のディスパッチされたパーティションを横取りする（preempt）か又はそうでなければそれを代替するパーティションを選択する際のメモリ・フットプリント設定コストを明らかにするものである。本発明は、さらに続いて、代替されたパーティションがその元のディスパッチ・ウィンドウの一部を取り戻すために再ディスパッチされる場合のフットプリント設定コストを明らかにするものである。 In one aspect, the present invention addresses system throughput limitations caused by memory access latency. The present invention reduces the impact of memory latency by determining and using memory access statistics for partition scheduling. Such memory-dependent partition scheduling improves partition scheduling decisions and provides greater dispatch window scheduling freedom. In one embodiment, the present invention addresses the cost of setting the memory footprint in selecting a partition that preempts or otherwise replaces the original dispatched partition. It is. The present invention further clarifies the footprint setting cost when the replacement partition is redispatched to regain its part of the original dispatch window.

ここで、全てにわたり同一の参照符号が同一の対応する部分を指す図面を参照すると、本発明のワークロード管理機能を実装する仮想化されたコンピュータ環境の１つの実施形態が、図１に示される。ニューヨーク州アーモンクのＩｎｔｅｒｎａｔｉｏｎａｌＢｕｓｉｎｅｓｓＭａｃｈｉｎｅｓＣｏｒｐｏｒａｔｉｏｎによって提供されるＰＯＷＥＲＳ５サーバに含まれる機能の多くを含む仮想コンピュータ・システム１００が図示される。仮想コンピュータ・システム１００は、一般に、ハイパーバイザ１１５を含むファームウェア層リソース１２０と、共有プロセッサ・プール１１７並びにメモリ・デバイス１２１及び１２５を含むハードウェア層リソース１２２とを含む。共有プロセッサ・プール１１７は、好ましくは、それぞれＣＰＵ１−ＣＰＵ４と表示される多数のプロセッサ１０８、１１０、１１２、及び１１４を含み、それぞれのプロセッサと関連付けられたキャッシュ・メモリＭ１−Ｍ４１２１を有する、マルチプロセッサ複合体を備える。仮想コンピュータ・システム１００は、それぞれＬＰ１−ＬＰ４と表示される多数の論理パーティション１０５Ａ−１０５Ｄをさらに含む。ハイパーバイザ１１５は、多数の論理パーティション１０５Ａ−１０５Ｄの間におけるハードウェア層リソース１２２の割り当てを管理し、調整する。 Referring now to the drawings in which like reference numerals refer to like corresponding parts throughout, one embodiment of a virtualized computer environment implementing the workload management functionality of the present invention is shown in FIG. . Illustrated is a virtual computer system 100 that includes many of the functions included in a POWERS5 server provided by International Business Machines Corporation of Armonk, NY. The virtual computer system 100 generally includes a firmware layer resource 120 that includes a hypervisor 115 and a hardware layer resource 122 that includes a shared processor pool 117 and memory devices 121 and 125. The shared processor pool 117 preferably includes a number of processors 108, 110, 112, and 114, each labeled CPU1-CPU4, with a cache memory M1-M4 121 associated with each processor. A processor complex is provided. Virtual computer system 100 further includes a number of logical partitions 105A-105D, each labeled LP1-LP4. The hypervisor 115 manages and coordinates the allocation of hardware layer resources 122 among multiple logical partitions 105A-105D.

ＣＰＵ１−ＣＰＵ４及びそれらと関連付けられたキャッシュ・メモリＭ１−Ｍ４は、リソース仮想化となるようにハイパーバイザ１１５によって論理パーティションＬＰ１−ＬＰ４に割り当てられる物理システム・リソースの一部を表す。物理システム・リソースは、一般に、プログラム階層化組織などの非物理的で抽象的なシステム・リソース及びオペレーティング・システムと関連付けられるようなプログラム・プロトコルとは区別される、メモリ・デバイス、プロセッサ、ドライバ、バス、プロセッサ／バス・サイクルといった、有形のシステム・デバイス、コンポーネント、及び関連する物理的現象である。物理システム・リソースは、さらに、仮想マシンのような論理的又は仮想的に定義可能なエンティティとも区別可能である。論理パーティションＬＰ１−ＬＰ４の各々は、１つ又は複数の論理プロセッサ（明示的に図示されない）を含み、論理プロセッサの各々は、パーティションに割り当てられた物理プロセッサＣＰＵ１−ＣＰＵ４のうちの１つの全体又は一部を表す。パーティション１０５Ａ−１０５Ｄのうちの所与の１つの論理プロセッサは、下のプロセッサ・リソースをそのパーティションのために確保するようにそのパーティション専用とするか、又は、下のプロセッサ・リソースを他のパーティションにも利用可能とするように共有にすることができる。 CPU1-CPU4 and the cache memories M1-M4 associated with them represent some of the physical system resources that are allocated by the hypervisor 115 to the logical partitions LP1-LP4 to be resource virtualized. Physical system resources are generally distinguished from non-physical abstract system resources such as program layered organizations and program protocols such as those associated with operating systems, memory devices, processors, drivers, Tangible system devices, components, and related physical phenomena such as buses, processor / bus cycles. Physical system resources can also be distinguished from logically or virtually definable entities such as virtual machines. Each of the logical partitions LP1-LP4 includes one or more logical processors (not explicitly shown), and each of the logical processors is an entire or one of one of the physical processors CPU1-CPU4 assigned to the partition. Represents a part. A given logical processor in partitions 105A-105D can dedicate the lower processor resources to that partition to reserve for that partition, or defer lower processor resources to other partitions Can also be shared to make it available.

図示される実施形態においては、論理パーティションＬＰ１−ＬＰ４の各々は、パーティション間で異なっていてもよい常駐のオペレーティング・システム１０４と、１つ又は複数のアプリケーション１０２とを有する個別のシステムとして機能する。１つの実施形態においては、オペレーティング・システム１０４Ａ−１０４Ｄの１つ又は複数は、Ｌｉｎｕｘオペレーティング・システムとしてもよく、ＩＢＭ社によって提供されるｉ５／ＯＳ（商標）オペレーティング・システムとしてもよい。さらに、オペレーティング・システム１０４Ａ−１０４Ｄ（又はそのサブセット）は、それぞれのパーティションの各々の内部におけるアプリケーション・ワークロードを管理するために、それぞれのＯＳワークロード・マネージャ１０６Ａ−１０６Ｄを含む。 In the illustrated embodiment, each of the logical partitions LP1-LP4 functions as a separate system having a resident operating system 104 and one or more applications 102 that may vary from partition to partition. In one embodiment, one or more of the operating systems 104A-104D may be a Linux operating system or an i5 / OS ™ operating system provided by IBM. In addition, operating systems 104A-104D (or a subset thereof) include respective OS workload managers 106A-106D to manage application workload within each of the respective partitions.

１つの実施形態においては、ハイパーバイザ１１５は、使用資格のある（entitled）容量を持たない隠しパーティションとして作動する。論理パーティションＬＰ１−ＬＰ４へのシステム・リソースの割り当ては、プロセッサＣＰＵ１−ＣＰＵ４上で稼働するマイクロコードによって実装することができるハイパーバイザ１１５により管理される。ハイパーバイザ呼び出しは、オペレーティング・システム１０４Ａ−１０４Ｄのいずれかが、ハイパーバイザ１１５と通信して、以下でより詳細に説明される技術を用いてパーティションのアイドル時間を最小にするスケジューリング経験則をサポートすることによって物理プロセッサ容量をより効率的に使用できるようにするための手段を提供する。論理パーティションＬＰ１−ＬＰ４及びハイパーバイザ１１５は、典型的には、プロセッサＣＰＵ１−ＣＰＵ４と関連付けられた中央メモリのそれぞれの部分にある１つ又は複数の有形のプログラム・モジュールを含む。 In one embodiment, the hypervisor 115 operates as a hidden partition that does not have entitled capacity. The allocation of system resources to the logical partitions LP1-LP4 is managed by a hypervisor 115 that can be implemented by microcode running on the processors CPU1-CPU4. Hypervisor calls support a scheduling rule of thumb that any of the operating systems 104A-104D communicates with the hypervisor 115 to minimize partition idle time using techniques described in more detail below. Thereby providing a means for enabling more efficient use of physical processor capacity. Logical partitions LP1-LP4 and hypervisor 115 typically include one or more tangible program modules in respective portions of central memory associated with processors CPU1-CPU4.

図２は、本発明の１つの実施形態に係る、パーティション・スケジューリングを容易にするように適合された例示的なアーキテクチャ２００を示す高レベル概念図である。パーティション・スケジューリング・アーキテクチャ２００は、パーティション管理ユニット（ＰＭＵ）２０４と、共有プロセッサ・プール１１７、ハイパーバイザ１１５、及びキャッシュ・メモリ２０６といった他のシステム・コンポーネントとを統合する。図２においてＰＭＵ２０４は別個のモジュールとして示されているが、ＰＭＵ２０４のハードウェア、ファームウェア、及びソフトウェアのコンポーネントのうちの幾つか又は全てをハイパーバイザ１１５内に統合してもよいことに留意すべきである。さらに、キャッシュのブロック２０６は、共有プロセッサ・プール１１７内に含まれるＣＰＵ１０８、１１０、１１２、１１４の１つ又は複数によって使用されるキャッシュ・メモリ・リソースＭ１−Ｍ４の集まりのうちの幾つか又は全てを代表するものであることに留意すべきである。 FIG. 2 is a high-level conceptual diagram illustrating an example architecture 200 adapted to facilitate partition scheduling, according to one embodiment of the present invention. The partition scheduling architecture 200 integrates a partition management unit (PMU) 204 with other system components such as a shared processor pool 117, hypervisor 115, and cache memory 206. Although PMU 204 is shown in FIG. 2 as a separate module, it should be noted that some or all of the hardware, firmware, and software components of PMU 204 may be integrated within hypervisor 115. is there. In addition, the cache block 206 may include some or all of the collection of cache memory resources M1-M4 used by one or more of the CPUs 108, 110, 112, 114 included in the shared processor pool 117. It should be noted that

ＰＭＵ２０４は、パーティションＬＰ１−ＬＰ４に割り当てられるリソースについての、例えばメモリ使用に関する評価指標のような物理システム・リソース性能評価指標を監視する、論理、プログラム・モジュール、並びに、他のハードウェア・モジュール、ファームウェア・モジュール及び／又はソフトウェア・モジュールを含む。こうした性能評価指標は、好ましくは、キャッシュの使用を含み、具体的には、パーティションＬＰ１−ＬＰ４の各々のキャッシュ・フットプリントに関する評価指標を含む。図２の高レベル概念図は、ＰＭＵ２０４と、論理パーティションと関連付けられた物理システム・リソースのこうした監視を可能にする他のシステム・コンポーネントとの間の統合及び対話のインターフェースを示す。 PMU 204 monitors physical system resource performance metrics, such as metrics related to memory usage, for resources allocated to partitions LP1-LP4, logic, program modules, and other hardware modules, firmware Includes modules and / or software modules. Such performance metrics preferably include cache usage and specifically include metrics related to each cache footprint of partitions LP1-LP4. The high level conceptual diagram of FIG. 2 illustrates an interface for integration and interaction between the PMU 204 and other system components that allow such monitoring of physical system resources associated with the logical partition.

現在ディスパッチされている論理パーティションが、共有プロセッサ・プール１１７のＣＰＵを用いてその命令ストリームを実行し、ロード操作又はストア操作を介してメモリ位置の内容にアクセスしたとき、ＣＰＵは、これらの要求をＣＰＵ−キャッシュ・インターフェース２１２を通してその関連付けられたキャッシュ２０６に発行する。そのときのキャッシュ２０６のタスクは、キャッシュのストレージにそのメモリ内容が存在するか否かを判定し、（ａ）存在する場合には、キャッシュされたデータをＣＰＵに戻し、（ｂ）存在しない場合には、ロード又はストアを実施する前に共有メモリ１２５などのメイン・メモリからそのメモリ内容をフェッチすることである。要求されたメモリ内容が既にキャッシュ２０６にある場合には、キャッシュ−メモリ・インターフェース２１０などを介して共有メモリ１２５にアクセスすることなく、そのデータがＣＰＵに戻される。この時点では、ＰＭＵ２０４との対話は必要ない。しかしながら、要求されたデータがキャッシュ２０６において利用可能ではない場合には、キャッシュ−メモリ・インターフェース２１０を通してメイン共有メモリ１２５からデータをフェッチしなければならない。 When the currently dispatched logical partition executes its instruction stream using the CPU of the shared processor pool 117 and accesses the contents of the memory location via a load or store operation, the CPU Issue to its associated cache 206 through the CPU-cache interface 212. The task of the cache 206 at that time determines whether or not the memory contents exist in the cache storage, and if (a) exists, the cached data is returned to the CPU, and (b) does not exist. Is to fetch its memory contents from main memory, such as shared memory 125, before performing the load or store. If the requested memory content is already in the cache 206, the data is returned to the CPU without accessing the shared memory 125 via the cache-memory interface 210 or the like. At this point, no interaction with the PMU 204 is necessary. However, if the requested data is not available in the cache 206, the data must be fetched from the main shared memory 125 through the cache-memory interface 210.

図３を参照すると、図２に示されるアーキテクチャ内に実装することができるパーティション監視ユニット２０４、ハイパーバイザ１１５、及びパーティション履歴テーブル３０５の内部アーキテクチャを示す高レベル概念図が示される。入力側はＰＭＵ２０４を含み、ＰＭＵ２０４は、ＣＰＵインターフェース２０８及びキャッシュ・インターフェース２１４からの入力を処理してパーティション・ベクトル３０８、３１０、及び３１２を生成する追跡論理モジュール３０２を含むものとして示される。出力側においては、アーキテクチャは、ハイパーバイザ１１５とパーティション履歴テーブル３０５とを含む。図示される実施形態においては、ハイパーバイザ１１５は、パーティション・ベクトル３０８、３１０、及び３１２を処理してパーティション履歴テーブル３０５の内容を生成及び更新する優先順位値計算モジュール３０４を含む。パーティション履歴テーブル３０５は、仮想コンピュータ・システム１００内のＮ個の論理パーティション全てについての項目（エントリ）を含む。図示される実施形態においては、パーティション履歴テーブル３０５は、各々がシステムにおけるＮ個の論理パーティションの１つに対応する行レコードを含むものとして示されており、各々のパーティションのレコードは、多数の列データ・フィールドを含む。列フィールドの中には、それぞれの論理パーティションの各々について、論理パーティション（ＬＰ）識別子フィールドと、１命令あたりのサイクル（ＣＰＩ）値のためのフィールド、キャッシュ・ライン・カウント（ＣＬＣ）値のためのフィールド、及びキャッシュ・ミス・カウント（ＣＭＣ）値のためのフィールドとがあり、これらは以下でより詳細に説明される。各々のパーティション履歴テーブル項目についての列フィールドは、ハードウェア検出されるＣＰＩ値、ＣＬＣ値、及びＣＭＣ値に加えて、メモリ・フットプリント期間値Ｔ_ＦＰのためのフィールドと、フットプリント値の変動ＶＡＲのためのフィールドと、上述のＣＰＩ値、ＣＬＣ値、及びＣＭＣ値のうちの１つ又は複数から導出することができるキャッシュ・アフィニティ（ＣＡ）値のためのフィールドと、を含む。 Referring to FIG. 3, a high level conceptual diagram showing the internal architecture of partition monitoring unit 204, hypervisor 115, and partition history table 305 that can be implemented within the architecture shown in FIG. The input side includes a PMU 204, which is shown as including a tracking logic module 302 that processes input from the CPU interface 208 and cache interface 214 to generate partition vectors 308, 310, and 312. On the output side, the architecture includes a hypervisor 115 and a partition history table 305. In the illustrated embodiment, the hypervisor 115 includes a priority value calculation module 304 that processes the partition vectors 308, 310, and 312 to generate and update the contents of the partition history table 305. The partition history table 305 includes items (entries) for all N logical partitions in the virtual computer system 100. In the illustrated embodiment, the partition history table 305 is shown as including a row record, each corresponding to one of the N logical partitions in the system, and each partition's record contains a number of columns. Contains data fields. Among the column fields are, for each logical partition, a logical partition (LP) identifier field, a field for cycle per instruction (CPI) value, and a cache line count (CLC) value. Fields, and fields for cache miss count (CMC) values, which are described in more detail below. The column field for each partition history table entry includes a field for the memory footprint period value T _FP in addition to the hardware detected CPI value, CLC value, and CMC value, and the footprint value variation VAR. And a field for a cache affinity (CA) value that can be derived from one or more of the CPI value, CLC value, and CMC value described above.

ＰＭＵ２０４内において、追跡論理モジュール３０２は、図１及び図２に示される処理リソース及びメモリ・リソースのような物理システム・リソースの性能評価指標を検出し、処理し、一時的に格納するための論理及びデータ・ストレージ・ハードウェア・デバイスを含む。性能評価指標は、検出時にメモリ及びＣＰＵなどの物理システム・リソースが割り当てられている論理パーティションに関連付けて検出される。検出され処理された性能評価指標は、物理リソースが現在割り当てられている論理パーティションの識別情報と関連付けて格納される。性能評価指標の所与の組と論理パーティションとの間の関連付けは、１つ又は複数の現在ディスパッチされているパーティションの識別情報を含むＣＰＵインターフェース・レジスタ３１４によって与えることができる。レジスタ３１４内のパーティションＩＤ値は、ディスパッチ決定時に設定されることが好ましい。 Within the PMU 204, the tracking logic module 302 is a logic for detecting, processing, and temporarily storing performance metrics of physical system resources such as the processing resources and memory resources shown in FIGS. And data storage hardware devices. The performance evaluation index is detected in association with a logical partition to which physical system resources such as a memory and a CPU are allocated at the time of detection. The detected and processed performance evaluation index is stored in association with the identification information of the logical partition to which the physical resource is currently allocated. The association between a given set of performance metrics and a logical partition may be provided by a CPU interface register 314 that contains identification information for one or more currently dispatched partitions. The partition ID value in register 314 is preferably set when dispatch is determined.

追跡論理モジュール３０２によって収集される例示的な性能評価指標は、ＣＰＵインターフェース２０８及び／又はキャッシュ・インターフェース２１４上で検出される信号から直接的に又は計算で求めることが可能な、ＣＰＩカウント、キャッシュ・ライン・カウント、キャッシュ・ミス・カウント、及び他のメモリ・アクセス又は処理効率に関連する評価指標を含むものとすることができる。プロセッサ使用リソース・レジスタ３２２を使用して、パーティションが物理プロセッサ上にディスパッチされるタイム・スライスの間の活動を測定するサイクル・カウントを与えることができる。図示される実施形態においては、追跡論理３０２によって検出された、現在ディスパッチされているパーティションａ−ｍの各々についてのＣＰＩカウントは、ディスパッチされたパーティション・ベクトル３０８内に格納される。同様に、追跡論理３０２によって検出された、ディスパッチされている論理パーティションａ−ｍについてのキャッシュ・ライン・カウント及びキャッシュ・ミス・カウントは、それぞれ、ディスパッチされたパーティション・ベクトル３１０及び３１２に格納される。 Exemplary performance metrics collected by the tracking logic module 302 include CPI counts, cache values, which can be determined directly or computationally from signals detected on the CPU interface 208 and / or the cache interface 214. Line counts, cache miss counts, and other metrics related to memory access or processing efficiency may be included. The processor usage resource register 322 can be used to provide a cycle count that measures activity during a time slice in which a partition is dispatched on a physical processor. In the illustrated embodiment, the CPI count for each of the currently dispatched partitions am detected by the tracking logic 302 is stored in the dispatched partition vector 308. Similarly, cache line counts and cache miss counts for dispatched logical partitions am detected by tracking logic 302 are stored in dispatched partition vectors 310 and 312 respectively. .

好ましい実施形態においては、追跡論理モジュール３０２内の検出論理及びデータ・ストレージ・デバイスは、オペレーティング・システム割り込みのようなプログラム割り込み機構とは独立に、ＣＰＵインターフェース２０８及びキャッシュ・インターフェース２１４上の信号を収集して処理するハードウェア・デバイス及びファームウェア・デバイスを含む。このような追跡デバイス及びストレージ・デバイスは、論理ゲート、レジスタなどといったハードウェアと、システム・バス・スヌーパによって用いられるようなファームウェア・エンコードとを含むものとすることができる。追跡論理モジュール３０２は、ソフトウェア・プログラム割り込みとは独立に、ハードウェア及び／又はファームウェア・レベルで、その検出機能、処理機能、及びストレージ機能を実施する。したがって、このような検出機能、処理機能、及びストレージ機能は、オペレーティング・システム・カーネルの管理上の制約とは独立して実施される。こうすることで、追跡論理モジュール３０２が性能評価指標を収集するサンプリング・レートは、以下でより詳細に説明されるようにパーティションのスケジューリング及び代替に用いられる基準データを求めるのに必要な、例えば０．１ミリ秒といった十分に細かい粒度（sufficiently fine granularity）を持つものとすることができる。 In the preferred embodiment, the detection logic and data storage devices in tracking logic module 302 collect signals on CPU interface 208 and cache interface 214 independently of program interrupt mechanisms such as operating system interrupts. Hardware devices and firmware devices for processing. Such tracking devices and storage devices may include hardware such as logic gates, registers, etc., and firmware encoding as used by system bus snoopers. The tracking logic module 302 implements its detection, processing, and storage functions at the hardware and / or firmware level, independent of software program interrupts. Accordingly, such detection functions, processing functions, and storage functions are implemented independently of operating system kernel management constraints. In this way, the sampling rate at which the tracking logic module 302 collects performance metrics is used to determine the reference data used for partition scheduling and replacement as described in more detail below, eg, 0. It can have a sufficiently fine granularity, such as 1 millisecond.

ディスパッチされたパーティション・ベクトル３０８、３１０、及び３１２内のシステム性能評価指標は、現在ディスパッチされている又は以前にディスパッチされた論理パーティションの番号ａ−ｍの各々について、それぞれ、ＣＰＩカウント、キャッシュ・ライン・カウント、及びキャッシュ・ミス・カウントを含む。図３においては、各々の記録されたパーティション・ベクトル値と、対応する論理パーティションとの間の関連付けは、各々が特定の論理パーティションを表す下付き文字ａ−ｍによって視覚的に表されている。パーティション・ベクトル３０８、３１０、及び３１２を格納する追跡論理３０２内のストレージ・デバイスは、専用のレジスタであることが好ましい。 The system performance metrics in dispatched partition vectors 308, 310, and 312 are the CPI count, cache line for each of the currently dispatched or previously dispatched logical partition numbers am, respectively. Includes counts and cache miss counts. In FIG. 3, the association between each recorded partition vector value and the corresponding logical partition is visually represented by subscripts am, each representing a particular logical partition. The storage devices in tracking logic 302 that store partition vectors 308, 310, and 312 are preferably dedicated registers.

追跡論理３０２によって検出され、パーティション・ベクトル３０８、３１０、及び３１２に収集された性能評価指標は、パーティション履歴テーブル３０５を更新するためにハイパーバイザ１１５によって処理される。ハイパーバイザ１１５が生成するか又は受信する割り込み信号によって、ディスパッチされたパーティション・ベクトル３０８、３１０、及び３１２内に収集されたシステム・データを用いてパーティション履歴テーブル３０５がいつ更新されるべきかが決定される。割り込みは、ディスパッチされたパーティションａ−ｍのうちの少なくとも１つについてのディスパッチ・ウィンドウの終了を示すか、そうでなければこれと同時に起こる。割り込みを受信すると、優先順位値計算モジュール３０４は、パーティション履歴テーブル３０５内の項目を追加又は更新するために、パーティション・ベクトル３０８、３１０、及び３１２を取得して処理する。図示される実施形態においては、ディスパッチされたベクトル３０８、３１０、及び３１２内のＣＰＩ、ＣＬＣ、及びＭＣ評価指標は、更新割り込み信号に応答して、対応する論理パーティションのレコードについてのそれぞれの列の評価指標項目と比較されるか、そうでなければ処理される。図３は、論理パーティションａ−ｍが最初にディスパッチされる初期のシステム起動時を示していると仮定すると、優先順位値計算モジュール３０４は、パーティション履歴テーブル３０５にパーティションａ−ｍの各々についてのレコード項目を追加し、パーティション・ベクトル・データをベクトル３０８、３１０、及び３１２から対応するレコード項目に入力する。レコード生成プロセスは、システム初期化の際に、Ｎ個の論理パーティション全てについてパーティション履歴テーブル３０５内にレコードが格納されるまで続く。 The performance metrics detected by the tracking logic 302 and collected in the partition vectors 308, 310, and 312 are processed by the hypervisor 115 to update the partition history table 305. Interrupt signals generated or received by the hypervisor 115 determine when the partition history table 305 should be updated using system data collected in dispatched partition vectors 308, 310, and 312. Is done. The interrupt indicates the end of the dispatch window for at least one of the dispatched partitions am, or otherwise coincides with this. Upon receipt of the interrupt, the priority value calculation module 304 obtains and processes the partition vectors 308, 310, and 312 to add or update entries in the partition history table 305. In the illustrated embodiment, the CPI, CLC, and MC metrics in dispatched vectors 308, 310, and 312 are responsive to the update interrupt signal for each column for the corresponding logical partition record. It is compared with the evaluation index item or otherwise processed. Assuming that FIG. 3 shows the initial system startup time when the logical partition am is dispatched first, the priority value calculation module 304 stores a record for each of the partitions am in the partition history table 305. Add an entry and enter the partition vector data from the vectors 308, 310, and 312 into the corresponding record entry. The record generation process continues until records are stored in the partition history table 305 for all N logical partitions during system initialization.

初期のディスパッチと、その後の各々の論理パーティションについてのパーティション・レコード生成とに続いて、優先順位値計算モジュール３０４は、信号が送信される各々の更新間隔ごとに、ディスパッチされたベクトル３０８、３１０、及び３１２を処理して、以下でより詳細に説明されるように、パーティション履歴テーブル３０５内の性能評価指標及び／又は代替優先順位値の項目を追加するか、置換するか、そうでなければ修正し続ける。 Following the initial dispatch and the subsequent partition record generation for each logical partition, the priority value calculation module 304 performs the dispatched vectors 308, 310, And 312 to add, replace, or otherwise modify the performance metrics and / or alternative priority value entries in the partition history table 305 as described in more detail below. Keep doing.

図４は、本発明に係る、論理パーティションについてのスケジューリング優先順位を求めるためのシーケンスの高レベル・ブロック図である。具体的には、図４は、現在ディスパッチされているパーティションを代替又は横取りするなどといったディスパッチ決定を行う際にハイパーバイザ１１５が利用できるような、例示的な時間及びイベントに基づく優先順位因子を示す。 FIG. 4 is a high level block diagram of a sequence for determining scheduling priorities for logical partitions in accordance with the present invention. Specifically, FIG. 4 illustrates an exemplary time and event based priority factor that can be utilized by the hypervisor 115 in making a dispatch decision, such as replacing or preempting a currently dispatched partition. .

動的に調整可能な因子として優先順位をパーティション・ディスパッチ・スケジューリング全体の中に組み込むために、ハイパーバイザ１１５は、各々の論理パーティションの相対的な又は絶対的な優先順位をキャッシュ・フットプリントに関連する評価指標の関数として求めるので、ハードウェアにより検出された性能評価指標をＰＭＵ２０４から取得しなければならない。 In order to incorporate priority into the overall partition dispatch scheduling as a dynamically adjustable factor, hypervisor 115 associates the relative or absolute priority of each logical partition with the cache footprint. Therefore, the performance evaluation index detected by the hardware must be acquired from the PMU 204.

上述のように、メモリ・インターフェース及びＣＰＵバス・インターフェースから直接求められる物理システム性能評価指標は、追跡論理モジュール３０２内のハードウェア・レベルの論理及びレジスタを用いて検出され、最初に登録及び処理される。追跡論理モジュール３０２によって検出又は生成されるキャッシュ・フットプリント評価指標は、好ましくは、各々のディスパッチされたパーティションについてのＣＰＩカウント、キャッシュ・ライン・カウント、及びキャッシュ・ミス・カウントを含み、全体的なディスパッチ優先順位を導くために他のスケジューリング優先順位因子と組み合わせることができるパーティション・ベクトル３０８、３１０、及び３１２として収集される。 As described above, physical system performance metrics derived directly from the memory interface and CPU bus interface are detected using hardware level logic and registers within the tracking logic module 302 and are initially registered and processed. The The cache footprint metrics detected or generated by the tracking logic module 302 preferably include the CPI count, cache line count, and cache miss count for each dispatched partition, Collected as partition vectors 308, 310, and 312 that can be combined with other scheduling priority factors to derive dispatch priority.

図５−図９は、図４と併せて、具体的に、最小のパーティション使用資格が割り当てられた後にディスパッチされたパーティション即ちスケジューリング・パーティションを代替又は横取りすることに関連する、パーティション監視機能を論理パーティション・スケジューリングに組み込む方法の例示的な実施形態を示す。第１に、図４に示されるパーティションの優先順位付けシーケンスは、パーティションについて優先順位値を設定及び調整するために使用される、ハードウェアにより検出される物理システム性能評価指標を含む多数の優先順位因子の蓄積及び使用を示す。図６は、コンピュータにより実施されるプロセスをさらに詳細に示すものであり、このプロセスによって、ハードウェア検出される物理システムの性能評価指標、特にメモリ・アクセス性能に関連する評価指標が求められ、各々のパーティションについて１つ又は複数のパーティション・スケジューリング優先順位因子を特性化するために具体的に使用される。図７−図９は、パーティションと関連付けられたメモリ・フットプリント評価指標を用いて優先順位を決定し、ディスパッチ・ウィンドウをチューニングするための、コンピュータにより実施されるプロセスを示す。図５及び図６に示される、優先順位付けの特性化及び／又はスケジューラのチューニングは、ハイパーバイザ１１５内に組み込まれるディスパッチャのようなハイパーバイザ・ディスパッチャと共に用いることができる。しかしながら、本明細書において説明される本発明の特徴及び技術は、示される実施形態のいずれか１つ又は複数に必ずしも限定されないことに留意すべきである。当業者であれば、パーティションの優先順位を決定して使用するプロセスの様々な態様は、本発明の趣旨及び範囲から逸脱することなく変更することができ、さらに、本明細書において説明される機構及びプロセスの基本的な１つ又は複数の態様は、他のスケジューリング・アルゴリズムと共に使用できることを容易に認識し理解するであろう。 5-9, in conjunction with FIG. 4, specifically illustrate partition monitoring functions associated with substituting or preempting a dispatched partition or scheduling partition after the minimum partition usage entitlement has been assigned. Fig. 4 illustrates an exemplary embodiment of a method for incorporation in partition scheduling. First, the partition prioritization sequence shown in FIG. 4 includes a number of priorities including physical system performance metrics detected by hardware that are used to set and adjust priority values for the partition. Shows the accumulation and use of factors. FIG. 6 illustrates in more detail the process performed by the computer, which determines the performance metrics of the hardware detected physical system, particularly metrics related to memory access performance, Specifically used to characterize one or more partition scheduling priority factors for a given partition. FIGS. 7-9 illustrate computer-implemented processes for determining priorities and tuning dispatch windows using memory footprint metrics associated with partitions. The prioritization characterization and / or scheduler tuning shown in FIGS. 5 and 6 can be used with a hypervisor dispatcher, such as a dispatcher incorporated within the hypervisor 115. However, it should be noted that the features and techniques of the invention described herein are not necessarily limited to any one or more of the illustrated embodiments. Those skilled in the art will be able to modify various aspects of the process of determining and using partition priorities without departing from the spirit and scope of the present invention, as well as the mechanisms described herein. And it will be readily appreciated and understood that the basic process aspect or aspects can be used with other scheduling algorithms.

図４を続けると、パーティション・スケジューリング／代替の決定に影響を及ぼさないように最初はゼロか、そうでなければ中間値とすることができるベース優先順位値ｂｐ（Ｐ_ｉ）４０２を有するパーティションＰ_ｉが生成される。ベース優先順位値ｂｐ（Ｐ_ｉ）は、ハイパーバイザ１１５がパーティションの横取り及び／又は代替の決定を行うために使用することができる数値若しくは数的な印又は他の量的な値若しくは印とすることができる。パーティションＰ_ｉの全体優先順位値４２５は、ベース優先順位値ｂｐ（Ｐ_ｉ）４０２を動的に調整する時間フェアネス成分（temporalfairness component）を含むことが好ましい。図４に示されるように、時間ｔにおける現在の優先順位ｃｐ（Ｐ_ｉ，ｔ）４０６は、パーティションのベース優先順位値ｂｐ（Ｐ_ｉ）を優先順位総和ｄΣｔ_ｊ４０４として累積的に表されるなんらかの時間依存増分ｄだけ増加させることによって、各々の時間量で計算される。したがって、現在の優先順位ｃｐ（Ｐ_ｉ，ｔ）４０６は、現在の時間間隔ｊに応じた優先順位総和ｄΣｔ_ｊ４０４内のどこかに入る増分フェアネス成分をベース優先順位に加えたものを含む。性能評価指標に基づかないパーティション・ディスパッチの場合は、論理パーティションの優先順位は、現在の優先順位ｃｐ（Ｐ_ｉ，ｔ）４０６に等しい。 Continuing with FIG. 4, a partition P having a base priority value bp (P _i ) 402 that can be initially zero or otherwise intermediate so as not to affect partition scheduling / alternative decisions. _i is generated. The base priority value bp (P _i ) is a numeric or numerical indicia or other quantitative value or indicia that can be used by the hypervisor 115 to make partition preemption and / or alternative decisions. be able to. The overall priority value 425 of the partition P _i preferably includes a temporal fairness component that dynamically adjusts the base priority value bp (P _i ) 402. As shown in FIG. 4, the current priority cp (P _i , t) 406 at time t is cumulatively represented by the partition base priority value bp (P _i ) as the priority sum dΣt _j 404. It is calculated for each amount of time by increasing it by some time dependent increment d. Thus, the current priority cp (P _i , t) 406 includes the base priority plus an incremental fairness component that falls somewhere within the priority sum dΣt _j 404 according to the current time interval j. For partition dispatch that is not based on performance metrics, the logical partition priority is equal to the current priority cp (P _i , t) 406.

本発明は、パーティションＰ_ｉのスケジューリング優先順位４２５を求めるために、パーティションと関連付けられた物理システム性能評価指標をさらに与えて、組み入れる、パーティション・スケジューリング機構を提供する。このことは、物理処理リソース（プロセッサ、メモリなど）の共有される組を用いて仮想化システム内で実行されている多数の論理パーティションについて、異なるパーティションの優先順位を同時に計算することによって、異なる値が生じる場合があることを意味する。具体的には、性能に依存しないパーティションの優先順位ｃｐ（Ｐ_ｉ，ｔ）４０６は、性能に基づく因子Δｐ４１８によってスケジューリング間隔ごとに調整される４２０。性能に基づく因子Δｐ４１８は、可能な特定値の範囲のうちの１つを表し、これにより、当該パーティションと関連付けられたシステム性能評価指標を用いてパーティションの優先順位レベルを増減させることができる。１つの実施形態においては、因子Δｐ４１８は、パーティションのキャッシュ・フットプリント値ＣＦＰ（Ｐ_ｉ）を処理する信頼性因子計算モジュールζ４１６を用いることによって計算される。ＣＦＰ（Ｐ_ｉ）自体は、キャッシュ・フットプリント計算モジュールΘ４１２によって表される１つ又は複数の論理機能を介して、本発明の機構によって数量化されるパーティションＰ_ｉのキャッシュ・フットプリント値として計算される。即ち、１つ又は複数のパーティションに関連付けられた物理リソース評価指標ＰＲＭ（Ｐ_ｉ）の組は、パーティション・スケジューリング優先順位を決定するのに役立つＣＦＰ（Ｐ_ｉ）のような特定のメモリ・フットプリント値を求めるために、キャッシュ・フットプリント計算モジュールΘ４１２によって処理される。フットプリント計算モジュールΘ４１２及び信頼性因子計算モジュールζ４１６は、性能に基づく因子Δｐ４１８を定める。したがって、図４に示されるように、時間ｔにおけるパーティションＰ_ｉの全体優先順位値は、ベース・パーティション優先順位と、時間フェアネス調整と、パーティションと関連して測定される物理システム評価指標との複合関数である。 The present invention provides a partition scheduling mechanism that further provides and incorporates physical system performance metrics associated with the partition to determine the scheduling priority 425 of the partition P _i . This can be achieved by simultaneously calculating the priority of different partitions for a number of logical partitions running in a virtualized system using a shared set of physical processing resources (processors, memory, etc.). Means that may occur. Specifically, the performance-independent partition priority cp (P _i , t) 406 is adjusted 420 for each scheduling interval by a performance-based factor Δp 418. The performance-based factor Δp 418 represents one of a range of possible specific values, which can be used to increase or decrease the partition priority level using the system performance metrics associated with the partition. In one embodiment, the factor Δp 418 is calculated by using a reliability factor calculation module ζ 416 that processes the partition's cache footprint value CFP (P _i ). CFP (P _i ) itself is calculated as the cache footprint value of partition P _i quantified by the mechanism of the present invention via one or more logical functions represented by the cache footprint calculation module Θ 412. Is done. That is, the set of physical resource metrics PRM (P _i ) associated with one or more partitions is a specific memory footprint such as CFP (P _i ) that helps determine partition scheduling priority. To determine the value, it is processed by the cache footprint calculation module Θ 412. The footprint calculation module Θ 412 and the reliability factor calculation module ζ 416 determine a performance-based factor Δp 418. Thus, as shown in FIG. 4, the overall priority value of partition P _{i at} time t is a composite of the base partition priority, the time fairness adjustment, and the physical system metrics that are measured in relation to the partition. It is a function.

図５は、本発明に係る、ディスパッチされたが中断されているパーティションを代替するための、ハイパーバイザ１１５内に実装することができるパーティション・ディスパッチャ状態の高レベル・ブロック図である。代替パーティション・ベクトル又は代替キュー５０５が生成され、本明細書において開示される方法に従ってハイパーバイザ１１５によって動的に調整される。代替キュー５０５は、例えば、現在実行中のパーティションが、現在のディスパッチ・ウィンドウの間にそのディスパッチ・ウィンドウ時間が終了する前に処理を中断したときに、絶対的な代替優先順位又は相対的な代替優先順位を決定するためにアクセスされる。代替キュー５０５は、優先される代替オブジェクトＬＰ_ａ−ＬＰ_ｎのキューとして編成され、各々のオブジェクトは、パーティション固有のデータの中でもパーティションの設定及び調整されたディスパッチ・スケジューリング状態及び代替優先順位状態などのパーティション状態情報を各々が保持している、対応するパーティション制御ブロック（ＰＣＢ）５０２ａ−５０２ｎを含むか、又はそうでなければこれらとリンクする。図５に示されるように、代替オブジェクトＬＰ_ａ−ＬＰ_ｎは、「最大」（即ち、利用可能な代替パーティションの中で最も高い優先順位）から、「最小」（即ち、利用可能な代替パーティションの中で最も低い優先順位）まで、優先順位付けすることができる。代替キュー５０５のコンポーネントは、ここで図６を参照して説明されるように、スケジューリングの優先順位を決定するために利用することができるという利点がある。 FIG. 5 is a high level block diagram of a partition dispatcher state that can be implemented in the hypervisor 115 to replace a dispatched but suspended partition in accordance with the present invention. An alternative partition vector or alternative queue 505 is generated and dynamically adjusted by the hypervisor 115 according to the methods disclosed herein. The replacement queue 505 may provide absolute replacement priority or relative replacement when, for example, a currently executing partition suspends processing before its dispatch window time expires during the current dispatch window. Accessed to determine priority. The alternate queue 505 is organized as a queue of preferred alternate objects LP _a -LP _n , each of which includes partition settings and coordinated dispatch scheduling states and alternate priority states among the partition specific data. Includes or otherwise links to corresponding partition control blocks (PCBs) 502a-502n, each holding partition state information. As shown in FIG. 5, the replacement objects LP _a -LP _n are changed from “maximum” (ie, highest priority among the available alternative partitions) to “minimum” (ie, of the available alternative partitions). Can be prioritized up to the lowest priority). The components of the alternate queue 505 have the advantage that they can be utilized to determine scheduling priorities, as will now be described with reference to FIG.

図６を参照すると、本発明に係る、パーティション・スケジューリング優先順位値を求めるためにパーティション監視ユニット２０４及びハイパーバイザ１１５によって実施されるステップを示す高レベル・フロー図が示される。プロセスは、ステップ６０２及び６０４において示されるように、各々の論理パーティションについて、図５において示されるような対応するパーティション制御ブロックを生成することなどによって、論理パーティションを初期化し、特定することで開始する。ステップ６０６及び６０８において示されるように、ベース代替優先順位がパーティションの各々について設定され、値が増加される。ハイパーバイザ１１５は、好ましくは、何らかの正規化された形でパーティションの優先順位値を動的に調整するために時間フェアネス・スケジューリング関数を用いる図４に示される方法で、パーティションについて代替優先順位を設定し、値を増加させる。 Referring to FIG. 6, a high level flow diagram illustrating the steps performed by the partition monitoring unit 204 and the hypervisor 115 to determine the partition scheduling priority value according to the present invention is shown. The process begins by initializing and identifying the logical partition, such as by generating a corresponding partition control block as shown in FIG. 5 for each logical partition, as shown in steps 602 and 604. . As shown in steps 606 and 608, a base alternative priority is set for each of the partitions and the value is incremented. The hypervisor 115 preferably sets an alternative priority for the partition in the manner shown in FIG. 4 using a time fairness scheduling function to dynamically adjust the partition priority value in some normalized manner. And increase the value.

ステップ６１０に示されるように進むと、論理パーティションは、一例においては、ハイパーバイザ１１５によって特定される設定されたディスパッチ・ウィンドウの割り当てに従って、ディスパッチされる。パーティション・ディスパッチ・ウィンドウの間に、追跡論理モジュール３０２内にあるもののようなハードウェア・ベースの追跡デバイス及びモジュールを用いて、ディスパッチ・ウィンドウの間にリソースが割り当てられるそれぞれの論理パーティションと関連付けられた物理システム・リソースの１つ又は複数についての性能評価指標を追跡する（ステップ６１５）。性能評価指標の追跡は、オペレーティング・システム割り込みなどのプログラミング割り込みとは独立にハードウェア・レベルで実施され、好ましくは、ＣＰＩ及び他の物理リソース処理評価指標を追跡することを含む。 Proceeding as shown in step 610, the logical partition is dispatched according to a set dispatch window assignment specified by the hypervisor 115, in one example. During the partition dispatch window, hardware-based tracking devices and modules, such as those in the tracking logic module 302, are used to associate each logical partition to which resources are allocated during the dispatch window. Track performance metrics for one or more of the physical system resources (step 615). Tracking performance metrics is performed at the hardware level independent of programming interrupts, such as operating system interrupts, and preferably includes tracking CPI and other physical resource processing metrics.

図６に示されるように、物理リソース評価指標の追跡は、ディスパッチされたパーティションと関連付けられた評価指標を追跡するサブステップと、ハードウェア検出された評価指標を用いて、ハイパーバイザ１１５がパーティションの代替又は他のディスパッチ決定に用いることができるスケジューリング優先順位値を計算するか、又はそうでなければ決定するサブステップとを含む。ステップ６１２において示されるように、パーティションと関連付けられた物理リソース評価指標は、ディスパッチ期間によって定められるパーティション・ディスパッチ期間Ｔ_ＤＰを１より大きい増分因子で割った比より短い記録時間増分Δｔ_ｒｅｃを有する記録速度で、ハードウェア・レベルの論理及びレジスタを用いて収集され、格納される。ハードウェア検出による物理リソース評価指標の収集に続いて及び／又はこれと併せて、パーティションと関連付けられた優先順位値が計算される（ステップ６１４）。パーティションと関連付けられた優先順位値は、ハードウェア検出された／格納された物理システム評価指標から計算されるか、又は、検出された／格納された評価指標自体であってもよい。例えば、図３に示される実施形態を参照すると、ステップ６１２において収集される物理リソース評価指標は、ディスパッチされたパーティション・ベクトル３０８に収集されたＣＰＩ値を含み、ステップ６１４において計算される優先順位値は、図８Ａ及び図８Ｂを参照して以下でより詳細に示され説明されるプロセスにおいてＣＰＩ値から計算されるメモリ・フットプリント値を含む。ステップ６１５における性能評価指標の決定と同時に、ステップ６１６に示されるように、パーティションについての各々のディスパッチ・ウィンドウにおいてパーティション履歴テーブル３０５内に示されるＶＡＲ値のような信頼性因子が計算され、更新されることが好ましい。 As shown in FIG. 6, physical resource metrics are tracked by the hypervisor 115 using the substep of tracking metrics associated with dispatched partitions and the hardware detected metrics. Calculating or otherwise determining a scheduling priority value that can be used for alternative or other dispatch decisions. As shown in step 612, the physical resource metric associated with the partition has a recording time increment Δt _rec that is less than the ratio of the partition dispatch period T _DP defined by the dispatch period divided by an increment factor greater than one. Collected and stored using hardware level logic and registers at speed. Following and / or in conjunction with the collection of physical resource metrics by hardware detection, a priority value associated with the partition is calculated (step 614). The priority value associated with the partition may be calculated from the hardware detected / stored physical system metrics, or may be the detected / stored metrics themselves. For example, referring to the embodiment shown in FIG. 3, the physical resource metrics collected in step 612 include the CPI values collected in the dispatched partition vector 308, and the priority value calculated in step 614. Includes memory footprint values calculated from CPI values in the process shown and described in more detail below with reference to FIGS. 8A and 8B. Concurrently with the determination of the performance metric in step 615, a reliability factor such as the VAR value shown in the partition history table 305 is calculated and updated in each dispatch window for the partition, as shown in step 616. It is preferable.

図８Ａを参照すると、ステップ６１５に組み入れられる処理ステップのより詳細な表現が提供される。具体的には、図８Ａは、ディスパッチ決定に利用されるメモリ・フットプリント性能値を求めるためにＰＭＵ２０４及びハイパーバイザ１１５などによって実施されるステップを示す、高レベル・フロー図を示す。このプロセスは、パーティション・ディスパッチのステップ６１０で開始し、ディスパッチされたパーティションのディスパッチ期間内の時間間隔点ｔ_ｎにおいてＣＰＩデータが求められ、記録されることを示すステップ８０６へ進む。ｔ_ｎにおいて収集されたＣＰＩ値は、同じパーティションについて以前に記録されたＣＰＩ値と比較される。以前に記録されたＣＰＩ値は、同じパーティションのディスパッチ期間内における、より前の時点で求められたＣＰＩ値を表す。ディスパッチ期間サイクルにおける種々の時点で収集されるＣＰＩデータの例示的なグラフが図８Ｂに示される。図８Ｂにおいては、Ｔ_ＤＰによって示されるディスパッチ期間にわたり、ディスパッチ期間が開始する時点はｔ_０と表され、終了する時点はｔ_ＤＰと表される。このような同じパーティションについてのこうした１つ又は多数のディスパッチにおいて、ＣＰＩデータが求められ記録される時点は、ｔ_ｒｅｃ０、ｔ_ｒｅｃ１、ｔ_ｒｅｃ２、ｔ_ｒｅｃ３、などと表される。図８Ａのステップ８０８に戻ると、ＣＰＩデータ値の間の比較は、例えば、時間ｔ_ｒｅｃ２において記録されたＣＰＩ値と、ｔ_ｒｅｃ１において記録されたＣＰＩ値との比較を含むものとすることができる。好ましい実施形態においては、異なる時間増分におけるＣＰＩデータの決定は、所与の論理パーティションについての異なるディスパッチ期間全体にわたって実施される。 Referring to FIG. 8A, a more detailed representation of the processing steps incorporated in step 615 is provided. Specifically, FIG. 8A shows a high-level flow diagram showing the steps performed by PMU 204, hypervisor 115, etc. to determine the memory footprint performance value used for dispatch decisions. The process begins at partition dispatch step 610 and proceeds to step 806, where CPI data is determined and recorded at time interval point t _n within the dispatch period of the dispatched partition. The CPI value collected at t _n is compared with the previously recorded CPI value for the same partition. The previously recorded CPI value represents the CPI value determined at an earlier time within the dispatch period of the same partition. An exemplary graph of CPI data collected at various points in the dispatch period cycle is shown in FIG. 8B. In FIG. 8B, over a dispatch period indicated by T _DP, when the dispatch period starts is denoted as t _0, the time of completion is expressed as t _DP. In such one or more dispatches for the same partition, the point in time when CPI data is sought and recorded is _denoted as t _rec0 , t _rec1 , t _rec2 , t _rec3 , etc. Returning to step 808 of FIG. 8A, the comparison between the CPI data values may be, for example, is intended to include a comparison of the CPI values recorded at time _{t _rec2,} a CPI value recorded at _{t rec1.} In the preferred embodiment, the determination of CPI data at different time increments is performed over different dispatch periods for a given logical partition.

本発明は、各々のパーティション・ディスパッチに関してメモリ・フットプリントを再び規定する必要性に関連する、ディスパッチの実質的なコストを明らかにするものである。ステップ８０８において実施される比較の目的は、パーティションがメモリ・フットプリントを規定するのに必要な期間を求めることである。メモリ・フットプリントの規定を決定することは、図８Ｂのｔ_ｒｅｃ３において示されるような端点（cornerpoint）を求めることを含み、この端点においては、ＣＰＩ値が、以前のディスパッチ期間の以前の時間増分において記録されたＣＰＩ値から、特定の閾値より小さい分だけ変化する。図８Ｂにおいて、時間ｔ_ｒｅｃ３は、ｔ_ｒｅｃ３において記録されたＣＰＩ値とｔ_ｒｅｃ２において記録されたＣＰＩ値との差が閾値ΔＣＰＩ_{ＴＨＲＳＨＬＤ}より小さい、こうした端点を表す。したがって、フットプリント期間Ｔ_ｆｐは、ｔ_０からｔ_ｒｅｃ３までの期間である。ＣＰＩ値の間の差が特定の閾値より小さいかどうかの判定は、ステップ８１０において示される。 The present invention reveals the substantial cost of dispatch associated with the need to redefine the memory footprint for each partition dispatch. The purpose of the comparison performed in step 808 is to determine the time period required for the partition to define its memory footprint. Determining the memory footprint definition includes determining a _cornerpoint as shown at t _rec3 in FIG. 8B, where the CPI value is the previous time increment of the previous dispatch period. From the recorded CPI value by a value smaller than a specific threshold value. In Figure 8B, the time _{t rec3} _the difference between the recorded CPI value in CPI value and _{t rec2} recorded at _{t rec3} threshold DerutaCPI _Thrshld smaller, represent these end points. Therefore, the footprint period T _fp is a period from t ₀ to t _rec3 . A determination is made at step 810 whether the difference between the CPI values is less than a particular threshold.

ＣＰＩ値の差が閾値以上であることに応答して、ステップ８０６において記録されたＣＰＩ値（即ち、最後のＣＰＩ値）が記録され（ステップ８１２）、ステップ６１０における同じパーティションの次のディスパッチに戻る前に、次のディスパッチ期間のために記録間隔の値を増加させる（ステップ８１４）。ｔ_ｒｅｃ３において収集されたＣＰＩ値とｔ_ｒｅｃ２において以前に収集されたＣＰＩ値との間の比較のような所与の比較が、閾値基準を満たす結果となったときには、そのパーティションについてのフットプリント規定期間は、パーティション履歴テーブル３０５内のようにパーティション識別情報と関連付けて記録される（ステップ８１６）。 In response to the CPI value difference being greater than or equal to the threshold, the CPI value recorded at step 806 (ie, the last CPI value) is recorded (step 812) and returns to the next dispatch of the same partition at step 610. Before, the value of the recording interval is increased for the next dispatch period (step 814). When given comparison such as a comparison between the collected CPI value previously in CPI value and t _rec2 collected at t _rec3 has resulted in meeting the threshold criteria, the footprint defined for that partition The period is recorded in association with the partition identification information as in the partition history table 305 (step 816).

図６に戻り、ステップ６１８において示されるように進むと、ハイパーバイザ１１５及び／又はＰＭＵ２０４は、論理パーティションの各々のディスパッチ期間の使用を判定することができる。ステップ６１８における判定を使用して、各々の論理パーティションは、設定されたディスパッチ期間の割り当てを実質的に全て用いるバッチ・パーティションか、又は、ディスパッチ期間中にパーティションが外部イベントを待たなければならない割り込み処理を受けるインタラクティブ・パーティションのどちらかとして分類される。バッチ・パーティションとして分類された、ディスパッチされたパーティションは、実質的にディスパッチ期間全体を使用するが、インタラクティブ・パーティションは、処理が中断したときなどに割り込み処理を受けて、別のプロセスからの応答を待たなければならない。所与のパーティションについてのディスパッチ期間の使用が、１つの実施形態においては９５％である特定の閾値を上回る場合には、パーティションは、バッチ・パーティションとして分類される（ステップ６１８及び６２２）。そうでなければ、ステップ６１８及び６２０に示されるように、パーティションはインタラクティブとして分類される。 Returning to FIG. 6 and proceeding as shown in step 618, the hypervisor 115 and / or PMU 204 may determine the use of each dispatch period of the logical partition. Using the determination in step 618, each logical partition is either a batch partition that uses substantially all of the configured dispatch period assignments, or interrupt handling that the partition must wait for external events during the dispatch period. Classified as either of the interactive partitions that you receive. A dispatched partition, categorized as a batch partition, uses substantially the entire dispatch period, while an interactive partition is interrupted, such as when processing is interrupted, and receives a response from another process. have to wait. If the use of the dispatch period for a given partition is above a certain threshold, which in one embodiment is 95%, the partition is classified as a batch partition (steps 618 and 622). Otherwise, the partition is classified as interactive, as shown in steps 618 and 620.

ディスパッチ期間の使用の分類は、図５において先に示されたように、各々のパーティションのパーティション制御ブロック内にフィールドとして含まれ、ハイパーバイザ１１５がスケジューリング優先順位基準として使用することができる。例えば、ハイパーバイザ１１５は、インタラクティブであると特定されたパーティションを、代替キュー５０５（図５を参照されたい）内の代替パーティションとしての適格性から除外することができる。さらに、図７を参照して以下で説明されるように、ディスパッチ期間の使用の分類は、ハイパーバイザ１１５が、ディスパッチ・ウィンドウの残りの部分を横取り又は代替された元のパーティションに再割り当てするかどうかを判断するのに用いることができる。 The dispatch period usage classification is included as a field in the partition control block of each partition, as previously shown in FIG. 5, and can be used by the hypervisor 115 as a scheduling priority criterion. For example, the hypervisor 115 may exclude a partition that has been identified as interactive from eligibility as an alternate partition in the alternate queue 505 (see FIG. 5). Further, as will be described below with reference to FIG. 7, the classification of dispatch period usage is whether the hypervisor 115 reassigns the rest of the dispatch window to the original partition that was intercepted or replaced. Can be used to determine whether.

パーティション・スケジューリング優先順位値を決定するプロセスは、パーティション履歴テーブル３０５及びＰＣＢ５０２ａ−５０２ｎに格納される優先順位値のような優先順位値を入力又は更新して、ステップ６２４及び６２６において示されるように終了する。図３−図５を参照して上で示され説明されたように、ハードウェア検出された性能評価指標は、パーティション履歴テーブル３０５内などで論理パーティションに関連付けられ、ハイパーバイザ１１５により、代替キュー５０５によって確立されるような優先順位を決定するのに用いられる。好ましい実施形態においては、ディスパッチ期間の使用の分類と性能評価指標の決定は、時間的に（例えば５分）測定するか又はパーティション・ディスパッチ・キューを通る特定のローテーション回数によって測定することができるシステム起動時に、実施することができる。優先順位データの少なくとも初期の組の収集後に、ハイパーバイザ１１５は、ディスパッチされたパーティションの代替及び場合によっては横取り、並びに、他のスケジューリング決定のために、そのデータを使用する。 The process of determining partition scheduling priority values enters or updates priority values such as priority values stored in partition history table 305 and PCBs 502a-502n and ends as shown in steps 624 and 626. To do. As shown and described above with reference to FIGS. 3-5, hardware detected performance metrics are associated with logical partitions, such as in the partition history table 305, and are replaced by the hypervisor 115 with an alternate queue 505. Used to determine the priority as established by. In a preferred embodiment, the dispatch period usage classification and performance metrics determination can be measured in time (eg, 5 minutes) or by a specific number of rotations through the partition dispatch queue. Can be implemented at startup. After collecting at least an initial set of priority data, the hypervisor 115 uses that data for replacement and possibly preemption of dispatched partitions, as well as other scheduling decisions.

図７は、本発明に係る、論理パーティション間のワークロードをバランスさせるためにハイパーバイザ１１５内のディスパッチャのようなディスパッチャによって実施されるステップを示す高レベル・フロー図である。プロセスは、ステップ７０２及び７１０において示されるように、ハイパーバイザ１１５が、システムによって設定されたパーティション・スケジューリングに従って次の１つ又は複数のパーティションの組をディスパッチすることで開始する。現在のディスパッチ・サイクルの間に、ディスパッチされたパーティションのＣＰＩ及びＣＬＣを含む物理システム・リソース評価指標が記録及び更新され（ステップ７１２）、図６のステップ６１４において計算された代替優先順位値のような代替優先順位値が更新される（ステップ７１３）。 FIG. 7 is a high-level flow diagram illustrating the steps performed by a dispatcher, such as a dispatcher in hypervisor 115, to balance the workload between logical partitions in accordance with the present invention. The process begins with the hypervisor 115 dispatching the next set of one or more partitions according to the partition scheduling set by the system, as shown in steps 702 and 710. During the current dispatch cycle, physical system resource metrics, including CPI and CLC of the dispatched partition, are recorded and updated (step 712), such as the alternative priority value calculated in step 614 of FIG. The alternative priority value is updated (step 713).

ステップ７１４において示されるように、ハイパーバイザ１１５は、パーティション代替優先順位データに従って、所与のディスパッチされたパーティションを横取りすることができるかどうかを判定することができる。ここで用いられるときには、パーティションの横取りは、ディスパッチ・ウィンドウの間に譲ったディスパッチされたパーティションを代替するのに用いられるものと同じ機構の多くに類似しており、その多くを含む。違いは、パーティションの横取りは、ここで説明される代替ステップを開始させる条件として、代替されるパーティションが処理を中断している必要がないことである。ステップ７１４及び７２０において示されるように、所与のディスパッチされたパーティションについて、パーティション履歴テーブル３０５に含まれる優先順位のような優先順位値の観点から、予め指定された横取り基準を満たすと判定されたことに応答して、ハイパーバイザ１１５は、ディスパッチされたパーティションを中断し、ディスパッチ・ウィンドウの残り（場合によってはその全体）に対してシステム・リソースを選択されたパーティションに割り当てる。代替パーティションは、ステップ７１８及び図９を参照して以下で説明されるスケジューリング優先順位付けステップと同一又は類似した形で、代替優先順位データに従って選択される。ハイパーバイザ１１５は、代替キュー５０５及びパーティション制御ブロック５０２ａ−５０２ｎ内の優先順位データを用いて、次にディスパッチされるパーティションか、そうでなければ現在ディスパッチされているパーティションが、選択された代替パーティションによって横取りされるかどうかを判定する。 As shown in step 714, the hypervisor 115 may determine whether a given dispatched partition can be intercepted according to the partition replacement priority data. As used herein, partition preemption is similar to and includes many of the same mechanisms used to replace dispatched partitions that were yielded during a dispatch window. The difference is that pre-emption of a partition does not require the replacement partition to be interrupted as a condition that initiates the replacement step described herein. As shown in steps 714 and 720, for a given dispatched partition, it was determined to meet pre-specified preemption criteria in terms of priority values, such as the priority contained in the partition history table 305. In response, the hypervisor 115 suspends the dispatched partition and allocates system resources to the selected partition for the remainder of the dispatch window (possibly its entirety). The alternate partition is selected according to the alternate priority data in the same or similar manner as the scheduling prioritization step described below with reference to step 718 and FIG. The hypervisor 115 uses the priority data in the alternate queue 505 and partition control blocks 502a-502n to determine whether the next dispatched partition or otherwise the currently dispatched partition is selected by the selected alternate partition. Determine whether it will be intercepted.

横取り基準を満たさず（ステップ７１４）、元のスケジューリングされたパーティションがディスパッチ割り当て期間の全体を使用する（即ち、パーティションが処理を中断しない）場合には、ハイパーバイザのディスパッチ／ロード・バランシングのプロセスは、ステップ７１０における次のディスパッチで続行される。 If the preemption criteria are not met (step 714) and the original scheduled partition uses the entire dispatch allocation period (ie, the partition does not interrupt processing), then the hypervisor dispatch / load balancing process is The next dispatch in step 710 continues.

ディスパッチされており、かつ、横取りされていないパーティションが、ディスパッチ・ウィンドウ期間中に処理を中断した場合には（ステップ７１６）、ハイパーバイザ１１５は、利用可能なパーティションについて代替キュー５０５及び／又はパーティション制御ブロック５０２ａ−５０２ｎによって提供される代替優先順位データを処理して、予め指定されたパーティション代替基準を満たしているかどうかを判定する（ステップ７１８）。 If a partition that has been dispatched and has not been preempted suspends processing during the dispatch window (step 716), the hypervisor 115 may replace the alternate queue 505 and / or partition control for available partitions. The replacement priority data provided by blocks 502a-502n is processed to determine whether pre-specified partition replacement criteria are met (step 718).

ステップ７１８において示される代替の判定は、好ましくは、現在アイドル状態の論理パーティションの代替優先順位値を評価して、その中のどのパーティションが、現在中断されているパーティションを代替する適格性があるかを判定することを含む。代替適格性の判定は、ディスパッチ・ウィンドウの残りの部分が限られていることなどといった課せられた制限の観点から、パーティション制御ブロック５０２ａ−５０２ｎに含まれるメモリ・フットプリント値などのパーティション代替優先順位値の１つ又は複数を評価する。例えば、元のディスパッチされたパーティションは、ＩＢＭ社のＰＯＷＥＲ５アーキテクチャの場合は１０ミリ秒であるパーティション・ウィンドウ期間Ｔ_ＤＷを有する。ディスパッチ・ウィンドウ期間Ｔ_ＤＷは、パーティションについての設定された最小の実行時増分であり、かつ、その結果としてハイパーバイザ１１５がディスパッチされたパーティションの代替又は横取りを行うことができる最小の増分である、ディスパッチ増分Ｔ_{ＤＰＳＴＣＨ}に有効に分割される。このような状況下で、ステップ７１８における代替基準の判定は、Ｔ_{ＤＰＳＴＣＨ}と、中断されたパーティションについてのＴ_ＤＷの残りの部分とによって課せられた制限の観点から、利用可能な代替パーティションのうちのどれが適格であるかを判定することを含む。 The alternative determination shown in step 718 preferably evaluates the alternative priority value of the currently idle logical partition to determine which partition is eligible to replace the currently suspended partition. Determining. Substitution eligibility determination is based on partition substitution priorities such as memory footprint values contained in partition control blocks 502a-502n in view of imposed restrictions such as the remaining portion of the dispatch window is limited. Evaluate one or more of the values. For example, the original dispatched partition has a partition window period _TDW that is 10 milliseconds for the IBM POWER5 architecture. The dispatch window period T _DW is the minimum run-time increment set for the partition and, as a result, the smallest increment that the hypervisor 115 can replace or preempt the dispatched partition. Effectively divided into dispatch increments T _DPSTCH . Under such circumstances, the determination of the alternative criteria in step 718 is that of the available alternative partitions in view of the restrictions imposed by T _DPSTCH and the rest of the T _DW for the suspended partition. Including determining which are eligible.

本発明は、特定のレベルのスケジューリング効率を保証するディスパッチ選択機能を含む。即ち、図９を参照すると、代替基準を満たしているかどうかを判定する（ステップ７１８）際に、及び、特定のレベルの処理効率を保証するように代替パーティションを選択する際に、パーティション・スケジューラによって実施されるステップを示す高レベル・フロー図が示される。プロセスは、ステップ９０２及び９０４において示されるように、横取り基準を満たしていること又はディスパッチされたパーティションが中断することといった代替イベントによって代替基準の判定が促されることによって開始する。ステップ７１８に示される代替基準の判定は、利用可能なパーティションについてのメモリ・フットプリント優先順位値とディスパッチ・ウィンドウの残りの利用可能な部分との両方を用いて代替パーティションを選択するためのサブステップ９０６及び９０８を含む。 The present invention includes a dispatch selection function that guarantees a certain level of scheduling efficiency. That is, referring to FIG. 9, in determining whether an alternative criterion is met (step 718) and when selecting an alternative partition to ensure a certain level of processing efficiency, the partition scheduler A high level flow diagram showing the steps performed is shown. The process begins by prompting the alternative criteria to be determined by an alternative event, such as meeting preemption criteria or suspending the dispatched partition, as shown in steps 902 and 904. The alternative criteria determination shown in step 718 is a sub-step for selecting an alternative partition using both the memory footprint priority value for the available partition and the remaining available portion of the dispatch window. 906 and 908 are included.

ステップ９０６は、パーティション制御ブロック及び／又はパーティション履歴テーブル３０５に記録することができる、利用可能な代替パーティションについてのフットプリント規定期間Ｔ_ｆｐが、ディスパッチ・ウィンドウの残りの部分Ｔ_{ｒｅｍａｉｎｉｎｇ}より小さいかどうかの判定を示す。さらに、ステップ９０６に示される判定は、あるパーティションについてのＴ_ｆｐ値が、利用可能なディスパッチ・ウィンドウ期間Ｔ_{ｒｅｍａｉｎｉｎｇ}内において代替パーティションが特定のレベルの処理効率を達成できるほど十分に小さいものであるかどうかを判定するために、１より大きい値に設定することができるチューニング可能な因子ｘを使用する。例えば、１つの実施形態においては、ハイパーバイザ１１５は、ｘが１より大きく、好ましくは少なくとも１０である場合に、代替論理パーティションがｘＴ_ｆｐ≦Ｔ_{ｒｅｍａｉｎｉｎｇ}の関係を満足する所定のメモリ・フットプリント値Ｔ_ｆｐを有するかどうかに従って、ステップ７１８において代替の決定を行う。 Step 906 determines whether the footprint stipulated period T _fp for the available alternate partition that can be recorded in the partition control block and / or the partition history table 305 is less than the _remaining portion T dispatching of the dispatch window. Indicates a decision. Further, the determination shown in step 906 is that the T _fp value for a partition is sufficiently small that the alternate partition can achieve a certain level of processing efficiency within the available dispatch window period T _remaining . To determine whether we use a tunable factor x that can be set to a value greater than one. For example, in one embodiment, the hypervisor 115 provides a predetermined memory footprint value at which the alternate logical partition satisfies the relationship xT _fp ≦ T _remaining when x is greater than 1, preferably at least 10. An alternative decision is made in step 718 depending on whether T _fp is present.

ステップ９０６及び９０８において示されるように、代替優先順位値の評価基準を満たさない場合には、そのパーティションは、代替可能なものから除外される。ステップ９０６は、代替基準が満たされるまでパーティションの１つ又は複数について実施され、プロセスは、ステップ９１０において示されるように終了する。 As shown in steps 906 and 908, if the criteria for the alternative priority value are not met, the partition is excluded from the alternative. Step 906 is performed for one or more of the partitions until the replacement criteria are met, and the process ends as indicated in step 910.

図７に戻ると、代替パーティションの選択に応答して、ハイパーバイザ１１５は、代替済みのパーティションからディスパッチ・ウィンドウ・リソースの割り当てを解除し、そのリソースを選択された代替パーティションに割り当てる（ステップ７２０）。代替されたパーティションが、図５及び図６を参照して示され説明されるように、非インタラクティブ（即ちバッチ）として分類されている場合には、代替パーティションは残りのディスパッチ・ウィンドウ期間を消費し、プロセスは次のディスパッチ・サイクルに戻る（ステップ７２２及び７１０）。しかしながら、代替されたパーティションがインタラクティブなパーティションである場合には（ステップ７２２）、代替パーティションは、残りのディスパッチ・ウィンドウ期間のサブセットにわたってディスパッチされる（ステップ７２３）。代替ディスパッチ期間の終了に続いて、ディスパッチ・ウィンドウ・リソースは元の代替済みのパーティションに再割り当てされ（ステップ７２４）、プロセスは続行するか（ステップ７２６）、又は終了する（ステップ７２８）。 Returning to FIG. 7, in response to selecting an alternate partition, hypervisor 115 deallocates the dispatch window resource from the replaced partition and allocates the resource to the selected alternate partition (step 720). . If the replacement partition is classified as non-interactive (ie, batch) as shown and described with reference to FIGS. 5 and 6, the replacement partition consumes the remaining dispatch window period. The process returns to the next dispatch cycle (steps 722 and 710). However, if the replacement partition is an interactive partition (step 722), the replacement partition is dispatched over a subset of the remaining dispatch window periods (step 723). Following the end of the alternate dispatch period, the dispatch window resource is reallocated to the original alternate partition (step 724) and the process continues (step 726) or ends (step 728).

図１０は、本発明に係るディスパッチ・ウィンドウ内のパーティション・スケジューリングを示す。図１０に示されるディスパッチ・ウィンドウは、ｔ_{ｓｔａｒｔ}で開始しｔ_{ｆｉｎｉｓｈ}で終了する設定されたディスパッチ・ウィンドウ期間Ｔ_ＤＷを有するものとして一部が定められ、これは、特定のディスパッチ・ウィンドウ間隔にわたるパーティション活動の時限割り込みをハイパーバイザ１１５に与える、ハイパーバイザ１１５内の（図面には明示されない）カウンタ機能によって実装される。示される実施形態においては、Ｐ_１、Ｐ_２、及びＰ_３として表されるパーティションは、それぞれディスパッチ・ウィンドウ内の時間ｔ_{ｓｔａｒｔ}、ｔ_１、及びｔ_２で開始する交互に配置される時間間隔で、ハイパーバイザ１１５によってディスパッチされる。論理パーティション・スケジューリング規則によれば、パーティションＰ_１、Ｐ_２、及びＰ_３の各々は、パーティションがそのそれぞれの最小使用資格を受け取ることを保証するために、ハイパーバイザ１１５がスケジューリングに用いる予め設定されたディスパッチ期間の使用資格を有する。 FIG. 10 illustrates partition scheduling within a dispatch window according to the present invention. The dispatch window shown in FIG. 10 is defined in part as having a set dispatch window period T _DW that starts at t _start and ends at t _finish , which is a partition over a specific dispatch window interval. Implemented by a counter function (not explicitly shown) in the hypervisor 115 that gives the hypervisor 115 a timed interrupt of activity. In the illustrated embodiment, the partitions represented as P ₁ , P ₂ , and P ₃ are interleaved time intervals starting at times t _start , t ₁ , and t ₂ , respectively, in the dispatch window. , And dispatched by the hypervisor 115. According to the logical partition scheduling rules, each of the partitions P ₁ , P ₂ , and P ₃ is preconfigured that the hypervisor 115 uses for scheduling to ensure that the partition receives its respective minimum usage entitlement. Eligible to use the dispatch period.

示される実施形態においては、パーティションＰ_１、Ｐ_２、及びＰ_３についての最小使用資格は、それぞれｔ_{ｓｔａｒｔ}−ｔ_１、ｔ_１−ｔ_２、及びｔ_２−ｔ_３の期間として図１０において示される。使用資格期間の後に、フェアネス及びスケジューリング優先順位因子に従ってハイパーバイザ１１５によってパーティションＰ_１、Ｐ_２、及びＰ_３がディスパッチされることがある残りの期間ｔ_３−ｔ_{ｆｉｎｉｓｈ}が存在する。さらに、上で説明されたように、パーティションは、パーティションＰ_２について示されるように、使用資格のあるディスパッチ期間内に処理を中断することがある。 In the illustrated embodiment, the minimum usage entitlements for partitions P ₁ , P ₂ , and P ₃ are shown in FIG. 10 as periods of t _start −t ₁ , t ₁ −t ₂ , and t ₂ −t ₃ , respectively. It is. After the eligibility period, there is a remaining period t ₃ -t _finish where the partitions P ₁ , P ₂ , and P ₃ may be dispatched by the hypervisor 115 according to fairness and scheduling priority factors. Further, as explained above, the partitions, as shown for partition P _2, which may interrupt the process to the dispatch period of entitlement.

図１１は、本発明に係る、図１０に示されるディスパッチ・ウィンドウなどの間のパーティション・スケジューリングのチューニングを示す高レベル・フロー図である。プロセスは、ステップ１１０２において示されるように、次のディスパッチ・ウィンドウの開始のためのハードによる（即ち、パーティション活動とは独立した）割り込み信号を受信することによって開始する。ステップ１１０４において示されるように進むと、パーティションＰ_１、Ｐ_２、及びＰ_３のうちの１つなどの論理パーティションが、パーティションの設定されたサイクル使用資格に対応する期間にわたってディスパッチされる。図１０を参照すると、第１のディスパッチ期間は、特定のサイクル数のためにｔ_{ｓｔａｒｔ}においてディスパッチされたパーティションＰ_１についての期間である。 11 is a high level flow diagram illustrating tuning of partition scheduling during the dispatch window etc. shown in FIG. 10 according to the present invention. The process begins by receiving a hard interrupt signal (ie, independent of partition activity) for the start of the next dispatch window, as shown in step 1102. Proceeding as shown in step 1104, a logical partition, such as one of partitions P ₁ , P ₂ , and P ₃ , is dispatched for a period of time corresponding to the partition's configured cycle usage. Referring to FIG. 10, the first dispatch period is for partition P ₁ dispatched at t _start for a specific number of cycles.

ディスパッチされたパーティションに属する論理プロセッサが、例えばｔ_{ｓｔａｒｔ}−ｔ_１にディスパッチされたパーティションＰ_１ついての処理を中断せず（ステップ１１０６）、さらに使用資格の割り当てが必要な場合には（ステップ１１０８）、次のスケジューリングされたパーティション（例えばパーティションＰ_２）がディスパッチされる（ステップ１１０４）。最小の使用資格の割り当て後にディスパッチ・ウィンドウ期間全体が消費された場合には（ステップ１１１０）、プロセスは、次のディスパッチ・ウィンドウ期間のためにステップ１１０２に戻る。 When the logical processor belonging to the dispatched partition does not interrupt the process for the partition P ₁ dispatched to, for example, t _start -t ₁ (step 1106), and further usage entitlement needs to be assigned (step 1108). The next scheduled partition (eg, partition P ₂ ) is dispatched (step 1104). If the entire dispatch window period has been consumed after assigning the minimum usage entitlement (step 1110), the process returns to step 1102 for the next dispatch window period.

パーティションの使用資格を満たすようにディスパッチすることに加えて、ハイパーバイザ１１５は、スケジューリング経験則の中から、パーティションに関連付けられた性能評価指標から導かれるメモリ・フットプリント値のような優先順位因子を使用して、パーティションを動的にスケジューリングすることが好ましい。ハイパーバイザ１１５が、スケジューリングのために性能評価指標から導かれた優先順位を使用する１つの状況が、パーティションＰ_２が処理を中断した時点である時間ｔ_ｓｕｓにおける、図１０に示される中断された論理プロセッサ状態（ステップ１１０６）として、図１１に示される。図１１に示される他のこうした状況は、付加的なサイクルがディスパッチ・ウィンドウ内に残っているｔ_３において開始する使用資格後のスケジューリングである（ステップ１１０８及び１１１０）。 In addition to dispatching to meet partition usage eligibility, the hypervisor 115 also determines a priority factor, such as a memory footprint value derived from the performance metrics associated with the partition, from a scheduling rule of thumb. It is preferable to use and dynamically schedule partitions. One situation in which the hypervisor 115 uses priorities derived from performance metrics for scheduling is interrupted as shown in FIG. 10, at time t _sus , when partition P ₂ interrupted processing. The logical processor state (step 1106) is shown in FIG. Another such situation shown in FIG. 11 is a scheduling after use credentials additional cycle begins at t ₃ when remaining in the dispatch window (steps 1108 and 1110).

パーティションの処理の中断（ステップ１１０６）又はディスパッチ・ウィンドウにおける追加サイクルの利用可能性に応じて、ハイパーバイザ１１５は、以下のサブステップを含むステップ７２０においてディスパッチを開始する。パーティションのディスパッチ適格性は、パーティションの各々についてメモリ・フットプリントを規定する時間をディスパッチ・ウィンドウの残りの時間と比較することによって、判定される（ステップ１１１２）。次いで、ステップ１１１２を通じて適格であるパーティションは、ステップ１１１４に示されるように、それぞれのフットプリント規定コストに従って優先順位付けされる。例えば、格納された統計値は、各々のパーティションがメモリ・フットプリントを規定するのに必要な時間を示す。図９を参照して上で説明されたように、メモリ・フットプリント規定期間を用いて、限られたディスパッチ・ウィンドウ期間が残っている状態で特定のパーティションをディスパッチすることについての相対的な効率を判定することができる。したがって、ステップ１１１４において実施される優先順位付けは、図４及び図９に示されるステップを含むものとすることができる。ステップ１１１４における優先順位付けに従って選択されたパーティションは、ステップ１１１６において示されるように、ディスパッチされる。 In response to the partition processing interruption (step 1106) or the availability of additional cycles in the dispatch window, the hypervisor 115 begins dispatching in step 720, which includes the following sub-steps. The dispatch eligibility of the partition is determined by comparing the time defining the memory footprint for each partition with the remaining time of the dispatch window (step 1112). The partitions that are eligible through step 1112 are then prioritized according to their respective footprint definition costs, as shown in step 1114. For example, the stored statistics indicate the time required for each partition to define a memory footprint. As described above with reference to FIG. 9, the relative efficiency of dispatching a particular partition with a limited memory window remaining using a memory footprint definition period Can be determined. Accordingly, the prioritization performed in step 1114 may include the steps shown in FIGS. The partitions selected according to the prioritization at step 1114 are dispatched as indicated at step 1116.

開示された方法は、オブジェクトを用いて、又は、様々なコンピュータ若しくはワークステーション・ハードウェア・プラットフォーム上で用いることが可能なポータブル・ソース・コードを提供するオブジェクト指向ソフトウェア開発環境を用いて、ソフトウェアとして容易に実装することができる。この場合においては、本発明の方法及びシステムは、Ｊａｖａ（登録商標）又はＣＧＩのスクリプトといった、パーソナル・コンピュータに埋め込まれるルーチン、サーバ又はグラフィックス・ワークステーションに常駐するリソース、専用のソースコード・エディタ管理システムに埋め込まれるルーチンなどとして、実装することができる。 The disclosed method can be implemented as software using objects or using an object-oriented software development environment that provides portable source code that can be used on a variety of computer or workstation hardware platforms. It can be easily implemented. In this case, the method and system of the present invention includes a routine embedded in a personal computer, such as a Java or CGI script, a resource residing on a server or graphics workstation, a dedicated source code editor. It can be implemented as a routine embedded in the management system.

本発明は、好ましい実施形態を参照して具体的に示され説明されたが、当業者であれば、本発明の趣旨及び範囲から逸脱することなく、形態及び細部において種々の変更を行うことができることを理解すべきである。このような代替的な実施形態は全て、発明の範囲内にある。 Although the invention has been particularly shown and described with reference to preferred embodiments, those skilled in the art can make various changes in form and detail without departing from the spirit and scope of the invention. You should understand what you can do. All such alternative embodiments are within the scope of the invention.

本発明に係るワークロード・バランシング及びディスパッチ・ウィンドウ・チューニングを実施するように適合された、仮想化されたコンピュータ・システムの図である。1 is a virtualized computer system adapted to implement workload balancing and dispatch window tuning according to the present invention. FIG. 本発明の１つの実施形態に係る、パーティション・ケジューリングを容易にするように適合された例示的なアーキテクチャを示す高レベル概念図である。FIG. 3 is a high-level conceptual diagram illustrating an example architecture adapted to facilitate partition scheduling, according to one embodiment of the invention. 図２に示されるアーキテクチャ内に実装することができるパーティション監視ユニット、ハイパーバイザ、及びパーティション履歴テーブルの内部アーキテクチャを示す高レベル概念図である。FIG. 3 is a high-level conceptual diagram illustrating the internal architecture of a partition monitoring unit, hypervisor, and partition history table that can be implemented within the architecture shown in FIG. 本発明に係る、論理パーティションの代替優先順位を求めるためのシーケンスの高レベル・ブロック図である。FIG. 4 is a high level block diagram of a sequence for determining alternative priorities for logical partitions according to the present invention. 本発明に従って実装されるパーティション・ディスパッチャ状態の高レベル・ブロック図である。FIG. 4 is a high level block diagram of a partition dispatcher state implemented in accordance with the present invention. 本発明に係る、代替優先順位を求めるためにパーティション監視ユニット及びディスパッチャによって実施されるステップを示す高レベル・フロー図である。FIG. 5 is a high level flow diagram illustrating the steps performed by a partition monitoring unit and dispatcher to determine alternative priorities according to the present invention. 本発明に係る、代替優先順位を用いて論理パーティション間のワークロードをバランスさせるパーティション・ディスパッチ・プロセスの間に実施されるステップを示す高レベル・フロー図である。FIG. 5 is a high level flow diagram illustrating steps performed during a partition dispatch process that balances workload among logical partitions using alternative priorities in accordance with the present invention. （Ａ）本発明の１つの実施形態に係る、ディスパッチ代替決定に利用されるメモリ・フットプリント性能評価指標を求めるためにパーティション監視ユニットによって実施されるステップを示す高レベル・フロー図である。（Ｂ）本発明に係る、ディスパッチ・ウィンドウ・サイクルにわたって収集される１命令あたりサイクルのデータのグラフである。(A) A high level flow diagram illustrating the steps performed by a partition monitoring unit to determine a memory footprint performance evaluation index utilized for dispatch substitution decisions, according to one embodiment of the present invention. (B) is a graph of cycle per instruction data collected over a dispatch window cycle in accordance with the present invention. 代替パーティションの選択の際にパーティション・スケジューラによって実施されるステップを示す高レベル・フロー図である。FIG. 5 is a high level flow diagram illustrating the steps performed by the partition scheduler in selecting an alternate partition. 本発明に係るディスパッチ・ウィンドウ内のパーティション・スケジューリングを示す。Fig. 5 shows partition scheduling within a dispatch window according to the present invention. 本発明に係る、ディスパッチ・ウィンドウの間にパーティション・スケジューラを動的にチューニングするためにパーティション監視ユニットによって実施されるステップを示す高レベル・フロー図である。FIG. 6 is a high level flow diagram illustrating steps performed by a partition monitoring unit to dynamically tune a partition scheduler during a dispatch window, in accordance with the present invention.

Claims

In a computer system using system virtualization partitioning, wherein each of a number of logical partitions operates logically independent of other partitions that use the computer system's shared physical resources, the physical system between the number of logical partitions A method for balancing access to resources,
Using hardware detection logic to collect performance metrics for one or more of the physical system resources in association with one or more of the logical partitions;
Classifying each of the logical partitions according to the level of usage of the assigned dispatch window;
Collected performance metrics associated with an alternative logical partition and the alternative logical partition during a dispatch window in which a given set of physical system resources is preconfigured to be assigned to one of the logical partitions Assigning the given set of physical system resources to the alternate logical partition according to a dispatch window usage classification of:
Including methods.

Storing the performance metrics in association with each of the respective associated logical partitions;
Assigning the given set of physical system resources to the alternative logical partition further comprises selecting the alternative logical partition according to the stored performance metrics;
The method of claim 1.

Dispatching one logical processor of the logical partition during a dispatch window, the dispatching step being pre-designated to allocate the physical system resources to be used by the one of the logical partitions Assigning a specified period of time; and
In response to the dispatched logical processor suspending processing during the pre-specified period, the logic to reallocate the physical system resources to the remaining portion of the pre-specified period. Selecting another logical partition of the partitions, wherein the selecting step is performed according to the collected performance metrics associated with the other logical partition of the logical partitions;
The method of claim 1, further comprising:

In a computer system using system virtualization partitioning, wherein each of a number of logical partitions operates logically independent of other partitions that use the computer system's shared physical resources, the physical system between the number of logical partitions A system for balancing access to resources,
Hardware detection logic for collecting performance metrics for one or more of the physical system resources in association with one or more of the logical partitions;
Partition scheduling logic to classify each of the logical partitions according to a level of usage of an assigned dispatch window;
Collected performance metrics associated with an alternative logical partition and the alternative logical partition during a dispatch window in which a given set of physical system resources is preconfigured to be assigned to one of the logical partitions Partition scheduling logic for assigning the given set of physical system resources to the alternative logical partition according to a dispatch window usage classification of
Including system.

Means for storing the performance metrics in association with each of the respective associated logical partitions;
The partition scheduling logic for assigning the given set of physical system resources to the alternative logical partition further comprises partition scheduling logic for selecting the alternative logical partition according to the stored performance metrics. Including,
The system according to claim 4.

Partition scheduling logic for dispatching one logical processor of the logical partition during a dispatch window, the dispatching being used by the one of the logical partitions as the physical system resource Partition scheduling logic, including allocating a pre-specified period in which
In response to the dispatched logical processor suspending processing during the pre-specified period, the logic to reallocate the physical system resources to the remaining portion of the pre-specified period. Partition scheduling logic for selecting another logical partition of the partitions, wherein the selection is performed according to the collected performance metrics associated with the other logical partition of the logical partitions; Partition scheduling logic;
The system of claim 4 further comprising:

In a computer system using system virtualization partitioning, wherein each of a number of logical partitions operates logically independent of other partitions that use the computer system's shared physical resources, the computer system includes the plurality of logical partitions. A program for executing processing for balancing access to physical system resources among logical partitions, the program comprising:
Using hardware detection logic to collect performance metrics for one or more of the physical system resources in association with one or more of the logical partitions;
Classifying each of the logical partitions according to the level of usage of the assigned dispatch window;
Collected performance metrics associated with an alternative logical partition and the alternative logical partition during a dispatch window in which a given set of physical system resources is preconfigured to be assigned to one of the logical partitions Assigning the given set of physical system resources to the alternate logical partition according to a dispatch window usage classification of:
A program adapted to perform a method comprising:

The method further comprises storing the performance metrics in association with each of the respective associated logical partitions;
Assigning the given set of physical system resources to the alternative logical partition further comprises selecting the alternative logical partition according to the stored performance metrics;
The program according to claim 7.

The method
Dispatching one logical processor of the logical partition during a dispatch window, the dispatching step being pre-designated to allocate the physical system resources to be used by the one of the logical partitions Assigning a specified period of time; and
In response to the dispatched logical processor suspending processing during the pre-specified period, the logic to reallocate the physical system resources to the remaining portion of the pre-specified period. Selecting another logical partition of the partitions, wherein the selecting step is performed according to the collected performance metrics associated with the other logical partition of the logical partitions;
The program according to claim 7, further comprising:

In a computer system using system virtualization partitioning, wherein each of a number of logical partitions operates logically independent of other partitions that use the computer system's shared physical resources, the physical system between the number of logical partitions A method for balancing access to resources,
Using hardware detection logic to collect performance metrics for one or more of the physical system resources in association with one or more of the logical partitions;
Processing the performance metric to determine a memory footprint value for the logical partition;
During a dispatch window in which a given set of physical system resources is assigned to one of the logical partitions, the given set of physical system resources according to the determined memory footprint value. Assigning a set to another of the logical partitions;
Including methods.

The method of claim 10, wherein the performance metric includes data per cycle of instructions, and the memory footprint value includes a period for a logical partition to define a memory footprint.

The period for defining the memory footprint is such that the cycle value per instruction at some point in the dispatch period for a partition is the cycle value per instruction at an earlier point in the dispatch period for the same partition. The method of claim 11, wherein the method is determined by detecting an end point that varies by less than a certain threshold from.

In a computer system using system virtualization partitioning, wherein each of a number of logical partitions operates logically independent of other partitions that use the computer system's shared physical resources, the physical system between the number of logical partitions A system for balancing access to resources,
Hardware detection logic for collecting performance metrics for one or more of the physical system resources in association with one or more of the logical partitions;
Partition monitoring logic for processing the performance metrics to determine a memory footprint value for the logical partition;
During a dispatch window in which a given set of physical system resources is assigned to one of the logical partitions, the given set of physical system resources according to the determined memory footprint value. Partition scheduling logic for assigning a set to another of the logical partitions;
Including system.

The system of claim 13, wherein the performance metric includes data for cycles per instruction, and the memory footprint value includes a period for a logical partition to define a memory footprint.

15. The system of claim 14, wherein the hardware detection logic includes hardware circuit means for collecting data for cycles per instruction over successive dispatches for a given one of the logical partitions.

Collecting data for cycles per instruction over successive dispatches for a given one of the logical partitions;
Collecting data for cycles per instruction in the nth time increment within the mth dispatch window for the given logical partition;
Let m, n, p, and q be integers greater than or equal to 1, per instruction at (n + p) th time increment in the (m + q) th dispatch window for the given logical partition. Collecting data for the cycle;
The system of claim 15, further comprising:

The partition scheduling logic for allocating the given set of physical system resources to another logical partition of the logical partition is such that T _unitized represents the unused portion of the dispatch window period and x is 1. _Assuming a larger efficiency multiplier, the physical system depends on whether the other logical partition of the logical partitions has a determined memory footprint value T _fp that satisfies the relationship xT _fp ≦ T _unitized 14. The system of claim 13, further comprising partition scheduling logic for assigning the given set of resources to the other logical partition of the logical partitions.

In a computer system using system virtualization partitioning, wherein each of a number of logical partitions operates logically independent of other partitions that use the computer system's shared physical resources, the computer system includes the plurality of logical partitions. A program for executing processing for balancing access to physical system resources among logical partitions, the program comprising:
Using hardware detection logic to collect performance metrics for one or more of the physical system resources in association with one or more of the logical partitions;
Processing the performance metric to determine a memory footprint value for the logical partition;
During a dispatch window in which a given set of physical system resources is assigned to one of the logical partitions, the given set of physical system resources according to the determined memory footprint value. Assigning a set to another of the logical partitions;
A program adapted to perform a method comprising:

The program according to claim 18, wherein the performance evaluation index includes data of a cycle per instruction, and the memory footprint value includes a period for the logical partition to define a memory footprint.

The period for defining the memory footprint is such that the cycle value per instruction at some point in the dispatch period for a partition is the cycle value per instruction at an earlier point in the dispatch period for the same partition. 20. The program according to claim 19, wherein the program is obtained by detecting an end point that changes by an amount smaller than a specific threshold value.