JP3925854B2

JP3925854B2 - Programs in multiprocessor systems

Info

Publication number: JP3925854B2
Application number: JP2002298685A
Authority: JP
Inventors: 眞上田
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2002-10-11
Filing date: 2002-10-11
Publication date: 2007-06-06
Anticipated expiration: 2022-10-11
Also published as: JP2004133753A

Description

【０００１】
【発明の属する技術分野】
本発明は、マルチ・プロセッサ・システムで複数のプロセス（またはプログラムと言う）のスレッドをディスパッチするためのプログラムに関する。
【０００２】
【従来の技術】
従来、マルチ・スレッド・プログラムの実行環境として、マルチ・プロセッサ・システムはＷＥＢサーバの応用で特に成功を収めている。この場合、システムで実行されるプロセスは、プロキシやファイアーウォールなどのサーバ・プログラムである。分散処理をおこなう場合、１つのシステムで１つのサーバ・プログラムが実行されることもしばしばである。
【０００３】
ＷＥＢサーバのマルチ・スレッド・プログラムは、非同期な複数のクライアントからのリクエストに応じて、ジョブの生成・消滅を繰り返す。これに対してＣＰＵ（central processing unit）サーバあるいはＣＰＵＦａｒｍと呼ばれるサーバ用途では、性質の異なる複数のプログラム、例えば計算やＣＡＤ（computer aided design）などが実行される。
【０００４】
マルチ・スレッド・プログラムを含むプロセスを１つ実行する場合、割り当て可能なプロセッサにプロセスのスレッドを全て割り当てることが最もプロセスの実行速度を高める。しかし、複数のプロセスが存在する場合、システムのスループットを最適化する方法でスレッドの割り当てをおこなう必要がある。
【０００５】
例えば、８個のプロセスを同時に処理できるマルチ・プロセッサ・システムで、独立な８個のプロセスを実行する場合、１個のプロセスを処理するシングル・プロセッサ８台で実行したときと比較してスループットが著しく劣ることがある。これは、ユーザ空間の共有の結果生じた競合が効率を落とす原因と考えられている。
【０００６】
マルチ・スレッド・プログラムは、スレッドの切り替えがプログラムの切り替えと比較して高速である。さらに異なるスレッド間で同じユーザ空間を共有しているため、スレッド間でのデータの受け渡しがポインタ渡しにできる。したがって、スレッド間のデータの受け渡しはコピー操作の必要なプログラム間のデータの受け渡しに比べて高速である。マルチ・スレッド・プログラムをマルチ・プロセッサ・システムで実行する場合、システムはキャッシュ・コヒーレントな共有メモリを有することが望ましい。なぜなら、異なるプロセッサで実行中のスレッド同士でデータの受け渡しをポインタ渡しでおこなうとき、ハードウェア・レベルでコピー操作が発生しないからである。
【０００７】
マルチ・プロセッサ・システムは、マルチ・スレッド・プログラムの実行に最適化された結果、ハードウェア・レベルでのキャッシュ・コヒーレンシなサポートを加えられている。キャッシュ・コヒーレンシをハードウェアでサポートすることは、頻繁に起きるスレッド間のデータの受け渡し、すなわちスレッド間の通信に適している。
【０００８】
しかし、スレッド間通信の頻度の低い独立したプログラム間でのデータの受け渡しに備えて、常にキャッシュ・コヒーレンシ・プロトコルを用いたデータ・アクセスを用いると、速度面で不要な処理が多くなる。ＣＣ−ＮＵＭＡ（cache coherence-non uniform memory access）型マルチ・プロセッサ・システムの例では、Ｌ２（レベル２）キャッシュ・ミスをした場合、ソフトウェアの介在なしに、データの所属するメモリを管理するホーム・ノードに属するタグ・メモリがアクセスされ、最新のデータを保持しているオーナ・ノードが自動的に照会される。また、スヌープ・キャッシュを用いた場合、常にアドレスを全キャッシュにブロードキャストする必要があるため、アドレスバスが１セットしかないと、スヌープを必要とするメモリ・アクセスを同時に複数発生させることができない。なお、タグ・メモリは、データをキャッシュしている状態を管理するメモリである。スヌープ・キャッシュは、コヒーレンシをハードウェアで自動管理するために使用するメモリである。Ｌ２キャッシュ・ミスは、キャッシュ・メモリが２層構造であり、１層に次いで２層でもキャッシュ・ミスをすることを言う。
【０００９】
したがって、キャッシュ・コヒーレンシのハードウェアでのサポートは、マルチ・スレッド・プログラムを効率よくサポートするために適しているが、独立した複数のプログラムを実行する場合には却って不要なサポートである。
【００１０】
【発明が解決しようとする課題】
本発明の目的は、マルチ・プロセッサ・システムで複数のプログラムを実行する場合、システムのスループットを最大化するように制御するプログラムを提供することにある。
【００１１】
【課題を解決するための手段】
本発明のプログラムの要旨は、複数のプロセッサを含むマルチ・プロセッサ・システムで実行されるプロセスのスレッドをディスパッチするプログラムであって、前記マルチ・プロセッサ・システムを、前記複数のプロセッサにそれぞれプロセスのスレッドを割り当てるときの該プロセスの処理効率を判定するために、前記プロセスの実行時間の内のプロセッサでの処理時間とＩ／Ｏ待ちの時間との長さを比較する手段と、前記プロセッサでの処理時間またはＩ／Ｏ待ちの時間のいずれが長いかによって、前記プロセスを１つのプロセッサまたは複数のプロセッサで実行させることを選択する手段、として機能させるためのプログラムである。
【００１２】
複数のプロセッサを含むマルチ・プロセッサ・システムで実行されるプロセスのスレッドをディスパッチする方法は、前記複数のプロセッサにそれぞれプロセスのスレッドを割り当てたときの該プロセスの処理効率を判定するステップと、前記プロセスを１つのプロセッサまたは複数のプロセッサで実行させるかを選択するステップと、を含む。
【００１３】
【発明の実施の形態】
本発明のプログラムの実施の形態について図面を用いて説明する。なお、本明細書、図面１、２および４において、符号３８は、符号３８ａ，３８ｂ，Ｐ１，Ｐ２，Ｐ３を総括的に示す。
【００１４】
本発明のプログラムは、図１に示すスレッドのディスパッチをおこなうプログラムである。言い換えると、本発明は、マルチ・プロセッサ・システムにおけるオペレーティング・システムのスケジューラの発明である。先ず、システムのスループットを最大化するための方法を説明する。
【００１５】
図２は、オペレーティング・システム３２とプロセス３８を表したものである。Ｔ０からＴ４はプロセス３８のスレッドを示し、これらのスレッドＴ０，Ｔ１，Ｔ２，Ｔ３，Ｔ４の位置がマルチ・プロセッサ・システム４０における実行ポイントを示す。図２の例では、スレッドＴ０はスレッドＴ１，Ｔ２，Ｔ３，Ｔ４と同時実行可能なユーザ空間３６のスレッドである。スレッドＴ１はシステム・コールによってスレッドＴ２，Ｔ３へとサービスを受けた後、スレッドＴ４にリターンする。
【００１６】
なお、本明細書において、例えば図２のスレッドＴ０，Ｔ２，Ｔ３を異なるプロセッサに割り当てれば、最大３の並行度が得られるプロセス３８であっても、１つのプロセッサ（シングル・プロセッサ）に割り当てることを、シングル・プロセッサ実行状態にすると言う。
【００１７】
従来、例えばスレッドＴ０，Ｔ２，Ｔ３を異なるプロセッサにて並列実行する場合、データが共有されるため、キャッシュ・コヒーレンシ・プロトコルが必要である。スレッドＴ０からＴ４をシングル・プロセッサ実行する場合は、キャッシュ・コヒーレンシ・プロトコルは不要である。そこで、ユーザ空間３６のデータは、対応するページテーブルにてコヒーレンシ不要とマークする。なお、ページテーブルは、メモリの記憶域をページ単位で管理するための表であり、仮想アドレスと物理アドレスの対応関係が格納される。この表を用いてキャッシュ制御、キャッシュ・コヒーレンシ制御を管理することができる。また、ページは、プログラムを等しい長さに分割した場合の分割された各部分である。
【００１８】
キャッシュ・コヒーレンシ不要とマークする場合、例えば、図３のページテーブル４２では、ＷＩＭビット・フィールドにて指定する。マルチ・プロセッサ・システム４０に含まれるＭＭＵ（memory management unit）の機能が働いて、このユーザ空間３６へのデータ・アクセスに関してハードウェア３０はキャッシュ・コヒーレンシ・プロトコルを発生しない。
【００１９】
ただし、このままでは他のプロセッサで実行されたプログラムとのデータ受け渡しの時に、キャッシュ・コヒーレンシが保たれない可能性がある。例えば、図２のスレッドＴ５が、図４のプロセッサｕＰ２にて実行されている場合に、スレッドＴ５がキャッシュ・コヒーレンシ不要としたプロセスを有するユーザ空間３６ａを参照した場合、キャッシュ・コヒーレンシが保たれなくなる。
【００２０】
上記の場合に、プロセッサへのディスパッチを変更して、スレッドＴ５も加えてシングル・プロセッサ実行状態にして、キャッシュ・コヒーレンシを保つ。ディスパッチを変更するイベントを発生するために、ＭＭＵのアシストを利用する。ページテーブル４２において、このユーザ空間３６を実行中の唯一のプロセッサからのみ読み書き可能とし、他のプロセッサからは読み書き不可能とする。スレッドＴ５に対しキャッシュ・コヒーレンシを破るデータの参照をおこなう前に割り込が発生し、スレッドＴ０，Ｔ１，Ｔ２，Ｔ３，Ｔ４と同じプロセッサへディスパッチさせる。
【００２１】
ページに対する読み書きの属性は、図３のページテーブル４２の例では、ＰＰビット・フィールドで指定することによって、キャッシュ・コヒーレンシ・プロトコルを省略することが可能となる。
【００２２】
ページテーブル４２に記録された読み書きの属性の内容が、プロセッサによって異なって見えるが、これは、ＤＳＭ（distributed shared memory）にてソフトウェア・キャッシュを実装する場合にも用いられる技術である。ＤＳＭのソフトウェア・キャッシュでは、ページ・フォルトを起こしたページをノード間で移動させることによってキャッシュ効果を得るのに対して、本発明では、ページ・フォルトを起こしたスレッドをノード間で移動させる点が異なる。なお、ページ・フォルトは、メモリに存在しないページに対するアクセスが起こったときに発生する割り込みである。
【００２３】
全てのプロセス３８をシングル・プロセッサ実行状態にしてしまうと、マルチ・プロセッサ・システム４０の意味が失われる。そこで、本発明のプログラムは、下記のような構成にし、シングル・プロセッサ実行状態にすべきプロセス３８を選択する。
【００２４】
図１に示す本発明のプログラム１０は、マルチ・プロセッサ・システム４０を、複数のプロセッサ（マルチ・プロセッサ）２８にそれぞれプロセス３８ａ，３８ｂのスレッドを割り当てたときのプロセス３８ａ，３８ｂの処理効率を判定する判定手段１２、判定手段１２の結果に基づいて、プロセス３８ａ，３８ｂをシングル・プロセッサ２６またはマルチ・プロセッサ２８で実行させるかを選択する選択手段１４、として機能させるためのプログラムである。
【００２５】
判定手段１２は、プロセス３８ａ，３８ｂの実行状態を観測する観測部１６と、プロセス３８ａ，３８ｂの処理効率を判定する判定部１８とを含む。観測部１６は、プロセス３８ａ，３８ｂの実行時間をプロセッサ２６，２８での処理の時間およびＩ／Ｏ（input/output）待ちの時間とに分割する分割手段（図示せず）と、処理の時間とＩ／Ｏ待ちの時間とを比較する比較手段（図示せず）と、を含む。比較手段の比較結果を用いて、判定部１８がプロセス３８ａ，３８ｂの処理効率を判定する。
【００２６】
選択手段１４は、メモリ管理部２０とタスク・ディスパッチャ２２とを含む。メモリ管理部２０は、メモリ２４にあるプロセス３８ａ，３８ｂに対して、キャッシュ・コヒーレンシをオンにするかオフにするかをおこなう。タスク・ディスパッチャ２２は、マルチ・プロセッサ実行かシングル・プロセッサ実行を選択する。
【００２７】
本発明は、判定手段１２と選択手段１４によって、シングル・プロセッサ２６で実行されているプロセス３８ａが処理するデータに他のプロセッサからのアクセスするとき、そのアクセスを中止し、そのアクセスを上記シングル・プロセッサ２６に変更する。すなわち、他のプロセッサからのアクセスに対して割り込みを発生させ、シングル・プロセッサ２６にそのアクセスを変更する。
【００２８】
本発明は、判定手段１２と選択手段１４によって、シングル・プロセッサ２６で実行されているプロセス３８ａが処理するデータに対して、シングル・プロセッサ２６がデータの読み書きをおこない、他のプロセッサに対してデータの読み書きを禁止する。
【００２９】
次に、上記のプログラム１０によるディスパッチ方法について説明する。プログラム１０は、マルチ・プロセッサ２８にそれぞれプロセス３８ａ，３８ｂのスレッドを割り当てたときの処理効率を判定し、シングル・プロセッサ実行またはマルチ・プロセッサ実行を選択する。
【００３０】
判定をおこなうとき、▲１▼Ｉ／Ｏバウンドが発生する処理の場合、ＣＰＵで消費されるわずかな時間でプロセスを並列処理したとしても、その効果は限られている。そこで、プロセス３８ａ，３８ｂの実行時間をプロセッサ２８での処理時間とＩ／Ｏ待ちの時間とに分割し、処理時間とＩ／Ｏ待ちの時間とを比較する。比較の結果、Ｉ／Ｏ待ちの時間の方が長い場合、シングル・プロセッサ実行する。図１ではプロセス３８ａがシングル・プロセッサ実行となっている。Ｉ／Ｏバウンドは、データの処理おいてハードディスク上の仮想記憶領域を使用すること、すなわちスワップファイルを生成することをいう。
【００３１】
▲２▼プログラム１０を含むオペレーティング・システム４２上で動作するアプリケーション・プログラムがシングル・スレッド・プログラムである場合もシングル・プロセッサ実行の候補となる。この場合は、オペレーティング・システム３２がマルチ・スレッド・プログラムであれば、アプリケーション・プログラムはマルチ・スレッド・プログラムのように振る舞う。しかし、Ｉ／Ｏバウンドの発生が予想されるため、オペレーティング・システム３２は、アプリケーション・プログラムがプロセス処理の並行性を持たない場合は、シングル・プロセッサ実行の候補とする。すなわち、本発明のプログラムは、上記の場合に、選択手段１４がシングル・プロセッサ実行を選択する。
【００３２】
▲３▼単独で実行されるときにマルチ・プロセッサ実行に適しているプロセス３８でも、複数のプロセス３８が実行される場合には処理効率が悪いという場合がある。この様な処理もシングル・プロセッサ実行の候補とする。すなわち、本発明のプログラム１０は、判定手段１２が個々のプロセス３８ａ，３８ｂにおけるマルチ・プロセッサ実行の処理効率を判定し、判定の結果が上記のような場合に、選択手段１４がシングル・プロセッサ実行を選択する。例えば、図１でプロセス３８ａがマルチ・プロセッサ実行に適していても、選択手段１４によってシングル・プロセッサ実行となる。
【００３３】
本発明のプログラム１０を有するマルチ・プロセッサ・システム４０において、マルチ・プロセッサ実行に適したマルチ・スレッド・プログラムを持つプロセス３８は、ハードウェア・レベルでのキャッシュ・コヒーレンシ・サポートを与えられたマルチ・プロセッサ２８にて並行処理する。
【００３４】
図４において、プロセスＰ３は２つのプロセッサｕＰ３，ｕＰ４で処理される。プロセッサｕＰ３，ｕＰ４で発生するデータ・アクセスは、キャッシュ・コヒーレンシ・プロトコルを発生する。
【００３５】
マルチ・プロセッサ実行に適さないプロセス３８は、シングル・プロセッサ実行され、他のプロセッサとのキャッシュ・コヒーレンシは、ＭＭＵのアシストなどを使ったソフトウェア制御方法のみ保たれる。図４のプロセスＰ１，Ｐ２をそれぞれ実行するプロセッサｕＰ１，ｕＰ２の発生するユーザ空間３６ａ，３６ｂに対するデータ・アクセスはキャッシュ・コヒーレンシ・プロトコルを発生しない。この結果、キャッシュ・コヒーレンシ・プロトコルのバンド幅は緩和され、システム４０のスループットは向上する。
【００３６】
キャッシュ・コヒーレンシ・プロトコルを発生しないため、プロセスＰ１に対して、プロセッサｕＰ２からのアクセスをおこなうとき、上記の割り込み手段がそのアクセスに対して割り込みを発生させる。そして、割り込みの発生後、プロセッサｕＰ２からプロセッサｕＰ１にスレッドを割り当てて、アクセスをおこなう。
【００３７】
また、ＭＭＵによって、あるプロセッサで実行されているプロセス３８が処理するデータに対して、そのプロセッサがデータの読み書きをおこない、他のプロセッサに対してデータの読み書きを禁止するようにシステム４０を制御する。
【００３８】
以上、本発明の実施の形態を説明したが、本発明は上記の形態に限定されることはない。上記の説明はキャッシュをＬ１（レベル１）キャッシュを前提に説明をおこなっており、シングル・プロセッサ実行状態とは、シングル・プロセッサでスレッドを実行することであった。同じアイデアをＬ２キャッシュに応用することができる。この場合、シングル・プロセッサ・ノード実行状態にする。シングル・プロセッサ・ノード実行状態とは、同じノードに属する複数のプロセッサが１つのプロセスに属する複数のタスクを処理する状態を言う。
【００３９】
キャッシュ・コヒーレンシ・プロトコルを省略する空間を、実装が容易で、効果の高いユーザ空間としたが、システム空間の一部に応用することは可能である。すなわち、システム空間の用途を分類し、分類された中でキャッシュ・コヒーレンシ・プロトコルを省略できる部分に適用する。
【００４０】
その他、本発明は、主旨を逸脱しない範囲で当業者の知識に基づき種々の改良、修正、変更を加えた態様で実施できるものである。
【００４１】
【発明の効果】
本発明によると、マルチ・プロセッサ・システムにおいて、プロセスのディスパッチを最適におこなうため、システムのスループットを下げることなく、全てのプロセスを処理することができる。
【図面の簡単な説明】
【図１】本発明のプログラムの構成およびディスパッチの様子を示す図である。
【図２】プロセスのスレッドの実行状態を示す図である。
【図３】キャッシュ・コヒーレンシを保つためのページテーブルの例を示す図である。
【図４】マルチ・プロセッサ・システムの例を示す図である。
【符号の説明】
１０：プログラム
１２：判定手段
１４：選択手段
１６：観測部
１８：判定部
２０：メモリ管理部
２２：タスク・ディスパッチャ
２４：メモリ
２６：シングル・プロセッサ
２８：マルチ・プロセッサ
３０：ハードウェア
３２：オペレーティング・システム
３４，３４ａ，３４ｂ，３４ｃ：システム空間
３６，３６ａ，３６ｂ，３６ｃ：ユーザ空間
３８：プロセス
４０：マルチ・プロセッサ・システム
４２：ページテーブル[0001]
BACKGROUND OF THE INVENTION
The present invention, concerning a thread of a plurality of processes (or referred to as a program) to the program for dispatching a multi-processor system.
[0002]
[Prior art]
Conventionally, as an execution environment for a multi-thread program, a multi-processor system has been particularly successful in the application of a WEB server. In this case, the process executed in the system is a server program such as a proxy or a firewall. When performing distributed processing, one server program is often executed in one system.
[0003]
The multi-thread program of the WEB server repeatedly generates and deletes jobs in response to requests from a plurality of asynchronous clients. On the other hand, in a server application called a CPU (central processing unit) server or CPU Farm, a plurality of programs having different properties, for example, calculation and CAD (computer aided design) are executed.
[0004]
When one process including a multi-thread program is executed, assigning all the process threads to an assignable processor increases the execution speed of the process most. However, when there are a plurality of processes, it is necessary to assign threads by a method that optimizes the throughput of the system.
[0005]
For example, in a multi-processor system that can process eight processes simultaneously, when eight independent processes are executed, the throughput is higher than when executed by eight single processors that process one process. May be significantly inferior. This is considered to be the cause of the inefficiency caused by the competition resulting from the sharing of the user space.
[0006]
In a multi-thread program, thread switching is faster than program switching. Furthermore, since the same user space is shared between different threads, data can be passed between threads using pointers. Therefore, data transfer between threads is faster than data transfer between programs that require a copy operation. When executing a multi-thread program on a multi-processor system, the system preferably has a cache coherent shared memory. This is because, when data is transferred between threads running on different processors by pointer passing, no copy operation occurs at the hardware level.
[0007]
Multi-processor systems have been added to support hardware-level cache coherency as a result of being optimized for the execution of multi-threaded programs. Supporting cache coherency with hardware is suitable for data passing between threads that occurs frequently, that is, communication between threads.
[0008]
However, if data access using a cache coherency protocol is always used in preparation for data transfer between independent programs with low frequency of communication between threads, unnecessary processing increases in speed. In an example of a CC-NUMA (cache coherence-non uniform memory access) type multi-processor system, when an L2 (level 2) cache miss occurs, the home memory that manages the memory to which the data belongs without intervention of software The tag memory belonging to the node is accessed and the owner node holding the latest data is automatically queried. In addition, when a snoop cache is used, it is necessary to always broadcast addresses to all caches. Therefore, if there is only one address bus, a plurality of memory accesses requiring snoop cannot be generated simultaneously. The tag memory is a memory that manages a state in which data is cached. The snoop cache is a memory used for automatically managing coherency by hardware. The L2 cache miss means that the cache memory has a two-layer structure, and a cache miss is made in two layers after the first layer.
[0009]
Therefore, the hardware support for cache coherency is suitable for efficiently supporting a multi-thread program, but is an unnecessary support when a plurality of independent programs are executed.
[0010]
[Problems to be solved by the invention]
An object of the present invention, when executing a plurality of programs in a multi-processor system is to provide Hisage a program to control to maximize the throughput of the system.
[0011]
[Means for Solving the Problems]
The gist of the program of the present invention is a program for dispatching a thread of a process executed in a multi-processor system including a plurality of processors, wherein the multi-processor system is sent to each of the plurality of processors as a process thread. Means for comparing the length of processing time in the processor and the time waiting for I / O within the execution time of the process, and processing in the processor A program for causing a process to be executed by one processor or a plurality of processors depending on whether the time or I / O waiting time is long .
[0012]
How to dispatch threads of a process executed by the multi-processor system containing multiple processors, determining the processing efficiency of the process when assigning the threads of the process to the plurality of processors, said process Selecting whether to execute on one processor or a plurality of processors.
[0013]
DETAILED DESCRIPTION OF THE INVENTION
It will be described with reference to the drawings programming implementation in the form of a ram of the present invention. In this specification and FIGS. 1, 2 and 4, reference numeral 38 generally indicates reference numerals 38a, 38b, P1, P2, and P3.
[0014]
The program of the present invention is a program for dispatching threads shown in FIG. In other words, the present invention is an invention of an operating system scheduler in a multi-processor system. First, a method for maximizing system throughput will be described.
[0015]
FIG. 2 represents the operating system 32 and the process 38. T0 to T4 indicate the threads of the process 38, and the positions of these threads T0, T1, T2, T3, and T4 indicate execution points in the multiprocessor system 40. In the example of FIG. 2, the thread T0 is a thread in the user space 36 that can be executed simultaneously with the threads T1, T2, T3, and T4. The thread T1 receives services from the system call to the threads T2 and T3, and then returns to the thread T4.
[0016]
In this specification, for example, if the threads T0, T2, and T3 in FIG. 2 are assigned to different processors, even the process 38 that can obtain a maximum of 3 parallelisms is assigned to one processor (single processor). This is called a single processor execution state.
[0017]
Conventionally, for example, when threads T0, T2, and T3 are executed in parallel by different processors, data is shared, and thus a cache coherency protocol is required. When the threads T0 to T4 are executed by a single processor, no cache coherency protocol is required. Therefore, the data in the user space 36 is marked as not requiring coherency in the corresponding page table. The page table is a table for managing the memory storage area in units of pages, and stores the correspondence between virtual addresses and physical addresses. This table can be used to manage cache control and cache coherency control. A page is each divided part when the program is divided into equal lengths.
[0018]
When marking that cache coherency is unnecessary, for example, in the page table 42 of FIG. 3, it is specified in the WIM bit field. With the function of an MMU (memory management unit) included in the multiprocessor system 40, the hardware 30 does not generate a cache coherency protocol for data access to this user space 36.
[0019]
However, there is a possibility that cache coherency may not be maintained when data is exchanged with a program executed by another processor. For example, when the thread T5 of FIG. 2 is executed by the processor uP2 of FIG. 4, when the user T36 having a process that does not require cache coherency is referred to, the thread coherency cannot be maintained. .
[0020]
In the above case, the dispatch to the processor is changed, and the thread T5 is added to the single processor execution state to maintain cache coherency. Use MMU assistance to generate events that change dispatches. In the page table 42, the user space 36 can be read / written only from the only processor that is executing, and cannot be read / written from other processors. An interrupt occurs before the thread T5 is referred to data that breaks cache coherency, and is dispatched to the same processor as the threads T0, T1, T2, T3, and T4.
[0021]
In the example of the page table 42 of FIG. 3, the read / write attribute for the page can be omitted by specifying the PP bit field in the cache coherency protocol.
[0022]
The contents of the read / write attribute recorded in the page table 42 look different depending on the processor. This is a technique that is also used when a software cache is implemented by DSM (distributed shared memory). In the DSM software cache, the page faulted page is moved between the nodes to obtain the cache effect, whereas in the present invention, the page faulted thread is moved between the nodes. Different. A page fault is an interrupt that occurs when a page that does not exist in memory is accessed.
[0023]
If all processes 38 are in a single processor execution state, the meaning of the multi-processor system 40 is lost. Therefore, the program of the present invention has the following configuration, and selects a process 38 to be in a single processor execution state.
[0024]
The program 10 of the present invention shown in FIG. 1 determines the processing efficiency of the processes 38a and 38b when the multiprocessor system 40 assigns threads of the processes 38a and 38b to a plurality of processors (multiprocessors) 28, respectively. This is a program for causing the determination means 12 and the selection means 14 to select whether the processes 38a and 38b are executed by the single processor 26 or the multi processor 28 based on the result of the determination means 12.
[0025]
The determination unit 12 includes an observation unit 16 that observes the execution states of the processes 38a and 38b, and a determination unit 18 that determines the processing efficiency of the processes 38a and 38b. The observation unit 16 includes a dividing unit (not shown) that divides the execution time of the processes 38a and 38b into processing time in the processors 26 and 28 and waiting time for I / O (input / output), and processing time. And a comparison means (not shown) for comparing the I / O waiting time. The determination unit 18 determines the processing efficiency of the processes 38a and 38b using the comparison result of the comparison unit.
[0026]
The selection unit 14 includes a memory management unit 20 and a task dispatcher 22. The memory management unit 20 turns on or off cache coherency for the processes 38 a and 38 b in the memory 24. The task dispatcher 22 selects multiprocessor execution or single processor execution.
[0027]
In the present invention, when the data processed by the process 38a executed by the single processor 26 is accessed from another processor by the determination unit 12 and the selection unit 14, the access is stopped and the access is stopped. Change to processor 26. That is, an interrupt is generated for an access from another processor, and the access is changed to the single processor 26.
[0028]
In the present invention, the single processor 26 reads / writes data for the data processed by the process 38a executed by the single processor 26 by the determination means 12 and the selection means 14, and the data is sent to other processors. Prohibits reading and writing.
[0029]
Next, a dispatch method by the program 10 will be described. The program 10 determines the processing efficiency when the threads of the processes 38a and 38b are assigned to the multiprocessor 28, respectively, and selects single processor execution or multiprocessor execution.
[0030]
When making a determination, (1) in the case of processing in which I / O bound occurs, even if the processes are processed in parallel in a short time consumed by the CPU, the effect is limited. Therefore, the execution time of the processes 38a and 38b is divided into the processing time in the processor 28 and the I / O waiting time, and the processing time and the I / O waiting time are compared. As a result of the comparison, if the I / O wait time is longer, a single processor is executed. In FIG. 1, the process 38a is executed by a single processor. I / O bound refers to using a virtual storage area on a hard disk in processing data, that is, generating a swap file.
[0031]
(2) A single processor execution candidate is also a case where an application program that runs on the operating system 42 including the program 10 is a single thread program. In this case, if the operating system 32 is a multi-thread program, the application program behaves like a multi-thread program. However, since an I / O bound is expected to occur, the operating system 32 is a candidate for single processor execution if the application program does not have concurrency of process processing. That is, in the program of the present invention, the selection means 14 selects the single processor execution in the above case.
[0032]
(3) Even if the process 38 is suitable for multiprocessor execution when executed alone, the processing efficiency may be poor when a plurality of processes 38 are executed. Such processing is also a candidate for single processor execution. That is, in the program 10 of the present invention, the determination unit 12 determines the processing efficiency of multiprocessor execution in the individual processes 38a and 38b, and when the determination result is as described above, the selection unit 14 executes the single processor execution. Select. For example, even if the process 38a in FIG. 1 is suitable for multiprocessor execution, the selection means 14 performs single processor execution.
[0033]
In a multi-processor system 40 having the program 10 of the present invention, a process 38 having a multi-thread program suitable for multi-processor execution is a multi-processor provided with hardware level cache coherency support. The processor 28 performs parallel processing.
[0034]
In FIG. 4, the process P3 is processed by two processors uP3 and uP4. Data accesses occurring in the processors uP3 and uP4 generate a cache coherency protocol.
[0035]
The process 38 that is not suitable for multiprocessor execution is executed by a single processor, and cache coherency with other processors is maintained only by a software control method using the assistance of the MMU. Data access to the user spaces 36a and 36b generated by the processors uP1 and uP2 respectively executing the processes P1 and P2 in FIG. 4 does not generate a cache coherency protocol. As a result, the bandwidth of the cache coherency protocol is relaxed and the throughput of the system 40 is improved.
[0036]
Since no cache coherency protocol is generated, when the process P1 is accessed from the processor uP2, the interrupt means generates an interrupt for the access. Then, after the occurrence of the interrupt, a thread is assigned from the processor uP2 to the processor uP1 to perform access.
[0037]
Further, the MMU controls the system 40 so that the processor reads / writes data processed by a process 38 executed by a processor and prohibits other processors from reading / writing data. .
[0038]
As mentioned above, although embodiment of this invention was described, this invention is not limited to said form. The above description is based on the assumption that the cache is an L1 (level 1) cache, and the single processor execution state is that a thread is executed by a single processor. The same idea can be applied to L2 cache. In this case, a single processor node execution state is set. The single processor node execution state refers to a state in which a plurality of processors belonging to the same node process a plurality of tasks belonging to one process.
[0039]
The space that omits the cache coherency protocol is a user space that is easy to implement and highly effective, but can be applied to a part of the system space. That is, the usage of the system space is classified and applied to a portion in which the cache coherency protocol can be omitted.
[0040]
In addition, the present invention can be implemented in a mode in which various improvements, modifications, and changes are made based on the knowledge of those skilled in the art without departing from the spirit of the present invention.
[0041]
【The invention's effect】
According to the present invention, in a multiprocessor system, process dispatching is optimally performed, so that all processes can be processed without reducing the system throughput.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration of a program according to the present invention and a state of dispatch.
FIG. 2 is a diagram showing an execution state of a thread of a process.
FIG. 3 is a diagram illustrating an example of a page table for maintaining cache coherency.
FIG. 4 illustrates an example of a multi-processor system.
[Explanation of symbols]
10: Program 12: Determination unit 14: Selection unit 16: Observation unit 18: Determination unit 20: Memory management unit 22: Task dispatcher 24: Memory 26: Single processor 28: Multiprocessor 30: Hardware 32: Operating System 34, 34a, 34b, 34c: System space 36, 36a, 36b, 36c: User space 38: Process 40: Multiprocessor system 42: Page table

Claims

A program for dispatching threads of a process executing on a multi-processor system including a plurality of processors,
Said multi-processor system,
In order to determine the processing efficiency of each process when assigning process threads to each of the plurality of processors, the length of the processing time in the processor and the time waiting for I / O within the execution time of the process are compared. Means to
Means for selecting the process to be executed by one processor or a plurality of processors depending on whether processing time in the processor or I / O waiting time is long;
A page table for marking cache coherency unnecessary for data processed by a process executed by the one processor;
Data that is marked as not requiring cache coherency in the page table is being processed by a process running on the one processor, and the access is interrupted when the data is accessed from another processor Means for generating,
Means for assigning a process thread to the one processor executing the process from the other processor after the interrupt;
The one processor reads / writes the data to / from data processed by a process executed by the one processor, prohibits the other processor from reading / writing the data, and the cache coherency. MMU ( memory management unit ) to prevent cache coherency protocol from being generated due to unnecessary marks
Program to function as.

The program according to claim 1, wherein the selecting unit selects execution of a process in one processor when the I / O waiting time is long.

The program according to claim 1 or 2, wherein, when the process is a single thread process, the means for selecting selects execution of the process on one processor.