JP4063256B2

JP4063256B2 - Computer cluster system, management method therefor, and program

Info

Publication number: JP4063256B2
Application number: JP2004189384A
Authority: JP
Inventors: 隆仁河内
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2004-06-28
Filing date: 2004-06-28
Publication date: 2008-03-19
Anticipated expiration: 2024-06-28
Also published as: JP2006011913A

Description

本発明は計算機クラスタシステムとそれにおける管理方法、及びプログラムに関し、特に、ＮＵＭＡ型計算機のクラスタシステムとそれにおける管理方法、及びプログラムに関する。 The present invention relates to a computer cluster system, a management method therefor, and a program, and more particularly, to a cluster system for a NUMA computer, a management method therefor, and a program therefor.

従来のクラスタシステムは、少数（典型的には２〜８個程度）のプロセッサを搭載した多数のＳＭＰ（Ｓｙｍｍｅｔｒｉｃ−Ｍｕｌｔｉ−Ｐｒｏｃｅｓｓｏｒ）型計算機を接続して構築されるものが現状殆どであり、クラスタの管理手法や各種管理ソフトウェアもその様な構成を前提としている。 Most conventional cluster systems are constructed by connecting a large number of SMP (Symmetric-Multi-Processor) computers equipped with a small number (typically about 2 to 8) processors. The management method and various management software are premised on such a configuration.

そのため、多くの管理項目は、粒度がマシン単位となっている。 Therefore, the granularity of many management items is in units of machines.

即ち、クラスタ管理ソフトウェアは、複数の計算機を束ねたクラスタの各種のリソース（ＣＰＵ、メモリ等）を管理し、投入されるジョブへのリソースの割り当てを司り、１つのクラスタ内部で別々のジョブに対し、排他的なリソースの割り当てを実現するため、全体のクラスタを複数のパーティションに分割する機能を有している。 In other words, the cluster management software manages various resources (CPU, memory, etc.) of a cluster in which a plurality of computers are bundled, and manages the allocation of resources to jobs to be submitted. In order to realize exclusive resource allocation, it has a function of dividing the entire cluster into a plurality of partitions.

従来のクラスタ管理ソフトウェアでは、このパーティション分割は個々の計算機の粒度でしか行えない。 In conventional cluster management software, this partitioning can be done only at the granularity of each computer.

一方で、多プロセッサの共有メモリシステムは、ＮＵＭＡ（Ｎｏｎ−Ｕｎｉｆｏｒｍ−Ｍｅｍｏｒｙ−Ａｃｃｅｓｓ）型のものが多くなっている。 On the other hand, many NUMA (Non-Uniform-Memory-Access) type multi-processor shared memory systems are used.

ＮＵＭＡアーキテクチャをベースに設計された計算機は、内部的には複数のＳＭＰの計算機を高速の接続機構を用いて接続している。従って、通常計算機一台に搭載するプロセッサ数がＳＭＰ構成の計算機に比べて多い。 A computer designed based on the NUMA architecture internally connects a plurality of SMP computers using a high-speed connection mechanism. Therefore, the number of processors mounted on one normal computer is larger than that of a computer having an SMP configuration.

又、ＮＵＭＡマルチプロセッサシステム用オペレーティングシステムとして、作業負荷を分散させる下記のオペレーティングシステム（ＯＳ）が示されている。 In addition, as an operating system for a NUMA multiprocessor system, the following operating system (OS) that distributes a work load is shown.

このＯＳは、各ジョブプロセッサに対応した複数のリーフ・ノードと、全てのプロセッサによって共有される少なくとも一つのシステム資源を表すルートノードと、ジョブプロセッサの異なる組み合わせによって共有される資源を表す複数の中間ノードとを有する階層ツリーを維持する。 The OS includes a plurality of leaf nodes corresponding to each job processor, a root node representing at least one system resource shared by all processors, and a plurality of intermediates representing resources shared by different combinations of job processors. Maintain a hierarchical tree with nodes.

そしてこのＯＳは、システム全体に分散されているアクティブなスレッド群の進展を監視し、衰退スレッド群を補助する中期スケジューラと、各々ジョブ・プロセッサの１つと関連付けられ、その状態を監視し、その実行スレッド群を得る複数のディスパッチャとを含み、更に複数のメモリ・プールおよびフレーム・トレジャリを用いて仮想および物理メモリを割り当てる（特許文献１参照。）。 This OS monitors the progress of active thread groups distributed throughout the system, and is associated with a medium-term scheduler that assists the declining thread groups, and one of each job processor, monitors its status, and executes it. A plurality of dispatchers for obtaining a thread group, and further, virtual and physical memories are allocated using a plurality of memory pools and frame treasuries (see Patent Document 1).

特開平９−２３７２１５（第１ページ）JP-A-9-237215 (first page)

上記の様なＮＵＭＡ型計算機を複数接続して構成されたクラスタを従来のクラスタ管理ソフトウェアで管理する場合、パーティション分割の粒度が大きく、実用的な分割が行えない問題があった。 When a cluster configured by connecting a plurality of NUMA type computers as described above is managed by conventional cluster management software, there is a problem that the partitioning has a large granularity and practical partitioning cannot be performed.

次に、上記のＮＵＭＡマルチプロセッサシステム用ＯＳは、ＮＵＭＡ型計算機１台の上で動作するＯＳであり、ＮＵＭＡ型計算機を複数接続したクラスタを対象としたものではない。 Next, the NUMA multiprocessor system OS is an OS that operates on one NUMA computer, and is not intended for a cluster in which a plurality of NUMA computers are connected.

また、このディスパッチャはＯＳカーネル内部に存在するタスクスケジューラの一部であり、ＯＳの改変を必要最小限にしたいという目的にはそぐわない。 In addition, this dispatcher is a part of a task scheduler existing in the OS kernel, and is not suitable for the purpose of minimizing the modification of the OS.

また、全体のリソースを一つのものとみなし、それをいかに多くのジョブで共有して有効にリソースを利用できる様にするかを目指したものである。 In addition, the entire resource is regarded as one, and it is aimed at how many jobs can be shared and used effectively.

本発明の目的は、全体のリソースをまず複数に分割することによって相互に影響されないリソースを確保して、それぞれのリソースの中で独立したジョブのスケジューリングを実現することである。 An object of the present invention is to secure resources that are not affected by each other by first dividing the entire resource into a plurality of resources, and realize independent job scheduling in each resource.

又、分割の最小サイズをクラスタを構成する各計算機のサブノード一つに出来ることである。 In addition, the minimum size of the division can be set to one subnode of each computer constituting the cluster.

本発明の第１の計算機クラスタシステムは、複数の相互接続されたサブノードからなる計算機を複数台含み、これらがネットワーク接続された計算機クラスタシステムであって、各サブノードは、一つ以上のプロセッサとローカルメモリを含み、前記ローカルメモリは前記プロセッサにローカルにアクセスされると共に、計算機内のサブノード間で分散共有する記憶域を含み、前記各サブノードの利用可能なリソース情報、或いはリソース使用状況を取得し、これらを計算機区画であるパーティション毎の利用可能リソース情報に纏める手段と、前記リソース情報を用いて、入力されるジョブの実行先のパーティションと開始タイミングを決定する手段を備え、前記パーティションの構成として、サブノード一つの構成も可能なことを特徴とする。 The first computer cluster system of the present invention is a computer cluster system including a plurality of computers composed of a plurality of interconnected sub-nodes, and these sub-nodes are connected to one or more processors and local computers. The local memory is accessed locally by the processor and includes a storage area that is distributed and shared among the sub-nodes in the computer, and obtains resource information available or resource usage status of each sub-node, Means for collecting these into available resource information for each partition that is a computer partition, and means for determining the execution destination partition and start timing of the input job using the resource information, and as the configuration of the partition, It is also possible to configure one subnode. That.

本発明の第２の計算機クラスタシステムは、前記第１の計算機クラスタシステムであって、更に、前記決定に基づき、開始するジョブのプロセスに、前記実行先のパーティションを構成するサブノードのプロセッサとローカルメモリの記憶域を割り当て、プロセスを起動する手段も備えたことを特徴とする。 The second computer cluster system of the present invention is the first computer cluster system, and further, based on the determination, in the process of the job to be started, the processor and local memory of the sub-node constituting the execution destination partition Means for allocating the storage area and starting the process.

本発明の第３の計算機クラスタシステムは、前記第１の計算機クラスタシステムであって、前記計算機クラスタシステムには、前記複数の計算機に通信接続された制御用計算機も含み、これに前記パーティション毎の利用可能リソース情報に纏める手段と、前記入力されるジョブの実行先のパーティションと開始タイミングを決定する手段を備えることを特徴とする。 A third computer cluster system of the present invention is the first computer cluster system, and the computer cluster system also includes a control computer that is connected to the plurality of computers for communication, and includes a computer for each partition. It is characterized by comprising means for collecting the available resource information and means for determining the execution destination partition and start timing of the input job.

本発明の第４の計算機クラスタシステムは、複数の相互接続されたサブノードからなる計算機を複数台含み、これらがネットワーク接続された計算機クラスタシステムであって、各サブノードは、一つ以上のプロセッサとローカルメモリを含み、前記ローカルメモリは前記プロセッサにローカルにアクセスされると共に、計算機内のサブノード間で分散共有する記憶域を含み、ジョブスケジューラと、前記計算機のそれぞれで前記ジョブスケジューラと連携して動作するディスパッチャとＯＳとを備え、前記ジョブスケジューラは、前記各サブノードの利用可能なリソース情報、或いはリソース使用状況を前記ＯＳより取得し、これらを計算機区画であるパーティション毎の利用可能リソース情報に纏める手段と、ユーザのジョブを受付け、これに必要なリソースを算出し、算出した情報とジョブの優先度と各パーティションの利用可能リソース情報とに応じて、次に各パーティションで開始するジョブを選択する手段と、計算機上のディスパッチャに、パーティションを構成するサブノードの情報を指定した上で前記選択したジョブの起動を指示する手段とを有し、前記ディスパッチャは、前記ＯＳを通じ、ジョブを構成するプロセスに、前記指定されたサブノード上の、プロセッサとローカルメモリの記憶域を割り当てる手段と、プロセスを起動する手段とを有し、前記パーティションの構成として、サブノード一つの構成も可能なことを特徴とする。 The fourth computer cluster system of the present invention is a computer cluster system including a plurality of computers composed of a plurality of interconnected sub-nodes, and these sub-nodes are connected to one or more processors and a local computer. The local memory is accessed locally by the processor and includes a storage area that is distributed and shared among subnodes in the computer, and operates in cooperation with the job scheduler in each of the job scheduler and the computer. A dispatcher and an OS, wherein the job scheduler obtains available resource information or resource usage status of each sub-node from the OS, and summarizes them into available resource information for each partition which is a computer partition; Accept user jobs, Based on the calculated information, the priority of the job, and the available resource information of each partition, the means for selecting the next job to be started in each partition, and the dispatcher on the computer, Means for instructing the start of the selected job after designating information on subnodes constituting the partition, and the dispatcher sends a process on the designated subnode to the process constituting the job through the OS. It has a means for allocating a storage area of a processor and a local memory, and a means for starting a process, and the configuration of the partition can also be a configuration of one subnode.

本発明の第５の計算機クラスタシステムは、複数の相互接続されたサブノードからなる計算機を複数台含み、これらがネットワーク接続された計算機クラスタシステムであって、各サブノードは、一つ以上のプロセッサとローカルメモリを含み、前記ローカルメモリは前記プロセッサにローカルにアクセスされると共に、計算機内のサブノード間で分散共有する記憶域を含み、ジョブスケジューラと、前記計算機のそれぞれで前記ジョブスケジューラと連携して動作するディスパッチャとＯＳとを備え、前記ＯＳは、自計算機の各サブノードの利用可能なリソース情報、或いはリソース使用状況を提示する手段と、指定されたプロセスに指定されたサブノードのプロセッサやプロセッサ集合を割り当て、プロセッサやプロセッサ集合の属するローカルメモリから記憶域を割り当てる手段を有し、前記ジョブスケジューラは、前記各サブノードの利用可能なリソース情報或いはリソース使用状況を前記ＯＳより取得し、これらを計算機区画であるパーティション毎の利用可能リソース情報に纏める手段と、ユーザのジョブを受付け、これに必要なリソースを算出し、算出した情報とジョブの優先度と各パーティションの利用可能リソース情報とに応じて、次に各パーティションで開始するジョブを選択する手段と、計算機上のディスパッチャに、パーティションを構成するサブノードの情報を指定した上で前記選択したジョブの起動を指示する手段とを有し、前記ディスパッチャは、ジョブを構成するプロセスに、前記指定されたサブノード上のプロセッサとメモリの割り当てを前記ＯＳに指示する手段と、そのプロセスを起動する手段とを有し、前記パーティションの構成として、サブノード一つの構成も可能なことを特徴とする。 The fifth computer cluster system of the present invention is a computer cluster system including a plurality of computers composed of a plurality of interconnected sub-nodes, and these sub-nodes are connected to one or more processors and local The local memory is accessed locally by the processor and includes a storage area that is distributed and shared among subnodes in the computer, and operates in cooperation with the job scheduler in each of the job scheduler and the computer. A dispatcher and an OS, and the OS allocates resource information or a resource usage status of each subnode of the local computer, and assigns a processor and a processor set of the specified subnode to the specified process; The processor or processor set belongs Means for allocating a storage area from a local memory, and the job scheduler obtains available resource information or resource usage status of each sub-node from the OS, and uses them as available resource information for each partition which is a computer partition. Collecting means, accepting user jobs, calculating resources required for them, and selecting the next job to start in each partition according to the calculated information, job priority, and available resource information for each partition And means for instructing the dispatcher on the computer to start the selected job after designating information on the subnodes constituting the partition, and the dispatcher sends the designation to the process constituting the job. The processor and memory allocation on the assigned subnode It means for instructing the S, and means for starting the process, as a constituent of the partition, and wherein the sub-node one configuration that can also be.

本発明の第６の計算機クラスタシステムは、前記第４、又は第５の計算機クラスタシステムであって、前記計算機クラスタシステムには、前記複数の計算機に通信接続された制御用計算機も含み、これに前記ジョブスケジューラを備えることを特徴とする。 A sixth computer cluster system according to the present invention is the fourth or fifth computer cluster system, and the computer cluster system also includes a control computer connected to the plurality of computers. The job scheduler is provided.

本発明の第１のプログラムは、ネットワーク接続された各計算機を構成する複数のサブノードのそれぞれについて利用可能なリソース情報、或いはリソース使用状況を取得し、これらを計算機区画であるパーティション毎の利用可能リソース情報に纏める手順と、前記リソース情報を用いて、入力され受け付けたジョブの実行先のパーティションと開始タイミングを決定する手順とを計算機に実行させる。 The first program of the present invention obtains resource information or resource usage status that can be used for each of a plurality of subnodes constituting each computer connected to the network, and uses these resources for each partition that is a computer partition. The computer causes the computer to execute a procedure for collecting information and a procedure for determining a partition and a start timing of an input and received job using the resource information.

本発明の第２のプログラムは、ネットワーク接続された各計算機を構成する複数のサブノードのそれぞれについて、利用可能なリソース情報、或いはリソース使用状況を前記計算機のＯＳより取得し、これらを計算機区画であるパーティション毎の利用可能リソース情報に纏める手順と、ユーザのジョブを受付け、これに必要なリソースを算出し、算出したリソース情報とジョブの優先度と各パーティションの利用可能リソース情報とに応じて、次に各パーティションで開始するジョブを選択する手順と、前記計算機上のプログラムに、パーティションを構成するサブノードの情報を指定した上で前記選択したジョブの起動を指示する手順とを計算機に実行させる。 The second program of the present invention obtains available resource information or resource usage status from the OS of the computer for each of a plurality of subnodes constituting each computer connected to the network, and these are computer partitions. The procedure for collecting the available resource information for each partition, accepting the user's job, calculating the resources required for this, and depending on the calculated resource information, job priority, and available resource information for each partition, The computer is caused to execute a procedure for selecting a job to be started in each partition and a procedure for instructing the program on the computer to start the selected job after specifying information on subnodes constituting the partition.

本発明の第３のプログラムは、ジョブと実行先のパーティション情報を受け、ジョブを構成するプロセスに、前記実行先のパーティションを構成するサブノードのプロセッサとローカルメモリの記憶域を割り当てる手順と、前記プロセスを起動する手順とを計算機に実行させる。 The third program of the present invention receives a partition information of a job and an execution destination, assigns a processor of a subnode constituting the execution destination partition and a storage area of a local memory to a process constituting the job, and the process The computer is caused to execute the procedure for starting up.

本発明の第４のプログラムは、計算機を構成する複数のサブノードのそれぞれについて、利用可能なリソース情報、或いはリソース使用状況を管理し、要求に応じこれらを提示する手順と、指定されたプロセスに対し、指定されたサブノードのプロセッサやプロセッサ集合を割り当て、プロセッサやプロセッサ集合の属するローカルメモリ上の記憶域を割り当てる手順とを計算機に実行させる。 The fourth program of the present invention manages available resource information or resource usage status for each of a plurality of subnodes constituting a computer, and presents them in response to a request and a designated process. And assigning a processor or processor set of the designated sub-node and allocating a storage area in a local memory to which the processor or processor set belongs.

本発明の第１の計算機クラスタシステムにおける管理方法は、複数の相互接続されたサブノードからなる計算機を複数台含み、これらがネットワーク接続され、また各サブノードに、一つ以上のプロセッサとローカルメモリを含む、計算機クラスタシステムにおける管理方法であって、前記各サブノードの利用可能なリソース情報、或いはリソース使用状況を取得し、これらを計算機区画であるパーティション毎の利用可能リソース情報に纏める手順と、前記リソース情報を用いて、入力されるジョブの実行先のパーティションと開始タイミングを決定する手順を備え、前記パーティションの構成として、サブノード一つの構成も可能なことを特徴とする。 The management method in the first computer cluster system of the present invention includes a plurality of computers composed of a plurality of interconnected subnodes, which are network-connected, and each subnode includes one or more processors and a local memory. A management method in a computer cluster system, which is a procedure for acquiring available resource information or resource usage status of each sub-node and collecting them into available resource information for each partition which is a computer partition, and the resource information , And a procedure for determining the execution destination partition and start timing of the input job, and the configuration of the partition may be a configuration of one subnode.

本発明の第２の計算機クラスタシステムにおける管理方法は、前記第１の計算機クラスタシステムにおける管理方法であって、更に、前記決定に基づき、開始するジョブのプロセスに、前記実行先のパーティションを構成するサブノードのプロセッサとローカルメモリの記憶域を割り当て、プロセスを起動する手順も備えたことを特徴とする。 The management method in the second computer cluster system of the present invention is the management method in the first computer cluster system, and further, the execution destination partition is configured in the process of the job to be started based on the determination. It also has a procedure for allocating a sub-node processor and a local memory storage area and starting a process.

本発明の第３の計算機クラスタシステムにおける管理方法は、複数の相互接続されたサブノードからなる計算機を複数台含み、これらがネットワーク接続され、また各サブノードに、一つ以上のプロセッサとローカルメモリを含む、計算機クラスタシステムにおける管理方法であって前記各サブノードの利用可能なリソース情報、或いはリソース使用状況を取得し、これらを計算機区画であるパーティション毎の利用可能リソース情報に纏める手順と、ユーザのジョブを受付け、これに必要なリソースを算出し、算出したリソース情報とジョブの優先度と各パーティションの利用可能リソース情報とに応じて、次に各パーティションで開始するジョブを選択する手順と、パーティションを構成するサブノードの情報を指定した上で前記選択したジョブの起動を指示する手順と、前記ジョブを構成するプロセスに、前記指定されたサブノード上の、プロセッサとローカルメモリの記憶域を割り当てる手順と、プロセスを起動する手順とを有し、前記パーティションの構成として、サブノード一つの構成も可能なことを特徴とする。 The management method in the third computer cluster system of the present invention includes a plurality of computers composed of a plurality of interconnected subnodes, which are connected to a network, and each subnode includes one or more processors and a local memory. A management method in a computer cluster system, which obtains available resource information or resource usage status of each sub-node, and summarizes them into available resource information for each partition which is a computer partition, and a user job. Accept, calculate the resources required for this, configure the partition and the procedure to select the job to start next in each partition according to the calculated resource information, job priority, and available resource information of each partition Specify the subnode information to be selected. A procedure for instructing start of a job, a procedure for allocating a storage area of a processor and local memory on the designated subnode to a process constituting the job, and a step for starting the process. As a configuration, a configuration with one subnode is also possible.

第１の効果は、ＮＵＭＡ型計算機で構成されたクラスタにおいて粒度の細かいリソースの分割ができることにある。 The first effect is that a fine-grained resource can be divided in a cluster composed of NUMA computers.

その理由は、ＯＳが提供する情報と制御するための機構をクラスタ管理ソフトが利用するためである。 This is because the cluster management software uses information provided by the OS and a mechanism for control.

第２の効果は、ＮＵＭＡ型計算機において、ジョブを構成するプロセスがより効率よく動作できることである。 The second effect is that the processes constituting the job can operate more efficiently in the NUMA type computer.

その理由は、プロセスが利用するメモリが、ＮＵＭＡノードのローカルメモリから取得できる可能性を高くできるからである。 The reason is that it is highly possible that the memory used by the process can be obtained from the local memory of the NUMA node.

次に、本発明を実施するための最良の形態について図面を参照して詳細に説明する。 Next, the best mode for carrying out the present invention will be described in detail with reference to the drawings.

本発明では、パーティショニングの粒度の細分化を、ＮＵＭＡ型計算機向けのＯＳ機能と、クラスタ管理ソフトウェアが連携することによって実現する。 In the present invention, the granularity of partitioning is realized by the cooperation of the OS function for the NUMA computer and the cluster management software.

図１は、本発明の計算機クラスタシステムを含むブロック図である。図１を参照し、本発明の計算機クラスタシステムは、複数のＮＵＭＡサブノード１１〜１４と接続手段１５からなるＮＵＭＡ型計算機１を２台以上高速ネットワーク３等で結合したクラスタシステムである。 FIG. 1 is a block diagram including a computer cluster system of the present invention. Referring to FIG. 1, the computer cluster system of the present invention is a cluster system in which two or more NUMA type computers 1 including a plurality of NUMA subnodes 11 to 14 and connecting means 15 are connected by a high-speed network 3 or the like.

接続手段１５はクロスバスイッチ等の高速な接続手段である。 The connection means 15 is a high-speed connection means such as a crossbar switch.

ＮＵＭＡ型計算機１、２には制御用計算機４が高速ＬＡＮ１００等で接続され、ネットワーク２００や通信端末５を通じ、ユーザ端末６−１〜６−ｎからのジョブを受け付ける。 The control computer 4 is connected to the NUMA type computers 1 and 2 through a high-speed LAN 100 or the like, and accepts jobs from the user terminals 6-1 to 6-n through the network 200 and the communication terminal 5.

それぞれのＮＵＭＡサブノードは、図２に示す様に、複数のＣＰＵ１１１〜１１３がバス１１５を通じ、システム制御部１１６に接続され、システム制御部１１６にはＣＰＵ１１１〜１１３が高速にアクセス出来るローカルメモリ１１７が接続されている。 As shown in FIG. 2, each NUMA subnode has a plurality of CPUs 111 to 113 connected to a system control unit 116 through a bus 115, and a local memory 117 that can be accessed at high speed by the CPUs 111 to 113 is connected to the system control unit 116. Has been.

ＣＰＵ１１１〜１１３はそれぞれキャッシュメモリ１１１１、・・、１１３１を備えている。 Each of the CPUs 111 to 113 includes cache memories 1111,.

システム制御部１１６は、ＣＰＵ１１１〜１１３からのメモリアクセス要求のアドレスの上位部が、ローカルメモリ１１７を指していればこれにアクセスする。 The system control unit 116 accesses the local memory 117 if the upper part of the address of the memory access request from the CPUs 111 to 113 indicates the local memory 117.

前記上位部がＮＵＭＡサブノード１２〜１４の何れかのローカルメモリを指していれば、接続手段１５を通じて、対応のシステム制御部にメモリアクセス要求を送る。 If the upper part points to any local memory of the NUMA subnodes 12 to 14, a memory access request is sent to the corresponding system control part through the connection means 15.

ＮＵＭＡ型計算機１、２ではそれぞれＮＵＭＡサブノード１１、ＮＵＭＡサブノード２１が制御用計算機４との通信を担当しており、これらのシステム制御部にはＬＡＮ通信部１１８も接続されている。 In the NUMA type computers 1 and 2, the NUMA subnode 11 and the NUMA subnode 21 are in charge of communication with the control computer 4, and a LAN communication unit 118 is also connected to these system control units.

同様にＮＵＭＡ型計算機１、２間の高速通信を担当しているＮＵＭＡサブノードのシステム制御部には高速通信部が接続されている。 Similarly, a high-speed communication unit is connected to the system control unit of the NUMA subnode that is in charge of high-speed communication between the NUMA computers 1 and 2.

それぞれのＮＵＭＡ型計算機上で動作するＯＳ（オペレーティングシステム）は、以下の機能を装備する。
（１）任意のプロセスを任意のプロセッサまたは任意のプロセッサ集合にバインドする機能。
（２）プロセスに対し、プロセスに割り付けられた、或いはプロセスが動作しているＮＵＭＡサブノードやこれの集合のメモリからメモリを割り当てる機能。
（３）各ＮＵＭＡサブノードにおける、全体のリソース量と現在のリソース（ローカルメモリ、プロセッサ）使用量をユーザプログラム乃至デーモンから取得できる様にする機能。
（４）ＮＵＭＡサブノード単位のメモリリソースの管理機構。 The OS (operating system) operating on each NUMA computer has the following functions.
(1) A function of binding an arbitrary process to an arbitrary processor or an arbitrary processor set.
(2) A function of allocating a memory to a process from a NUMA subnode allocated to the process or a process in which the process is operating or a set of memories.
(3) A function that makes it possible to acquire the total resource amount and the current resource (local memory, processor) usage amount from each user program or daemon in each NUMA subnode.
(4) A memory resource management mechanism in units of NUMA subnodes.

クラスタ管理ソフトウェアは、パーティショニングを実現するため、ジョブスケジューラとジョブのディスパッチャを有する。 The cluster management software has a job scheduler and a job dispatcher in order to realize partitioning.

ジョブスケジューラは、クラスタのパーティション毎にインスタンスを持ち、クラスタのジョブ実行における制約事項（必要リソース、ジョブの優先度、ＮＵＭＡ型計算機上の空きリソース等）に応じて、次に各々のパーティション上で実行を開始するジョブの選択を行う。 The job scheduler has an instance for each partition of the cluster, and then executes on each partition according to restrictions on job execution of the cluster (necessary resources, job priority, free resources on the NUMA computer, etc.) Select a job to start.

ここで、ジョブスケジューラは、ＮＵＭＡ型計算機向けＯＳの機能（３）によって、それぞれのＮＵＭＡサブノードにおける利用可能なリソース量、或いはリソース使用状況を取得し、これらを各パーティションの構成情報に従って合計し、各パーティションの利用可能なリソース量を把握している。 Here, the job scheduler acquires the amount of available resources or the resource usage status in each NUMA subnode by the function (3) of the NUMA type computer OS, and sums them according to the configuration information of each partition. You know the amount of resources available for the partition.

尚、パーティションとは、ＮＵＭＡ型計算機１及び２を一つ以上の区画に区切り、各区画を一つの計算機として扱うための計算機区画である。 The partition is a computer section for dividing the NUMA type computers 1 and 2 into one or more sections and handling each section as one computer.

１〜８台のＮＵＭＡサブノードを一つのパーティションに定義する。 Define 1 to 8 NUMA subnodes in one partition.

例えば、ＮＵＭＡサブノード１１をパーティション＃１、ＮＵＭＡサブノード１２をパーティション＃２、ＮＵＭＡサブノード１３と１４をパーティション＃３とし、ＮＵＭＡ型計算機２も同様にする。 For example, the NUMA sub-node 11 is the partition # 1, the NUMA sub-node 12 is the partition # 2, the NUMA sub-nodes 13 and 14 are the partition # 3, and so on.

尚、ＮＵＭＡサブノード１１をパーティション＃１、ＮＵＭＡサブノード１２をパーティション＃２、ＮＵＭＡサブノード１３と１４及びＮＵＭＡサブノード２１〜２４をパーティション＃３とする例もある。 In some cases, the NUMA subnode 11 is the partition # 1, the NUMA subnode 12 is the partition # 2, the NUMA subnodes 13 and 14, and the NUMA subnodes 21 to 24 are the partition # 3.

ジョブのディスパッチャは、各ＮＵＭＡ型計算機上で実行され、ジョブスケジューラによって選択されたジョブを、各ＮＵＭＡ型計算機上で必要な初期化動作を行った後実行を開始させる。 The job dispatcher is executed on each NUMA type computer, and starts executing the job selected by the job scheduler after performing a necessary initialization operation on each NUMA type computer.

この際、ジョブのディスパッチャは、ジョブスケジューラの指示に従って動作プロセスを特定のＮＵＭＡサブノード上のプロセッサにＯＳの（１）の機能を利用してバインドする。 At this time, the job dispatcher binds the operation process to a processor on a specific NUMA sub-node using the function (1) of the OS according to the instruction of the job scheduler.

また、プロセスのメモリは、ＯＳの（２）と（４）の機能を利用して、動作予定ないし動作しているＮＵＭＡサブノード上から取得するよう設定を行うことにより、パーティション外部のメモリに影響を及ぼさない。 In addition, the process memory is set to be acquired from the scheduled or operating NUMA subnode using the functions (2) and (4) of the OS, thereby affecting the memory outside the partition. Does not reach.

また、ジョブを構成する各プロセスをプロセッサにバインドする際、プロセス間の通信のパターンや通信トポロジ等に関するヒントがユーザによって与えられている場合、あるいはプロセスの動作からそれらの情報が自動的に判断できる場合、特に通信量の多いプロセス集合をあるＮＵＭＡサブノードに集中して配置して通信速度の最適化を図ることができる。 In addition, when binding each process that constitutes a job to a processor, if the user gives a hint about the communication pattern or communication topology between processes, the information can be automatically determined from the operation of the process. In this case, it is possible to optimize the communication speed by allocating a process set having a large amount of communication to a certain NUMA sub-node.

図３は、上記のＮＵＭＡ型計算機向けＯＳ１７、２７と、クラスタ管理ソフトウェアにおけるジョブスケジューラ４１及びジョブのディスパッチャ１６、２６の連携を示している。 FIG. 3 shows the cooperation between the above-mentioned OS 17 and 27 for the NUMA computer, the job scheduler 41 and the job dispatchers 16 and 26 in the cluster management software.

尚、ＯＳ２７、ジョブのディスパッチャ２６は図示できてないが、ＮＵＭＡ型計算機２に搭載されている。 The OS 27 and the job dispatcher 26 are not shown, but are installed in the NUMA computer 2.

これらの連携によって、従来はＮＵＭＡ型計算機単位であったクラスタのパーティショニングの粒度を、計算機内部のＮＵＭＡサブノード単位で行うことが実現できる。 With these linkages, it is possible to achieve the granularity of cluster partitioning, which has conventionally been in units of NUMA computers, in units of NUMA subnodes in the computer.

ＮＵＭＡ型計算機１は、複数のＮＵＭＡサブノード１１〜１４を持ち、ＮＵＭＡ型計算機１上で動作するＯＳ１７は各ＮＵＭＡサブノードのリソース情報１７１をユーザに見える様に公開している。 The NUMA type computer 1 has a plurality of NUMA subnodes 11 to 14, and the OS 17 operating on the NUMA type computer 1 discloses the resource information 171 of each NUMA subnode so that it can be seen by the user.

リソース情報１７１は、各ＮＵＭＡサブノードのリソース情報であり、利用可能な空きのリソース情報、若しくはリソースの使用状況（使用量、使用しているプロセス識別も含む）である。 The resource information 171 is resource information of each NUMA subnode, and is available free resource information or resource usage status (including usage amount and used process identification).

ＮＵＭＡ型計算機２についても同様であるが、ＮＵＭＡサブノード数は必ずしもＮＵＭＡ型計算機１と同数でなくてもよい。 The same applies to the NUMA computer 2, but the number of NUMA subnodes is not necessarily the same as that of the NUMA computer 1.

尚、本実施例ではＯＳ１７やジョブのディスパッチャ１６はＮＵＭＡサブノード１１の一部に搭載している。 In this embodiment, the OS 17 and the job dispatcher 16 are mounted on a part of the NUMA sub-node 11.

このクラスタは、管理者によって複数のパーティションに分割されており、ユーザはＮＵＭＡ型計算機１、２からそれぞれ１ＮＵＭＡサブノードで構成されたパーティションでのジョブ実行を許可されているものとする。 This cluster is divided into a plurality of partitions by the administrator, and the user is permitted to execute jobs in partitions each composed of 1 NUMA subnodes from the NUMA computers 1 and 2.

クラスタの管理を行う制御用計算機４上ではユーザからのジョブの投入の受付と、ジョブのスケジューリングを行うジョブスケジューラ４１が動作している。 On the control computer 4 that manages the cluster, a job scheduler 41 that accepts job input from the user and performs job scheduling is operating.

クラスタを構成するＮＵＭＡ型計算機１、２上では、ジョブスケジューラ４１により投入されたプログラムを実行を開始するためのジョブのディスパッチャ１６、１７がＮＵＭＡ型計算機１、２上のプロセスとして動作している。 On the NUMA type computers 1 and 2 constituting the cluster, job dispatchers 16 and 17 for starting execution of programs input by the job scheduler 41 operate as processes on the NUMA type computers 1 and 2.

ジョブスケジューラ４１は、ＯＳ１７、２７を通じ、パーティションを構成する全てのＮＵＭＡサブノードのリソース情報１７１、２７１を収集・総合し、次に実行開始するジョブを決定する。 The job scheduler 41 collects and combines the resource information 171 and 271 of all the NUMA subnodes constituting the partition through the OSs 17 and 27, and determines a job to be started next.

次に、ジョブスケジューラ４１は、ＮＵＭＡ型計算機１、２上のジョブのディスパッチャ１６、２６に、パーティションを構成するＮＵＭＡサブノードの情報（識別）を指定した上で、ジョブの起動を指示する。 Next, the job scheduler 41 instructs the job dispatchers 16 and 26 on the NUMA type computers 1 and 2 to start the job after specifying the information (identification) of the NUMA subnodes constituting the partition.

ジョブのディスパッチャ１６は、ジョブを構成するプロセスを、パーティションに属しているＮＵＭＡサブノード上のＣＰＵ上にバインドするようＯＳ１７に指示する。 The job dispatcher 16 instructs the OS 17 to bind the process constituting the job onto the CPU on the NUMA subnode belonging to the partition.

前記プロセスへのメモリ割り当てをＯＳ１７に指示し、ＯＳ１７がパーティションを構成するＮＵＭＡサブノード（前記ＣＰＵが属するＮＵＭＡサブノード）上のメモリを割り当てる。 The OS 17 is instructed to allocate the memory to the process, and the OS 17 allocates the memory on the NUMA subnode (the NUMA subnode to which the CPU belongs) constituting the partition.

ジョブのディスパッチャ１６がそのプロセスを起動する。 The job dispatcher 16 starts the process.

これにより、ＮＵＭＡのノード単位でのパーティショニングの効果が得られる。 As a result, the effect of partitioning in units of NUMA nodes can be obtained.

次に、本発明を実施するための最良の形態の動作について図面を参照して説明する。図４は、本発明の実施形態の動作を示すフローチャートである。 Next, the operation of the best mode for carrying out the present invention will be described with reference to the drawings. FIG. 4 is a flowchart showing the operation of the embodiment of the present invention.

ジョブスケジューラ４１は、通信端末５等を通じユーザ端末６−１から行われるジョブの投入を受け付けると（ステップＡ１）、ジョブが必要とするリソースを算出し（ステップＡ２）、これと優先度に応じて、スケジューリングを行う（ステップＡ３）。 When the job scheduler 41 receives a job input from the user terminal 6-1 through the communication terminal 5 or the like (step A1), the job scheduler 41 calculates a resource required by the job (step A2), and according to the priority. Then, scheduling is performed (step A3).

ＯＳ１７から得られる、パーティション内部の空きリソースがジョブを実行するのに十分にあり、該当ジョブが実行可能な状態になったら（ステップＡ４）、ジョブスケジューラ４１は、ジョブのディスパッチャ１６にジョブを構成するプロセスを渡し、割り当てたリソースの情報を通知し、起動を指示する（ステップＡ５）。 When there are enough free resources in the partition obtained from the OS 17 to execute the job and the job becomes executable (step A4), the job scheduler 41 configures the job in the job dispatcher 16. The process is passed, information on the allocated resource is notified, and activation is instructed (step A5).

ジョブのディスパッチャ１６は起動指示を受信し（ステップＢ２）、付随して受けたリソースの指定情報をＯＳ１７に渡しリソースを確保させる（ステップＢ３）。 The job dispatcher 16 receives the activation instruction (step B2), and passes the accompanying resource designation information to the OS 17 to secure the resources (step B3).

リソースの確保後、ジョブのディスパッチャ１６はジョブの実行を開始する（ステップＢ４）。 After securing the resources, the job dispatcher 16 starts executing the job (step B4).

ジョブの実行が完了したら（ステップＢ５）、ジョブの出力を収集しジョブスケジューラ４１に送信し（ステップＢ６）、確保したリソースを開放し（ステップＢ７）、ジョブのディスパッチャは次のジョブの起動指示を待つ。 When the execution of the job is completed (step B5), the output of the job is collected and transmitted to the job scheduler 41 (step B6), the reserved resources are released (step B7), and the job dispatcher issues an instruction to start the next job. wait.

ジョブスケジューラ４１は、受信したジョブの結果を出力する。例えばユーザ端末６−１に返信する（ステップＡ７）。 The job scheduler 41 outputs the result of the received job. For example, it returns to the user terminal 6-1 (step A7).

本実施例では、計算機のクラスタシステムに制御用計算機を含み、ジョブスケジューラ４１をこれに搭載しているが、計算機のクラスタシステムのＮＵＭＡサブノード１１の一つのプロセッサにジョブスケジューラ４１とジョブのディスパッチャ１６を搭載する例もある。 In this embodiment, the computer cluster system includes a control computer and is equipped with a job scheduler 41. However, the job scheduler 41 and the job dispatcher 16 are provided in one processor of the NUMA subnode 11 of the computer cluster system. There is also an example to install.

また、図１の例では、２台のＮＵＭＡ型計算機での例を説明したが、ＮＵＭＡ型計算機の台数は２台以上であれば何台でも同様に実施可能である。 In the example of FIG. 1, an example with two NUMA computers has been described. However, any number of NUMA computers can be used as long as the number of NUMA computers is two or more.

逆に、本発明は、ＮＵＭＡ型計算機１台の内部をパーティショニングして運用したいという場合にも適用可能である。 Conversely, the present invention can also be applied to the case where it is desired to partition and operate the inside of one NUMA type computer.

本発明は、ＮＵＭＡ型計算機クラスタシステムに適用できるが、ＳＭＰ型計算機向けＯＳにもＮＵＭＡ型計算機向けＯＳと同等の機能を実装することにより、より粒度の細かいリソース制御を実現する用途にも適用可能である。 The present invention can be applied to a NUMA computer cluster system, but it can also be applied to applications that realize finer granular resource control by implementing functions equivalent to those for NUMA computer OS in the OS for SMP computers. It is.

例えば、ＣＰＵ１個単位でのパーティション分割が可能になる。 For example, partitioning can be performed in units of one CPU.

本発明の計算機クラスタシステムを含む全体のブロック図。1 is an overall block diagram including a computer cluster system of the present invention. 図１のＮＵＭＡサブノード１１の構成を示すブロック図。The block diagram which shows the structure of the NUMA subnode 11 of FIG. ＮＵＭＡ型計算機向けＯＳと、クラスタ管理ソフトウェア（ジョブスケジューラ及びジョブのディスパッチャ）の連携を示す図。The figure which shows cooperation with OS for NUMA type computers, and cluster management software (a job scheduler and a job dispatcher). 本発明の実施形態の動作を示すフローチャート。The flowchart which shows operation | movement of embodiment of this invention.

Explanation of symbols

１、２ＮＵＭＡ型計算機
１１〜１４、２１〜２４ＮＵＭＡサブノード
１１１〜１１３ＣＰＵ
１１１１、・・、１１３１キャッシュメモリ
１１５バス
１１６システム制御部
１１７ローカルメモリ
１１８ＬＡＮ通信部／高速通信部
１５接続手段
１６、２６ジョブのディスパッチャ
１７、２７ＯＳ
１７１各ＮＵＭＡサブノードのリソース情報
３高速ネットワーク
４制御用計算機
４１ジョブスケジューラ
５通信端末
６−１〜６−ｎユーザ端末
１００高速ＬＡＮ
２００ネットワーク
1, 2 NUMA type computers 11-14, 21-24 NUMA subnode 111-113 CPU
1111,..., 1311, Cache memory 115 Bus 116 System control unit 117 Local memory 118 LAN communication unit / High speed communication unit 15 Connection unit 16, 26 Job dispatcher 17, 27 OS
171 Resource information of each NUMA subnode 3 High-speed network 4 Control computer 41 Job scheduler 5 Communication terminal 6-1 to 6-n User terminal 100 High-speed LAN
200 network

Claims

A computer cluster system including a plurality of computers composed of a plurality of interconnected sub-nodes, these being network-connected,
Each sub-node includes one or more processors and local memory, and the local memory is locally accessed by the processor and includes a storage area that is distributed and shared among the sub-nodes in the computer.
The computer cluster system, comprising the construction of said subnodes one, until structure comprising said subnodes of a plurality of pre-Symbol in the computer definable, the cluster management means for handling separated to the computer compartment i.e. partitions,
The cluster management means obtains available resource information or resource usage status of each sub-node, and summarizes them into available resource information for each partition;
A computer cluster system comprising means for determining the execution destination partition and start timing of an input job using the resource information.

2. The apparatus according to claim 1, further comprising means for activating a process by allocating a processor and a local memory storage area of a subnode constituting the execution destination partition to a process of a job to be started based on the determination. The computer cluster system described.

The computer cluster system also includes a control computer that is communicatively connected to the plurality of computers, and includes means for collecting the available resource information for each partition, a partition of the execution destination of the input job, and a start timing The computer cluster system according to claim 1, further comprising means for determining.

A computer cluster system including a plurality of computers composed of a plurality of interconnected sub-nodes, these being network-connected,
Each sub-node includes one or more processors and local memory, and the local memory is locally accessed by the processor and includes a storage area that is distributed and shared among the sub-nodes in the computer.
Said computer cluster system, from the sub-nodes one configuration, until structure comprising said subnodes of a plurality of pre-Symbol in the computer definable, and cluster management software that handles separated to the computer compartment i.e. partitions, each of said computer OS With
The cluster management software has a job scheduler and a dispatcher that operates in cooperation with the job scheduler in each of the computers,
The job scheduler obtains resource information or resource usage status of each sub-node from the OS, collects these into the available resource information for each partition, and accepts a user job and is necessary for this A partition is created in the dispatcher on the computer and means for selecting the next job to start in each partition according to the calculated information, job priority, and available resource information for each partition. Means for instructing the start of the selected job after designating information of subnodes to be performed,
The dispatcher includes means for allocating a processor and a local memory storage area on the designated subnode to a process constituting a job through the OS, and means for starting the process.
Examples partition configuration, computer cluster system from the service nodes send one configuration, characterized in that possible to structure comprising a plurality said computing machine of the subnode.

The OS allocates the resource information or resource usage status of each subnode of the own computer, and assigns the processor or processor set of the specified subnode to the specified process, and the local to which the processor or processor set belongs. 5. The computer cluster system according to claim 4, further comprising means for allocating a storage area from a memory.

6. The computer cluster system according to claim 4, wherein the computer cluster system includes a control computer connected to the plurality of computers and includes the job scheduler. 7.

Separated computer cluster system the computer comprises a plurality comprising a plurality of interconnected sub-node from the subnode one configuration, the plurality of pre-Symbol definable up structure comprising said subnodes in the computer, the computer compartment i.e. partition And obtaining resource information that can be used for each of the plurality of subnodes, or resource usage status, and collecting them into available resource information for each partition;
A program for causing a computer to execute an execution destination partition of a received and received job and a procedure for determining a start timing using the resource information.

Separated computer cluster system the computer comprises a plurality comprising a plurality of interconnected sub-node from the subnode one configuration, the plurality of pre-Symbol definable up structure comprising said subnodes in the computer, the computer compartment i.e. partition For each of the plurality of sub-nodes, the available resource information or resource usage status is obtained from the OS of the computer, and the procedure for collecting them into the available resource information for each partition, and the user job, Accepting, calculating the resources required for this, and selecting the job to start next in each partition according to the calculated resource information, job priority and available resource information of each partition,
A program for causing a computer program to execute a procedure for instructing activation of the selected job after designating information on subnodes constituting a partition.

A management method in a computer cluster system comprising a plurality of computers comprising a plurality of interconnected subnodes, these being network-connected, and each subnode including one or more processors and a local memory,
The computer cluster system, comprising from the subnode one configuration, until structure comprising said subnodes of a plurality of pre-Symbol in the computer definable, the cluster management procedures for handling separated to the computer compartment i.e. partitions,
The cluster management procedure acquires available resource information or resource usage status of each sub-node, and summarizes them into available resource information for each partition;
A management method in a computer cluster system comprising a procedure for determining a partition to execute an input job and a start timing using the resource information.

10. The method according to claim 9, further comprising: allocating a sub-node processor and a local memory storage area constituting the execution destination partition to the process of the job to be started based on the determination, and starting the process. A management method in the computer cluster system described.

A management method in a computer cluster system comprising a plurality of computers comprising a plurality of interconnected subnodes, these being network-connected, and each subnode including one or more processors and a local memory,
Said computer cluster system, from the sub-nodes one configuration, until structure comprising said subnodes of a plurality of pre-Symbol in the computer definable, and cluster management procedures for handling separated to the computer compartment i.e. partitions, each of said computer OS And
The cluster management procedure obtains available resource information or resource usage status of each sub-node from the OS, and summarizes them into available resource information for each partition;
A procedure for accepting a user's job, calculating the resources required for this, selecting the next job to be started in each partition according to the calculated resource information, job priority, and available resource information for each partition; ,
A procedure for instructing the start of the selected job after designating information of subnodes constituting the partition;
Allocating storage areas of a processor and local memory on the designated subnode to a process constituting the job through the OS;
A management method in a computer cluster system, comprising: a procedure for starting a process.