JP7385156B2

JP7385156B2 - Scheduling method, scheduler, GPU cluster system and program

Info

Publication number: JP7385156B2
Application number: JP2022514945A
Authority: JP
Inventors: 兼三奥田; 仁士益谷; 武志弘田; 健桑原
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2020-04-16
Filing date: 2020-04-16
Publication date: 2023-11-22
Anticipated expiration: 2040-04-16
Also published as: WO2021210123A1; JPWO2021210123A1

Description

本発明は、スケジューリング方法、スケジューラ、GPUクラスタシステムおよびプログラムに関する。 The present invention relates to a scheduling method, a scheduler, a GPU cluster system, and a program.

GPU（Graphics Processing Unit）は、高精細な画像や動画のレンダリングなどに必要な計算処理を行うハードウェアである。近年、GPUは、機械学習（Machine Learning)などの演算器として用いられている。また、複数のGPUをクラスタ化したGPUクラスタの開発も行われている。コンテナ型のGPUクラスタを管理するオープンソース・ソフトウェアとしてKubernetesが存在する（非特許文献１）。 GPU (Graphics Processing Unit) is hardware that performs the calculation processing necessary for rendering high-definition images and videos. In recent years, GPUs have been used as computing units for machine learning and other applications. GPU clusters, which cluster multiple GPUs, are also being developed. Kubernetes exists as open source software that manages container-type GPU clusters (Non-Patent Document 1).

Kubernetes、［online］、インターネット＜URL: https://github.com/kubernetes/kubernetes＞Kubernetes, [online], Internet <URL: https://github.com/kubernetes/kubernetes> Kubernetes、［online］、インターネット＜URL: https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/＞Kubernetes, [online], Internet <URL: https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/>

従来のGPUクラスタでは、ストレージにアップロードされた学習対象などのデータを読み出しながら機械学習処理などが行われる。GPUの処理速度は高速であるが、これに比べてストレージの処理速度は遅い。このため、ジョブが確保したGPUに、データの読み出し待ちによる遊休時間が発生してしまう。 In conventional GPU clusters, machine learning processing is performed while reading data such as learning targets that have been uploaded to storage. GPU processing speed is fast, but storage processing speed is slow compared to GPU processing speed. As a result, idle time occurs on the GPU secured by the job as it waits to read data.

本発明は、上記事情に鑑みてなされたものであり、本発明の目的は、GPUの遊休時間を低減し、GPUの稼働率を向上させるスケジューリング方法、スケジューラ、GPUクラスタシステムおよびプログラムを提供することにある。 The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a scheduling method, scheduler, GPU cluster system, and program that reduce GPU idle time and improve GPU utilization rate. It is in.

上記目的を達成するため、本発明の一態様は、GPUクラスタシステムが行うスケジューリング方法であって、スケジューラは、投入されたジョブを、フェッチ開始待ちのジョブを格納する第１ステージキューに格納するステップと、第１ステージキューのジョブを取り出してフェッチングジョブリストに登録し、前記ジョブのデータのフェッチをキャッシュクラスタに開始させるステップと、フェッチしたデータ量が所定の閾値を超えたジョブを、フェッチングジョブリストから取り出し、デプロイ待ちのジョブを格納する第２ステージキューに格納するステップと、第２ステージキューからジョブを取り出し、当該ジョブのデプロイを指示するステップと、を行い、前記キャッシュクラスタは、フェッチングジョブリストに登録されたジョブのデータを、当該データが格納されたストレージからフェッチして、当該キャッシュクラスタに格納するステップを行い、GPUクラスタは、前記キャッシュクラスタのデータにアクセスして、ジョブを実行するステップを行う。 In order to achieve the above object, one aspect of the present invention is a scheduling method performed by a GPU cluster system, in which the scheduler stores the submitted job in a first stage queue that stores jobs waiting to start fetching. and fetching a job from the first stage queue, registering it in a fetching job list, and instructing the cache cluster to start fetching the data of the job, and fetching a job whose fetched data amount exceeds a predetermined threshold. The cache cluster performs the steps of extracting the job from the job list and storing it in a second stage queue that stores jobs waiting to be deployed, and extracting the job from the second stage queue and instructing the deployment of the job. The GPU cluster performs a step of fetching the data of the job registered in the processing job list from the storage where the data is stored and storing it in the cache cluster, and the GPU cluster accesses the data of the cache cluster to process the job. Perform the steps you want to perform.

本発明の一態様は、GPUクラスタシステムにおけるスケジューラであって、投入されたジョブを、フェッチ開始待ちのジョブを格納する第１ステージキューに格納する第１キューセレクタと、第１ステージキューのジョブを取り出してフェッチングジョブリストに登録し、ストレージに格納された、前記ジョブのデータのフェッチをキャッシュクラスタに開始させる第１ジョブセレクタと、フェッチしたデータ量が所定の閾値を超えたジョブを、フェッチングジョブリストから取り出し、デプロイ待ちのジョブを格納する第２ステージキューに格納する第２キューセレクタと、第２ステージキューからジョブを取り出し、当該ジョブのデプロイを指示する第２ジョブセレクタと、を有し、前記ジョブのデプロイ指示には、前記ジョブのデータの格納場所として前記キャッシュクラスタが指定され、GPUクラスタは前記キャッシュクラスタにアクセスして前記ジョブを実行する。 One aspect of the present invention is a scheduler for a GPU cluster system, which includes a first queue selector that stores submitted jobs in a first stage queue that stores jobs waiting to start fetching, and a first queue selector that stores jobs in the first stage queue. A first job selector that causes the cache cluster to fetch data of the job that is retrieved and registered in a fetching job list and stored in the storage, and a job whose fetched data amount exceeds a predetermined threshold; A second queue selector that extracts a job from a job list and stores it in a second stage queue that stores jobs waiting to be deployed, and a second job selector that extracts a job from the second stage queue and instructs to deploy the job. , the job deployment instruction specifies the cache cluster as a storage location for data of the job, and the GPU cluster accesses the cache cluster and executes the job.

本発明の一態様は、スケジューラと、キャッシュクラスタと、GPUクラスタとを備えるGPUクラスタシステムであって、前記キャッシュクラスタは、フェッチングジョブリストに登録されたジョブのデータを、当該データが格納されたストレージからフェッチして、当該キャッシュクラスタに格納し、前記GPUクラスタは、前記キャッシュクラスタのデータにアクセスして、ジョブを実行する。 One aspect of the present invention is a GPU cluster system including a scheduler, a cache cluster, and a GPU cluster, wherein the cache cluster stores data of a job registered in a fetching job list. The data is fetched from storage and stored in the cache cluster, and the GPU cluster accesses the data in the cache cluster to execute the job.

本発明の一態様は、上記スケジューラとして、コンピュータを機能させるプログラムである。 One aspect of the present invention is a program that causes a computer to function as the scheduler.

本発明によれば、GPUの遊休時間を低減し、GPUの稼働率を向上させるスケジューリング方法、スケジューラ、GPUクラスタシステムおよびプログラムを提供することができる。 According to the present invention, it is possible to provide a scheduling method, a scheduler, a GPU cluster system, and a program that reduce idle time of GPUs and improve utilization rates of GPUs.

基本的なGPUクラスタシステムの構成図である。FIG. 1 is a configuration diagram of a basic GPU cluster system. ユーザストレージにアクセスする図１のGPUクラスタシステムの構成図である。FIG. 2 is a configuration diagram of the GPU cluster system of FIG. 1 that accesses user storage. 本実施形態のGPUクラスタシステムの構成図である。FIG. 1 is a configuration diagram of a GPU cluster system according to the present embodiment. キャッシュクラスタの構成図である。FIG. 2 is a configuration diagram of a cache cluster. スケジューラの構成図である。FIG. 2 is a configuration diagram of a scheduler. 第１キューセレクタの処理を示すフローチャートである。It is a flowchart which shows the process of a 1st queue selector. 第１ジョブセレクタの処理を示すフローチャートである。3 is a flowchart showing processing of a first job selector. 第２キューセレクタの処理を示すフローチャートである。It is a flowchart which shows the process of a 2nd queue selector. 第２ジョブセレクタの処理を示すフローチャートである。7 is a flowchart showing the processing of the second job selector. 実施例１のGPUクラスタの構成図である。FIG. 2 is a configuration diagram of a GPU cluster according to the first embodiment. 実施例２のGPUクラスタの構成図である。FIG. 2 is a configuration diagram of a GPU cluster according to a second embodiment. 実施例３のGPUクラスタの構成図である。FIG. 3 is a configuration diagram of a GPU cluster according to a third embodiment. 方式１の閉域接続を示す模式図である。FIG. 2 is a schematic diagram showing a closed area connection of method 1. 方式２の閉域接続を示す模式図である。FIG. 2 is a schematic diagram showing a closed area connection of method 2. 方式３の閉域接続を示す模式図である。FIG. 3 is a schematic diagram showing a closed area connection of method 3. 方式４の閉域接続を示す模式図である。FIG. 4 is a schematic diagram showing closed area connection of method 4. 方式５の閉域接続を示す模式図である。FIG. 7 is a schematic diagram showing a closed area connection of method 5. 方式６の閉域接続を示す模式図である。FIG. 7 is a schematic diagram showing a closed area connection of method 6. 方式７の閉域接続を示す模式図である。FIG. 7 is a schematic diagram showing a closed area connection of method 7. 基本的なGPUクラスタシステムの動作を示すシーケンス図である。FIG. 2 is a sequence diagram showing the operation of a basic GPU cluster system. 本実施形態のGPUクラスタの動作を示すシーケンス図である。FIG. 3 is a sequence diagram showing the operation of the GPU cluster according to the present embodiment. 本実施形態のGPUクラスタの動作を示すシーケンス図である。FIG. 3 is a sequence diagram showing the operation of the GPU cluster according to the present embodiment. 本実施形態のGPUクラスタの動作を示すシーケンス図である。FIG. 3 is a sequence diagram showing the operation of the GPU cluster according to the present embodiment. 本実施形態のGPUクラスタの動作を示すシーケンス図である。FIG. 3 is a sequence diagram showing the operation of the GPU cluster according to the present embodiment. 本実施形態のGPUクラスタの動作を示すシーケンス図である。FIG. 3 is a sequence diagram showing the operation of the GPU cluster according to the present embodiment. 本実施形態のGPUクラスタの動作を示すシーケンス図である。FIG. 3 is a sequence diagram showing the operation of the GPU cluster according to the present embodiment. 方式２の「閉域接続の確立処理」を示すシーケンス図である。FIG. 7 is a sequence diagram showing “closed connection establishment processing” of method 2; 方式２の「閉域接続の解除処理」を示すシーケンス図である。FIG. 7 is a sequence diagram showing “closed connection release processing” of method 2; 方式７の「閉域接続の確立処理」を示すシーケンス図である。FIG. 7 is a sequence diagram showing “closed connection establishment processing” of method 7; 方式７の「閉域接続の解除処理」を示すシーケンス図である。FIG. 7 is a sequence diagram showing “closed connection release processing” of method 7; 「学習対象データのクラスタ格納処理」を示すシーケンス図である。FIG. 2 is a sequence diagram showing "cluster storage processing of learning target data". 「学習対象データのクラスタ格納処理」を示すシーケンス図である。FIG. 2 is a sequence diagram showing "cluster storage processing of learning target data". 「学習処理におけるキャッシュクラスタへのデータアクセス処理」を示すシーケンス図である。FIG. 3 is a sequence diagram illustrating "data access processing to cache clusters in learning processing." FIG. 「ジョブのチェックポイント処理」を示すシーケンス図である。FIG. 2 is a sequence diagram showing "job checkpoint processing." FIG. 「ジョブのチェックポイント処理」を示すシーケンス図である。FIG. 2 is a sequence diagram showing "job checkpoint processing." FIG. 「ジョブのチェックポイント処理」を示すシーケンス図である。FIG. 2 is a sequence diagram showing "job checkpoint processing." FIG. 「ジョブのリストア処理」を示すシーケンス図である。FIG. 3 is a sequence diagram illustrating "job restoration processing." FIG. 「ジョブのリストア処理」を示すシーケンス図である。FIG. 3 is a sequence diagram illustrating "job restoration processing." FIG. 「ジョブのリストア処理」を示すシーケンス図である。FIG. 3 is a sequence diagram illustrating "job restoration processing." FIG. ハードウェア構成図である。FIG. 2 is a hardware configuration diagram.

以下、本発明の実施の形態について、図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

（GPUクラスタシステムの基本構成）
図１は、基本的なGPUクラスタシステムの概略構成を示す構成図である。図示するGPUクラスタシステムは、GPUを用いた学習処理を実行するためのGPU学習クラスタシステムである。(Basic configuration of GPU cluster system)
FIG. 1 is a configuration diagram showing a schematic configuration of a basic GPU cluster system. The illustrated GPU cluster system is a GPU learning cluster system for executing learning processing using GPU.

クラスタ提供事業者（以下、「事業者」という）は、GPUクラスタを用いて学習処理を代行する設備をユーザ（利用者）に提供する。ユーザは、高価なGPUを保有せずに、GPUクラスタの利用時間等に応じて従量課金された金額を事業者に支払う。機械学習等の学習処理は一度実行すれば良いため、ユーザは、高価なGPUを購入するよりも、従量課金された金額を支払う方が低コストとなる。 A cluster provider (hereinafter referred to as a "company") provides users with equipment that performs learning processing on their behalf using GPU clusters. Users do not own expensive GPUs, but pay the operator a pay-as-you-go amount based on the usage time of the GPU cluster. Since learning processes such as machine learning only need to be executed once, it is cheaper for users to pay a pay-per-use amount than to purchase an expensive GPU.

一方、GPUの稼働率を高めることが、事業者の利益最大化のポイントとなる。そのため、GPUクラスタシステムでは、多様なジョブ（Job）を実行できること（すなわちジョブの仮想化）、ジョブのデプロイが高速であることなどが求められる。 On the other hand, increasing the utilization rate of GPUs is the key to maximizing profits for operators. Therefore, GPU cluster systems are required to be able to execute a variety of jobs (that is, job virtualization) and to deploy jobs at high speed.

図１を参照して基本的なGPUクラスタの動作概要を説明する。ここでは、GPUリソースをジョブの実行毎に割り当てるコンテナ型のクラスタを用いる。ユーザ端末５は、ユーザの指示により、GPUクラスタを提供する事業者から指示されたクラスタ共有ストレージ４Ａに学習対象となるデータ等を格納する（Ｓ１Ａ）。ユーザ端末５は、ユーザの指示により、実施したい学習処理のジョブをスケジューラ１Ａに登録する（Ｓ２Ａ）。スケジューラ１Ａは、複数のユーザ端末５から受け取ったジョブを優先順位、想定処理時間などを踏まえてスケジューリングし、 GPUリソースが確保でき次第マスタ２Ａにジョブの実行を指示する（Ｓ３Ａ）。 An overview of the basic GPU cluster operation will be explained with reference to FIG. Here, we use a container-type cluster that allocates GPU resources for each job execution. In response to a user's instruction, the user terminal 5 stores data and the like to be a learning target in the cluster shared storage 4A, which is instructed by the GPU cluster provider (S1A). The user terminal 5 registers a desired learning process job to the scheduler 1A according to the user's instruction (S2A). The scheduler 1A schedules jobs received from multiple user terminals 5 based on priorities, estimated processing time, etc., and instructs the master 2A to execute the jobs as soon as GPU resources are secured (S3A).

マスタ２Ａは、ジョブをノードにデプロイし、GPUをアタッチし、GPUに学習処理を実行させる（Ｓ４Ａ）。すなわち、マスタ２Ａは、ジョブ毎に学習・推論のプログラムを実行するための仮想環境を生成し、GPUをアタッチする。マスタ２Ａは、ジョブが完了したらGPUを開放する。GPUは、予めクラスタ共有ストレージ４Ａにアップロードされた学習対象データを読み出しながら学習処理を行い、学習処理の結果をクラスタ共有ストレージ４Ａに格納する（Ｓ５Ａ）。ユーザは、自身のジョブの実行が終了すると、クラスタ共有ストレージ４Ａにアクセスすることで、学習処理の実行結果を取得することができる。 The master 2A deploys the job to the node, attaches the GPU, and causes the GPU to execute learning processing (S4A). That is, the master 2A generates a virtual environment for executing a learning/inference program for each job, and attaches a GPU to the virtual environment. The master 2A releases the GPU when the job is completed. The GPU performs learning processing while reading the learning target data uploaded in advance to the cluster shared storage 4A, and stores the results of the learning processing in the cluster shared storage 4A (S5A). When the user finishes executing his or her job, the user can obtain the execution results of the learning process by accessing the cluster shared storage 4A.

図１に示す基本的なGPUクラスタシステムの場合、下記のような想定状況および制約条件に対応することが難しい。 In the case of the basic GPU cluster system shown in Figure 1, it is difficult to deal with the following assumed situations and constraints.

(1)学習プログラムの処理速度よりもストレージの速度（データ転送速度）が遅い場合、Ｓ５Ａの処理において、ストレージの速度不足によりジョブが確保したGPUに遊休時間が発生する。ビッグデータは、Cephなどを用いた大容量分散ストレージであるクラスタ共有ストレージ４に、無加工またはほぼ無加工で格納される。分散ストレージは、分散並列化の効果で大容量化しても低速化しないことが特徴であるが、劇的に高速化するわけではなく、高々数百MB/s の性能である。 (1) If the storage speed (data transfer speed) is slower than the processing speed of the learning program, idle time occurs in the GPU secured by the job due to insufficient storage speed in the process of S5A. Big data is stored unprocessed or almost unprocessed in the cluster shared storage 4, which is a large-capacity distributed storage using Ceph or the like. Distributed storage is characterized by the fact that it does not slow down even when the capacity increases due to the effect of distributed parallelization, but the speed does not increase dramatically, and the performance is only a few hundred MB/s at most.

GPUの処理速度に匹敵する高速なストレージは極めて高価であるため、ビッグデータすべてを格納できる容量の高価なストレージは用意できない。一方、ビッグデータ全体が同時に必要になることはない。 High-speed storage that matches the processing speed of GPUs is extremely expensive, so expensive storage with the capacity to store all of the big data is not available. On the other hand, all of big data is not needed at the same time.

(2)学習対象データを纏まった状態でクラスタ共有ストレージ４にアップロードできない、または、学習対象データが巨大過ぎて、全ての学習対象データをアップロードするのが現実的ではない場合がある。 (2) There are cases where it is not possible to upload all the learning target data to the cluster shared storage 4, or the learning target data is so large that it is not practical to upload all the learning target data.

このような場合、図２に示すように、ユーザ端末５は、ユーザ拠点のユーザストレージ６Ａに学習対象データを格納し（Ｓ１Ａ’）、ノード３Ａのジョブは、ユーザストレージ６Ａに閉域接続し、ユーザストレージ６Ａに直接アクセスする（Ｓ５Ａ’）。しかしながら、ユーザストレージ６Ａからノード３Ａまでの通信区間があり、また、ユーザストレージ６Ａの速度が遅いことにより、学習プログラムの処理速度よりもデータ転送速度が遅くなり、GPUの遊休時間が発生する。 In such a case, as shown in FIG. 2, the user terminal 5 stores the learning target data in the user storage 6A at the user base (S1A'), and the job of the node 3A connects to the user storage 6A in a closed area and Directly access storage 6A (S5A'). However, since there is a communication section from the user storage 6A to the node 3A, and the speed of the user storage 6A is slow, the data transfer speed becomes slower than the processing speed of the learning program, resulting in GPU idle time.

(3) GPUを効率的に稼働させるために、データの蓄積と学習処理とを並行して実行させたい場合がある。 (3) In order to operate the GPU efficiently, there are cases where data accumulation and learning processing need to be executed in parallel.

このような状況に対応可能な本実施形態のGPUクラスタシステムについて、以下に説明する。 The GPU cluster system of this embodiment that can cope with such a situation will be described below.

（本実施形態のGPUクラスタシステム）
図３は、本実施形態のGPUクラスタシステムの概略構成を示す構成図である。本実施形態のGPUクラスタシステムは、GPUを用いて学習処理を実行するためのGPU学習クラスタシステムである。学習処理は、学習対象データを読み込んで、機械学習等の処理を行うことである。この際、学習対象データの全てを一括で読み込むのではなく、ブロック単位やファイル単位などに分けて順々に読み込みながら学習処理を行う。(GPU cluster system of this embodiment)
FIG. 3 is a configuration diagram showing a schematic configuration of the GPU cluster system of this embodiment. The GPU cluster system of this embodiment is a GPU learning cluster system for executing learning processing using GPU. Learning processing is reading the learning target data and performing processing such as machine learning. At this time, rather than reading all of the learning target data at once, the learning process is performed while dividing it into blocks or files and reading them one after another.

図示するGPUクラスタシステムは、スケジューラ１と、マスタ２と、ノード３と、クラスタ共有ストレージ４と、キャッシュクラスタ７とを備える。ここでは、GPUリソースをジョブの実行毎に割り当てるコンテナ型のGPUクラスタを用いる。ユーザ拠点では、ユーザが学習対象データを格納するユーザストレージ６を備えていてもよい。 The illustrated GPU cluster system includes a scheduler 1, a master 2, a node 3, a cluster shared storage 4, and a cache cluster 7. Here, we use a container-type GPU cluster that allocates GPU resources for each job execution. The user base may include a user storage 6 in which the user stores learning target data.

本実施形態のGPUクラスタシステムは、高コストで高速なキャッシュクラスタ７（キャッシュ）を備え、スケジューラ１は、キャッシュクラスタ７とGPUとを同時にスケジュールする。大容量データは、通常時には低コストで低速なストレージ（クラスタ共有ストレージ４、ユーザストレージ６）に格納しておき、ジョブの実行の際にデータをキャッシュクラスタ７に格納する。これにより、本実施形態では、GPUは、高速なキャッシュクラスタ７からデータを読み出すため、GPUがデータ読み出し待ちで遊休してしまう事態を回避することができる。 The GPU cluster system of this embodiment includes a high-cost and high-speed cache cluster 7 (cache), and the scheduler 1 schedules the cache cluster 7 and the GPU simultaneously. Large-capacity data is normally stored in low-cost, low-speed storage (cluster shared storage 4, user storage 6), and is stored in cache cluster 7 when a job is executed. As a result, in this embodiment, since the GPU reads data from the high-speed cache cluster 7, it is possible to avoid a situation where the GPU becomes idle while waiting to read data.

スケジューラ１（Scheduler）は、ユーザ端末５から投入されるジョブ（Job）を受け付ける。スケジューラ１は、GPUクラスタ内のGPUリソースの空き状況を監視し、空きがあれば、ジョブのデプロイ（実行環境に展開すること）をマスタ２に指示する。すなわち、スケジューラ１は、ジョブの実行をマスタ２に指示する。 The scheduler 1 receives jobs submitted from the user terminal 5. The scheduler 1 monitors the free status of GPU resources in the GPU cluster, and if there is free space, instructs the master 2 to deploy the job (deploy it to the execution environment). That is, the scheduler 1 instructs the master 2 to execute the job.

マスタ２（Master）は、ノード３（Node）の管理と、ジョブのデプロイを行う。マスタ２は、スケジューラ１からジョブのデプロイが指示されると、ノード３上にジョブに定義されたコンテナ等の仮想環境を構築し、仮想環境でジョブに定義されているプログラムを実行させる。マスタ２は、ジョブに定義されたプログラムが完了すると、仮想環境を削除する。 Master 2 (Master) manages node 3 (Node) and deploys jobs. When instructed to deploy a job by the scheduler 1, the master 2 constructs a virtual environment such as a container defined in the job on the node 3, and causes the program defined in the job to be executed in the virtual environment. Master 2 deletes the virtual environment when the program defined in the job is completed.

ノード３（Node）には、複数のGPUがプールされている。GPUは、マスタ２にアタッチされるとジョブを実行する。ジョブは、ユーザが実行したいプログラム（例えば、学習、推論のプログラム）と、プログラムの実行環境とを定義したものである。具体的には、ジョブには、実行すべき１以上のプログラムと、その順序とが含まれる。また、ジョブには、プログラムを実行するための環境（仮想環境、ランタイム、OS、ディストリビューション、ライブラリ等）が含まれる。例えば、ジョブには、環境として、コンテナのイメージファイル名、VM（Virtual Machine）のイメージファイル名などが含まれる。また、必要があれば、ジョブは、上記環境を自動的に構築するための手順を含み、ジョブが自動で実行環境のイメージを生成してもよい。本実施形態のジョブはメインコンテナ（Main Container）を含み、メインコンテナ以外のコンテナを含んでもよい。メインコンテナは、本実施形態の学習プログラムを実行する仮想環境のコンテナである。なお、本実施形態では、仮想環境の実現形態としてコンテナを用いるが、VMでもよい。 A plurality of GPUs are pooled in the node 3 (Node). The GPU executes the job when attached to master 2. A job defines a program (for example, a learning or inference program) that the user wants to execute and an execution environment for the program. Specifically, a job includes one or more programs to be executed and their order. Further, a job includes an environment for executing a program (virtual environment, runtime, OS, distribution, library, etc.). For example, a job includes the container image file name, VM (Virtual Machine) image file name, etc. as the environment. Further, if necessary, the job may include a procedure for automatically constructing the above environment, and the job may automatically generate an image of the execution environment. The job of this embodiment includes a main container (Main Container), and may also include containers other than the main container. The main container is a container of a virtual environment in which the learning program of this embodiment is executed. Note that in this embodiment, a container is used as a form of implementation of the virtual environment, but a VM may also be used.

クラスタ共有ストレージ４（Cluster Shared Storage）は、データを格納するストレージシステムである。例えば、クラスタ共有ストレージ４には、学習対象データと、実行結果とが格納される。ジョブの仮想環境からクラスタ共有ストレージ４にアクセスできる。ユーザは、直接的または間接的に何らかの手段でクラスタ共有ストレージ４に、ジョブが読み込む学習対象データを格納できる。クラスタ共有ストレージ４では、大量の学習対象データを格納するため、例えばCeph、GlusterFS、Swift、RAID等のストレージ技術の利用が想定される。Ceph(https://ceph.io/)およびGlusterFS(https://www.gluster.org/)は、オープンソースの分散ストレージソフトウェアである。 The cluster shared storage 4 (Cluster Shared Storage) is a storage system that stores data. For example, the cluster shared storage 4 stores learning target data and execution results. The cluster shared storage 4 can be accessed from the job's virtual environment. A user can directly or indirectly store learning target data read by a job in the cluster shared storage 4 by some means. In the cluster shared storage 4, storage technologies such as Ceph, GlusterFS, Swift, and RAID are assumed to be used to store a large amount of learning target data. Ceph (https://ceph.io/) and GlusterFS (https://www.gluster.org/) are open source distributed storage software.

キャッシュクラスタ７（Cluster Shared Storage）については、後述する。 The cache cluster 7 (Cluster Shared Storage) will be described later.

次に、図３を参照して、本実施形態のGPUクラスタシステムの動作概要を説明する。ここでは、GPUリソースをジョブの実行毎に割り当てるコンテナ型のクラスタを用いる。 Next, an overview of the operation of the GPU cluster system of this embodiment will be explained with reference to FIG. Here, we use a container-type cluster that allocates GPU resources for each job execution.

本実施形態では、キャッシュクラスタ７がクラスタ共有ストレージ４またはユーザストレージ６からデータをフェッチする。なお、データを読み出す「フェッチ」は、以降において「キャッシュ」ともいう。 In this embodiment, the cache cluster 7 fetches data from the cluster shared storage 4 or the user storage 6. Note that "fetch" for reading data will also be referred to as "cache" hereinafter.

ユーザ端末５は、ユーザの指示により、GPUクラスタシステムを提供する事業者から指示されたクラスタ共有ストレージ４またはユーザストレージ６に学習対象データ等を格納する（Ｓ１）。ユーザ端末５は、ユーザの指示により、実施したい学習処理のジョブをスケジューラ１に登録する（Ｓ２）。スケジューラ１は、キャッシュクラスタ７にデータをキャッシュするように指示する（Ｓ３）。キャッシュクラスタ７は、クラスタ共有ストレージ４またはユーザストレージ６から学習対象データのフェッチを開始する（Ｓ４）。スケジューラ１は、複数のユーザ端末５（ユーザ）から受け取ったジョブを登録順序、優先順位、必要リソース量(GPU数、CPU数など)、想定処理時間などを踏まえてスケジューリングし、 GPUリソースが確保でき次第、マスタ２にジョブの実行を指示する（Ｓ５）。必要リソース量は、ユーザが事前にジョブのメタデータに含めてスケジューラ１に通知してもよく、または、スケジューラ１がジョブの内容から推定してもよい。 In response to a user's instruction, the user terminal 5 stores learning target data and the like in the cluster shared storage 4 or the user storage 6 instructed by the provider providing the GPU cluster system (S1). The user terminal 5 registers a desired learning process job to the scheduler 1 according to the user's instruction (S2). The scheduler 1 instructs the cache cluster 7 to cache the data (S3). The cache cluster 7 starts fetching the learning target data from the cluster shared storage 4 or the user storage 6 (S4). The scheduler 1 schedules jobs received from multiple user terminals 5 (users) based on registration order, priority, required resource amount (number of GPUs, number of CPUs, etc.), expected processing time, etc., and secures GPU resources. Then, the master 2 is instructed to execute the job (S5). The required resource amount may be included in the metadata of the job in advance by the user and notified to the scheduler 1, or the scheduler 1 may estimate it from the contents of the job.

マスタ２は、ジョブをノードにデプロイし、GPUをアタッチし、キャッシュクラスタ７のキャッシュ領域をマウントして、GPUにジョブの学習処理を実行させる（Ｓ６）。すなわち、マスタ２は、ジョブ毎に学習・推論のプログラムを実行するための仮想環境を生成し、GPUをアタッチする。マスタ２は、ジョブが完了したらGPUを開放する。GPUは、キャッシュ領域にキャッシュされた学習対象のデータを読み出しながら学習処理を行い、学習処理の結果をキャッシュクラスタ７またはクラスタ共有ストレージ４に格納する（Ｓ７）。ユーザは、自身のジョブの実行が終了すると、キャッシュクラスタ７またはクラスタ共有ストレージ４にアクセスすることで、学習処理の実行結果を取得することができる。スケジューラ１は、ジョブの終了後にキャッシュ領域のデータを削除する（Ｓ８）。 The master 2 deploys the job to the node, attaches the GPU, mounts the cache area of the cache cluster 7, and causes the GPU to execute the learning process of the job (S6). That is, the master 2 generates a virtual environment for executing a learning/inference program for each job, and attaches a GPU to the virtual environment. Master 2 releases the GPU when the job is completed. The GPU performs learning processing while reading out the learning target data cached in the cache area, and stores the results of the learning processing in the cache cluster 7 or the cluster shared storage 4 (S7). When the user finishes executing his or her job, the user can obtain the execution results of the learning process by accessing the cache cluster 7 or the cluster shared storage 4. The scheduler 1 deletes the data in the cache area after the job ends (S8).

図４は、キャッシュクラスタ７の構成図である。図示するキャッシュクラスタ７は、VPN接続部７１（VPN Function）と、キャッシュ管理部７２（Cache Manager）と、１以上のストレージ７３（Storage）とを備える。 FIG. 4 is a configuration diagram of the cache cluster 7. As shown in FIG. The illustrated cache cluster 7 includes a VPN connection section 71 (VPN Function), a cache management section 72 (Cache Manager), and one or more storages 73 (Storage).

VPN接続部７１は、閉域接続を開始または待受し、閉域接続を確立する。 The VPN connection unit 71 starts or stands by for a closed connection, and establishes a closed connection.

キャッシュ管理部７２は、１以上のストレージ７３をまとめてクラスタを構成する。キャッシュ管理部７２は、オリジン（原本）となるストレージ（クラスタ共有ストレージ４、ユーザストレージ６）に対してファイル共有プロトコル等を用いてアクセスし、オリジンが持つデータをキャッシュしつつ要求元にデータを共有する透過的なキャッシュ機能を有する。要求元が、キャッシュクラスタ７にデータを要求すると、キャッシュ管理部７２が要求されたデータがキャッシュ済みか否かを判断する。キャッシュ済みであれば当該データを要求元に返す。キャッシュ済みでなければ、オリジンとなるストレージにデータを要求し、オリジンとなるストレージから渡されたデータを要求元に返す。キャッシュ管理部７２は、クラスタ共有ストレージ４およびユーザストレージ６を操作する機能を有する。 The cache management unit 72 composes one or more storages 73 into a cluster. The cache management unit 72 accesses the origin (original) storage (cluster shared storage 4, user storage 6) using a file sharing protocol, etc., and caches the data owned by the origin while sharing the data with the request source. It has a transparent caching function. When a request source requests data from the cache cluster 7, the cache management unit 72 determines whether the requested data has already been cached. If the data has been cached, the data is returned to the request source. If the data is not already cached, it requests the data from the origin storage and returns the data passed from the origin storage to the request source. The cache management unit 72 has a function of operating the cluster shared storage 4 and the user storage 6.

ストレージ７３は、オリジンとなるストレージからキャッシュしたデータを格納する。ストレージ７３には、NVMe、NVDIMMなどの高速なストレージを使用する。なお、VPN接続部７１は、キャッシュクラスタ７に内包されず、キャッシュクラスタ７から独立してGPUクラスタシステムに存在してもよい。また、キャッシュクラスタ７は、クラスタ共有ストレージ４を内包してもよい。 The storage 73 stores data cached from the origin storage. For the storage 73, high-speed storage such as NVMe or NVDIMM is used. Note that the VPN connection unit 71 may not be included in the cache cluster 7 but may exist independently from the cache cluster 7 in the GPU cluster system. Further, the cache cluster 7 may include the cluster shared storage 4.

図５は、スケジューラ１の構成図である。スケジューラ１は、第１ステージキュー１０と、第２ステージキュー２０と、フェッチングジョブリスト３０（Fetching Job List (以下、「FJL」)）と、アカウントDB３１（Accounting DB）と、GPU使用量監視部３２（GPU Utilization Monitor）と、を備える。アカウントDB３１は、各ユーザのGPU使用量を管理する。アカウントDB３１は、スケジューラ１内ではなく、スケジューラ１の外部に設置してもよい。また、アカウントDB３１には、事業者の既存のユーザデータベースなどを転用してもよい。GPU使用量監視部３２は、マスタ２またはノード３からGPU使用量を取得し、GPU使用量を監視する。 FIG. 5 is a configuration diagram of the scheduler 1. The scheduler 1 includes a first stage queue 10, a second stage queue 20, a fetching job list 30 (FJL), an accounting DB 31, and a GPU usage monitoring unit. 32 (GPU Utilization Monitor). The account DB 31 manages each user's GPU usage. The account DB 31 may be installed outside the scheduler 1 instead of inside the scheduler 1. Furthermore, the account DB 31 may be an existing user database of the business operator. The GPU usage monitoring unit 32 acquires the GPU usage from the master 2 or the node 3 and monitors the GPU usage.

第１ステージキュー１０には、フェッチ開始待ちのジョブが格納される。第１ステージキュー１０は、第１キューセレクタ１１(Queue Selector 1)と、複数のジョブキュー１３－１５と、第１ジョブセレクタ１２(Job Selector 1)とを備える。第１キューセレクタ１１は、ユーザ端末５から投入されたジョブを、フェッチ開始待ちのジョブを格納する第１ステージキュー１０のいずれかのジョブキュー１３－１５に格納する。第１キューセレクタ１１の処理は後述する。 The first stage queue 10 stores jobs waiting to start fetching. The first stage queue 10 includes a first queue selector 11 (Queue Selector 1), a plurality of job queues 13-15, and a first job selector 12 (Job Selector 1). The first queue selector 11 stores the job submitted from the user terminal 5 in one of the job queues 13-15 of the first stage queue 10 that stores jobs waiting to start fetching. The processing of the first queue selector 11 will be described later.

第１ジョブセレクタ１２は、第１ステージキューのジョブを取り出してフェッチングジョブリストに登録し、ストレージに格納された、前記ジョブのデータのフェッチをキャッシュクラスタに開始させる。本実施形態では、第１ジョブセレクタ１２は、ジョブキュー１３－１５に格納されたジョブを優先度等に従って取り出し、フェッチングジョブリスト３０に登録する。また、第１ジョブセレクタ１２は、アカウントDB３１にユーザの現時点のGPUの使用量を問い合わせ、ユーザの現時点の使用量に応じて、公平性割当量またはユーザ割当量を超過したジョブを対応するジョブキューに再配置する。第１ジョブセレクタ１２の処理は後述する。 The first job selector 12 extracts a job from the first stage queue, registers it in the fetching job list, and causes the cache cluster to start fetching data of the job stored in the storage. In this embodiment, the first job selector 12 takes out the jobs stored in the job queue 13-15 according to their priorities and registers them in the fetching job list 30. The first job selector 12 also queries the account DB 31 about the user's current GPU usage, and depending on the user's current usage, the first job selector 12 assigns jobs that exceed the fairness quota or user quota to the corresponding job queue. Relocate to. The processing of the first job selector 12 will be described later.

第１ステージのジョブキューには、ジョブキュー１３（Job Queue (以下、「JQ」)）と、公平性超過ジョブキュー１４（Over Fairness-quota Job Queue （以下「OFJQ」））と、ユーザ超過ジョブキュー１５（Over User-quota Job Queue (以下、「OUJQ」)）とが含まれる。 The first stage job queue includes a job queue 13 (JQ), an over-fairness-quota job queue 14 (OFJQ), and over-user jobs. Queue 15 (Over User-quota Job Queue (hereinafter referred to as "OUJQ")) is included.

JQ１３には、公平性割当量およびユーザ割当量を超えていないジョブが格納される。JQ１３は、ジョブのクラス（優先度）k毎に設けられる。ここでは、クラスkは1≦k≦nとし、最も高い優先度のクラスはk＝1とし、最も低い優先度のクラスはk＝nとする。クラスkのJQ１３は、「JQ k」と記載する場合もある。 Jobs that do not exceed the fairness quota and user quota are stored in JQ13. JQ13 is provided for each job class (priority) k. Here, class k is 1≦k≦n, the highest priority class is k=1, and the lowest priority class is k=n. JQ13 of class k is sometimes written as "JQ k".

OFJQ１４には、公平性の観点から各ユーザに割り当てられたGPUの公平性割当量（quota）を超えたジョブが格納される。この公平性の割当量は、１人のユーザがGPUを独占し、他のユーザがGPUを使用できない状態を防止し、各ユーザにGPUを公平に割り当てるために、事業者が各ユーザのGPU使用量の上限を定めたものである。割当量は、例えば１か月などの所定期間の割当量である。OFJQ１４は、JQ１３と同様にジョブのクラスk 毎に設けられ、kは1≦k≦nとする。クラスkのジョブキュー１４は、「OFJQ k」と記載する場合もある。 OFJQ14 stores jobs that exceed the GPU fairness quota allocated to each user from the viewpoint of fairness. This fairness quota is set by the operator to prevent a situation where one user monopolizes the GPU and other users cannot use the GPU, and to fairly allocate GPUs to each user. This sets an upper limit on the amount. The allocated amount is an allocated amount for a predetermined period such as one month, for example. Like JQ13, OFJQ14 is provided for each job class k, where k is 1≦k≦n. The job queue 14 of class k may be written as "OFJQ k".

OUJQ１５には、ユーザが設定したGPUのユーザ割当量を超えたジョブが格納される。このユーザ割当量は、GPUの使用料金を予算内に抑えるために、ユーザが自身のGPU使用量の上限を定めたものである。割当量は、例えば１か月などの所定期間の割当量である。OUJQ１５に格納されたジョブは、デプロイおよびフェッチされない。ユーザ割当量が変更されるか、あるいは、現在の使用量が更新された場合、第１ジョブセレクタ１２は、OUJQ１５の先頭からジョブを取り出し、第１キューセレクタ１１は当該ジョブを対応するクラスのジョブキュー１３に振り分ける。現在の使用量が更新される場合は、例えば、月毎の使用量の上限を定めている場合、翌月となって使用量が0に更新される場合などである。 OUJQ15 stores jobs that exceed the user allocated amount of GPU set by the user. This user quota is an upper limit set by the user for his/her own GPU usage in order to keep the GPU usage fee within the budget. The allocated amount is an allocated amount for a predetermined period such as one month, for example. Jobs stored in OUJQ15 are not deployed and fetched. When the user quota is changed or the current usage amount is updated, the first job selector 12 takes out the job from the beginning of the OUJQ 15, and the first queue selector 11 assigns the job to the job of the corresponding class. Assign to queue 13. Examples of cases in which the current amount of usage is updated include when an upper limit for monthly usage is set, or when the usage amount is updated to 0 in the next month.

第１ジョブセレクタ１２は、JQ１３のジョブをOFJQ１４のジョブより優先的にFJL３０に登録する。 The first job selector 12 registers the job of JQ13 in the FJL30 with priority over the job of OFJQ14.

FJL３０は、データのフェッチを開始するジョブが登録されるリストである。キャッシュクラスタ７は、FJL３０に登録されたジョブをフェッチ（プリフェッチ）する。第１ジョブセレクタ１２は、ジョブをFJL３０に登録した後に、FJL３０に追加されたジョブのフェッチの開始をキャッシュクラスタ７に指示してもよい。キャッシュクラスタ７は、定期的にFJL３０をチェックし、新たなジョブが登録されると当該ジョブのフェッチを開始してもよい。フェッチされたデータ量が所定の閾値を超えたジョブは、第２ステージキュー２０に移される。閾値については、後述する。FJL３０には、サスペンド状態のジョブが登録されてもよい。 FJL30 is a list in which jobs that start fetching data are registered. The cache cluster 7 fetches (prefetches) jobs registered in the FJL 30. After registering the job in the FJL 30, the first job selector 12 may instruct the cache cluster 7 to start fetching the job added to the FJL 30. The cache cluster 7 may periodically check the FJL 30 and, when a new job is registered, start fetching the job. Jobs whose fetched data amount exceeds a predetermined threshold are moved to the second stage queue 20. The threshold value will be described later. A suspended job may be registered in the FJL 30.

第２ステージキュー２０には、デプロイ待ち状態のジョブが格納される。第２ステージキュー２０は、第２キューセレクタ２１(Queue Selector 2)と、複数のジョブキュー２３－２５と、第２ジョブセレクタ２２(Job Selector 2)とを備える。 The second stage queue 20 stores jobs waiting to be deployed. The second stage queue 20 includes a second queue selector 21 (Queue Selector 2), a plurality of job queues 23-25, and a second job selector 22 (Job Selector 2).

第２キューセレクタ２１は、フェッチしたデータ量が所定の閾値を超えたジョブをFJL３０から取り出し、第２ステージキュー２０のいずれかのキュー２３－２５に格納する。第２キューセレクタ２１の処理は後述する。第２ジョブセレクタ２２は、第２ステージキュー２０のいずれかのキュー２３－２５からジョブを取り出し、当該ジョブのデプロイを指示する。 The second queue selector 21 extracts a job whose fetched data amount exceeds a predetermined threshold from the FJL 30 and stores it in one of the queues 23-25 of the second stage queue 20. The processing of the second queue selector 21 will be described later. The second job selector 22 takes out a job from one of the queues 23-25 of the second stage queue 20 and instructs the deployment of the job.

第２ステージのジョブキューには、リストアキュー２３（Restore Queue（以下、「RQ」））と、デプロイキュー２４（Deploy Queue（以下「DQ」））と、公平性超過キュー２５（Over Fairness-quota Deploy Queue（以下 (OFDQ)）とが含まれる。 The second stage job queue includes a restore queue 23 (Restore Queue (hereinafter referred to as "RQ")), a deploy queue 24 (hereinafter referred to as "DQ"), and an over fairness queue 25 (Over Fairness-quota). Deploy Queue (hereinafter referred to as (OFDQ)) is included.

RQ２３には、フェッチしたデータ量が閾値を超えたリストア待ちのジョブが格納される。DQ２４には、フェッチしたデータ量が閾値を超えたデプロイ待ちのジョブが格納される。OFDQ２５には、フェッチしたデータ量が閾値を超えたジョブのうち、当該ジョブのユーザ（ジョブ所有者）の現在のGPU使用量が公平性割当量を超えているジョブが格納される。公平性割当量を超えたジョブは、GPUに空きがあり、他のジョブ（RQ２３およびDQ２４のジョブ）がない場合にデプロイ対象となる。他のジョブがある場合は、他のジョブが優先される。翌月となりユーザの使用量が0にリセットされるなどして、超過状態が解消された場合、OFDQ２５の先頭のジョブから取り出してRQ２３またはDQ２４に格納される。 RQ23 stores jobs that are waiting to be restored and whose fetched data amount exceeds a threshold. DQ24 stores jobs waiting to be deployed whose fetched data amount exceeds a threshold. The OFDQ 25 stores jobs in which the current GPU usage of the user (job owner) of the job exceeds the fairness quota among the jobs whose fetched data amount exceeds the threshold. A job that exceeds the fairness quota becomes a target for deployment if there is free space on the GPU and there are no other jobs (jobs RQ23 and DQ24). If there are other jobs, the other jobs have priority. In the next month, when the usage amount of the user is reset to 0 and the excessive state is resolved, the first job of OFDQ 25 is taken out and stored in RQ 23 or DQ 24.

第２ジョブセレクタ２２は、RQ２３のジョブをDQ２４のジョブより優先的にデプロイ指示し、DQ２４のジョブをOFDQ２５のジョブより優先的にデプロイ指示する。また、第２ジョブセレクタ２２は、RQ２３、DQ２４の全てが空の場合、第２キューセレクタ２１を起動し、第２キューセレクタ２１にFJL３０の先頭のジョブ、またはFJL３０の中でフェッチしたデータ量が最も多いジョブを、RQ２３、DQ２４およびOFDQ２５のいずれかに格納させてもよい。また、RQ２３からジョブを取り出す際は、短期間でリストアとサスペンドを繰り返してしまわないように、サスペンド直後のジョブは一定時間または一定量フェッチが行われるまでデプロイ指示の対象外としてもよい。 The second job selector 22 instructs to deploy the job of RQ23 with priority over the job of DQ24, and instructs the job of DQ24 to be deployed with priority over the job of OFDQ25. Furthermore, when both RQ23 and DQ24 are empty, the second job selector 22 activates the second queue selector 21, and the second job selector 22 stores the first job of FJL30 or the amount of data fetched in FJL30. The job with the largest number of jobs may be stored in any of RQ23, DQ24, and OFDQ25. Furthermore, when retrieving a job from RQ23, the job immediately after being suspended may not be subject to deployment instructions until a certain amount of time or a certain amount of fetching is performed, so as to avoid repeating restoration and suspending in a short period of time.

FJL３０からジョブを取り出す際のフェッチ済みデータ量の閾値は、例えば以下の方法で算出してもよい。 The threshold value of the fetched data amount when retrieving a job from the FJL 30 may be calculated, for example, by the following method.

第１の方法は、事業者またはユーザが定義した値を閾値とする。例えば、データ量の10%などとする。 In the first method, a value defined by a business operator or a user is used as a threshold value. For example, let it be 10% of the data amount.

第２の方法は、ジョブ定義から閾値を算出する。具体的には、ジョブ定義に含まれるプログラムのループ処理の深さと、命令数とから計算量オーダを算出し、計算量オーダの大きさにより段階に分け、段階毎に閾値を決定する。また、計算量オーダが大きいほど時間あたりのデータ処理量(データ処理速度)が低下するため、計算量オーダが大きいほど閾値は小さくする。 The second method calculates the threshold value from the job definition. Specifically, the calculation amount order is calculated from the depth of the loop processing of the program included in the job definition and the number of instructions, and the calculation amount order is divided into stages depending on the size of the calculation amount order, and a threshold value is determined for each stage. Furthermore, the larger the order of calculation amount is, the lower the amount of data processing per time (data processing speed) is, so the larger the order of calculation amount is, the smaller the threshold value is.

第３の方法は、後述するチェックポイントされるまでのジョブの実行状況から閾値を算出する。具体的には、これまでの実行状況からデータ処理速度Ｖｐとフェッチ速度Ｖｆを算出する。 In the third method, a threshold value is calculated from the job execution status up to a checkpoint, which will be described later. Specifically, the data processing speed Vp and the fetch speed Vf are calculated from the previous execution status.

Ｖｆ≧Ｖｐの場合は、閾値＝Ｖｆ×Ｍとする。Ｍは任意の値である。 In the case of Vf≧Vp, the threshold value is set as Vf×M. M is an arbitrary value.

Ｖｆ＜Ｖｐの場合は、閾値＝（１－Ｖｆ／Ｖｐ）×Ｓ＋Ｖｐ×Ｍとする。Ｓは処理されていない残りのデータ量、Ｍは任意の値である。 When Vf<Vp, threshold value=(1−Vf/Vp)×S+Vp×M. S is the amount of remaining data that has not been processed, and M is an arbitrary value.

図６は、第１キューセレクタ１１の処理を示すフローチャートである。第１キューセレクタ１１は、ジョブを受け付けると（Ｓ１１）、ジョブの所有者であるユーザの優先クラスをアカウントDB３１取得する（Ｓ１２）。ここでは、優先クラスはkとする（Ｓ１３）。第１キューセレクタ１１は、GPUの公平性割当量と、ユーザの現在使用量とを比較し（Ｓ１４）、現在使用量が公平性割当量を超えていない場合（Ｓ１５：true）、GPUのユーザ割当量とユーザの現在使用量とを比較する（Ｓ１６）。 FIG. 6 is a flowchart showing the processing of the first queue selector 11. When the first queue selector 11 receives a job (S11), it acquires the priority class of the user who is the owner of the job from the account DB 31 (S12). Here, the priority class is k (S13). The first queue selector 11 compares the GPU fairness allocation with the user's current usage (S14), and if the current usage does not exceed the fairness allocation (S15: true), the GPU The allocated amount and the user's current usage amount are compared (S16).

現在使用量がユーザ割当量を超えていない場合（Ｓ１７：true）、第１キューセレクタ１１は、Ｓ１１で受信したジョブを優先クラスkのJQ k１３の末尾に格納する（Ｓ１８）。現在使用量がユーザ割当量を超えている場合（Ｓ１７：false）、第１キューセレクタ１１は、Ｓ１１で受信したジョブをOUJQ１５の末尾に格納する（Ｓ１９）。現在使用量が公平性割当量を超えている場合（Ｓ１５：false）、第１キューセレクタ１１は、Ｓ１１で受信したジョブを優先クラスkのOFJQ k１４の末尾に格納する（Ｓ２０）。 If the current usage does not exceed the user quota (S17: true), the first queue selector 11 stores the job received in S11 at the end of JQ k13 of priority class k (S18). If the current usage exceeds the user quota (S17: false), the first queue selector 11 stores the job received in S11 at the end of OUJQ15 (S19). If the current usage exceeds the fairness quota (S15: false), the first queue selector 11 stores the job received in S11 at the end of OFJQ k14 of priority class k (S20).

図７は、第１ジョブセレクタ１２の処理を示すフローチャートである。図７の処理は、第１キューセレクタ１１にジョブが投入されることをトリガとして開始される。また、図７の処理は、第２キューセレクタ２１がFJL３０に空きができたことを検知することをトリガとして開始される。 FIG. 7 is a flowchart showing the processing of the first job selector 12. The process in FIG. 7 is started when a job is submitted to the first queue selector 11 as a trigger. Further, the process in FIG. 7 is started when the second queue selector 21 detects that a space becomes available in the FJL 30.

第１ジョブセレクタ１２は、FJL３０に空きがある場合（Ｓ３１：true）、k（優先クラス）に1を設定する（Ｓ３２）。k＝1のJQ k１３にジョブがある場合（Ｓ３３：true）、第１ジョブセレクタ１２は、JQ k１３からジョブを取り出し（Ｓ３４）、ジョブ所有者の公平性割当量と、現在使用量とを比較する（Ｓ３５）。現在使用量が公平性割当量を超えていない場合（Ｓ３６：true）、第１ジョブセレクタ１２は、ジョブ所有者のユーザ割当量と現在使用量とを比較する（Ｓ３７）。 If there is a vacancy in the FJL 30 (S31: true), the first job selector 12 sets k (priority class) to 1 (S32). If there is a job in JQ k13 with k=1 (S33: true), the first job selector 12 extracts the job from JQ k13 (S34) and compares the fairness quota of the job owner with the current usage amount. (S35). If the current usage does not exceed the fairness quota (S36: true), the first job selector 12 compares the job owner's user quota with the current usage (S37).

現在使用量がユーザ割当量を超えていない場合（Ｓ３８：true）、第１ジョブセレクタ１２は、Ｓ３４で取り出したジョブをFJL３０の末尾に格納する（Ｓ３９）。現在使用量がユーザ割当量を超えている場合（Ｓ３８：false）、第１ジョブセレクタ１２は、第１キューセレクタ１１を介して、Ｓ３４で取り出したジョブをOUJQ１５の末尾に格納する（Ｓ４０）。現在使用量が公平性割当量を超えている場合（Ｓ３６：false）、第１ジョブセレクタ１２は、第１キューセレクタ１１を介してＳ３４で取り出したジョブを優先クラスkのOFJQ k１４の末尾に格納する（Ｓ４１）。 If the current usage does not exceed the user quota (S38: true), the first job selector 12 stores the job extracted in S34 at the end of the FJL 30 (S39). If the current usage exceeds the user quota (S38: false), the first job selector 12 stores the job retrieved in S34 at the end of the OUJQ 15 via the first queue selector 11 (S40). If the current usage exceeds the fairness quota (S36: false), the first job selector 12 stores the job retrieved in S34 via the first queue selector 11 at the end of OFJQ k14 of priority class k. (S41).

k＝1のJQ k１３にジョブがない場合（Ｓ３３：false）、第１ジョブセレクタ１２は、k に1を加算し（Ｓ４２）、k≦nの場合（Ｓ４３：：true）、Ｓ３３に戻り以降の処理を行う。k＞nの場合（Ｓ４３：false）、第１ジョブセレクタ１２は、kに1を設定する（Ｓ４４）。k＝1のOFJQ k１４にジョブがある場合（Ｓ４５：true）、第１ジョブセレクタ１２は、OFJQ k１４からジョブを取り出し（Ｓ４８）、取り出したジョブのジョブ所有者のユーザ割当量と現在使用量とを比較し（Ｓ３７）、Ｓ３８に進む。Ｓ３８以降の処理は、前述したとおりであるため説明を省略する。 If there is no job in JQ k13 with k=1 (S33: false), the first job selector 12 adds 1 to k (S42), and if k≦n (S43::true), returns to S33 and thereafter. Process. If k>n (S43: false), the first job selector 12 sets k to 1 (S44). If there is a job in OFJQ k14 with k=1 (S45: true), the first job selector 12 retrieves the job from OFJQ k14 (S48), and calculates the user quota and current usage of the job owner of the retrieved job. are compared (S37), and the process proceeds to S38. The processing from S38 onwards is the same as described above, so a description thereof will be omitted.

k＝1のOFJQ k１４にジョブがない場合（Ｓ４５：false）、第１ジョブセレクタ１２は、k に1を加算し（Ｓ４６）、k≦nの場合（Ｓ４７：true）、Ｓ４５に戻り以降の処理を行う。k＞nの場合（Ｓ４７：false）、第１ジョブセレクタ１２は、処理を終了する。 If there is no job in OFJQ k14 with k=1 (S45: false), the first job selector 12 adds 1 to k (S46), and if k≦n (S47: true), returns to S45 and performs the subsequent Perform processing. If k>n (S47: false), the first job selector 12 ends the process.

図８は、第２キューセレクタ２１の処理を示すフローチャートである。図８の処理は、定期的に実行される。第２キューセレクタ２１は、変数iに1を設定し（Ｓ５１）、FJL３０のi番目のジョブが存在する場合（Ｓ５２：true）、i番目のジョブのフェッチ済みの学習対象データのデータ量が閾値を超過しているか否かを判定する（Ｓ５３）。第２キューセレクタ２１は、キャッシュクラスタ７（キャッシュ管理部７２）にフェッチ済みのデータ量を問い合わせる。フェッチ済みのデータ量が閾値を超過していない場合（Ｓ５３：false）、第２キューセレクタ２１は、i に1を加算し（Ｓ５４）、Ｓ５２に戻り以降の処理を行う。 FIG. 8 is a flowchart showing the processing of the second queue selector 21. The process in FIG. 8 is executed periodically. The second queue selector 21 sets the variable i to 1 (S51), and if the i-th job of the FJL 30 exists (S52: true), the data amount of the fetched learning target data of the i-th job is the threshold value. is exceeded (S53). The second queue selector 21 inquires of the cache cluster 7 (cache management unit 72) about the amount of fetched data. If the fetched data amount does not exceed the threshold (S53: false), the second queue selector 21 adds 1 to i (S54), returns to S52, and performs the subsequent processing.

フェッチ済みデータ量が閾値を超過している場合（Ｓ５３：true）、第２キューセレクタ２１は、FJL３０のi番目のジョブを取り出し、デキューする（Ｓ５５）。第２キューセレクタ２１は、取り出したジョブのメタデータを確認して（Ｓ５６）、当該ジョブがサスペンド状態（一時停止状態）の場合（Ｓ５７：true）、RQ２３に当該ジョブを格納する（Ｓ６３）。第２キューセレクタ２１は、FJL３０に空きができたため、第１ジョブセレクタ１２を起動する（Ｓ６１）。 If the fetched data amount exceeds the threshold (S53: true), the second queue selector 21 takes out the i-th job of the FJL 30 and dequeues it (S55). The second queue selector 21 checks the metadata of the retrieved job (S56), and if the job is in a suspended state (temporarily stopped state) (S57: true), stores the job in the RQ 23 (S63). The second queue selector 21 activates the first job selector 12 because there is a free space in the FJL 30 (S61).

取り出したジョブがサスペンド状態でない場合（Ｓ５７：false）、第２キューセレクタ２１は、ジョブ制御を進めるか否かを判定するために、ジョブ所有者の公平性割当量と減現在使用量とを確認する（Ｓ５８）。現在使用量が公平性割当量を超えていない場合（Ｓ５９：true）、第２キューセレクタ２１は、DQ２４に当該ジョブを格納し（Ｓ６３）、第１ジョブセレクタ１２を起動する（Ｓ６１）。現在使用量が公平性割当量を超えている場合（Ｓ５９：false）、第２キューセレクタ２１は、OFDQ２５に当該ジョブを格納し（Ｓ６２）、第１ジョブセレクタ１２を起動する（Ｓ６１）。 If the retrieved job is not in a suspended state (S57: false), the second queue selector 21 checks the job owner's fairness quota and reduced current usage in order to determine whether to proceed with job control. (S58). If the current usage does not exceed the fairness quota (S59: true), the second queue selector 21 stores the job in the DQ 24 (S63) and activates the first job selector 12 (S61). If the current usage exceeds the fairness quota (S59: false), the second queue selector 21 stores the job in the OFDQ 25 (S62) and activates the first job selector 12 (S61).

また、第２キューセレクタ２１は、GPU使用量監視部３２にGPUの使用量を問い合わせる。GPU使用量監視部３２は、マスタ２またはノード３からGPUの使用量を取得し、第２キューセレクタ２１に回答する。GPUに空きがあり（Ｓ６４：true）、RQ２３が空で（Ｓ６５：true）、DQ２４が空で（Ｓ６６：true）、FJL３０に1番目のジョブが存在する場合（Ｓ６７：true）、第２キューセレクタ２１は、FJL３０の1番目のジョブを取り出し（Ｓ６８）、Ｓ５６に進む。GPUの使用率を最大に高めるために、本実施形態では、実行すべきRQ２３およびDQ２４のジョブがなくなった場合、フェッチが不十分なFJL３０のジョブであってもデプロイさせる。すなわち、第２キューセレクタ２１は、RQ２３およびDQ２４が共に空の場合、FJL３０の1番目のジョブをフェッチが不十分であっても取り出していずれかのキュー２３－２５にエンキューする。Ｓ６４からＳ６７の少なくとも１つがfalseの場合、第２キューセレクタ２１は、第１ジョブセレクタ１２を起動する（Ｓ６１）。 The second queue selector 21 also inquires of the GPU usage monitoring unit 32 about the GPU usage. The GPU usage monitoring unit 32 acquires the GPU usage from the master 2 or node 3 and sends it to the second queue selector 21 . If there is space in the GPU (S64: true), RQ23 is empty (S65: true), DQ24 is empty (S66: true), and the first job exists in FJL30 (S67: true), the second queue The selector 21 takes out the first job of the FJL 30 (S68) and proceeds to S56. In order to maximize the usage rate of the GPU, in this embodiment, when there are no RQ23 and DQ24 jobs to be executed, even the FJL30 job with insufficient fetching is deployed. That is, when both RQ23 and DQ24 are empty, the second queue selector 21 takes out the first job of FJL30 even if the fetch is insufficient and enqueues it into one of the queues 23-25. If at least one of S64 to S67 is false, the second queue selector 21 activates the first job selector 12 (S61).

なお、学習対象データが格納されたストレージのI/O速度、通信速度などにより、FJL３０に格納されたジョブのうち、１番目（先頭）のジョブのフェッチ済みデータ量が、最も多いとは限らない。このような場合を考慮して、Ｓ６８で第２キューセレクタ２１は、FJL３０の中でフェッチしたデータ量が最も多いジョブを取り出し、Ｓ５６に進み、当該ジョブをRQ２３、DQ２４およびOFDQ２５のいずれかに格納させてもよい。すなわち、第２キューセレクタ２１は、FJL３０内のジョブのうち最もフェッチが進んだジョブを取り出してもよい。 Note that, depending on the I/O speed and communication speed of the storage where the learning target data is stored, the amount of fetched data for the first (first) job among the jobs stored in FJL30 may not be the largest. . Considering such a case, in S68, the second queue selector 21 extracts the job with the largest amount of data fetched from FJL30, proceeds to S56, and stores the job in one of RQ23, DQ24, and OFDQ25. You may let them. That is, the second queue selector 21 may take out the job that has been fetched the most among the jobs in the FJL 30.

図９は、第２ジョブセレクタ２２の処理を示すフローチャートである。GPU使用量監視部３２は、GPUに空きがあると第２ジョブセレクタ２２を起動し、図９の処理が行われる。第２ジョブセレクタ２２は、RQ２３が空でない場合（Ｓ７１：false）、RQ２３から１つのジョブを取り出しJに格納する（Ｓ７２）。RQ２３が空の場合（Ｓ７１：true）で、DQ２４が空でない場合（Ｓ７５：false）、第２ジョブセレクタ２２は、DQ２４から１つのジョブを取り出しJに格納する（Ｓ７６）。DQ２４が空の場合（Ｓ７５：true）で、FJL３０が空で（Ｓ７７：true）、OFDQ２５が空でない場合（Ｓ７８：false）、第２ジョブセレクタ２２は、OFDQ２５から１つのジョブを取り出しJに格納する（Ｓ７９）。 FIG. 9 is a flowchart showing the processing of the second job selector 22. The GPU usage monitoring unit 32 activates the second job selector 22 when there is a vacant GPU, and the process shown in FIG. 9 is performed. If RQ23 is not empty (S71: false), the second job selector 22 extracts one job from RQ23 and stores it in J (S72). If RQ23 is empty (S71: true) and DQ24 is not empty (S75: false), the second job selector 22 extracts one job from DQ24 and stores it in J (S76). If DQ24 is empty (S75: true), FJL30 is empty (S77: true), and OFDQ25 is not empty (S78: false), the second job selector 22 takes out one job from OFDQ25 and stores it in J. (S79).

Ｓ７２、Ｓ７６およびＳ７９の後、第２ジョブセレクタ２２は、マスタ２にＪのデプロイを指示し（Ｓ７３）、第２キューセレクタ２１を起動する（Ｓ７４）。OFDQ２５が空の場合（Ｓ７８：true）、第２ジョブセレクタ２２は、第２キューセレクタ２１を起動する（Ｓ７４）。FJL３０が空でない場合（Ｓ７７：false）、第２ジョブセレクタ２２は、第２キューセレクタ２１を起動し、第２キューセレクタ２１の動作完了を待機して（Ｓ８０）、Ｓ７１へ進む。このように、第２ジョブセレクタ２２は、RQ２３、DQ２４およびOFDQ２５の全てが空の場合、第２キューセレクタ２１を起動し、FJL３０の先頭のジョブを、RQ２３、DQ２４およびOFDQ２５のいずれかに格納させる。 After S72, S76, and S79, the second job selector 22 instructs the master 2 to deploy J (S73), and activates the second queue selector 21 (S74). If the OFDQ 25 is empty (S78: true), the second job selector 22 activates the second queue selector 21 (S74). If the FJL 30 is not empty (S77: false), the second job selector 22 activates the second queue selector 21, waits for the second queue selector 21 to complete its operation (S80), and proceeds to S71. In this way, when all of RQ23, DQ24 and OFDQ25 are empty, the second job selector 22 activates the second queue selector 21 and stores the first job of FJL30 in any of RQ23, DQ24 and OFDQ25. .

（実施例１）
図１０は、実施例１のGPUクラスタの構成図である。本実施例は、学習対象データを低速なクラスタ共有ストレージ４（分散ストレージ）に事前に格納している。ジョブの実行が近くなると、キャッシュクラスタ７（キャッシュ管理部７２）は、学習対象データをクラスタ共有ストレージ４からキャッシュクラスタ７にプリフェッチする。GPUに空きができると、マスタ２は、キャッシュクラスタ７の領域をノード３にマウントする。キャッシュ領域のマウントは、RDMA-fs （RDMAデバイス上のデータをファイルシステム化する仕組み）、NFS over RDMA、GlusterFSなどを用いて実装する。RDMA用の転送パスは、TSN(Time Sensitive Networking)等で帯域保証する。本実施例では、Lossless DC fabricなどの高速・帯域確保型ネットワークを構築し、スパインスイッチ（Spine SW）などの各種スイッチ（SW）を用いてデータを転送する。(Example 1)
FIG. 10 is a configuration diagram of a GPU cluster according to the first embodiment. In this embodiment, the learning target data is stored in advance in the low-speed cluster shared storage 4 (distributed storage). When the execution of a job approaches, the cache cluster 7 (cache management unit 72) prefetches learning target data from the cluster shared storage 4 to the cache cluster 7. When the GPU becomes free, the master 2 mounts the area of the cache cluster 7 on the node 3. Mounting of the cache area is implemented using RDMA-fs (a mechanism for converting data on an RDMA device into a file system), NFS over RDMA, GlusterFS, etc. Bandwidth is guaranteed for the RDMA transfer path using TSN (Time Sensitive Networking), etc. In this embodiment, a high-speed, bandwidth-securing network such as a lossless DC fabric is constructed, and data is transferred using various switches (SW) such as a spine switch (Spine SW).

本実施例では、(1)スケジューラ１は、ジョブ待機中に当該ジョブのデータのプリフェッチを、キャッシュクラスタ７に指示する。これにより、キャッシュクラスタ７は、前記指示によりクラスタ共有ストレージ４からデータをプリフェッチする。(2)スケジューラ１は、ジョブのデプロイをマスタ２に指示し、マスタ２はジョブをGPUにアサインする。(3)マスタ２は、キャッシュクラスタ７のキャッシュ領域を、RDMA-fs等を用いてマウントする。(4)GPUは、ジョブを実行する。(5)スケジューラ１は、ジョブの実行後に、キャッシュクラスタ７のキャッシュデータを削除する。 In this embodiment, (1) the scheduler 1 instructs the cache cluster 7 to prefetch the data of the job while the job is on standby. Thereby, the cache cluster 7 prefetches data from the cluster shared storage 4 according to the instruction. (2) Scheduler 1 instructs master 2 to deploy the job, and master 2 assigns the job to the GPU. (3) The master 2 mounts the cache area of the cache cluster 7 using RDMA-fs or the like. (4) GPU executes the job. (5) The scheduler 1 deletes the cache data in the cache cluster 7 after executing the job.

（実施例２）
図１１は、実施例２のGPUクラスタの構成図である。本実施例は、ユーザ拠点のユーザストレージ６にオンライン接続する。すなわち、本実施例では、低速なユーザストレージ６に格納された学習対象データにオンラインで接続する。(Example 2)
FIG. 11 is a configuration diagram of a GPU cluster according to the second embodiment. In this embodiment, online connection is made to the user storage 6 at the user base. That is, in this embodiment, the learning target data stored in the low-speed user storage 6 is connected online.

本実施例では、GPUクラスタシステム内では、実施例１と同様にLossless DC fabricなどの高速・帯域確保型ネットワークを構築し、スパインスイッチ（Spine SW）などの各種スイッチ（SW）を用いてデータを転送する。GPUクラスタシステムとユーザ拠点との間は、Access/MetroネットワークをBorder Leafなどスイッチで接続してデータ転送パス（VPN、専用線等)を構築する。本実施例の動作は、以下のとおりである。 In this example, in the GPU cluster system, a high-speed, bandwidth-securing network such as a Lossless DC fabric is constructed as in Example 1, and data is transferred using various switches (SW) such as spine switches (Spine SW). Forward. Connect the Access/Metro network with a switch such as Border Leaf to build a data transfer path (VPN, dedicated line, etc.) between the GPU cluster system and the user location. The operation of this embodiment is as follows.

(1)キャッシュクラスタ７（キャッシュ管理部７２）は、ユーザ拠点のユーザストレージ６のデータを転送し、キャッシュクラスタ７のメモリ（NV-DIMM）にプリフェッチする。なお、キャッシュメモリにデータの一部を置くだけなのでダウンロードに相当しない。 (1) The cache cluster 7 (cache management unit 72) transfers data in the user storage 6 at the user base and prefetches it to the memory (NV-DIMM) of the cache cluster 7. Note that this does not correspond to downloading because it only places part of the data in the cache memory.

(2)キャッシュクラスタ７のメモリに一定量のキャッシュデータが溜まったら、GPUはジョブを実行する。 (2) When a certain amount of cache data is accumulated in the memory of the cache cluster 7, the GPU executes the job.

(3)GPUがキャッシュデータを使い切ると、GPUは、ジョブを一時中断し、リソースを開放する。リソースの開放にはCRIU (Checkpoint/Restore In Userspace)のような技術を用いることで，ジョブのプログラムに一時中断のための機能を実装する必要がなくなる。CRIUは、プロセスを終了せずに、一時停止、保存、再開する技術である。 (3) When the GPU runs out of cache data, the GPU temporarily suspends the job and releases resources. By using technology such as CRIU (Checkpoint/Restore In Userspace) to release resources, there is no need to implement a temporary suspension function in the job program. CRIU is a technology that pauses, saves, and resumes processes without terminating them.

(4)キャッシュクラスタ７は、処理中のプロセスデータをキャッシュクラスタ７に書き込む。 (4) The cache cluster 7 writes the process data being processed into the cache cluster 7.

(5)キャッシュクラスタ７のメモリに一定量のキャッシュデータが溜まったら、GPUを確保する。 (5) Once a certain amount of cache data has accumulated in the memory of the cache cluster 7, secure a GPU.

(6)プロセスデータを書き戻し、ジョブの処理を再開（リストア）する。 (6) Write back the process data and restart (restore) job processing.

(7)ジョブの処理が完了したら終了する。完了していない場合は、(3)に戻り以降の処理を繰り返す。 (7) Terminate when job processing is completed. If it is not completed, return to (3) and repeat the process.

（実施例３）
図１２は、実施例３のGPUクラスタの構成図である。本実施例では、複数のデータセンタ４０が分散して存在する。データセンタ４０には、複数のマスタ２およびノード３を含むGPUクラスタと、キャッシュクラスタ７と、クラスタ共有ストレージ４とを備える。データセンタ４０は、クラスタ共有ストレージ４を備えなくてもよい。(Example 3)
FIG. 12 is a configuration diagram of a GPU cluster according to the third embodiment. In this embodiment, a plurality of data centers 40 exist in a distributed manner. The data center 40 includes a GPU cluster including a plurality of masters 2 and nodes 3, a cache cluster 7, and a cluster shared storage 4. The data center 40 does not need to include the cluster shared storage 4.

スケジューラ１は、ジョブの配置先をユーザ拠点から近いGPUクラスタに配置する。ユーザ自らデータをクラスタ共有ストレージ４にアップロードする場合は、スケジューラ１は、ユーザがデータをアップロードしたクラスタ共有ストレージ４になるべく近いGPUクラスタを選択する。 The scheduler 1 places jobs in GPU clusters that are close to the user base. When the user himself/herself uploads data to the cluster shared storage 4, the scheduler 1 selects a GPU cluster as close as possible to the cluster shared storage 4 to which the user has uploaded the data.

（閉域接続方式）
以下に、実施例２のユーザ拠点のユーザストレージ６に格納された学習対象データをキャッシュクラスタ７がフェッチする場合に、ユーザ拠点とキャッシュクラスタ７との閉域接続方式について説明する。(Closed connection method)
Below, a closed connection system between the user base and the cache cluster 7 will be described when the cache cluster 7 fetches the learning target data stored in the user storage 6 of the user base in the second embodiment.

図１３は、方式１の閉域接続を示す模式図である。本方式では、ユーザストレージ６が閉域接続の機能の有し、キャッシュクラスタ７からの閉域接続を待ち受けている。学習対象データのプリフェッチの際に、キャッシュクラスタ７がユーザストレージ６に対し閉域接続を開始する。学習対象データの取得が完了すると、キャッシュクラスタ７は、閉域接続を解除する。これにより、ユーザストレージ６は、閉域接続の待受状態に戻る。ユーザストレージ６は、常時、閉域接続の待ち受ける状態である。ユーザ拠点には加入者側回線終端装置（以下「CPE」)が配置される。ユーザは、閉域接続のための設定をGPUクラスタシステムの事業者と事前に折衝し、決めておく必要がある。ユーザは、自身のユーザストレージ６にキャッシュクラスタ７との閉域接続のための設定をする必要がある。 FIG. 13 is a schematic diagram showing closed area connection of method 1. In this method, the user storage 6 has a closed connection function and waits for a closed connection from the cache cluster 7. When prefetching learning target data, the cache cluster 7 starts a closed connection to the user storage 6. When acquisition of the learning target data is completed, the cache cluster 7 releases the closed connection. As a result, the user storage 6 returns to the closed connection standby state. The user storage 6 is always in a state of waiting for a closed connection. A subscriber line termination equipment (hereinafter referred to as "CPE") is installed at the user location. Users need to negotiate and decide on the settings for closed connections with the GPU cluster system provider in advance. The user needs to configure his/her own user storage 6 for closed connection with the cache cluster 7 .

図１４は、方式２の閉域接続を示す模式図である。本方式では、ユーザ拠点のCPE８は、VPN接続部と、スケジューラ１からの制御に対応するためのAPI（制御部）とを備える。本方式は、オンデマンドで閉域接続を構成する。ユーザは、スケジューラ１にジョブを登録する際に、当該ジョブにCPE８のAPIへの接続情報を含める。スケジューラ１は、キャッシュクラスタ７にCPE８からの閉域接続を待ち受けるよう指示する。CPE８は、スケジューラ１からの指示を受けて、指示された接続先(キャッシュクラスタ７)に対し閉域接続を要求する。閉域接続が確立すると、キャッシュクラスタ７は、ユーザストレージ６上の学習対象データをGPUクラスタシステムに複製する。複製先は、キャッシュクラスタ７またはクラスタ共有ストレージ４のいずれかである。ジョブが完了すると、スケジューラ１は、CPE８に閉域接続の設定削除を指示する。 FIG. 14 is a schematic diagram showing closed area connection of method 2. In this system, the CPE 8 at the user site includes a VPN connection section and an API (control section) for responding to control from the scheduler 1. This method configures closed connections on demand. When the user registers a job in the scheduler 1, the user includes connection information to the API of the CPE 8 in the job. The scheduler 1 instructs the cache cluster 7 to wait for a closed connection from the CPE 8. Upon receiving an instruction from the scheduler 1, the CPE 8 requests a closed connection to the specified connection destination (cache cluster 7). When the closed connection is established, the cache cluster 7 copies the learning target data on the user storage 6 to the GPU cluster system. The replication destination is either the cache cluster 7 or the cluster shared storage 4. When the job is completed, the scheduler 1 instructs the CPE 8 to delete the closed connection settings.

図１５は、方式３の閉域接続を示す模式図である。本方式では、方式２と同様に、CPE８は、VPN接続部と、API（制御部）とを備える。本方式では、スケジューラ１は、CPE８には閉域接続を待受させ、キャッシュクラスタ７に閉域接続開始を指示する。本方式は、オンデマンドで閉域接続を構成する。ユーザは、スケジューラ１にジョブを登録する際に、当該ジョブにCPE８のAPIへの接続情報を含める。スケジューラ１は、CPE８にキャッシュクラスタ７からの閉域接続を待ち受けるよう指示する。キャッシュクラスタ７は、スケジューラ１からの指示を受けて、指示された接続先(CPE８)に対し閉域接続を要求する。閉域接続が確立すると、キャッシュクラスタ７は、ユーザストレージ６上の学習対象データをGPUクラスタシステムに複製する。複製先は、キャッシュクラスタ７またはクラスタ共有ストレージ４のいずれかである。ジョブが完了すると、スケジューラ１は、CPEに閉域接続の設定削除を指示する。 FIG. 15 is a schematic diagram showing closed area connection of method 3. In this method, as in method 2, the CPE 8 includes a VPN connection section and an API (control section). In this method, the scheduler 1 causes the CPE 8 to wait for a closed connection, and instructs the cache cluster 7 to start a closed connection. This method configures closed connections on demand. When the user registers a job in the scheduler 1, the user includes connection information to the API of the CPE 8 in the job. The scheduler 1 instructs the CPE 8 to wait for a closed connection from the cache cluster 7. Upon receiving an instruction from the scheduler 1, the cache cluster 7 requests a closed connection to the specified connection destination (CPE 8). When the closed connection is established, the cache cluster 7 copies the learning target data on the user storage 6 to the GPU cluster system. The replication destination is either the cache cluster 7 or the cluster shared storage 4. When the job is completed, the scheduler 1 instructs the CPE to delete the closed connection settings.

図１６は、方式４の閉域接続を示す模式図である。本方式では、キャリア網内に、仮想化された加入者側回線終端装置（以下、「vCPE」)９２を設置する。vCPE９２は、VPN接続部と、スケジューラ１から制御に対応するAPI（制御部）とを有する。 FIG. 16 is a schematic diagram showing closed area connection of method 4. In this method, a virtualized subscriber line termination equipment (hereinafter referred to as "vCPE") 92 is installed within the carrier network. The vCPE 92 has a VPN connection unit and an API (control unit) that is controlled by the scheduler 1.

本方式は、オンデマンドで閉域接続を構成する。ユーザは、スケジューラ１にジョブを登録する際に、当該ジョブにユーザストレージ６が接続されている回線を識別するための回線識別情報を含める。スケジューラ１は、キャッシュクラスタ７にvCPE９２からの閉域接続を待ち受けるよう指示する。vCPE９２は、スケジューラ１からの指示を受けて、指示された接続先(キャッシュクラスタ７)に対し閉域接続を要求する。閉域接続が確立すると、キャッシュクラスタ７は、ユーザストレージ６上の学習対象データをGPUクラスタシステムに複製する。複製先は、キャッシュクラスタ７またはクラスタ共有ストレージ４のいずれかである。ジョブが完了すると、スケジューラ１は、vCPE９２に閉域接続の解除を指示する。ユーザ拠点には、光回線終端装置(以下、「ONU」)９１またはモデムなどが設置され、vCPE９２と接続される。ONU９１等は、vCPE９２とのレイヤ2接続(Ethernet等)を提供する。 This method configures closed connections on demand. When a user registers a job in the scheduler 1, the user includes line identification information for identifying the line to which the user storage 6 is connected to the job. The scheduler 1 instructs the cache cluster 7 to wait for a closed connection from the vCPE 92. In response to an instruction from the scheduler 1, the vCPE 92 requests a closed connection to the specified connection destination (cache cluster 7). When the closed connection is established, the cache cluster 7 copies the learning target data on the user storage 6 to the GPU cluster system. The replication destination is either the cache cluster 7 or the cluster shared storage 4. When the job is completed, the scheduler 1 instructs the vCPE 92 to release the closed connection. At the user base, an optical line terminal unit (hereinafter referred to as "ONU") 91 or a modem is installed and connected to the vCPE 92. ONU91 etc. provide layer 2 connection (Ethernet etc.) with vCPE92.

図１７は、方式５の閉域接続を示す模式図である。本方式では、キャリア網内に方式４と同様にvCPE９２を備える。ユーザ拠点には、ONU９１等が設置される。ONU９１等は、vCPE９２と接続され、vCPE９２とのレイヤ2接続を提供する。本方式では、スケジューラ１は、vCPE９２には閉域接続を待受させ、キャッシュクラスタ７に閉域接続開始を指示する。 FIG. 17 is a schematic diagram showing closed area connection of method 5. In this method, a vCPE 92 is provided in the carrier network as in method 4. ONU91 etc. are installed at the user base. The ONU 91 and the like are connected to the vCPE 92 and provide layer 2 connection with the vCPE 92. In this method, the scheduler 1 causes the vCPE 92 to wait for a closed connection, and instructs the cache cluster 7 to start a closed connection.

本方式は、オンデマンドで閉域接続を構成する。ユーザは、スケジューラ１にジョブを登録する際に、当該ジョブにユーザストレージ６が接続されている回線を識別するための回線識別情報を含める。スケジューラ１は、vCPE９２にキャッシュクラスタ７からの閉域接続を待ち受けるよう指示する。キャッシュクラスタ７は、スケジューラ１からの指示を受けて、指示された接続先(vCPE９２)に対し閉域接続を要求する。閉域接続が確立すると、キャッシュクラスタ７は、ユーザストレージ６上の学習対象データをGPUクラスタシステムに複製する。複製先は、キャッシュクラスタ７またはクラスタ共有ストレージ４のいずれかである。ジョブが完了すると、スケジューラ１は、vCPE９２に閉域接続の解除を指示する。 This method configures closed connections on demand. When the user registers a job in the scheduler 1, the user includes line identification information for identifying the line to which the user storage 6 is connected to the job. The scheduler 1 instructs the vCPE 92 to wait for a closed connection from the cache cluster 7. Upon receiving the instruction from the scheduler 1, the cache cluster 7 requests a closed connection to the specified connection destination (vCPE 92). When the closed connection is established, the cache cluster 7 copies the learning target data on the user storage 6 to the GPU cluster system. The replication destination is either the cache cluster 7 or the cluster shared storage 4. When the job is completed, the scheduler 1 instructs the vCPE 92 to release the closed connection.

図１８は、方式６の閉域接続を示す模式図である。本方式では、キャリア網内に方式４と同様にvCPE９２を備える。ユーザ拠点には、方式１と同様のCPE８が設置され、vCPE９２と接続される。 FIG. 18 is a schematic diagram showing closed area connection of method 6. In this method, a vCPE 92 is provided in the carrier network as in method 4. A CPE 8 similar to method 1 is installed at the user base and connected to the vCPE 92.

本方式は、オンデマンドで閉域接続を構成する。スケジューラ１は、vCPE９２にキャッシュクラスタ７とCPE８からの閉域接続要求に対する待受開始を指示する。スケジューラ１は、キャッシュクラスタ７に対し、vCPE９２への閉域接続を指示する。スケジューラ１は、CPE８にvCPE９２への閉域接続を指示する。閉域接続が確立すると、キャッシュクラスタ７は、ユーザストレージ６上の学習対象データをGPUクラスタシステムに複製する。複製先は、キャッシュクラスタ７またはクラスタ共有ストレージ４のいずれかである。ジョブが完了すると、スケジューラ１は、vCPE９２およびCPE８に閉域接続の解除を指示する。 This method configures closed connections on demand. The scheduler 1 instructs the vCPE 92 to start waiting for closed connection requests from the cache cluster 7 and CPE 8. The scheduler 1 instructs the cache cluster 7 to establish a closed connection to the vCPE 92. The scheduler 1 instructs the CPE 8 to make a closed connection to the vCPE 92. When the closed connection is established, the cache cluster 7 copies the learning target data on the user storage 6 to the GPU cluster system. The replication destination is either the cache cluster 7 or the cluster shared storage 4. When the job is completed, the scheduler 1 instructs the vCPE 92 and CPE 8 to release the closed connection.

vCPE９２のインスタンスのパターンとしては、事前にデプロイしたものをプールしておき、ジョブの学習対象データのプリフェッチ開始時にユーザ拠点の最寄りのvCPE９２をアサインしてもよい。また、ジョブの学習対象データのプリフェッチ開始時に、vCPEのインスタンスをデプロイしてもよい。 As a pattern of vCPE92 instances, those deployed in advance may be pooled, and the vCPE92 closest to the user's base may be assigned when prefetching of the learning target data of a job is started. Furthermore, an instance of vCPE may be deployed when prefetching of learning target data for a job is started.

図１９は、方式７の閉域接続を示す模式図である。本方式では、キャリア網内にあるPPPoE等をISPに中継するゲートウェイ装置（以下「GW」）９３を用いて、閉域接続を行う。本方式のGW９３には、キャッシュクラスタ７との閉域接続を行う接続部と、スケジューラ１から制御に対応するAPI（制御部）とが追加される。通常、インターネットアクセスでは，PPPoEやDS-lite等のトンネリングプロトコルを使用してキャリア網内の中継装置を介してISPに接続される。ユーザ拠点に設置されるCPE８は、加入者側でこれらのプロトコルを終端する装置であり、殆どの場合は常時GW９３に対して閉域接続を行っている。スケジューラ１は、GW９３とキャッシュクラスタ７の間に閉域接続を確立し、GW９３にユーザストレージ６とキャッシュクラスタ７との通信を中継させる。キャッシュクラスタ７以外の装置とCPE８との通信は、通常通りISPへのトンネルに転送し、インターネットアクセス９４とする。 FIG. 19 is a schematic diagram showing closed area connection of method 7. In this method, a closed connection is established using a gateway device (hereinafter referred to as "GW") 93 that relays PPPoE etc. in the carrier network to the ISP. The GW 93 of this method is added with a connection section for making a closed connection with the cache cluster 7 and an API (control section) that supports control from the scheduler 1. Normally, when accessing the Internet, a tunneling protocol such as PPPoE or DS-lite is used to connect to an ISP via a relay device within the carrier network. The CPE 8 installed at the user base is a device that terminates these protocols on the subscriber side, and in most cases always makes a closed connection to the GW 93. The scheduler 1 establishes a closed connection between the GW 93 and the cache cluster 7, and causes the GW 93 to relay communication between the user storage 6 and the cache cluster 7. Communication between devices other than the cache cluster 7 and the CPE 8 is transferred to the tunnel to the ISP as usual and is used as Internet access 94.

本方式は、オンデマンドで閉域接続を構成する。スケジューラ１は、閉域接続の設定時に、GW９３に対し、キャッシュクラスタ７からの閉域接続要求に対する待受開始を指示する。指示対象のGW９３は，回線識別情報等から特定する。スケジューラ１は、キャッシュクラスタ７に対しGW９３への閉域接続を要求する。閉域接続が確立すると、ユーザストレージ６とキャッシュクラスタ７の通信をGW９３が中継し通信経路が確立する。 This method configures closed connections on demand. When setting up a closed connection, the scheduler 1 instructs the GW 93 to start waiting for a closed connection request from the cache cluster 7. The GW93 to be instructed is specified from line identification information, etc. The scheduler 1 requests the cache cluster 7 to establish a closed connection to the GW 93. When the closed connection is established, the GW 93 relays communication between the user storage 6 and the cache cluster 7, and a communication path is established.

（GPUクラスタシステムの動作）
以下にGPUクラスタシステムの動作について説明する。(GPU cluster system operation)
The operation of the GPU cluster system will be explained below.

図２０は、図１に示す基本的なGPUクラスタシステムの動作を示すシーケンス図である。ユーザは、学習対象データをクラスタ共有ストレージ４にアップロードし（Ｓ１０１）、ジョブをスケジューラ１に登録する（Ｓ１０２）。ジョブの登録データには、ジョブの定義、学習対象データの格納場所、ユーザＩＤなどの認証情報などが含まれる。スケジューラ１は、認証情報を用いてユーザを認証するが、ここでは認証処理については省略する。 FIG. 20 is a sequence diagram showing the operation of the basic GPU cluster system shown in FIG. 1. The user uploads the learning target data to the cluster shared storage 4 (S101) and registers the job in the scheduler 1 (S102). The job registration data includes a job definition, a storage location of learning target data, authentication information such as a user ID, and the like. The scheduler 1 authenticates the user using the authentication information, but the authentication process will be omitted here.

スケジューラ１は、ジョブが登録されると、GPUの空き状況など（GPUの稼働状況）をマスタ２に確認し（Ｓ１０３）、マスタ２からGPUの空き状況などを取得する（Ｓ１０４）。スケジューラ１は、GPUの空き情報等を用いて、ジョブをスケジューリングし（Ｓ１０５）、マスタ２にジョブのデプロイを指示する（Ｓ１０６）。このデプロイ指示には、ジョブの定義、学習対象データの格納場所、認証情報などが含まれる。マスタ２は、ノード３にジョブのデプロイを指示する（Ｓ１０７）。このデプロイ指示には、ジョブの定義、学習対象データの格納場所などが含まれる。 When the job is registered, the scheduler 1 checks the GPU availability status (GPU operating status) with the master 2 (S103), and acquires the GPU availability status etc. from the master 2 (S104). The scheduler 1 schedules a job using GPU availability information (S105), and instructs the master 2 to deploy the job (S106). This deployment instruction includes job definitions, storage locations for learning target data, authentication information, etc. Master 2 instructs node 3 to deploy the job (S107). This deployment instruction includes job definitions, storage locations for learning target data, etc.

ノード３は、ジョブの実行を開始し、ジョブの仮想環境を作成する（Ｓ１０８）。具体的には、ノード３は、Network namespace等の名前空間やコンテナなどの仮想環境を生成する。また、ノード３は、学習対象データにジョブがアクセスできるように設定する。これにより、学習対象データの格納先（クラスタ共有ストレージ４）がジョブからアクセス可能になる。 Node 3 starts executing the job and creates a virtual environment for the job (S108). Specifically, the node 3 generates a name space such as a network namespace and a virtual environment such as a container. Further, the node 3 is set so that the job can access the learning target data. As a result, the storage location (cluster shared storage 4) of the learning target data becomes accessible from the job.

ジョブは、学習処理を開始し（Ｓ１０９）、学習対象データにアクセスしながら学習処理を実行する。ジョブは、学習結果をクラスタ共有ストレージ４に書き出す（Ｓ１１０）。学習結果は、逐次書き出す場合と、最後にまとめて書き出す場合とがある。ジョブは、学習処理が終了すると（Ｓ１１１）、実行完了をノード３に報告する（Ｓ１１２）。ノード３は、ジョブの仮想環境等を削除する（Ｓ１１３）。また、ノード３は、ジョブのための仮想ネットワークなども併せて削除する。ジョブの実行が完了するとノード３は、ジョブの実行完了をマスタ２に報告する（Ｓ１１４）。マスタ２は、必要に応じてユーザにジョブ完了を報告する。あるいは、ユーザがスケジューラ１またはマスタ２にジョブの完了を問い合わせてもよい。 The job starts the learning process (S109) and executes the learning process while accessing the learning target data. The job writes the learning results to the cluster shared storage 4 (S110). The learning results may be written out one after another or all at once at the end. When the learning process ends (S111), the job reports execution completion to the node 3 (S112). The node 3 deletes the virtual environment of the job, etc. (S113). Further, the node 3 also deletes the virtual network for the job. When the execution of the job is completed, the node 3 reports the completion of the job execution to the master 2 (S114). The master 2 reports job completion to the user as necessary. Alternatively, the user may inquire of the scheduler 1 or master 2 about job completion.

図２１Ａ、図２１Ｂおよび図２１Ｃは、本実施形態のGPUクラスタの動作を示すシーケンス図である。これらは、クラスタ共有ストレージ４にアップロードされた学習対象データを、キャッシュクラスタ７がフェッチして利用する場合のシーケンス図である。 FIGS. 21A, 21B, and 21C are sequence diagrams showing the operation of the GPU cluster of this embodiment. These are sequence diagrams when the cache cluster 7 fetches and uses the learning target data uploaded to the cluster shared storage 4.

ジョブ登録前に学習対象データをアップロードする場合、ユーザは、ユーザストレージ６に格納している学習対象データをクラスタ共有ストレージ４にアップロードし（Ｓ１３１）、ジョブをスケジューラ１に登録する（Ｓ１３２）。ジョブの登録データには、ジョブの定義、学習対象データの格納場所、ユーザＩＤなどの認証情報などが含まれる。学習対象データの格納場所は、事前アップロードの場合はクラスタ共有ストレージ４であり、事前アップロードしない場合はユーザストレージ６である。また、事前アップロードしない場合は、ジョブの登録データには、ユーザストレージ６への閉域接続情報などが含まれる。スケジューラ１におけるユーザの認証処理については、省略する。 When uploading the learning target data before job registration, the user uploads the learning target data stored in the user storage 6 to the cluster shared storage 4 (S131), and registers the job in the scheduler 1 (S132). The job registration data includes a job definition, a storage location of learning target data, authentication information such as a user ID, and the like. The storage location of the learning target data is the cluster shared storage 4 if pre-uploaded, and the user storage 6 if not pre-uploaded. Furthermore, if the job is not uploaded in advance, the job registration data includes closed connection information to the user storage 6 and the like. The user authentication process in the scheduler 1 will be omitted.

学習対象データを事前にアップロードしない場合は、Ｓ１３１を行うことなく、後述する「閉域接続の確立処理」Ａと、「学習対象データのクラスタ格納処理」Ｂと、「閉域接続の解除処理」Ｃとが行われる。「閉域接続の確立処理」Ａは、スケジューラ１の制御によりユーザ拠点とキャッシュクラスタ７との間に閉域接続または閉域経路を接続する。「学習対象データのクラスタ格納処理」Ｂは、ユーザ拠点とキャッシュクラスタ７との間に確立された閉域接続または閉域経路を介して、ユーザストレージ６上の学習対象データをキャッシュクラスタ７上に格納する。「閉域接続の解除処理」Ｃは、スケジューラ１の制御により、ユーザ拠点とキャッシュクラスタ７との間に確立された閉域接続または閉域経路を解除する。 If the learning target data is not uploaded in advance, the "closed connection establishment process" A, the "cluster storage process of learning target data" B, and the "closed connection cancellation process" C described below are performed without performing S131. will be held. "Closed connection establishment process" A connects a closed connection or a closed path between the user base and the cache cluster 7 under the control of the scheduler 1. “Cluster storage processing of learning target data” B stores the learning target data on the user storage 6 on the cache cluster 7 via the closed connection or closed path established between the user base and the cache cluster 7 . “Closed connection release processing” C releases the closed connection or closed path established between the user base and the cache cluster 7 under the control of the scheduler 1.

スケジューラ１は、キャッシュクラスタ７に学習対象データのプリフェッチを指示する（Ｓ１３３）。すなわち、スケジューラ１は、キャッシュクラスタ７上の所定の格納場所に、学習対象データを格納することを指示する。キャッシュクラスタ７は、クラスタ共有ストレージ４上の学習対象データのフェッチを開始する（Ｓ１３４）。 The scheduler 1 instructs the cache cluster 7 to prefetch the learning target data (S133). That is, the scheduler 1 instructs the cache cluster 7 to store the learning target data in a predetermined storage location. The cache cluster 7 starts fetching the learning target data on the cluster shared storage 4 (S134).

全ての学習対象データをフェッチする場合、キャッシュクラスタ７は、学習対象データのプリフェッチの完了をスケジューラ１に報告する（Ｓ１３５）。スケジューラ１はGPUの空き状況などをマスタ２に確認し（Ｓ１３６）マスタ２からGPUの空き状況などを取得する（Ｓ１３７）。 When fetching all the learning target data, the cache cluster 7 reports the completion of prefetching the learning target data to the scheduler 1 (S135). The scheduler 1 checks the GPU availability status with the master 2 (S136) and obtains the GPU availability status and the like from the master 2 (S137).

全ての学習対象データをフェッチしない場合、すなわち、全ての学習対象データのキャッシュデータを待たずに、投機的にジョブの実行を開始する場合、スケジューラ１は、プリフェッチの完了を待たずに続く処理を実行する。スケジューラ１はGPUの空き状況などをマスタ２に確認し（Ｓ１３８）、マスタ２からGPUの空き状況などを取得する（Ｓ１３９）。また、スケジューラ１はフェッチ済のデータ量をキャッシュクラスタ７に確認し（Ｓ１４０）、キャッシュクラスタ７からフェッチ済みのデータ量を取得する（Ｓ１４１）。スケジューラ１は、Ｓ１３８およびＳ１３９のGPUの空き状態の確認処理と、Ｓ１４０およびＳ１４１の学習対象データのフェッチ進捗確認処理とを並行して行ってもよい。 When not fetching all learning target data, that is, when starting a job speculatively without waiting for the cache data of all learning target data, scheduler 1 starts the subsequent processing without waiting for the completion of prefetching. Execute. The scheduler 1 checks the GPU availability status with the master 2 (S138), and acquires the GPU availability status etc. from the master 2 (S139). Further, the scheduler 1 checks the amount of data that has been fetched with the cache cluster 7 (S140), and obtains the amount of data that has been fetched from the cache cluster 7 (S141). The scheduler 1 may perform in parallel the process of checking the idle state of the GPU in S138 and S139, and the process of checking the progress of fetching the learning target data in S140 and S141.

スケジューラ１は、GPUの空き情報等を用いて、ジョブをスケジューリングし（Ｓ１４２）、マスタ２にジョブのデプロイを指示する（Ｓ１４３）。このデプロイ指示には、ジョブの定義、学習対象データの格納場所、ユーザＩＤ等の認証情報などが含まれる。マスタ２は、ノード３にジョブのデプロイを指示する（Ｓ１４４）。このデプロイ指示には、ジョブの定義、学習対象データの格納場所などが含まれる。 The scheduler 1 schedules the job using GPU availability information and the like (S142), and instructs the master 2 to deploy the job (S143). This deployment instruction includes a job definition, a storage location of learning target data, authentication information such as a user ID, and the like. Master 2 instructs node 3 to deploy the job (S144). This deployment instruction includes job definitions, storage locations for learning target data, etc.

ノード３は、ジョブの実行を開始し、ジョブの仮想環境を作成する（Ｓ１４５）。具体的には、ノード３は、Network namespace等の名前空間やコンテナなどの仮想環境を生成する。また、ノード３は、学習対象データにジョブがアクセスできるように設定する。これにより、学習対象データの格納先（キャッシュクラスタ７）がジョブからアクセス可能になる。 Node 3 starts executing the job and creates a virtual environment for the job (S145). Specifically, the node 3 generates a name space such as a network namespace and a virtual environment such as a container. Further, the node 3 is set so that the job can access the learning target data. As a result, the storage location (cache cluster 7) of the learning target data becomes accessible from the job.

ジョブは、学習処理を開始し（Ｓ１４６）、後述する「学習処理におけるキャッシュクラスタへのデータアクセス」Ｄを行い、学習対象データにアクセスしながら学習処理を実行する。ジョブは、学習結果をキャッシュクラスタ７に書き出す（Ｓ１４７）。学習結果をキャッシュクラスタ７に書き出すことで、キャッシュ管理部７２は、透過的にクラスタ共有ストレージ４に学習結果を書き出す。また、ジョブは、学習結果を直接クラスタ共有ストレージ４に書き出してもよい。その場合、Ｓ１４５でジョブの仮想環境を作成する際に。ジョブがクラスタ共有ストレージ４にアクセスできるように設定する。 The job starts learning processing (S146), performs "data access to cache cluster in learning processing" D, which will be described later, and executes learning processing while accessing the learning target data. The job writes the learning results to the cache cluster 7 (S147). By writing the learning results to the cache cluster 7, the cache management unit 72 transparently writes the learning results to the cluster shared storage 4. Further, the job may directly write the learning results to the cluster shared storage 4. In that case, when creating the virtual environment for the job in S145. Settings are made so that jobs can access the cluster shared storage 4.

ジョブは、学習処理が終了すると（Ｓ１４８）、実行完了をノード３に報告する（Ｓ１４９）。ノード３は、ジョブの仮想環境等を削除する（Ｓ１５０）。また、ノード３は、ジョブのための仮想ネットワークなども併せて削除する。ジョブの実行が完了すると、ノード３は、ジョブの実行完了をマスタ２に報告する（Ｓ１５１）。マスタ２は、必要に応じてユーザにジョブ完了を報告する。あるいは、ユーザがスケジューラ１またはマスタ２にジョブの完了を問い合わせてもよい。 When the learning process ends (S148), the job reports execution completion to node 3 (S149). The node 3 deletes the virtual environment of the job, etc. (S150). Further, the node 3 also deletes the virtual network for the job. When the execution of the job is completed, the node 3 reports the completion of the job execution to the master 2 (S151). The master 2 reports job completion to the user as necessary. Alternatively, the user may inquire of the scheduler 1 or master 2 about job completion.

スケジューラ１は、GPUの空き状況およびジョブの完了状況をマスタ２に確認し（Ｓ１５２）、マスタ２からこれらの情報を取得する（Ｓ１５３）。スケジューラ１は、学習対象データのキャッシュデータ等の削除をキャッシュクラスタ７に指示する（Ｓ１５４）。キャッシュクラスタ７は、キャッシュデータ等を削除する（Ｓ１５５）。キャッシュクラスタ７は、学習結果が一時的に格納された場合、学習結果も削除する。キャッシュクラスタ７は、削除処理に合わせて、ジョブからの書き出しデータのクラスタ共有ストレージ４への書き戻しを実行する。キャッシュクラスタ７は、削除完了をスケジューラ１に報告する（Ｓ１５６）。 The scheduler 1 checks the GPU availability status and job completion status with the master 2 (S152), and acquires this information from the master 2 (S153). The scheduler 1 instructs the cache cluster 7 to delete the cache data of the learning target data (S154). The cache cluster 7 deletes cache data and the like (S155). The cache cluster 7 also deletes the learning results when the learning results are temporarily stored. The cache cluster 7 writes back the data written from the job to the cluster shared storage 4 in accordance with the deletion process. The cache cluster 7 reports completion of deletion to the scheduler 1 (S156).

図２２Ａ、図２２Ｂおよび図２２Ｃは、本実施形態のGPUクラスタの動作を示すシーケンス図である。ここでは、ユーザストレージ６上の学習対象データを、キャッシュクラスタ７が直接フェッチして利用する場合のシーケンスを説明する。 22A, 22B, and 22C are sequence diagrams showing the operation of the GPU cluster of this embodiment. Here, a sequence will be described in which the cache cluster 7 directly fetches and uses the learning target data on the user storage 6.

ユーザは、ジョブをスケジューラ１に登録する（Ｓ１６１）。ジョブの登録データには、ジョブの定義、学習対象データの格納場所（ユーザストレージ６）、ユーザＩＤなどの認証情報などが含まれる。ジョブの登録データには、ユーザストレージ６への閉域接続情報などが含まれる。閉域接続情報については後述する。スケジューラ１の認証処理については省略する。次に、後述する「閉域接続の確立処理」Ａが行われる。「閉域接続の確立処理」Ａは、スケジューラ１の制御によりユーザ拠点とキャッシュクラスタ７との間に閉域接続または閉域経路を接続する。確立した閉域接続または閉域経路を介して、キャッシュクラスタ７からユーザ拠点のユーザストレージ６上の学習データにアクセス可能となる。 The user registers the job in the scheduler 1 (S161). The job registration data includes a job definition, a storage location for learning target data (user storage 6), authentication information such as a user ID, and the like. The job registration data includes closed connection information to the user storage 6 and the like. The closed connection information will be described later. The authentication process of the scheduler 1 will be omitted. Next, “closed connection establishment processing” A, which will be described later, is performed. "Closed connection establishment process" A connects a closed connection or a closed path between the user base and the cache cluster 7 under the control of the scheduler 1. The learning data on the user storage 6 at the user base can be accessed from the cache cluster 7 via the established closed connection or closed path.

スケジューラ１は、キャッシュクラスタ７に学習対象データのプリフェッチを指示する（Ｓ１６２）。すなわち、スケジューラ１は、キャッシュクラスタ７上の所定の格納場所に、学習対象データを格納することを指示する。キャッシュクラスタ７は、閉域接続または閉域経路を介して、ユーザストレージ６上の学習対象データのフェッチを開始する（Ｓ１６３）。Ｓ１６４からＳ１７１の処理は、図２１ＢのＳ１３５からＳ１４１の処理と同じであるため、ここでは説明を省略する。 The scheduler 1 instructs the cache cluster 7 to prefetch the learning target data (S162). That is, the scheduler 1 instructs the cache cluster 7 to store the learning target data in a predetermined storage location. The cache cluster 7 starts fetching the learning target data on the user storage 6 via a closed connection or a closed path (S163). The processing from S164 to S171 is the same as the processing from S135 to S141 in FIG. 21B, so a description thereof will be omitted here.

そして、図２２ＢのＳ１７２からＳ１８１の処理が行われるが、この処理は、図２１ＣのＳ１４２からＳ１５１の処理と同じであるため、ここでは説明を省略する。そして、図２２Ｂで、スケジューラ１は、GPUの空き状況およびジョブの完了状況をマスタ２に確認し（Ｓ１８２）、マスタ２からこれらの情報を取得する（Ｓ１８３）。そして、後述する「閉域接続の解除処理」Ｃが行われる。「閉域接続の解除処理」は、スケジューラ１の制御により、ユーザ拠点とキャッシュクラスタ７との間に確立された閉域接続または閉域経路を解除する。スケジューラ１は、学習対象データのキャッシュデータ等の削除をキャッシュクラスタ７に指示する（Ｓ１８４）。キャッシュクラスタ７は、キャッシュデータ等を削除する（Ｓ１８５）。キャッシュクラスタ７は、学習結果が一時的に格納された場合、学習結果も削除する。キャッシュクラスタ７は、削除処理に合わせて、ジョブからの書き出しデータのクラスタ共有ストレージ４への書き戻しを実行する。キャッシュクラスタ７は、削除完了をスケジューラ１に報告する（Ｓ１８６）。 Then, the processing from S172 to S181 in FIG. 22B is performed, but since this processing is the same as the processing from S142 to S151 in FIG. 21C, the explanation will be omitted here. Then, in FIG. 22B, the scheduler 1 checks the GPU availability status and job completion status with the master 2 (S182), and acquires this information from the master 2 (S183). Then, "closed connection release processing" C, which will be described later, is performed. The “closed connection release process” releases the closed connection or closed route established between the user base and the cache cluster 7 under the control of the scheduler 1. The scheduler 1 instructs the cache cluster 7 to delete the cache data of the learning target data (S184). The cache cluster 7 deletes cache data and the like (S185). The cache cluster 7 also deletes the learning results when the learning results are temporarily stored. The cache cluster 7 writes back the data written from the job to the cluster shared storage 4 in accordance with the deletion process. The cache cluster 7 reports completion of deletion to the scheduler 1 (S186).

図２３は、「閉域接続の確立処理」Ａの動作を示すシーケンス図である。ここでは、図１４に示す方式２の閉域接続の確立処理を説明する。ユーザ拠点には、CPE８が配置されており、CPE８とキャッシュクラスタ７との間で閉域接続を確立する。そのため、CPE８がAPIを公開していない場合は、スケジューラ１がAPIでCPE８を制御している部分については、ユーザが当該部分を設定する。CPE８は、キャリア網内にデプロイされているvCPEに置き換わる場合もある。 FIG. 23 is a sequence diagram showing the operation of "closed connection establishment process" A. Here, the closed connection establishment process of method 2 shown in FIG. 14 will be explained. A CPE 8 is installed at the user base, and a closed connection is established between the CPE 8 and the cache cluster 7 . Therefore, if the CPE 8 does not publish the API, the user sets the part where the scheduler 1 controls the CPE 8 using the API. The CPE 8 may also replace a vCPE deployed within the carrier network.

本処理の前提として、スケジューラ１にジョブが登録されている。ジョブの登録データに含まれる、ユーザストレージ６への閉域接続情報には、「CPEとの閉域接続の情報」と、「CPEのAPIへの接続情報」とが含まれる。ただし、CPE８がAPIを公開していなく、ユーザがCPE８の設定を行う場合は、閉域接続情報には「CPEのAPIへの接続情報」は含まれない。以下に、本処理を説明する。 As a premise of this process, a job is registered in the scheduler 1. The closed connection information to the user storage 6 included in the job registration data includes "information about the closed connection with the CPE" and "information about the connection to the API of the CPE." However, if the CPE8 does not publish its API and the user configures the CPE8, the closed connection information does not include "CPE connection information to the API." This process will be explained below.

スケジューラ１は、キャッシュクラスタ７に閉域接続の待ち受けを指示する（Ｓ１９１）。この指示には、CPE８との閉域接続の情報が含まれる。閉域接続確立後、キャッシュクラスタ７が自律的に学習対象データの取得制御を行う場合は、「学習対象データの格納場所」についても閉域接続の待ち受け指示で渡される。キャッシュクラスタ７は、閉域接続を待ち受ける設定を行う（Ｓ１９２）。これにより、閉域接続待ち受け状態が確立する。キャッシュクラスタ７は、閉域接続待ち受け処理の完了をスケジューラ１に報告する（Ｓ１９３）。キャッシュクラスタ７への閉域接続の情報は、Ｓ１９１で生成される。キャッシュクラスタ７への閉域接続の情報は、CPE８がAPIを公開していなく、ユーザがCPE８の設定を行う場合は、ジョブの登録より前段階でのユーザと事業者間での契約手続き等の事前折衝で決定され、ユーザに通知される。 The scheduler 1 instructs the cache cluster 7 to wait for a closed connection (S191). This instruction includes information on closed connection with CPE8. After the closed connection is established, if the cache cluster 7 autonomously controls the acquisition of the learning target data, the "storage location of the learning target data" is also passed in the closed connection standby instruction. The cache cluster 7 performs settings to wait for a closed connection (S192). This establishes a closed connection standby state. The cache cluster 7 reports the completion of the closed connection standby process to the scheduler 1 (S193). Information on the closed connection to the cache cluster 7 is generated in S191. Information on the closed connection to the cache cluster 7 can be obtained from the contract procedure between the user and the operator prior to job registration if the CPE8 does not publish the API and the user configures the CPE8. The decision will be made through negotiation and the user will be notified.

スケジューラ１は、閉域接続の確立をCPE８に指示する（Ｓ１９４）。CPE８は、閉域接続を設定し（Ｓ１９５）、キャッシュクラスタ７との閉域接続を開始する（Ｓ１９６）。CPE８がAPIを公開していない場合においては、ユーザからのジョブ登録と、ユーザによるCPE８への閉域接続確立のための設定が非同期で実施される。そのため、閉域接続の開始処理は閉域接続が確立するまでCPE８により繰り返し施行される。キャッシュクラスタ７は、CPE８に閉域接続を受諾する（Ｓ１９７）。これにより、閉域接続が確立される。CPE８は閉域接続の完了をスケジューラ１に報告する（Ｓ１９８）。以降、確立された閉域接続を介することで、キャッシュクラスタ７もしくはクラスタ共有ストレージ４からユーザ拠点のユーザストレージ６上の学習対象データにアクセス可能となる。 The scheduler 1 instructs the CPE 8 to establish a closed connection (S194). The CPE 8 sets up a closed connection (S195) and starts a closed connection with the cache cluster 7 (S196). If the CPE 8 does not publish the API, job registration from the user and settings for establishing a closed connection to the CPE 8 by the user are performed asynchronously. Therefore, the closed connection start process is repeatedly executed by the CPE 8 until the closed connection is established. The cache cluster 7 accepts the closed connection to the CPE 8 (S197). This establishes a closed connection. The CPE 8 reports completion of the closed connection to the scheduler 1 (S198). Thereafter, the learning target data on the user storage 6 at the user base can be accessed from the cache cluster 7 or the cluster shared storage 4 via the established closed connection.

図２４は、「閉域接続の解除処理」Ｂの動作を示すシーケンス図である。ここでは、図１４に示す方式２の閉域接続の解除処理を説明する。ユーザ拠点のCPE８とキャッシュクラスタ７との間で閉域接続が確立されておりこれを解除する。そのため、CPE８は、閉域接続制御のためのAPIを公開していることする。APIを公開していない場合は、スケジューラ１がAPIでCPE８を制御している部分については、ユーザが当該部分を設定する。CPE８は、キャリア網内にデプロイされているvCPEに置き換わる場合もある。 FIG. 24 is a sequence diagram showing the operation of "closed connection release processing" B. Here, closed connection release processing according to method 2 shown in FIG. 14 will be described. A closed connection has been established between the CPE 8 at the user site and the cache cluster 7, and this is released. Therefore, CPE8 publishes an API for closed connection control. If the API is not made public, the user sets the part where the scheduler 1 controls the CPE 8 using the API. The CPE 8 may also replace a vCPE deployed within the carrier network.

本処理の前提として、スケジューラ１にジョブが登録されている。ジョブの登録データに含まれる、ユーザストレージ６への閉域接続情報には、「CPEとの閉域接続の情報」と、「CPEのAPIへの接続情報」とが含まれる。ただし、CPE８がAPIを公開していなく、ユーザがCPE８の設定を行う場合は、閉域接続情報には「CPEのAPIへの接続情報」は含まれない。スケジューラ１の制御によりCPE８とキャッシュクラスタ７の間で閉域接続が確立される。スケジューラ１の制御により、キャッシュクラスタ７が学習対象データのフェッチを開始する。スケジューラ１の制御によりジョブがデプロイされ、学習を開始する。デプロイされるタイミングとしては、キャッシュクラスタ７で学習対象データを全てフェッチしてからの場合と、フェッチを継続している場合とが存在する。ジョブが完了し、ジョブの実行完了をスケジューラ１が検知する。以下に、本処理を説明する。 As a premise of this process, a job is registered in the scheduler 1. The closed connection information to the user storage 6 included in the job registration data includes "information about the closed connection with the CPE" and "information about the connection to the API of the CPE." However, if the CPE8 does not publish its API and the user configures the CPE8, the closed connection information does not include "CPE connection information to the API." A closed connection is established between the CPE 8 and the cache cluster 7 under the control of the scheduler 1. Under the control of the scheduler 1, the cache cluster 7 starts fetching the learning target data. The job is deployed under the control of the scheduler 1 and learning begins. There are two deployment timings: after the cache cluster 7 has fetched all the learning target data, and when fetching is continuing. The job is completed, and the scheduler 1 detects the completion of job execution. This process will be explained below.

キャッシュクラスタ７への閉域接続の情報は、図２３の「閉域接続の接続処理」のＳ１９３で前述したとおりである。スケジューラ１は、CPE８に閉域接続の解除を指示する（Ｓ２０１）。CPE８は、キャッシュクラスタ７に対して閉域接続の解除を開始する（Ｓ２０２）。キャッシュクラスタ７は、CPE８に閉域接続の解除を受諾する（Ｓ２０３）。これにより、CPE８と７との間の閉域接続が解除される。CPE８は、閉域接続を削除し（Ｓ２０４）、閉域接続の解除完了をスケジューラ１に報告する（Ｓ２０５）。 The information on the closed connection to the cache cluster 7 is as described above in S193 of "closed connection connection processing" in FIG. The scheduler 1 instructs the CPE 8 to release the closed connection (S201). The CPE 8 starts releasing the closed connection to the cache cluster 7 (S202). The cache cluster 7 accepts the cancellation of the closed connection from the CPE 8 (S203). As a result, the closed connection between CPEs 8 and 7 is released. The CPE 8 deletes the closed connection (S204) and reports completion of the closed connection release to the scheduler 1 (S205).

スケジューラ１は、キャッシュクラスタ７に閉域接続の待ち受け解除を指示する（Ｓ２０６）。この指示には、CPE８との閉域接続の情報が含まれる。キャッシュクラスタ７は、閉域接続の待ち受ける設定を削除し（Ｓ２０７）、閉域接続の待ち受け解除をスケジューラ１に報告する（Ｓ２０８）。 The scheduler 1 instructs the cache cluster 7 to cancel standby for the closed connection (S206). This instruction includes information on closed connection with CPE8. The cache cluster 7 deletes the closed connection standby setting (S207), and reports the cancellation of the closed connection standby to the scheduler 1 (S208).

CPE８がAPIを公開していなく、ユーザがCPE８の設定を行う場合は、Ｓ２０６－Ｓ２０８（キャッシュクラスタの閉域接続の待ち受け解除処理）が、Ｓ２０１－Ｓ２０５（CPEの閉域接続の解除処理）より前に実行される可能性がある。その場合、「CPEの閉域接続の解除処理」では、閉域接続は既に解除されているため、Ｓ２０２（閉域接続の解除を開始）を実行せず、Ｓ２０４（閉域接続を削除）を実行する。一方、「キャッシュクラスタの閉域接続の待ち受け解除処理」では、Ｓ２０７に伴い、閉域接続の解除を開始し、CPE８から閉域接続の解除を受諾する。 If the CPE8 does not publish the API and the user configures the CPE8, S206-S208 (cache cluster closed connection canceling process) will be performed before S201-S205 (CPE closed connection canceling process). There is a possibility that it will be executed. In that case, in the "CPE closed area connection release process", since the closed area connection has already been released, S202 (starting the release of the closed area connection) is not executed, but S204 (deleting the closed area connection) is executed. On the other hand, in the "cache cluster closed connection standby cancellation process", in step S207, cancellation of the closed connection is started, and the cancellation of the closed connection is accepted from the CPE 8.

図２５は、「閉域接続の確立処理」Ａの動作を示すシーケンス図である。ここでは、図１９に示す方式７の閉域接続の確立処理を説明する。ユーザ拠点には、CPE８が配置されており、キャリア網のGW９３がCPE８への接続インタフェースを保持している。CPE８とGW９３との間は、事前にPPPoE等により閉域接続が確立済みであり、さらにGW９３とキャッシュクラスタ７との間で閉域接続を確立することで、GW９３は、２つの閉域接続を中継する閉域経路を生成する。この閉域経路を介することで、キャッシュクラスタ７およびクラスタ共有ストレージ４からCPE８配下のユーザストレージ６にアクセス可能な状態になる。この制御のためにGW９３は閉域接続制御のためのAPIを保持している。 FIG. 25 is a sequence diagram showing the operation of "closed connection establishment process" A. Here, the closed connection establishment process of method 7 shown in FIG. 19 will be described. A CPE 8 is installed at the user base, and a GW 93 of the carrier network holds a connection interface to the CPE 8. A closed connection has already been established between the CPE 8 and the GW 93 using PPPoE, etc., and by establishing a closed connection between the GW 93 and the cache cluster 7, the GW 93 can create a closed connection that relays the two closed connections. Generate a route. By using this closed path, the user storage 6 under the CPE 8 can be accessed from the cache cluster 7 and the cluster shared storage 4. For this control, the GW93 maintains an API for closed connection control.

本処理の前提として、CPE８とGW９３との間で、PPPoE等により閉域接続が確立済みで、この閉域接続を介してCPE８はインターネットに接続可能である。スケジューラ１にジョブが登録される。ジョブの登録データに含まれる、ユーザストレージ６への閉域接続情報には、「回線識別情報」（CPE８が接続すると、GW９３の識別等に用いられる）が含まれる。以下に、本処理を説明する。 As a premise of this process, a closed connection has been established between the CPE 8 and the GW 93 by PPPoE or the like, and the CPE 8 can be connected to the Internet via this closed connection. A job is registered in scheduler 1. The closed connection information to the user storage 6 included in the job registration data includes "line identification information" (used to identify the GW 93 when the CPE 8 is connected). This process will be explained below.

スケジューラ１は、CPE８が接続するGW９３を特定する（Ｓ２１１）。スケジューラ１は、GW９３に閉域接続の待ち受け設定と、閉域接続の中継設定とを指示する（Ｓ２１２）。この中継設定は、キャッシュクラスタ７との閉域接続確立後、CPE８とGW９３間の閉域接続と、GW９３とキャッシュクラスタ７間の閉域接続をルーティング、スイッチ等により中継して論理的なCPE８とキャッシュクラスタ７間の閉域経路を生成するための設定である。この閉域経路を利用することで、キャッシュクラスタ７およびクラスタ共有ストレージ４とCPE８配下のユーザストレージ６とは互いに接続可能となる。GW９３では、CPE８配下からのトラフィックについて、キャッシュクラスタ７もしくはクラスタ共有ストレージ４宛てのみのデータを閉域経路へ転送する。CPE８配下からインターネット接続と共用可能である。 The scheduler 1 identifies the GW 93 to which the CPE 8 connects (S211). The scheduler 1 instructs the GW 93 to perform closed connection standby settings and closed connection relay settings (S212). In this relay setting, after establishing a closed connection with the cache cluster 7, the closed connection between the CPE 8 and the GW 93 and the closed connection between the GW 93 and the cache cluster 7 are relayed using routing, switches, etc., and the logical CPE 8 and the cache cluster 7 are connected to each other. This is a setting for generating a closed path between. By using this closed path, the cache cluster 7, the cluster shared storage 4, and the user storage 6 under the CPE 8 can be connected to each other. The GW 93 transfers data destined only for the cache cluster 7 or the cluster shared storage 4 to the closed path regarding traffic from under the CPE 8 . It can be shared with the Internet connection from under CPE8.

GW９３は、閉域接続の待ち受け設定と、閉域接続の中継設定とを実施する（Ｓ２１３）。これにより、閉域接続の待ち受けと、閉域接続中継の待機状態が確立する。GW９３は、閉域接続の待ち受け設定と、閉域接続の中継設定の完了をスケジューラ１に報告する（Ｓ２１４）。この報告には、「GWへの閉域接続の情報」が含まれる。スケジューラ１は、閉域接続の確立をキャッシュクラスタ７に指示する（Ｓ２１５）。この指示には、「GWへの閉域接続の情報」が含まれる。閉域接続確立後、キャッシュクラスタ７が自律的に学習対象データの取得制御を行う場合は、「学習対象データの格納場所」についても閉域接続の待ち受け指示で渡される。キャッシュクラスタ７は、閉域接続を設定し（Ｓ２１６）、閉域接続の開始をGW９３に通知する（Ｓ２１７）。GW９３は、キャッシュクラスタ７に閉域接続を受諾する（Ｓ２１８）。これにより、閉域接続待ち受け状態が確立される。GW９３による閉域接続の中継により、CPE８とキャッシュクラスタ７間の閉域経路が確立される。キャッシュクラスタ７は、閉域接続の確立完了をスケジューラ１に報告する（Ｓ２１９）。以降、確立された閉域経路を介することで。キャッシュクラスタ７もしくはクラスタ共有ストレージ４からユーザ拠点のユーザストレージ６上の学習対象データにアクセス可能となる。 The GW 93 performs a standby setting for a closed connection and a relay setting for a closed connection (S213). This establishes a standby state for a closed connection and a standby state for closed connection relay. The GW 93 reports the completion of the closed connection standby setting and the closed connection relay setting to the scheduler 1 (S214). This report includes "information on closed connections to GW". The scheduler 1 instructs the cache cluster 7 to establish a closed connection (S215). This instruction includes "information on closed connection to GW". After the closed connection is established, if the cache cluster 7 autonomously controls the acquisition of the learning target data, the "storage location of the learning target data" is also passed in the closed connection standby instruction. The cache cluster 7 sets up a closed connection (S216) and notifies the GW 93 of the start of the closed connection (S217). The GW 93 accepts the closed connection to the cache cluster 7 (S218). This establishes a closed connection standby state. A closed path between the CPE 8 and the cache cluster 7 is established by relaying the closed connection by the GW 93. The cache cluster 7 reports the completion of establishing the closed connection to the scheduler 1 (S219). From then on, through the established closed path. The learning target data on the user storage 6 at the user base can be accessed from the cache cluster 7 or the cluster shared storage 4.

図２６は、「閉域接続の解除処理」Ｃの動作を示すシーケンス図である。ここでは、図１９に示す方式７の閉域接続の解除処理を説明する。GW９３とキャッシュクラスタ７間で、閉域接続が確立されており、さらにGW９３により、CPE８とキャッシュクラスタ７間に閉域経路が確立されている。ここで、GW９３とキャッシュクラスタ７間の閉域接続を解除することで、CPE８とキャッシュクラスタ７間の閉域経路も併せて解除する。この制御のためにGW９３は閉域接続制御のためのAPIを保持している。 FIG. 26 is a sequence diagram showing the operation of "closed connection release processing" C. Here, the closed connection release process of method 7 shown in FIG. 19 will be described. A closed connection has been established between the GW 93 and the cache cluster 7, and a closed path has been established between the CPE 8 and the cache cluster 7 by the GW 93. Here, by canceling the closed connection between the GW 93 and the cache cluster 7, the closed path between the CPE 8 and the cache cluster 7 is also canceled. For this control, the GW93 maintains an API for closed connection control.

本処理の前提として、CPE８とGW９３との間で、PPPoE等により閉域接続が確立済みで、この閉域接続を介してCPE８はインターネットに接続可能である。スケジューラ１にジョブが登録される。ジョブの登録データに含まれる、ユーザストレージ６への閉域接続情報には「回線識別情報」が含まれる。スケジューラ１の制御により、GW９３とクラスタ共有ストレージ４間に閉域接続が確立される。合わせて、GW９３により、CPE８とキャッシュクラスタ７間に閉域経路が確立される。スケジューラ１の制御により、キャッシュクラスタ７が学習対象データのフェッチを開始する。スケジューラ１の制御により、ジョブがデプロイされ学習を開始する。デプロイされるタイミングとしては、キャッシュクラスタ７で学習対象データを全てフェッチしてからの場合と、フェッチを継続している場合とが存在する。以下に、本処理を説明する。 As a premise of this process, a closed connection has been established between the CPE 8 and the GW 93 by PPPoE or the like, and the CPE 8 can be connected to the Internet via this closed connection. A job is registered in scheduler 1. The closed connection information to the user storage 6 included in the job registration data includes "line identification information." Under the control of the scheduler 1, a closed connection is established between the GW 93 and the cluster shared storage 4. At the same time, a closed path is established between the CPE 8 and the cache cluster 7 by the GW 93. Under the control of the scheduler 1, the cache cluster 7 starts fetching the learning target data. Under the control of the scheduler 1, the job is deployed and learning starts. There are two deployment timings: after the cache cluster 7 has fetched all the learning target data, and when fetching is continuing. This process will be explained below.

スケジューラ１は、CPE８に閉域接続の解除を指示する（Ｓ２３１）。この指示には、「GW９３への閉域接続の情報」が含まれる。キャッシュクラスタ７は、GW９３との閉域接続の解除を開始する（Ｓ２３２）。GW９３は、閉域接続の解除をキャッシュクラスタ７に受諾する（Ｓ２３３）。これにより、閉域接続が解除され、CPE８とキャッシュクラスタ７間の閉域経路が解除される。キャッシュクラスタ７は、閉域接続を削除し（Ｓ２３４）、閉域接続の解除完了をスケジューラ１に報告する（Ｓ２３５）。スケジューラ１は、閉域接続待ち受け設定の削除と、閉域接続の中継設定の削除をGW９３に指示する（Ｓ２３６）。GW９３は、閉域接続待ち受け設定を削除し、閉域接続の中継設定を削除する（Ｓ２３７）。GW９３は、Ｓ２３７の削除完了をスケジューラ１に報告する（Ｓ２３８）。 The scheduler 1 instructs the CPE 8 to release the closed connection (S231). This instruction includes "information on closed connection to GW93". The cache cluster 7 starts canceling the closed connection with the GW 93 (S232). The GW 93 accepts the cancellation of the closed connection from the cache cluster 7 (S233). As a result, the closed connection is released, and the closed path between the CPE 8 and the cache cluster 7 is released. The cache cluster 7 deletes the closed connection (S234) and reports completion of the closed connection release to the scheduler 1 (S235). The scheduler 1 instructs the GW 93 to delete the closed connection standby setting and the closed connection relay setting (S236). The GW 93 deletes the closed connection standby setting and deletes the closed connection relay setting (S237). The GW 93 reports the completion of deletion in S237 to the scheduler 1 (S238).

図２７および図２８は、「学習対象データのクラスタ格納処理」Ｂを示すシーケンス図である。本処理では、ユーザストレージ６上の学習対象データをクラスタ共有ストレージ４上に格納する。 FIGS. 27 and 28 are sequence diagrams showing "cluster storage processing of learning target data" B. In this process, the learning target data on the user storage 6 is stored on the cluster shared storage 4.

図２７では、キャッシュクラスタ７がユーザストレージ６の学習対象データのブロックを読み出し、当該ブロックをクラスタ共有ストレージ４上に書き込む（複製する）ことを繰り返す。ブロックは、学習対象データの一部であって、例えば１以上のファイルの集合、または、１ファイルのうちの一定サイズの１部分などを示す。キャッシュクラスタ７とCPE８との間には、閉域接続または閉域経路が確立されており、いずれかを介することでキャッシュクラスタ７はユーザストレージ６の学習対象データにアクセスする。また、キャッシュクラスタ７は、閉域接続または閉域経路の確立を検知すると、自律的に格納処理を開始する。CPE８は、キャリア網内に配置されているvCPEに置き換わる場合もある。 In FIG. 27, the cache cluster 7 repeatedly reads a block of learning target data from the user storage 6 and writes (duplicates) the block onto the cluster shared storage 4. A block is a part of learning target data, and indicates, for example, a set of one or more files, or a portion of a certain size of one file. A closed connection or a closed path is established between the cache cluster 7 and the CPE 8, and the cache cluster 7 accesses the learning target data in the user storage 6 via either one. Furthermore, upon detecting the establishment of a closed connection or a closed route, the cache cluster 7 autonomously starts storage processing. The CPE 8 may also replace a vCPE located within the carrier network.

本処理の前提として、スケジューラ１にジョブが登録される。スケジューラ１の制御により、CPE８とキャッシュクラスタ７との間に閉域接続または閉域経路が確立される。キャッシュクラスタ７は、自律的に格納処理を開始するために、閉域接続または閉域経路の確立処理の中で「学習対象データの格納場所」がスケジューラ１からキャッシュクラスタ７に渡されている。以下に、本処理を説明する。 As a premise of this process, a job is registered in the scheduler 1. Under the control of the scheduler 1, a closed connection or a closed path is established between the CPE 8 and the cache cluster 7. In order for the cache cluster 7 to autonomously start the storage process, the "storage location of the learning target data" is passed from the scheduler 1 to the cache cluster 7 during the closed connection or closed route establishment process. This process will be explained below.

キャッシュクラスタ７は、閉域接続の確立を契機として、閉域接続または閉域経路を介してユーザストレージ６から学習対象データをブロック単位で読み出し（Ｓ２５１）、読み出したブロック単位の学習対象データをクラスタ共有ストレージ４に書き込む（Ｓ２５２）。キャッシュクラスタ７は、学習対象データを全てクラスタ共有ストレージ４に格納するまで、Ｓ２５１およびＳ２５２を繰り返す。キャッシュクラスタ７は、学習対象データを全て格納した後、学習対象データの取得完了を、スケジューラ１に通知する（Ｓ２５３）。この通知には、クラスタ共有ストレージ４上の学習対象データの格納場所が含まれる。 Upon establishment of the closed connection, the cache cluster 7 reads the learning target data in blocks from the user storage 6 via the closed connection or the closed path (S251), and stores the read learning target data in blocks in the cluster shared storage 4. (S252). The cache cluster 7 repeats S251 and S252 until all learning target data is stored in the cluster shared storage 4. After storing all the learning target data, the cache cluster 7 notifies the scheduler 1 of completion of acquisition of the learning target data (S253). This notification includes the storage location of the learning target data on the cluster shared storage 4.

図２８では、キャッシュクラスタ７がクラスタ共有ストレージ４に対し、ユーザストレージ６上の学習対象データの取得を指示する。前提条件などについては、図２７と同じであるため、ここでは説明を省略する。以下に、本処理を説明する。 In FIG. 28, the cache cluster 7 instructs the cluster shared storage 4 to acquire learning target data on the user storage 6. Since the preconditions and the like are the same as those in FIG. 27, their explanation will be omitted here. This process will be explained below.

キャッシュクラスタ７は、閉域接続の確立を契機として、ユーザストレージ６に学習対象データの取得を指示する（Ｓ２７１）。この指示には、「学習対象データの格納場所」が含まれる。クラスタ共有ストレージ４は、閉域接続または閉域経路を介してユーザストレージ６から学習対象データを取得する（Ｓ２７２）。これにより、閉域接続または閉域経路を介してユーザストレージ６の学習対象データがクラスタ共有ストレージ４に格納される。クラスタ共有ストレージ４は、学習対象データの取得完了を、スケジューラ１に報告する（Ｓ２７３）。この報告には、クラスタ共有ストレージ４上の学習対象データの格納場所が含まれる。 Taking the establishment of the closed connection as a trigger, the cache cluster 7 instructs the user storage 6 to acquire the learning target data (S271). This instruction includes "the storage location of the learning target data". The cluster shared storage 4 acquires the learning target data from the user storage 6 via a closed connection or a closed path (S272). Thereby, the learning target data in the user storage 6 is stored in the cluster shared storage 4 via a closed connection or a closed path. The cluster shared storage 4 reports completion of acquisition of the learning target data to the scheduler 1 (S273). This report includes the storage location of the learning target data on the cluster shared storage 4.

図２９は、「学習処理におけるキャッシュクラスタへのデータアクセス処理」Ｄを示すシーケンス図である。本処理では、キャッシュクラスタ７への学習対象データのキャッシュ（フェッチ）が完全に完了する前に投機的にジョブがデプロイされ、学習を開始している場合である。キャッシュクラスタ７がキャッシュ対象とする学習対象データは、ユーザストレージ６またはキャッシュクラスタ７に格納されている。 FIG. 29 is a sequence diagram showing "data access processing to cache clusters in learning processing" D. In this process, a job is speculatively deployed and learning is started before caching (fetching) of learning target data to the cache cluster 7 is completely completed. Learning target data to be cached by the cache cluster 7 is stored in the user storage 6 or the cache cluster 7.

本処理の前提として、スケジューラ１にジョブが登録される。スケジューラ１の制御により、キャッシュクラスタ７が学習対象データのキャッシュを開始する。ジョブがデプロイされる。キャッシュクラスタ７の学習対象データにジョブからアクセスが可能になる。ジョブが、学習処理を開始する（Ｓ２９１）。ジョブは、学習対象データにアクセスしながら学習処理を行う。以下に、本処理を説明する。 As a premise of this process, a job is registered in the scheduler 1. Under the control of the scheduler 1, the cache cluster 7 starts caching the learning target data. The job is deployed. The learning target data in the cache cluster 7 can be accessed from the job. The job starts learning processing (S291). The job performs learning processing while accessing learning target data. This process will be explained below.

ジョブは、学習対象データのブロック単位での読み込みをキャッシュクラスタ７に要求する（Ｓ２９２）。キャッシュクラスタ７は、キャッシュミスが発生した場合（Ｓ２９３）、学習対象データへの透過的接続（Ｓ２９４、Ｓ２９５）と、学習対象データのプリフェッチ（Ｓ２９６）とを並列処理で行う。キャッシュミスは、キャッシュクラスタ７がキャッシュ対象としているデータのうち、要求元（キャッシュクラスタ７を利用するジョブなど）が、未キャッシュのデータを読み書きしようとした状態を指す。データが存在しないため、要求元に対してデータを即時応答することができない。キャッシュクラスタ７は、要求元を待たせたまま、キャッシュ対象データをオリジンに要求し、キャッシュデータを作成してから要求元に応答するなどの処理が発生する。 The job requests the cache cluster 7 to read the learning target data in blocks (S292). When a cache miss occurs (S293), the cache cluster 7 performs transparent connection to the learning target data (S294, S295) and prefetching the learning target data (S296) in parallel. A cache miss refers to a state in which a request source (such as a job that uses the cache cluster 7) attempts to read or write uncached data among the data that is cached by the cache cluster 7. Since the data does not exist, it is not possible to immediately respond with data to the requester. The cache cluster 7 performs processing such as requesting data to be cached from the origin while keeping the requester waiting, creating cache data, and then responding to the requester.

学習対象データへの透過的接続では、キャッシュクラスタ７は、キャッシュミスしたブロックの学習対象データをオリジンから取得し（Ｓ２９４）、取得したブロックの学習対象データをジョブに返却する（Ｓ２９５）。キャッシュクラスタ７は、キャッシュミス時に学習対象データの元データにアクセスしながら、ジョブにデータを返却する。これにより、ジョブにはキャッシュミスの発生を隠ぺいしつつ、透過的に学習対象データのオリジンにアクセスさせる。なお、キャッシュクラスタ７において、ここで返却している学習対象データのブロックは、今後利用される見込みがないため、キャッシュしないことでデータ入力処理を高速化してもよい。 In the transparent connection to the learning target data, the cache cluster 7 acquires the learning target data of the block that caused the cache miss from the origin (S294), and returns the acquired learning target data of the block to the job (S295). The cache cluster 7 returns data to the job while accessing the original data of the learning target data when a cache miss occurs. This allows jobs to transparently access the origin of the learning target data while hiding the occurrence of cache misses. Note that in the cache cluster 7, since the blocks of learning target data that are being returned here are unlikely to be used in the future, data input processing may be speeded up by not caching them.

学習対象データのプリフェッチでは、数ブロック先の学習対象データを先読みしてキャッシュする（Ｓ２９６）。キャッシュクラスタ７は、キャッシュミス発生後、学習対象データのオリジンにアクセスして応答をジョブに返却するとともに、今後、ジョブが読み込む学習対象データのブロックについて、数ブロック先のキャッシュを並行して開始する。これにより、キャッシュミスを一時的なものとして、その後のキャッシュミスの発生を低減し、データ入出力処理を高速化する。 In the prefetching of learning target data, learning target data several blocks ahead is prefetched and cached (S296). After a cache miss occurs, the cache cluster 7 accesses the origin of the learning target data and returns a response to the job, and also starts caching several blocks ahead in parallel for the blocks of learning target data that the job will read in the future. . This makes cache misses temporary, reduces the occurrence of subsequent cache misses, and speeds up data input/output processing.

図３０Ａおよび図３０Ｂは、「ジョブのチェックポイント処理」を示すシーケンス図である。チェックポイント処理は、動作中のジョブに含まれる仮想空間やプロセスをフリーズし、状態をいくつかのファイル(ダンプ)に保存する処理である。ジョブのチェックポイント処理は、例えば、CRIU (https://www.criu.org/Main_Page、https://github.com/checkpoint-restore/criu) などを使用して実現される。 30A and 30B are sequence diagrams showing "job checkpoint processing". Checkpoint processing is a process that freezes the virtual space and processes included in a running job and saves the state to several files (dump). Job checkpoint processing is achieved using, for example, CRIU (https://www.criu.org/Main_Page, https://github.com/checkpoint-restore/criu).

本処理では、ジョブがキャッシュクラスタ７から学習対象データを読み込む際のキャッシュミスを許容する。具体的には、キャッシュミスが発生した場合に、その発生を検知し、ジョブをチェックポイントする。本処理では、キャッシュクラスタ７への学習対象データのフェッチが完全に完了する前に投機的にジョブがデプロイされ、学習を開始している場合である。キャッシュクラスタ７がキャッシュ対象とする学習対象データは、ユーザストレージ６またはキャッシュクラスタ７に格納されている。 In this process, a cache miss when a job reads learning target data from the cache cluster 7 is allowed. Specifically, when a cache miss occurs, the occurrence is detected and the job is checkpointed. In this process, a job is speculatively deployed and learning is started before fetching of learning target data to the cache cluster 7 is completely completed. Learning target data to be cached by the cache cluster 7 is stored in the user storage 6 or the cache cluster 7.

本処理の前提として、ノード３は、キャッシュクラスタ７上のボリュームをジョブのダンプの格納場所としてマウントしている。ジョブのダンプの格納場所は、クラスタ共有ストレージ４上でもよい。スケジューラ１にジョブが登録される。スケジューラ１の制御により、キャッシュクラスタ７が学習対象データのキャッシュを開始する。ジョブがデプロイされる。キャッシュクラスタ７の学習対象データにジョブからアクセスが可能になる。ジョブが、学習処理を開始する（Ｓ３１１）。ジョブは、学習対象データにアクセスしながら学習処理を行う。以下に、本処理を説明する。 As a premise of this process, the node 3 mounts a volume on the cache cluster 7 as a storage location for job dumps. The job dump may be stored on the cluster shared storage 4. A job is registered in scheduler 1. Under the control of the scheduler 1, the cache cluster 7 starts caching the learning target data. The job is deployed. The learning target data in the cache cluster 7 can be accessed from the job. The job starts learning processing (S311). The job performs learning processing while accessing learning target data. This process will be explained below.

学習対象データの読み込みでキャッシュミスが連続して発生する。この場合、以下の３つの処理のいずれかの処理が行われる。 Cache misses occur continuously when reading the learning target data. In this case, one of the following three processes is performed.

「キャッシュクラスタが検知する場合」では、キャッシュクラスタ７が、所定の閾値以上の連続したキャッシュミスを検出し（Ｓ３１２）、スケジューラ１にキャッシュミスの発生を通知する（Ｓ３１３）。閾値は、クラスタ管理者が任意に決定する。閾値は、ブロックサイズや通信速度などから適切な値を決定することができる。 In "when the cache cluster detects", the cache cluster 7 detects consecutive cache misses equal to or greater than a predetermined threshold (S312), and notifies the scheduler 1 of the occurrence of cache misses (S313). The threshold value is arbitrarily determined by the cluster administrator. An appropriate threshold value can be determined based on the block size, communication speed, etc.

「ジョブがキャッシュミスを検知する場合」では、ジョブが、ストレージIO帯域幅の減少等からキャッシュミスを検知し（Ｓ３１４）、スケジューラ１にキャッシュミスの発生を通知する（Ｓ３１５）。 In "When the job detects a cache miss", the job detects a cache miss from a decrease in storage IO bandwidth, etc. (S314), and notifies the scheduler 1 of the occurrence of a cache miss (S315).

「スケジューラ１がキャッシュミスを検知する場合」では、ノード３が、ジョブのストレージIO帯域幅やGPU使用率等をマスタ２に報告する（Ｓ３１６）。スケジューラ１は、マスタ２にジョブの状態を問い合わせる（Ｓ３１７）。マスタ２は、ノード３から報告されているジョブの状態を応答する（Ｓ３１８）。スケジューラ１は、ジョブの状態からジョブのストレージIO帯域幅の減少や、GPUがほぼ使用されていないことなどを確認して、キャッシュミスの発生を検知する。 In "when the scheduler 1 detects a cache miss", the node 3 reports the job's storage IO bandwidth, GPU usage rate, etc. to the master 2 (S316). Scheduler 1 inquires of master 2 about the job status (S317). Master 2 responds with the job status reported from node 3 (S318). The scheduler 1 detects the occurrence of a cache miss by checking the job status to see if the job's storage IO bandwidth has decreased or if the GPU is almost not being used.

スケジューラ１は、キャッシュミスの発生を検知すると、ジョブのチェックポイントをマスタ２に指示し（Ｓ３１９）、マスタは、ジョブのチェックポイントをノード３に指示する（Ｓ３２０）。ノード３は、ジョブをチェックポイントする（Ｓ３２１）。すなわち、ノード３は、ジョブのダンプをキャッシュクラスタ７上に格納する。ノード３は、事前にキャッシュクラスタ７をマウントしている。ジョブのチェックポイントにより、ジョブは一時停止状態（サスペンド状態）となる。一方、キャッシュクラスタ７による学習対象データの未キャッシュ部分のプリフェッチは継続される。 When the scheduler 1 detects the occurrence of a cache miss, it instructs the master 2 to check the job (S319), and the master instructs the node 3 to check the job (S320). Node 3 checkpoints the job (S321). That is, the node 3 stores the dump of the job on the cache cluster 7. Node 3 has mounted cache cluster 7 in advance. Due to the job checkpoint, the job is placed in a suspended state. On the other hand, the cache cluster 7 continues to prefetch the uncached portion of the learning target data.

ノード３は、ジョブのチェックポイント完了をマスタ２に報告し（Ｓ３２２）、マスタ２は、ジョブのチェックポイント完了をスケジューラ１に報告する（Ｓ３２３）。この報告には、ジョブのダンプの格納場所が含まれる。そして、後述する「ジョブのリストア処理」Ｅが行われる。 The node 3 reports the checkpoint completion of the job to the master 2 (S322), and the master 2 reports the completion of the job checkpoint to the scheduler 1 (S323). This report includes the location of the job's dump. Then, "job restoration processing" E, which will be described later, is performed.

図３１は、別の「ジョブのチェックポイント処理」を示すシーケンス図である。本処理では、ジョブがキャッシュクラスタ７から学習対象データを読み込む際のキャッシュミスを防止する。具体的には、キャッシュミスの発生を事前に検知し、ジョブをチェックポイントする。なお、本処理の前提は、図３０Ａと同様であるためここでは、説明を省略する。以下に、本処理を説明する。 FIG. 31 is a sequence diagram showing another "job checkpoint process". This process prevents cache misses when a job reads learning target data from the cache cluster 7. Specifically, the occurrence of a cache miss is detected in advance and the job is checkpointed. Note that the premise of this process is the same as that in FIG. 30A, so the explanation will be omitted here. This process will be explained below.

ジョブが学習を開始すると（Ｓ３３１）、キャッシュクラスタ７は、キャッシュの利用状況の監視を開始する。キャッシュ済みの学習対象データ量と、ジョブが読み出したデータ量との変遷からキャッシュの発生を事前に検知する（Ｓ３３２）。キャッシュクラスタ７は、スケジューラ１にキャッシュミス発生の事前警告を通知する（Ｓ３３３）。スケジューラ１は、ジョブのチェックポイントをマスタ２に指示し（Ｓ３３４）、マスタは、ジョブのチェックポイントをノード３に指示する（Ｓ３３５）。ノード３は、ジョブをチェックポイントする（Ｓ３３６）。すなわち、ノード３は、事前にマウントしたキャッシュクラスタ７上にジョブのダンプを格納する。ジョブのチェックポイントにより、ジョブは一時停止状態となる。一方、キャッシュクラスタ７による学習対象データの未キャッシュ部分のプリフェッチは継続される。 When the job starts learning (S331), the cache cluster 7 starts monitoring the cache usage status. Occurrence of caching is detected in advance from the change in the amount of cached learning target data and the amount of data read by the job (S332). The cache cluster 7 notifies the scheduler 1 of advance warning of the occurrence of a cache miss (S333). The scheduler 1 instructs the master 2 to check the job (S334), and the master instructs the node 3 to check the job (S335). Node 3 checkpoints the job (S336). That is, the node 3 stores the job dump on the cache cluster 7 mounted in advance. A job checkpoint puts the job in a suspended state. On the other hand, the cache cluster 7 continues to prefetch the uncached portion of the learning target data.

ノード３は、ジョブのチェックポイント完了をマスタ２に報告し（Ｓ３３７）、マスタ２は、ジョブのチェックポイント完了をスケジューラ１に報告する（Ｓ３３８）。この報告には、ジョブのダンプの格納場所が含まれる。そして、後述する「ジョブのリストア処理」Ｅが行われる。 The node 3 reports the completion of the checkpoint of the job to the master 2 (S337), and the master 2 reports the completion of the checkpoint of the job to the scheduler 1 (S338). This report includes the location of the job's dump. Then, "job restoration processing" E, which will be described later, is performed.

図３２Ａ、図３２Ｂおよび図３２Ｃは、「ジョブのリストア」Ｅを示すシーケンス図である。本処理は、ジョブがチェックポイントされた後、ジョブが実行を再開するまでの処理である。ジョブのリストアは、チェックポイントされたジョブのダンプから、ジョブを復元し動作を再開させる処理である。ジョブのリストアは、例えば、CRIU (https://www.criu.org/Main_Page、https://github.com/checkpoint-restore/criu) などを使用して実現される。 32A, 32B, and 32C are sequence diagrams showing "job restoration" E. This process is a process from when the job is checkpointed until the job resumes execution. Job restoration is a process of restoring a job from a checkpointed job dump and restarting the job. Job restoration is achieved using, for example, CRIU (https://www.criu.org/Main_Page, https://github.com/checkpoint-restore/criu).

本処理の前提として、ノード３は、キャッシュクラスタ７上のボリュームをジョブのダンプの格納場所としてマウントしている。ダンプの格納場所は、クラスタ共有ストレージ４上でもよい。スケジューラ１にジョブが登録される。スケジューラ１の制御により、キャッシュクラスタ７が学習対象データのキャッシュを開始する。ジョブがデプロイされる。キャッシュクラスタ７の学習対象データにジョブからアクセスが可能になる。ジョブが、学習処理を開始する（Ｓ３５１）。ジョブは、学習対象データにアクセスしながら、学習処理を行う。スケジューラ１がジョブをチェックポイントすることで、ジョブの実行を一次停止する。ジョブが停止された後も、学習対象データの未キャッシュ部分のプリフェッチは継続される。以下に、本処理を説明する。 As a premise of this process, the node 3 mounts a volume on the cache cluster 7 as a storage location for job dumps. The dump may be stored on the cluster shared storage 4. A job is registered in scheduler 1. Under the control of the scheduler 1, the cache cluster 7 starts caching the learning target data. The job is deployed. The learning target data in the cache cluster 7 can be accessed from the job. The job starts learning processing (S351). The job performs learning processing while accessing learning target data. The scheduler 1 temporarily stops execution of the job by checkpointing the job. Even after the job is stopped, prefetching of the uncached portion of the learning target data continues. This process will be explained below.

チェックポイントによりジョブの実行が一時停止されると、以下の「キャッシュクラスタのポーリング確認によるリストア待機」、「時間予測に基づくリストア待機」および「キャッシュクラスタから通知する場合」の３つの処理のいずれかの処理が行われる。 When job execution is paused due to a checkpoint, one of the following three processes will occur: "Waiting for restoration based on cache cluster polling confirmation," "Waiting for restoration based on time prediction," and "When notified from the cache cluster." processing is performed.

「キャッシュクラスタのポーリング確認によるリストア待機」では、スケジューラ１がジョブのチェックポイント時のキャッシュデータ量と、学習対象データのデータ量とをキャッシュクラスタ７に問い合わせて（Ｓ３５２）、キャッシュクラスタ７からこれらの情報を取得する（Ｓ３５３）。学習対象データのデータ量はジョブ登録時にユーザから取得してもよい。そして、スケジューラ１は、キャッシュデータ量をキャッシュクラスタ７に問い合わせ、取得する（Ｓ３５４、Ｓ３５５）。スケジューラ１は、「キャッシュデータ量」－「チェックポイント時のキャッシュデータ量」＞＝「データ量閾値」となるまで、Ｓ３５４およびＳ３５５の処理を繰り返す。 In the "restoration standby based on cache cluster polling confirmation", the scheduler 1 queries the cache cluster 7 about the amount of cache data at the checkpoint of the job and the amount of learning target data (S352), and the cache cluster 7 asks these about the amount of cache data at the job checkpoint and the amount of learning target data. Information is acquired (S353). The amount of learning target data may be obtained from the user at the time of job registration. The scheduler 1 then inquires of the cache cluster 7 about the amount of cache data and obtains it (S354, S355). The scheduler 1 repeats the processes of S354 and S355 until “cache data amount”−“cache data amount at checkpoint”>=“data amount threshold”.

「時間予測に基づくリストア待機」では、スケジューラ１は、チェックポイント時のキャッシュデータ量と、キャッシュクラスタ７のキャッシュ速度と、学習対象データのデータ量とを、キャッシュクラスタ７に問い合わせ（Ｓ３５６）、これらの情報を取得する（Ｓ３５７）。学習対象データのデータ量はジョブ登録時にユーザから取得してもよい。キャッシュクラスタ７のキャッシュ速度は、キャッシュクラスタ７が学習対象データをキャッシュする際のデータ入力スループットを示す。 In the "restoration standby based on time prediction", the scheduler 1 inquires of the cache cluster 7 about the amount of cache data at the time of the checkpoint, the cache speed of the cache cluster 7, and the amount of learning target data (S356). information is acquired (S357). The amount of learning target data may be obtained from the user at the time of job registration. The cache speed of the cache cluster 7 indicates the data input throughput when the cache cluster 7 caches learning target data.

スケジューラ１は、待機時間候補１を算出する（Ｓ３５８）。具体的には、スケジューラ１は、チェックポイント時のキャッシュデータ量とキャッシュ速度から、今後キャッシュデータ量が閾値を超過するまでの時間を時間候補１として算出する。スケジューラ１は、待機時間候補２を算出する（Ｓ３５９）。具体的には、スケジューラ１は、チェックポイント時のキャッシュデータ量とキャッシュ速度から、今後、学習対象データの全てがキャッシュされるまでの時間を時間候補２として算出する。スケジューラ１は、待機候補時間１と待機候補時間２とを比較し、短い方の時間待機する（Ｓ３６０）。 Scheduler 1 calculates waiting time candidate 1 (S358). Specifically, the scheduler 1 calculates, as a time candidate 1, the time until the cache data amount exceeds a threshold in the future from the cache data amount and cache speed at the time of the checkpoint. Scheduler 1 calculates waiting time candidate 2 (S359). Specifically, the scheduler 1 calculates the time until all of the learning target data is cached from now on as the time candidate 2 from the cache data amount and cache speed at the time of the checkpoint. Scheduler 1 compares candidate wait time 1 and candidate wait time 2, and waits for the shorter time (S360).

「キャッシュクラスタから通知する場合」では、スケジューラ１は、必要なキャッシュデータ量をキャッシュクラスタ７に指示する（Ｓ３６１）。キャッシュクラスタ７は、学習対象データの未キャッシュ部分をキャッシュし（Ｓ３６２）、指示されたデータ量をキャッシュしたことを契機にスケジューラ１に通知する（Ｓ３６３）。 In "the case of notification from the cache cluster", the scheduler 1 instructs the cache cluster 7 about the required amount of cache data (S361). The cache cluster 7 caches the uncached portion of the learning target data (S362), and notifies the scheduler 1 when the instructed amount of data has been cached (S363).

スケジューラ１は、チェックポイントされたサスペンドジョブをRQ２３に登録する（Ｓ３６４）。スケジューラ１は、マスタ２にGPUの空き状況等を問い合わせ（Ｓ３６５）、取得する（Ｓ３６６）。GPUに空きがある場合、スケジューラ１は、ジョブをスケジューリングする（Ｓ３６７）。具体的には、スケジューラ１は、RQ２３のジョブを、DQ２４の通常のジョブより優先的にスケジュールする。スケジューラ１は、ジョブのリストアをマスタ２に指示し（Ｓ３６８）、マスタ２はジョブのリストアをノード３に指示する（Ｓ３６９）。この指示には、ダンプの格納場所が含まれる。ノード３は、ジョブのリストアを実行し（Ｓ３７０）、ジョブの実行を再開する（Ｓ３７１）。例えば、Network namespaceなどの仮想環境が復元され、チェックポイント時の状況から学習処理が再開可能な状態に復元される。ノード３は、学習処理を再開する（Ｓ３７２）。 Scheduler 1 registers the checkpointed suspended job in RQ23 (S364). The scheduler 1 inquires of the master 2 about the availability status of the GPU (S365) and obtains the information (S366). If there is free space on the GPU, the scheduler 1 schedules the job (S367). Specifically, the scheduler 1 schedules the RQ23 job with priority over the DQ24 normal job. Scheduler 1 instructs master 2 to restore the job (S368), and master 2 instructs node 3 to restore the job (S369). This instruction includes the location of the dump. Node 3 executes job restoration (S370) and resumes job execution (S371). For example, a virtual environment such as a network namespace is restored, and the learning process is restored from the state at the checkpoint to a state where it can be restarted. Node 3 restarts the learning process (S372).

（本実施形態の効果）
以上説明した本実施形態のGPUクラスタシステムにおけるスケジューラ１は、投入されたジョブを、フェッチ開始待ちのジョブを格納する第１ステージキュー１３-１５に格納する第１キューセレクタ１１と、第１ステージキュー１３-１５のジョブを取り出してフェッチングジョブリスト３０に登録し、ストレージ４に格納された、前記ジョブのデータのフェッチをキャッシュクラスタ７に開始させる第１ジョブセレクタ１２と、フェッチしたデータ量が所定の閾値を超えたジョブを、フェッチングジョブリスト３０から取り出し、デプロイ待ちのジョブを格納する第２ステージキュー２３－２５に格納する第２キューセレクタ２１と、第２ステージキュー２３－２５からジョブを取り出し、当該ジョブのデプロイを指示する第２ジョブセレクタ２２と、を有し、前記ジョブのデプロイ指示には、前記ジョブのデータの格納場所としてキャッシュクラスタ７が指定され、GPUクラスタはキャッシュクラスタ７にアクセスして前記ジョブを実行する。(Effects of this embodiment)
The scheduler 1 in the GPU cluster system of the present embodiment described above includes a first queue selector 11 that stores submitted jobs in a first stage queue 13-15 that stores jobs waiting to start fetching, and a first stage queue A first job selector 12 extracts the job No. 13-15 and registers it in the fetching job list 30, and causes the cache cluster 7 to start fetching the data of the job stored in the storage 4, and the amount of fetched data is determined by a predetermined amount. A second queue selector 21 extracts jobs that exceed the threshold from the fetching job list 30 and stores them in a second stage queue 23-25 that stores jobs waiting to be deployed. and a second job selector 22 that instructs to extract and deploy the job, and in the job deployment instruction, the cache cluster 7 is specified as the storage location of the data of the job, and the GPU cluster is assigned to the cache cluster 7. access and execute the job.

これにより本実施形態では、ストレージの速度不足により発生するGPUの遊休時間を低減し、GPUの稼働率を向上させることができる。すなわち、学習対象データなどのデータの読み出しを高速化することができ、GPUクラスタシステムの提供事業者によるGPUの稼働率を高めることができる。 As a result, in this embodiment, idle time of the GPU caused by insufficient storage speed can be reduced, and the utilization rate of the GPU can be improved. In other words, it is possible to speed up the reading of data such as learning target data, and it is possible to increase the utilization rate of GPUs by GPU cluster system providers.

また、本実施形態では、実行前のジョブをフェッチングジョブリスト３０に登録し、キャッシュクラスタ７にデータのプリフェッチを開始させる。このように、GPUによるジョブの実行と並行してデータのプリフェッチを行うことで、GPUを効率的に使用することができる。 Further, in this embodiment, a job to be executed is registered in the fetching job list 30, and the cache cluster 7 is caused to start prefetching data. In this way, by prefetching data in parallel with job execution by the GPU, the GPU can be used efficiently.

また、本実施形態では、データ待ちによるGPU遊休時にジョブを一時停止し、他のジョブにGPUを譲ることで、GPUの稼働率を向上することができる。 Furthermore, in this embodiment, the GPU utilization rate can be improved by temporarily stopping a job when the GPU is idle due to waiting for data and giving the GPU to another job.

（ハードウェア構成）
上記説明したスケジューラ１は、例えば、図３３に示すような汎用的なコンピュータシステムを用いることができる。図示するコンピュータシステムは、CPU（Central Processing Unit、プロセッサ）９０１と、メモリ９０２と、ストレージ９０３（HDD：Hard Disk Drive、SSD：Solid State Drive）と、通信装置９０４と、入力装置９０５と、出力装置９０６とを備える。メモリ９０２およびストレージ９０３は、記憶装置である。このコンピュータシステムにおいて、CPU９０１がメモリ９０２上にロードされた所定のプログラムを実行することにより、スケジューラ１の各機能が実現される。(Hardware configuration)
For the scheduler 1 described above, for example, a general-purpose computer system as shown in FIG. 33 can be used. The illustrated computer system includes a CPU (Central Processing Unit) 901, a memory 902, a storage 903 (HDD: Hard Disk Drive, SSD: Solid State Drive), a communication device 904, an input device 905, and an output device. 906. Memory 902 and storage 903 are storage devices. In this computer system, each function of the scheduler 1 is realized by the CPU 901 executing a predetermined program loaded onto the memory 902.

また、スケジューラ１は、１つのコンピュータで実装されてもよく、あるいは複数のコンピュータで実装されても良い。また、スケジューラ１は、コンピュータに実装される仮想マシンであっても良い。 Furthermore, the scheduler 1 may be implemented on one computer or on multiple computers. Further, the scheduler 1 may be a virtual machine implemented in a computer.

スケジューラ１用のプログラムは、HDD、SSD、USB（Universal Serial Bus）メモリ、CD (Compact Disc)、DVD (Digital Versatile Disc)などのコンピュータ読取り可能な記録媒体に記憶することも、ネットワークを介して配信することもできる。 The program for Scheduler 1 can be stored on a computer-readable recording medium such as an HDD, SSD, USB (Universal Serial Bus) memory, CD (Compact Disc), or DVD (Digital Versatile Disc), or distributed via a network. You can also.

なお、本発明は上記実施形態および変形例に限定されるものではなく、その要旨の範囲内で数々の変形が可能である。 Note that the present invention is not limited to the above-described embodiments and modifications, and can be modified in many ways within the scope of the gist thereof.

１：スケジューラ
１１：第１キューセレクタ
１２：第１ジョブセレクタ
１３：ジョブキュー（JQ）
１４：公平性超過ジョブキュー（OFJQ）
１５：ユーザ超過ジョブキュー（OUJQ）
２１：第２キューセレクタ
２２：第２ジョブセレクタ
２３：リストアキュー（RQ）
２４：デプロイキュー（DQ）
２５：公平性超過キュー（OFDQ）
３０：フェッチングジョブリスト（FJL）
３１：アカウントDB
３２：GPU使用量監視部
２：マスタ
３：ノード
４：クラスタ共有ストレージ
５：ユーザ端末
６：ユーザストレージ
７：キャッシュクラスタ1: Scheduler 11: First queue selector 12: First job selector 13: Job queue (JQ)
14: Over-Fairness Job Queue (OFJQ)
15: User excess job queue (OUJQ)
21: Second queue selector 22: Second job selector 23: Restore queue (RQ)
24: Deployment queue (DQ)
25: Over-Fairness Queue (OFDQ)
30: Fetching job list (FJL)
31: Account DB
32: GPU usage monitoring unit 2: Master 3: Node 4: Cluster shared storage 5: User terminal 6: User storage 7: Cache cluster

Claims

A scheduling method performed by a GPU cluster system,
The scheduler is
storing the submitted job in a first stage queue that stores jobs waiting to start fetching;
retrieving a job from a first stage queue, registering it in a fetching job list, and causing a cache cluster to start fetching data for the job;
extracting a job whose fetched data amount exceeds a predetermined threshold from the fetching job list and storing it in a second stage queue that stores jobs waiting to be deployed;
retrieving the job from the second stage queue and instructing the deployment of the job;
The cache cluster is
fetching job data registered in the fetching job list from the storage where the data is stored and storing it in the cache cluster;
GPU cluster is
A scheduling method that performs a step of accessing data in the cache cluster and executing a job.

A scheduler for a GPU cluster system,
a first queue selector that stores the submitted job in a first stage queue that stores jobs waiting to start fetching;
a first job selector that retrieves a job from a first stage queue, registers it in a fetching job list, and causes a cache cluster to start fetching data of the job stored in storage;
a second queue selector that extracts jobs whose fetched data amount exceeds a predetermined threshold from the fetching job list and stores them in a second stage queue that stores jobs waiting to be deployed;
a second job selector that retrieves a job from the second stage queue and instructs to deploy the job;
The cache cluster is specified as a storage location for data of the job in the job deployment instruction, and the GPU cluster accesses the cache cluster and executes the job.

3. The scheduler according to claim 2,
From the viewpoint of fairness, the first stage queue is divided into a job queue in which jobs that do not exceed the GPU quota assigned to each user are stored, and an excess job queue in which jobs that exceed the quota are stored. Prepare,
A first job selector is a scheduler that registers jobs in the job queue in a fetching job list with priority over jobs in the excess job queue.

The scheduler according to claim 2 or 3,
The second stage queue consists of a restore queue where jobs waiting to be restored are stored, a deploy queue where jobs waiting to be deployed are stored, and jobs that exceed the GPU quota allocated to each user from a fairness perspective. Equipped with a stored excess queue and
A second job selector is a scheduler that instructs to deploy jobs in the restore queue with priority over jobs in the deploy queue, and instructs to deploy jobs in the deploy queue with priority over jobs in the excess queue.

5. The scheduler according to claim 4,
If the restore queue, the deploy queue, and the excess queue are all empty, the second job selector activates the second job selector and selects the job or fetch job that has the largest amount of fetched data in the fetch job list. a scheduler that stores a first job in a processing job list in one of the restore queue, the deployment queue, and the excess queue;

A GPU cluster system comprising the scheduler according to any one of claims 2 to 5, a cache cluster, and a GPU cluster,
The cache cluster fetches the data of the job registered in the fetching job list from the storage where the data is stored and stores it in the cache cluster,
The GPU cluster accesses data in the cache cluster to execute the job.
GPU cluster system.

A program that causes a computer to function as the scheduler according to claim 2.