JP2019035997A

JP2019035997A - Distributed synchronous processing system, distributed synchronous processing method and distributed synchronous processing program

Info

Publication number: JP2019035997A
Application number: JP2017155099A
Authority: JP
Inventors: 操片岡; Misao Kataoka; 小林　弘明; Hiroaki Kobayashi; 弘明小林; 岡本　光浩; Mitsuhiro Okamoto; 光浩岡本; 雄大北野; Yudai Kitano; 健福元; Takeshi Fukumoto
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-08-10
Filing date: 2017-08-10
Publication date: 2019-03-07
Anticipated expiration: 2037-08-10
Also published as: JP6778161B2

Abstract

To provide a distributed synchronous processing system capable of suppressing a standby time and reducing a processing time.SOLUTION: A distributed synchronous system is a system connecting a plurality of processing servers for operating one or more distributed processing units which is a virtual machine and performing calculation processing in synchronization between the distributed processing units. The processing server 30 includes control means 322 for sending a late message to another processing server if a distributed processing unit of the processing server is in a standby time between a distributed processing unit on another processing server, and a resource operating a distributed processing unit on another processing server exists, and replica control means 323 for constructing a replica on its own processing server 30 and performing calculation processing at a stage of receiving a replica control message instructing construction of the replica of the distributed processing unit that has generated a waiting state from another processing server to the late message.SELECTED DRAWING: Figure 6

Description

本発明は、分散配置された複数のサーバを同期させて処理を実行する分散同期処理システム、分散同期処理方法および分散同期処理プログラムに関する。 The present invention relates to a distributed synchronization processing system, a distributed synchronization processing method, and a distributed synchronization processing program that execute processing by synchronizing a plurality of distributed servers.

ネットワーク上に複数のサーバを分散配置する分散処理システムのフレームワークとして、非特許文献１にはＭａｐＲｅｄｕｃｅが開示されている。ただし、このＭａｐＲｅｄｕｃｅは、処理の度に、外部のデータストアからの入力データの読み込みや、結果の書き出し処理が必要であるため、ある処理の結果を次の処理で利用するようなイテレーティブな（反復する）処理には向いていない。この種の処理には、非特許文献２に開示されているＢＳＰ（Bulk Synchronous Parallel：バルク同期並列）が適している。 Non-Patent Document 1 discloses MapReduce as a framework of a distributed processing system in which a plurality of servers are distributed on a network. However, since MapReduce needs to read input data from an external data store and write a result every time it is processed, it is an iterative method that uses the result of a certain process in the next process (iterative). Yes) is not suitable for processing. BSP (Bulk Synchronous Parallel) disclosed in Non-Patent Document 2 is suitable for this type of processing.

このＢＳＰは、「スーパーステップ（ＳＳ：superstep)」という処理単位を繰り返し実行することにより、分散環境でのデータ処理を実行する。図１８は、ＢＳＰ計算モデルを説明するための図である。
１つのスーパーステップは、図１８に示すように、３つのフェーズ（ＰＨ：phase）として、「ローカル計算（ＬＣ：Local computation）」（フェーズＰＨ１）、「データ交換（Ｃｏｍ：Communication）」（フェーズＰＨ２）、「同期（Sync）」（フェーズＰＨ３）により構成される。 The BSP executes data processing in a distributed environment by repeatedly executing a processing unit called “superstep (SS)”. FIG. 18 is a diagram for explaining the BSP calculation model.
As shown in FIG. 18, one super-step includes three phases (PH: phase): “local computation (LC)” (phase PH1) and “data exchange (Com: Communication)” (phase PH2). ), “Sync” (phase PH3).

具体的には、複数のノード（ノード１〜ノード４）のうちのいずれかのノードがデータを受信すると、そのノード（例えば、ノード１）がフェーズＰＨ１において、そのデータについての計算処理（ローカル計算〔ＬＣ〕）を実行する。続いて、フェーズＰＨ２において、各ノードが保持しているローカル計算の結果であるデータについて、ノード間でのデータ交換を実行する。次に、フェーズＰＨ３において、同期処理を行う、より詳細には、すべてのノード間でのデータ交換の終了を待つ。
そして、スーパーステップＳＳ１として、一連のスーパーステップの処理（ＰＨ１〜ＰＨ３）が終了すると、各ノードはその計算結果を保持した上で、次の一連の処理であるスーパーステップＳＳ２へと進む。以下、同様にして、複数のスーパーステップが繰り返される。 Specifically, when any one of a plurality of nodes (node 1 to node 4) receives data, the node (for example, node 1) performs calculation processing (local calculation) on the data in phase PH1. [LC]). Subsequently, in phase PH2, the data exchange between the nodes is executed for the data which is the result of the local calculation held by each node. Next, in phase PH3, synchronization processing is performed. More specifically, the end of data exchange between all nodes is awaited.
Then, as a super step SS1, when a series of super step processing (PH1 to PH3) is completed, each node holds the calculation result, and then proceeds to the next series of processing, the super step SS2. Thereafter, a plurality of super steps are repeated in the same manner.

従来、分散同期処理システムのフレームワークでは、マスタ（master）／ワーカ（worker）モデルを採用しており、システムを管理する管理サーバ（マスタ）が、対象とする計算処理の全体を所定単位に細分化した個々の計算処理を処理サーバ（ワーカ）に割り振ることとしている。
非特許文献３には、ＢＳＰを採用した分散同期処理システムのフレームワークとして、Ｐｒｅｇｅｌが開示されている。このＰｒｅｇｅｌ等のフレームワークでは、全体の処理をグラフＧ＝（Ｖ，Ｅ）として表現し、これをＢＳＰに適用して実行する。ここで、Ｖは「バーテックス（vertex：頂点）の集合」であり、各頂点は、細分化された個々の処理内容に対応する。また、Ｅは「エッジ（edge：辺）の集合」であり、有向辺は、各頂点間の情報伝達を行う経路に対応する。 Conventionally, the framework of a distributed synchronous processing system has adopted a master / worker model, and the management server (master) that manages the system subdivides the entire calculation processing into predetermined units. Individualized calculation processes are allocated to processing servers (workers).
Non-Patent Document 3 discloses Pregel as a framework of a distributed synchronous processing system that employs BSP. In the framework such as Pregel, the entire process is expressed as a graph G = (V, E), and this is applied to the BSP and executed. Here, V is a “set of vertices (vertex)”, and each vertex corresponds to a subdivided individual processing content. E is “a set of edges”, and the directed edge corresponds to a path for transmitting information between vertices.

ここで、図２０を参照して、従来の分散同期処理システムにおける処理の流れについて説明する。なお、計算対象は、図１９に示すグラフトポロジ（グラフＧ）であることとする。また、図２０に示すように、分散同期処理システム１００Ａが２台のworker（worker１，worker２）で構成され、頂点（vertex）Ｖ_１〜Ｖ_６のうち、頂点（vertex）Ｖ_１〜Ｖ_３をworker１が担当し、頂点（vertex）Ｖ_４〜Ｖ_６をworker２が担当するものとする。以下、全体の処理の流れを通して説明する。 Here, the flow of processing in the conventional distributed synchronous processing system will be described with reference to FIG. The calculation target is the graph topology (graph G) shown in FIG. Further, as shown in FIG. 20, the distributed synchronous processing system 100A includes two workers (workers 1 and 2), and among the vertices V _{1 to} V ₆ , vertices V _{1 to} V ₃ are used. worker1 is in charge, vertex _(vertex) V 4 ~V ₆ a worker2 it is assumed to be in charge. Hereinafter, a description will be given through the entire processing flow.

まず、各worker（worker１，worker２）は、masterからの指示で、担当する頂点（vertex）のスーパーステップを実行する（ステップＳ１００）。具体的には、フェーズＰＨ１のローカル計算を実行し、スーパーステップの処理を開始する。
次に、各workerは、自身が担当する頂点（vertex）の処理の進行を監視し、各頂点（vertex）が、フェーズＰＨ２のデータ交換まで完了したか否かを判定する。そして、各workerは、担当するすべての頂点（vertex）が、フェーズＰＨ２までの処理を完了したと確認した場合に、その完了情報をmasterに報告する（ステップＳ１０１，Ｓ１０２）。 First, each worker (worker1, worker2) executes a super step of a vertex (vertex) in charge according to an instruction from the master (step S100). Specifically, the local calculation of the phase PH1 is executed, and the super step process is started.
Next, each worker monitors the progress of the processing of the vertex (vertex) that he / she is in charge of, and determines whether or not each vertex (vertex) has completed the data exchange in the phase PH2. When each worker confirms that all vertices in charge have completed the processing up to phase PH2, the worker reports the completion information to the master (steps S101 and S102).

続いて、masterは、すべてのworker（ここでは、worker１，worker２）から完了情報を受信した場合に、スーパーステップを「＋１」更新し、各頂点（vertex）に次のスーパーステップの実行を指示する（ステップＳ１０３）。
そして、masterおよび各workerは、ステップＳ１００〜Ｓ１０３の動作を繰り返す。 Subsequently, when receiving completion information from all workers (here, workers 1 and 2), the master updates the super step by “+1” and instructs each vertex (vertex) to execute the next super step. (Step S103).
The master and each worker repeat the operations of steps S100 to S103.

従来の分散同期処理システム１００Ａは、スーパーステップごとに、計算対象となるすべての頂点（vertex）で同期を行う。具体的には、図２０に示す全体同期ポイントにおいて同期を行うため、最も遅い頂点（vertex）の処理後に同期が完了することになる。例えば、図２０のスーパーステップＳＳ１では、頂点（vertex）Ｖ_１〜Ｖ_６のうち、最も遅い頂点（vertex）Ｖ_２の処理後に同期が完了する。また、スーパーステップＳＳ２では、最も遅い頂点（vertex）Ｖ_６の処理後に同期が完了する。よって、著しく処理が遅い頂点（vertex）があると、その頂点（vertex）の処理後に同期が完了するため、頂点（vertex）の処理全体が著しく遅延してしまう。 The conventional distributed synchronization processing system 100A performs synchronization at every vertex (vertex) to be calculated for each super step. Specifically, since synchronization is performed at the entire synchronization point shown in FIG. 20, the synchronization is completed after the processing of the slowest vertex (vertex). For example, the Super step SS1 in FIG. 20, of the vertex _(vertex) V 1 ~V _6, the slowest vertex (vertex) synchronized after processing _{V 2} is completed. Furthermore, the Super step SS2, the slowest vertex (vertex) synchronized after processing _{V 6} is completed. Therefore, if there is a vertex (vertex) that is extremely slow in processing, synchronization is completed after the processing of that vertex (vertex), and the entire processing of the vertex (vertex) is significantly delayed.

このような問題を解決するため非特許文献４の分散同期処理システムでは、従来の全頂点（vertex）での同期を行わず、頂点（vertex）ごとに次のスーパーステップへの移行を判断している。具体的には、非特許文献４の分散同期処理システムでは、「自頂点（vertex）およびグラフトポロジの入力辺で接するすべての頂点（vertex）の計算・交換処理（ローカル計算〔フェーズＰＨ１〕およびデータ交換〔フェーズＰＨ２〕）が完了していること」（隣接同期）を満たしている場合に、各頂点（vertex）が次のスーパーステップに移行するものとしている。 In order to solve such a problem, the distributed synchronization processing system of Non-Patent Document 4 does not perform synchronization at all conventional vertices (vertex), and determines the transition to the next super step for each vertex (vertex). Yes. Specifically, in the distributed synchronous processing system of Non-Patent Document 4, “calculation / exchange processing (local calculation [phase PH1] and data of the vertex and its own vertex (vertex) that touches the input edge of the graph topology) and data It is assumed that each vertex (vertex) moves to the next super step when the exchange [phase PH2]) is completed "(adjacent synchronization) is satisfied.

図２１は、図２０で説明した従来の分散同期処理システム１００Ａが実行する処理の流れ（図２１（ａ）参照）と、非特許文献４に記載された従来の分散同期処理システム１００Ｂが実行する処理の流れ（図２１（ｂ）参照）とを示す図である。
従来の分散同期処理システム１００Ｂでは、前記したように、「自頂点（vertex）および入力辺で接するすべての頂点（vertex）の計算・交換処理（ローカル計算〔フェーズＰＨ１〕およびデータ交換〔フェーズＰＨ２〕）が完了していること」（隣接同期）により、次のスーパーステップに移行する。 FIG. 21 shows the flow of processing executed by the conventional distributed synchronous processing system 100A described with reference to FIG. 20 (see FIG. 21A) and the conventional distributed synchronous processing system 100B described in Non-Patent Document 4. It is a figure which shows the flow (refer FIG.21 (b)) of a process.
In the conventional distributed synchronous processing system 100B, as described above, “calculation / exchange processing of local vertex (vertex) and all vertices (vertex) that touch at the input side (local calculation [phase PH1] and data exchange [phase PH2]”) ) Is completed ”(adjacent synchronization), the process proceeds to the next super step.

例えば、図２１（ｂ）の頂点（vertex）Ｖ_４に着目すると、頂点（vertex）Ｖ_４は、入力辺で接する頂点（vertex）Ｖ_５，Ｖ_６の計算・交換処理と自身の計算・交換処理が終わった時点が隣接同期ポイントとなる。
ここで、頂点（vertex）Ｖ_４は、スーパーステップＳＳ１のとき、自身の計算・交換処理が終わった時点では、頂点（vertex）Ｖ_４の計算・交換処理は終わっているが、頂点（vertex）Ｖ_５，Ｖ_６の計算・交換処理が終わっていないため、待機状態となり、頂点（vertex）Ｖ_５，Ｖ_６の計算・交換処理が終わった時点が隣接同期ポイントとなる。 For example, when attention is paid to the vertex (vertex) V ₄ in FIG. 21B, the vertex (vertex) V ₄ is calculated and exchanged with the vertexes (vertex) V ₅ and V _{6 which} are in contact with each other on the input side. The point in time when the processing is completed becomes the adjacent synchronization point.
Here, the vertex (vertex) V ₄ is calculated at the super-step SS1, and when the calculation / exchange processing of the vertex (vertex) V ₄ is completed, the vertex (vertex) V ₄ is finished. Since the calculation / exchange processing of V ₅ and V ₆ has not been completed, a standby state is entered, and the point in time when the calculation / exchange processing of vertices V ₅ and V ₆ is completed becomes the adjacent synchronization point.

また、頂点（vertex）Ｖ_１に着目すると、頂点（vertex）Ｖ_１は、入力辺で接する頂点（vertex）は存在しない。よって、スーパーステップＳＳ１のとき、自頂点（vertex）の計算・送信処理が終了した時点が隣接同期ポイントとなる。
すなわち、分散同期処理システム１００Ｂは、すべての頂点（vertex）の計算・交換処理が完了していなくても、隣接同期ポイントで次のスーパーステップに移行することができる。
このように、分散同期処理システム１００Ｂは、分散同期処理システム１００Ａに比べて、フェーズＰＨ３の同期完了までの時間を大幅に削減することができるため、システム全体としての処理速度や各workerの利用効率を改善することができる。 Further, paying attention to the vertex (vertex) V _1, the vertex (vertex) V ₁ is the vertex in contact with the input side (vertex) is not present. Therefore, at the time of super step SS1, the time point when the calculation / transmission processing of the own vertex (vertex) is completed becomes the adjacent synchronization point.
That is, the distributed synchronization processing system 100B can move to the next super step at the adjacent synchronization point even if the calculation / exchange processing of all the vertices is not completed.
As described above, the distributed synchronous processing system 100B can significantly reduce the time until the synchronization of the phase PH3 is completed as compared with the distributed synchronous processing system 100A. Therefore, the processing speed and the utilization efficiency of each worker as the entire system Can be improved.

Dean, J., et al., ”MapReduce: Simplified Data Processing on Large Clusters,” OSDI '04, 2004, p.137-149Dean, J., et al., “MapReduce: Simplified Data Processing on Large Clusters,” OSDI '04, 2004, p.137-149 Valiant, L., et al., ”A bridging model for parallel computation,” Communications of the ACM, 1990, vol.33, No.8, p.103-111Valiant, L., et al., `` A bridging model for parallel computation, '' Communications of the ACM, 1990, vol.33, No.8, p.103-111 Malewicz, G., et al., ”Pregel: A System for Large-Scale Graph Processing,” Proc. of ACM SIGMOD, 2010, p.136-145Malewicz, G., et al., “Pregel: A System for Large-Scale Graph Processing,” Proc. Of ACM SIGMOD, 2010, p.136-145 小林弘明，岡本光浩，「分散処理フレームワークの同期処理に関する一検討」，電子情報通信学会／北海道大学工学部大学院情報科学研究科共催，2016年ソサイエティ大会講演論文集，Ｂ−７−１，2016年9月20日Hiroaki Kobayashi, Mitsuhiro Okamoto, “A Study on Synchronous Processing of Distributed Processing Framework”, The Institute of Electronics, Information and Communication Engineers / Graduate School of Information Science, Hokkaido University, Proceedings of 2016 Society Conference, B-7-1, 2016 September 20

従来のＢＰＳ（バルク同期並列）等のように、分散環境で計算処理を並行して行い、定期的に同期させる分散同期処理システムでは、計算処理（データ量、ＣＰＵ負荷等）が常に変動するため、予め計算処理がサーバ間で均等になるように処理を分散することは困難である。 In a distributed synchronous processing system that performs calculation processing in parallel in a distributed environment and synchronizes periodically, such as conventional BPS (bulk synchronous parallel), the calculation processing (data amount, CPU load, etc.) always fluctuates. It is difficult to distribute the processing so that the calculation processing is equalized among the servers in advance.

ここで、図２２を参照して、従来の問題点について説明する。なお、計算対象は、図２２に示すグラフトポロジ（グラフＧ）であることとする。また、図２２に示すように、分散同期処理システムが２台のworker（worker１，worker２）で構成され、頂点（vertex）Ｖ_１〜Ｖ_３のうち、頂点（vertex）Ｖ_１，Ｖ_２をworker１が担当し、頂点（vertex）Ｖ_３をworker２が担当するものとする。
また、ここでは、説明を簡略化するため、頂点（vertex）Ｖ_１〜Ｖ_３の各動作ステップは、ＣＰＵの１コア（ＣＯＲＥ）で数字の順に動作するものとする。 Here, conventional problems will be described with reference to FIG. The calculation target is the graph topology (graph G) shown in FIG. Further, as shown in FIG. 22, the distributed synchronous processing system is configured by two workers (workers ₁ and ₂ ), and among the vertices V _{1 to} V ₃ , the vertices V ₁ and V ₂ are assigned to worker ₁ . but in charge, the vertex (vertex) V ₃ it is assumed that the worker2 is responsible.
In addition, here, in order to simplify the description, it is assumed that the operation steps of the vertices (vertex) V _{1 to} V ₃ operate in numerical order with one core (CORE) of the CPU.

図２２に示すように、worker１は、スーパーステップＳＳ１として、頂点（vertex）Ｖ_１，Ｖ_２において、１〜１２ステップで処理を実行する。一方、worker２は、スーパーステップＳＳ１として、頂点（vertex）Ｖ_３において、１〜３ステップで処理を実行する。
ここで、worker１は、スーパーステップＳＳ１の処理が完了した後、頂点（vertex）Ｖ_１，Ｖ_２において、１３ステップ目でスーパーステップＳＳ２の処理を開始する。
一方、worker２は、スーパーステップＳＳ１の処理が完了した後、頂点（vertex）Ｖ_３において、スーパーステップＳＳ２の処理を開始する。このとき、worker２は、スーパーステップＳＳ２の処理を開始するためには、頂点（vertex）Ｖ_３において、worker１の頂点（vertex）Ｖ_２の計算完了まで待機しなければならない。 As illustrated in FIG. 22, worker1 executes processing in 1 to 12 steps at vertices V ₁ and V ₂ as super step SS1. On the other hand, worker 2 executes processing in 1 to 3 steps at vertex V ₃ as super step SS1.
Here, worker ₁ starts the process of super step SS2 at the 13th step at vertexes (vertex) V ₁ and V ₂ after the process of super step SS1 is completed.
Meanwhile, worker2 After the process of super-step SS1 is completed, at the apex (vertex) _{V 3,} starts the process of super-step SS2. At this time, worker ₂ must wait until the calculation of vertex (vertex) V ₂ of worker 1 is completed at vertex (vertex) V ₃ in order to start the process of super step SS2.

このとき、worker１の頂点（vertex）Ｖ_２の処理が完了するのは、８ステップ目である。そのため、worker２は、頂点（vertex）Ｖ_３において、スーパーステップＳＳ２の処理を開始するのは９ステップ目となる。すなわち、worker２は、４〜８ステップの間、頂点（vertex）Ｖ_２の処理の待機状態となり、待機時間が発生してしまう。 At this time, to process vertices (vertex) V ₂ of worker1 is completed, an 8 th step. Therefore, worker2 is at the apex (vertex) V _3, to start the process of a super step SS2 is nine-th step. That is, worker ₂ is in a standby state for processing vertex (vertex) V ₂ for 4 to 8 steps, and standby time occurs.

本発明は、このような問題に鑑みてなされたもので、分散同期処理を行うシステムにおいて、待機時間を抑制し、システム全体としての処理時間を短縮させることが可能な分散同期処理システム、分散同期処理方法および分散同期処理プログラムを提供することを課題とする。 The present invention has been made in view of such a problem, and in a system that performs distributed synchronization processing, a distributed synchronization processing system and distributed synchronization that can reduce standby time and shorten processing time as a whole system. It is an object to provide a processing method and a distributed synchronization processing program.

前記課題を解決するため、請求項１に記載の発明は、仮想マシンである分散処理部を一または複数動作させる処理サーバを複数接続し、予め定めたグラフトポリジで、前記分散処理部間で同期して計算処理を行う分散同期処理システムであって、前記処理サーバは、リソースを管理するリソース管理手段と、前記分散処理部間で計算結果を含む計算完了メッセージを送受信するデータ転送手段と、自身の処理サーバ上の分散処理部が他の処理サーバ上の分散処理部との間で前記計算結果の待ち状態となり、かつ、前記リソース管理手段が管理するリソースに前記他の処理サーバ上の分散処理部を動作させるリソースが存在する場合に、前記他の処理サーバに処理の遅れを示す遅れメッセージを送信する遅れ制御手段と、前記遅れメッセージに対して、前記他の処理サーバから前記待ち状態を発生させた分散処理部のレプリカの構築を指示するレプリカ制御メッセージを受信した段階で、自身の処理サーバ上に前記レプリカを構築して計算処理を行わせるレプリカ制御手段と、を備え、前記遅れ制御手段は、前記遅れメッセージを他の処理サーバから受信した場合に、前記レプリカの構築を指示するレプリカ制御メッセージを返信することを特徴とする。 In order to solve the above-mentioned problem, the invention according to claim 1 connects a plurality of processing servers that operate one or a plurality of distributed processing units that are virtual machines, and synchronizes between the distributed processing units with a predetermined grafting policy. A distributed synchronous processing system for performing calculation processing, wherein the processing server includes resource management means for managing resources, data transfer means for transmitting and receiving calculation completion messages including calculation results between the distributed processing sections, and itself The distributed processing unit on the other processing server is in a waiting state for the calculation result with the distributed processing unit on the other processing server, and the resource processing means manages the distributed processing on the other processing server. A delay control means for transmitting a delay message indicating a processing delay to the other processing server when there is a resource for operating a section; When the replica control message instructing the construction of the replica of the distributed processing unit that has caused the waiting state is received from the other processing server, the replica is constructed on the own processing server and the calculation process is performed. Replica control means to be executed, and when the delay message is received from another processing server, the delay control means returns a replica control message instructing construction of the replica.

また、前記課題を解決するため、請求項５に記載の発明は、仮想マシンである分散処理部を一または複数動作させる処理サーバを複数接続し、予め定めたグラフトポリジで、前記分散処理部間で同期して計算処理を行う分散同期処理方法であって、前記処理サーバは、自身の処理サーバ上の分散処理部が他の処理サーバ上の分散処理部との間で計算結果の待ち状態となり、かつ、リソース管理手段が管理するリソースに前記他の処理サーバ上の分散処理部を動作させるリソースが存在する場合に、前記他の処理サーバに処理の遅れを示す遅れメッセージを送信するステップと、前記遅れメッセージに対して、前記他の処理サーバから前記待ち状態を発生させた分散処理部のレプリカの構築を指示するレプリカ制御メッセージを受信した段階で、自身の処理サーバ上に前記レプリカを構築して計算処理を行わせるステップと、を実行し、前記遅れメッセージを他の処理サーバから受信した場合に、前記レプリカの構築を指示するレプリカ制御メッセージを返信するステップを実行することを特徴とする。 In order to solve the above-mentioned problem, the invention according to claim 5 connects a plurality of processing servers that operate one or a plurality of distributed processing units which are virtual machines, and uses a predetermined graft strategy between the distributed processing units. In the distributed synchronous processing method for performing calculation processing in synchronization with each other, the processing server waits for a calculation result between a distributed processing unit on its processing server and a distributed processing unit on another processing server. And, when there is a resource for operating the distributed processing unit on the other processing server in the resource managed by the resource management means, sending a delay message indicating a processing delay to the other processing server; In response to receiving the replica control message instructing to construct a replica of the distributed processing unit that caused the waiting state from the other processing server in response to the delayed message Executing a calculation process by constructing the replica on its own processing server, and returning a replica control message instructing the construction of the replica when the delayed message is received from another processing server. The step of performing is performed.

また、前記課題を解決するため、請求項６に記載の発明は、仮想マシンである分散処理部を一または複数動作させる処理サーバを複数接続し、予め定めたグラフトポリジで、前記分散処理部間で同期して計算処理を行う分散同期処理システムにおける前記処理サーバのコンピュータを、リソースを管理するリソース管理手段、前記分散処理部間で計算結果を含む計算完了メッセージを送受信するデータ転送手段、自身の処理サーバ上の分散処理部が他の処理サーバ上の分散処理部との間で前記計算結果の待ち状態となり、かつ、前記リソース管理手段が管理するリソースに前記他の処理サーバ上の分散処理部を動作させるリソースが存在する場合に、前記他の処理サーバに処理の遅れを示す遅れメッセージを送信する遅れ制御手段、前記遅れメッセージに対して、前記他の処理サーバから前記待ち状態を発生させた分散処理部のレプリカの構築を指示するレプリカ制御メッセージを受信した段階で、自身の処理サーバ上に前記レプリカを構築して計算処理を行わせるレプリカ制御手段、として機能させるための分散同期処理プログラムであり、前記遅れ制御手段は、前記遅れメッセージを他の処理サーバから受信した場合に、前記レプリカの構築を指示するレプリカ制御メッセージを返信することを特徴とする。 In order to solve the above-mentioned problem, the invention according to claim 6 connects a plurality of processing servers that operate one or a plurality of distributed processing units which are virtual machines, and uses a predetermined graft strategy between the distributed processing units. The processing server computer in the distributed synchronous processing system that performs the calculation processing in synchronization with the resource management means for managing the resource, the data transfer means for transmitting and receiving the calculation completion message including the calculation result between the distributed processing sections, The distributed processing unit on the other processing server is in a waiting state for the calculation result between the distributed processing unit on the processing server and the distributed processing unit on the other processing server, and the resource managed by the resource management unit A delay control means for transmitting a delay message indicating a processing delay to the other processing server when there is a resource for operating When a replica control message instructing to construct a replica of the distributed processing unit that caused the waiting state is received from the other processing server in response to a message, the replica is constructed on the own processing server and calculated. A distributed synchronization processing program for functioning as replica control means for performing processing, wherein the delay control means instructs the construction of the replica when the delay message is received from another processing server It is characterized by replying.

請求項１，５，６に記載の発明によれば、分散同期処理システムは、同期を待ち、かつ、空きリソースがある処理サーバが、他の処理サーバの分散処理部の動作をレプリカにより並行して開始することができる。これによって、本発明は、各処理サーバのリソースを効率よく利用することができ、システム全体として、計算処理の処理時間を短縮させることができる。 According to the first, fifth, and sixth aspects of the present invention, the distributed synchronous processing system waits for synchronization, and the processing server having free resources makes the operations of the distributed processing units of other processing servers in parallel by replicas. Can start. As a result, the present invention can efficiently use the resources of each processing server, and the processing time of calculation processing can be shortened as a whole system.

また、請求項２に記載の発明は、請求項１に記載の分散同期処理システムにおいて、前記データ転送手段が、自身の処理サーバ上の前記レプリカが計算を完了した段階で、前記レプリカのオリジナルである分散処理部が動作する他の処理サーバに、前記レプリカの計算結果を含む前記計算完了メッセージを送信し、前記他の処理サーバから、前記レプリカの計算結果を含む前記計算完了メッセージを受信した段階で、前記オリジナルの分散処理部の計算を終了させ、前記オリジナルの分散処理部の計算結果の出力先に前記計算完了メッセージを送信することを特徴とする。 The invention according to claim 2 is the distributed synchronous processing system according to claim 1, wherein the data transfer means is the original of the replica when the replica on its processing server has completed the calculation. A step of transmitting the calculation completion message including the calculation result of the replica to another processing server in which a certain distributed processing unit operates, and receiving the calculation completion message including the calculation result of the replica from the other processing server. Then, the calculation of the original distributed processing unit is terminated, and the calculation completion message is transmitted to the output destination of the calculation result of the original distributed processing unit.

請求項２に記載の発明によれば、レプリカがオリジナルの分散処理部よりも早く計算処理を終了した段階で、レプリカの計算処理結果を採用し、オリジナルの分散処理部の計算処理を中止させることができる。これによって、本発明は、不要になった計算処理を早期に中止させて、システム全体の効率化を図ることができる。 According to the invention described in claim 2, when the replica finishes the calculation processing earlier than the original distributed processing unit, the calculation processing result of the replica is adopted and the calculation processing of the original distributed processing unit is stopped. Can do. As a result, the present invention can stop the calculation processing that is no longer necessary at an early stage, and can improve the efficiency of the entire system.

また、請求項３に記載の発明は、請求項１または請求項２に記載の分散同期処理システムにおいて、オリジナルの分散処理部に対するレプリカが構築されている状態を管理するレプリカ管理手段を備え、前記データ転送手段は、自身の処理サーバ上の分散処理部から他の分散処理部に前記計算結果を送信する段階で、前記レプリカ管理手段において、前記自身の処理サーバ上の分散処理部に対するレプリカが構築されている状態である場合に、前記レプリカを構築している他の処理サーバに前記レプリカの動作の中止を指示するレプリカ制御メッセージを送信し、前記レプリカ制御手段は、他の処理サーバから、前記レプリカの動作の中止を指示するレプリカ制御メッセージを受信した段階で、前記レプリカの動作を中止させることを特徴とする。 The invention according to claim 3 is the distributed synchronous processing system according to claim 1 or 2, further comprising replica management means for managing a state in which a replica for the original distributed processing unit is constructed, The data transfer means transmits the calculation result from the distributed processing section on its own processing server to another distributed processing section, and the replica management means constructs a replica for the distributed processing section on the own processing server. When the replica control message is instructed to stop the operation of the replica, the replica control means transmits the replica control message from the other processing server to the other processing server constructing the replica. The operation of the replica is stopped when a replica control message instructing to stop the operation of the replica is received. That.

請求項３に記載の発明によれば、レプリカよりもオリジナルの分散処理部が早く計算処理を終了した段階で、レプリカの計算処理を中止させることができる。これによって、本発明は、不要になった計算処理を早期に中止させて、システム全体の効率化を図ることができる。 According to the third aspect of the invention, the replica calculation process can be stopped when the original distributed processing unit finishes the calculation process earlier than the replica. As a result, the present invention can stop the calculation processing that is no longer necessary at an early stage, and can improve the efficiency of the entire system.

また、請求項４に記載の発明は、請求項１から請求項３のいずれか一項に記載の分散同期処理システムにおいて、前記処理サーバは、自身の処理サーバ上で動作する分散処理部を、自身の処理サーバ上で動作するレプリカよりも優先して動作させることを特徴とする。 The invention according to claim 4 is the distributed synchronous processing system according to any one of claims 1 to 3, wherein the processing server includes a distributed processing unit operating on its own processing server. It is characterized in that it is operated in preference to a replica operating on its own processing server.

請求項４に記載の発明によれば、処理サーバ上で動作する分散処理部の動作を、本来他の処理サーバで動作する分散処理部のレプリカよりも優先して動作させることで、オリジナルの分散処理部の処理速度への影響を抑えることができる。 According to the fourth aspect of the invention, the operation of the distributed processing unit operating on the processing server is operated in preference to the replica of the distributed processing unit originally operating on another processing server, so that the original distributed The influence on the processing speed of the processing unit can be suppressed.

本発明によれば、分散同期処理を行う際に、同期を待ち、かつ、空きリソースがある処理サーバが、他の処理サーバの分散処理部の動作を投機的に実行することができる。
これによって、本発明は、空きリソースがある処理サーバを有効に活用し、待ち状態となっている処理サーバの待機時間を短縮し、システム全体として、同期処理計算の高速化を図ることができる。 According to the present invention, when performing distributed synchronization processing, a processing server waiting for synchronization and having free resources can speculatively execute the operation of the distributed processing unit of another processing server.
As a result, the present invention can effectively use a processing server having free resources, shorten the waiting time of a processing server in a waiting state, and increase the speed of synchronous processing calculation as a whole system.

本発明の実施形態に係る分散同期処理システムの処理概要を説明するための図（１／２）である。It is a figure (1/2) for demonstrating the process outline | summary of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの処理概要を説明するための図（２／２）である。It is a figure (2/2) for demonstrating the process outline | summary of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの全体構成を示す図である。1 is a diagram illustrating an overall configuration of a distributed synchronous processing system according to an embodiment of the present invention. 本発明の実施形態に係る分散同期処理システムの管理サーバの構成を示す図である。It is a figure which shows the structure of the management server of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの分散処理部の構成を示す図である。It is a figure which shows the structure of the distributed processing part of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの処理サーバの構成を示す図である。It is a figure which shows the structure of the processing server of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの全体動作を示すフローチャートである。It is a flowchart which shows the whole operation | movement of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムのデータ転送処理の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of the data transfer process of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの遅れ制御の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of delay control of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムのレプリカ制御の動作を示すフローチャートである。It is a flowchart which shows the operation | movement of replica control of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの適用例の計算対象のグラフトポロジを例示する図である。It is a figure which illustrates the graph topology of the calculation object of the example of application of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの適用例の処理手順を説明するための図（１／６）である。It is a figure (1/6) for demonstrating the process sequence of the example of application of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの適用例の処理手順を説明するための図（２／６）である。It is a figure (2/6) for demonstrating the process sequence of the application example of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの適用例の処理手順を説明するための図（３／６）である。It is a figure (3/6) for demonstrating the process sequence of the application example of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの適用例の処理手順を説明するための図（４／６）である。It is a figure (4/6) for demonstrating the process sequence of the application example of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの適用例の処理手順を説明するための図（５／６）である。It is a figure (5/6) for demonstrating the process sequence of the application example of the distributed synchronous processing system which concerns on embodiment of this invention. 本発明の実施形態に係る分散同期処理システムの適用例の処理手順を説明するための図（６／６）である。FIG. 6D is a diagram (6/6) for explaining the processing procedure of the application example of the distributed synchronous processing system according to the embodiment of the present invention. ＢＳＰ計算モデルを説明するための説明図である。It is explanatory drawing for demonstrating a BSP calculation model. ＢＳＰ計算モデルにおける計算対象のグラフトポロジを例示する図である。It is a figure which illustrates the graph topology of the calculation object in a BSP calculation model. 従来の分散同期処理システムの処理の流れを説明するための図である。It is a figure for demonstrating the flow of a process of the conventional distributed synchronous processing system. 従来の２つ分散同期処理システムの同期ポイントの相違を説明するための図である。It is a figure for demonstrating the difference of the synchronization point of two conventional distributed synchronous processing systems. 従来の分散同期処理システムの問題点を説明するための図である。It is a figure for demonstrating the problem of the conventional distributed synchronous processing system.

本発明の実施形態について図面を参照して説明する。
≪分散同期処理システムの処理概要≫
まず、図１，図２を参照して、本発明の実施形態に係る分散同期処理システム１の処理概要について説明する。なお、計算対象は、図１に示すグラフトポロジ（グラフＧ）であることとする。また、図１に示すように、分散同期処理システム１が２台のworker（worker１，worker２〔処理サーバ〕）で構成され、頂点（vertex）Ｖ_１〜Ｖ_３（分散処理部）のうち、頂点（vertex）Ｖ_１，Ｖ_２をworker１が担当し、頂点（vertex）Ｖ_３をworker２が担当するものとする。
また、ここでは、説明を簡略化するため、頂点（vertex）Ｖ_１〜Ｖ_３の各動作ステップは、ＣＰＵの１コア（ＣＯＲＥ）で数字の順に動作するものとする。 Embodiments of the present invention will be described with reference to the drawings.
≪Overview of distributed synchronous processing system≫
First, with reference to FIG. 1 and FIG. 2, an outline of processing of the distributed synchronous processing system 1 according to the embodiment of the present invention will be described. The calculation target is the graph topology (graph G) shown in FIG. Further, as shown in FIG. 1, the distributed synchronization system 1 is composed of two worker (worker1, worker2 [processing server]), the apex (vertex) V 1 _~V 3 of _{(distribution} processing unit), the apex It is assumed that worker ₁ is in charge of (vertex) V ₁ and V ₂ and worker ₂ is in charge of vertex (vertex) V ₃ .
In addition, here, in order to simplify the description, it is assumed that the operation steps of the vertices (vertex) V _{1 to} V ₃ operate in numerical order with one core (CORE) of the CPU.

図１に示すように、worker１は、スーパーステップＳＳ１として、頂点（vertex）Ｖ_１，Ｖ_２において、１，２，…の各ステップで処理を実行する。また、worker２は、スーパーステップＳＳ１として、頂点（vertex）Ｖ_３において、１〜３ステップで処理を実行する。ここで、worker２は、スーパーステップＳＳ２に移行するには、まだ完了していないworker１の頂点（vertex）Ｖ_２の計算処理の結果が必要となる。 As shown in FIG. 1, worker1 as super step SS1, at the apex _{_{(vertex) V 1, V 2}} , 1,2, and performs processing in a ... each step. Further, worker2 as super step SS1, at the apex (vertex) _{V 3,} and performs processing in a 1-3 steps. Here, worker ₂ needs the result of calculation processing of vertex (vertex) V ₂ of worker 1 that has not yet been completed in order to shift to super step SS ₂ .

そこで、worker２は、リソースに空きがある場合、worker１の頂点（vertex）Ｖ_２の複製であるレプリカＲをworker２上に構築し、レプリカＲにおいて、頂点（vertex）Ｖ_２と同じ処理を、４〜７ステップで実行する。
図１の場合、worker１は、７ステップでは、オリジナルである頂点（vertex）Ｖ_２の処理が完了していない。そこで、オリジナルである頂点（vertex）Ｖ_２は、自身の計算途中のデータを、レプリカＲの計算結果であるデータに置換する。
これによって、worker１は、頂点（vertex）Ｖ_２の計算処理を早く終了させ、システム全体として計算時間を短縮させることができる。 Therefore, worker ₂ constructs replica R, which is a duplicate of vertex 1 of worker 1 on worker ₂ when resources are available, and performs the same processing as vertex V ₂ in replica R in 4 to 4. Execute in 7 steps.
In the case of FIG. 1, worker 1 has not completed processing of the original vertex V ₂ in 7 steps. Therefore, the vertex is the original (vertex) V ₂ is its computational data during, replaces the data which is the calculation result of the replica R.
Thus, worker1 the vertex (vertex) calculation processing V ₂ quickly terminate the, it is possible to shorten the calculation time as a whole system.

なお、レプリカＲの計算処理が、オリジナルの頂点（vertex）Ｖ_２の計算処理よりも時間がかかった場合、図２に示すように、worker１は、worker２に対して、レプリカＲの計算処理を中止させる。
図２の場合、worker１は、４ステップで頂点（vertex）Ｖ_２の処理が完了しているが、レプリカＲは処理を完了していない。そこで、オリジナルである頂点（vertex）Ｖ_２は、レプリカＲの計算を中止させる。
これによって、必要以上にシステムに負荷をかけることを防止することができる。
以下、この処理概要で説明した分散同期処理システムを実現するための分散同期処理システムの構成および動作について詳細に説明する。 In the case where the calculation process of the replica R is took the original vertices (vertex) time than computing the V _2, as shown in FIG. 2, worker1, relative worker2, stops the calculation processing of the replica R Let
In the case of FIG. 2, worker 1 has completed processing of vertex V ₂ in four steps, but replica R has not completed processing. Therefore, the original vertex (vertex) V ₂ stops the calculation of the replica R.
As a result, it is possible to prevent an unnecessary load on the system.
Hereinafter, the configuration and operation of the distributed synchronization processing system for realizing the distributed synchronization processing system described in this processing overview will be described in detail.

≪分散同期処理システムの構成≫
図３に示すように、分散同期処理システム１は、管理サーバ１０と、管理サーバ１０にそれぞれ接続され並列に処理を行う複数の処理サーバ３０と、を備える。なお、処理サーバ３０上では、一または複数の分散処理部２０（仮想マシン）が動作する。 ≪Configuration of distributed synchronous processing system≫
As illustrated in FIG. 3, the distributed synchronous processing system 1 includes a management server 10 and a plurality of processing servers 30 that are connected to the management server 10 and perform processing in parallel. Note that one or a plurality of distributed processing units 20 (virtual machines) operate on the processing server 30.

管理サーバ１０および処理サーバ３０は、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）、ＲＯＭ（Read Only Memory）、ＨＤＤ（Hard Disk Drive）等、一般的なコンピュータとしてのハードウエアを備えており、ＨＤＤには、ＯＳ（Operating System）、プログラム、各種データ等が格納されている。ＯＳおよびアプリケーションプログラムは、ＲＡＭに展開され、ＣＰＵによって実行される。なお、図４〜図６において、管理サーバ１０、分散処理部２０および処理サーバ３０の内部は、ＲＡＭに展開されたプログラム（管理プログラム、分散処理プログラムおよび分散同期処理プログラム）によって実現される機能を、ブロックとして示している。 The management server 10 and the processing server 30 have hardware as a general computer such as a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and an HDD (Hard Disk Drive). The HDD stores an OS (Operating System), programs, various data, and the like. The OS and application programs are expanded in the RAM and executed by the CPU. 4 to 6, the management server 10, the distributed processing unit 20, and the processing server 30 have functions implemented by programs (management program, distributed processing program, and distributed synchronous processing program) expanded in the RAM. Shown as a block.

管理サーバ１０（master）は、システム全体を管理するものである。管理サーバ１０は、図４に示すように分散配置手段１１を備える。
分散配置手段１１は、対象とする計算処理に必要な複数の分散処理部２０を複数の処理サーバ３０に対して割り当てるものである。
この分散配置手段１１は、設定された計算対象となる各分散処理部２０（頂点〔vertex〕）の接続関係を示すグラフトポロジに基づいて、処理サーバ３０に対して、仮想マシンである分散処理部２０を配置する。ここで、分散処理部２０を配置するとは、仮想マシンを構成するソフトウェアを処理サーバ３０に送信し、処理サーバ３０上で仮想マシンとして分散処理部２０を動作させることである。
また、分散配置手段１１は、各処理サーバ３０に対して、分散処理部２０の接続関係を示すグラフトポロジや、分散処理部２０が必要とするリソース（ＣＰＵのコア数、メモリ量等）を通知する。
なお、ここでは、図示を省略するが、管理サーバ１０は、外部からの指示により、システム全体の処理を終了させる終了指示メッセージを、処理サーバ３０に送信する終了指示部を備えることとしてもよい。 The management server 10 (master) manages the entire system. The management server 10 includes distributed arrangement means 11 as shown in FIG.
The distributed arrangement unit 11 allocates a plurality of distributed processing units 20 necessary for a target calculation process to a plurality of processing servers 30.
The distributed arrangement unit 11 is configured to distribute the distributed processing unit that is a virtual machine to the processing server 30 based on the graph topology indicating the connection relation of each distributed processing unit 20 (vertex [vertex]) to be set. 20 is arranged. Here, the arrangement of the distributed processing unit 20 means that the software constituting the virtual machine is transmitted to the processing server 30 and the distributed processing unit 20 is operated as a virtual machine on the processing server 30.
Further, the distributed arrangement unit 11 notifies each processing server 30 of the graph topology indicating the connection relationship of the distributed processing unit 20 and the resources required by the distributed processing unit 20 (number of CPU cores, memory amount, etc.). To do.
Although illustration is omitted here, the management server 10 may include an end instruction unit that transmits to the processing server 30 an end instruction message for ending the processing of the entire system according to an instruction from the outside.

分散処理部２０（頂点〔vertex〕）は、処理サーバ３０上で動作し、分散された計算処理を実行する仮想マシンである。個々の計算処理には、データ入力、計算、メッセージの送受信等が含まれる。この分散処理部２０は、対象とする計算処理をグラフＧ＝（Ｖ，Ｅ）として表現したときに、グラフ中の個々の頂点（vertex）として機能する。
分散処理部２０は、図５に示すように、数値計算手段２１と、メッセージ送受信手段２２と、を備える。 The distributed processing unit 20 (vertex [vertex]) is a virtual machine that operates on the processing server 30 and executes distributed calculation processing. Each calculation process includes data input, calculation, message transmission / reception, and the like. The distributed processing unit 20 functions as an individual vertex (vertex) in the graph when the target calculation processing is expressed as a graph G = (V, E).
As shown in FIG. 5, the distributed processing unit 20 includes numerical value calculation means 21 and message transmission / reception means 22.

数値計算手段２１は、所定単位に区分された計算処理を行うものである。この処理は、図１９を参照して説明したＢＳＰ計算モデルのフェーズＰＨ１としてのローカル計算（ＬＣ：Local computation）に相当する。
この数値計算手段２１は、メッセージ送受信手段２２を介して他の分散処理部２０から通知される計算完了メッセージに設定されている計算結果を入力として計算処理を実行し、次のスーパーステップに移行する。なお、数値計算手段２１は、計算処理が他の分散処理部２０からの計算結果を用いない処理である場合（グラフの入力辺がない場合）、入力を待たずに順次計算処理を実行し、次のスーパーステップに移行する。 The numerical calculation means 21 performs calculation processing divided into predetermined units. This processing corresponds to local computation (LC) as phase PH1 of the BSP calculation model described with reference to FIG.
The numerical calculation means 21 executes calculation processing with the calculation result set in the calculation completion message notified from the other distributed processing section 20 via the message transmission / reception means 22 as input, and proceeds to the next super step. . In addition, when the calculation process is a process that does not use the calculation result from the other distributed processing unit 20 (when there is no input side of the graph), the numerical calculation unit 21 sequentially executes the calculation process without waiting for the input, Move to next super step.

また、数値計算手段２１は、計算処理がその計算結果を他の分散処理部２０に通知する処理である場合（グラフの出力辺がある場合）、計算結果を計算完了メッセージに設定してメッセージ送受信手段２２に送信する。また、数値計算手段２１は、計算処理がその結果を他の分散処理部２０に通知しない処理である場合（グラフの出力辺がない場合）、計算完了メッセージを送信しない。 Further, when the calculation process is a process of notifying the calculation result to another distributed processing unit 20 (when there is an output side of the graph), the numerical calculation means 21 sets the calculation result as a calculation completion message and transmits / receives a message. Transmit to means 22. The numerical calculation means 21 does not transmit a calculation completion message when the calculation process is a process that does not notify the other distributed processing unit 20 of the result (when there is no output side of the graph).

メッセージ送受信手段２２は、他の分散処理部２０との間でメッセージを送受信するものである。この処理は、図１８を参照して説明したＢＳＰ計算モデルのフェーズＰＨ２としてのデータ交換（ＣＯＭ：Communication）に相当する。
このメッセージ送受信手段２２は、処理サーバ３０のメッセージ処理手段３２（図６参照）を介して、同じ処理サーバ３０内の他の分散処理部２０、または、他の処理サーバ３０内の分散処理部２０との間でメッセージを送受信する。 The message transmitting / receiving unit 22 transmits / receives a message to / from another distributed processing unit 20. This process corresponds to data exchange (COM: Communication) as phase PH2 of the BSP calculation model described with reference to FIG.
The message transmission / reception means 22 is connected to another distributed processing section 20 in the same processing server 30 or the distributed processing section 20 in another processing server 30 via the message processing means 32 (see FIG. 6) of the processing server 30. Send and receive messages to and from.

なお、メッセージ送受信手段２２は、他の分散処理部２０から通知される計算完了メッセージが揃った段階、すなわち、グラフの入力辺に対応するすべての分散処理部２０から同じスーパーステップの計算完了メッセージを受信した段階で、数値計算手段２１にすべての計算完了メッセージの計算結果を出力する。
また、メッセージ送受信手段２２は、自身の分散処理部２０の数値計算手段２１が他の分散処理部２０からの計算結果を必要とする処理であって、予め定めた時間内に計算完了メッセージを受信しない場合、すなわち、待ち状態となった場合、計算完了メッセージの送信が遅れている他の分散処理部２０を特定する情報（vertex番号等）を設定した遅れメッセージをメッセージ処理手段３２に送信する。 Note that the message transmitting / receiving means 22 receives the calculation completion message of the same super step from the stage where all the calculation completion messages notified from the other distributed processing units 20 are prepared, that is, from all the distributed processing units 20 corresponding to the input sides of the graph. At the received stage, the calculation results of all calculation completion messages are output to the numerical calculation means 21.
The message transmission / reception unit 22 is a process in which the numerical calculation unit 21 of its own distributed processing unit 20 requires a calculation result from another distributed processing unit 20, and receives a calculation completion message within a predetermined time. If not, that is, if a waiting state is entered, a delayed message in which information (vertex number or the like) specifying another distributed processing unit 20 for which the transmission of the calculation completion message is delayed is transmitted to the message processing means 32.

処理サーバ３０（worker）は、個々の計算処理にそれぞれ対応した一または複数の分散処理部２０を動作させるものである。
処理サーバ３０は、図６に示すように、仮想化制御手段３１と、メッセージ処理手段３２と、リソース管理手段３３と、レプリカ管理手段３４と、記憶手段３５と、を備える。 The processing server 30 (worker) operates one or a plurality of distributed processing units 20 corresponding to individual calculation processes.
As shown in FIG. 6, the processing server 30 includes a virtualization control unit 31, a message processing unit 32, a resource management unit 33, a replica management unit 34, and a storage unit 35.

仮想化制御手段３１は、仮想化技術に基づいて、処理サーバ３０上に仮想化プラットホームを構築し、複数の分散処理部２０（仮想マシン）を配置する制御を行うものである。
この仮想化制御手段３１は、管理サーバ１０から入力される仮想マシンを構成するソフトウェアを入力し、仮想化プラットホーム上で動作させる。
また、仮想化制御手段３１は、レプリカ管理手段３４から、他の処理サーバ３０上で動作する分散処理部２０のレプリカ（複製）の構築を指示された場合、管理サーバ１０からソフトウェアを取得し、仮想化プラットホーム上で動作させる。なお、分散処理部２０の接続関係を示すグラフトポロジにより、レプリカとして動作させる可能性がある分散処理部２０については、予め管理サーバ１０からソフトウェアを取得し、記憶手段３５に記憶させておくこととしてもよい。
また、仮想化制御手段３１は、レプリカ管理手段３４から、レプリカの動作を中止する指示があった場合、レプリカの動作を中止し、仮想化プラットホームから廃棄する。 The virtualization control means 31 performs control for constructing a virtualization platform on the processing server 30 and arranging a plurality of distributed processing units 20 (virtual machines) based on the virtualization technology.
The virtualization control means 31 inputs software that constitutes a virtual machine input from the management server 10 and operates on the virtualization platform.
In addition, when the virtualization control unit 31 is instructed by the replica management unit 34 to construct a replica (replication) of the distributed processing unit 20 operating on the other processing server 30, the virtualization control unit 31 acquires the software from the management server 10, Operate on a virtualization platform. For the distributed processing unit 20 that may be operated as a replica based on the graph topology indicating the connection relationship of the distributed processing unit 20, software is acquired from the management server 10 in advance and stored in the storage unit 35. Also good.
Further, when there is an instruction from the replica management unit 34 to stop the operation of the replica, the virtualization control unit 31 stops the operation of the replica and discards it from the virtualization platform.

メッセージ処理手段３２は、メッセージの送受信を行うものである。
ここでは、メッセージ処理手段３２は、データ転送手段３２１と、遅れ制御手段３２２と、レプリカ制御手段３２３と、を備える。 The message processing means 32 transmits and receives messages.
Here, the message processing unit 32 includes a data transfer unit 321, a delay control unit 322, and a replica control unit 323.

データ転送手段３２１は、記憶手段３５に記憶されている各分散処理部２０（頂点〔vertex〕）の接続関係を示すグラフトポロジに基づいて、分散処理部間で計算結果を含む計算完了メッセージを送受信するものである。
このデータ転送手段３２１は、自身または他の処理サーバ３０上で動作する分散処理部２０から受信した計算完了メッセージを、グラフトポロジの送信先（出力辺の相手先）である自身または他の処理サーバ３０上で動作する分散処理部２０に送信する。
このとき、データ転送手段３２１は、自身の処理サーバ３０上の分散処理部２０から計算完了メッセージを受信した場合、当該分散処理部２０のレプリカが他の処理サーバ３０で動作していれば、他の処理サーバ３０にレプリカの動作を中止する指示を含んだレプリカ制御メッセージを送信する。これは、レプリカよりもオリジナルの分散処理部２０の方が早く計算処理を完了し、レプリカの計算処理が不要になったためである。なお、レプリカが動作している否かの情報は、レプリカ管理手段３４から取得する。また、レプリカ制御メッセージには、レプリカを特定する情報（オリジナルの分散処理部２０のvertex番号等）を付加することとする。 The data transfer unit 321 transmits and receives a calculation completion message including a calculation result between the distributed processing units based on the graph topology indicating the connection relation of each distributed processing unit 20 (vertex [vertex]) stored in the storage unit 35. To do.
The data transfer means 321 sends the calculation completion message received from the distributed processing unit 20 operating on itself or another processing server 30 to itself or another processing server that is a graph topology transmission destination (an output side partner). 30 to the distributed processing unit 20 operating on
At this time, when the data transfer means 321 receives a calculation completion message from the distributed processing unit 20 on its own processing server 30, if the replica of the distributed processing unit 20 is operating on another processing server 30, the other A replica control message including an instruction to stop the operation of the replica is transmitted to the processing server 30. This is because the original distributed processing unit 20 completes the calculation process earlier than the replica, and the replica calculation process becomes unnecessary. Information about whether or not the replica is operating is acquired from the replica management means 34. Further, information specifying the replica (such as the original vertex number of the distributed processing unit 20) is added to the replica control message.

また、データ転送手段３２１は、他の処理サーバ３０上のレプリカから計算完了メッセージを受信した場合、自身の処理サーバ３０上のレプリカに対応するオリジナルの分散処理部２０において現時点でのスーパーステップの計算処理を終了させ、オリジナルの分散処理部２０の計算結果の代わりに、受信した計算完了メッセージの計算結果に置き換えた計算完了メッセージ生成し、送信先に送信する。
また、データ転送手段３２１は、自身の処理サーバ３０上のレプリカから計算完了メッセージを受信した場合、レプリカ管理手段３４を介してレプリカを廃棄し、計算完了メッセージをオリジナルの分散処理部２０が動作する処理サーバ３０に送信する。 In addition, when the data transfer unit 321 receives a calculation completion message from a replica on another processing server 30, the data transfer unit 321 calculates the current superstep in the original distributed processing unit 20 corresponding to the replica on its own processing server 30. The processing is terminated, and a calculation completion message replaced with the calculation result of the received calculation completion message is generated instead of the calculation result of the original distributed processing unit 20, and transmitted to the transmission destination.
Further, when the data transfer means 321 receives the calculation completion message from the replica on its processing server 30, the data transfer means 321 discards the replica via the replica management means 34, and the original distributed processing unit 20 operates the calculation completion message. It transmits to the processing server 30.

遅れ制御手段３２２は、自身または他の処理サーバ３０から受信した遅れメッセージに対する処理を行うものである。
この遅れ制御手段３２２は、自身の処理サーバ３０の分散処理部２０から、遅れメッセージを受信した場合、すなわち、分散処理部２０が計算結果の待ち状態となった場合、当該分散処理部２０を計算結果の出力先とする他の分散処理部２０のレプリカを構築するリソースがあれば、他の分散処理部２０を動作させている処理サーバ３０に遅れメッセージを送信する。なお、この遅れメッセージには、遅れの対象となっている分散処理部２０を特定する情報（vertex番号等）を付加することとする。 The delay control means 322 performs processing on a delay message received from itself or another processing server 30.
When the delay control unit 322 receives a delay message from the distributed processing unit 20 of its own processing server 30, that is, when the distributed processing unit 20 enters a waiting state for the calculation result, the delay control unit 322 calculates the distributed processing unit 20. If there is a resource for constructing a replica of another distributed processing unit 20 as an output destination of the result, a delayed message is transmitted to the processing server 30 operating the other distributed processing unit 20. Note that information (vertex number or the like) specifying the distributed processing unit 20 that is the target of the delay is added to the delayed message.

これによって、自身の処理サーバ３０でレプリカを動作させることが可能であることを他の処理サーバ３０に通知することができる。なお、レプリカを構築するリソースがあるか否かは、リソース管理手段３３に対して問い合わせを行うこととする。なお、レプリカを構築するリソースがなければ、遅れ制御手段３２２は、受信した遅れメッセージを廃棄する。 As a result, it is possible to notify other processing servers 30 that the replica can be operated by its own processing server 30. Note that an inquiry is made to the resource management means 33 as to whether there is a resource for constructing a replica. If there is no resource for constructing a replica, the delay control unit 322 discards the received delay message.

また、遅れ制御手段３２２は、他の処理サーバ３０の分散処理部２０から、遅れメッセージを受信した場合、メッセージを送信した処理サーバ３０に対して、メッセージを送信した分散処理部２０を計算結果の送信先とする分散処理部２０のレプリカを構築する指示を含んだレプリカ制御メッセージを返信する。なお、このレプリカ制御メッセージには、レプリカのオリジナルとなる分散処理部２０を特定する情報（vertex番号等）、ならびに、オリジナルとなる分散処理部２０に入力辺で接する他の分散処理部２０からの計算結果がある場合にその計算結果を付加することとする。これによって、遅れメッセージを送信した処理サーバ３０上にレプリカが構築され、計算を開始することが可能になる。 When the delay control unit 322 receives a delay message from the distributed processing unit 20 of another processing server 30, the delay control unit 322 sends the distributed processing unit 20 that has transmitted the message to the processing server 30 that has transmitted the message. A replica control message including an instruction to construct a replica of the distributed processing unit 20 as a transmission destination is returned. The replica control message includes information (vertex number and the like) for specifying the original distributed processing unit 20 of the replica, as well as information from other distributed processing units 20 that contact the original distributed processing unit 20 at the input side. If there is a calculation result, the calculation result is added. As a result, a replica is constructed on the processing server 30 that has transmitted the delayed message, and calculation can be started.

レプリカ制御手段３２３は、処理サーバ３０における分散処理部２０のレプリカの動作を制御するものである。
このレプリカ制御手段３２３は、他の処理サーバ３０から、レプリカを構築する指示を含んだレプリカ制御メッセージを受信した場合、分散処理部２０のレプリカの構築する指示をレプリカ管理手段３４に出力する。この指示には、オリジナルとなる分散処理部２０を特定する情報（vertex番号等）が含まれる。なお、レプリカ制御手段３２３は、レプリカの動作よりも、自身が担当する分散処理部２０の動作を優先させることとする。例えば、レプリカ制御手段３２３は、処理サーバ３０で動作するＯＳに、レプリカのプロセスよりも自身が担当する分散処理部２０のプロセスの優先度を高く設定する。 The replica control unit 323 controls the operation of the replica of the distributed processing unit 20 in the processing server 30.
When the replica control unit 323 receives a replica control message including an instruction to construct a replica from another processing server 30, the replica control unit 323 outputs an instruction to construct a replica of the distributed processing unit 20 to the replica management unit 34. This instruction includes information (vertex number or the like) for specifying the original distributed processing unit 20. Note that the replica control unit 323 gives priority to the operation of the distributed processing unit 20 that it is in charge of over the operation of the replica. For example, the replica control unit 323 sets the priority of the process of the distributed processing unit 20 that the replica control unit 323 is responsible for to the OS that operates on the processing server 30 rather than the process of the replica.

また、レプリカ制御手段３２３は、他の処理サーバ３０から、レプリカの動作を中止する指示を含んだレプリカ制御メッセージを受信した場合、分散処理部２０のレプリカの動作を中止させる指示をレプリカ管理手段３４に出力する。この指示には、レプリカに対応するオリジナルの分散処理部２０を特定する情報（vertex番号等）が含まれる。
なお、メッセージ処理手段３２は、管理サーバ１０から、終了指示メッセージを受信した場合に、すべての分散処理部２０を終了させる終了制御部を備えることとしてもよい。 Further, when the replica control unit 323 receives a replica control message including an instruction to stop the operation of the replica from another processing server 30, the replica management unit 34 issues an instruction to stop the operation of the replica of the distributed processing unit 20. Output to. This instruction includes information (vertex number or the like) for specifying the original distributed processing unit 20 corresponding to the replica.
Note that the message processing unit 32 may include an end control unit that ends all the distributed processing units 20 when an end instruction message is received from the management server 10.

リソース管理手段３３は、処理サーバ３０のリソース（ＣＰＵのコア数、メモリ量等）を管理するものである。
具体的には、リソース管理手段３３は、仮想化制御手段３１で動作される分散処理部２０が使用するリソースから、現在の空きリソースを把握し、遅れ制御手段３２２からのレプリカを起動させるリソースがあるか否かの問い合わせに対して回答する。 The resource management means 33 manages the resources of the processing server 30 (CPU core number, memory amount, etc.).
Specifically, the resource management unit 33 grasps the current free resource from the resources used by the distributed processing unit 20 operated by the virtualization control unit 31, and finds a resource for starting the replica from the delay control unit 322. Answer the inquiry about whether or not there is.

レプリカ管理手段３４は、レプリカの構築および動作を管理するものである。
このレプリカ管理手段３４は、レプリカ制御手段３２３からレプリカを構築する指示があった場合、仮想化制御手段３１に対して、分散処理部２０を特定する情報（vertex番号等）を含んだレプリカ構築の指示を行う。また、このとき、レプリカ管理手段３４は、分散処理部２０に対してレプリカが起動されていることを、例えば、分散処理部２０に対応するフラグをセットする等で管理する。 The replica management means 34 manages the construction and operation of replicas.
When there is an instruction to build a replica from the replica control means 323, the replica management means 34 performs the replica construction including information (vertex number and the like) for specifying the distributed processing unit 20 to the virtualization control means 31. Give instructions. At this time, the replica management unit 34 manages that the replica is activated for the distributed processing unit 20, for example, by setting a flag corresponding to the distributed processing unit 20.

また、レプリカ管理手段３４は、レプリカ制御手段３２３からレプリカの動作を中止する指示があった場合、仮想化制御手段３１に対して、分散処理部２０を特定する情報（vertex番号等）を含んだレプリカ中止の指示を行う。また、このとき、レプリカ管理手段３４は、分散処理部２０に対してレプリカが動作していないことを、例えば、分散処理部２０に対応するフラグをリセットする等で管理する。
また、レプリカ管理手段３４は、データ転送手段３２１から、分散処理部２０に対するレプリカが動作しているか否かの問い合わせに対して、例えば、フラグのセット、リセット状態を参照して回答する。 Further, the replica management unit 34 includes information (vertex number and the like) for identifying the distributed processing unit 20 to the virtualization control unit 31 when the replica control unit 323 instructs to stop the operation of the replica. Instruct to stop replica. At this time, the replica management unit 34 manages that the replica is not operating with respect to the distributed processing unit 20, for example, by resetting a flag corresponding to the distributed processing unit 20.
Further, the replica management unit 34 replies to the inquiry from the data transfer unit 321 as to whether or not the replica is operating with respect to the distributed processing unit 20 with reference to, for example, a flag set or reset state.

記憶手段３５は、処理サーバ３０で使用する各種情報を記憶するものである。
この記憶手段３５は、管理サーバ１０から通知される分散処理部２０の接続関係を示すグラフトポロジや、分散処理部２０が必要とするリソース（ＣＰＵのコア数、メモリ量等）を記憶する。
なお、ここでは、分散同期処理システム１を、管理サーバ１０と処理サーバ３０とを備える構成としたが、複数の処理サーバ３０のうちの１台を代表サーバとして、その代表サーバ内に管理サーバ１０の機能を備える構成としてもよい。 The storage unit 35 stores various information used by the processing server 30.
The storage unit 35 stores the graph topology indicating the connection relationship of the distributed processing unit 20 notified from the management server 10 and the resources (the number of CPU cores, the memory amount, etc.) required by the distributed processing unit 20.
Here, the distributed synchronous processing system 1 is configured to include the management server 10 and the processing server 30, but one of the plurality of processing servers 30 is used as a representative server, and the management server 10 is included in the representative server. It is good also as a structure provided with these functions.

≪分散同期処理システムの動作≫
次に、図７を参照（構成については、適宜図３〜図６参照）して、本発明の実施形態に係る分散同期処理システム１の動作について説明する。図７は、本実施形態に係る分散同期処理システム１の処理の流れを示すフローチャートである。 << Operation of the distributed synchronous processing system >>
Next, the operation of the distributed synchronous processing system 1 according to the embodiment of the present invention will be described with reference to FIG. FIG. 7 is a flowchart showing a processing flow of the distributed synchronous processing system 1 according to the present embodiment.

まず、管理サーバ１０（master）は、分散配置手段１１によって、予め設定された分散処理部２０（頂点〔vertex〕）の接続関係を示すグラフトポロジに基づいて、複数の処理サーバ３０（worker）に対して、分散処理部２０を配置する（ステップＳ１）。ここでは、分散配置手段１１は、分散処理部２０のソフトウェアを、処理サーバ３０に送信する。 First, the management server 10 (master) distributes to a plurality of processing servers 30 (workers) based on the graph topology indicating the connection relation of the distributed processing units 20 (vertices) set in advance by the distributed arrangement unit 11. On the other hand, the distributed processing unit 20 is arranged (step S1). Here, the distributed arrangement unit 11 transmits the software of the distributed processing unit 20 to the processing server 30.

そして、処理サーバ３０は、仮想化制御手段３１によって、処理サーバ３０上に仮想化プラットホームを構築し、仮想化プラットホーム上で分散処理部２０を仮想マシンとして動作させることで、計算処理を開始する（ステップＳ２）。なお、このステップＳ２以降の動作は、それぞれの処理サーバ３０で動作する。 Then, the processing server 30 constructs a virtualization platform on the processing server 30 by the virtualization control unit 31, and starts the calculation process by causing the distributed processing unit 20 to operate as a virtual machine on the virtualization platform ( Step S2). The operations after step S2 are performed by each processing server 30.

ここで、処理サーバ３０は、メッセージ処理手段３２によって、自身の処理サーバ３０で動作する分散処理部２０、または、他の処理サーバ３０からメッセージを受信するまで待機する（ステップＳ３：Ｎｏ）。
そして、メッセージ処理手段３２は、メッセージを受信し（ステップＳ３：Ｙｅｓ）、そのメッセージが計算完了メッセージであった場合（ステップＳ４：Ｙｅｓ）、データ転送手段３２１によって、データ転送処理を行う（ステップＳ５）。
このステップＳ５において、データ転送手段３２１が行うデータ転送処理について、図８を参照してさらに詳細に説明する。 Here, the processing server 30 waits until the message processing means 32 receives a message from the distributed processing unit 20 operating on its own processing server 30 or another processing server 30 (step S3: No).
The message processing means 32 receives the message (step S3: Yes), and if the message is a calculation completion message (step S4: Yes), the data transfer means 321 performs the data transfer process (step S5). ).
The data transfer process performed by the data transfer unit 321 in step S5 will be described in more detail with reference to FIG.

図８に示すように、データ転送手段３２１は、受信した計算完了メッセージの送信元を解析する（ステップＳ２０）。
そして、計算完了メッセージの送信元が他の処理サーバ３０上で動作する分散処理部２０であった場合（ステップＳ２０：他処理サーバの分散処理部）、データ転送手段３２１は、自身の処理サーバ３０上で動作する送信先（出力辺の相手先）の分散処理部２０に計算完了メッセージを送信する（ステップＳ２１）。 As shown in FIG. 8, the data transfer unit 321 analyzes the transmission source of the received calculation completion message (step S20).
When the transmission source of the calculation completion message is the distributed processing unit 20 operating on the other processing server 30 (step S20: the distributed processing unit of the other processing server), the data transfer unit 321 has its own processing server 30. A calculation completion message is transmitted to the distributed processing unit 20 of the transmission destination (output side partner) operating above (step S21).

また、計算完了メッセージの送信元が自身の処理サーバ３０上で動作する分散処理部２０であった場合（ステップＳ２０：自処理サーバの分散処理部）、データ転送手段３２１は、当該分散処理部２０のレプリカが他の処理サーバ３０で動作しているか否かの判定を、レプリカ管理手段３４に問い合わせることにより行う（ステップＳ２２）。 When the transmission source of the calculation completion message is the distributed processing unit 20 operating on its own processing server 30 (step S20: distributed processing unit of its own processing server), the data transfer unit 321 It is determined by inquiring of the replica management means 34 whether or not the other replica is operating on another processing server 30 (step S22).

ここで、他の処理サーバ３０で当該分散処理部２０のレプリカが動作している場合（ステップＳ２２：Ｙｅｓ）、データ転送手段３２１は、このレプリカを動作させている他の処理サーバ３０にレプリカの動作を中止する指示を含んだレプリカ制御メッセージを送信する（ステップＳ２３）。 Here, when the replica of the distributed processing unit 20 is operating on the other processing server 30 (step S22: Yes), the data transfer unit 321 sends the replica of the replica to the other processing server 30 operating the replica. A replica control message including an instruction to stop the operation is transmitted (step S23).

一方、他の処理サーバ３０で当該分散処理部２０のレプリカが動作していない場合（ステップＳ２２：Ｎｏ）、または、ステップＳ２３の動作後、データ転送手段３２１は、計算完了メッセージを送信先（出力辺の相手先）である自身または他の処理サーバ３０上で動作する分散処理部２０に送信する（ステップＳ２４）。 On the other hand, if the replica of the distributed processing unit 20 is not operating on another processing server 30 (step S22: No), or after the operation of step S23, the data transfer unit 321 sends a calculation completion message to the destination (output) It is transmitted to the distributed processing unit 20 operating on itself or another processing server 30 that is the other party of the side (step S24).

また、計算完了メッセージの送信元が他の処理サーバ３０で動作するレプリカであった場合（ステップＳ２０：他処理サーバのレプリカ）、データ転送手段３２１は、レプリカに対応するオリジナルの分散処理部２０において現時点でのスーパーステップの計算処理を終了させ、オリジナルの分散処理部２０の計算結果の代わりに、受信した計算完了メッセージの計算結果に置き換えた計算完了メッセージを生成する（ステップＳ２５）。 When the transmission source of the calculation completion message is a replica operating on another processing server 30 (step S20: replica of other processing server), the data transfer unit 321 uses the original distributed processing unit 20 corresponding to the replica. The calculation process of the super step at the present time is terminated, and a calculation completion message replaced with the calculation result of the received calculation completion message is generated instead of the calculation result of the original distributed processing unit 20 (step S25).

そして、データ転送手段３２１は、ステップＳ２５で生成した計算完了メッセージをオリジナルの分散処理部２０の送信先（出力辺の相手先）に送信する（ステップＳ２６）。
また、計算完了メッセージの送信元が自身の処理サーバ３０で動作するレプリカであった場合（ステップＳ２０：自処理サーバのレプリカ）、データ転送手段３２１は、レプリカ管理手段３４を介して、レプリカを廃棄する（ステップＳ２７）。
そして、データ転送手段３２１は、計算完了メッセージをオリジナルの分散処理部２０が動作する処理サーバ３０に送信する（ステップＳ２８）。
以上のステップＳ２０〜Ｓ２８によって、データ転送手段３２１は、データ転送処理を行う。 Then, the data transfer means 321 transmits the calculation completion message generated in step S25 to the transmission destination (the output side partner) of the original distributed processing unit 20 (step S26).
Further, when the transmission source of the calculation completion message is a replica operating on its own processing server 30 (step S20: replica of its own processing server), the data transfer unit 321 discards the replica via the replica management unit 34. (Step S27).
Then, the data transfer means 321 transmits a calculation completion message to the processing server 30 on which the original distributed processing unit 20 operates (step S28).
Through the above steps S20 to S28, the data transfer means 321 performs the data transfer process.

なお、ここでは、図示を省略するが、分散処理部２０は、ステップＳ２１，Ｓ２４，Ｓ２６において送信された計算完了メッセージを、メッセージ送受信手段２２を介して受信する。そして、分散処理部２０は、数値計算手段２１によって、受信した計算完了メッセージの計算結果を入力として計算処理を行い、送信先（出力辺の相手先）があれば、メッセージ処理手段３２を介して、その送信先に計算完了メッセージを送信する。 Although not shown here, the distributed processing unit 20 receives the calculation completion message transmitted in steps S21, S24, and S26 via the message transmission / reception means 22. Then, the distributed processing unit 20 performs calculation processing by using the calculation result of the received calculation completion message by the numerical calculation means 21, and if there is a transmission destination (an opponent on the output side), the message processing means 32 The calculation completion message is transmitted to the destination.

このとき、分散処理部２０は、計算処理が他の分散処理部２０からの計算結果を必要とする処理である場合、予め定めた時間内に計算完了メッセージが受信されなければ、メッセージ送受信手段２２によって、計算完了メッセージの送信が遅れている他の分散処理部２０を特定する情報（vertex番号等）を設定した遅れメッセージをメッセージ処理手段３２に送信する。
図７に戻って、分散同期処理システム１の動作について説明を続ける。 At this time, if the calculation process is a process that requires a calculation result from the other distributed processing unit 20, the distributed processing unit 20 sends a message transmitting / receiving unit 22 if a calculation completion message is not received within a predetermined time. Thus, a delayed message in which information (vertex number or the like) for specifying the other distributed processing unit 20 for which the transmission of the calculation completion message is delayed is transmitted to the message processing means 32.
Returning to FIG. 7, the description of the operation of the distributed synchronous processing system 1 will be continued.

処理サーバ３０は、ステップＳ３で受信したメッセージが遅れメッセージであった場合（ステップＳ６：Ｙｅｓ）、遅れ制御手段３２２によって、遅れ制御処理を行う（ステップＳ７）。
このステップＳ７において、遅れ制御手段３２２が行う遅れ制御処理について、図９を参照してさらに詳細に説明する。 When the message received in step S3 is a delayed message (step S6: Yes), the processing server 30 performs a delay control process using the delay control unit 322 (step S7).
The delay control process performed by the delay control means 322 in step S7 will be described in more detail with reference to FIG.

図９に示すように、遅れ制御手段３２２は、受信した遅れメッセージの送信元を解析する（ステップＳ３０）。
そして、遅れメッセージの送信元が他の処理サーバ３０上で動作する分散処理部２０であった場合（ステップＳ３０：他処理サーバの分散処理部）、遅れ制御手段３２２は、送信元の処理サーバ３０に対して、メッセージを送信した分散処理部２０を計算結果の送信先とするレプリカを構築する指示を含んだレプリカ制御メッセージを送信する（ステップＳ３１）。このレプリカ制御メッセージには、レプリカのオリジナルとなる分散処理部２０を特定する情報（vertex番号等）を付加する。 As shown in FIG. 9, the delay control means 322 analyzes the transmission source of the received delay message (step S30).
If the source of the delayed message is the distributed processing unit 20 operating on the other processing server 30 (step S30: the distributed processing unit of the other processing server), the delay control means 322 sends the processing server 30 of the transmission source. In response to this, a replica control message including an instruction to construct a replica with the distributed processing unit 20 that transmitted the message as a transmission destination of the calculation result is transmitted (step S31). Information (vertex number or the like) for specifying the distributed processing unit 20 that is the original of the replica is added to the replica control message.

一方、遅れメッセージの送信元が自身の処理サーバ３０上で動作する分散処理部２０であった場合（ステップＳ３０：自処理サーバの分散処理部）、遅れ制御手段３２２は、レプリカを構築するリソースがあるか否かの判定を、リソース管理手段３３に問い合わせることにより行う（ステップＳ３２）。 On the other hand, when the source of the delayed message is the distributed processing unit 20 operating on its own processing server 30 (step S30: distributed processing unit of its own processing server), the delay control means 322 has a resource for constructing the replica. Whether or not there is is determined by inquiring the resource management means 33 (step S32).

ここで、レプリカを構築するリソースがあると判定した場合（ステップＳ３２：Ｙｅｓ）、遅れ制御手段３２２は、遅れメッセージに設定されている他の分散処理部２０を動作させている処理サーバ３０に、遅れメッセージを送信する（ステップＳ３３）。この遅れメッセージには、遅れの対象となっている分散処理部２０を特定する情報（vertex番号等）を付加する。なお、レプリカを構築するリソースがないと判定した場合（ステップＳ３２：Ｎｏ）、遅れ制御手段３２２は、遅れメッセージを廃棄する（不図示）。
以上のステップＳ３０〜Ｓ３３によって、遅れ制御手段３２２は、遅れ制御処理を行う。
図７に戻って、分散同期処理システム１の動作について説明を続ける。 Here, when it is determined that there is a resource for constructing the replica (step S32: Yes), the delay control unit 322 causes the processing server 30 operating the other distributed processing unit 20 set in the delay message to A delayed message is transmitted (step S33). Information (vertex number or the like) for specifying the distributed processing unit 20 that is the target of the delay is added to the delayed message. If it is determined that there is no resource for constructing the replica (step S32: No), the delay control unit 322 discards the delay message (not shown).
Through the above steps S30 to S33, the delay control means 322 performs a delay control process.
Returning to FIG. 7, the description of the operation of the distributed synchronous processing system 1 will be continued.

処理サーバ３０は、ステップＳ３で受信したメッセージがレプリカ制御メッセージであった場合（ステップＳ８：Ｙｅｓ）、レプリカ制御手段３２３によって、レプリカ制御処理を行う（ステップＳ９）。
このステップＳ９において、レプリカ制御手段３２３が行うレプリカ制御処理について、図１０を参照してさらに詳細に説明する。 When the message received in step S3 is a replica control message (step S8: Yes), the processing server 30 performs replica control processing by the replica control unit 323 (step S9).
The replica control process performed by the replica control unit 323 in step S9 will be described in more detail with reference to FIG.

図１０に示すように、レプリカ制御手段３２３は、受信したレプリカ制御メッセージの指示内容を解析する（ステップＳ４０）。
そして、レプリカ制御メッセージの指示内容が、レプリカを構築する指示であった場合（ステップＳ４０：構築）、レプリカ制御手段３２３は、レプリカ制御メッセージに付加されているレプリカのオリジナルとなる分散処理部２０を特定する情報に基づいて、レプリカ管理手段３４により、処理サーバ３０上にレプリカを構築し（ステップＳ４１）、計算処理を開始させる（ステップＳ４２）。 As shown in FIG. 10, the replica control means 323 analyzes the instruction content of the received replica control message (step S40).
Then, if the instruction content of the replica control message is an instruction to construct a replica (step S40: construction), the replica control means 323 determines the distribution processing unit 20 that is the original of the replica added to the replica control message. Based on the information to be identified, the replica management unit 34 constructs a replica on the processing server 30 (step S41), and starts calculation processing (step S42).

一方、レプリカ制御メッセージの指示内容が、レプリカを中止する指示であった場合（ステップＳ４０：中止）、レプリカ制御手段３２３は、レプリカ制御メッセージに付加されているレプリカを特定する情報（オリジナルの分散処理部２０のvertex番号等）に基づいてレプリカを特定し、レプリカ管理手段３４により、処理サーバ３０上のレプリカの動作を中止し（ステップＳ４３）、仮想化プラットホームから廃棄する（ステップＳ４４）。
以上のステップＳ４０〜Ｓ４４によって、レプリカ制御手段３２３は、レプリカ制御処理を行う。
図７に戻って、分散同期処理システム１の動作について説明を続ける。 On the other hand, when the instruction content of the replica control message is an instruction to cancel the replica (step S40: stop), the replica control unit 323 specifies information specifying the replica added to the replica control message (original distributed processing) The replica management unit 34 stops the operation of the replica on the processing server 30 (step S43) and discards it from the virtualization platform (step S44).
Through the above steps S40 to S44, the replica control means 323 performs a replica control process.
Returning to FIG. 7, the description of the operation of the distributed synchronous processing system 1 will be continued.

処理サーバ３０は、管理サーバ１０から、終了指示メッセージを受信した場合（ステップＳ１０：Ｙｅｓ）、分散処理部２０の動作を終了させ（ステップＳ１１）、システム全体の動作を終了する。
一方、終了指示メッセージを受信しなければ（ステップＳ１０:Ｎｏ）、処理サーバ３０は、ステップＳ３に戻って動作を継続する。 When receiving an end instruction message from the management server 10 (step S10: Yes), the processing server 30 ends the operation of the distributed processing unit 20 (step S11), and ends the operation of the entire system.
On the other hand, if the end instruction message is not received (step S10: No), the processing server 30 returns to step S3 and continues the operation.

以上説明したように分散同期処理システム１を構成し、動作させることで、分散同期処理システム１は、同期を待ち、かつ、空きリソースがある処理サーバ３０が、他の処理サーバ３０の分散処理部２０の動作をレプリカにより並行して開始（投機実行）することができる。これによって、分散同期処理システム１は、システム全体として、処理時間を短縮することができる。 By configuring and operating the distributed synchronous processing system 1 as described above, the distributed synchronous processing system 1 waits for synchronization, and the processing server 30 having free resources is connected to the distributed processing unit of the other processing server 30. Twenty operations can be started (speculative execution) in parallel by the replica. As a result, the distributed synchronous processing system 1 can shorten the processing time of the entire system.

≪分散同期処理システムの適用例≫
次に、図１２〜図１７を参照して、図１１に示すグラフトポロジ（グラフＧ）で表される計算を、本発明の実施形態に係る分散同期処理システム１で行う適用例について説明する。
ここでは、図１２に示すように、分散同期処理システム１は、２台の処理サーバ３０（worker１，worker２）で構成され、分散処理部２０である頂点（vertex）Ｖ_１〜Ｖ_３のうち、頂点（vertex）Ｖ_１，Ｖ_２をworker１が担当し、頂点（vertex）Ｖ_３をworker２が担当するものとする。なお、ここでは、説明を簡略化するため頂点（vertex）Ｖ_１〜Ｖ_５の各動作ステップは、ＣＰＵの１コア（ＣＯＲＥ）で数字の順に動作するものとする。
また、以下の説明では、複数の分散処理部２０および処理サーバ３０を識別するため、それぞれの分散処理部２０をvertex（vertexＶ_１，Ｖ_２，…）と呼称し、それぞれの処理サーバ３０をworker（worker１，２，…）と呼称する。 ≪Application example of distributed synchronous processing system≫
Next, an application example in which the calculation represented by the graph topology (graph G) illustrated in FIG. 11 is performed in the distributed synchronization processing system 1 according to the embodiment of the present invention will be described with reference to FIGS.
Here, as shown in FIG. 12, the distributed synchronous processing system 1 includes two processing servers 30 (workers 1 and 2), and among the vertices (vertex) V _{1 to} V ₃ which are the distributed processing units 20, It is assumed that worker ₁ is in charge of vertices (vertex) V ₁ and V ₂ and worker ₂ is in charge of vertex (vertex) V ₃ . Here, in order to simplify the description, it is assumed that the operation steps of the vertices (vertex) V _{1 to} V ₅ operate in numerical order on one core (CORE) of the CPU.
Further, in the following description, in order to identify a plurality of distributed processing units 20 and processing servers 30, each distributed processing unit 20 is referred to as a vertex (vertex V ₁ , V ₂ ,...), And each processing server 30 is referred to as a worker. (Workers 1, 2, ...).

図１２に示すように、worker１は、vertexＶ_１，Ｖ_２，Ｖ_３において、１，２，…の各ステップで処理を実行する。worker２は、vertexＶ_４において、１，２，…の各ステップで処理を実行する。worker３は、vertexＶ₅において、１，２，…の各ステップで処理を実行する。このように、ここでは、各workerの処理量が均等に割り振られていない状況を想定する。 As shown in FIG. 12, worker1, in _{_{_{vertexV 1, V 2, V 3}}} , 1,2, and performs processing in a ... each step. worker2, in vertexV _4, 1,2, and performs processing in a ... each step. worker3, in vertexV _5, 1,2, and performs processing in a ... each step. Thus, here, a situation is assumed in which the processing amount of each worker is not evenly allocated.

図１３は、各workerが４ステップまでの処理を完了した状態を示している。ここで、vertexＶ_４は、自身の計算が完了し、出力辺で接するvertexＶ_１，Ｖ_３に計算完了メッセージを送信する。しかし、vertexＶ_４は、入力辺で接するvertexＶ_２から計算完了メッセージを取得しなければ次のスーパーステップに移行することができず、待機状態となる。
そこで、vertexＶ_４が予め定めた時間内に計算完了メッセージを受信せず、かつ、worker２のリソースに少なくともvertexＶ_２を動作させる余裕があれば、worker２は、worker１に対してvertexＶ_２の遅れを示す遅れメッセージを送信する。なお、worker２のリソースに余裕がなければ、worker２は、worker１に対して遅れメッセージを送信しない。ここでは、worker２のリソースに余裕があるものとして、以下説明する。 FIG. 13 shows a state in which each worker has completed up to four steps. Here, the vertex V ₄ completes its calculation and transmits a calculation completion message to the vertex V ₁ and V _{3 that} are in contact with each other on the output side. However, the vertex V ₄ cannot enter the next super step without obtaining a calculation completion message from the vertex V ₂ that is in contact with the input side, and enters a standby state.
Therefore, if the vertex V ₄ does not receive the calculation completion message within the predetermined time and if the resource of the worker ₂ has at least a margin for operating the vertex V ₂ , the worker _{2 has} a delay indicating the delay of the vertex V ₂ with respect to the worker _1. Send a message. Note that worker2 does not send a delayed message to worker1 if there is no room in worker2's resources. Here, description will be given below assuming that worker 2 has sufficient resources.

図１４は、worker１が遅れメッセージを受信した後、worker２に対してvertexＶ_２の構築を指示するレプリカ制御メッセージを送信し、worker２がvertexＶ_２のレプリカＲを構築した状態を示している。なお、レプリカ制御メッセージには、vertexＶ_２を特定する情報や、vertexＶ_２を動作させるために必要な情報（例えば、スーパーステップＳＳ２以降であれば、vertexＶ_１の計算結果）が含まれる。これによって、worker２は、自身でvertexＶ_２を動作させることができる。なお、このとき、worker２は、レプリカよりもworker２上で動作する自身が担当するvertexの動作を優先する。 FIG. 14 shows a state where, after worker 1 receives the delay message, a replica control message instructing worker ₂ to construct vertex V ₂ is transmitted, and worker ₂ constructs a replica R of vertex V ₂ . Note that the replica control message, and information identifying the VertexV _2, information required for operating the vertexV ₂ (e.g., if the super-step SS2 later, the calculation result of vertexV ₁₎ include. Thus, worker2 can operate the VertexV ₂ itself. At this time, worker2 prioritizes the operation of the vertex in charge of the worker2 operating on worker2 over the replica.

図１５は、各workerが８ステップまでの処理を完了した状態を示している。ここで、worker２は、vertexＶ_２のレプリカＲが計算を完了したことを、オリジナルのvertexＶ_２を動作させているworker１に計算完了メッセージを送信する。ここで、worker１は、オリジナルのvertexＶ_２の計算を中断し、worker２から送信された計算結果を、vertexＶ_２の出力辺で接するvertexＶ_１，Ｖ_４に計算完了メッセージを送信する。
なお、図１５では、worker３において、vertexＶ_３のレプリカを構築した状態を示しているが、vertexＶ_２のレプリカＲと同じ動作であるため、ここでは説明を省略する。 FIG. 15 shows a state in which each worker has completed processing up to eight steps. Here, worker2 is that replica R of VertexV ₂ has completed computation, and transmits the calculated completion message to worker1 you are running the original vertexV _2. Here, worker1 interrupts the original calculation of VertexV _2, the computational result sent from worker2, transmits the calculated complete message to vertexV _1, V ₄ in contact with the output side of vertexV _2.
Incidentally, omitted in Figure 15, in Worker3, but shows a state in which to build a replica of VertexV _3, it is the same operation as a replica of vertexV ₂ R, the description here.

図１６は、各workerが９ステップまでの処理を完了した状態を示している。ここで、worker２は、vertexＶ_２の計算結果を取得しているため、次のスーパーステップＳＳ２に移行している。
一方、worker１は、worker２で動作したレプリカにより、vertexＶ_２の計算は早く完了しているため、１０ステップ以降は、vertexＶ_１，Ｖ_３だけが動作することになる。これによって、worker１の処理時間を早めることができる。 FIG. 16 shows a state where each worker has completed processing up to 9 steps. Here, worker2 is because it acquires the calculation result of VertexV _2, is shifted to next super step SS2.
Meanwhile, worker1, due replicas operating in worker2, since it has been completed early calculation of VertexV _2, 10 after step, only the vertexV _1, V ₃ is operated. As a result, the processing time of worker1 can be advanced.

図１７は、worker１で動作するオリジナルンのvertexＶ_２が、レプリカＲよりも早く処理が完了した状態を示している。この場合、worker１は、worker２に対してvertexＶ_２のレプリカＲの動作の中止を指示するレプリカ制御メッセージを送信し、さらに、vertexＶ_２の出力辺で接するvertexＶ_４が動作するworker２に計算完了メッセージを送信する。
そして、worker２は、vertexＶ_２のレプリカＲの計算を中止し、vertexＶ_４がオリジナルのvertexＶ_２の計算結果を取得し、次のスーパーステップに移行する。
このように、分散同期処理システム１は、各workerの処理量が均等に割り振られていない状態であっても、システム全体として、処理時間を短縮することができる。 FIG. 17 shows a state where the original vertex V ₂ operating on worker ₁ has completed processing earlier than the replica R. In this case, worker1 sends a replica control message instructing the stop of the operation of the replica of vertexV ₂ R relative worker2, further transmits the calculated completion message to worker2 the VertexV ₄ operates in contact with the output side of VertexV ₂ To do.
Then, worker2 stops the calculation of the replica of vertexV ₂ R, vertexV ₄ acquires the calculation result of the original VertexV _2, the process proceeds to the next super step.
As described above, the distributed synchronous processing system 1 can shorten the processing time of the entire system even when the processing amount of each worker is not evenly allocated.

１分散同期処理システム
１０管理サーバ（master）
１１分散配置手段
２０分散処理部（vertex）
２１数値計算手段
２２メッセージ送受信手段
３０処理サーバ（worker）
３１仮想化制御手段
３２メッセージ処理手段
３２１データ転送手段
３２２遅れ制御手段
３２３レプリカ制御手段
３３リソース管理手段
３４レプリカ管理手段
３５記憶手段 1 Distributed synchronous processing system 10 Management server (master)
11 Distributed Arrangement Means 20 Distributed Processing Unit (vertex)
21 Numerical calculation means 22 Message transmission / reception means 30 Processing server (worker)
31 virtualization control means 32 message processing means 321 data transfer means 322 delay control means 323 replica control means 33 resource management means 34 replica management means 35 storage means

Claims

A distributed synchronous processing system that connects a plurality of processing servers that operate one or a plurality of distributed processing units, which are virtual machines, and performs calculation processing synchronously between the distributed processing units in a predetermined graft policy,
The processing server
Resource management means for managing resources;
Data transfer means for transmitting and receiving a calculation completion message including a calculation result between the distributed processing units;
The distributed processing unit on its own processing server is in a waiting state for the calculation result with the distributed processing unit on the other processing server, and distributed on the other processing server to the resource managed by the resource management means A delay control means for transmitting a delay message indicating a processing delay to the other processing server when there is a resource for operating the processing unit;
In response to the delayed message, when the replica control message instructing the construction of the replica of the distributed processing unit that caused the waiting state is received from the other processing server, the replica is constructed on the own processing server. Replica control means for performing calculation processing by
The distributed control system, wherein the delay control means returns a replica control message instructing the construction of the replica when the delay message is received from another processing server.

The data transfer means includes
When the replica on its own processing server completes the calculation, the calculation completion message including the calculation result of the replica is transmitted to another processing server on which the distributed processing unit that is the original of the replica operates,
Upon receiving the calculation completion message including the calculation result of the replica from the other processing server, the calculation of the original distributed processing unit is terminated, and the calculation result of the original distributed processing unit is output to the output destination. The distributed synchronization processing system according to claim 1, wherein a calculation completion message is transmitted.

A replica management means for managing a state in which a replica for the original distributed processing unit is constructed,
In the step of transmitting the calculation result from the distributed processing unit on its own processing server to another distributed processing unit, the data transfer unit includes a replica for the distributed processing unit on the own processing server in the replica management unit. If it is in a built state, send a replica control message instructing other processing servers that are building the replica to stop the operation of the replica,
The replica control means stops the operation of the replica when receiving a replica control message instructing to stop the operation of the replica from another processing server. The distributed synchronous processing system described.

4. The processing server according to claim 1, wherein the processing server operates a distributed processing unit operating on its processing server in preference to a replica operating on its processing server. The distributed synchronous processing system according to item.

A distributed synchronous processing method for connecting a plurality of processing servers that operate one or a plurality of distributed processing units that are virtual machines, and performing calculation processing synchronously between the distributed processing units, with a predetermined graft polygon,
The processing server
The distributed processing unit on its own processing server is in a waiting state for calculation results with the distributed processing unit on the other processing server, and the resource processing unit manages the distributed processing unit on the other processing server. Sending a delay message indicating a processing delay to the other processing server when there is a resource for operating
In response to the delayed message, when the replica control message instructing the construction of the replica of the distributed processing unit that caused the waiting state is received from the other processing server, the replica is constructed on the own processing server. To perform the calculation process by
A distributed synchronization processing method characterized by executing a step of returning a replica control message instructing construction of the replica when the delayed message is received from another processing server.

A computer of the processing server in a distributed synchronous processing system that connects a plurality of processing servers that operate one or a plurality of distributed processing units, which are virtual machines, and performs calculation processing synchronously between the distributed processing units using a predetermined grafting policy. The
Resource management means for managing resources,
Data transfer means for transmitting and receiving a calculation completion message including a calculation result between the distributed processing units;
The distributed processing unit on its own processing server is in a waiting state for the calculation result with the distributed processing unit on the other processing server, and distributed on the other processing server to the resource managed by the resource management means A delay control means for transmitting a delay message indicating a processing delay to the other processing server when there is a resource for operating the processing unit;
In response to the delayed message, when the replica control message instructing the construction of the replica of the distributed processing unit that caused the waiting state is received from the other processing server, the replica is constructed on the own processing server. A distributed synchronous processing program for functioning as a replica control means for performing calculation processing by
The distributed synchronization processing program, wherein the delay control means returns a replica control message instructing the construction of the replica when the delay message is received from another processing server.