JP2013069189A

JP2013069189A - Parallel distributed processing method and parallel distributed processing system

Info

Publication number: JP2013069189A
Application number: JP2011208477A
Authority: JP
Inventors: Akihiro Ito; 昭博伊藤
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2011-09-26
Filing date: 2011-09-26
Publication date: 2013-04-18

Abstract

PROBLEM TO BE SOLVED: To efficiently move an entry without making a user be conscious of entry moving time in a parallel distributed processing system.SOLUTION: In a parallel distributed processing method and parallel distributed processing system 100 for performing parallel processing in a plurality of entries 22 which have a plurality of nodes, which are virtual calculation units operating in each node, and which are movable among the nodes, calculation processing is performed in each entry 22, and movement of entries 22 is effected by moving data of entries 22 whose nodes are to be changed to the change destination nodes of the entries 22.

Description

各ノードに配置されているエントリにおいて並列処理を行う並列分散処理方法および並列分散処理システムの技術に関する。 The present invention relates to a technique of a parallel distributed processing method and a parallel distributed processing system for performing parallel processing in entries arranged in each node.

並列分散処理のフレームワークの１つとしてＰｒｅｇｅｌ（非特許文献１参照）がある。これはグラフ構造に基づく処理を実行するためのフレームワークである。このＰｒｅｇｅｌでは、論理的な計算単位である各エントリ（あるいはｖｅｒｔｅｘ：グラフの頂点）がデータを保持し、エントリ間の仮想的な接続関係であるエッジ（ｅｄｇｅ：辺）を辿ってメッセージのやりとりを行う。各エントリは、他エントリからメッセージを受信すると、受信メッセージと自身が保持するデータに基づき、定められた処理（計算処理）を行い、自身が保持するデータを更新する。また必要に応じてエッジで接続されている別のエントリに対してメッセージを送信する。また、事前に宛先エントリの識別子が分かっている場合は、エッジが定義されていないエントリにもメッセージ送信できる。 One of the frameworks of parallel distributed processing is Pregel (see Non-Patent Document 1). This is a framework for executing processing based on the graph structure. In this pregel, each entry (or vertex: graph vertex) which is a logical calculation unit holds data, and messages are exchanged by tracing an edge (edge) which is a virtual connection relationship between entries. Do. When each entry receives a message from another entry, each entry performs a predetermined process (calculation process) based on the received message and the data held by itself, and updates the data held by itself. If necessary, a message is transmitted to another entry connected at the edge. If the identifier of the destination entry is known in advance, a message can be transmitted to an entry for which no edge is defined.

各エントリは複数のノード（論理的な計算機）に分散配置されている。そして、各エントリは、ＢＳＰ（Bulk Synchronous Parallel）モデルに従い、エントリ毎に計算およびメッセージ送信処理を実行する。すなわち、計算処理とメッセージ送信は、各エントリ間で同期的かつ交互に行われる。すべてのエントリにおいて計算処理（並列計算処理）が終了した後、各エントリは、メッセージを次のエントリへ送信（メッセージ送信）する。すべてのメッセージについて送信が終了したら、各エントリは再び並列計算処理を行う。従って、ある並列計算処理の結果として送信したメッセージは、次回の並列計算処理で初めて利用可能となる。このようなシステムは、データアクセスの競合を防止したり、処理の一意性（必ず同じ結果になること）を保障したりすることができる。 Each entry is distributed in a plurality of nodes (logical computers). Each entry performs calculation and message transmission processing for each entry according to a BSP (Bulk Synchronous Parallel) model. That is, calculation processing and message transmission are performed synchronously and alternately between the entries. After the calculation process (parallel calculation process) is completed in all entries, each entry transmits a message to the next entry (message transmission). When transmission is completed for all messages, each entry performs parallel computation again. Therefore, a message transmitted as a result of a certain parallel calculation process can be used for the first time in the next parallel calculation process. Such a system can prevent data access conflicts and guarantee the uniqueness of processing (which always results in the same result).

また、ノードの負荷を均等化する別の技術として、特許文献１に記載の技術がある。この技術における分散型データベース管理システムは、各プロセッサにおける処理負荷の偏りを検出し、このプロセッサにおける処理負荷の偏りを均等化するように、データの分散配置構成を変更するものである。この技術において、各ノードは自身が保持するデータに対して処理（計算）を行うため、データの配置を調整することで、ノードの負荷を均等化することができる。 As another technique for equalizing the load on the node, there is a technique described in Patent Document 1. The distributed database management system according to this technique detects a processing load bias in each processor, and changes the data distribution arrangement configuration so as to equalize the processing load bias in each processor. In this technique, each node performs processing (calculation) on the data held by itself, so that the load on the nodes can be equalized by adjusting the arrangement of the data.

このような、並列分散処理システムおけるノードの負荷均等化および耐障害性の向上を実現する技術として、特許文献２に記載の技術がある。この技術における分散並列型処理システムは、各データの複製を、本来そのデータを所有するノード以外のノードで作成しておく。そして、あるノードが障害停止した場合や、あるノードの負荷が大きくなった場合、分散並列型処理システムは、そのノードが所有するデータの複製を、保持しているノードが処理を代理実行するものである。 As a technique for realizing such node load balancing and fault tolerance improvement in such a parallel distributed processing system, there is a technique described in Patent Document 2. The distributed parallel processing system in this technology creates a copy of each data at a node other than the node that originally owns the data. When a node stops failure or when the load on a node increases, the distributed parallel processing system is a process in which a node that holds a copy of the data owned by that node is executed as a proxy. It is.

特開平９−２１８８５８号公報JP-A-9-218858 特開２００１−１０１１４９号公報JP 2001-101149 A

Ｐｒｅｇｅｌ：ａｓｙｓｔｅｍｆｏｒｌａｒｇｅ−ｓｃａｌｅｇｒａｐｈｐｒｏｃｅｓｓｉｎｇ，ＳＩＧＭＯＤ２０１０Pregel: a system for large-scale graph processing, SIGMOD 2010

非特許文献１に記載の技術は、ノード障害に対する対策としてチェックポイントを利用している。チェックポイントとは、ある瞬間のシステム状態（エントリの状態）をストレージ装置に記録しておき、障害発生時にストレージ装置からデータを読み出し、チェックポイント取得時の配置状態に戻す手法である。 The technique described in Non-Patent Document 1 uses checkpoints as a countermeasure against node failures. Checkpoint is a method of recording a system state (entry state) at a certain moment in a storage device, reading data from the storage device when a failure occurs, and returning to the arrangement state at the time of checkpoint acquisition.

チェックポイントの取得は、エントリの配置が確定している状態で行う必要があるため、システムは、エントリ間のメッセージ送信が終了し、各エントリにおける並列計算処理が行われる直前にチェックポイントを取得する必要がある。すなわち、チェックポイント取得のためだけの処理時間が必要になり、全体的な処理時間が増大するという課題がある。 Since the checkpoint needs to be acquired in a state where the entry arrangement is fixed, the system acquires the checkpoint immediately before the message transmission between the entries is completed and the parallel calculation processing is performed for each entry. There is a need. In other words, there is a problem that processing time only for obtaining checkpoints is required, and the overall processing time increases.

また、非特許文献１に記載の技術は、ノードの負荷均等化を実現するため、パーティション関数を使ってエントリのノードへの配置方法を決定している。パーティション関数はエントリ識別子を引数として与えると、そのエントリを配置すべきノードの識別子を出力する関数である。この方法ではエントリとノードとの関係が固定的になる。
一方、複数ジョブの同時実行が可能なシステムでは、ジョブ実行前において、すでに他のジョブが実行されている場合や、ジョブ実行中に他のジョブが実行される場合があり、各ノードの余剰リソースの比が動的に変化するという特徴がある。このような計算機環境において、非特許文献１の記載の技術のようなエントリの配置を固定的に決める方法では、各ノードの負荷を均等化できない場合がある。 Further, the technique described in Non-Patent Document 1 determines the arrangement method of entries to nodes using a partition function in order to realize load balancing of nodes. The partition function is a function that, when an entry identifier is given as an argument, outputs the identifier of the node where the entry is to be placed. In this method, the relationship between entries and nodes is fixed.
On the other hand, in a system that can execute multiple jobs at the same time, other jobs may already be executed before the job is executed, or other jobs may be executed during job execution. There is a characteristic that the ratio of the number changes dynamically. In such a computer environment, there is a case where the load of each node cannot be equalized by the method of fixedly determining the entry arrangement as in the technique described in Non-Patent Document 1.

また、特許文献１に記載の技術は、前記したように、各ノードの処理負荷を均等化するため、ノードへのデータ配置を変更する手法である。この技術を非特許文献１に適用しようとすると、非特許文献１に記載の技術では、エントリの配置が確定している状態でデータ配置の変更を行う必要があるため、チェックポイント取得と同様に、データ配置変更のためだけの処理時間が必要になり、全体として処理時間が増大するという課題がある。 In addition, as described above, the technique described in Patent Document 1 is a method of changing the data arrangement in a node in order to equalize the processing load of each node. If this technique is to be applied to Non-Patent Document 1, the technique described in Non-Patent Document 1 needs to change the data arrangement in a state where the entry arrangement is fixed. The processing time only for changing the data arrangement is required, and there is a problem that the processing time increases as a whole.

特許文献２に記載の技術は、前記したように、予めデータの複製を異なるノードに作成しておき、あるノードの負荷が高くなったときに複製データを持つノードが処理を代理実行することで、ノードの負荷均等化を実現するものである。しかしながら、このような技術を非特許文献１に記載の技術に適用しようとすると、非特許文献１に記載の技術では、エントリの配置が確定している状態で複製データを作成する必要があるため、チェックポイント取得の場合と同様に、データの複製作成のためだけの処理時間が必要になり、全体として処理時間が増大する。 As described above, the technology described in Patent Document 2 creates a copy of data in a different node in advance, and when a load on a certain node becomes high, a node having the copy data performs proxy processing. , To achieve load equalization of nodes. However, if such a technique is applied to the technique described in Non-Patent Document 1, the technique described in Non-Patent Document 1 needs to create duplicate data in a state where the entry arrangement is fixed. As in the case of checkpoint acquisition, processing time only for creating a copy of data is required, and the processing time increases as a whole.

そのため、並列分散処理システムにおいて、効率的にエントリを移動する技術が求められる。 Therefore, a technique for efficiently moving entries in a parallel distributed processing system is required.

本発明における一の手段は、複数のノードを有し、各ノードにおいて動作する仮想的な計算単位であり、前記ノード間の移動が可能な複数のエントリにおいて、並列処理を行う並列分散処理方法であって、前記ノードは、前記エントリにおいて、計算処理を行うとともに、ノードが変更されるエントリのデータを、前記エントリの変更先のノードへ移動することを特徴とする。 One means in the present invention is a virtual distributed processing method having a plurality of nodes and operating in each node, and performing parallel processing in a plurality of entries that can move between the nodes. The node performs a calculation process on the entry and moves data of the entry whose node is changed to the node to which the entry is changed.

さらに、本発明の他の手段は、複数のノードを有し、各ノードにおいて動作する仮想的な計算単位であり、ノード間の移動が可能な複数のエントリにおいて、並列処理を行う並列分散処理方法であって、互いに前記エントリのバックアップ元と、前記エントリのバックアップ先の関係にある第１のノードと、第２のノードとが存在し、前記第１のノードが計算処理を行い、前記計算処理の結果である計算結果を前記第２のノードへコピーするとともに、前記第２のノードが計算処理を行い、前記計算処理の結果である計算結果を前記第１のノードへコピーし、前記第１のノードで、すべての計算結果が得られたが、前記第２のノードにおいて、計算結果を得ていない場合、前記第１のノードは、前記第２のノードにおいて、計算結果を得ていないエントリにおける計算結果を保持し、次の計算処理が始まった際に、前記第１のノードおよび前記第２のノードは、計算処理および計算結果のコピーを互いに行うとともに、前記第１のノードは、前記保持している計算結果を前記第２のノードへコピーすることを特徴とする。
その他の解決手段については、実施形態中で説明することとする。 Furthermore, the other means of the present invention is a parallel distributed processing method for performing parallel processing in a plurality of entries which are a plurality of nodes and which operate in each node and can be moved between the nodes. And there is a first node and a second node that are in a relation between the backup source of the entry, the backup destination of the entry, and the first node performs a calculation process, and the calculation process The calculation result that is the result of the above is copied to the second node, the second node performs the calculation process, the calculation result that is the result of the calculation process is copied to the first node, and the first node When all the calculation results are obtained at the node, but the calculation result is not obtained at the second node, the first node obtains the calculation result at the second node. When the next calculation process is started, the first node and the second node perform the calculation process and copy the calculation result, and the first node The held calculation result is copied to the second node.
Other solutions will be described in the embodiments.

本発明によれば、並列分散処理システムにおいて、効率的にエントリを移動することができる。 According to the present invention, entries can be moved efficiently in a parallel distributed processing system.

本実施形態に係る並列分散処理システムの構成例を示す図である。It is a figure which shows the structural example of the parallel distributed processing system which concerns on this embodiment. 本実施形態に係る計算機のハードウェア構成図である。It is a hardware block diagram of the computer which concerns on this embodiment. 本実施形態に係る並列処理のモデル例を示す図である。It is a figure which shows the model example of the parallel processing which concerns on this embodiment. 本実施形態に係る並列処理の動作モデル例を示す図である。It is a figure which shows the operation | movement model example of the parallel processing which concerns on this embodiment. 本実施形態に係るパーティションテーブルの例を示す図でる。It is a figure which shows the example of the partition table which concerns on this embodiment. 本実施形態に係るデータテーブルの例を示す図である。It is a figure which shows the example of the data table which concerns on this embodiment. 本実施形態に係るエントリの再配置例を示す図である。It is a figure which shows the example of the rearrangement of the entry which concerns on this embodiment. 第１実施形態に係るエントリの再配置における動作を示す図である（その１）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 1st Embodiment (the 1). 第１実施形態に係るエントリの再配置における動作を示す図である（その２）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 1st Embodiment (the 2). 第１実施形態に係るエントリの再配置における動作を示す図である（その３）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 1st Embodiment (the 3). 第１実施形態に係るエントリの再配置における動作を示す図である（その４）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 1st Embodiment (the 4). 第１実施形態に係るエントリの再配置における動作を示す図である（その５）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 1st Embodiment (the 5). 第１実施形態に係るエントリの再配置における動作を示す図である（その６）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 1st Embodiment (the 6). 第２実施形態に係るエントリの再配置における動作を示す図である（その１）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 2nd Embodiment (the 1). 第２実施形態に係るエントリの再配置における動作を示す図である（その２）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 2nd Embodiment (the 2). 第２実施形態に係るエントリの再配置における動作を示す図である（その３）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 2nd Embodiment (the 3). 第２実施形態に係るエントリの再配置における動作を示す図である（その４）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 2nd Embodiment (the 4). 第２実施形態に係るエントリの再配置における動作を示す図である（その５）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 2nd Embodiment (the 5). 第３実施形態に係るエントリの再配置における動作を示す図である（その１）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 3rd Embodiment (the 1). 第３実施形態に係るエントリの再配置における動作を示す図である（その２）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 3rd Embodiment (the 2). 第３実施形態に係るエントリの再配置における動作を示す図である（その３）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 3rd Embodiment (the 3). 第３実施形態に係るエントリの再配置における動作を示す図である（その４）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 3rd Embodiment (the 4). 第３実施形態に係るエントリの再配置における動作を示す図である（その５）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 3rd Embodiment (the 5). 第３実施形態に係るエントリの再配置における動作を示す図である（その６）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 3rd Embodiment (the 6). 第３実施形態に係るエントリの再配置における動作を示す図である（その７）。It is a figure which shows the operation | movement in the rearrangement of the entry which concerns on 3rd Embodiment (the 7). 第４実施形態に係る並列処理の動作モデル例を示す図である。It is a figure which shows the operation | movement model example of the parallel processing which concerns on 4th Embodiment.

次に、本発明を実施するための形態（「実施形態」という）について、適宜図面を参照しながら詳細に説明する。 Next, modes for carrying out the present invention (referred to as “embodiments”) will be described in detail with reference to the drawings as appropriate.

（システム構成図）
図１は、本実施形態に係る並列分散処理システムの構成例を示す図である。
並列分散処理システム１００は、１つのマスタノード１、複数のスレーブノード２、クライアントノード３を有している。
マスタノード１と、スレーブノード２は、それぞれ論理的な計算機であり、１つのサーバで複数のノード（マスタノード１、スレーブノード２）を実行してもよいし、単独のノードを実行するようにしてもよい。
なお、各ノードは仮想的なネットワーク４で相互に接続されている。 (System Configuration)
FIG. 1 is a diagram illustrating a configuration example of a parallel distributed processing system according to the present embodiment.
The parallel distributed processing system 100 includes one master node 1, a plurality of slave nodes 2, and client nodes 3.
Each of the master node 1 and the slave node 2 is a logical computer, and one server may execute a plurality of nodes (master node 1 and slave node 2), or a single node may be executed. May be.
Each node is connected to each other by a virtual network 4.

マスタノード１は、スレーブノード２の動作状況を管理する論理的な計算機であり、実際の計算処理はスレーブノード２上で行われる。なお、マスタノード１がスレーブノード２を兼ねることも可能である。
また、並列分散処理システム１００が実行する各ジョブは、複数のタスク２１に分割されて実行される。 The master node 1 is a logical computer that manages the operation status of the slave node 2, and actual calculation processing is performed on the slave node 2. The master node 1 can also serve as the slave node 2.
Each job executed by the parallel distributed processing system 100 is divided into a plurality of tasks 21 and executed.

マスタノード１は、ジョブマネージャ１１および分散ファイルシステム１２を実行している。
ジョブマネージャ１１は、スレーブノード２におけるジョブや、タスク２１の実行状態を管理する。具体的には、ジョブとタスク２１との対応関係、タスク２１とそのタスク２１を実行するスレーブノード２との対応関係、各タスク２１の進捗率などを管理する。
なお、分散ファイルシステム１２については後記する。
スレーブノード２は、実際の計算処理を行う論理的な計算機であり、少なくとも１つのタスク２１と、タスクマネージャ２３と、分散ファイルシステム２４が実行されている。
タスク２１では、複数のエントリ２２が実行されている。エントリ２２は、タスク２１や、ノード間を移動可能である。
タスクマネージャ２３は、各スレーブノード２で実行中のタスク２１の実行状態（実行中タスク２１の進捗率など）を管理する。タスクマネージャ２３はタスク２１を実行する際、１つのタスク２１に対して複数のスレッドを割り当て実行することで、ＣＰＵ（Central Processing Unit）２０１（図２）がマルチコアの場合にもＣＰＵリソースを有効活用できる。 The master node 1 executes a job manager 11 and a distributed file system 12.
The job manager 11 manages the job in the slave node 2 and the execution state of the task 21. Specifically, the correspondence between the job and the task 21, the correspondence between the task 21 and the slave node 2 that executes the task 21, the progress rate of each task 21, and the like are managed.
The distributed file system 12 will be described later.
The slave node 2 is a logical computer that performs actual calculation processing, and at least one task 21, a task manager 23, and a distributed file system 24 are executed.
In the task 21, a plurality of entries 22 are executed. The entry 22 can move between the task 21 and the nodes.
The task manager 23 manages the execution state of the task 21 being executed in each slave node 2 (such as the progress rate of the task 21 being executed). When the task manager 23 executes the task 21, it allocates and executes a plurality of threads for one task 21, thereby effectively utilizing the CPU resources even when the CPU (Central Processing Unit) 201 (FIG. 2) is multi-core. it can.

なお、ファイルサイズが大きい場合、１つのファイルが複数のノードに分割配置されることがある。
分散ファイルシステム１２，２４は、このような場合に、並列分散処理システム１００のようにファイルの本体が複数のノードに分散配置されたときに、同一のパス名で、どのノードからもすべてのファイルへアクセス可能にするためのシステムである。分散ファイルシステム１２，２４を使用すると、アクセスすべきデータが、自身のノードに格納されておらず、他のノードに格納されている場合、他のノードの分散ファイルシステム１２，２４を経由して、データにアクセスする。
スレーブノード２群の負荷状況に応じて、生成したタスク２１の実行スレーブノードが動的に決定されるため、計算処理を実際に実行するスレーブノード２をどれにするかは事前に決まっていない。分散ファイルシステム１２，２４を使用することによって、どのスレーブノード２でタスク２１が実行された場合も、そのタスク２１は指定したファイルの指定したデータ位置にアクセスすることが可能となる。 When the file size is large, one file may be divided and arranged on a plurality of nodes.
In such a case, the distributed file systems 12 and 24 have the same path name and all files from any node when the file body is distributed and arranged in a plurality of nodes as in the parallel distributed processing system 100. It is a system to make it accessible. When the distributed file systems 12 and 24 are used, if the data to be accessed is not stored in its own node but is stored in another node, the data is accessed via the distributed file system 12 or 24 of the other node. , Access the data.
Since the execution slave node of the generated task 21 is dynamically determined according to the load status of the slave node 2 group, it is not determined in advance which slave node 2 to actually execute the calculation process. By using the distributed file systems 12 and 24, when the task 21 is executed on any slave node 2, the task 21 can access the designated data position of the designated file.

クライアントノード３はユーザインタフェースを提供する計算機端末である。利用者がクライアントノード３を操作して、並列分散処理システム１００にジョブ実行指示を通知すると、クライアントノード３はマスタノード１のジョブマネージャ１１にジョブ実行指示を送信する。ジョブマネージャ１１は指示されたジョブを複数のタスク２１に分割し、スレーブノード２のタスクマネージャ２３に各タスク２１を配布し、タスクマネージャ２３に実行指示する。 The client node 3 is a computer terminal that provides a user interface. When the user operates the client node 3 to notify the parallel distributed processing system 100 of a job execution instruction, the client node 3 transmits the job execution instruction to the job manager 11 of the master node 1. The job manager 11 divides the designated job into a plurality of tasks 21, distributes each task 21 to the task manager 23 of the slave node 2, and instructs the task manager 23 to execute it.

各スレーブノード２が複数のジョブを同時実行する場合は、図１に示すように、１つのスレーブノード２内に複数のタスク２１が生成される場合がある。 When each slave node 2 executes a plurality of jobs simultaneously, a plurality of tasks 21 may be generated in one slave node 2 as shown in FIG.

ジョブマネージャ１１とタスクマネージャ２３は並列分散処理システム１００に常に常駐しており、サービスプロセスとして動作する。すなわち、ジョブを実行していないときは、ジョブマネージャ１１とタスクマネージャ２３は利用者からのジョブ実行要求を待つ待機状態となっている。また、タスク２１はジョブを実行するたびに動的に生成されるプロセスであり、ジョブ実行を開始すると、タスクマネージャ２３はタスク２１に対応するプロセスを動的に生成する。異なるジョブ同士が影響しあうと、ジョブ実行の性能や安定性が低くなる場合があるため、タスクマネージャ２３は、異なるジョブを異なるプロセスに分離することで、ジョブ同士の影響を低減している。 The job manager 11 and the task manager 23 are always resident in the parallel distributed processing system 100 and operate as service processes. That is, when the job is not being executed, the job manager 11 and the task manager 23 are in a standby state waiting for a job execution request from the user. The task 21 is a process that is dynamically generated every time the job is executed. When the job execution is started, the task manager 23 dynamically generates a process corresponding to the task 21. When different jobs influence each other, the performance and stability of job execution may be lowered. Therefore, the task manager 23 reduces the influence between jobs by separating different jobs into different processes.

（ハードウェア構成）
図２は、本実施形態に係るマスタノードや、スレーブノードを実行する計算機のハードウェア構成図である。
計算機２００は、ＣＰＵ２０１、ＬＡＮ（Local Area Network）２５０に接続されるＬＡＮＩ／Ｆ(Interface)２０２、入出力Ｉ／Ｆ２０３、メモリ２０４、ディスクＩ／Ｆ２０５がバス２０６を介して内部で接続されている。入出力Ｉ／Ｆ２０３には、キーボード２１０、マウス２２０、ディスプレイ２３０が接続されており、利用者はこれらを使用してジョブの実行を指示する。ディスクＩ／Ｆ２０５には、ＯＳ（Operating System）や、各ノードで動作するサービスプロセスのプログラムや、処理を行うための解析対象データが記録されているＨＤＤ（Hard Disk Drive）２４０が接続されている。並列分散処理システム１００（図１）起動時にディスク装置１０９から各プログラムがメモリ２０４に読み込まれ、ＣＰＵ２０１がプログラムを実行することで、ジョブマネージャ１１、タスク２１、エントリ２２、タスクマネージャ２３などが具現化する。なお、マスタノード１（図１）とスレーブノード２（図２）を実行している計算機２００については、ユーザが直接入出力操作を行うことはないため、キーボード２１０、マウス２２０、ディスプレイ２３０を省略することができる。 (Hardware configuration)
FIG. 2 is a hardware configuration diagram of a computer that executes a master node and a slave node according to the present embodiment.
In the computer 200, a CPU 201, a LAN I / F (Interface) 202 connected to a LAN (Local Area Network) 250, an input / output I / F 203, a memory 204, and a disk I / F 205 are internally connected via a bus 206. Yes. A keyboard 210, a mouse 220, and a display 230 are connected to the input / output I / F 203, and the user uses these to instruct execution of a job. Connected to the disk I / F 205 is an OS (Operating System), a service process program operating on each node, and an HDD (Hard Disk Drive) 240 in which data to be analyzed for processing is recorded. . When the parallel distributed processing system 100 (FIG. 1) is started, each program is read from the disk device 109 into the memory 204, and the CPU 201 executes the program, thereby realizing the job manager 11, the task 21, the entry 22, the task manager 23, and the like. To do. Note that the computer 210 executing the master node 1 (FIG. 1) and the slave node 2 (FIG. 2) omits the keyboard 210, the mouse 220, and the display 230 because the user does not directly perform input / output operations. can do.

（並列処理モデル）
図３は、本実施形態に係る並列処理のモデル例を示す図である。なお、以下、本実施形態に係る並列分散処理システム１００を、適宜「本システム１００」あるいは「システム１００」と称することがある。
本システム１００では、キー−データ値形式のデータを保持するエントリ２２（２２ａ〜２２ｃ）同士が相互にメッセージを交換する。例えば、エントリ２２ａは「ｋ１」というキーと「ｖ１１」というデータ値を保持している。同様に、エントリ２２ｂは「ｋ２」というキーと「ｖ２１」というデータ値を保持しており、エントリ２２ｃは「ｋ３」というキーと「ｖ３１」というデータ値を保持している。 (Parallel processing model)
FIG. 3 is a diagram illustrating a model example of parallel processing according to the present embodiment. Hereinafter, the parallel distributed processing system 100 according to the present embodiment may be appropriately referred to as “the present system 100” or “the system 100”.
In the system 100, entries 22 (22a to 22c) holding data in a key-data value format exchange messages with each other. For example, the entry 22a holds a key “k1” and a data value “v11”. Similarly, the entry 22b holds a key “k2” and a data value “v21”, and the entry 22c holds a key “k3” and a data value “v31”.

キーはエントリ２２の識別子であり、エントリ２２の生成時に作成され、ジョブの実行中は変更されることはない。データ値はエントリ２２が保持するデータそのものであり、並列処理を実行する過程で変更される場合がある。
各エントリ２２は、他のエントリ２２からメッセージを受信すると、自身が保持するデータ値と受信したメッセージに基づいて計算処理を行い、必要に応じてデータ値を更新し、他のエントリ２２に新たなメッセージを送信する。この処理はユーザプログラムとして実装される。メッセージ送信は送信先エントリのキーを指定することで行われる。各エントリ２２はシステム１００内のスレーブノード２に分散配置されるが、各エントリ２２がどのスレーブノード２に配置されているかは、ジョブマネージャ１１が管理しているため、ユーザプログラムはメッセージの送信先スレーブノード２を指定する必要はない。 The key is an identifier of the entry 22 and is created when the entry 22 is generated, and is not changed during execution of the job. The data value is the data itself held by the entry 22 and may be changed in the course of executing parallel processing.
When each entry 22 receives a message from another entry 22, it performs a calculation process based on the data value held by itself and the received message, updates the data value as necessary, and adds a new value to the other entry 22. Send a message. This process is implemented as a user program. Message transmission is performed by specifying the key of the destination entry. Each entry 22 is distributed and arranged in the slave nodes 2 in the system 100. Since the job manager 11 manages which slave node 2 each entry 22 is arranged in, the user program sends the message destination. There is no need to designate the slave node 2.

ここで、データ値は１つの値である必要はなく、複数の値の組であっても、ネスト構造であってもよい。従って、データ値の更新処理においてデータ値全体を更新する場合もあるし、データ値の一部を更新する場合もある。 Here, the data value does not have to be a single value, and may be a set of a plurality of values or a nested structure. Therefore, the entire data value may be updated in the data value update process, or a part of the data value may be updated.

図４に示すように、本システム１００では、ＢＳＰモデルに基づき、一連の並列計算とメッセージ送信とが同期しながら実行される。なお、図４において、符号２２ａ〜２２ｃは図３と同様であるため、説明を省略する。
例えば「Ｐｈａｓｅ１」では、各エントリ２２は、ユーザプログラムに記載された処理（計算処理、データ値更新、メッセージ送信など）を行う。各エントリ２２の処理は並列実行され、すべてのエントリ２２の処理が完了するまで待つ（同期）。すべてのエントリ２２で、処理が完了したら、システム１００は「Ｐｈａｓｅ２」に移行し、各エントリ２２は再びユーザプログラムに記載された処理を行う。このとき、「ｖ１１」、「ｖ２１」、「ｖ３１」の各データ値は、「ｖ１２」、「ｖ２２」、「ｖ３２」に更新される。このように、「Ｐｈａｓｅ１」でエントリ２２が送信したメッセージは、「Ｐｈａｓｅ２」で初めて利用可能となる。ＢＳＰモデルでは、このような各「Ｐｈａｓｅ」のことを「ｓｕｐｅｒｓｔｅｐ」と呼ぶ。以下、本実施形態では、この「ｓｕｐｅｒｓｔｅｐ」を「並列処理期間」と称することとする。そして、この並列処理期間に各エントリ２２において分散して行われる処理を並列処理（計算処理、データ値更新、メッセージ送信など）と称することとする。 As shown in FIG. 4, in this system 100, based on the BSP model, a series of parallel calculations and message transmission are executed in synchronization. In FIG. 4, reference numerals 22a to 22c are the same as those in FIG.
For example, in “Phase 1”, each entry 22 performs processing (calculation processing, data value update, message transmission, etc.) described in the user program. The processing of each entry 22 is executed in parallel and waits until the processing of all the entries 22 is completed (synchronization). When the processing is completed for all the entries 22, the system 100 shifts to “Phase 2”, and each entry 22 performs the processing described in the user program again. At this time, the data values of “v11”, “v21”, and “v31” are updated to “v12”, “v22”, and “v32”. As described above, the message transmitted by the entry 22 in “Phase 1” can be used for the first time in “Phase 2”. In the BSP model, each “Phase” is referred to as “superstep”. Hereinafter, in the present embodiment, this “superstep” is referred to as a “parallel processing period”. Processing performed in a distributed manner in each entry 22 during this parallel processing period is referred to as parallel processing (calculation processing, data value update, message transmission, etc.).

本システム１００では、非特許文献１のようにエントリ２２間のエッジを仮定しないが、エッジの有無に関して本質的な差異はない。つまり、本システム１００においてエッジが設定されるようにしてもよい。本システム１００においてメッセージの送信先エントリがわかっていない場合、そのエントリ２２のキーを送信元エントリ２２がデータ値として保持しておく必要があるが、これはエントリ２２から別のエントリ２２へのエッジを作成することと等価であるからである。 The system 100 does not assume an edge between the entries 22 as in Non-Patent Document 1, but there is no essential difference regarding the presence or absence of an edge. That is, an edge may be set in the system 100. When the destination entry of the message is not known in the system 100, it is necessary for the source entry 22 to hold the key of the entry 22 as a data value. This is an edge from one entry 22 to another entry 22. This is because it is equivalent to creating

（パーティションテーブル）
図５は、本実施形態に係るパーティションテーブルの例を示す図でる。
パーティションテーブルは、どのキーを有するエントリ２２が、どのタスク２１、あるいは、どのノードで実行されているかを管理するためのテーブルである。
図１で説明したように、本システム１００では、各タスク２１が複数のエントリ２２を保持しており、各タスク２１が複数のエントリ２２の処理を行う。各タスク２１が保持するエントリ２２の一覧は、図５に示すパーティションテーブルで管理されている。パーティションテーブルにおける各レコードが１つのタスク２１に対応している。「開始キー」と「終了キー」とで示された範囲が、対応するタスク２１が保持するエントリ２２のキーの範囲を表す。タスクＩＤ（Identification）はタスク２１の識別子であり、ノードＩＤはそのタスク２１が存在するノードの識別子である。宛先は、対応するノードの受信アドレスである。パーティションテーブルはジョブ毎に生成されるが、ジョブマネージャ１１とそのジョブに関わるすべてのタスク２１を管理しているタスクマネージャ２３が同一のパーティションテーブルを保持している。 (Partition table)
FIG. 5 is a diagram illustrating an example of a partition table according to the present embodiment.
The partition table is a table for managing which task 21 or in which node the entry 22 having which key is executed.
As described with reference to FIG. 1, in this system 100, each task 21 holds a plurality of entries 22, and each task 21 processes a plurality of entries 22. A list of entries 22 held by each task 21 is managed in the partition table shown in FIG. Each record in the partition table corresponds to one task 21. The range indicated by “start key” and “end key” represents the key range of the entry 22 held by the corresponding task 21. The task ID (Identification) is an identifier of the task 21, and the node ID is an identifier of a node where the task 21 exists. The destination is the reception address of the corresponding node. The partition table is generated for each job, but the job manager 11 and the task manager 23 that manages all the tasks 21 related to the job hold the same partition table.

ジョブマネージャ１１は、ジョブ開始時にパーティションテーブル（図５）を作成し、タスクマネージャ２３にタスク２１の生成を指示する。タスク２１の宛先はスレーブノード２のアドレスとポート番号の組から構成される。複数ジョブが同時実行される場合、複数のタスク用プロセスを１つのスレーブノード２上で実行するため、各タスク用プロセスに異なる受信用ポートを割り当てる必要がある。タスクマネージャ２３はタスク２１の生成指示を受けると、タスク用プロセスへの受信ポートの割り当てを行い、タスク用プロセスを生成する。タスク用プロセスの生成が完了すると、タスクマネージャ２３はタスク２１の宛先（アドレスとポート番号）をジョブマネージャ１１に返し、ジョブマネージャ１１は自身のパーティションテーブルにおいて、対応するタスク２１の宛先（図５）を更新する。
すべてのタスク２１について生成が完了すると、ジョブマネージャ１１は実行しているジョブに関わるすべてのタスク２１を管理しているタスクマネージャ２３へパーティションテーブルを送信する。このようにして、タスク２１の生成時にすべてのタスクマネージャ２３に同一のパーティションテーブルが配布される。 The job manager 11 creates a partition table (FIG. 5) at the start of the job, and instructs the task manager 23 to generate the task 21. The destination of the task 21 is composed of a set of the slave node 2 address and port number. When a plurality of jobs are executed at the same time, a plurality of task processes are executed on one slave node 2, and therefore it is necessary to assign a different reception port to each task process. Upon receiving the task 21 generation instruction, the task manager 23 assigns a reception port to the task process and generates a task process. When the generation of the task process is completed, the task manager 23 returns the destination (address and port number) of the task 21 to the job manager 11, and the job manager 11 has the destination of the corresponding task 21 in its own partition table (FIG. 5). Update.
When the generation for all the tasks 21 is completed, the job manager 11 transmits the partition table to the task manager 23 that manages all the tasks 21 related to the job being executed. In this way, the same partition table is distributed to all the task managers 23 when the task 21 is generated.

（データテーブル）
図６は、本実施形態に係るデータテーブルの例を示す図である。
データテーブルは、どのエントリ２２が、どのようなデータ値を保持しているかを管理するためのテーブルである。
各タスクマネージャ２３は、図６に示すデータテーブルを参照して、自身が保持するエントリ２２のデータ値を管理している。データテーブルは、エントリ２２のキー（ｋ）、計算処理の結果のデータ値である「Ｄｎ」、並列処理中に受信したメッセージである「Ｂｒ」、並列処理期間開始前にエントリ２２が保持しているデータ値である「Ｄ」、前回の並列処理で受信したメッセージである「Ｂｕ」を有している。「Ｂｕ」は、エントリにおいて他のエントリから送信されたメッセージを受信するためのバッファである。ここで、「Ｄｎ」＝「Ｄ」＋「Ｂｕ」である（前記したように、並列処理中に受信したメッセージ「Ｂｒ」は、次の並列処理期間にならないと使用することができない）。なお、実際には、「Ｄｎ」は「Ｄ」との差分が格納されることが多い。さらに、データテーブルは、データ値のバックアップである「Ｄｂ」、メッセージのバックアップである「Ｂｂ」の各データ値を有している。このうち「Ｄｂ」と「Ｂｂ」は、エントリ２２を二重化する場合に利用する（第３実施形態で後記）。
なお、図６において、各欄が空欄となっているが、実際にはデータ値が格納されている。 (Data table)
FIG. 6 is a diagram illustrating an example of a data table according to the present embodiment.
The data table is a table for managing which entry 22 holds what data value.
Each task manager 23 refers to the data table shown in FIG. 6 and manages the data value of the entry 22 held by itself. The data table stores the key (k) of the entry 22, “Dn” which is the data value of the result of the calculation process, “Br” which is the message received during the parallel processing, and the entry 22 holds before the start of the parallel processing period. “D”, which is a data value, and “Bu”, which is a message received in the previous parallel processing. “Bu” is a buffer for receiving a message transmitted from another entry in the entry. Here, “Dn” = “D” + “Bu” (as described above, the message “Br” received during the parallel processing cannot be used unless the next parallel processing period is reached). In practice, “Dn” often stores a difference from “D”. Further, the data table has data values “Db” which is a backup of data values and “Bb” which is a backup of messages. Among these, “Db” and “Bb” are used when the entry 22 is duplicated (described later in the third embodiment).
In FIG. 6, each column is blank, but actually data values are stored.

ジョブマネージャ１１は、図６に示すようなすべてのエントリ２２に関するデータテーブルを有しているが、タスクマネージャ２３は、システム１００における、すべてのエントリ２２に関するデータテーブルを有してもよいし、自身に対応するエントリ２２に関するデータテーブルを有してもよい。 The job manager 11 has a data table for all the entries 22 as shown in FIG. 6, but the task manager 23 may have a data table for all the entries 22 in the system 100, May have a data table related to the entry 22 corresponding to.

（並列処理の概要）
ここで、並列処理の概要を記載する。具体的な処理は、図７以降を参照して後記する。
並列処理期間において、各エントリ２２はデータテーブルを参照して、自身に対応するエントリ２２の「Ｄ」と「Ｂｕ」を用いてユーザが定義した計算処理を実行する。前記したように、「Ｄ」は、並列処理期間開始時にエントリ２２が保持しているデータ値であり、「Ｂｕ」は前回の並列処理期間で受信したメッセージである。データ値の更新を行う場合、エントリ２２は、更新前との差分を計算処理の結果「Ｄｎ」として格納する。各タスク２１は自分が保持するすべてのエントリ２２の並列処理が終了したら、「Ｄｎ」と「Ｄ」とをマージ（データの結合など）し、マージ後のデータ値を、新たな「Ｄ」とする。各タスク２１は、この新たな「Ｄ」を並列処理期間終了後のデータ値として更新する。 (Overview of parallel processing)
Here, an outline of parallel processing is described. Specific processing will be described later with reference to FIG.
In the parallel processing period, each entry 22 refers to the data table and executes a calculation process defined by the user using “D” and “Bu” of the entry 22 corresponding to itself. As described above, “D” is a data value held in the entry 22 at the start of the parallel processing period, and “Bu” is a message received in the previous parallel processing period. When updating the data value, the entry 22 stores the difference from the pre-update as the result “Dn” of the calculation process. When the parallel processing of all the entries 22 held by each task 21 is completed, “Dn” and “D” are merged (such as data combination), and the merged data value is replaced with a new “D”. To do. Each task 21 updates this new “D” as a data value after the end of the parallel processing period.

各タスク２１は、他のタスク２１が保持するエントリ２２に対してメッセージを送信する。このとき、タスク２１は、タスクマネージャ２３が保持しているパーティションテーブルの宛先（図５）を参照して、送信先スレーブノード２を決定する。すなわちタスク２１は、メッセージ送信時に宛先エントリ２２のキーに対応するエントリ２２をパーティションテーブルから検出する。そして、タスク２１は、検出されたエントリ２２の宛先（図５）に対してメッセージを送信する。送信するメッセージには、メッセージ本体と宛先エントリ２２のキーが含まれている。 Each task 21 transmits a message to the entry 22 held by the other task 21. At this time, the task 21 determines the destination slave node 2 with reference to the destination (FIG. 5) in the partition table held by the task manager 23. That is, the task 21 detects the entry 22 corresponding to the key of the destination entry 22 from the partition table at the time of message transmission. Then, the task 21 transmits a message to the detected destination 22 of the entry 22 (FIG. 5). The message to be transmitted includes the message body and the destination entry 22 key.

メッセージを受信したタスク２１は、自身が保持するデータテーブル（図６）を参照し、メッセージに含まれるキーに対応するエントリ２２の「Ｂｒ」（並列処理中に受信したメッセージ）に受信したメッセージを追加する。「Ｂｒ」へのメッセージ追加方法は、（ａ）メッセージを単純に追加していく方法と、（ｂ）集約計算を行う方法がある。（ａ）の場合、「Ｂｒ」は、最終的に複数のエントリ２２から受信したメッセージのリストになる。また、（ｂ）の例として、すべてのエントリ２２から受信したメッセージの和を「Ｂｒ」の結果とする処理などが挙げられる。和は結合法則と交換法則を満たすため、（ｂ）の場合、メッセージを受信するたびに、タスク２１は、受信したメッセージを「Ｂｒ」のデータ値に加算したデータ値を計算し、その結果を新たな「Ｂｒ」の値として更新する。このような演算が集約処理であり、結合法則と交換法則を満たす演算であればどのような演算であってもよい。他の集約処理の例として、最大、最小、二乗和などが用いられていてもよい。また、集約処理を行う場合、送信側タスク２１において、同一エントリ２２へ送信するメッセージを予め集約した後、集約結果を送信してもよい。これによって、ネットワーク転送量を削減することができる。 The task 21 that has received the message refers to the data table (FIG. 6) held by itself and refers to the message received in “Br” (message received during parallel processing) of the entry 22 corresponding to the key included in the message. to add. There are two methods of adding messages to “Br”: (a) a method of simply adding a message, and (b) a method of performing an aggregation calculation. In the case of (a), “Br” is a list of messages finally received from the plurality of entries 22. Further, as an example of (b), there is a process in which the sum of messages received from all entries 22 is the result of “Br”. Since the sum satisfies the combining law and the exchange law, in the case of (b), each time a message is received, the task 21 calculates a data value obtained by adding the received message to the data value of “Br”, and the result is Update as a new "Br" value. Such an operation is an aggregation process, and any operation may be used as long as it satisfies the combining law and the exchange law. As examples of other aggregation processes, maximum, minimum, sum of squares, and the like may be used. Further, when performing the aggregation process, the transmission task 21 may transmit the aggregation result after previously collecting the messages to be transmitted to the same entry 22. As a result, the amount of network transfer can be reduced.

メッセージ送信処理は、並列処理期間の途中では他のエントリ２２からメッセージを受信する可能性があり、「Ｂｒ」のデータ値を確定することができない。従って、ある並列処理期間が終了した時点、つまりすべてのエントリ２２がある並列処理期間での並列処理を終えた時点で、その並列処理期間で更新された「Ｂｒ」の値が確定する。ある並列処理期間が終了し「Ｂｒ」が確定したら、タスク２１は、並列処理中に「Ｂｒ」を「Ｂｕ」（並列処理期間開始前にエントリ２２が保持しているデータ値）に移動する。これによって、タスク２１は、次回の並列処理期間において、受信したメッセージを使えるようになる。 In the message transmission process, a message may be received from another entry 22 during the parallel processing period, and the data value of “Br” cannot be determined. Therefore, when a certain parallel processing period ends, that is, when all the entries 22 finish parallel processing in a certain parallel processing period, the value of “Br” updated in the parallel processing period is determined. When a certain parallel processing period ends and “Br” is determined, the task 21 moves “Br” to “Bu” (data value held in the entry 22 before the start of the parallel processing period) during the parallel processing. As a result, the task 21 can use the received message in the next parallel processing period.

（負荷均等化の具体例）
続いて、図１を参照しつつ、図７〜図１８に沿って本システム１００における負荷均等化の手法を説明する。これは各スレーブノード２が保持するデータ量をノード性能に応じて調整することで、スレーブノード２の計算負荷を均等化する方法である。これを行うためにはデータ値をスレーブノード２間で移動する必要があるが、前記したように、ＢＳＰモデルにおける並列処理期間と次の並列処理期間の間にデータ移動を行うと、データ移動処理のためだけの時間が必要になる。そこで本実施形態におけるシステム１００は、並列処理期間の実行中にデータ移動を並行して行うことで、データ移動処理の時間を並列処理期間の処理時間の中に隠蔽し、全体処理時間を削減するようにする。 (Specific example of load balancing)
Next, a load equalization method in the system 100 will be described with reference to FIGS. 7 to 18 with reference to FIG. This is a method of equalizing the calculation load of the slave node 2 by adjusting the amount of data held by each slave node 2 according to the node performance. In order to do this, it is necessary to move the data value between the slave nodes 2. However, as described above, if the data movement is performed between the parallel processing period and the next parallel processing period in the BSP model, the data movement process is performed. Just need time for. Therefore, the system 100 according to this embodiment performs data movement in parallel during execution of the parallel processing period, thereby concealing the time of data movement processing in the processing time of the parallel processing period and reducing the overall processing time. Like that.

例えば、初期配置状態では図７の上段に示すように各ノード（以降では、スレーブノード２のことを単に「ノード」と称する）にエントリ２２が割り当てられたと仮定する。図７の横軸は各ノードが保持するエントリ２２のキーを表している。図７の上段の例に示すように、「ノード１」は「ｋ１」〜「ｋ１５０」、「ノード２」は「ｋ１５１」（図示せず）〜「ｋ２２５」、「ノード３」は「ｋ２２６」（図示せず）〜「ｋ３００」のキーに対応するエントリ２２を保持している。 For example, in the initial arrangement state, it is assumed that an entry 22 is assigned to each node (hereinafter, the slave node 2 is simply referred to as “node”) as shown in the upper part of FIG. The horizontal axis in FIG. 7 represents the key of the entry 22 held by each node. As shown in the upper example of FIG. 7, “Node 1” is “k1” to “k150”, “Node 2” is “k151” (not shown) to “k225”, and “Node 3” is “k226”. (Not shown) to the entry 22 corresponding to the keys “k300”.

各ノードの性能が同一であれば、図７の上段に示す図では、「ノード１」に係る負荷が大きいため、図７の下段のようにエントリ２２を移動して、各ノードにおける負荷を均等化する。すなわち、ジョブマネージャ１１は、図７の下段のように各ノードに割り当てるキーの範囲を変更する。このとき、ジョブマネージャ１１は、「ｋ１０１」〜「ｋ１５０」のキー範囲のエントリ２２を「ノード１」から「ノード２」へ移動する。そして、ジョブマネージャ１１は、「ｋ２０１」〜「ｋ２２５」のキー範囲のエントリ２２を「ノード２」から「ノード３」へ移動する。このようにすることで、「ノード１」は「ｋ１」〜「ｋ１００」までのエントリ２２を有し、「ノード２」は「ｋ１０１」（図示せず）〜「ｋ２００」までのエントリ２２を有し、「ノード３」は「ｋ２０１」（図示せず）〜「ｋ３００」までのエントリ２２を有する。つまり、各ノードは、１００づつのエントリ２２を有することになり、各ノードの性能が同一であれば、負荷均等化を図ることができる。なお、ここでは、説明を簡単にするため、各ノードの性能が同一であることを仮定したが、ノード性能が異なる場合は、ノード性能の比に応じてエントリ２２の分配比を調整すればよい。 If the performance of each node is the same, in the diagram shown in the upper part of FIG. 7, the load related to “Node 1” is large. Therefore, the entry 22 is moved as shown in the lower part of FIG. Turn into. That is, the job manager 11 changes the key range assigned to each node as shown in the lower part of FIG. At this time, the job manager 11 moves the entry 22 in the key range “k101” to “k150” from “node 1” to “node 2”. Then, the job manager 11 moves the entry 22 in the key range “k201” to “k225” from “node 2” to “node 3”. In this way, “Node 1” has entries 22 from “k1” to “k100”, and “Node 2” has entries 22 from “k101” (not shown) to “k200”. “Node 3” has entries 22 from “k201” (not shown) to “k300”. That is, each node has 100 entries 22, and load balancing can be achieved if the performance of each node is the same. Here, to simplify the explanation, it is assumed that the performance of each node is the same. However, when the node performance is different, the distribution ratio of the entry 22 may be adjusted according to the ratio of the node performance. .

《第１実施形態》
続いて、図７に示すような負荷均等化の具体的な手順の第１実施形態を、図８〜図１３を参照して説明する。なお、第１実施形態〜第３実施形態におけるシステム１００の構成、ハードウェア構成は、図１および図２に示す構成と同様であるため、説明を省略する。
なお、図８〜図２５において、縦軸が各ノード（タスク２１）が保持するエントリ２２のキーを示している。また、各エントリ２２は、横方向に対応する各種データ値を有している。また、図８〜図２５において、破線で示した矩形はデータ値が存在しないことを示しており、実線の矩形はデータ値が存在することを示している。 << First Embodiment >>
Next, a first embodiment of a specific procedure for load equalization as shown in FIG. 7 will be described with reference to FIGS. Note that the configuration and hardware configuration of the system 100 in the first to third embodiments are the same as the configurations shown in FIGS.
8 to 25, the vertical axis indicates the key of the entry 22 held by each node (task 21). Each entry 22 has various data values corresponding to the horizontal direction. 8 to 25, a rectangle indicated by a broken line indicates that a data value does not exist, and a solid line rectangle indicates that a data value exists.

図８〜図２５の各図において、矩形内に示した記号の添え字はデータ値のバージョンを示す。例えば図８において「ノード１」は、同期開始前にエントリ２２が保持しているデータ値である「Ｄ」と、前回の並列処理期間で受信したメッセージ「Ｂｕ」を自分が担当するすべての範囲のキーに対して保持している。「Ｄ」には「Ｄ１」、「Ｂｕ」には「Ｂ１」と記載されているが、この添え字「１」がバージョンである。詳細は後記するが、図９の「ノード１」では、「ｋ１００」〜「ｋ８５」の範囲のキーについてエントリ２２の並列処理が終了し、データ値の差分である「Ｄｎ」の部分に「ｄ２」と記載されている。ここで、添え字が「２」になっているが、このデータ値がバージョン「１」のデータ値（「Ｄ１」と「Ｂ１」）に基づいて生成されたため、バージョンが「２」に上がったことによる。 In each of FIGS. 8 to 25, the subscript of the symbol shown in the rectangle indicates the version of the data value. For example, in FIG. 8, “Node 1” has “D”, which is the data value held in the entry 22 before the start of synchronization, and all the ranges for which it is responsible for the message “Bu” received in the previous parallel processing period. Hold against the key. “D” is described as “D1”, and “Bu” is described as “B1”. The subscript “1” is a version. Although details will be described later, in “Node 1” in FIG. 9, the parallel processing of the entry 22 is completed for the keys in the range of “k100” to “k85”, and “d2” is added to the portion of “Dn” that is the difference between the data values. Is described. Here, although the subscript is “2”, this data value is generated based on the data values of the version “1” (“D1” and “B1”), so the version has increased to “2”. It depends.

まず各タスク２１は、データ値の再分配を行う前に並列処理期間を最低１回実行し、各ノードにおける並列処理期間の実行時間を計測する。並列処理期間が終了したら、計測結果は各タスク２１からジョブマネージャ１１に集められ、ジョブマネージャ１１は計測結果に基づき、各ノードに割り当てるキーの比を計算する。
具体的には、既存のエントリ２２の分配比がｘ１：ｘ２：ｘ３であり、処理時間の比がｔ１：ｔ２：ｔ３であった場合、ジョブマネージャ１１は、エントリ２２の分配比がｘ１／ｔ１：ｘ２／ｔ２：ｘ３／ｔ３になるように調整することで、理論的には次回の処理時間の比を１：１：１にすることができる。ジョブマネージャ１１は、このようなエントリ２２の分配比を、パーティションテーブル（図５）を参照して計算する。 First, each task 21 executes the parallel processing period at least once before redistributing data values, and measures the execution time of the parallel processing period in each node. When the parallel processing period ends, the measurement results are collected from each task 21 to the job manager 11, and the job manager 11 calculates the ratio of keys to be assigned to each node based on the measurement results.
Specifically, when the distribution ratio of the existing entry 22 is x1: x2: x3 and the processing time ratio is t1: t2: t3, the job manager 11 determines that the distribution ratio of the entry 22 is x1 / t1. By adjusting the ratio to x2 / t2: x3 / t3, the ratio of the next processing time can theoretically be 1: 1: 1. The job manager 11 calculates the distribution ratio of such entries 22 with reference to the partition table (FIG. 5).

このようにして算出した各ノードへのエントリ２２の割り当てが、図７の下段に示すように「ノード１」に対して「ｋ１」〜「ｋ１００」、「ノード２」に対して「ｋ１０１」〜「ｋ２００」、「ノード３」に対して「ｋ２０１」〜「ｋ３００」になったと仮定する。ジョブマネージャ１１は、再配置後のキーの範囲（つまり、図５のパーティションテーブル）を各ノードに配布し、次回の並列処理期間を開始するように各タスク２１に要求する。
以下、図７の上段の配置状態から、図７の下段の配置状態へエントリ２２を再配置する具体的な手順を説明する。 As shown in the lower part of FIG. 7, the assignment of the entry 22 to each node calculated in this way is “k1” to “k100” for “node 1” and “k101” to “node 2”. It is assumed that “k201” to “k300” are obtained for “k200” and “node 3”. The job manager 11 distributes the range of keys after rearrangement (that is, the partition table in FIG. 5) to each node, and requests each task 21 to start the next parallel processing period.
Hereinafter, a specific procedure for rearranging the entries 22 from the upper arrangement state of FIG. 7 to the lower arrangement state of FIG. 7 will be described.

（図８）
並列処理期間の開始時点で各ノードが保持するエントリ２２は図８のようになっている。ここで、図８におけるエントリ２２の配置状態は、図７の上段と同様の配置状態である。各ノードのタスク２１は、ジョブマネージャ１１から配布された再配置後のキーの範囲に基づき、他のノードに送信すべきキーの範囲を確定する。具体的には、「ノード１」は「ｋ１０１」〜「ｋ１５０」を「ノード２」に送信し、「ノード２」は「ｋ２０１」〜「ｋ２２５」を「ノード３」に送信することを確定する。この結果、この並列処理期間の終了後には、「ノード１」が「ｋ１」〜「ｋ１００」のエントリ２２を有し、「ノード２」が「ｋ１０１」〜「ｋ２００」のエントリ２２を有し、「ノード３」が「ｋ２０１」〜「ｋ３００」のエントリ２２を有することになる。 (Fig. 8)
The entry 22 held by each node at the start of the parallel processing period is as shown in FIG. Here, the arrangement state of the entry 22 in FIG. 8 is the same arrangement state as the upper stage of FIG. The task 21 of each node determines the key range to be transmitted to other nodes based on the rearranged key range distributed from the job manager 11. Specifically, “Node 1” transmits “k101” to “k150” to “Node 2”, and “Node 2” determines to transmit “k201” to “k225” to “Node 3”. . As a result, after the end of this parallel processing period, “node 1” has entries 22 of “k1” to “k100”, “node 2” has entries 22 of “k101” to “k200”, “Node 3” has entries 22 “k201” to “k300”.

（図９）
図９は、並列処理期間開始からしばらく時間が経過したときの、各ノードにおけるエントリの配置状態を示す図である。
各ノードのタスク２１は再配置後に自身が担当するキー範囲のエントリ２２の並列処理の実行を開始すると共に、再配置対象のエントリ２２のデータ値（「Ｄ」および「Ｂｕ」）を宛先ノードのタスク２１にする。
例えば、「ノード２」は再配置後に「ｋ１０１」〜「ｋ２００」の範囲のキーを担当することになるが、「ｋ１０１」〜「ｋ１５０」の範囲のキーは保持していない（現時点では、「ノード１」が保持している）。従って、「ノード２」は、エントリ２２の並列処理の実行を、再配置後自身が保持し、かつ、現時点でも自身が保持している最大のキーである「ｋ２００」から開始し、キーの値が小さくなる方向（「ｋ２００」→「ｋ１５１」）にエントリ２２の並列処理を進めていく。なお、図９〜図１３において、並列処理を進める方向を黒く塗りつぶした矢印で示す。 (Fig. 9)
FIG. 9 is a diagram illustrating an entry arrangement state in each node when a certain time has elapsed since the start of the parallel processing period.
The task 21 of each node starts executing parallel processing of the entry 22 in the key range that it is in charge of after the relocation, and the data value (“D” and “Bu”) of the relocation target entry 22 is set to the destination node. Set to task 21.
For example, “Node 2” is responsible for keys in the range of “k101” to “k200” after rearrangement, but does not hold keys in the range of “k101” to “k150” (currently “ Node 1 ”). Accordingly, “node 2” starts execution of the parallel processing of the entry 22 from “k200”, which is the largest key held by itself after relocation, and is still held at the present time. The parallel processing of the entry 22 is advanced in the direction of decreasing (“k200” → “k151”). In FIGS. 9 to 13, the direction in which the parallel processing proceeds is indicated by black arrows.

図９の時刻では、「ノード２」は「ｋ２００」から「ｋ１８５」まで並列処理を進めており、計算処理の結果として「ｋ２００」〜「ｋ１８５」の範囲のエントリ２２において「Ｄｎ」のデータ値（「ｄ２」）が格納されている。 At the time of FIG. 9, “Node 2” is proceeding with parallel processing from “k200” to “k185”, and the data value of “Dn” in the entry 22 in the range of “k200” to “k185” as a result of the calculation processing ("D2") is stored.

このような計算処理と並行して、各ノードは、「Ｄ」、「Ｂｕ」のデータ値を再配置後のノードへコピーしていくが、このとき、以下に記載するような手順でデータ値のコピーを行う。例えば、前記したような「ノード２」の並列処理動作を見越して、「ノード１」は「ノード２」との再配置前のキー境界に近いデータ値から優先的に「ノード２」にコピーする。つまり、「ノード２」は「ｋ２００」から順にキーの値が小さくなる方向にエントリ２２の並列処理を行うため、「ノード１」は「ｋ１０１」〜「ｋ１５０」の範囲のデータ値を「ノード２」にコピーするときに、再配置前のキー境界である「ｋ１５０」からコピーを開始しする。そして、「ノード１」は、「ｋ１５０」からキーの値が小さくなる方向（「ｋ１５０」→「ｋ１０１」）に「ｋ１０１」まで順次データ値をコピーする。なお、図９〜図１３において、コピーしていく方向をハッチングの矢印で示し、コピーされるていく方向を白抜きの矢印で示す。これによって、「ノード２」におけるエントリ２２の処理をなるべく滞らせずに実行することが可能となる。 In parallel with such calculation processing, each node copies the data values of “D” and “Bu” to the rearranged nodes. At this time, the data values are processed according to the procedure described below. Make a copy of For example, in anticipation of the parallel processing operation of “node 2” as described above, “node 1” preferentially copies to “node 2” from the data value close to the key boundary before relocation with “node 2”. . That is, since “node 2” performs parallel processing of the entry 22 in the direction of decreasing key value from “k200” in order, “node 1” sets the data value in the range of “k101” to “k150” to “node 2”. When copying to “”, copying starts from “k150” which is the key boundary before rearrangement. Then, “node 1” sequentially copies data values from “k150” to “k101” in a direction in which the key value decreases (“k150” → “k101”). 9 to 13, the copying direction is indicated by hatching arrows, and the copying direction is indicated by white arrows. As a result, the processing of the entry 22 in “node 2” can be executed with as little delay as possible.

なお、図９において、「ノード１」は「ｋ１５０」〜「ｋ１３０」までのデータ値をコピーしているのに、「ノード２」では「ｋ１５０」〜「ｋ１３５」までしかデータ値を受信していない。これは、送信・受信のタイムラグのためである。
すべてのノードにおいて、エントリ２２の実行順序と、エントリ２２のノード間送信順序はこのようなルールに基づいて実行する。
なお、エントリ２２の計算処理とエントリ２２のデータ値のコピーは非同期に行われるため、エントリ２２の計算処理が完了したキーに対応するすべての「Ｄｎ」、「Ｄ」のコピーが完了しているわけではない。なお、エントリ２２の計算処理とエントリ２２のデータ値のコピーが同期的に行われてもよい。 In FIG. 9, “node 1” copies data values from “k150” to “k130”, but “node 2” receives data values only from “k150” to “k135”. Absent. This is due to a transmission / reception time lag.
In all the nodes, the execution order of the entry 22 and the inter-node transmission order of the entry 22 are executed based on such rules.
Since the calculation process of the entry 22 and the copy of the data value of the entry 22 are performed asynchronously, the copying of all “Dn” and “D” corresponding to the key for which the calculation process of the entry 22 has been completed is completed. Do not mean. Note that the calculation processing of the entry 22 and the copying of the data value of the entry 22 may be performed synchronously.

図９における「Ｄｎ」、「Ｄ」、「Ｂｕ」におけるエントリの配置状態を、以下に整理する。
ノード１：「ｋ１００」〜「ｋ８５」まで計算処理が完了済み（「ｄ２」）。「ｋ１５０」〜「ｋ１３０」までのデータ値（「Ｄ１」、「Ｂ１」）が「ノード２」へコピー済み。
ノード２：「ｋ２００」〜「ｋ１８５」まで計算処理が完了済み（「ｄ２」）。「ｋ１２２５」〜「ｋ２０５」までのデータ値（「Ｄ１」、「Ｂ１」）が「ノード３」へコピー済み。「ｋ１５０」〜「ｋ１３５」までのデータ値（「Ｄ１」、「Ｂ１」）が「ノード１」から受信済み。
ノード３：「ｋ３００」〜「ｋ２８５」まで計算処理が完了済み（「ｄ２」）。「ｋ２２５」〜「ｋ２１０」までのデータ値（「Ｄ１」、「Ｂ１」）が「ノード２」から受信済み。 The arrangement state of entries in “Dn”, “D”, and “Bu” in FIG. 9 is organized as follows.
Node 1: Computation processing has been completed for “k100” to “k85” (“d2”). Data values (“D1”, “B1”) from “k150” to “k130” have been copied to “node 2”.
Node 2: Calculation processing from “k200” to “k185” has been completed (“d2”). Data values (“D1”, “B1”) from “k1225” to “k205” have been copied to “Node 3”. Data values (“D1”, “B1”) from “k150” to “k135” have been received from “node 1”.
Node 3: Calculation processing from “k300” to “k285” has been completed (“d2”). Data values (“D1”, “B1”) from “k225” to “k210” have been received from “node 2”.

並列処理期間中に各エントリ２２が他のエントリ２２に対して送信するメッセージは、再配置後のエントリ配置に基づき送信される。従って、各ノードは再配置後に自身が担当する範囲のキーに対応するメッセージを受信することになる。
例えば、「ノード１」は再配置後に「ｋ１」〜「ｋ１００」のキー範囲のエントリ２２を担当するため、「Ｂｒ」はこの範囲のデータ値のみを受信し、保持する。図９では、「ノード１」において、「ｋ１」〜「ｋ１００」のキー範囲に該当する「Ｂｒ」が実線の矩形として示されている。ただし、並列処理期間実行中であり「Ｂｒ」の値は確定していないため、図９において、「Ｂｒ」ではバージョン番号を伴ったデータ値を記載していない（以降の図でも同様の理由から、確定前の「Ｂｒ」にはバージョン番号を記載しない）。また、再配置後、「ノード１」はそもそも「ｋ１０１」〜「ｋ１５０」のキー範囲のエントリ２２を担当しないため、図９では対応する「Ｂｒ」の部分の破線表示を行っていない。
同様の理由から、「ノード２」では「ｋ１０１」〜「ｋ２００」のキー範囲に該当する「Ｂｒ」が、「ノード３」では「ｋ２０１」〜「ｋ３００」のキー範囲に該当する「Ｂｒ」が実線の矩形として示されている。以降の図でも、「Ｂｒ」は同様に示されるため、図１０、図１１では「Ｂｒ」の説明を省略することとする。 A message transmitted from each entry 22 to another entry 22 during the parallel processing period is transmitted based on the entry arrangement after the rearrangement. Therefore, each node receives a message corresponding to a key in a range that it is responsible for after the rearrangement.
For example, since “node 1” is responsible for the entry 22 in the key range “k1” to “k100” after the rearrangement, “Br” receives and holds only the data values in this range. In FIG. 9, in “Node 1”, “Br” corresponding to the key range of “k1” to “k100” is shown as a solid-line rectangle. However, since the parallel processing period is being executed and the value of “Br” has not been determined, in FIG. 9, “Br” does not describe a data value with a version number (for the same reason in the following figures) The version number is not written in “Br” before confirmation. Further, after the rearrangement, since “node 1” is not responsible for the key range entry 22 of “k101” to “k150” in the first place, the broken line display of the corresponding “Br” portion is not performed in FIG.
For the same reason, “Br” corresponding to the key range of “k101” to “k200” in “Node 2”, and “Br” corresponding to the key range of “k201” to “k300” in “Node 3”. Shown as a solid rectangle. In the subsequent drawings, “Br” is indicated in the same manner, and therefore description of “Br” is omitted in FIGS. 10 and 11.

（図１０）
図１０は、図９からさらに時間が経過したときの、各ノードにおけるエントリの配置状態を示す図である。
各ノードにおけるエントリの配置状態は以下の通りである。
ノード１：「ｋ１００」〜「ｋ５５」まで計算処理が完了済み（「ｄ２」）。「ｋ１５０」〜「ｋ１０５」までのデータ値（「Ｄ１」、「Ｂ１」）が「ノード２」へコピー済み。
ノード２：「ｋ２００」〜「ｋ１５５」まで計算処理が完了済み（「ｄ２」）。「ｋ１２２５」〜「ｋ２０１」までのデータ値（「Ｄ１」、「Ｂ１」）が「ノード３」へコピー済み。「ｋ１５０」〜「ｋ１１０」までのデータ値（「Ｄ１」、「Ｂ１」）が「ノード１」から受信済み。
ノード３：「ｋ３００」〜「ｋ２５５」まで計算処理が完了済み（「ｄ２」）。「ｋ２２５」〜「ｋ２０１」までのデータ値（「Ｄ１」、「Ｂ１」）が「ノード２」から受信済み。 (Fig. 10)
FIG. 10 is a diagram illustrating an entry arrangement state in each node when a further time elapses from FIG. 9.
The arrangement state of entries in each node is as follows.
Node 1: Calculation processing from “k100” to “k55” has been completed (“d2”). Data values (“D1”, “B1”) from “k150” to “k105” have been copied to “node 2”.
Node 2: Calculation processing from “k200” to “k155” has been completed (“d2”). Data values (“D1”, “B1”) from “k1225” to “k201” have been copied to “Node 3”. Data values (“D1”, “B1”) from “k150” to “k110” have been received from “node 1”.
Node 3: Calculation processing from “k300” to “k255” has been completed (“d2”). Data values (“D1”, “B1”) from “k225” to “k201” have been received from “node 2”.

図１０に示す時刻では、「ノード２」から「ノード３」へのデータ値（「Ｄ」および「Ｂｕ」）のコピーは完了しているが、「ノード１」から「ノード２」へのデータ値のコピーは未完である。 At the time shown in FIG. 10, the copying of the data values (“D” and “Bu”) from “Node 2” to “Node 3” is completed, but the data from “Node 1” to “Node 2” is complete. The value copy is incomplete.

（図１１）
図１１は、図１０からさらに時間が経過したときの、各ノードにおけるエントリの配置状態を示す図である。
各ノードにおけるエントリの配置状態は以下の通りである。
ノード１：「ｋ１００」〜「ｋ１５」まで計算処理が完了済み（「ｄ２」）。「ｋ１５０」〜「ｋ１０１」までのデータ値（「Ｄ１」、「Ｂ１」）が「ノード２」へコピー済み。
ノード２：「ｋ２００」〜「ｋ１１５」まで計算処理が完了済み（「ｄ２」）。「ｋ１２２５」〜「ｋ２０１」までのデータ値（「Ｄ１」、「Ｂ１」）が「ノード３」へコピー済み。「ｋ１５０」〜「ｋ１０１」までのデータ値（「Ｄ１」、「Ｂ１」）が「ノード１」から受信済み。
ノード３：「ｋ３００」〜「ｋ２２１５」まで計算処理が完了済み（「ｄ２」）。「ｋ２２５」〜「ｋ２０１」までのデータ値（「Ｄ１」、「Ｂ１」）が「ノード２」から受信済み。 (Fig. 11)
FIG. 11 is a diagram showing an entry arrangement state in each node when a further time elapses from FIG.
The arrangement state of entries in each node is as follows.
Node 1: Computation processing has been completed for “k100” to “k15” (“d2”). Data values (“D1”, “B1”) from “k150” to “k101” have been copied to “node 2”.
Node 2: The calculation processing from “k200” to “k115” has been completed (“d2”). Data values (“D1”, “B1”) from “k1225” to “k201” have been copied to “Node 3”. Data values (“D1”, “B1”) from “k150” to “k101” have been received from “node 1”.
Node 3: Calculation processing from “k300” to “k2215” has been completed (“d2”). Data values (“D1”, “B1”) from “k225” to “k201” have been received from “node 2”.

図１１に示す時刻では、「ノード２」から「ノード３」へのデータ値（「Ｄ」および「Ｂｕ」）のコピーに加え、「ノード１」から「ノード２」へのデータ値のコピーも完了している。しかし、各ノードにおける計算処理（「ｄ２」）は未完である。 At the time shown in FIG. 11, in addition to copying data values (“D” and “Bu”) from “Node 2” to “Node 3”, copying data values from “Node 1” to “Node 2” is also performed. Completed. However, the calculation process (“d2”) at each node is incomplete.

(図１２）
図１２は、図１１からさらに時間が経過したときの、各ノードにおけるエントリの配置状態を示す図である。
図１２では、データ値のコピー、および各ノードにおける計算処理が終了している。図１２の時点で並列処理期間における並列処理が終了し、ＢＳＰモデルにおける同期の配置状態へ移行する。
並列処理期間が終了したことで、「Ｂｒ」の状態が確定しているため、各ノードにおける「Ｂｒ」の矩形の内部に「Ｂ２」とバージョン番号の添え字付の値を記載している。
また、「ノード１」における「ｋ１５０」〜「ｋ１０１」、「ノード２」における「ｋ２２５」〜「ｋ２０１」は、該当するノードにおける計算処理を行っていないため、データ値が存在しない破線矩形となっている。 (Fig. 12)
FIG. 12 is a diagram illustrating an entry arrangement state in each node when a further time elapses from FIG. 11.
In FIG. 12, the copying of the data value and the calculation process at each node are completed. At the time of FIG. 12, the parallel processing in the parallel processing period ends, and the state shifts to the synchronous arrangement state in the BSP model.
Since the state of “Br” has been determined because the parallel processing period has ended, “B2” and a value with a subscript of the version number are written inside the rectangle of “Br” in each node.
In addition, “k150” to “k101” in “node 1” and “k225” to “k201” in “node 2” are not subjected to calculation processing in the corresponding node, and thus are broken-line rectangles having no data value. ing.

（図１３）
図１３は、図１２からさらに時間が経過したときの、各ノードにおけるエントリの配置状態を示す図である。
図１３は同期中におけるエントリの配置状態を示しており、この同期中に、各ノードのタスク２１は、「Ｄｎ」の値を「Ｄ」にマージし（「ｄ２」＋「Ｄ１」）、「Ｄｎ」をクリアする。また各ノードのタスク２１は、「Ｂｒ」を「Ｂｕ」に移動し、「Ｂｒ」をクリアする。
図１３では「Ｄ」の値が「Ｄ２」になっているが、１つ前のバージョンのデータ値である「Ｄ１」が新バージョンデータ値への差分「ｄ２」をマージした結果「Ｄ２」になったためである（「ｄ２」＋「Ｄ１」＝「Ｄ２」）。最後に、各ノードのタスク２１は、自身が保持するパーティションテーブル（図５）の開始キー、終了キーを再配置後の値に更新する。 (Fig. 13)
FIG. 13 is a diagram showing an entry arrangement state in each node when a further time elapses from FIG.
FIG. 13 shows the arrangement state of entries during synchronization. During this synchronization, the task 21 of each node merges the value of “Dn” with “D” (“d2” + “D1”). Dn ”is cleared. Also, the task 21 of each node moves “Br” to “Bu” and clears “Br”.
In FIG. 13, the value of “D” is “D2”, but “D1”, which is the data value of the previous version, is merged with the difference “d2” to the new version data value to “D2”. (“D2” + “D1” = “D2”). Finally, the task 21 of each node updates the start key and end key of the partition table (FIG. 5) held by itself to values after the rearrangement.

（第１実施形態の効果）
以上説明したように、並列処理期間において各ノードが再配置後のエントリ２２に対する処理を行いつつ、データ値のコピー（送信）を行うことで、システム１００はエントリ２２の再配置処理時間を隠蔽することができ、全体処理時間を短縮することが可能になる。
つまり、並列処理期間において、各ノードがエントリ２２の計算処理を行うとともに、再配置後のエントリ２２に従ってデータ値をコピーするため、システム１００はエントリ２２の再配置処理時間をエントリ２２の計算処理に隠すことができ、全体処理時間を短縮することが可能になる。なお、データ値のコピーは、計算処理に比べて、ＣＰＵ２０１（図２）やメモリ２０４（図２）の使用率が著しく低いため、データ値のコピーが並列処理時間に与える影響は小さい。 (Effect of 1st Embodiment)
As described above, the system 100 conceals the relocation processing time of the entry 22 by copying (transmitting) the data value while each node performs processing on the relocated entry 22 during the parallel processing period. And the overall processing time can be shortened.
That is, in the parallel processing period, each node performs the calculation process of the entry 22 and copies the data value according to the entry 22 after the rearrangement. It can be hidden, and the overall processing time can be shortened. Note that the copy of data values has a significantly lower usage rate of the CPU 201 (FIG. 2) and the memory 204 (FIG. 2) than the calculation process, so the influence of the copy of the data values on the parallel processing time is small.

《第２実施形態》
続いて、図７に示すような負荷均等化の具体的な手順の第２実施形態を、図１４〜図１８を参照して説明する。
第１実施形態の技術は、エントリ２２の再配置処理で「Ｄ」と「Ｂｕ」をノード間で送信していたが、コピーするデータ量が多い場合、データ値のコピー処理に多くの時間を費やしてしまい、全体処理時間の削減効果が小さくなるという課題がある。そこで第２実施形態におけるシステム１００は、再配置処理で「Ｄ」と「Ｂｕ」の代わりに「Ｄｎ」と「Ｄ」を送信する方法について説明する。
この方式は、「Ｂｕ」に比較して「Ｄｎ」のデータ量が十分に小さい場合に有効になる。 << Second Embodiment >>
Next, a second embodiment of a specific procedure for load equalization as shown in FIG. 7 will be described with reference to FIGS.
In the technique of the first embodiment, “D” and “Bu” are transmitted between the nodes in the relocation processing of the entry 22, but when the amount of data to be copied is large, a long time is required for the data value copy processing. There is a problem that the effect of reducing the overall processing time is reduced. Therefore, the system 100 according to the second embodiment will explain a method of transmitting “Dn” and “D” instead of “D” and “Bu” in the rearrangement process.
This method is effective when the data amount of “Dn” is sufficiently smaller than “Bu”.

第１実施形態との違いはエントリ２２の再配置処理であるため、以下ではエントリ２２の再配置処理のみ説明する。第１実施形態と同様、図７に示すようにエントリ２２の再配置を行う場合を例として説明する。各ノードのタスク２１が他のノードにコピーするエントリ２２のキーの範囲を確定する処理までは、第１実施形態と同じであるため説明を省略する。 Since the difference from the first embodiment is the relocation processing of the entry 22, only the relocation processing of the entry 22 will be described below. Similar to the first embodiment, a case where the rearrangement of the entries 22 is performed as shown in FIG. 7 will be described as an example. The process up to the process of determining the key range of the entry 22 to be copied to the other node by the task 21 of each node is the same as that in the first embodiment, and the description thereof is omitted.

（図１４）
図１４は、並列処理期間の開始時点における各ノードにおけるエントリの配置状態を示す図である。
図１４は、図８と同様であるため、説明を省略する。 (Fig. 14)
FIG. 14 is a diagram illustrating an entry arrangement state in each node at the start of the parallel processing period.
FIG. 14 is the same as FIG.

（図１５）
並列処理期間が開始すると、各ノードは、再配置前のエントリ２２に対する処理を行うと共に、再配置後のエントリ２２については計算処理の結果の「Ｄｎ」と「Ｄ」を再配置先にコピーする。
例えば、「ノード１」は、同期開始時点で「ｋ１」〜「ｋ１５０」のキー範囲のエントリ２２を保持しており、このうち「ｋ１０１」〜「ｋ１５０」のキー範囲のエントリ２２に対応するデータ値が「ノード２」に送信される必要がある。ネットワーク４（図１）の状態によっては、データ値の送信には時間がかかる可能性があるため、システム１００は送信対象エントリ２２（「ノード１」では「ｋ１０１」〜「ｋ１５０」）の並列処理を優先して実行し、このキー範囲のデータ値のコピー処理を開始する。従って、「ノード１」は「ｋ１０１」〜「ｋ１５０」のキー範囲のエントリ２２における並列処理を、再配置後にも「ノード１」に残るエントリ２２（「ｋ１」〜「ｋ１００」）より先に実行し、計算処理結果である「Ｄｎ」（「ｄ２」）が作成できたものから順に「ノード２」にコピーする。 (Fig. 15)
When the parallel processing period starts, each node performs processing for the entry 22 before the rearrangement, and copies “Dn” and “D” as a result of the calculation processing to the rearrangement destination for the entry 22 after the rearrangement. .
For example, “Node 1” holds an entry 22 in the key range “k1” to “k150” at the start of synchronization, and among these, data corresponding to the entry 22 in the key range “k101” to “k150” The value needs to be sent to “Node 2”. Depending on the state of the network 4 (FIG. 1), it may take time to transmit the data value, so the system 100 performs parallel processing of the transmission target entries 22 (“k101” to “k150” in “node 1”). Is executed with priority, and the copy processing of the data value in this key range is started. Therefore, “Node 1” executes parallel processing in the entry 22 in the key range “k101” to “k150” before the entry 22 (“k1” to “k100”) remaining in “Node 1” after the rearrangement. Then, “Dn” (“d2”), which is the calculation processing result, is copied to “node 2” in order from the one that was created.

このような制御を行うため、「ノード１」は「ｋ１５０」からエントリ２２の並列処理を開始し、キーの値が小さくなる方向（「ｋ１５０」→「ｋ１」）に並列処理を行っている。つまり、「ノード１」は再配置前のノード間のキー境界（ここでは、「ｋ１５０」）に近いデータ値からエントリ２２の並列処理を開始している。
図１５において、「ノード１」は「ｋ１５０」から「ｋ１２５」までのエントリ２２の計算処理が完了し、対応する「Ｄｎ」（「ｄ２」）の作成が完了している。エントリ２２の計算処理とエントリ２２のデータ値のコピーは非同期に行われるため、エントリ２２の計算処理が完了したキーに対応するすべての「Ｄｎ」、「Ｄ」のコピーが完了しているわけではない。なお、エントリ２２の計算処理とエントリ２２のデータ値のコピーが同期的に行われてもよい。
図１５では、「ノード１」において計算処理が完了している「ｋ１５０」〜「ｋ１２５」のデータ値のうち、「ｋ１５０」〜「ｋ１３０」の範囲の「Ｄｎ」、「Ｄ」のコピーが完了している（「ノード２」参照）。 In order to perform such control, “node 1” starts parallel processing of entry 22 from “k150” and performs parallel processing in a direction in which the key value decreases (“k150” → “k1”). That is, “Node 1” starts parallel processing of the entry 22 from a data value close to the key boundary (here, “k150”) between the nodes before rearrangement.
In FIG. 15, “Node 1” has completed the calculation processing of the entry 22 from “k150” to “k125”, and the creation of the corresponding “Dn” (“d2”) has been completed. Since the calculation processing of the entry 22 and the copy of the data value of the entry 22 are performed asynchronously, the copying of all “Dn” and “D” corresponding to the key for which the calculation processing of the entry 22 has been completed is not complete. Absent. Note that the calculation processing of the entry 22 and the copying of the data value of the entry 22 may be performed synchronously.
In FIG. 15, copying of “Dn” and “D” in the range of “k150” to “k130” among the data values of “k150” to “k125” for which calculation processing has been completed in “node 1” is completed. (See “Node 2”).

また、第２実施形態では、同じエントリ２２に対応する「Ｄｎ」と「Ｄ」とを対にして、送信先ノードへコピーする。ただし、必ずしも「Ｄｎ」と「Ｄ」を対にしてコピーする必要はなく、ネットワーク４の帯域に余裕がある場合は、すべての「Ｄｎ」のコピーが完了する前に、すべての「Ｄ」のコピーを完了してもよい。
なお、図１５、図１６において、計算およびコピーしていく方向を黒く塗りつぶした矢印で示し、コピーされていく方向を白抜きの矢印で示す。 In the second embodiment, “Dn” and “D” corresponding to the same entry 22 are paired and copied to the transmission destination node. However, it is not always necessary to copy “Dn” and “D” as a pair. If there is a margin in the bandwidth of the network 4, all “D” must be copied before copying of all “Dn” is completed. You may complete the copy.
In FIGS. 15 and 16, the direction of calculation and copying is indicated by a black arrow, and the direction of copying is indicated by a white arrow.

同様に、図１５において、「ｋ２２５」〜「ｋ２０１」の範囲のエントリ２２を「ノード３」に再配置する必要があるため、「ノード２」は再配置前のキー境界「ｋ２２５」から「ｋ２２５」→「ｋ２０１」の方向にエントリ２２の並列処理を開始し、「ｋ２０５」まで計算処理を終えている。そして「ノード３」は「ｋ２２５」〜「ｋ２１０」の範囲の「Ｄｎ」（「ｄ２」）と「Ｄ」（「Ｄ１」）の受信を完了している。 Similarly, in FIG. 15, since it is necessary to rearrange the entries 22 in the range of “k225” to “k201” to “node 3”, “node 2” is changed from “k225” to “k225” before the rearrangement. The parallel processing of the entry 22 is started in the direction of “→ K201”, and the calculation processing is finished up to “k205”. “Node 3” has received “Dn” (“d2”) and “D” (“D1”) in the range of “k225” to “k210”.

図１５における「Ｄｎ」、「Ｄ」におけるエントリの配置状態を、以下に整理する。
ノード１：「ｋ１５０」〜「ｋ１２５」まで計算処理が完了済み（「ｄ２」）。「ｋ１５０」〜「ｋ１３０」までのデータ値（「ｄ２」、「Ｄ１」）が「ノード２」へコピー済み（「ノード２」の「Ｄｎ」、「Ｄ」参照）。
ノード２：「ｋ２２５」〜「ｋ２０５」まで計算処理が完了済み（「ｄ２」）。「ｋ２２５」〜「ｋ２１０」までのデータ値（「ｄ２」、「Ｄ１」）が「ノード３」へコピー済み（「ノード３」の「Ｄｎ」、「Ｄ」参照）。「ｋ１５０」〜「ｋ１３０」までのデータ値（「ｄ２」、「Ｄ１」）が「ノード１」から受信済み。
ノード３：「ｋ３００」〜「ｋ２８０」まで計算処理が完了済み（「ｄ２」）。「ｋ２２５」〜「ｋ２１０」までのデータ値（「ｄ２」、「Ｄ１」）が「ノード２」から受信済み。 The arrangement state of entries in “Dn” and “D” in FIG. 15 is organized as follows.
Node 1: Calculation processing from “k150” to “k125” has been completed (“d2”). Data values (“d2”, “D1”) from “k150” to “k130” have been copied to “node 2” (see “Dn”, “D” of “node 2”).
Node 2: Calculation processing has been completed for “k225” to “k205” (“d2”). Data values from “k225” to “k210” (“d2”, “D1”) have been copied to “node 3” (see “Dn”, “D” of “node 3”). Data values (“d2”, “D1”) from “k150” to “k130” have been received from “node 1”.
Node 3: Calculation processing from “k300” to “k280” has been completed (“d2”). Data values (“d2”, “D1”) from “k225” to “k210” have been received from “node 2”.

なお、第２実施形態では、既に計算処理を行った結果（「Ｄｎ」）を送信先ノードで送っているため、送信先ノードでは、改めて「Ｄ」と「Ｂｕ」を用いた計算処理を行う必要はない。そのため、各ノードにおいて「Ｂｕ」のコピーは行われていない（図１６、図１７も同様）。
また、「Ｂｒ」については、第１実施形態と同様であるため、説明を省略する。 In the second embodiment, since the result of calculation processing (“Dn”) is already sent by the destination node, the destination node performs calculation processing using “D” and “Bu” again. There is no need. Therefore, “Bu” is not copied in each node (the same applies to FIGS. 16 and 17).
Further, “Br” is the same as that in the first embodiment, and thus the description thereof is omitted.

（図１６）
図１６は、図１５からさらに時間が経過したときの、各ノードにおけるエントリの配置状態を示す図である。なお、前記したように計算およびコピーの方向を黒く塗りつぶした矢印で示し、コピーされる方向を白抜きの矢印で示している。
各ノードにおけるエントリの配置状態は以下の通りである。
ノード１：「ｋ１５０」〜「ｋ９５」まで計算処理が完了済み（「ｄ２」）。「ｋ１５０」〜「ｋ１０１」までのデータ値（「ｄ２」、「Ｄ１」）が「ノード２」へコピー済み（「ノード２」の「Ｄｎ」、「Ｄ」参照）。
ノード２：「ｋ２２５」〜「ｋ１７０」まで計算処理が完了済み（「ｄ２」）。「ｋ２２５」〜「ｋ２０１」までのデータ値（「ｄ２」、「Ｄ１」）が「ノード３」へコピー済み（「ノード３」の「Ｄｎ」、「Ｄ」参照）。「ｋ１５０」〜「ｋ１０１」までのデータ値（「ｄ２」、「Ｄ１」）が「ノード１」から受信済み。
ノード３：「ｋ３００」〜「ｋ２２６」まで計算処理が完了済み（「ｄ２」）。「ｋ２２５」〜「ｋ２０１」までのデータ値（「ｄ２」、「Ｄ１」）が「ノード２」から受信済み。 (Fig. 16)
FIG. 16 is a diagram showing an entry arrangement state in each node when a further time elapses from FIG. As described above, the direction of calculation and copying is indicated by a black arrow, and the direction of copying is indicated by a white arrow.
The arrangement state of entries in each node is as follows.
Node 1: Computation processing has been completed for “k150” to “k95” (“d2”). Data values (“d2”, “D1”) from “k150” to “k101” have been copied to “node 2” (see “Dn”, “D” of “node 2”).
Node 2: Computation processing has been completed for “k225” to “k170” (“d2”). Data values (“d2”, “D1”) from “k225” to “k201” have already been copied to “node 3” (see “Dn”, “D” of “node 3”). Data values (“d2”, “D1”) from “k150” to “k101” have been received from “node 1”.
Node 3: Calculation processing from “k300” to “k226” has been completed (“d2”). Data values (“d2”, “D1”) from “k225” to “k201” have been received from “node 2”.

図１６では、各ノードにおけるデータ値のコピーは完了しているが、「ノード１」、「ノード２」において計算処理が未完である（「ノード３」は計算処理も完了している）。 In FIG. 16, the copying of the data value at each node is completed, but the calculation processing is incomplete at “node 1” and “node 2” (the calculation processing is also completed at “node 3”).

（図１７）
図１７は、図１６からさらに時間が経過したときの、各ノードにおけるエントリの配置状態を示す図である。
各ノードにおけるエントリの配置状態は以下の通りである。
ノード１：「ｋ１５０」〜「ｋ１」まで計算処理が完了済み（「ｄ２」）。「ｋ１５０」〜「ｋ１０１」までのデータ値（「ｄ２」、「Ｄ１」）が「ノード２」へコピー済み（「ノード２」の「Ｄｎ」、「Ｄ」参照）。
ノード２：「ｋ２２５」〜「ｋ１５１」まで計算処理が完了済み（「ｄ２」）。「ｋ２２５」〜「ｋ２０１」までのデータ値（「ｄ２」、「Ｄ１」）が「ノード３」へコピー済み（「ノード３」の「Ｄｎ」、「Ｄ」参照）。「ｋ１５０」〜「ｋ１０１」までのデータ値（「ｄ２」、「Ｄ１」）が「ノード１」から受信済み。
ノード３：「ｋ３００」〜「ｋ２２６」まで計算処理が完了済み（「ｄ２」）。「ｋ２２５」〜「ｋ２０１」までのデータ値（「ｄ２」、「Ｄ１」）が「ノード２」から受信済み。 (Fig. 17)
FIG. 17 is a diagram illustrating an entry arrangement state in each node when a further time elapses from FIG.
The arrangement state of entries in each node is as follows.
Node 1: Calculation processing from “k150” to “k1” has been completed (“d2”). Data values (“d2”, “D1”) from “k150” to “k101” have been copied to “node 2” (see “Dn”, “D” of “node 2”).
Node 2: Calculation processing has been completed for “k225” to “k151” (“d2”). Data values (“d2”, “D1”) from “k225” to “k201” have already been copied to “node 3” (see “Dn”, “D” of “node 3”). Data values (“d2”, “D1”) from “k150” to “k101” have been received from “node 1”.
Node 3: Calculation processing from “k300” to “k226” has been completed (“d2”). Data values (“d2”, “D1”) from “k225” to “k201” have been received from “node 2”.

図１７の時点で、各ノードにおける計算処理およびデータ値のコピーが完了している。そして、図１７の時点で、並列処理期間の並列処理が終了し、ＢＳＰモデルにおける同期の状態へ以降する。ここで、第１実施形態と同様に、並列処理期間が終了したことで、「Ｂｒ」の状態が確定しているため、「Ｂｒ」の矩形の内部に「Ｂ２」とバージョン番号の添え字付の値を記載している。 At the time of FIG. 17, the calculation process and data value copying at each node have been completed. Then, at the time of FIG. 17, the parallel processing in the parallel processing period is completed, and the process goes to the synchronized state in the BSP model. Here, as in the first embodiment, since the state of “Br” has been determined by the end of the parallel processing period, “B2” and the version number are appended to the inside of the rectangle of “Br”. The value of is described.

（図１８）
図１８は、図１７からさらに時間が経過したときの、各ノードにおけるエントリの配置状態を示す図である。
図１８は同期中におけるエントリの配置状態を示しており、この同期中に、各ノードのタスク２１は、再配置後のエントリ２２に対して「Ｄｎ」の値を「Ｄ」にマージし（「ｄ２」＋「Ｄ１」＝「Ｄ２」）、また、Ｄｎをクリアする。そして、各ノードのタスク２１は、「Ｂｒ」を「Ｂｕ」に移動し、「Ｂｒ」をクリアする。なお、「Ｄ」におけるバージョン番号は第１実施形態と同様であるため、説明を省略する。
最後に、各ノードのタスク２１は、自身が保持するパーティションテーブルの開始キー、終了キーを再配置後の値に更新する。 (Fig. 18)
FIG. 18 is a diagram illustrating an entry arrangement state in each node when a further time elapses from FIG.
FIG. 18 shows the arrangement state of entries during synchronization. During this synchronization, the task 21 of each node merges the value of “Dn” into “D” for the entry 22 after relocation (“ d2 "+" D1 "=" D2 ") and Dn is cleared. Then, the task 21 of each node moves “Br” to “Bu” and clears “Br”. Note that the version number in “D” is the same as in the first embodiment, and a description thereof will be omitted.
Finally, the task 21 of each node updates the start key and end key of the partition table held by itself to values after rearrangement.

（第２実施形態の効果）
第２実施形態によれば、「Ｂｕ」の代わりに「Ｄｎ」（この値は、差分であることが多いのでデータ量が小さい）を、送信先ノードへコピー（送信）することにより、送信データ量の削減が可能となる。 (Effect of 2nd Embodiment)
According to the second embodiment, instead of “Bu”, “Dn” (this value is often a difference and thus the amount of data is small) is copied (transmitted) to the destination node, thereby transmitting data. The amount can be reduced.

《第３実施形態》
次に、図１９から図２５を参照して、本発明の第３実施形態を説明する。
第３実施形態におけるシステム１００は、耐障害性を維持しつつ、データ移動の時間を隠蔽することを目的とする。
耐障害性とノードの負荷均等化を両立するための技術として特許文献２に記載の技術が挙げられるが、前記したように、データ値の複製処理に対して余分な時間が必要となり、全体の処理時間が増大するという課題がある。このような課題に対し、第３実施形態におけるシステム１００は、エントリ２２における並列処理を行いつつ、データ値の複製を行うことで、全体の処理時間を削減する。 << Third Embodiment >>
Next, a third embodiment of the present invention will be described with reference to FIGS.
The system 100 according to the third embodiment aims to conceal the data movement time while maintaining fault tolerance.
The technique described in Patent Document 2 can be cited as a technique for achieving both fault tolerance and node load equalization, but as described above, extra time is required for data value duplication processing. There is a problem that processing time increases. In response to such a problem, the system 100 according to the third embodiment reduces the overall processing time by performing data value replication while performing parallel processing in the entry 22.

第３実施形態と、第１実施形態、第２実施形態との違いは並列処理期間におけるエントリ２２の処理とデータ複製処理であるため、この部分を中心に説明する。第３実施形態ではすべてのエントリ２２のデータ値は２つのノードが保持する。主としてデータ値を管理し、エントリ２２の処理を行うノードをプライマリノードと呼び（単に「プライマリ」と適宜称する：第１のノード）、データ値の複製を保持し、負荷が小さいときにエントリ２２の処理を代理実行するノード（つまり、バックアップ用のノード）をレプリカノードと呼ぶ（単に「レプリカ」と適宜称する：第２のノード）。第３実施形態において、レプリカは別のキー範囲に対するプライマリを兼ねるため、プライマリに比べて計算処理能力（代理実行能力）が低いと仮定する。ただし、レプリカを専用のノードとし、プライマリと同等の計算処理能力を持つ構成としてもよい。 Since the difference between the third embodiment, the first embodiment, and the second embodiment is the processing of the entry 22 and the data replication processing in the parallel processing period, this portion will be mainly described. In the third embodiment, the data values of all entries 22 are held by two nodes. The node that mainly manages the data value and processes the entry 22 is called a primary node (simply referred to as “primary” as appropriate: a first node), holds a copy of the data value, and stores the entry 22 when the load is small. A node that performs processing on behalf (that is, a backup node) is referred to as a replica node (simply referred to as “replica” as appropriate: a second node). In the third embodiment, since the replica also serves as a primary for another key range, it is assumed that the calculation processing capability (proxy execution capability) is lower than that of the primary. However, the replica may be a dedicated node and may have a calculation processing capability equivalent to that of the primary.

（図１９）
図１９は、並列処理期間の開始時点におけるプライマリとレプリカにおけるエントリの配置状態を示す図である。
図１９に示すように、プライマリとレプリカとでは、キー範囲「ｋ１」〜「ｋ１００」に対するエントリ２２のデータ値（「Ｄ」、「Ｂｕ」）が二重化されている。 (Fig. 19)
FIG. 19 is a diagram showing the arrangement state of entries in the primary and replica at the start of the parallel processing period.
As shown in FIG. 19, the data value (“D”, “Bu”) of entry 22 for key ranges “k1” to “k100” is duplicated in the primary and replica.

（図２０）
図２０は、並列処理期間開始からしばらく時間が経過したときの、各ノードにおけるエントリの配置状態を示す図である。
並列処理期間が開始されエントリ２２の処理が始まると、プライマリはキーの値が最小である「ｋ１」からエントリ２２の並列処理を開始し、「ｋ１００」の方向（「ｋ１」→「ｋ１００」）へ並列処理を進めていく。プライマリは計算処理が完了したエントリ２２から順に、計算処理結果である「Ｄｎ」（「ｄ１」）をレプリカへコピーする。エントリ２２の計算処理と「Ｄｎ」のコピーは非同期に行われるため、計算処理中のエントリ２２のキーとコピー中のＤｎに対応するキーは一致しない。なお、エントリ２２の計算処理と「Ｄｎ」のコピーが同期的に行われてもよい。図１９では、「ｋ１」〜「ｋ４０」のエントリ２２における計算処理が完了し、「ｋ１」〜「ｋ２０」の「Ｄｎ」のコピーが完了している（レプリカ参照）。
なお、図２０〜図２３において、計算およびコピーしていく方向を黒く塗りつぶした矢印で示し、コピーされていく方向を白抜きの矢印で示す。 (Fig. 20)
FIG. 20 is a diagram illustrating an entry arrangement state in each node when a certain time has elapsed since the start of the parallel processing period.
When the parallel processing period is started and the processing of the entry 22 is started, the primary starts the parallel processing of the entry 22 from “k1” having the smallest key value, and the direction of “k100” (“k1” → “k100”). Continue parallel processing. The primary copies “Dn” (“d1”), which is the calculation processing result, to the replica in order from the entry 22 where the calculation processing is completed. Since the calculation process of entry 22 and the copy of “Dn” are performed asynchronously, the key of entry 22 being calculated does not match the key corresponding to Dn being copied. The calculation process of the entry 22 and the copy of “Dn” may be performed synchronously. In FIG. 19, the calculation process in the entry 22 from “k1” to “k40” is completed, and the copy of “Dn” from “k1” to “k20” is completed (refer to replica).
In FIG. 20 to FIG. 23, the calculation and copying directions are indicated by solid arrows, and the copying directions are indicated by white arrows.

レプリカは負荷に余裕があり代理実行が可能であれば、プライマリが行うエントリ２２の処理を肩代わりする。図２０の例では、レプリカはプライマリの実行と処理が重ならないようにするため、プライマリとは逆順にエントリ２２の処理を代理実行している。つまり、キーの値が最大である「ｋ１００」からエントリ２２の処理を開始し、「ｋ１」の方向（「ｋ１００」→「ｋ１」）へ進めていく。レプリカは代理実行したエントリ２２の計算処理の結果である「Ｄｎ」（「ｄ１」）をプライマリへコピーする。図１９の例では、レプリカは「ｋ１００」〜「ｋ８５」のエントリ２２における計算処理が完了し、このうち、「ｋ１００」〜「ｋ９０」のエントリ２２についてデータ値（「Ｄｎ」）のコピーが完了している（プライマリ参照）。
なお、プライマリと、レプリカの並列処理が重ならないようにすれば、並列処理の方向は、図１９に示す方向に限らない。例えば、それぞれの向きが逆向きでもよいし、任意のエントリ２２から互いに逆方向に並列処理を進めてもよい。 If the replica has a sufficient load and can be executed by proxy, it takes over the processing of the entry 22 performed by the primary. In the example of FIG. 20, the replica performs the processing of the entry 22 in the reverse order to the primary so that the execution and processing of the primary do not overlap. That is, the processing of the entry 22 is started from “k100” having the maximum key value, and proceeds in the direction of “k1” (“k100” → “k1”). The replica copies “Dn” (“d1”), which is the result of the calculation process of the entry 22 executed by proxy, to the primary. In the example of FIG. 19, the replica completes the calculation processing in the entries 22 from “k100” to “k85”, and among these, copying of the data value (“Dn”) is completed for the entries 22 from “k100” to “k90” (Primary reference).
As long as the parallel processing of the primary and the replica does not overlap, the direction of the parallel processing is not limited to the direction shown in FIG. For example, the respective directions may be reversed, or parallel processing may proceed in the opposite directions from any entry 22.

このようにプライマリとレプリカはエントリ２２の処理を分担するが、そのエントリ２２の並列処理を実行したノードがプライマリであってもレプリカであっても、並列処理を行っている他のエントリ２２が送信したメッセージは、プライマリに送信される。従って、図１９〜図２５において、レプリカには「Ｂｒ」が存在していない。ただし、他のエントリ２２へのメッセージは、プライマリおよびレプリカから送信される。つまり、レプリカは、自身が計算したエントリ２２については、メッセージの送信を行う。 In this way, the primary and the replica share the processing of the entry 22, but the other entry 22 performing the parallel processing transmits whether the node that executed the parallel processing of the entry 22 is the primary or the replica. Messages are sent to the primary. Accordingly, in FIGS. 19 to 25, “Br” does not exist in the replica. However, messages to other entries 22 are transmitted from the primary and replica. That is, the replica transmits a message for the entry 22 calculated by itself.

図２０の時点におけるプライマリと、レプリカにおけるエントリの配置状態を整理すると以下のようになる。
プライマリ：「ｋ１」〜「ｋ４０」まで計算処理が完了済み（「ｄ１」）。「ｋ１」〜「ｋ２０」までのデータ値がレプリカへコピー済み（レプリカの「Ｄｎ」参照）。「ｋ１００」〜「ｋ９０」までのデータ値をレプリカから受信済み（「ｄ１」）。
レプリカ：「ｋ１００」〜「ｋ８５」まで計算処理が完了済み（「ｄ１」）。「ｋ１００」〜「ｋ９０」までのデータ値がプライマリへコピー済み（プライマリの「Ｄｎ」参照）。「ｋ１」〜「ｋ２０」までのデータ値をプライマリから受信済み（「ｄ１」）。 The arrangement of entries in the primary and replica at the time of FIG. 20 is organized as follows.
Primary: The calculation processing from “k1” to “k40” has been completed (“d1”). Data values from “k1” to “k20” have been copied to the replica (see “Dn” of the replica). Data values “k100” to “k90” have been received from the replica (“d1”).
Replica: Calculation processing from “k100” to “k85” has been completed (“d1”). Data values from “k100” to “k90” have been copied to the primary (see “Dn” of the primary). Data values “k1” to “k20” have been received from the primary (“d1”).

（図２１）
図２１は、図２０からさらに時間が経過したときの、各ノードにおけるエントリの配置配置状態を示す図である。
プライマリと、レプリカにおけるエントリの配置配置状態は以下の通りである。
プライマリ：「ｋ１」〜「ｋ８０」まで計算処理が完了済み（「ｄ１」）。「ｋ１」〜「ｋ５０」までのデータ値がレプリカへコピー済み（レプリカの「Ｄｎ」参照）。「ｋ１００」〜「ｋ８０」までのデータ値をレプリカから受信済み（「ｄ１」）。
レプリカ：「ｋ１００」〜「ｋ７０」まで計算処理が完了済み（「ｄ１」）。「ｋ１００」〜「ｋ８０」までのデータ値がプライマリへコピー済み（プライマリの「Ｄｎ」参照）。「ｋ１」〜「ｋ５０」までのデータ値をプライマリから受信済み（「ｄ１」）。 (Fig. 21)
FIG. 21 is a diagram showing an entry arrangement state in each node when a further time elapses from FIG.
The layout of entries in the primary and replica is as follows.
Primary: The calculation processing from “k1” to “k80” has been completed (“d1”). Data values from “k1” to “k50” have been copied to the replica (see “Dn” of the replica). Data values “k100” to “k80” have been received from the replica (“d1”).
Replica: Calculation processing from “k100” to “k70” has been completed (“d1”). Data values from “k100” to “k80” have been copied to the primary (see “Dn” of the primary). Data values “k1” to “k50” have been received from the primary (“d1”).

図２１の時点で、並列処理期間が終了したとする。
すると、図２１の時点で、プライマリはすべてのエントリ２２についての計算処理結果（「ｄ１」）を保持している。プライマリ自身が計算処理を実行したエントリ２２と、レプリカから受信した計算処理結果が一致した地点が「ｋ８０」である。
一方、図２１の時点で、レプリカは「ｋ１００」から開始した代理実行が「ｋ７０」まで進んでおり、プライマリにはこのうち「ｋ１００」〜「ｋ８０」のデータ値がコピーされている。 It is assumed that the parallel processing period has ended at the time of FIG.
Then, as shown in FIG. 21, the primary holds the calculation processing results (“d1”) for all the entries 22. The point where the entry 22 where the primary itself performed the calculation process and the calculation process result received from the replica match is “k80”.
On the other hand, at the time of FIG. 21, the proxy execution started from “k100” has progressed to “k70”, and data values “k100” to “k80” are copied to the primary.

すなわち、「ｋ７０」〜「ｋ８０」のエントリ２２については、プライマリとレプリカの両方で計算処理を行っている。前記したように、プライマリも、レプリカも自身が計算したエントリ２２については、メッセージの送信を行うため、「ｋ７０」〜「ｋ８０」のエントリ２２に関して、プライマリとレプリカの双方からメッセージが送信されてしまう。その結果、他のノードにおけるプライマリは、プライマリとレプリカの双方から送信されたメッセージを受信してしまう。同一メッセージを２回「Ｂｒ」に適用すると、「Ｂｒ」の値が不正になるため、各ノードのプライマリではメッセージの重複受信を検出する必要がある。例えば、各ノードのプライマリでは受信したメッセージの送信元エントリのキーの集合を保持しておくことで、重複受信を検出することができる。つまり、各ノードのプライマリは、メッセージを受信したとき、そのメッセージの送信元のキーを保持しておく。そして、プライマリはメッセージを受信するたびに、そのメッセージの送信元のキーと、保持しているキーとを比較することで、メッセージを重複受信していないか否かを判定する。
なお、メッセージを重複受信していることが検知された場合、プライマリは、後から受信したメッセージを破棄するなどの処理を行う。 That is, for the entry 22 from “k70” to “k80”, calculation processing is performed in both the primary and replica. As described above, since the message is transmitted for the entry 22 calculated by both the primary and the replica, the message is transmitted from both the primary and the replica regarding the entry 22 of “k70” to “k80”. . As a result, the primary in the other node receives a message transmitted from both the primary and the replica. If the same message is applied twice to “Br”, the value of “Br” becomes invalid, and it is necessary for the primary of each node to detect duplicate reception of messages. For example, the primary of each node can detect duplicate reception by holding a set of keys of transmission source entries of received messages. In other words, when the primary of each node receives a message, it holds the key of the sender of the message. Each time a primary message is received, the primary compares the message transmission source key with the held key to determine whether or not duplicate messages have been received.
When it is detected that a message has been received twice, the primary performs processing such as discarding a message received later.

（図２１→図２２）
ここで、前記したように図２１の時点で並列処理期間は終了しており、ＢＳＰモデルにおける同期の状態になる。
この間、レプリカは自身が保持する「Ｄｎ」を「Ｄ」にマージし、「Ｄｎ」を保持していないキー範囲（「ｋ５１」〜「ｋ６９」）の「Ｄ」と「Ｂｕ」については、それぞれレプリカ自身の「Ｄｂ」、「Ｂｂ」にバックアップする。 (Fig. 21 → Fig. 22)
Here, as described above, the parallel processing period ends at the time of FIG. 21, and the BSP model is in a synchronized state.
During this time, the replica merges “Dn” held by itself into “D”, and “D” and “Bu” of the key range (“k51” to “k69”) not holding “Dn” Back up to “Db” and “Bb” of the replica itself.

また、プライマリは、すべてのエントリ２２において「Ｄｎ」を「Ｄ」にマージし、「Ｂｒ」を「Ｂｕ」へ移動する。さらに、プライマリは、レプリカが保持していない「ｋ５１」〜「ｋ６９」のＤｎ（「ｄ１」）をプライマリ自身の「Ｄｂ」にバックアップする。またプライマリは、すべてのエントリ２２における「Ｂｒ」をまとめてレプリカへコピーする。レプリカは、プライマリから「Ｂｒ」を受信すると、受信した「Ｂｒ」を「Ｂｕ」に格納する。 Further, the primary merges “Dn” into “D” in all the entries 22 and moves “Br” to “Bu”. Further, the primary backs up Dn (“d1”) of “k51” to “k69” that the replica does not hold to “Db” of the primary. Further, the primary copies “Br” in all the entries 22 together to the replica. When the replica receives “Br” from the primary, the replica stores the received “Br” in “Bu”.

このようにすることで、キー範囲「ｋ１」〜「ｋ５０」および「ｋ７０」〜「ｋ１００」に対応するデータ値（「Ｄ」、「Ｂｕ」）については、並列処理期間実行後のデータ値が二重化されることになる。
それ以外のキー範囲、すなわち「ｋ５１」〜「ｋ６９」については、プライマリは並列処理期間実行後のデータ値（「Ｄ」（「Ｄ１」）、「Ｂｕ」（「Ｂ１」））を保持している。しかし、レプリカは、このキー範囲（「ｋ５１〜「ｋ６９」）については、「Ｄ」を保持していない。ただし、レプリカはこのキー範囲（「ｋ５１」〜「ｋ６９」）において、「Ｄｂ」、「Ｂｂ」に並列処理期間実行前のデータ値（「Ｄ０」、「Ｂ０」）を保持しているため、これを使用して再計算することで並列処理期間実行後の「Ｄ」（「Ｄ１」）を作成することができる（「Ｄ０」＋「Ｂ０」＝「ｄ１」、「ｄ１」＋「Ｄ０」＝「Ｄ１」）。 In this way, for the data values (“D”, “Bu”) corresponding to the key ranges “k1” to “k50” and “k70” to “k100”, the data values after execution of the parallel processing period are It will be duplicated.
For the other key ranges, ie, “k51” to “k69”, the primary holds the data values (“D” (“D1”), “Bu” (“B1”)) after execution of the parallel processing period). Yes. However, the replica does not hold “D” for this key range (“k51 to“ k69 ”). However, since the replica holds data values (“D0”, “B0”) before execution of the parallel processing period in “Db” and “Bb” in this key range (“k51” to “k69”), By recalculating using this, it is possible to create “D” (“D1”) after execution of the parallel processing period (“D0” + “B0” = “d1”, “d1” + “D0”) = "D1").

つまり、レプリカは、並列処理期間実行後のデータ値（「Ｄ１」、「Ｂ１」）を持つことと等価であり、データ値は問題なく二重化されていると言える。 That is, the replica is equivalent to having the data values ("D1", "B1") after execution of the parallel processing period, and it can be said that the data values are duplicated without any problem.

以上の処理を完了した結果が、図２２に示される状態である。 The result of completing the above processing is the state shown in FIG.

なお、プライマリ−レプリカ間の並列処理が完了しても、他のエントリ２２における計算処理が完了していない場合がある。このような場合、他のエントリ２２の計算処理が完了するのを待つ必要があるため、プライマリ・レプリカは、以下のような処理を行ってもよいが、本実施形態では、以下の処理を行わなかったものとして処理の説明を進める。
つまり、この待ち時間を利用してプライマリはレプリカへ、適時「ｋ５１」〜「ｋ６９」の「Ｄｂ」（「ｄ１」）をコピーしてもよい。
そして、レプリカは「ｄ１」を受信したら「Ｄｂ」（「Ｄ０」）とマージし、マージ結果を「Ｄ」（「Ｄ１」）に格納していく。そして、レプリカは、マージと、マージ結果の格納と並行して更新が完了したキーに対応する「Ｄｂ」と「Ｂｂ」を順次クリアするようにしてもよい。またプライマリでは、「Ｄｂ」（「ｄ１」）のコピーが完了したキーに対応する「Ｄｂ」を順次クリアしてもよい。 Even if the parallel processing between the primary and the replica is completed, the calculation processing in the other entry 22 may not be completed. In such a case, since it is necessary to wait for the calculation processing of the other entry 22 to be completed, the primary replica may perform the following processing, but in the present embodiment, the following processing is performed. The explanation of the processing will proceed as if it did not exist.
That is, using this waiting time, the primary may copy “Db” (“d1”) of “k51” to “k69” to the replica at appropriate times.
When the replica receives “d1”, it merges with “Db” (“D0”) and stores the merge result in “D” (“D1”). Then, the replica may sequentially clear “Db” and “Bb” corresponding to the keys that have been updated in parallel with the merge and storage of the merge result. In the primary, “Db” corresponding to the keys for which “Db” (“d1”) has been copied may be sequentially cleared.

（図２２→図２３）
次の並列処理期間において、レプリカでエントリ２２の計算処理を代理実行する際にキーを進める方向として、（ａ）「ｋ１」→「ｋ１００」、および（ｂ）「ｋ１００」→「ｋ１」の２つの選択肢がある。この場合、どちらの方向から計算処理の代理実行を行うのかが、以下に示すように設定される必要がある。 (Fig. 22 → Fig. 23)
In the next parallel processing period, as a direction to advance the key when the calculation processing of the entry 22 is executed by proxy in the replica, (a) “k1” → “k100”, and (b) “k100” → “k1” 2 There are two options. In this case, it is necessary to set in which direction the proxy execution of the calculation process is performed as shown below.

図２２の時点で、レプリカは中間のキー範囲「ｋ５１」〜「ｋ６９」のエントリ２２のデータ値（「Ｄ」）を保持していないため、この範囲のエントリ２２における計算処理を実行することはできない。そこでレプリカは、なるべく多くのエントリ２２に対して代理実行を実行できるようにするため、「Ｄ」の存在するキー範囲においてエントリ２２が多い側から代理実行を開始する。次の並列処理期間開始時点で図２２に示すような配置状態であれば、「ｋ１」〜「ｋ５０」のキー範囲のレプリカの方が、「ｋ７０」〜「ｋ１００」のキー範囲より、多くのデータ値「Ｄ」（「Ｄ１」）を保持している。よって、レプリカは、「ｋ１」→「ｋ１００」の方向で代理実行を進めるよう代理実行の方向を設定する。 Since the replica does not hold the data value (“D”) of the entry 22 in the intermediate key range “k51” to “k69” at the time of FIG. 22, the calculation process in the entry 22 in this range is not executed. Can not. Therefore, in order to enable the replica to execute proxy execution for as many entries 22 as possible, the proxy execution is started from the side where the number of entries 22 is large in the key range where “D” exists. If the arrangement state is as shown in FIG. 22 at the start of the next parallel processing period, the replica of the key range of “k1” to “k50” is more than the key range of “k70” to “k100”. The data value “D” (“D1”) is held. Therefore, the replica sets the proxy execution direction so that the proxy execution proceeds in the direction of “k1” → “k100”.

レプリカは自身がプライマリとして管理するエントリ２２を保持するため、並列処理期間が開始されたら直ちに代理実行を開始するわけではない。並列処理期間開始時にレプリカは代理実行の方向を決めておき、実際の代理実行は、自身の負荷が低いときに行う。 Since the replica holds the entry 22 that it manages as a primary, the proxy execution is not started immediately after the parallel processing period starts. The replica determines the direction of proxy execution at the start of the parallel processing period, and actual proxy execution is performed when its own load is low.

レプリカが代理実行の方向を決めたらプライマリは、その逆順（つまり、「ｋ１００」→「ｋ１」）の方向でエントリ２２の並列処理を行う。これによって、並列処理を行うエントリ２２が、プライマリとレプリカとで重複しないようにすることができる。
また、プライマリは「Ｄｂ」（「ｄ１」）の残りをレプリカに送信するが、なるべくレプリカでの代理実行を継続できるように、レプリカの代理実行の方向と同方向となるよう「Ｄｂ」をコピーする。
すなわち、本実施形態では、レプリカが「ｋ１」→「ｋ１００」の方向で代理実行を行うので、プライマリは同方向となる「ｋ５１」→「ｋ６９」の方向で「Ｄｂ」（「ｄ１」）を送信する。 When the replica determines the proxy execution direction, the primary performs parallel processing of the entry 22 in the reverse order (ie, “k100” → “k1”). As a result, it is possible to prevent the entry 22 performing parallel processing from overlapping between the primary and the replica.
In addition, the primary sends the remainder of “Db” (“d1”) to the replica, but copies “Db” to be in the same direction as the proxy execution of the replica so that the proxy execution can be continued as much as possible. To do.
That is, in this embodiment, since the replica performs proxy execution in the direction of “k1” → “k100”, the primary performs “Db” (“d1”) in the direction of “k51” → “k69” that is the same direction. Send.

（図２３）
図２３は、並列処理期間開始から所定時間が経過したときにおけるエントリの配置状態を示す図である。
図２３の時点におけるエントリの配置状態は以下の通りである。
プライマリ：「ｋ１００」〜「ｋ４０」まで計算処理が完了済み（「ｄ２」）。「ｋ１００」〜「ｋ５５」までのデータ値（「ｄ２」）がレプリカへコピー済み（レプリカの「Ｄｎ」参照）。「ｋ１」〜「ｋ１０」までのデータ値（「ｄ２」）をレプリカから受信済み（「ｄ２」）。「ｋ５１」〜「ｋ６５」までの「Ｄｂ」（「ｄ１」）をレプリカへ送信済み。
レプリカ：「ｋ１」〜「ｋ１０」まで計算処理が完了済み。「ｋ１」〜「ｋ１０」までのデータ値（「ｄ２」）がプライマリへコピー済み（プライマリの「Ｄｎ」参照）。「ｋ１００」〜「ｋ５５」までのデータ値（「ｄ２」）をプライマリから受信済み。「ｋ５１」〜「ｋ６５」までのデータ値（「ｄ１」）をプライマリから受信し、それを自身の「Ｄｂ」（「Ｄ０」）とマージすることで「Ｄ」（「Ｄ１」）を生成済み。
ここで、レプリカからプライマリへコピーしたキー範囲が、「ｋ１」〜「ｋ１０」と同じであるのは、レプリカが自身の負荷に余裕のあるときに計算・コピーを行っているためである。 (Fig. 23)
FIG. 23 is a diagram illustrating an entry arrangement state when a predetermined time has elapsed since the start of the parallel processing period.
The entry arrangement state at the time of FIG. 23 is as follows.
Primary: Computation processing has been completed from “k100” to “k40” (“d2”). Data values (“d2”) from “k100” to “k55” have been copied to the replica (see “Dn” of the replica). Data values (“d2”) from “k1” to “k10” have been received from the replica (“d2”). “Db” (“d1”) from “k51” to “k65” has already been transmitted to the replica.
Replica: Calculation processing from “k1” to “k10” has been completed. Data values (“d2”) from “k1” to “k10” have been copied to the primary (see “Dn” of the primary). Data values (“d2”) from “k100” to “k55” have been received from the primary. “D” (“D1”) has been generated by receiving data values (“d1”) from “k51” to “k65” from the primary and merging them with their own “Db” (“D0”) .
Here, the key range copied from the replica to the primary is the same as “k1” to “k10” because the calculation / copying is performed when the replica has sufficient load.

（図２４）
図２４は、図２３の時点からさらに時間が経過し、並列処理期間が終了した直後におけるエントリの配置状態を示す図である。
図２４の時点におけるエントリの配置状態は以下の通りである。
プライマリ：「ｋ１００」〜「ｋ３０」まで計算処理が完了済み。「ｋ１００」〜「ｋ５０」までのデータ値（「ｄ２」）がレプリカへコピー済み（レプリカの「Ｄｎ」参照）。「ｋ１」〜「ｋ３０」までのデータ値（「ｄ２」）をレプリカから受信済み（従って、「Ｄｎ」のデータ値をすべて保持している）。「ｋ５１」〜「ｋ６９」までの「Ｄｂ」（「ｄ１」）をレプリカへ送信済み。
レプリカ：「ｋ１」〜「ｋ３０」まで計算処理が完了済み。「ｋ１」〜「ｋ３０」までのデータ値（「ｄ２」）がプライマリへコピー済み（プライマリの「Ｄｎ」参照）。「ｋ１００」〜「ｋ５０」までのデータ値（「ｄ２」）をプライマリから受信済み。「ｋ５１」〜「ｋ６９」までのデータ値（「ｄ１」）をプライマリから受信し、それを自身の「Ｄｂ」（「Ｄ０」）とマージすることで「Ｄ」（「Ｄ１」）を生成済み（従って、「Ｄ」（「Ｄ１」）のデータ値をすべて保持している）。 (Fig. 24)
FIG. 24 is a diagram showing an entry arrangement state immediately after the time elapses from the time of FIG. 23 and the parallel processing period ends.
The entry arrangement state at the time of FIG. 24 is as follows.
Primary: Calculation processing has been completed from “k100” to “k30”. Data values (“d2”) from “k100” to “k50” have been copied to the replica (see “Dn” of the replica). Data values (“d2”) from “k1” to “k30” have been received from the replica (thus, all data values of “Dn” are retained). “Db” (“d1”) from “k51” to “k69” has been transmitted to the replica.
Replica: Calculation processing from “k1” to “k30” has been completed. Data values (“d2”) from “k1” to “k30” have been copied to the primary (see “Dn” of the primary). Data values (“d2”) from “k100” to “k50” have been received from the primary. “D” (“D1”) is generated by receiving data values (“d1”) from “k51” to “k69” from the primary and merging it with its own “Db” (“D0”) (Thus, all data values of “D” (“D1”) are retained).

ここで、プライマリでは「ｋ１００」〜「ｋ３０」までの計算処理が完了しているのに、レプリカが「ｋ１００」〜「ｋ５０」までしかデータ値を受信していないのは、レプリカの性能が、プライマリより低いためである。 Here, although the calculation processing from “k100” to “k30” is completed at the primary, the replica receives only the data values from “k100” to “k50”. This is because it is lower than the primary.

図２４の時点において、レプリカは、前回の並列処理期間で処理が完了していなかった「ｋ５１」〜「ｋ６９」のキー範囲のデータ値「Ｄ」（「Ｄ１」）をすべて保持することができるが、新たに「ｋ３１」〜「ｋ４９」のキー範囲のデータ値「Ｄｎ」（「ｄ２」）について計算処理が間に合わず、当該範囲のデータ値を保持していない状態となっている。 At the time of FIG. 24, the replica can hold all the data values “D” (“D1”) in the key range “k51” to “k69” that have not been processed in the previous parallel processing period. However, the calculation process is not in time for the data value “Dn” (“d2”) in the key range “k31” to “k49”, and the data value in the range is not held.

（図２４→図２５）
図２４の時点で並列処理期間が終了したとすると、図２４におけるエントリの配置状態から、プライマリおよびレプリカは、図２１→図２２と同様の処理を行う。
すなわち、レプリカは自身が保持しているキー範囲（「ｋ１」〜「ｋ３０」、「ｋ５０」〜「ｋ１００」）の「Ｄｎ」（「ｄ２」）を「Ｄ」（「Ｄ１」）にマージして、マージした結果（「Ｄ２」）を「Ｄ」に格納する。
また、レプリカは、「Ｄｎ」を保持していないキー範囲（「ｋ３１」〜「ｋ４９」）の「Ｄ」（「Ｄ１」）と「Ｂｕ」（「Ｂ１」）については、それぞれ自身の「Ｄｂ」、「Ｂｂ」にバックアップをとる。 (FIG. 24 → FIG. 25)
If the parallel processing period ends at the time of FIG. 24, the primary and replica perform the same processing as in FIG. 21 to FIG. 22 from the entry arrangement state in FIG.
That is, the replica merges “Dn” (“d2”) of the key range (“k1” to “k30”, “k50” to “k100”) held by itself with “D” (“D1”). The merged result (“D2”) is stored in “D”.
The replica also has its own “Db” for “D” (“D1”) and “Bu” (“B1”) in the key range (“k31” to “k49”) that does not hold “Dn”. ”And“ Bb ”.

またプライマリは、「Ｄｎ」（「ｄ２」）をＤ（「Ｄ１」）にマージし、「Ｂｒ」（Ｂ２」）を「Ｂｕ」に移動する。そして、プライマリは、レプリカが保持していない「ｋ３１」〜「ｋ４９」のキー範囲の「Ｄｎ」（「ｄ２」）を自身の「Ｄｂ」にバックアップする。またプライマリは、すべてのキー範囲の「Ｂｒ」（「Ｂ２」）をまとめてレプリカに送信する。レプリカは「Ｂｒ」（「Ｂ２」）をプライマリから受信したら、受信したデータ値（「Ｂ２」）を「Ｂｕ」に代入する。
以上の処理を行うことで、各エントリ２２におけるエントリの配置状態は、図２５に示すようになる。
そして、ジョブマネージャ１１は、各エントリ２２に関する情報をデータテーブルに反映することによって、データテーブルを更新する。 The primary merges “Dn” (“d2”) into D (“D1”) and moves “Br” (B2 ”) to“ Bu ”. Then, the primary backs up “Dn” (“d2”) in the key range “k31” to “k49” not held by the replica to its own “Db”. In addition, the primary transmits “Br” (“B2”) of all key ranges together to the replica. When the replica receives “Br” (“B2”) from the primary, the replica substitutes the received data value (“B2”) for “Bu”.
By performing the above processing, the entry arrangement state in each entry 22 is as shown in FIG.
Then, the job manager 11 updates the data table by reflecting information on each entry 22 in the data table.

（第３実施形態の効果）
第３実施形態によれば、システム１００はレプリカにおいて、すべての代理実行が終了せずに並列処理期間が終了しても、次の並列処理期間で終了しなかったキー範囲の代理実行を行いつつ、新たな代理実行も行うことができるため、効率的に複製を行うことができる。 (Effect of the third embodiment)
According to the third embodiment, the system 100 performs the proxy execution of the key range that did not end in the next parallel processing period even if the parallel processing period ends without completing all the proxy executions in the replica. Since new proxy execution can also be performed, replication can be performed efficiently.

《第４実施形態》
図２６は、第４実施形態に係る並列計算処理の概要を示す図である。なお、第４実施形態でも、システム構成、ハードウェア構成は、図１、図２と同様であるので、システム構成、ハードウェア構成に対する図示、説明を省略する。
第１実施形態〜第３実施形態では、図４に示すような、各並列処理期間で同一のエントリ群が並列計算を実行するモデルに基づいていた。
しかしながら、実際には、図２６に示すように、並列処理期間毎に異なるエントリ群が並列計算を実行する場合がある。第４実施形態では、このような並列分散処理システム１００の場合を、第１実施形態〜第３実施形態に適用した場合について説明する。 << 4th Embodiment >>
FIG. 26 is a diagram illustrating an outline of parallel calculation processing according to the fourth embodiment. In the fourth embodiment, the system configuration and the hardware configuration are the same as those in FIGS. 1 and 2, and thus illustration and description of the system configuration and the hardware configuration are omitted.
In the first embodiment to the third embodiment, as shown in FIG. 4, the same entry group is based on a model that executes parallel computation in each parallel processing period.
However, in practice, as shown in FIG. 26, different entry groups may execute parallel computation for each parallel processing period. In the fourth embodiment, a case where such a parallel distributed processing system 100 is applied to the first to third embodiments will be described.

図２６において、各エントリ２２は、図３，図４に示すような（キー，データ値）に加えてタイプ（種別）という属性を有している。例えば、図１１の左上に記載したエントリ２２には「（Ｔ１，ｋ１，ｖ１１）」と記載されているが、これはタイプが「Ｔ１」で、キーが「ｋ１」で、データ値が「ｖ１１」であることを示している。
この、並列分散処理システム１００ではタイプが同一のエントリ群が同じ並列処理期間で並列計算処理を行い、同期後の次の並列処理期間では、前回の並列処理期間とは異なるタイプのエントリ群が並列計算処理を行う。そして並列処理期間を繰り返すうちに、ある並列処理期間で以前処理を実行したエントリ群が再び並列処理を行う。
例えば、図２６における「Ｐｈａｓｅ１」ではタイプが「Ｔ１」のエントリ群が並列計算処理を行い、次の「Ｐｈａｓｅ２」ではタイプが「Ｔ２」のエントリ群が並列計算処理を行っている。そして、「Ｐｈａｓｅ３」では再びタイプが「Ｔ１」のエントリ群の処理を行っている。
なお、タイプ「Ｔ２」のエントリ群における「ｗ１１」はデータ値を示している。 In FIG. 26, each entry 22 has an attribute of type (type) in addition to (key, data value) as shown in FIGS. For example, the entry 22 described in the upper left of FIG. 11 describes “(T1, k1, v11)”, which has the type “T1”, the key “k1”, and the data value “v11”. ".
In this parallel distributed processing system 100, the same type of entry group performs parallel calculation processing in the same parallel processing period, and in the next parallel processing period after synchronization, an entry group of a different type from the previous parallel processing period is parallel. Perform the calculation process. Then, as the parallel processing period is repeated, the entry group that has executed the previous processing in a certain parallel processing period performs parallel processing again.
For example, in “Phase 1” in FIG. 26, an entry group of type “T1” performs parallel calculation processing, and in the next “Phase 2”, an entry group of type “T2” performs parallel calculation processing. In “Phase 3”, the processing of the entry group whose type is “T1” is performed again.
Note that “w11” in the entry group of the type “T2” indicates a data value.

このような並列分散処理システム１００を実現するために、図５に示すパーティションテーブルおよび図６に示すデータテーブルがタイプ毎に作成される。前記したように、元々、パーティションテーブル、データテーブルはジョブ毎に作成していたため、第４実施形態では、パーティションテーブルおよびデータテーブルがジョブとタイプの組み合わせ毎に作成されることになる。なお、１つのタイプに属するエントリ群の処理が並列実行されるため、タイプ単位で負荷の均等化が実現できることが望ましい。このために、タイプ毎にパーティションテーブルおよびデータテーブルが作成されている。 In order to realize such a parallel distributed processing system 100, the partition table shown in FIG. 5 and the data table shown in FIG. 6 are created for each type. As described above, since the partition table and data table were originally created for each job, in the fourth embodiment, the partition table and data table are created for each combination of job and type. Since processing of entry groups belonging to one type is executed in parallel, it is desirable that load equalization can be realized in units of types. For this reason, a partition table and a data table are created for each type.

続いて、図２６に示す並列分散処理システム１００を第１実施形態〜第３実施形態のそれぞれに適用した場合の処理について説明する。 Next, processing when the parallel distributed processing system 100 shown in FIG. 26 is applied to each of the first to third embodiments will be described.

（第１実施形態に適用した場合）
第１実施形態の図８から図１３で説明したデータ再配置処理において、エントリ２２が送信したメッセージは並列処理期間で並列処理を実行中のエントリ群が受け取っていたが、第４実施形態では次の並列処理期間で並列計算処理を実行するタイプのエントリ群（従って、現在並列計算処理を行っているエントリ群とは異なるエントリ群）が受信することになる。従って、図８から図１３に、各エントリ２２が送信したメッセージは図示されている「Ｂｒ」へは送信されず、次の並列処理期間で並列計算処理を行うエントリ群の「Ｂｒ」へメッセージが送信される。 (When applied to the first embodiment)
In the data rearrangement processing described in FIGS. 8 to 13 of the first embodiment, the message transmitted by the entry 22 is received by the group of entries that are executing parallel processing in the parallel processing period. Thus, an entry group of a type that executes parallel computation processing in the parallel processing period (that is, an entry group different from the entry group currently performing parallel computation processing) is received. Accordingly, in FIG. 8 to FIG. 13, the message transmitted by each entry 22 is not transmitted to “Br” shown in the figure, but the message is transmitted to “Br” of the entry group that performs parallel calculation processing in the next parallel processing period. Sent.

そして、再配置の際に、システム１００はメッセージ送信先のエントリ群のデータ値の再配置は行わず、メッセージ送信元のエントリ群のデータ値の再配置を行う。つまり、システム１００は、現在並列計算処理を行っているエントリ群については再配置を行うが、次の並列処理期間で並列計算処理を行うエントリ群については、再配置を行わない。従って、エントリ２２のメッセージを実際に送信するノードは、次に並列処理を行うタイプのパーティションテーブルを使って決定される。その他の処理については、第１実施形態で説明した処理とと同様である。 In the rearrangement, the system 100 does not rearrange the data values of the message transmission destination entry group, but rearranges the data values of the message transmission source entry group. That is, the system 100 rearranges the entry group that is currently performing the parallel calculation process, but does not perform the rearrangement for the entry group that performs the parallel calculation process in the next parallel processing period. Therefore, the node that actually transmits the message of the entry 22 is determined using a partition table of a type that performs parallel processing next. Other processes are the same as the processes described in the first embodiment.

ただし、第１実施形態に適用する場合、ある並列処理期間で各タスクの実行時間が計測され、各タスクへのデータ配置が決定された後、システム１００は、次に最初のタスクとはタイプが異なるタスク群の並列処理を実行している間に、最初のタスクに関するデータの再配置を行うことができる。またデータの再配置が完了する前に、最初に実行時間を計測したタスクと同じタイプのタスク群を実行することになった場合、データの再配置を続けながらそのタスク群の並列処理を行うことが可能である。このように、システム１００はデータの再配置処理を効率化することができる。例えば図２６の例では「Ｐｈａｓｅ１」でタイプが「Ｔ１」のエントリ群の並列計算処理を行うタスク群の実行時間が計算され、各タスクへの最適なデータ配置が決定される。そして「Ｐｈａｓｅ２」で「Ｔ２」のエントリ群の並列計算処理を行っている間に、「Ｔ１」のエントリ群の再配置が行われる。 However, when applied to the first embodiment, the execution time of each task is measured in a certain parallel processing period, and after the data allocation to each task is determined, the system 100 is the type of the first task next. While parallel processing of different task groups is being performed, data relating to the first task can be rearranged. In addition, if the task group of the same type as the task whose execution time was first measured is executed before the data rearrangement is completed, the task group should be processed in parallel while continuing the data rearrangement. Is possible. As described above, the system 100 can improve the efficiency of the data rearrangement process. For example, in the example of FIG. 26, the execution time of a task group that performs parallel calculation processing of an entry group of “Phase1” and type “T1” is calculated, and the optimal data arrangement for each task is determined. Then, while the parallel calculation processing of the “T2” entry group is performed in “Phase2”, the rearrangement of the “T1” entry group is performed.

（第２実施形態に適用した場合）
第４実施形態に係る並列分散処理システム１００を、第２実施形態に適用した場合も第１実施形態に適用した場合と同様である。
すなわち、エントリ２２のメッセージを実際に送信するノードは、次に並列処理を行うタイプのパーティションテーブルを使って決定する。図１５〜図１７では、並列処理期間で並列計算処理を実行中のエントリ２２の「Ｂｒ」に受信したメッセージが蓄積されていくが、第４実施形態を適用すると、並列処理期間で処理を実行中のエントリ群はメッセージを受信せず、次の並列処理期間で並列計算処理を行うエントリ群がメッセージを受信する。その他の処理については、第２実施形態と同様の処理である。 (When applied to the second embodiment)
The case where the parallel distributed processing system 100 according to the fourth embodiment is applied to the second embodiment is the same as that applied to the first embodiment.
That is, the node that actually transmits the message of the entry 22 is determined using a partition table of a type that performs parallel processing next. 15 to 17, the received messages are accumulated in “Br” of the entry 22 that is executing the parallel calculation process in the parallel processing period. However, when the fourth embodiment is applied, the process is executed in the parallel processing period. The middle entry group does not receive the message, and the entry group that performs parallel calculation processing in the next parallel processing period receives the message. Other processes are the same as those in the second embodiment.

第２実施形態においても、第１実施形態へ第４実施形態を適用した場合のデータ再配置効率化手法を実施することが可能である。ただし、第２実施形態はエントリの計算を行ってから計算結果を再配置先へ送信するため、データ再配置処理が次回の並列計算処理を開始するまでに完了しなかった場合、各タスクは次回の並列計算処理開始時点で保持するデータに対して並列計算処理を行い、その結果を再配置先へ送信することになる。 Also in the second embodiment, it is possible to implement the data rearrangement efficiency technique when the fourth embodiment is applied to the first embodiment. However, since the second embodiment calculates the entry and transmits the calculation result to the relocation destination, if the data relocation processing is not completed before the next parallel calculation processing is started, each task The parallel computation processing is performed on the data held at the start of the parallel computation processing, and the result is transmitted to the relocation destination.

（第３実施形態に適用した場合）
第４実施形態係る並列分散処理システム１００を、第３実施形態に適用した場合についても、エントリ２２のメッセージ送信に関しては第１実施形態、第２実施形態に適用した場合と同様である。各エントリ２２が送信したメッセージは、次の並列処理期間で並列計算処理を実行するエントリ群のプライマリに対して送信される。従って、図２１、図２２に移行する過程で、プライマリは、レプリカへ「Ｂｒ」の内容を送信しているが、第４実施形態を適用した場合では、次の並列処理期間を実行するエントリ群のプライマリから、次の並列計算処理を実行するレプリカに対して「Ｂｒ」の内容を送信することになる。
つまり、並列処理の実行順が図２６で示すような順番であるとすると、「Ｐｈａｓｅ１」終了時に、「Ｔ２」のプライマリは「Ｔ２」のレプリカへ「Ｂｒ」の内容を送信し、次の「Ｐｈａｓｅ２」終了時に、「Ｔ１」のプライマリは「Ｔ１」のレプリカへ「Ｂｒ」の内容を送信することとなる。 (When applied to the third embodiment)
When the parallel distributed processing system 100 according to the fourth embodiment is applied to the third embodiment, the message transmission of the entry 22 is the same as that applied to the first embodiment and the second embodiment. The message transmitted by each entry 22 is transmitted to the primary of the entry group that executes parallel calculation processing in the next parallel processing period. Accordingly, in the process of shifting to FIG. 21 and FIG. 22, the primary transmits the contents of “Br” to the replica, but when the fourth embodiment is applied, an entry group that executes the next parallel processing period. The contents of “Br” are transmitted from the primary to the replica that executes the next parallel calculation process.
That is, assuming that the execution order of parallel processing is as shown in FIG. 26, at the end of “Phase 1”, the primary of “T2” transmits the content of “Br” to the replica of “T2”, and the next “ At the end of Phase 2, the primary of “T 1” transmits the content of “Br” to the replica of “T 1”.

また、図２２において、プライマリの「Ｄｂ」にバックアップされた未送信データ値（「ｄ２」）が残っているが、次の並列処理期間（図２３〜図２５）では、これらのプライマリ、レプリカは、並列計算処理を行わないため、次の並列処理期間の処理中に、プライマリの「Ｄｂ」のデータ値（「ｄ２」）をレプリカに送信し、レプリカは、受信したデータ値を基に「Ｄ」を生成することができる。そのため、第３実施形態より効率的な複製が可能となる。その他の処理については、第３実施形態における処理と同様である。 In FIG. 22, the unsent data value (“d2”) backed up to the primary “Db” remains, but in the next parallel processing period (FIGS. 23 to 25), these primary and replica Since the parallel calculation processing is not performed, the data value (“d2”) of the primary “Db” is transmitted to the replica during the processing of the next parallel processing period, and the replica performs “D” based on the received data value. Can be generated. Therefore, more efficient duplication than in the third embodiment is possible. Other processes are the same as those in the third embodiment.

（第４実施形態の効果）
第４実施形態によれば、並列処理期間毎に異なるエントリ群で並列計算処理を行うことで、並列分散処理システム１００全体の負荷を低減することができる。 (Effect of 4th Embodiment)
According to the fourth embodiment, it is possible to reduce the load on the entire parallel distributed processing system 100 by performing parallel calculation processing with different entry groups for each parallel processing period.

１マスタノード
２スレーブノード（ノード：第１のノード、第２のノードを含む）
３クライアントノード
１１ジョブマネージャ
１２分散ファイルシステム（マスタノード）
２１タスク
２２，２２ａ〜２２ｃエントリ
２３タスクマネージャ
２４分散ファイルシステム（スレーブノード） 1 Master node 2 Slave node (including nodes: first node and second node)
3 Client node 11 Job manager 12 Distributed file system (master node)
21 task 22, 22a-22c entry 23 task manager 24 distributed file system (slave node)

Claims

A parallel distributed processing method that has a plurality of nodes and is a virtual calculation unit that operates in each node, and performs parallel processing in a plurality of entries that can move between the nodes,
The node is
A parallel distributed processing method characterized in that, in the entry, calculation processing is performed, and data of an entry whose node is changed is moved to a node to which the entry is changed.

The node is
In the entry, calculation processing is performed, and the data of the entry whose node is changed is copied to the change destination node of the entry,
The parallel distributed processing method according to claim 1, wherein the migration is performed by deleting data relating to the copied entry after completion of the calculation processing and copying.

The node is
The parallel distributed processing according to claim 1 or 2, further comprising a partition table in which at least the entry, a node having the entry, and identification information of the node are stored as set information. Method.

The node is
The calculation process is started from the entry to be left in the own node, and the calculation process using the data moved from the change source node is performed after the end of the calculation process in the entry to be left in the own node. The parallel distributed processing method according to any one of claims 1 to 3.

In the calculation process of each entry, the node can transmit a message from each entry to a different entry, and each node receives a message transmitted from another entry by an entry held by the node. With a buffer,
The parallel distributed processing method according to claim 4, wherein the data in the buffer is moved to the change destination node together with the data.

The node is
While starting the calculation process from the move entry, copy the result of the calculation process and the data used for the calculation process to the change destination node,
While performing the calculation process, move the data used for the calculation process,
The calculation process for the entry to be left is performed when the calculation process for the moving entry is completed, and the result of the calculation process is stored in its own storage unit. The parallel distributed processing method described in 1.

It is a virtual distributed processing method that has a plurality of nodes and is a virtual calculation unit that operates in each node, and performs parallel processing in a plurality of entries that can move between the nodes.
A first node and a second node that are in a relationship of backup of the entry with each other and a backup destination of the entry;
The first node performs a calculation process, copies a calculation result that is a result of the calculation process to the second node, and
The second node performs a calculation process, and copies a calculation result that is a result of the calculation process to the first node;
When all the calculation results are obtained at the first node, but no calculation results are obtained at the second node,
The first node holds a calculation result in an entry for which no calculation result is obtained in the second node;
When the next calculation process starts, the first node and the second node mutually perform the calculation process and copy the calculation result, and the first node stores the held calculation result. Copying to the second node. A parallel distributed processing method.

The calculation processing performed after holding the calculation result in the entry for which the calculation result is not obtained is performed in the first node and the second node,
The parallel distributed processing method according to claim 7, wherein the calculation of the entry holding the calculation result is performed later.

The entries are classified into a plurality of types,
The parallel distributed processing method according to any one of claims 1 to 8, wherein the parallel processing is performed with the different types of entries for each parallel processing.

A parallel distributed processing system in which a plurality of entries are arranged, which is a virtual calculation unit that performs parallel processing in a plurality of nodes,
The node is
A parallel distributed processing system characterized in that, in the entry, calculation processing is performed, and data of an entry whose node is changed is moved to a node to which the entry is changed.

A parallel distributed processing system in which a plurality of entries are arranged, which is a virtual calculation unit that performs parallel processing in a plurality of nodes,
A first node and a second node that are in a relationship of backup of the entry with each other and a backup destination of the entry;
The first node performs a calculation process, copies a calculation result that is a result of the calculation process to the second node, and
The second node performs a calculation process, and copies a calculation result that is a result of the calculation process to the first node;
When all the calculation results are obtained at the first node, but no calculation results are obtained at the second node,
The first node holds a calculation result in an entry for which no calculation result is obtained in the second node;
When the next calculation process starts, the first node and the second node mutually perform the calculation process and copy the calculation result, and the first node stores the held calculation result. Copying to said 2nd node. The parallel distributed processing system characterized by the above-mentioned.