JP2014153935A

JP2014153935A - Parallel distributed processing control device, parallel distributed processing control system, parallel distributed processing control method, and parallel distributed processing control program

Info

Publication number: JP2014153935A
Application number: JP2013023499A
Authority: JP
Inventors: Toshimori Honjo; 利守本庄
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-02-08
Filing date: 2013-02-08
Publication date: 2014-08-25

Abstract

PROBLEM TO BE SOLVED: To accelerate parallel distributed processing efficiently using I/O performance of a disk.SOLUTION: A parallel distributed processing control device includes a division unit, an allocation unit, and an integration unit. The division unit divides a received task into the number of subtasks according to the number of operation parts. The allocation unit assigns the subtasks to the operation parts. The integration unit integrates a processing result of parallel execution of the subtasks in the operation parts in which the subtasks are assigned.

Description

本発明は、並列分散処理制御装置、並列分散処理制御システム、並列分散処理制御方法および並列分散処理制御プログラムに関する。 The present invention relates to a parallel distributed processing control device, a parallel distributed processing control system, a parallel distributed processing control method, and a parallel distributed processing control program.

従来、大容量のデータを高速に処理するために、データを分散して複数の情報処理装置に割り当て、複数の情報処理装置において並行して処理を実行する技術が知られている。たとえば、大規模分散処理技術の一つとして、Ｗｅｂのクロールデータやアクセスログなどの大量データ（ビッグデータ）を多数のコモディティサーバに分散させることで効率的にデータを処理するための技術が知られている。 Conventionally, in order to process a large amount of data at high speed, a technique is known in which data is distributed and allocated to a plurality of information processing apparatuses, and processing is executed in parallel in the plurality of information processing apparatuses. For example, as one of large-scale distributed processing technologies, a technology for efficiently processing data by distributing a large amount of data (big data) such as Web crawl data and access logs to a large number of commodity servers is known. ing.

中でも、分散化のフレームワークとして広く利用されている技術が、Googleにより提唱されたMapReduceである。MapReduceは、単純なプログラミングモデルを使用してコンピュータクラスタ上に大量のデータを分散させることができるフレームワークを提供する。Hadoop（登録商標）とは、MapReduceのオープンソースソフトウェア実装の一つである。Hadoopの中核をなすのは、大量データを分散して並列処理を可能にするフレームワークであるMapReduceと、高スループットのアプリケーションデータへのアクセスを実現する分散ファイルシステムであるＨＤＦＳ（Hadoop Distributed File System（登録商標））である。 Among them, a technology widely used as a decentralization framework is MapReduce, proposed by Google. MapReduce provides a framework that allows a large amount of data to be distributed over a computer cluster using a simple programming model. Hadoop (registered trademark) is one of the open source software implementations of MapReduce. At the heart of Hadoop is MapReduce, a framework that distributes large amounts of data and enables parallel processing, and HDFS (Hadoop Distributed File System), a distributed file system that enables access to high-throughput application data. Registered trademark)).

MapReduceでは、MapとReduceと呼ばれる２つの関数を記述することで、並列分散処理を実現する。まず、処理対象のデータを分割し、各Mapタスクに割り当てる。Mapタスクでは、割り当てられたデータを処理し、KeyとValueの組合せであるKey-Valueペアを生成する。そして、KeyごとにまとめられたデータをReduceタスクに渡す。Reduceタスクでは、Keyごとに処理を行い、結果を出力する。 In MapReduce, parallel distributed processing is realized by describing two functions called Map and Reduce. First, the data to be processed is divided and assigned to each Map task. The Map task processes the assigned data and generates a Key-Value pair that is a combination of Key and Value. Then, the data collected for each key is passed to the Reduce task. In the Reduce task, processing is performed for each key and the result is output.

Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI ’04)、２００４年１２月６日−８日Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI '04), December 6-8, 2004 山田浩之、合田和生、喜連川優、「並列データインテンシブ処理基盤のＩ／Ｏ性能評価に関する実験的考察」、第４回データ工学と情報マネジメントに関するフォーラム（DEIM 2012）、D6-4, ２０１２年３月Hiroyuki Yamada, Kazuo Aida, Yuu Kitsuregawa, “Experimental Study on I / O Performance Evaluation of Parallel Data Intensive Processing Platform”, 4th Forum on Data Engineering and Information Management (DEIM 2012), D6-4, 2012 3 Moon Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Technical Report No. UCB/EECS-2011-82, July 19, 2011Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Technical Report No. UCB / EECS-2011-82, July 19, 2011

しかしながら、従来は各Mapタスクの処理全体を単一のプロセス、単一のスレッドとして実行していた。各Mapタスクはたとえば、ディスク（記憶装置）からのデータロード、ロードしたデータのデシリアライゼーション、Key-Valueペアの生成、Keyによるソート、という一連の処理を含む。これらを単一のスレッドとして実行する場合、実行するＣＰＵ（Central Processing Unit）の性能によって処理速度が制限される。 However, conventionally, the entire processing of each Map task has been executed as a single process and a single thread. Each Map task includes, for example, a series of processes such as data loading from a disk (storage device), deserialization of loaded data, generation of key-value pairs, and sorting by key. When these are executed as a single thread, the processing speed is limited by the performance of the CPU (Central Processing Unit) to be executed.

また、処理を分散し複数のMapタスクを同時に並列して実行させて大量データを短時間で処理しようとした場合、並列して実行するMapタスクの数は、処理を行うサーバ等のＣＰＵのコア数程度までしか並列度を上げることによる性能向上は見込めない。このため、サーバに搭載するＣＰＵの数や性能等によって並列処理による性能向上が制約される。また、従来から、データをロードする際のデータのデシリアライゼーション等によってＣＰＵにかかる負荷が大きいことが知られている。 Also, when processing is distributed and multiple Map tasks are executed simultaneously in parallel to process a large amount of data in a short time, the number of Map tasks executed in parallel is the core of the CPU such as the server that performs the processing. Performance improvement by increasing the degree of parallelism to only a few is expected. For this reason, performance improvement by parallel processing is restricted by the number and performance of CPUs mounted on the server. Conventionally, it is known that the load on the CPU is large due to data deserialization when loading data.

このように、従来の技術では、ＣＰＵ性能がボトルネックとなって、ディスクのＩ／Ｏ性能を十分に生かすことができない。すなわち、ディスクのＩ／Ｏ性能に余裕がある場合であっても、ＣＰＵ性能により処理量が制約され、ディスクのＩ／Ｏ性能を十分に生かすことができない。 As described above, in the conventional technology, the CPU performance becomes a bottleneck, and the I / O performance of the disk cannot be fully utilized. In other words, even if there is a margin in the disk I / O performance, the amount of processing is restricted by the CPU performance, and the disk I / O performance cannot be fully utilized.

従来、分散処理に使用していた磁気ハードディスクドライブ（ＨＤＤ：Hard Disk Drive）のＩ／Ｏ性能は、ＣＰＵの処理性能よりも低かったため、上述の問題は顕著ではなかった。しかし、ＳＳＤ（Solid State Drive）等のＩ／Ｏ性能の高いディスクがＨＤＤに取って代わると、上述の問題が顕著化する。 Conventionally, since the I / O performance of a magnetic hard disk drive (HDD) used for distributed processing is lower than the processing performance of a CPU, the above-mentioned problem is not remarkable. However, when a disk with high I / O performance such as SSD (Solid State Drive) is replaced with HDD, the above-mentioned problem becomes remarkable.

開示の実施の形態は、上記に鑑みてなされたものであって、ディスクのＩ／Ｏ性能を効率的に利用して並列分散処理の高速化を実現することができる並列分散処理制御装置、並列分散処理制御システム、並列分散処理制御方法および並列分散処理制御プログラムを提供することを目的とする。 An embodiment of the disclosure has been made in view of the above, and a parallel distributed processing control apparatus capable of realizing high speed parallel distributed processing by efficiently using disk I / O performance. An object is to provide a distributed processing control system, a parallel distributed processing control method, and a parallel distributed processing control program.

上述した課題を解決し、目的を達成するために、本発明に係る実施の形態は、受け付けたタスクを、演算部の数に応じた数のサブタスクに分割し、サブタスクを演算部に割り当て、サブタスクを割り当てられた演算部において当該サブタスクを並列実行した処理結果を統合することを特徴とする。 In order to solve the above-described problems and achieve the object, the embodiment according to the present invention divides the received task into a number of subtasks corresponding to the number of arithmetic units, assigns the subtasks to the arithmetic units, The processing units obtained by executing the subtasks in parallel in the arithmetic unit to which are assigned are integrated.

本発明の実施の形態に係る並列分散処理制御装置、並列分散処理制御システム、並列分散処理制御方法および並列分散処理制御プログラムは、ディスクのＩ／Ｏ性能を効率的に利用して並列分散処理の高速化を実現することができるという効果を奏する。 A parallel distributed processing control apparatus, a parallel distributed processing control system, a parallel distributed processing control method, and a parallel distributed processing control program according to an embodiment of the present invention efficiently use disk I / O performance. There is an effect that high speed can be realized.

図１は、第１の実施形態に係る並列分散処理制御装置の構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a configuration of a parallel and distributed processing control apparatus according to the first embodiment. 図２は、第１の実施形態に係る並列分散処理制御装置における処理の流れの一例を示すフローチャートである。FIG. 2 is a flowchart illustrating an example of a processing flow in the parallel and distributed processing control apparatus according to the first embodiment. 図３は、第２の実施形態に係る並列分散処理制御システムの構成の一例を示す図である。FIG. 3 is a diagram illustrating an example of a configuration of a parallel distributed processing control system according to the second embodiment. 図４は、第２の実施形態に係る並列分散処理制御システムにおけるMapReduce処理の流れの一例を示すフローチャートである。FIG. 4 is a flowchart showing an example of the flow of MapReduce processing in the parallel distributed processing control system according to the second embodiment. 図５は、第２の実施形態に係るスレーブおよびアクセラレータの構成の一例を示す図である。FIG. 5 is a diagram illustrating an example of the configuration of the slave and the accelerator according to the second embodiment. 図６は、第２の実施形態におけるMapタスク実行処理の流れの一例を示す図である。FIG. 6 is a diagram illustrating an example of the flow of a Map task execution process in the second embodiment. 図７は、第２の実施形態に係るアクセラレータにおけるサブタスク実行処理を説明するための図である。FIG. 7 is a diagram for explaining subtask execution processing in the accelerator according to the second embodiment. 図８は、第２の実施形態に係るMapサブタスクの処理の流れの一例を示すフローチャートである。FIG. 8 is a flowchart illustrating an example of the processing flow of the Map subtask according to the second embodiment. 図９は、第２の実施形態に係るマージソートサブタスクの処理の流れの一例を示すフローチャートである。FIG. 9 is a flowchart illustrating an example of a process flow of the merge sort subtask according to the second embodiment. 図１０は、第３の実施形態に係る並列分散処理制御システムにおけるMapタスクの処理を説明するための図である。FIG. 10 is a diagram for explaining the processing of the Map task in the parallel and distributed processing control system according to the third embodiment. 図１１は、第３の実施形態における処理対象データの転送態様を説明するための図である。FIG. 11 is a diagram for explaining a transfer mode of data to be processed in the third embodiment. 図１２は、並列分散処理制御システムにおける一連の処理を実行するプログラムである並列分散処理制御プログラムによる情報処理が、コンピュータを用いて具体的に実現されることを示す図である。FIG. 12 is a diagram illustrating that information processing by the parallel distributed processing control program, which is a program for executing a series of processes in the parallel distributed processing control system, is specifically realized using a computer.

以下に、本発明にかかる並列分散処理制御装置、並列分散処理制御システム、並列分散処理制御方法および並列分散処理制御プログラムの実施の形態を図面に基づいて詳細に説明する。なお、この実施の形態によりこの発明が限定されるものではない。また、各実施の形態は適宜組み合わせることができる。 Hereinafter, embodiments of a parallel distributed processing control device, a parallel distributed processing control system, a parallel distributed processing control method, and a parallel distributed processing control program according to the present invention will be described in detail with reference to the drawings. Note that the present invention is not limited to the embodiments. Moreover, each embodiment can be combined suitably.

（第１の実施形態）
図１は、第１の実施形態に係る並列分散処理制御装置の構成の一例を示す図である。図１を参照して、第１の実施形態に係る並列分散処理制御装置の構成の一例について説明する。 (First embodiment)
FIG. 1 is a diagram illustrating an example of a configuration of a parallel and distributed processing control apparatus according to the first embodiment. With reference to FIG. 1, an example of the configuration of the parallel distributed processing control apparatus according to the first embodiment will be described.

［並列分散処理制御装置１０の構成の一例］
図１に示す並列分散処理制御装置１０は、制御部１１０と、記憶部１２０と、アクセラレータ１３０と、記憶装置（ディスク）１４０と、を備える。 [Example of Configuration of Parallel Distributed Processing Control Device 10]
The parallel distributed processing control device 10 illustrated in FIG. 1 includes a control unit 110, a storage unit 120, an accelerator 130, and a storage device (disk) 140.

制御部１１０は、並列分散処理制御装置１０内の処理を制御する。制御部１１０は、タスク分割部１１１と、サブタスク割当部１１２と、サブタスク起動部１１３と、統合部１１４と、を有する。 The control unit 110 controls processing in the parallel distributed processing control device 10. The control unit 110 includes a task division unit 111, a subtask assignment unit 112, a subtask activation unit 113, and an integration unit 114.

タスク分割部１１１は、並列分散処理制御装置１０に割り当てられたタスクをサブタスクに分割する。たとえば、タスク分割部１１１は、アクセラレータ１３０（後述）が有する演算部（たとえばＣＰＵ等のコア）の数に依存した数のサブタスクにタスクを分割する。タスク分割部１１１は、各サブタスクにおける処理対象データのブロック数が略均等になるようタスクを分割する。 The task dividing unit 111 divides a task assigned to the parallel distributed processing control device 10 into subtasks. For example, the task dividing unit 111 divides a task into a number of subtasks depending on the number of arithmetic units (for example, a core such as a CPU) included in an accelerator 130 (described later). The task division unit 111 divides the task so that the number of blocks of processing target data in each subtask is substantially equal.

サブタスク割当部１１２は、タスク分割部１１１での分割により得られたサブタスクを、アクセラレータ１３０が有する演算部に割り当てる。 The subtask assignment unit 112 assigns the subtask obtained by the division in the task division unit 111 to the arithmetic unit included in the accelerator 130.

サブタスク起動部１１３は、タスク分割部１１１での分割により得られたサブタスクを、サブタスク割当部１１２が当該サブタスクを割り当てた演算部において起動する。 The subtask activation unit 113 activates the subtask obtained by the division by the task division unit 111 in the arithmetic unit to which the subtask allocation unit 112 has assigned the subtask.

統合部１１４は、サブタスク起動部１１３により起動されたサブタスク全ての実行が完了すると、記憶部１２０に格納された実行結果を統合して出力データを生成する。 When the execution of all the subtasks activated by the subtask activation unit 113 is completed, the integration unit 114 integrates the execution results stored in the storage unit 120 and generates output data.

記憶部１２０は、アクセラレータ１３０におけるサブタスクの実行の結果得られる結果データを記憶する。記憶部１２０に記憶された結果データは、全てのサブタスクが完了すると統合部１１４によって統合され、出力データが生成される。出力データは、記憶装置１４０に格納される。なお、記憶部１２０に記憶される各サブタスクの実行結果は、各演算部において統合処理の一部が実行され、部分的に統合された状態であってもよい。 The storage unit 120 stores result data obtained as a result of the execution of the subtask in the accelerator 130. The result data stored in the storage unit 120 is integrated by the integration unit 114 when all the subtasks are completed, and output data is generated. The output data is stored in the storage device 140. The execution result of each subtask stored in the storage unit 120 may be in a partially integrated state in which a part of the integration process is executed in each arithmetic unit.

アクセラレータ１３０は、複数の演算部を有するプロセッサである。アクセラレータ１３０としてたとえば、多数のＣＰＵを搭載するメニ−コアプロセッサや、再構成可能なＦＰＧＡ（Field-programmable gate array）等を利用することができる。図１中、アクセラレータ１３０は、並列分散処理制御装置１０の内部に配置されるものとして図示する。実際には、メニ−コアプロセッサを既存のサーバに増設する等、既存のサーバにＰＣＩバスを介してメニ−コアプロセッサのボードを接続すればよい。 The accelerator 130 is a processor having a plurality of arithmetic units. As the accelerator 130, for example, a many-core processor on which a large number of CPUs are mounted, a reconfigurable FPGA (Field-programmable gate array), or the like can be used. In FIG. 1, the accelerator 130 is illustrated as being disposed inside the parallel distributed processing control device 10. In practice, a many-core processor board may be connected to an existing server via a PCI bus, such as adding a many-core processor to an existing server.

記憶装置１４０は、並列分散処理制御装置１０における処理の対象となるデータや処理の結果生成されるデータを記憶する。記憶装置１４０はたとえば、ハードディスクドライブ（ＨＤＤ：Hard Disk Drive）やＳＤＤ（Solid State Drive）等である。 The storage device 140 stores data to be processed by the parallel distributed processing control device 10 and data generated as a result of the processing. The storage device 140 is, for example, a hard disk drive (HDD) or a solid state drive (SDD).

［並列分散処理制御装置１０における処理の流れ］
図２は、第１の実施形態に係る並列分散処理制御装置１０における処理の流れの一例を示すフローチャートである。まず、並列分散処理制御装置１０は、タスク実行要求を受け付ける（ステップＳ２０１）。すると、タスク分割部１１１は、当該タスクを複数のサブタスクに分割する（ステップＳ２０２）。サブタスク割当部１１２は、分割された複数のサブタスク各々を、アクセラレータ１３０の演算部に割り当てる（ステップＳ２０３）。そして、サブタスク起動部１１３は、サブタスクが割り当てられた演算部においてサブタスクを起動させる（ステップＳ２０４）。サブタスクが起動した演算部は各々、並列してサブタスクを実行し、処理結果を記憶部１２０に送り格納させる（ステップＳ２０５）。全ての演算部での処理が完了すると、統合部１１４は、記憶部１２０に格納された処理結果を統合して出力データを生成する（ステップＳ２０６）。統合部１１４により生成された出力データは、記憶装置１４０に記憶される（ステップＳ２０７）。これによって、並列分散処理制御装置１０における処理が終了する。 [Flow of Processing in Parallel Distributed Processing Control Device 10]
FIG. 2 is a flowchart illustrating an example of a processing flow in the parallel distributed processing control apparatus 10 according to the first embodiment. First, the parallel distributed processing control device 10 receives a task execution request (step S201). Then, the task dividing unit 111 divides the task into a plurality of subtasks (step S202). The subtask assignment unit 112 assigns each of the divided subtasks to the calculation unit of the accelerator 130 (step S203). And the subtask starting part 113 starts a subtask in the calculating part to which the subtask was allocated (step S204). The arithmetic units that have started up the subtask each execute the subtask in parallel, and send the processing result to the storage unit 120 for storage (step S205). When the processing in all the arithmetic units is completed, the integration unit 114 integrates the processing results stored in the storage unit 120 to generate output data (step S206). The output data generated by the integration unit 114 is stored in the storage device 140 (step S207). Thereby, the process in the parallel distributed processing control device 10 is completed.

［第１の実施形態の効果］
このように第１の実施形態に係る並列分散処理制御装置１０は、受け付けたタスクを演算部の数に応じた数のサブタスクに分割する分割部と、サブタスクを演算部に割り当てる割当部と、サブタスクを割り当てられた演算部においてサブタスクを並列実行した処理結果を統合する統合部と、を備える。このため、一つのタスクによる処理量が多いときや、タスクを割り当てられた演算部の処理能力が低いときでも、処理をサブタスクに分割して複数の演算部において並列実行させることができる。このため、演算部の処理能力の制約に影響されずに記憶装置のＩ／Ｏ性能を効率的に利用して並列分散処理の高速化を実現することができる。また、アクセラレータの演算部を利用して処理を並列実行させるため、必要に応じてアクセラレータをＰＣＩバス等で接続して増設でき、柔軟に処理の高速化を図ることができる。 [Effect of the first embodiment]
As described above, the parallel distributed processing control device 10 according to the first embodiment includes a dividing unit that divides an accepted task into a number of subtasks corresponding to the number of arithmetic units, an assigning unit that allocates subtasks to arithmetic units, and a subtask. And an integration unit that integrates the processing results obtained by executing the subtasks in parallel in the operation unit to which is assigned. For this reason, even when the amount of processing by one task is large or when the processing capability of the arithmetic unit to which the task is assigned is low, the processing can be divided into subtasks and executed in parallel by a plurality of arithmetic units. For this reason, it is possible to increase the speed of parallel distributed processing by efficiently using the I / O performance of the storage device without being affected by the restriction of the processing capability of the arithmetic unit. In addition, since the processing is executed in parallel using the operation unit of the accelerator, the accelerator can be connected and expanded by a PCI bus or the like as necessary, and the processing speed can be flexibly increased.

並列分散処理制御装置１０はたとえば、分散型ファイルシステムを構成するサーバとして適用できる。 The parallel distributed processing control device 10 can be applied as a server constituting a distributed file system, for example.

（第２の実施形態）
次に、第２の実施形態として、Hadoop環境でMapReduceを実行するサーバにおいてMapサブタスクの並列処理を行う例を説明する。 (Second Embodiment)
Next, as a second embodiment, an example of performing parallel processing of Map subtasks in a server that executes MapReduce in a Hadoop environment will be described.

従来のMapReduceでは、各Mapタスクが、「ディスクからのデータロード、Key-Valueペアの生成、出力ファイルのソート」までの処理全体を実行する。 In the conventional MapReduce, each Map task executes the entire process up to “loading data from disk, generating key-value pairs, sorting output files”.

これに対して、第２の実施形態では、まず、MapReduceのMap処理におけるディスクからのデータロード、データのデシリアライゼーション、Key-Valueペアの生成、出力ファイルのソートを二つの処理に分割する。すなわち、「ディスクからのデータロード、データのデシリアライゼーション、Key-Valueペアの生成、ソートの一部」と、出力ファイル（全体）のソートとの二つの処理に分割する。そして、前者の「ディスクからのデータロード、データのデシリアライゼーション、Key-Valueペアの生成、ソートの一部」の処理を並列実行することにより、処理を高速化する。 In contrast, in the second embodiment, data loading from the disk, data deserialization, key-value pair generation, and output file sorting are first divided into two processes in MapReduce Map processing. That is, it is divided into two processes: “data loading from disk, data deserialization, generation of key-value pair, part of sorting” and sorting of the output file (whole). Then, the processing of the former “data loading from disk, data deserialization, generation of key-value pairs, part of sorting” is executed in parallel, thereby speeding up the processing.

［並列分散処理制御システム１の構成の一例］
図３を参照し、第２の実施形態に係る並列分散処理制御システムの一例について説明する。図３は、第２の実施形態に係る並列分散処理制御システム１の構成の一例を示す図である。図３に示すように、並列分散処理制御システム１は、マスタ１００と、スレーブ２００Ａ〜２００Ｇ（以下、まとめてスレーブ２００とも称する）と、アクセラレータ３００Ａ〜３００Ｇ（以下、まとめてアクセラレータ３００とも称する）と、を備える。また、並列分散処理制御システム１は、記憶装置４００Ａ〜４００Ｈ（以下、まとめて記憶装置４００とも称する）を備える。 [Example of configuration of parallel distributed processing control system 1]
An example of the parallel distributed processing control system according to the second embodiment will be described with reference to FIG. FIG. 3 is a diagram illustrating an example of a configuration of the parallel distributed processing control system 1 according to the second embodiment. As shown in FIG. 3, the parallel distributed processing control system 1 includes a master 100, slaves 200A to 200G (hereinafter collectively referred to as slave 200), and accelerators 300A to 300G (hereinafter also collectively referred to as accelerator 300). . The parallel distributed processing control system 1 includes storage devices 400A to 400H (hereinafter, collectively referred to as a storage device 400).

マスタ１００およびスレーブ２００Ａ〜２００Ｇは各々たとえばコモディティサーバ等の情報処理装置である。マスタ１００およびスレーブ２００Ａ〜２００Ｇは、ネットワークを介して互いに接続され、相互にデータを送受信する。また、スレーブ２００Ａ〜２００Ｇは、ＰＣＩバス等によって各々アクセラレータ３００Ａ〜３００Ｇと接続される。また、マスタ１００およびスレーブ２００Ａ〜２００Ｇは各々、記憶装置４００Ｈおよび記憶装置４００Ａ〜４００Ｇと接続され、配下の記憶装置にデータを記憶する。 Each of master 100 and slaves 200A to 200G is an information processing apparatus such as a commodity server. The master 100 and slaves 200A to 200G are connected to each other via a network and transmit / receive data to / from each other. The slaves 200A to 200G are connected to the accelerators 300A to 300G, respectively, by a PCI bus or the like. The master 100 and the slaves 200A to 200G are connected to the storage device 400H and the storage devices 400A to 400G, respectively, and store data in the subordinate storage devices.

図３には、１つのマスタと７つのスレーブ、７つのアクセラレータを示すが、これらの数はそれぞれ１、７、７に限定されるものではない。１０以上のサーバによってマスタおよびスレーブを構成してもよい。さらに、複数のデータセンタにまたがって処理制御システム１を構築し、インタークラウドシステムとしてもよい。 Although FIG. 3 shows one master, seven slaves, and seven accelerators, these numbers are not limited to 1, 7, and 7, respectively. You may comprise a master and a slave by ten or more servers. Furthermore, the process control system 1 may be constructed across a plurality of data centers to form an intercloud system.

並列分散処理制御システム１は、データを複数のノードに分散して記憶する分散型ファイルシステムである。この分散型ファイルシステムでは、書き込みの対象となるデータをブロックに分割してサーバ（マスタ１００およびスレーブ２００Ａ〜２００Ｇ）配下の記憶装置４００Ａ〜４００Ｈに格納する。また、一つのブロックについて冗長性を確保するために、たとえば３つの同一のブロックを異なる複数のサーバ（マスタ１００およびスレーブ２００Ａ〜２００Ｇ）配下の記憶装置４００に格納する。データの格納場所についての情報はマスタ１００が管理する。そして、MapReduceを実行する際には、当該データを格納した記憶装置４００からデータを読み出して、複数のサーバ上でMapタスクを並列実行させる。実行結果はReduceタスクを実行するサーバにおいて統合され、最終的な出力ファイルが生成される。 The parallel distributed processing control system 1 is a distributed file system that stores data distributed to a plurality of nodes. In this distributed file system, data to be written is divided into blocks and stored in storage devices 400A to 400H under the servers (master 100 and slaves 200A to 200G). Further, in order to ensure redundancy for one block, for example, three identical blocks are stored in the storage device 400 under a plurality of different servers (master 100 and slaves 200A to 200G). Information on the data storage location is managed by the master 100. When executing MapReduce, data is read from the storage device 400 storing the data, and Map tasks are executed in parallel on a plurality of servers. The execution results are integrated in the server that executes the Reduce task, and a final output file is generated.

マスタ１００はたとえば、ジョブトラッカー（Job Tracker）およびネームノード（Namenode）として動作する。ジョブトラッカーは、ジョブの投入を受け付けて複数のタスクに分割し、分割したタスクをノードに割り当てる。ネームノードは、ＨＤＦＳの管理を行う。ネームノードは、書き込み対象データを各ノードに分配し、各ノードに格納されるファイル（データブロック）の位置等の情報を管理する。 For example, the master 100 operates as a job tracker and a name node. The job tracker accepts job input and divides it into a plurality of tasks, and assigns the divided tasks to nodes. The name node manages HDFS. The name node distributes write target data to each node and manages information such as the position of a file (data block) stored in each node.

スレーブ２００Ａ〜２００Ｇは各々、タスクトラッカー（Task Tracker）およびデータノード（Datanode）として動作する。タスクトラッカーは、ジョブトラッカーに割り当てられたタスクの実行を管理する。データノードは、ネームノードから分配されたデータを、指定された場所から読み出して格納する。 Each of the slaves 200A to 200G operates as a task tracker and a data node. The task tracker manages the execution of tasks assigned to the job tracker. The data node reads the data distributed from the name node from the designated location and stores it.

［MapReduce処理の流れの一例］
次に、図４を参照してMapReduce処理の流れの一例を説明する。図４は、第２の実施形態に係る並列分散処理制御システム１におけるMapReduce処理の流れの一例を示すフローチャートである。 [Example of MapReduce process flow]
Next, an example of the flow of MapReduce processing will be described with reference to FIG. FIG. 4 is a flowchart showing an example of the flow of MapReduce processing in the parallel distributed processing control system 1 according to the second embodiment.

まず、マスタ１００はジョブを受け付ける（ステップＳ４０１）。すなわち、マスタ１００は、MapReduce処理の対象とするデータの指定を受け付ける。次に、マスタ１００は、受け付けたジョブを、並列実行するための複数のMapタスクに分割する（ステップＳ４０２）。すなわち、マスタ１００は、Mapタスク各々に処理対象データを割り当てる。一つのMapタスクに割り当てられたデータを、以下、入力スプリットともいう。そしてマスタ１００は、各Mapタスクを実行するスレーブ２００を選択し、選択したスレーブ２００にMapタスクを割り当てる（ステップＳ４０３）。 First, the master 100 receives a job (step S401). That is, the master 100 accepts designation of data to be subjected to MapReduce processing. Next, the master 100 divides the received job into a plurality of Map tasks for parallel execution (step S402). That is, the master 100 assigns processing target data to each Map task. The data assigned to one Map task is also called input split below. Then, the master 100 selects the slave 200 that executes each Map task, and assigns the Map task to the selected slave 200 (step S403).

マスタ１００は、Mapタスクの割り当てにおいては、各スレーブ２００に、当該スレーブ配下の記憶装置４００（ローカルディスク）に格納されているデータを処理対象とするMapタスクを割り当てる。ただし、マスタ１００は、他のスレーブ配下の記憶装置に格納されているデータを処理対象とするMapタスクを割り当ててもよい。その場合、スレーブは、ネットワークを介して該当する記憶装置から処理対象データを読み出し、処理を実行する。 In assigning a Map task, the master 100 assigns to each slave 200 a Map task whose processing target is data stored in the storage device 400 (local disk) under the slave. However, the master 100 may assign a Map task for processing data stored in a storage device under another slave. In that case, the slave reads the processing target data from the corresponding storage device via the network, and executes the processing.

Mapタスクを割り当てられたスレーブ２００は、Mapタスクを実行する（ステップＳ４０４）。そして、Mapタスクの実行結果を、Reduceタスクを実行するスレーブに渡す。Reduceタスクは、Mapタスクと同様、マスタ１００によって所定のスレーブ２００に割り当てられる。 The slave 200 to which the Map task is assigned executes the Map task (Step S404). Then, the execution result of the Map task is passed to the slave that executes the Reduce task. Like the Map task, the Reduce task is assigned to a predetermined slave 200 by the master 100.

Reduceタスクを割り当てられたスレーブは、Reduceタスクを実行する（ステップＳ４０５）。そして、当該スレーブは、実行結果を出力ファイルとして出力する（ステップＳ４０６）。これによってMapReduce処理が終了する。 The slave assigned the Reduce task executes the Reduce task (step S405). Then, the slave outputs the execution result as an output file (step S406). This completes the MapReduce process.

次に、第２の実施形態におけるMapタスクの処理（図４のステップＳ４０４）を説明する。第２の実施形態では、スレーブ２００に割り当てられたMapタスクをさらに分割して複数のMapサブタスクとする。そして、スレーブ２００は、複数のMapサブタスクをアクセラレータ３００が備える複数の処理部（後述）に割り当てて並列実行させる。まず、スレーブ２００およびアクセラレータ３００の構成について説明する。 Next, the Map task process (step S404 in FIG. 4) in the second embodiment will be described. In the second embodiment, the Map task assigned to the slave 200 is further divided into a plurality of Map subtasks. Then, the slave 200 assigns a plurality of Map subtasks to a plurality of processing units (described later) provided in the accelerator 300 and executes them in parallel. First, the configuration of the slave 200 and the accelerator 300 will be described.

［スレーブ２００の構成の一例］
図５は、第２の実施形態に係るスレーブ２００Ａおよびアクセラレータ３００Ａの構成の一例を示す図である。図５中、スレーブ２００Ａ〜２００Ｇに共通の構成の例としてスレーブ２００Ａの構成を示す。また、アクセラレータ３００Ａ〜３００Ｇに共通の構成の例としてアクセラレータ３００Ａの構成を示す。図５に示すように、スレーブ２００Ａは、制御部２１０と、記憶部２２０と、通信部２３０と、を備える。 [Example of configuration of slave 200]
FIG. 5 is a diagram illustrating an example of the configuration of the slave 200A and the accelerator 300A according to the second embodiment. In FIG. 5, a configuration of the slave 200A is shown as an example of a configuration common to the slaves 200A to 200G. Further, the configuration of the accelerator 300A is shown as an example of a configuration common to the accelerators 300A to 300G. As illustrated in FIG. 5, the slave 200 </ b> A includes a control unit 210, a storage unit 220, and a communication unit 230.

制御部２１０は、スレーブ２００Ａにおける処理を制御する。制御部２１０はたとえば、コモディティサーバに搭載されたＣＰＵである。制御部２１０は、タスク分割部２１１、サブタスク割当部２１２、サブタスク起動部２１３および統合部２１４を備える。 The control unit 210 controls processing in the slave 200A. The control unit 210 is, for example, a CPU mounted on a commodity server. The control unit 210 includes a task division unit 211, a subtask assignment unit 212, a subtask activation unit 213, and an integration unit 214.

タスク分割部２１１は、スレーブ２００Ａにおいて起動されたMapタスクを複数のMapサブタスクに分割する。タスク分割部２１１は、アクセラレータ３００Ａのコアの数より少なくとも１少ない数のMapサブタスクを生成する。また、タスク分割部２１１は、コア各々に割り当てられる処理対象ブロックの数が等しくなるようMapサブタスクを生成する。 The task division unit 211 divides the Map task activated in the slave 200A into a plurality of Map subtasks. The task division unit 211 generates a number of Map subtasks that is at least one less than the number of cores of the accelerator 300A. In addition, the task division unit 211 generates a Map subtask so that the number of processing target blocks allocated to each core is equal.

ここで、サブタスクの数をアクセラレータ３００Ａのコア数よりも少ない数に設定するのは、少なくとも一つのコアにMapサブタスクの結果を統合するマージソートサブタスクを実行させるためである。 Here, the reason why the number of subtasks is set to be smaller than the number of cores of the accelerator 300A is to cause at least one core to execute a merge sort subtask that integrates the results of the Map subtask.

サブタスク割当部２１２は、タスク分割部２１１が生成したサブタスクを実行するコアを決定する。すなわち、サブタスク割当部２１２は、Mapサブタスクおよびマージソートサブタスクを割り当てるコアを決定する。 The subtask assignment unit 212 determines a core that executes the subtask generated by the task division unit 211. That is, the subtask assignment unit 212 determines the core to which the Map subtask and the merge sort subtask are assigned.

サブタスク起動部２１３は、タスク分割部２１１が生成したサブタスクが、サブタスク割当部２１２が決定したコアにおいて実行されるよう、サブタスクを起動する。 The subtask activation unit 213 activates the subtask so that the subtask generated by the task division unit 211 is executed in the core determined by the subtask allocation unit 212.

統合部２１４は、アクセラレータ３００Ａにおけるサブタスクの実行により生成され、記憶部２２０に格納されたKey-Valueペアを読み出してコピーする。さらに、統合部２１４は、コピーしたKey-ValueペアをKeyごとにソートする。統合部２１４はソートした結果を、記憶装置４００Ａに格納する。その後、格納された結果は、Reduceタスクが割り当てられたスレーブにより読み出される。 The integration unit 214 reads and copies the key-value pair generated by the execution of the subtask in the accelerator 300A and stored in the storage unit 220. Furthermore, the integration unit 214 sorts the copied key-value pairs for each key. The integration unit 214 stores the sorted result in the storage device 400A. Thereafter, the stored result is read by the slave to which the Reduce task is assigned.

記憶部２２０は、アクセラレータ３００Ａでのサブタスクの実行結果を一時的に記憶する。 Storage unit 220 temporarily stores the execution result of the subtask in accelerator 300A.

通信部２３０は、スレーブ２００Ａと、マスタ１００、スレーブ２００Ｂ〜２００Ｇとの間でデータ等を送受信するインタフェースである。たとえば、通信部２３０は、制御部２１０の処理により生成されるデータをマスタ１００に送信する。 The communication unit 230 is an interface that transmits and receives data and the like between the slave 200A, the master 100, and the slaves 200B to 200G. For example, the communication unit 230 transmits data generated by the processing of the control unit 210 to the master 100.

［アクセラレータ３００Ａの構成の一例］
アクセラレータ３００Ａは、スレーブ２００Ａにおいて起動されたMapタスクを分割したサブタスクを実行する処理部である。アクセラレータ３００Ａはたとえば、多数のＣＰＵ（コア）を備えたメニーコアプロセッサや、再構成可能なＦＰＧＡなどを搭載した基板で構成する。図５では、アクセラレータ３００Ａを、１０個のコアＣ１〜Ｃ１０を有するメニーコアプロセッサとして示す。 [Example of configuration of accelerator 300A]
The accelerator 300A is a processing unit that executes a subtask obtained by dividing the Map task activated in the slave 200A. The accelerator 300A is configured by, for example, a substrate on which a many-core processor including a large number of CPUs (cores), a reconfigurable FPGA, and the like are mounted. In FIG. 5, the accelerator 300A is shown as a many-core processor having ten cores C1 to C10.

また、アクセラレータ３００Ａは、ローカルバッファＬＢ１〜ＬＢ１０とグローバルバッファＧＢとを備える（図５参照）。コアＣ１〜Ｃ１０各々に対してローカルバッファＬＢ１〜ＬＢ１０が割り当てられる。グローバルバッファＧＢには、複数のコアの処理結果が統合的に格納される。 The accelerator 300A includes local buffers LB1 to LB10 and a global buffer GB (see FIG. 5). Local buffers LB1 to LB10 are allocated to the cores C1 to C10, respectively. The global buffer GB stores the processing results of a plurality of cores in an integrated manner.

ここでは、アクセラレータ３００Ａとして、スレーブ２００Ａが備えるＣＰＵ（コア）の数よりも多数のＣＰＵ（コア）を備える装置を想定する。たとえば、タイレラ（Tilera）（登録商標）のメニーコアプロセッサやインテル（登録商標）Xeon Phiなどを採用することができる。また、ＦＰＧＡを用いて、Mapサブタスクをハードウェア化して実行することも可能である。 Here, it is assumed that the accelerator 300A is an apparatus including a larger number of CPUs (cores) than the number of CPUs (cores) included in the slave 200A. For example, a many-core processor of Tilera (registered trademark) or Intel (registered trademark) Xeon Phi can be employed. Also, it is possible to implement the Map subtask by using FPGA as hardware.

［スレーブ２００ＡにおけるMapタスク実行処理の一例］
図６を参照し、スレーブ２００ＡにおけるMapタスク実行処理（図４のステップＳ４０４）について説明する。図６は、第２の実施形態におけるMapタスク実行処理の流れの一例を示す図である。 [Example of Map task execution process in slave 200A]
With reference to FIG. 6, the Map task execution process (step S404 in FIG. 4) in the slave 200A will be described. FIG. 6 is a diagram illustrating an example of the flow of a Map task execution process in the second embodiment.

まず、スレーブ２００ＡにMapタスクが割り当てられ、Mapタスクが起動される（ステップＳ６０１）。スレーブ２００Ａのタスク分割部２１１は、MapタスクをMapサブタスクおよびマージソートサブタスクに分割する（ステップＳ６０２）。そして、サブタスク割当部２１２は、Mapサブタスクおよびマージソートサブタスクを割り当てるアクセラレータ３００Ａのコアを決定する（ステップＳ６０３）。そして、サブタスク起動部２１３は、Mapサブタスクおよびマージソートサブタスクを割り当てられたコアにおいてサブタスクを起動し、サブタスクを実行させる（ステップＳ６０４）。統合部２１４は、サブタスクの処理結果を統合する（ステップＳ６０５）。つまり、統合部２１４は、処理結果をコピーしてKeyごとに処理結果をソートする。ソートした処理結果は記憶装置４００に記憶される（ステップＳ６０６）。 First, a Map task is assigned to the slave 200A, and the Map task is activated (step S601). The task dividing unit 211 of the slave 200A divides the Map task into a Map subtask and a merge sort subtask (Step S602). Then, the subtask assignment unit 212 determines the core of the accelerator 300A to which the Map subtask and the merge sort subtask are assigned (Step S603). Then, the subtask activation unit 213 activates the subtask in the core to which the Map subtask and the merge sort subtask are assigned, and causes the subtask to be executed (step S604). The integration unit 214 integrates the processing results of the subtasks (Step S605). That is, the integration unit 214 copies the processing results and sorts the processing results for each key. The sorted processing results are stored in the storage device 400 (step S606).

［アクセラレータ３００Ａでのサブタスク実行処理の流れ］
次に、サブタスクを割り当てられたアクセラレータ３００Ａにおけるサブタスク実行処理（図６のステップＳ６０４）の流れを説明する。図７は、第２の実施形態に係るアクセラレータ３００Ａにおけるサブタスク実行処理を説明するための図である。図８は、第２の実施形態に係るMapサブタスクの処理の流れの一例を示すフローチャートである。図９は、第２の実施形態に係るマージソートサブタスクの処理の流れの一例を示すフローチャートである。図７乃至図９を参照して、アクセラレータ３００Ａにおけるサブタスク実行処理の一例について説明する。 [Flow of subtask execution process in accelerator 300A]
Next, the flow of the subtask execution process (step S604 in FIG. 6) in the accelerator 300A to which the subtask is assigned will be described. FIG. 7 is a diagram for explaining subtask execution processing in the accelerator 300A according to the second embodiment. FIG. 8 is a flowchart illustrating an example of the processing flow of the Map subtask according to the second embodiment. FIG. 9 is a flowchart illustrating an example of a process flow of the merge sort subtask according to the second embodiment. An example of subtask execution processing in the accelerator 300A will be described with reference to FIGS.

まず、アクセラレータ３００Ａの各コアにおいてサブタスクが起動される（ステップＳ８０１）。サブタスクが起動すると、Mapサブタスクが割り当てられたコア（ここではコアＣ１として説明する）は、Mapサブタスクの対象データを読み込む（ステップＳ８０２）。たとえば、図７の例では、一つのMapタスクが４つのMapサブタスクとマージソートサブタスクとに分割され、Mapタスクの処理対象であった４ブロックのデータＢ１〜Ｂ４がそれぞれ１ブロックずつMapサブタスクの処理対象となっている。コアＣ１は、ブロックＢ１を読み込む。なお、第２の実施形態では、各コアは、処理対象のブロックをスレーブ２００Ａを介さず直接記憶装置４００Ａから読み出す。 First, a subtask is activated in each core of the accelerator 300A (step S801). When the subtask is activated, the core to which the Map subtask is assigned (described here as the core C1) reads the target data of the Map subtask (step S802). For example, in the example of FIG. 7, one Map task is divided into four Map subtasks and a merge sort subtask, and the four blocks of data B1 to B4 that have been processed by the Map task are each processed by the Map subtask. It is targeted. The core C1 reads the block B1. In the second embodiment, each core reads the processing target block directly from the storage device 400A without using the slave 200A.

そして、コアＣ１は、読み込んだブロックＢ１をデシリアライズする（ステップＳ８０３、図７の★）。次に、コアＣ１は、読み込んだブロックＢ１のデータを１レコードずつ取り出し、Key-Valueペアを生成する（ステップＳ８０４）。生成したKey-Valueペアは、コアＣ１に割り当てられたローカルバッファＬＢ１に格納される。 Then, the core C1 deserializes the read block B1 (step S803, ★ in FIG. 7). Next, the core C1 extracts the data of the read block B1 one record at a time, and generates a key-value pair (step S804). The generated key-value pair is stored in the local buffer LB1 assigned to the core C1.

次にコアＣ１は、ローカルバッファＬＢ１が一杯になっているか否かを判定する（ステップＳ８０５）。ローカルバッファＬＢ１が一杯になっていないと判定した場合（ステップＳ８０５、否定）、コアＣ１は全てのレコードの処理が完了しているか否かを判定する（ステップＳ８０６）。全てのレコードの処理が完了していないと判定した場合（ステップＳ８０６、否定）、コアＣ１は、次のレコードを読み出して（ステップＳ８０７）、ステップＳ８０４に戻る。 Next, the core C1 determines whether or not the local buffer LB1 is full (step S805). When it is determined that the local buffer LB1 is not full (No at Step S805), the core C1 determines whether or not all records have been processed (Step S806). If it is determined that all the records have not been processed (No at Step S806), the core C1 reads the next record (Step S807) and returns to Step S804.

他方、ステップＳ８０５においてローカルバッファＬＢ１が一杯になっていると判定した場合（ステップＳ８０５、肯定）およびステップＳ８０６において全てのレコードの処理が完了していると判定した場合（ステップＳ８０６、肯定）、ステップＳ８０８に進む。ステップＳ８０８において、コアＣ１は、領域ＬＢ１内のKey-ValueペアをKeyに基づいてソートし、結果をグローバルバッファＧＢに送って格納する（ステップＳ８０９）。そして、コアＣ１は、全てのレコードの処理が完了しているか否かを判定する（ステップＳ８１０）。全てのレコードの処理が完了していないと判定した場合（ステップＳ８１０、否定）、コアＣ１はステップＳ８０７に戻る。他方、全てのレコードの処理が完了したと判定した場合（ステップＳ８１０、肯定）、コアＣ１は、マージソートサブタスクを割り当てられたコア（ここではコアＣ１０とする）に処理を渡す（ステップＳ８１１）。これによってコアＣ１におけるMapサブタスクの処理は終了する。 On the other hand, if it is determined in step S805 that the local buffer LB1 is full (step S805, affirmative) and if it is determined in step S806 that all the records have been processed (step S806, affirmative), step The process proceeds to S808. In step S808, the core C1 sorts the key-value pairs in the area LB1 based on the key, and sends the result to the global buffer GB for storage (step S809). Then, the core C1 determines whether or not all the records have been processed (step S810). If it is determined that all the records have not been processed (No at Step S810), the core C1 returns to Step S807. On the other hand, if it is determined that all the records have been processed (Yes at step S810), the core C1 passes the processing to the core (here, core C10) to which the merge sort subtask is assigned (step S811). This completes the processing of the Map subtask in the core C1.

コアＣ１におけるMapサブタスクの実行中、Mapサブタスクが割り当てられた他のコアＣ２〜Ｃ４（図７参照）も並行して上記ステップＳ８０１〜ステップＳ８１１の処理を実行する。 During the execution of the Map subtask in the core C1, the other cores C2 to C4 (see FIG. 7) to which the Map subtask is assigned also execute the processes of Steps S801 to S811 in parallel.

コアＣ１〜Ｃ４においてMapサブタスクが実行されている間、コアＣ１０はマージソートサブタスクを実行している（図７参照）。すなわち、図９に示すように、コアＣ１０は、所定時間ごとに、グローバルバッファＧＢが一杯になっているか否かを判定する（ステップＳ９０１）。グローバルバッファＧＢが一杯になっていないと判定した場合（ステップＳ９０１、否定）、コアＣ１０は、コアＣ１〜Ｃ４全部からMapサブタスクが完了した旨の通知を受け取ったか否かを判定する（ステップＳ９０２）。コアＣ１〜Ｃ４全部から受け取ってはいないと判定した場合（ステップＳ９０２、否定）、コアＣ１０は、所定時間待機した（ステップＳ９０３）後、ステップＳ９０１に戻る。 While the Map subtask is being executed in the cores C1 to C4, the core C10 is executing the merge sort subtask (see FIG. 7). That is, as shown in FIG. 9, the core C10 determines whether or not the global buffer GB is full every predetermined time (step S901). When it is determined that the global buffer GB is not full (No at Step S901), the core C10 determines whether or not a notification that the Map subtask has been completed is received from all the cores C1 to C4 (Step S902). . If it is determined that it has not been received from all the cores C1 to C4 (No at Step S902), the core C10 waits for a predetermined time (Step S903), and then returns to Step S901.

グローバルバッファＧＢが一杯であると判定した場合（ステップＳ９０１、肯定）およびコアＣ１〜Ｃ４全てから完了通知を受け取ったと判定した場合（ステップＳ９０２、肯定）、コアＣ１０は、ステップＳ９０４に進む。ステップＳ９０４では、コアＣ１０は、グローバルバッファＧＢに蓄積されたKey-Valueペアの情報を統合してKeyをもとにソートする（ステップＳ９０４）。コアＣ１０は、ソートした情報を、Mapサブタスクを割り当てたスレーブ２００へ送信する（ステップＳ９０５）。これによって、処理を終わる。 When it is determined that the global buffer GB is full (Yes at Step S901) and when it is determined that the completion notification is received from all the cores C1 to C4 (Yes at Step S902), the core C10 proceeds to Step S904. In step S904, the core C10 integrates the information of the key-value pairs accumulated in the global buffer GB and sorts based on the key (step S904). The core C10 transmits the sorted information to the slave 200 to which the Map subtask is assigned (Step S905). This completes the process.

図８および図９に示すサブタスクの処理は、図７に示すようにアクセラレータ３００Ａにおいて実行される。そして、アクセラレータ３００ＡのコアＣ１０が生成した情報は、スレーブ２００Ａに送信される。 The subtask processes shown in FIGS. 8 and 9 are executed in the accelerator 300A as shown in FIG. The information generated by the core C10 of the accelerator 300A is transmitted to the slave 200A.

このように、コアＣ１０からスレーブ２００Ａには、グローバルバッファＧＢが一杯になるごとにKey-Valueペアのデータ（出力結果）が送られる。スレーブ２００Ａの統合部２１４は、こうして送信され蓄積される複数の出力結果をさらにKeyを基にして統合し、最終的なMapタスクの出力ファイルを生成する（図７参照）。 In this way, key-value pair data (output result) is sent from the core C10 to the slave 200A each time the global buffer GB becomes full. The integration unit 214 of the slave 200A further integrates the plurality of output results transmitted and accumulated in this manner based on the key, and generates a final map task output file (see FIG. 7).

［第２の実施形態の効果］
このように、第２の実施形態に係る並列分散処理制御システム１は、複数の情報処理装置と、複数のコアを有するアクセラレータと、を備える。そして、複数の情報処理装置のうち１の情報処理装置は、Mapタスクを、アクセラレータが有するコアの数に応じた数のサブタスクに分割する分割部を備える。また、１の情報処理装置は、サブタスクをアクセラレータの複数のコアに割り当てる割当部と、サブタスクの処理結果を統合する統合部と、を備える。そして、アクセラレータは、複数のコアにおいてサブタスクを並列実行する。このため、コアの処理能力の制約に影響されずに記憶装置のＩ／Ｏ性能を効率的に利用して並列分散処理の高速化を実現することができる。また、アクセラレータが備えるコアを利用して処理を並列実行させるため、必要に応じてアクセラレータをＰＣＩバス等で接続して増設でき、柔軟に処理の高速化を図ることができる。 [Effects of Second Embodiment]
As described above, the parallel distributed processing control system 1 according to the second embodiment includes a plurality of information processing apparatuses and an accelerator having a plurality of cores. One information processing apparatus among the plurality of information processing apparatuses includes a division unit that divides the Map task into a number of subtasks corresponding to the number of cores included in the accelerator. One information processing apparatus includes an assigning unit that assigns subtasks to a plurality of accelerator cores, and an integration unit that integrates the processing results of the subtasks. The accelerator executes subtasks in parallel in a plurality of cores. For this reason, parallel distributed processing can be speeded up by efficiently using the I / O performance of the storage device without being affected by the restriction of the processing capability of the core. In addition, since the processing is performed in parallel using the core provided in the accelerator, it is possible to connect and increase the accelerator by a PCI bus or the like as necessary, so that the processing speed can be flexibly increased.

（第３の実施形態）
上記第２の実施形態では、アクセラレータ３００Ａのコアが処理対象データを格納する記憶装置４００Ａに直接アクセスしてデータを取得する例を説明した。第３の実施形態では、処理対象データを一端、Mapタスクが起動したスレーブ２００Ａの記憶部２２０に読み込む。そして、スレーブ２００Ａの記憶部２２０からさらに、アクセラレータ３００Ａ上のメモリに転送する。その後、サブタスクを実行するコアがアクセラレータ３００Ａ上のメモリから必要なデータを読み出して処理を実行する。その他の点では、第３の実施形態の並列分散処理制御システム２の構成および機能は、第２の実施形態の並列分散処理制御システム１と同様である。 (Third embodiment)
In the second embodiment, the example in which the core of the accelerator 300A directly accesses the storage device 400A that stores the processing target data and acquires the data has been described. In the third embodiment, the processing target data is once read into the storage unit 220 of the slave 200A in which the Map task is activated. Then, the data is further transferred from the storage unit 220 of the slave 200A to the memory on the accelerator 300A. Thereafter, the core executing the subtask reads out necessary data from the memory on the accelerator 300A and executes the processing. In other respects, the configuration and function of the parallel distributed processing control system 2 of the third embodiment are the same as those of the parallel distributed processing control system 1 of the second embodiment.

図１０は、第３の実施形態に係る並列分散処理制御システム２におけるMapタスクの処理を説明するための図である。図１０を参照して、第３の実施形態におけるMapタスクの処理の一例を説明する。 FIG. 10 is a diagram for explaining map task processing in the parallel distributed processing control system 2 according to the third embodiment. With reference to FIG. 10, an example of the processing of the Map task in the third embodiment will be described.

まず、スレーブ２００ＡにおいてMapタスクが起動される（ステップＳ１００１）。すると、スレーブ２００Ａは記憶装置３００Ａから当該Mapタスクの処理対象である入力スプリットを読み出して記憶部２２０を経由して、アクセラレータ３００Ａに転送する。すなわち、記憶装置３００Ａから記憶部２２０に読み込み、アクセラレータ３００Ａへの送付を繰り返す（ステップS１００２）。このとき、データの転送にはダイレクトメモリアクセス（ＤＭＡ：Direct Memory Access）を用いる。 First, the Map task is activated in the slave 200A (step S1001). Then, the slave 200A reads the input split that is the processing target of the Map task from the storage device 300A, and transfers it to the accelerator 300A via the storage unit 220. That is, reading from the storage device 300A to the storage unit 220 and sending to the accelerator 300A are repeated (step S1002). At this time, direct memory access (DMA) is used for data transfer.

次に、スレーブ２００Ａのタスク分割部２１１は、Mapタスクを処理内容およびアクセラレータ３００Ａ上のコアの数に応じた数のサブタスクに分割する（ステップＳ１００３）。そして、サブタスク割当部２１２は、サブタスクを割り当てるコアを決定する（ステップＳ１００４）。このとき、サブタスク割当部２１２は、Mapタスクを分割して得たMapサブタスクの割当先とともに、マージソートサブタスクの割当先を決定する。 Next, the task dividing unit 211 of the slave 200A divides the Map task into a number of subtasks according to the processing content and the number of cores on the accelerator 300A (step S1003). Then, the subtask assignment unit 212 determines the core to which the subtask is assigned (Step S1004). At this time, the subtask assignment unit 212 determines the assignment destination of the merge sort subtask together with the assignment destination of the Map subtask obtained by dividing the Map task.

そして、サブタスク起動部２１３は、サブタスク割当部２１２が決定した割当先において該当するサブタスクを起動する（ステップＳ１００５）。サブタスクが起動した各コアは、アクセラレータ３００Ａ上のメモリにアクセスしてサブタスクの処理対象データを読み出し、サブタスクを実行する（ステップＳ１００６）。すなわち、データのデシリアライゼーション、Key-Valueペアの生成、Keyに基づくソート等が実行される。その後、全てのサブタスクが完了するとスレーブ２００Ａは処理結果を統合して結果を記憶装置４００Ａに格納する（ステップＳ１００７）。 Then, the subtask activation unit 213 activates the corresponding subtask at the allocation destination determined by the subtask allocation unit 212 (step S1005). Each core in which the subtask is activated accesses the memory on the accelerator 300A, reads the processing data for the subtask, and executes the subtask (step S1006). That is, data deserialization, generation of key-value pairs, sorting based on keys, and the like are executed. Thereafter, when all subtasks are completed, the slave 200A integrates the processing results and stores the results in the storage device 400A (step S1007).

このように、第３の実施形態では、アクセラレータ３００Ａ上のコアは直接スレーブ２００Ａ配下の記憶装置４００Ａから処理対象データを読み出すのではなく、アクセラレータ３００Ａ上のメモリからデータを読み出す。図１１は、第３の実施形態における処理対象データの転送態様を説明するための図である。図１１に示すように、記憶装置４００Ａに格納された処理対象データは、一端スレーブ２００Ａ上の記憶部２２０に転送される。このとき転送はダイレクトメモリアクセスで行う。さらに、スレーブ２００Ａ上の記憶部２２０からアクセラレータ３００Ａ上のメモリに、処理対象データが転送される。やはり転送はダイレクトメモリアクセスで行う。この後、Mapサブタスクを各コアが実行する際には、アクセラレータ３００Ａ上のメモリからデータを読み出してロードし、デシリアライゼーション、Key-Valueペアの生成等を実行する。 As described above, in the third embodiment, the core on the accelerator 300A does not directly read the processing target data from the storage device 400A under the slave 200A, but reads the data from the memory on the accelerator 300A. FIG. 11 is a diagram for explaining a transfer mode of data to be processed in the third embodiment. As shown in FIG. 11, the processing target data stored in the storage device 400A is transferred to the storage unit 220 on the slave 200A. At this time, transfer is performed by direct memory access. Further, the processing target data is transferred from the storage unit 220 on the slave 200A to the memory on the accelerator 300A. Transfer is also performed by direct memory access. Thereafter, when each core executes the Map subtask, data is read from the memory on the accelerator 300A and loaded to execute deserialization, generation of a key-value pair, and the like.

［第３の実施形態の効果］
このように、第３の実施形態では、処理対象データをいったんスレーブ２００Ａ上の記憶部２２０に転送してから、アクセラレータ３００Ａに転送し、各コアが対象データをアクセラレータ３００Ａ上のメモリから読み出して処理を実行する。このため、Mapタスクを割り当てられたスレーブ２００Ａから、処理対象データを自装置のファイルシステムとして扱えるようになる。したがって、管理者による処理の操作が容易になる。 [Effect of the third embodiment]
As described above, in the third embodiment, the processing target data is once transferred to the storage unit 220 on the slave 200A and then transferred to the accelerator 300A, and each core reads the target data from the memory on the accelerator 300A for processing. Execute. Therefore, the processing target data can be handled as the file system of the own device from the slave 200A to which the Map task is assigned. Therefore, the processing operation by the administrator is facilitated.

なお、上記第２および第３の実施形態は、Hadoop環境において動作する分散型ファイルシステムを前提として説明したが、MapReduceを利用したシステムであればHadoopに限定されず、他のフレームワークにも適用することができる。 The second and third embodiments have been described on the premise of a distributed file system operating in a Hadoop environment. However, the system is not limited to Hadoop as long as it is a system using MapReduce, and can be applied to other frameworks. can do.

（第４の実施形態）
これまで本発明の実施形態について説明したが、本発明は上述した実施形態以外にも、その他の実施形態にて実施されてもよい。以下に、その他の実施形態を説明する。 (Fourth embodiment)
Although the embodiments of the present invention have been described so far, the present invention may be implemented in other embodiments besides the above-described embodiments. Other embodiments will be described below.

［システム構成］
また、本実施例において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上述文書中や図面中に示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 [System configuration]
Also, among the processes described in this embodiment, all or part of the processes described as being performed automatically can be performed manually, or the processes described as being performed manually can be performed. All or a part can be automatically performed by a known method. In addition, the processing procedures, control procedures, specific names, and information including various data and parameters shown in the above-described document and drawings can be arbitrarily changed unless otherwise specified.

また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。例えば、図５に示す例では、タスク分割部２１１、サブタスク割当部２１２をスレーブ２００Ａ内に配置したが、これを分離して複数のスレーブのタスク分割部とサブタスク割当部とを統合する情報処理装置を配置し、当該情報処理装置がアクセラレータにおける機能の割当を管理するようにしてもよい。 Further, each component of each illustrated apparatus is functionally conceptual, and does not necessarily need to be physically configured as illustrated. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or a part of the distribution / integration may be functionally or physically distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. For example, in the example shown in FIG. 5, the task division unit 211 and the subtask allocation unit 212 are arranged in the slave 200A. However, the information processing apparatus integrates a plurality of slave task division units and subtask allocation units by separating them. May be arranged so that the information processing apparatus manages allocation of functions in the accelerator.

［プログラム］
図１２は、並列分散処理制御システム１，２における一連の処理を実行するプログラムである並列分散処理制御プログラムによる情報処理が、コンピュータを用いて具体的に実現されることを示す図である。図１２に例示するように、コンピュータ３０００は、たとえば、メモリ３０１０と、ＣＰＵ（Central Processing Unit）３０２０と、ハードディスクドライブ３０８０と、ネットワークインタフェース３０７０とを有する。コンピュータ３０００の各部はバス５１００によって接続される。 [program]
FIG. 12 is a diagram showing that information processing by the parallel distributed processing control program, which is a program for executing a series of processes in the parallel distributed processing control systems 1 and 2, is specifically realized using a computer. As illustrated in FIG. 12, the computer 3000 includes, for example, a memory 3010, a CPU (Central Processing Unit) 3020, a hard disk drive 3080, and a network interface 3070. Each part of the computer 3000 is connected by a bus 5100.

メモリ３０１０は、図１２に例示するように、ＲＯＭ３０１１およびＲＡＭ３０１２を含む。ＲＯＭ３０１１は、たとえば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。 The memory 3010 includes a ROM 3011 and a RAM 3012 as illustrated in FIG. The ROM 3011 stores a boot program such as BIOS (Basic Input Output System).

ここで、図１２に例示するように、ハードディスクドライブ３０８０は、たとえば、ＯＳ３０８１、アプリケーションプログラム３０８２、プログラムモジュール３０８３、プログラムデータ３０８４を記憶する。すなわち、開示の技術に係る並列分散処理制御プログラムは、コンピュータによって実行される指令が記述されたプログラムモジュール３０８３として、たとえばハードディスクドライブ３０８０に記憶される。たとえば、マスタ１００が備える制御部、スレーブ２００Ａが備える制御部２１０およびコアＣ１〜Ｃ１０の各部と同様の情報処理を実行する手順各々が記述されたプログラムモジュール３０８３が、ハードディスクドライブ３０８０に記憶される。 Here, as illustrated in FIG. 12, the hard disk drive 3080 stores, for example, an OS 3081, an application program 3082, a program module 3083, and program data 3084. In other words, the parallel distributed processing control program according to the disclosed technique is stored in, for example, the hard disk drive 3080 as the program module 3083 in which instructions executed by the computer are described. For example, the hard disk drive 3080 stores a program module 3083 in which the control unit included in the master 100, the control unit 210 included in the slave 200A, and the procedures for executing the same information processing as each unit of the cores C1 to C10 are described.

また、記憶部２２０、ローカルバッファＬＢおよびグローバルバッファＧＢに記憶されるデータのように、並列分散処理制御プログラムによる情報処理に用いられるデータは、プログラムデータ３０８４として、たとえばハードディスクドライブ３０８０に記憶される。そして、ＣＰＵ３０２０が、ハードディスクドライブ３０８０に記憶されたプログラムモジュール３０８３やプログラムデータ３０８４を必要に応じてＲＡＭ３０１２に読み出し、各種の手順を実行する。 Further, data used for information processing by the parallel distributed processing control program, such as data stored in the storage unit 220, the local buffer LB, and the global buffer GB, is stored as program data 3084, for example, in the hard disk drive 3080. The CPU 3020 reads the program module 3083 and program data 3084 stored in the hard disk drive 3080 to the RAM 3012 as necessary, and executes various procedures.

なお、並列分散処理制御プログラムに係るプログラムモジュール３０８３やプログラムデータ３０８４は、ハードディスクドライブ３０８０に記憶される場合に限られない。たとえば、プログラムモジュール３０８３やプログラムデータ３０８４は、着脱可能な記憶媒体に記憶されてもよい。この場合、ＣＰＵ３０２０は、ディスクドライブなどの着脱可能な記憶媒体を介してデータを読み出す。また、同様に、プログラムモジュール３０８３やプログラムデータ３０８４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。この場合、ＣＰＵ３０２０は、ネットワークインタフェース３０７０を介して他のコンピュータにアクセスすることで各種データを読み出す。 Note that the program module 3083 and the program data 3084 related to the parallel distributed processing control program are not limited to being stored in the hard disk drive 3080. For example, the program module 3083 and the program data 3084 may be stored in a removable storage medium. In this case, the CPU 3020 reads data via a removable storage medium such as a disk drive. Similarly, the program module 3083 and the program data 3084 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). In this case, the CPU 3020 reads various data by accessing another computer via the network interface 3070.

［その他］
なお、本実施例で説明した並列分散処理制御プログラムは、インターネット等のネットワークを介して配布することができる。また、並列分散処理制御プログラムは、ハードディスク、フレキシブルディスク（ＦＤ）、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤなどのコンピュータで読取可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行することもできる。 [Others]
The parallel distributed processing control program described in this embodiment can be distributed via a network such as the Internet. The parallel distributed processing control program may be recorded on a computer-readable recording medium such as a hard disk, a flexible disk (FD), a CD-ROM, an MO, or a DVD, and executed by being read from the recording medium by the computer. it can.

１並列分散処理制御システム
１０並列分散処理制御装置
１００マスタ
１１０制御部
１１１タスク分割部
１１２サブタスク割当部
１１３サブタスク起動部
１１４統合部
１２０記憶部
１３０アクセラレータ
１４０記憶装置
２００Ａ〜２００Ｇスレーブ
２１０制御部
２１１タスク分割部
２１２サブタスク割当部
２１３サブタスク起動部
２１４統合部
２２０記憶部
２３０通信部
３００アクセラレータ
４００記憶装置
Ｃ１〜Ｃ１０コア
ＧＢグローバルバッファ
ＬＢ１〜ＬＢ１０ローカルバッファ DESCRIPTION OF SYMBOLS 1 Parallel distributed processing control system 10 Parallel distributed processing control apparatus 100 Master 110 Control part 111 Task division part 112 Subtask allocation part 113 Subtask starting part 114 Integration part 120 Storage part 130 Accelerator 140 Storage apparatus 200A-200G Slave 210 Control part 211 Task division Unit 212 Subtask allocation unit 213 Subtask activation unit 214 Integration unit 220 Storage unit 230 Communication unit 300 Accelerator 400 Storage device C1 to C10 Core GB Global buffer LB1 to LB10 Local buffer

Claims

A division unit that divides the received task into a number of subtasks according to the number of arithmetic units;
An assigning unit for assigning the subtask to the computing unit;
An integration unit that integrates processing results obtained by executing the subtasks in parallel in the calculation unit to which the subtasks are assigned;
A parallel distributed processing control device.

A parallel distributed processing control system comprising a plurality of information processing devices and an accelerator having a plurality of cores,
One information processing apparatus among the plurality of information processing apparatuses is:
A division unit that divides the Map task into a number of subtasks according to the number of cores included in the accelerator;
An assigning unit that assigns the subtask to a plurality of cores of the accelerator;
An integration unit for integrating the processing results of the subtasks by the core;
With
The parallel distributed processing control system, wherein the accelerator executes the subtasks in parallel in the plurality of cores.

The plurality of cores of the accelerator each read a data block that is a processing target of an assigned subtask directly from a storage device that stores the data block under an information processing device. Parallel distributed processing control system.

Each of the plurality of cores of the accelerator sends a data block to be processed by the assigned subtask from a storage device under the information processing apparatus that stores the data block to a memory on the accelerator via the memory of the information processing apparatus. The parallel distributed processing control system according to claim 2, wherein the subtask is executed by reading from the memory on the accelerator after being transferred to the accelerator.

The plurality of cores of the accelerator have a storage area allocated to each of the accelerators, and when the storage area is filled with a key-value pair generated by execution of the subtask, the accelerator is stored in the storage area. 5. The parallel distributed processing control according to claim 2, wherein the key-value pairs are sorted based on the key, and the sorting result is stored in a global buffer on the accelerator. 6. system.

The dividing unit divides a part of processing among loading of data blocks to be processed, deserialization of loaded data, generation of key-value pairs and sorting of key-value pairs into the subtasks,
6. The parallel distributed processing control according to claim 2, wherein the allocating unit allocates the processing target blocks of the subtasks allocated to each of the plurality of cores of the accelerator so as to be substantially equal. 6. system.

A parallel distributed processing control method executed by a parallel distributed processing control system including a plurality of information processing devices and an accelerator including a plurality of cores,
A division step of dividing a Map task into a number of subtasks according to the number of cores included in the accelerator by one information processing device among the plurality of information processing devices;
An assigning step of assigning the subtask to a plurality of cores of the accelerator by the one information processing apparatus;
An execution step of executing the subtasks in parallel by a plurality of cores of the accelerator;
An integration step of integrating the processing results of the subtasks by the one information processing apparatus;
A parallel distributed processing control method comprising:

On the computer,
A division step of dividing the Map task into a number of subtasks according to the number of cores included in the accelerator;
Assigning the subtasks to a plurality of cores of the accelerator;
An execution step of causing the plurality of cores of the accelerator to execute the subtasks in parallel;
An integration step of integrating the processing results of the subtasks by the core;
Parallel distributed processing control program to execute.