WO2014102917A1 - Parallel processing method and parallel computer system - Google Patents
Parallel processing method and parallel computer system Download PDFInfo
- Publication number
- WO2014102917A1 WO2014102917A1 PCT/JP2012/083546 JP2012083546W WO2014102917A1 WO 2014102917 A1 WO2014102917 A1 WO 2014102917A1 JP 2012083546 W JP2012083546 W JP 2012083546W WO 2014102917 A1 WO2014102917 A1 WO 2014102917A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- processing
- node
- parallel
- key
- bucket
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
Definitions
- the present invention relates to a parallel processing method and a parallel computer system, and more particularly to transmission of processing results.
- Non-patent document 1 discloses a calculation model called MapReduce shown in FIG. 2 as a general-purpose method for efficiently performing calculations on a parallel computing device.
- MapReduce is taken up merely as an example of the technical description, and the invention itself in this application does not assume the MapReduce calculation model.
- the MapReduce calculation model is a calculation model composed of three phases, a Map phase, a Sort phase, and a Reduce phase.
- Map phase input data is read after being divided into a large number of processing units and input to the Map process.
- the Map process performs some calculation or processing for each processing unit and outputs a set of ⁇ Key, Value>.
- Sort phase the set of ⁇ Key, Value> output in the Map phase is classified (aligned) for each key, and a set of multiple values for each Key is output.
- the Reduce phase a set of a plurality of values for each key is input to the Reduce process, and some calculation or processing is performed on the set to which the Reduce process is input, and a final result is output.
- each Map process and each Reduce process have no dependency with other Map processes and Reduce processes, and can be executed in parallel. Therefore, by using the MapReduce calculation model, it is possible to perform calculation (processing) in parallel by a parallel calculation device including a plurality of calculation nodes.
- FIG. 3 is a schematic diagram showing a state in which the MapReduce calculation model is operated on a parallel computing device composed of a plurality of computation nodes.
- Map phase and the Reduce phase can be executed in parallel. Therefore, these processes can be allocated to a plurality of calculation nodes and executed in parallel.
- Map phase and the Reduce phase are not necessarily performed by the same parallel computing device.
- the total number of Map processes and Reduce processes is larger than the total number of computation nodes. In that case, each computation node is necessarily responsible for a plurality of Map processes and Reduce processes.
- the Sort phase in FIG. 2 is subdivided into two phases, a Shuffle phase (inter-computation node communication phase) and a Local Sort phase (receiving side alignment phase).
- a Shuffle phase inter-computation node communication phase
- a Local Sort phase receiving side alignment phase
- a set of ⁇ Key, Value> output in the Map phase is transmitted to a calculation node in charge of a Reduce process that is uniquely determined for each Key.
- the ⁇ Key, Value> pairs sent in the Shuffle phase are classified (aligned) for each Key in the computation node on the receiving side. There is a need. This classification (alignment) is performed in the Local Sort phase (reception side alignment phase).
- the MapReduce calculation model As described above, by using the MapReduce calculation model, it is possible to perform calculation (processing) in parallel with a parallel calculation device including a plurality of calculation nodes. Furthermore, when the MapReduce calculation model is used in a parallel computing device including a plurality of calculation nodes, only the Map process and the Reduce process perform actual calculation and processing, and the Shuffle phase and the Local Sort phase are applications. This is the same regardless of the processing content. Therefore, by creating the Shuffle phase and the Local Sort phase as a common framework in advance, it is possible to easily create a plurality of applications by changing only the processing contents of the Map process and the Reduce process.
- the data rearrangement performed in the Local Sort phase generally requires random access to the storage device.
- a random access memory which is a temporary storage device provided in each computation node, has a small capacity. Therefore, when the amount of data to be handled is large, temporary access that allows random access at high speed is performed in the Local Sort phase process. It is necessary to use not only a RAM as a storage device but also a storage device such as a flash memory or a hard disk drive which has a large capacity but a low random access speed. Therefore, when the amount of data to be handled becomes larger than the capacity of the RAM which is a temporary storage device, the time required for the Local Sort will increase more rapidly.
- the invention of the present application has been made in view of the above, and one of its purposes is to perform classification (alignment) processing for each key or a part thereof in the Map phase and the Shuffle phase.
- the Shuffle phase a means is provided so that data that has been classified (aligned) by key is delivered to the computation node on the receiving side as much as possible, and a means for shortening the processing time of the Local Sort phase is provided by using this means.
- the problem to be solved by the invention of the present application is a problem that generally occurs when a plurality of processing units are processed in parallel by a plurality of computing nodes, and MapReduce is taken as an example only for explanation. This is because. Therefore, the invention of the present application can be applied not only to the MapReduce calculation model but also to many cases where a plurality of processing units are processed in parallel by a plurality of calculation nodes.
- FIG. 4 shows a schematic diagram when a graph problem is calculated by a parallel computing device as an example of parallel processing of a plurality of processing units by a plurality of calculation nodes to which the invention of the present application can be applied, which is different from MapReduce.
- FIG. 4 shows an example in which the calculation nodes 1 to 4 perform processing on the graph portion given to each of them.
- the graph problem is composed of a plurality of calculation steps, and at each calculation step, some calculation is performed with respect to all the vertices as input of the calculation result of the vertex having the edge toward the own vertex in the previous step.
- a graph problem is calculated by a parallel computing device composed of a plurality of calculation nodes, it is natural to assign a vertex to each calculation node and perform parallel processing considering that each vertex can be calculated independently.
- each calculation step is completed, the calculation result is transmitted to the vertex to which the edge is connected, and the receiving calculation node needs to classify (align) the received data for each destination vertex.
- This is the same processing as in the Shuffle phase and the Local Sort phase in the above-described MapReduce calculation model.
- the classification for each destination vertex in the reception-side computation node is increased.
- the problem is that the (alignment) processing takes time.
- the present invention is a parallel computer system having a plurality of computing nodes, which divides a processing target in a first grouping, arranges the processing target in each computing node, and processes the storage device group based on a second grouping.
- the processing result is stored, and the stored processing result is transmitted to each computation node according to the first grouping, thereby solving the above-described problem.
- the present invention it is possible to reduce the time required for the classification (alignment) processing performed in the calculation node on the receiving side. As a result, parallel computation can be accelerated.
- FIG. It is a block diagram which shows the structure of the information processing system which is an Example of this invention. It is a figure explaining the calculation model called MapReduce disclosed by the nonpatent literature 1.
- FIG. It is a conceptual diagram at the time of operating the MapReduce calculation model disclosed by the nonpatent literature 1 on a parallel computer.
- FIG. It is a conceptual diagram when a graph problem is operated on a parallel computing device. It is a figure explaining the receiving side alignment (Local Sort) speed-up by the bucket sort used as the beginning which considers this invention. It is a figure explaining the receiving side alignment (Local Sort) speed-up by the bucket sort used as the beginning which considers this invention.
- FIG. 1 is a block diagram showing a configuration of an information processing system 101 according to an embodiment of the present invention.
- Each computation node CALC_NODE_x includes a central processing unit (CPU), a temporary storage device MEM, a storage device STOR, a communication device COM_DEV, and a bus BUS connecting the CPU, the storage device STOR, and the communication device COM_DEV. .
- the CPU reads necessary input data from the storage device STOR, performs calculation, and transmits the input data or calculation result to another calculation node using the communication device COM_DEV if necessary. .
- the CPU stores temporarily necessary data, the progress of calculation, and the like in the temporary storage device MEM as necessary.
- one of the computation nodes CALC_NODE_x serves as a retransmission management node.
- a random access memory is used as the temporary storage device MEM.
- a flash memory, a phase change memory, and a hard disk drive are used for the storage device STOR. Therefore, the temporary storage device MEM can be accessed at a higher speed than the storage device STOR, but has a small capacity.
- the temporary storage device MEM can be randomly accessed at high speed, whereas the storage device STOR cannot perform random access only by sequential access, or the random access speed is very low compared to the sequential access speed. It will have the characteristics.
- each calculation node In the parallel computer system shown in FIG. 1, the CPU, the temporary storage device MEM, the storage device STOR, and the communication device COM_DEV included in each calculation node are not necessarily the same, and each calculation node has a large size. The sheath performance may be different. Further, the case where there is a computation node that does not include all of the CPU, the temporary storage device MEM, the storage device STOR, and the communication device COM_DEV is also considered for the invention disclosed in the present application.
- 5 (a) and 5 (b) are diagrams for explaining a bucket sort method that is the beginning of thinking about the invention of the present application in solving the above-described problems.
- the Key output in the Map phase is an integer number greater than or equal to zero.
- the case where Key is an integer number greater than or equal to 0 is taken as an example of explanation, and even if Key is a non-integer such as a character string, each bucket is assigned in advance. If a method is determined, the same method can be applied.
- FIG. 5A shows a processing flow in the calculation node CALC_NODE_1 at this time.
- CALC_NODE_1 receives a set of ⁇ Key, Value> whose Key numbers are 0 to 999. Since there is no agreement on the order of the received ⁇ Key, Value> pairs, all the received ⁇ Key, Value> pairs must be temporarily stored in the storage device STOR in preparation for the subsequent classification (alignment) processing. There is. After the calculation node CALC_NODE_1 has received all the ⁇ Key, Value> pairs, it reads out all the ⁇ Key, Value> pairs stored in the storage device STOR and classifies (aligns) them for each Key. Reduce processing is performed.
- the classification (alignment) processing for each key needs to be performed on all ⁇ Key, Value> pairs received by the self-calculation node. (Alignment) takes time.
- FIG. 5B a plurality of storage devices are prepared and bucket sorting (bucket sorting) is considered.
- CALC_NODE_1 includes 10 storage devices from STOR1 to STOR10, and these storage devices constitute a bucket group.
- the calculation node CALC_NODE_1 distributes and stores the received ⁇ Key, Value> set to 10 storage devices according to the Key number. This distribution is performed by means such as, for example, a ⁇ Key, Value> set with a Key number from 0 to 99 in STOR1, and a ⁇ Key, Value> set with a Key number from 100 to 199 in STOR2.
- the ⁇ Key, Value> pairs stored in the storage device STOR1 are read out and classified (aligned) for each Key. Reduce processing.
- the ⁇ Key, Value> pairs in the storage device STOR1 After all the ⁇ Key, Value> pairs in the storage device STOR1 have been processed, the ⁇ Key, Value> pairs stored in the storage device STOR2 are read out, classified (aligned) by key, and reduced. To do. After that, similarly, STOR3 to STOR10 may be classified (aligned) in order and reduced.
- the classification (alignment) processing for each key may be performed for each storage device STOR1 to STOR10. Therefore, the classification (alignment) processing time is longer than that in FIG. It can be shortened.
- access from each storage device (bucket) STOR1 to STOR10 is only sequential access, and a storage device with low random access can be used.
- the bucket groups (storage devices STOR1 to STOR10) are assigned in ascending order of the keys.
- the order of the key numbers is not necessarily limited. There is no need to assign. Further, the order of processing the storage device (bucket) after completion of the Shuffle phase is arbitrary.
- FIG. 6 shows an example in which each of the ten calculation nodes CALC_NODE_1 to 10 of the information processing system 101 is in charge of Reduce processing of 1000 types of key numbers. If ten storage devices (buckets) are arranged in each computation node by the bucket sort method shown in FIG. 5B, 100 parallel storage devices (buckets) are required for the entire parallel computing device. By the way, as described above, in the bucket sort method shown in FIG. 5B, after the completion of the Shuffle phase, the bucket stored in the self-calculation node is stored as a set of ⁇ Key, Value> having a small key number. Classification (alignment) and Reduce processing are performed in order.
- the time required for CALC_NODE_1 to process the buckets with the key numbers 0 to 99 and CALC_NODE_2 to process the buckets with the key numbers 1000 to 1099 The required time is considered to be approximately the same. Therefore, the time when CALC_NODE_1 starts the processing of the buckets with the key numbers 100 to 199 and the time when CALC_NODE_2 starts the processing of the buckets with the key numbers 1100 to 1199 are almost the same time.
- the inventors of the present application need this at almost the same time. That is, the buckets that start classification (alignment) / reduce processing at almost the same time share the calculation nodes across the computation nodes. Found that the number of buckets to be reduced.
- FIG. 7 is a diagram for explaining an example of parallel processing operation in the information processing system 101.
- FIG. 7 shows an example in which parallel processing is executed by 10 calculation nodes CALC_NODE_1 to CALC_NODE_10, and each calculation node is responsible for Reduce processing of 1000 types of key numbers.
- the key is an integer number
- the reduction process or the bucket allocation method is arbitrary.
- the computation nodes CALC_NODE_1 to 10 realize the functions of performing the MAP processing, the classification (alignment) for each key, and the Reduce processing with the storage devices STOR1 to STOR10. As shown in FIG. 7, the storage device STOR of the calculation node CALC_NODE_1 is associated with STOR1, and the storage device STOR of the calculation node CALC_NODE_2 is associated with STOR2.
- a group of buckets BSTOR1 to BSTOR10 that temporarily store a set of ⁇ Key, Value> are BSTORE1 as the storage device STOR of the calculation node CALC_NODE_1, BSTR2 as the storage device STOR of the calculation node CALC_NODE_2, and BSTR3 as the storage device STOR of the calculation node CALC_NODE_3.
- the calculation node is also used as a storage device STOR of a calculation node that executes calculation by parallel processing. By doing so, parallel computation can be executed with a small amount of resources.
- the storage device STOR of the calculation node CALC_NODE_11 is a storage device STOR of a calculation node that does not perform calculation in parallel processing, such as BSTORE1, the storage device STOR of the calculation node CALC_NODE_12 is BSTR2, and the storage device STOR of the calculation node CALC_NODE_13 is BSTOR3. It may be realized with. Also, the number of bucket groups BSTOR1 to BSTOR10 and the number of calculation node groups CALC_NODE_1 to CALC_NODE_10 that perform Reduce processing may be different. In a configuration in which the number of bucket groups and the number of calculation node groups are different and buckets are arranged on the calculation nodes, an integer number of storage devices (buckets) of 0 or more are arranged per calculation node.
- the controller BUCKET_CONT that performs bucket retransmission management is realized by a retransmission management node that is one of the calculation nodes CALC_NODE_x.
- the retransmission management node may be realized as a calculation node that contributes to parallel processing, or may separately use a calculation node that is not related to the parallel processing calculation.
- the above-described Shuffle phase (inter-computation node communication) is further subdivided into two phases: a bucket transmission phase and a bucket retransmission phase.
- the bucket transmission phase instead of directly transmitting the ⁇ Key, Value> pair output in the Map process to the computation node that performs the Reduce process corresponding to the Key, the bucket group BSTOR1 shared by the entire parallel process is transmitted to the BSTOR10. To do.
- the destination bucket is assigned to BSOTR1
- the destination bucket is assigned to BSOTR2.
- each computation node gives a destination to one of BSTORs 1 to 10 for a set of ⁇ Key, Value> that is a result of the Map processing in the computation nodes CALC_NODE_1 to 10.
- the process proceeds to the bucket retransmission phase.
- the bucket retransmission phase in accordance with an instruction from the controller BUCKET_CONT, the set of ⁇ Key, Value> stored in each bucket is retransmitted to the original destination calculation node (calculation node performing the Reduce process). More specifically, first, the set of ⁇ Key, Value> stored in the bucket BSTOR1 is retransmitted to the original destination calculation node (calculation node that performs the Reduce process).
- each of the calculation nodes CALC_NODE_1 to 10 stores the received ⁇ Key, Value> set in a storage device STOR included in the own calculation node (for example, STOR1 corresponds to the calculation node CALC_NODE_1).
- the controller BUCKET_CONT notifies the calculation node CALC_NODE_1 to CALC_NODE_10.
- each of the calculation nodes CALC_NODE_1 to CALC_NODE_10 reads a set of ⁇ Key, Value> stored in the STOR10 from the storage device STOR1, classifies (aligns) each key, and performs a Reduce process.
- each computing node performs MAP processing in the first grouping, whereas each storage device of BSTOR 1 to 10 in the second grouping is processed. And retransmitted from each storage device to each computation node according to the first grouping.
- the time required for the classification (alignment) processing in each computation node is prepared when 10 buckets are prepared for each computation node (FIG. 5). It can be shortened to the same as (b).
- communication between calculation nodes is performed twice in the bucket transmission phase and the bucket retransmission phase, the amount of communication increases.
- the time required for communication between computation nodes is proportional to N with respect to the data amount N
- the time required for classification (alignment) is proportional to N ⁇ Log (N).
- the number of Keys responsible for Reduce processing to each computation node is simply 1000 in ascending order of the Key number, and the Key allocation method for each bucket is the remainder obtained by dividing the Key number by 1000.
- this determination method is arbitrary, and other allocation methods can be applied.
- the assignment method of the key responsible for the reduction process to each computation node and the assignment method to each bucket are as orthogonal as possible. It is desirable to have a relationship. That is, it is desirable that the processing targets included in any one group in the first grouping are distributed so as to be included in each group in the second grouping.
- a desirable allocation method depends on the distribution of the Key output in the Map phase, and differs depending on the application and input data. Therefore, it is conceivable that the application designer or user can set the assignment method of the key responsible for the reduction processing to each computation node and the assignment method to each bucket. It is also possible to perform state transition between phases by exchanging necessary information between each bucket and each computation node without having to use a controller BUCKET_CONT for performing bucket retransmission management.
- graph processing data can also be used as a processing target of the information processing system 101.
- the graph is divided into the number of computation nodes performing parallel processing and assigned to each computation node. In each computation node, processing of the assigned vertex group is performed.
- FIG. 8 is a diagram for explaining a second example of the parallel processing operation in the information processing system 101.
- the system of this embodiment applies the idea of a radix sort (Radix Sort) method, which is a method of performing more precise sorting by performing bucket sorting a plurality of times, and reuses the same bucket group BSTOR1 to BSTOR10. By performing the bucket sort twice, it is possible to obtain substantially the same effect as when the number of buckets is squared.
- radix sort Radix Sort
- a group of buckets BSTOR1 to BSTOR10 that temporarily store a set of ⁇ Key, Value> are stored, storage device STOR of calculation node CALC_NODE_1 is BSTOR1, storage device STOR of calculation node CALC_NODE_2 is storage of BSTOR2, and storage of calculation node CALC_NODE_3
- a description will be given by taking, as an example, a format in which the device STOR is realized by being used also as a storage device STOR of a calculation node that performs calculation by parallel processing, such as BSTR3.
- a set of ⁇ Key, Value> output in the Map phase is transmitted from the bucket BSTOR1 determined based on the remainder obtained by dividing the Key number by 100 to the BSTOR10.
- each calculation node reads out the stored contents of the bucket BSTOR1 and transmits it again from another bucket BSTOR1 to the BSTOR10.
- the destination bucket is determined based on the remainder obtained by dividing the Key number by 1000.
- the retransmission is sequentially performed from the bucket BSTOR2 to the bucket BSTOR10.
- each calculation node After the retransmission of all buckets is completed, each calculation node again reads out the stored contents of the bucket BSTOR1 and sends it again to the original destination calculation node (calculation node that performs the Reduce process). At each calculation node, Local Sort and Reduce processing is performed. After the processing of the bucket BSTOR1 is completed, similarly, the packet BSTOR2 to the bucket BSTOR10 are sequentially sent again, and the Local Sort and Reduce processing is performed at each calculation node.
- FIG. 9 is a diagram for explaining a second example of the parallel processing operation in the information processing system 101.
- the bucket is retransmitted after the bucket transmission phase is completed, that is, output in the Map phase ⁇ Key, Value>.
- the original destination calculation node the calculation node performing the Reduce process
- the original destination computation node can receive any ⁇ Key, Value> pair until the bucket retransmission phase starts.
- the original destination computation node compute node that performs the Reduce process
- it will receive the ⁇ Key, Value> pair in parallel.
- the inventors of the present application have considered that the entire processing time may be shortened by performing the classification (alignment) processing for each key little by little (online).
- each bucket BSTOR1 to BSTOR10 has a remainder obtained by dividing the Key number by 1000 from 91 to 181, 182 to 272, and so on.
- 90 or 91 Key numbers are assigned. assign. 90 or 91 corresponds to a number obtained by dividing the number of keys (1000) for which one computation node is in charge of Reduce processing by the number of bucket groups (10) +1.
- the time required for completing all Map processes is estimated in advance, and the contents stored in the buckets BSTOR1 to BSTOR10 are sequentially read every time 1/10 of the time for completing all Map processes has elapsed. Then, the data is retransmitted to the original destination calculation node (calculation node performing the Reduce process).
- the original destination calculation node (the calculation node that performs the Reduce process) can receive the ⁇ Key, Value> set from an early stage, and receive the ⁇ Key, Value> set while receiving the parallel.
- the classification (alignment) processing for each key can be performed little by little (online).
- the time Tmap required for all Map processes is estimated by some means.
- the number of Map processes and the processing contents are known in advance, it is possible to estimate the processing time of the entire Map process. Even if the estimated time has some errors, the following operations are not hindered.
- the ⁇ Key, Value> set output in the Map phase is applied to the bucket corresponding to the Key number, and if there is no bucket corresponding to the Key number.
- the data is sent to the original destination calculation node (the calculation node that performs the Reduce process). Therefore, a set of ⁇ Key, Value> with a Key number from 0 to 90 is not directly sent to the bucket group, but is directly transmitted to the original destination calculation node (calculation node that performs Reduce processing). .
- the storage contents of the bucket BSTOR1 are read out and retransmitted to the original destination calculation node (calculation node that performs the Reduce process).
- the ⁇ Key, Value> pair output by the Map processing is in charge of the bucket BSTOR1 in addition to the ⁇ Key, Value> pair whose key number is 0 to 90 and no corresponding bucket exists.
- a pair of ⁇ Key, Value> with key numbers 91 to 181 is also transmitted directly to the original destination calculation node (calculation node that performs Reduce processing).
- the bucket is retransmitted toward the original destination calculation node (the calculation node that performs the Reduce process), and thereafter, the ⁇ Key, Value output in the Map process is output.
- the pair of> is assigned to the bucket if the bucket corresponding to the Key exists and the bucket has not been retransmitted, and directly to the original destination calculation node (Reduce processing) otherwise. To the computing node that performs the transmission.
- the bucket BSTOR1 stores only 1/10 ⁇ Key, Value> pairs of the bucket BSTOR10 at the timing of bucket retransmission. Will not be.
- the number of keys handled by each bucket is increased in the bucket BSTOR1, and in the bucket BSTOR10. It is possible to reduce it. Or, instead of simply dividing the Tmap into equal parts, instead of dividing the Tmap evenly, the time between retransmitting the previous bucket and retransmitting the next bucket is gradually shortened. Good.
- the original destination calculation node (reduction processing is performed by retransmitting the bucket in order in parallel with the Map processing.
- Node can receive ⁇ Key, Value> pairs from an early stage, and receive the ⁇ Key, Value> pairs in parallel (online) little by little (online) classification (alignment) for each Key. Processing can be performed, and the overall processing time can be shortened.
- CALC_NODE_x calculation node
- COM_SW communication switch
- MEM temporary storage device
- STOR storage device
- COM_DEV communication device
- STOR storage device
- BUS bus
- BSTOR1 to 10 storage device (bucket)
- BUCKET_CONT A controller that performs retransmission management of buckets.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
In parallel computation of a graph problem or the like, the time required for sorting (ordering) processing which is performed at a computation node of a receiving side becomes problematic. The present invention is a parallel computer system which has a plurality of computation nodes, which resolves the above problem by partitioning the objects to be processed with a first grouping so as to place the objects at each computation node and perform processing, stores results of the processing in a storage device group on the basis of a second grouping, and transmits the stored results of the processing to each computation node in accordance with the first grouping.
Description
本発明は、並列処理方法、および並列計算機システムに関し、特に、処理結果の送信に関する。
The present invention relates to a parallel processing method and a parallel computer system, and more particularly to transmission of processing results.
並列計算装置の上で効率よく計算をする汎用的な方式として、非特許文献1では、図2に示したMapReduceと呼ばれる計算モデルを開示している。なお、後述するように、ここでMapReduceを取り上げるのは単に技術説明の例とするためであり、本願における発明自体はMapReduce計算モデルをその前提としているわけでない。
Non-patent document 1 discloses a calculation model called MapReduce shown in FIG. 2 as a general-purpose method for efficiently performing calculations on a parallel computing device. As will be described later, MapReduce is taken up merely as an example of the technical description, and the invention itself in this application does not assume the MapReduce calculation model.
図2に示したように、MapReduce計算モデルは、Mapフェーズと、Sortフェーズと、Reduceフェーズと、の3つのフェーズによって構成される計算モデルである。Mapフェーズでは、入力データが多数の処理単位に分割されて読み込まれ、Mapプロセスに入力され、Mapプロセスは処理単位ごとになんらかの計算あるいは処理を行って、<Key,Value>の組を出力する。続くSortフェーズでは、Mapフェーズで出力された<Key,Value>の組が、Key毎に分類(整列)され、各Keyに対して複数のValueを組にしたものが出力される。Reduceフェーズでは、各Keyに複数のValueを組にしたものが、Reduceプロセスに入力され、Reduceプロセスが入力された該組になんらかの計算あるいは処理を行って最終的な結果を出力する。
As shown in FIG. 2, the MapReduce calculation model is a calculation model composed of three phases, a Map phase, a Sort phase, and a Reduce phase. In the Map phase, input data is read after being divided into a large number of processing units and input to the Map process. The Map process performs some calculation or processing for each processing unit and outputs a set of <Key, Value>. In the subsequent Sort phase, the set of <Key, Value> output in the Map phase is classified (aligned) for each key, and a set of multiple values for each Key is output. In the Reduce phase, a set of a plurality of values for each key is input to the Reduce process, and some calculation or processing is performed on the set to which the Reduce process is input, and a final result is output.
ここで、Mapフェーズ、およびReduceフェーズにおいて、各Mapプロセス、および各Reduceプロセスは、他のMapプロセスおよびReduceプロセスとの依存関係はないため、並列に実行することが可能である。したがって、MapReduce計算モデルを用いることで、複数の計算ノードで構成される並列計算装置で並列的に計算(処理)を行うことができる。
Here, in the Map phase and the Reduce phase, each Map process and each Reduce process have no dependency with other Map processes and Reduce processes, and can be executed in parallel. Therefore, by using the MapReduce calculation model, it is possible to perform calculation (processing) in parallel by a parallel calculation device including a plurality of calculation nodes.
図3は、複数の計算ノードで構成される並列計算装置上で、MapReduce計算モデルを動作させた様子を示す模式図である。前述のように、Mapフェーズにおける複数のMapプロセス、および、Reduceフェーズにおける複数のReduceプロセスは、それぞれ並列に実行することが可能である。したがって、これらのプロセスを複数の計算ノードに割り振って並列に実行させることができる。なお、MapフェーズとReduceフェーズは、必ずしも同一の並列計算装置で行われる必要はない。ところで、多くの場合、Mapプロセス、およびReduceプロセスの総数は、計算ノードの総数に比べて多い。その場合、必然的に、各計算ノードは、それぞれ、複数のMapプロセス、およびReduceプロセスを担当することとなる。Reduceプロセスの総数(Mapフェーズで出力されるKeyの種類の総数)が、計算ノードの総数に比べて多い場合を考える。この場合、図2におけるSortフェーズは、Shuffleフェーズ(計算ノード間通信フェーズ)、およびLocal Sortフェーズ(受信側整列フェーズ)、の2つのフェーズに細分化されることになる。Shuffleフェーズでは、Mapフェーズで出力された<Key,Value>の組が、Key毎に一意に定まるReduceプロセスを担当する計算ノードに送信される。各計算ノードが複数のReduceプロセス(複数のMap出力Key)を担当する場合、Shuffleフェーズで送られた<Key,Value>の組は、受信側の計算ノードでKey毎に分類(整列)される必要がある。この分類(整列)が行われるのがLocal Sortフェーズ(受信側整列フェーズ)である。
FIG. 3 is a schematic diagram showing a state in which the MapReduce calculation model is operated on a parallel computing device composed of a plurality of computation nodes. As described above, a plurality of Map processes in the Map phase and a plurality of Reduce processes in the Reduce phase can be executed in parallel. Therefore, these processes can be allocated to a plurality of calculation nodes and executed in parallel. Note that the Map phase and the Reduce phase are not necessarily performed by the same parallel computing device. By the way, in many cases, the total number of Map processes and Reduce processes is larger than the total number of computation nodes. In that case, each computation node is necessarily responsible for a plurality of Map processes and Reduce processes. Consider a case where the total number of Reduce processes (the total number of types of keys output in the Map phase) is larger than the total number of computation nodes. In this case, the Sort phase in FIG. 2 is subdivided into two phases, a Shuffle phase (inter-computation node communication phase) and a Local Sort phase (receiving side alignment phase). In the Shuffle phase, a set of <Key, Value> output in the Map phase is transmitted to a calculation node in charge of a Reduce process that is uniquely determined for each Key. When each computation node is in charge of a plurality of Reduce processes (a plurality of Map output keys), the <Key, Value> pairs sent in the Shuffle phase are classified (aligned) for each Key in the computation node on the receiving side. There is a need. This classification (alignment) is performed in the Local Sort phase (reception side alignment phase).
以上のように、MapReduce計算モデルを用いることで、複数の計算ノードで構成される並列計算装置で並列的に計算(処理)を行うことができる。さらに、複数の計算ノードで構成される並列計算装置でMapReduce計算モデルを用いる場合、実際の計算および処理を行うのは、Mapプロセス、およびReduceプロセスのみであり、Shuffleフェーズ、およびLocal Sortフェーズはアプリケーションの処理内容によらず共通である。したがって、Shuffleフェーズ、およびLocal Sortフェーズをあらかじめ共通フレームワークとして作成しておくことで、MapプロセスおよびReduceプロセスの処理内容のみを変更することで、複数のアプリケーションを簡単に作成することができる。
As described above, by using the MapReduce calculation model, it is possible to perform calculation (processing) in parallel with a parallel calculation device including a plurality of calculation nodes. Furthermore, when the MapReduce calculation model is used in a parallel computing device including a plurality of calculation nodes, only the Map process and the Reduce process perform actual calculation and processing, and the Shuffle phase and the Local Sort phase are applications. This is the same regardless of the processing content. Therefore, by creating the Shuffle phase and the Local Sort phase as a common framework in advance, it is possible to easily create a plurality of applications by changing only the processing contents of the Map process and the Reduce process.
前述のように、MapReduce計算モデルを用いることで、複数の計算ノードで構成される並列計算装置で並列的に計算(処理)を行うことができる。しかしながら、処理すべきデータ量が大きくなるにつれて、とくに、Reduceプロセスの総数が、計算ノードの総数に比べて多くなるにつれて、Local Sortフェーズ(受信側整列フェーズ)が重たい処理となり、全体の処理時間の大部分を占めてしまう。これは、Mapフェーズ、Reduceフェーズ、およびShuffleフェーズが、処理すべきデータ量に比例する時間がかかるのに対して、Local Sortフェーズでは、単純な整列アルゴリズムを用いると、データ量Nに対して、N×Log(N)に比例する時間がかかるためである。
As described above, by using the MapReduce calculation model, it is possible to perform calculation (processing) in parallel with a parallel calculation device including a plurality of calculation nodes. However, as the amount of data to be processed increases, especially as the total number of Reduce processes increases compared to the total number of computing nodes, the Local Sort phase (reception side alignment phase) becomes heavier, and the overall processing time is reduced. Will occupy the majority. This is because the Map phase, Reduce phase, and Shuffle phase take time proportional to the amount of data to be processed, whereas in the Local Sort phase, using a simple alignment algorithm, This is because it takes time proportional to N × Log (N).
さらに悪いことに、Local Sortフェーズで行われるデータの並び替えは、一般に記憶装置へのランダムアクセスを必要とする。通常、各計算ノードが備える一時記憶装置であるランダムアクセスメモリ(RAM)は小容量であるため、扱うべきデータ量が大きい場合には、Local Sortフェーズの処理に、高速にランダムアクセスが可能な一時記憶装置であるRAMだけではなく、大容量ではあるが特にランダムアクセス速度が低速な、フラッシュメモリやハードディスクドライブなどの記憶装置を使う必要がある。したがって、扱うべきデータ量が一時記憶装置であるRAMの容量よりも大きくなると、Local Sortに必要な時間がさらに急激に増加することになる。
Worse, the data rearrangement performed in the Local Sort phase generally requires random access to the storage device. Normally, a random access memory (RAM), which is a temporary storage device provided in each computation node, has a small capacity. Therefore, when the amount of data to be handled is large, temporary access that allows random access at high speed is performed in the Local Sort phase process. It is necessary to use not only a RAM as a storage device but also a storage device such as a flash memory or a hard disk drive which has a large capacity but a low random access speed. Therefore, when the amount of data to be handled becomes larger than the capacity of the RAM which is a temporary storage device, the time required for the Local Sort will increase more rapidly.
本願の発明は、このようなことを鑑みてなされたものであり、その目的の一つは、Mapフェーズ、およびShuffleフェーズで、Key毎の分類(整列)処理、あるいはその一部を行うことで、Shuffleフェーズにおいて受信側の計算ノードに、なるべくKey毎に分類(整列)済みのデータが届くようにする手段を提供し、それを用いて、Local Sortフェーズの処理時間を短縮する手段を提供することにある。本願の発明の前記並びにその他の目的と新規な特徴は、本明細書の記述及び添付図面から明らかになるであろう。
The invention of the present application has been made in view of the above, and one of its purposes is to perform classification (alignment) processing for each key or a part thereof in the Map phase and the Shuffle phase. In the Shuffle phase, a means is provided so that data that has been classified (aligned) by key is delivered to the computation node on the receiving side as much as possible, and a means for shortening the processing time of the Local Sort phase is provided by using this means. There is. The above and other objects and novel features of the present invention will become apparent from the description of this specification and the accompanying drawings.
なお、本願の発明が解決しようとする課題は、複数の計算ノードで複数の処理単位を並列処理するときに一般的に発生する課題であって、ここでMapReduceを取り上げたのは単に説明の例とするためである。したがって、本願の発明は、MapReduce計算モデルのみならず、複数の計算ノードで複数の処理単位を並列処理する場合の多くに適用が可能である。図4に、MapReduceとは別の、本願の発明が適用可能な複数の計算ノードで複数の処理単位を並列処理する例として、グラフ問題を並列計算装置で計算する場合の模式図を示した。
The problem to be solved by the invention of the present application is a problem that generally occurs when a plurality of processing units are processed in parallel by a plurality of computing nodes, and MapReduce is taken as an example only for explanation. This is because. Therefore, the invention of the present application can be applied not only to the MapReduce calculation model but also to many cases where a plurality of processing units are processed in parallel by a plurality of calculation nodes. FIG. 4 shows a schematic diagram when a graph problem is calculated by a parallel computing device as an example of parallel processing of a plurality of processing units by a plurality of calculation nodes to which the invention of the present application can be applied, which is different from MapReduce.
グラフ問題の並列計算とは、複数の頂点と頂点間を結ぶエッジとで構成されるグラフ構造を所与として、その上で種々の計算を行うものである。図4では、計算ノード1~4が、それぞれに与えられたグラフの部分についての処理を行う例を示した。通常、グラフ問題は、複数の計算ステップにより構成されており、各計算ステップで、全頂点に関して、前ステップにおける自頂点に向かうエッジを持つ頂点の計算結果を入力としてなんらかの計算が行われる。グラフ問題を複数の計算ノードで構成される並列計算装置で計算する場合、各頂点が独立に計算できることを考えると、各計算ノードに頂点を割り当てて並列処理するのが自然である。このとき、各計算ステップが終了するごとに、計算結果をエッジのつながり先の頂点に送信し、受信側の計算ノードでは、受信したデータを宛先頂点毎に分類(整列)する必要がある。これは、前述のMapReduce計算モデルにおける、Shuffleフェーズ、およびLocal Sortフェーズと同様の処理であり、頂点の数が多くなるほど、すなわち大規模なグラフ問題になるほど、受信側計算ノードにおける宛先頂点毎の分類(整列)処理に時間がかかることが課題となる。
並列 Parallel calculation of graph problems is to perform various calculations on a given graph structure composed of multiple vertices and edges connecting vertices. FIG. 4 shows an example in which the calculation nodes 1 to 4 perform processing on the graph portion given to each of them. Usually, the graph problem is composed of a plurality of calculation steps, and at each calculation step, some calculation is performed with respect to all the vertices as input of the calculation result of the vertex having the edge toward the own vertex in the previous step. When a graph problem is calculated by a parallel computing device composed of a plurality of calculation nodes, it is natural to assign a vertex to each calculation node and perform parallel processing considering that each vertex can be calculated independently. At this time, each calculation step is completed, the calculation result is transmitted to the vertex to which the edge is connected, and the receiving calculation node needs to classify (align) the received data for each destination vertex. This is the same processing as in the Shuffle phase and the Local Sort phase in the above-described MapReduce calculation model. As the number of vertices increases, that is, as the graph problem becomes larger, the classification for each destination vertex in the reception-side computation node is increased. The problem is that the (alignment) processing takes time.
本発明は、複数の計算ノードを有する並列計算機システムで、第1のグループ分けで処理対象を分割して各計算ノードに配置して処理し、第2のグループ分けに基づいてストレージ装置群に該処理結果を保存し、保存された処理結果を第1のグループ分けに従って各計算ノードに送信することで、前述の課題を解決する。
The present invention is a parallel computer system having a plurality of computing nodes, which divides a processing target in a first grouping, arranges the processing target in each computing node, and processes the storage device group based on a second grouping. The processing result is stored, and the stored processing result is transmitted to each computation node according to the first grouping, thereby solving the above-described problem.
本発明により、受信側の計算ノードで行う分類(整列)処理に要する時間を削減することが可能となる。ひいては、並列計算を高速化することが可能となる。
According to the present invention, it is possible to reduce the time required for the classification (alignment) processing performed in the calculation node on the receiving side. As a result, parallel computation can be accelerated.
以下の実施の形態においては便宜上その必要があるときは、複数のセクションまたは実施の形態に分割して説明するが、特に明示した場合を除き、それらは互いに無関係なものではなく、一方は他方の一部または全部の変形例、詳細、補足説明等の関係にある。また、以下の実施の形態において、要素の数等(個数、数値、量、範囲等を含む)に言及する場合、特に明示した場合および原理的に明らかに特定の数に限定される場合等を除き、その特定の数に限定されるものではなく、特定の数以上でも以下でも良い。
In the following embodiment, when it is necessary for the sake of convenience, the description will be divided into a plurality of sections or embodiments. However, unless otherwise specified, they are not irrelevant, and one is the other. Some or all of the modifications, details, supplementary explanations, and the like are related. Further, in the following embodiments, when referring to the number of elements (including the number, numerical value, quantity, range, etc.), especially when clearly indicated and when clearly limited to a specific number in principle, etc. Except, it is not limited to the specific number, and may be more or less than the specific number.
以下、本発明の実施の形態を図面に基づいて詳細に説明する。なお、実施の形態を説明するための全図において、同一の部材には原則として同一の符号を付し、その繰り返しの説明は省略する。
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. Note that components having the same function are denoted by the same reference symbols throughout the drawings for describing the embodiment, and the repetitive description thereof will be omitted.
図1は、本発明の実施例である情報処理システム101の構成を示すブロック図である。情報処理システム101は、複数の計算ノードCALC_NODE_x(x=1,2,3・・・)と、その間の通信を仲介する通信スイッチCOM_SWと、を備える並列計算機システムである。
FIG. 1 is a block diagram showing a configuration of an information processing system 101 according to an embodiment of the present invention. The information processing system 101 is a parallel computer system including a plurality of calculation nodes CALC_NODE_x (x = 1, 2, 3,...) And a communication switch COM_SW that mediates communication therebetween.
各計算ノードCALC_NODE_xは、中央演算処理装置(CPU)と、一時記憶装置MEMと、記憶装置STORと、通信デバイスCOM_DEVと、CPU、記憶装置STOR、および通信デバイスCOM_DEVを接続するバスBUSと、を備える。各計算ノードCALC_NODE_xにおいて、CPUは、記憶装置STORから必要な入力データを読み出して計算を行い、必要であれば通信デバイスCOM_DEVを用いて、入力データ、あるいは計算結果を、他の計算ノードに送信する。一連の処理の間、CPUは必要に応じて、一時的に必要なデータや、計算の途中経過等を一時記憶装置MEMに記憶する。また、後述のように、計算ノードCALC_NODE_xの内の一つは、再送管理ノードとして働く。
Each computation node CALC_NODE_x includes a central processing unit (CPU), a temporary storage device MEM, a storage device STOR, a communication device COM_DEV, and a bus BUS connecting the CPU, the storage device STOR, and the communication device COM_DEV. . In each calculation node CALC_NODE_x, the CPU reads necessary input data from the storage device STOR, performs calculation, and transmits the input data or calculation result to another calculation node using the communication device COM_DEV if necessary. . During the series of processing, the CPU stores temporarily necessary data, the progress of calculation, and the like in the temporary storage device MEM as necessary. As will be described later, one of the computation nodes CALC_NODE_x serves as a retransmission management node.
ここで、一時記憶装置MEMには、ランダムアクセスメモリ(RAM)を用いる。一方、記憶装置STORには、フラッシュメモリ、相変化メモリやハードディスクドライブを用いる。したがって、一時記憶装置MEMは、記憶装置STORに比べて、高速なアクセスが可能であるが小容量である。また、一時記憶装置MEMは高速にランダムアクセス可能であるのに対して、記憶装置STORは、シーケンシャルアクセスのみでランダムアクセスは不可能、あるいは、シーケンシャルアクセス速度に比べてランダムアクセス速度が非常に遅い、といった特徴をもつことになる。
Here, a random access memory (RAM) is used as the temporary storage device MEM. On the other hand, a flash memory, a phase change memory, and a hard disk drive are used for the storage device STOR. Therefore, the temporary storage device MEM can be accessed at a higher speed than the storage device STOR, but has a small capacity. In addition, the temporary storage device MEM can be randomly accessed at high speed, whereas the storage device STOR cannot perform random access only by sequential access, or the random access speed is very low compared to the sequential access speed. It will have the characteristics.
なお、図1に示した並列計算機システムにおいて、各計算ノードが備える、CPU、一時記憶装置MEM、記憶装置STOR、および通信デバイスCOM_DEVは、必ずしも同じものであるとは限らず、計算ノードごとに大きさや性能が異なる場合もある。また、CPU、一時記憶装置MEM、記憶装置STOR、および、通信デバイスCOM_DEVの全てを備えていない計算ノードが存在する場合も、本願において開示される発明の考慮の対象としている。
In the parallel computer system shown in FIG. 1, the CPU, the temporary storage device MEM, the storage device STOR, and the communication device COM_DEV included in each calculation node are not necessarily the same, and each calculation node has a large size. The sheath performance may be different. Further, the case where there is a computation node that does not include all of the CPU, the temporary storage device MEM, the storage device STOR, and the communication device COM_DEV is also considered for the invention disclosed in the present application.
図5(a)および図5(b)は、前述した課題の解決にあたって、本願の発明を考える端緒となったバケットソート(バケツソート)方式について説明する図である。例えば、Mapフェーズで出力されるKeyが0以上の整数番号である場合を考える。なお、ここで、Keyが0以上の整数番号である場合をとりあげたのは、説明の例にするためであり、Keyが文字列等、整数以外のものであっても、あらかじめ各バケツの割り当て方を決めておけば、同様の方式が適用可能である。
5 (a) and 5 (b) are diagrams for explaining a bucket sort method that is the beginning of thinking about the invention of the present application in solving the above-described problems. For example, consider a case where the Key output in the Map phase is an integer number greater than or equal to zero. Here, the case where Key is an integer number greater than or equal to 0 is taken as an example of explanation, and even if Key is a non-integer such as a character string, each bucket is assigned in advance. If a method is determined, the same method can be applied.
ここで、並列計算の一部を担当する計算ノードCALC_NODE_1が処理を担当するReduceプロセスのKey番号が、0から999までの1000種類であるとする。図5(a)に、このときの、計算ノードCALC_NODE_1内の処理フローを示す図を示す。Shuffleフェーズで、CALC_NODE_1は、Key番号が0~999までの<Key,Value>の組を受信する。受信する<Key,Value>の組の順序については何も取り決めがないため、後の分類(整列)処理に備えて、受信した<Key,Value>の組をいったん全て記憶装置STORに記憶する必要がある。計算ノードCALC_NODE_1は、全ての<Key,Value>の組を受信し終わった後に、記憶装置STORに記憶された全ての<Key,Value>の組を読み出して、Key毎に分類(整列)し、Reduce処理を行う。
Here, it is assumed that there are 1000 types of key numbers of 0 to 999 for the Reduce process in which the computation node CALC_NODE_1 responsible for a part of the parallel computation is in charge of processing. FIG. 5A shows a processing flow in the calculation node CALC_NODE_1 at this time. In the Shuffle phase, CALC_NODE_1 receives a set of <Key, Value> whose Key numbers are 0 to 999. Since there is no agreement on the order of the received <Key, Value> pairs, all the received <Key, Value> pairs must be temporarily stored in the storage device STOR in preparation for the subsequent classification (alignment) processing. There is. After the calculation node CALC_NODE_1 has received all the <Key, Value> pairs, it reads out all the <Key, Value> pairs stored in the storage device STOR and classifies (aligns) them for each Key. Reduce processing is performed.
ところが、図5(a)に示した方式では、Key毎に分類(整列)する処理を、自計算ノードが受信した全ての<Key,Value>の組を対象に行う必要があるため、分類(整列)に時間がかかるという問題がある。そこで、図5(b)に示すように記憶装置を複数用意してバケットソート(バケツソート)を行うことも考えた。
However, in the method shown in FIG. 5A, the classification (alignment) processing for each key needs to be performed on all <Key, Value> pairs received by the self-calculation node. (Alignment) takes time. In view of this, as shown in FIG. 5B, a plurality of storage devices are prepared and bucket sorting (bucket sorting) is considered.
図5(b)において、CALC_NODE_1は、記憶装置をSTOR1~STOR10までの10台備えており、これらの記憶装置がバケツ群を構成している。計算ノードCALC_NODE_1は、受信した<Key,Value>の組を、Key番号に応じて10台の記憶装置に振り分けて記憶する。この振り分けは、例えば、Key番号が0から99までの<Key,Value>の組はSTOR1に、Key番号が100から199までの<Key,Value>の組はSTOR2に、といった手段で行う。Shuffleフェーズが終了し、全ての<Key,Value>の組を受信し終わった後、最初に、記憶装置STOR1に記憶された<Key,Value>の組を読み出してKey毎に分類(整列)してReduce処理する。記憶装置STOR1内の<Key,Value>の組を全て処理し終わったら、次に、記憶装置STOR2に記憶された<Key,Value>の組を読み出してKey毎に分類(整列)してReduce処理する。その後も同様に、STOR3からSTOR10までを順番に分類(整列)、Reduce処理すればよい。
In FIG. 5B, CALC_NODE_1 includes 10 storage devices from STOR1 to STOR10, and these storage devices constitute a bucket group. The calculation node CALC_NODE_1 distributes and stores the received <Key, Value> set to 10 storage devices according to the Key number. This distribution is performed by means such as, for example, a <Key, Value> set with a Key number from 0 to 99 in STOR1, and a <Key, Value> set with a Key number from 100 to 199 in STOR2. After completion of the Shuffle phase and reception of all <Key, Value> pairs, first, the <Key, Value> pairs stored in the storage device STOR1 are read out and classified (aligned) for each Key. Reduce processing. After all the <Key, Value> pairs in the storage device STOR1 have been processed, the <Key, Value> pairs stored in the storage device STOR2 are read out, classified (aligned) by key, and reduced. To do. After that, similarly, STOR3 to STOR10 may be classified (aligned) in order and reduced.
図5(b)で示した方式では、Key毎に分類(整列)する処理は、記憶装置STOR1からSTOR10ごとに行えばよいため、図5(a)に比べて分類(整列)処理の時間を短縮することが可能である。ここで、各記憶装置(バケツ)STOR1からSTOR10へのアクセスは、シーケンシャルなアクセスのみであり、ランダムアクセスが低速な記憶装置を用いることができる。また、多ポート同時にシーケンシャルアクセス可能な記憶装置が使用可能な場合には、各アクセスポートを異なるKey番号範囲(バケツ)に対応させることで、1台の記憶装置でバケツ群を構成することも可能である。なお、ここでは、バケツ群(記憶装置STOR1からSTOR10)をKeyの小さい順に割り当てたたが、Key番号と記憶装置(バケツ)の対応関係が定まっていれば十分であり、必ずしもKey番号の小さい順に割り当てる必要はない。また、Shuffleフェーズを終了後に記憶装置(バケツ)を処理する順番は任意である。
In the method shown in FIG. 5B, the classification (alignment) processing for each key may be performed for each storage device STOR1 to STOR10. Therefore, the classification (alignment) processing time is longer than that in FIG. It can be shortened. Here, access from each storage device (bucket) STOR1 to STOR10 is only sequential access, and a storage device with low random access can be used. In addition, when a storage device that can access multiple ports simultaneously can be used, it is possible to configure a bucket group with a single storage device by making each access port correspond to a different key number range (bucket). It is. Here, the bucket groups (storage devices STOR1 to STOR10) are assigned in ascending order of the keys. However, it is sufficient that the correspondence relationship between the key numbers and the storage devices (buckets) is determined, and the order of the key numbers is not necessarily limited. There is no need to assign. Further, the order of processing the storage device (bucket) after completion of the Shuffle phase is arbitrary.
しかしながら、図5(b)に示した方式では、各計算ノードに多数の記憶装置(バケツ)を用意する必要があるという問題がある。それに対して、本願発明者らは、次の構成を見出した。
However, the method shown in FIG. 5B has a problem that it is necessary to prepare a large number of storage devices (buckets) in each computation node. In contrast, the inventors of the present application have found the following configuration.
図6に、情報処理システム101の10台の計算ノードCALC_NODE_1~10がそれぞれ、1000種類のKey番号のReduce処理を担当する場合の例を示す。図5(b)で示したバケットソート方式で、各計算ノードに10台ずつの記憶装置(バケツ)を配置すると、並列計算装置全体では、100台の記憶装置(バケツ)が必要となる。ところで、前述のように、図5(b)で示したバケットソート方式では、Shuffleフェーズ終了後に、自計算ノードにあるバケツを、Key番号が小さい<Key,Value>の組を記憶しているバケツから順に分類(整列)およびReduce処理を行う。Mapフェーズで出力されるKeyの分布が極端にばらついていなければ、CALC_NODE_1がKey番号0から99のバケツを処理するのに必要な時間と、CALC_NODE_2がKey番号1000から1099のバケツを処理するのに必要な時間は、ほぼ同じであると考えられる。したがって、CALC_NODE_1がKey番号100から199のバケツの処理を始める時刻と、CALC_NODE_2がKey番号1100から1199のバケツの処理を始める時刻と、はほぼ同時刻である。本願発明者等は、このほぼ同時刻に必要となる、すなわち、ほぼ同時刻に分類(整列)・Reduce処理が始まるバケツ同士を、計算ノードをまたがって共有することで、並列計算全体で必要とするバケツの数を削減することができることを見出した。
FIG. 6 shows an example in which each of the ten calculation nodes CALC_NODE_1 to 10 of the information processing system 101 is in charge of Reduce processing of 1000 types of key numbers. If ten storage devices (buckets) are arranged in each computation node by the bucket sort method shown in FIG. 5B, 100 parallel storage devices (buckets) are required for the entire parallel computing device. By the way, as described above, in the bucket sort method shown in FIG. 5B, after the completion of the Shuffle phase, the bucket stored in the self-calculation node is stored as a set of <Key, Value> having a small key number. Classification (alignment) and Reduce processing are performed in order. If the distribution of the Keys output in the Map phase is not extremely varied, the time required for CALC_NODE_1 to process the buckets with the key numbers 0 to 99 and CALC_NODE_2 to process the buckets with the key numbers 1000 to 1099 The required time is considered to be approximately the same. Therefore, the time when CALC_NODE_1 starts the processing of the buckets with the key numbers 100 to 199 and the time when CALC_NODE_2 starts the processing of the buckets with the key numbers 1100 to 1199 are almost the same time. The inventors of the present application need this at almost the same time. That is, the buckets that start classification (alignment) / reduce processing at almost the same time share the calculation nodes across the computation nodes. Found that the number of buckets to be reduced.
図7は、情報処理システム101での、並列処理の動作の例を説明する図である。図7では、実施の形態の説明にあたって、CALC_NODE_1からCALC_NODE_10の10台の計算ノードで並列処理が実行され、それぞれの計算ノードが、1000種類のKey番号のReduce処理を担当する場合の例を示した。しかしながら、前述のように、Keyが整数の番号であることや、Reduce処理あるいはバケツの割り当て方等は、任意である。
FIG. 7 is a diagram for explaining an example of parallel processing operation in the information processing system 101. In the description of the embodiment, FIG. 7 shows an example in which parallel processing is executed by 10 calculation nodes CALC_NODE_1 to CALC_NODE_10, and each calculation node is responsible for Reduce processing of 1000 types of key numbers. . However, as described above, the key is an integer number, the reduction process or the bucket allocation method is arbitrary.
図7に示した並列処理の動作では、Mapフェーズで出力される<Key,Value>の組を一時的に記憶する10台の記憶装置(バケツ)BSTOR1からBSTOR10と、Reduce処理を行う10台の計算ノードCALC_NODE_1からCALC_NODE_10と、バケツの再送管理を行うコントローラBUCKET_CONTと、が機能する。
In the parallel processing operation illustrated in FIG. 7, ten storage devices (buckets) BSTOR1 to BSTOR10 that temporarily store a pair of <Key, Value> output in the Map phase, and ten units that perform Reduce processing. The calculation nodes CALC_NODE_1 to CALC_NODE_10 and the controller BUCKET_CONT that performs bucket retransmission management function.
計算ノードCALC_NODE_1~10は、記憶装置STOR1~10と、MAP処理、Key毎の分類(整列)、およびReduce処理を行う機能を実現する。図7に示したように、計算ノードCALC_NODE_1の記憶装置STORをSTOR1、計算ノードCALC_NODE_2の記憶装置STORをSTOR2、のように対応づけた。
The computation nodes CALC_NODE_1 to 10 realize the functions of performing the MAP processing, the classification (alignment) for each key, and the Reduce processing with the storage devices STOR1 to STOR10. As shown in FIG. 7, the storage device STOR of the calculation node CALC_NODE_1 is associated with STOR1, and the storage device STOR of the calculation node CALC_NODE_2 is associated with STOR2.
<Key,Value>の組を一時的に記憶するバケツ群BSTOR1からBSTOR10は、計算ノードCALC_NODE_1の記憶装置STORをBSTOR1、計算ノードCALC_NODE_2の記憶装置STORをBSTOR2、計算ノードCALC_NODE_3の記憶装置STORをBSTOR3、のように並列処理で計算を実行する計算ノードの記憶装置STORと兼用させて実現する。このようにすることで、少資源で並列計算を実行できる。また、例えば、計算ノードCALC_NODE_11の記憶装置STORをBSTOR1、計算ノードCALC_NODE_12の記憶装置STORをBSTOR2、計算ノードCALC_NODE_13の記憶装置STORをBSTOR3、のように並列処理では計算を実行しない計算ノードの記憶装置STORで実現してもよい。また、バケツ群BSTOR1からBSTOR10と、Reduce処理を行う計算ノード群CALC_NODE_1からCALC_NODE_10は、それぞれ数が異なっていてもよい。バケツ群と、計算ノード群の数が異なっており、かつ計算ノード上にバケツを配置する構成では、計算ノードあたりに0以上の整数台の記憶装置(バケツ)が配置されることになる。
A group of buckets BSTOR1 to BSTOR10 that temporarily store a set of <Key, Value> are BSTORE1 as the storage device STOR of the calculation node CALC_NODE_1, BSTR2 as the storage device STOR of the calculation node CALC_NODE_2, and BSTR3 as the storage device STOR of the calculation node CALC_NODE_3. In this way, the calculation node is also used as a storage device STOR of a calculation node that executes calculation by parallel processing. By doing so, parallel computation can be executed with a small amount of resources. In addition, for example, the storage device STOR of the calculation node CALC_NODE_11 is a storage device STOR of a calculation node that does not perform calculation in parallel processing, such as BSTORE1, the storage device STOR of the calculation node CALC_NODE_12 is BSTR2, and the storage device STOR of the calculation node CALC_NODE_13 is BSTOR3. It may be realized with. Also, the number of bucket groups BSTOR1 to BSTOR10 and the number of calculation node groups CALC_NODE_1 to CALC_NODE_10 that perform Reduce processing may be different. In a configuration in which the number of bucket groups and the number of calculation node groups are different and buckets are arranged on the calculation nodes, an integer number of storage devices (buckets) of 0 or more are arranged per calculation node.
バケツの再送管理を行うコントローラBUCKET_CONTは、計算ノードCALC_NODE_xの内の一つである再送管理ノードで実現される。ここで、再送管理ノードは、並列処理に寄与する計算ノードと兼用する形で実現してもよいし、並列処理の計算には関係しない計算ノードを別途用いてもよい。
The controller BUCKET_CONT that performs bucket retransmission management is realized by a retransmission management node that is one of the calculation nodes CALC_NODE_x. Here, the retransmission management node may be realized as a calculation node that contributes to parallel processing, or may separately use a calculation node that is not related to the parallel processing calculation.
図7に示した並列処理方式では、前述のShuffleフェーズ(計算ノード間通信)が、さらに、バケツ送信フェーズと、バケツ再送フェーズと、の2つのフェーズに細分化される。バケツ送信フェーズでは、Map処理で出力される<Key,Value>の組を、Keyに対応するReduce処理を行う計算ノードに直接送信するかわりに、並列処理全体で共有するバケツ群BSTOR1からBSTOR10に送信する。このとき、Key番号を1000で割った余りが0から99であればBSOTR1に、Key番号を1000で割った余りが100から199であればBSOTR2に、以下同様の規則にしたがって送信先のバケツを定める。このように、計算ノードCALC_NODE_1~10でのMap処理の結果である<Key,Value>の組に対して、各計算ノードは、BSTOR1~10のいずれかへの宛先を与える。
In the parallel processing method shown in FIG. 7, the above-described Shuffle phase (inter-computation node communication) is further subdivided into two phases: a bucket transmission phase and a bucket retransmission phase. In the bucket transmission phase, instead of directly transmitting the <Key, Value> pair output in the Map process to the computation node that performs the Reduce process corresponding to the Key, the bucket group BSTOR1 shared by the entire parallel process is transmitted to the BSTOR10. To do. At this time, if the remainder of dividing the Key number by 1000 is 0 to 99, the destination bucket is assigned to BSOTR1, and if the remainder of dividing the Key number by 1000 is 100 to 199, the destination bucket is assigned to BSOTR2. Determine. In this way, each computation node gives a destination to one of BSTORs 1 to 10 for a set of <Key, Value> that is a result of the Map processing in the computation nodes CALC_NODE_1 to 10.
Mapフェーズで出力される全ての<Key,Value>の組を、バケツ群BSTOR1からBSTOR10に記憶し終わると、バケツ再送フェーズに移行する。バケツ再送フェーズでは、コントローラBUCKET_CONTの指示にしたがって、各バケツに記憶されている<Key,Value>の組を、本来の宛先計算ノード(Reduce処理を行う計算ノード)に再送する。具体的には、まず、バケツBSTOR1に記憶されている<Key,Value>の組を、本来の宛先計算ノード(Reduce処理を行う計算ノード)に再送する。このとき各計算ノードCALC_NODE_1~10は、受信した<Key,Value>の組を、自計算ノードが備える記憶装置STOR(例えば、計算ノードCALC_NODE_1であればSTOR1が対応)に記憶する。バケツBSTOR1に記憶されている<Key,Value>の組の再送信が完了すると、コントローラBUCKET_CONTは、それを各計算ノードCALC_NODE_1からCALC_NODE_10に通知する。すると、各計算ノードCALC_NODE_1からCALC_NODE_10は、記憶装置STOR1からSTOR10に記憶されている<Key,Value>の組を読み出してKey毎に分類(整列)し、Reduce処理を行う。全計算ノードがReduce処理を終了して記憶装置STOR1からSTOR10が空になると、コントローラBUCKET_CONTは、次のバケツBSTOR2の再送を開始する。以下同様に、最後のバケツBSTOR10まで、再送とReduce処理を、バケツ毎に順次行っていく。
When all pairs of <Key, Value> output in the Map phase are stored in the bucket groups BSTOR1 to BSTOR10, the process proceeds to the bucket retransmission phase. In the bucket retransmission phase, in accordance with an instruction from the controller BUCKET_CONT, the set of <Key, Value> stored in each bucket is retransmitted to the original destination calculation node (calculation node performing the Reduce process). More specifically, first, the set of <Key, Value> stored in the bucket BSTOR1 is retransmitted to the original destination calculation node (calculation node that performs the Reduce process). At this time, each of the calculation nodes CALC_NODE_1 to 10 stores the received <Key, Value> set in a storage device STOR included in the own calculation node (for example, STOR1 corresponds to the calculation node CALC_NODE_1). When the retransmission of the set of <Key, Value> stored in the bucket BSTOR1 is completed, the controller BUCKET_CONT notifies the calculation node CALC_NODE_1 to CALC_NODE_10. Then, each of the calculation nodes CALC_NODE_1 to CALC_NODE_10 reads a set of <Key, Value> stored in the STOR10 from the storage device STOR1, classifies (aligns) each key, and performs a Reduce process. When all the computation nodes finish the Reduce process and the storage devices STOR1 to STOR10 become empty, the controller BUCKET_CONT starts retransmitting the next bucket BSTOR2. Similarly, retransmission and Reduce processing are sequentially performed for each bucket until the last bucket BSTOR10.
以上のように、図7に示した並列処理方式では、第1のグループ分けで、各計算ノードがMAP処理をするのに対し、処理結果が第2のグループ分けでBSTOR1~10の各記憶装置に送信され、さらに第1のグループ分けに従って各記憶装置から各計算ノードに再送信がなされる。これにより、並列処理全体で10個のバケツBSTOR1からBSTOR10を用意することで、各計算ノードにおける分類(整列)処理にかかる時間を、各計算ノードそれぞれに10個のバケツを用意した場合(図5(b)の構成)と同等まで短縮することが可能である。その一方で、計算ノード間の通信が、バケツ送信フェーズおよび、バケツ再送フェーズの2回行われるため、通信量が増大する。しかしながら、一般に、データ量Nに対して、計算ノード間通信に必要な時間はNに比例するのに対して、分類(整列)にかかる時間は、N×Log(N)に比例するため、データ量が大きくなるほど、分類(整列)の処理時間の短縮は全体の処理時間短縮に効果的になる。
As described above, in the parallel processing method shown in FIG. 7, each computing node performs MAP processing in the first grouping, whereas each storage device of BSTOR 1 to 10 in the second grouping is processed. And retransmitted from each storage device to each computation node according to the first grouping. As a result, by preparing 10 buckets BSTOR1 to BSTOR10 in the entire parallel processing, the time required for the classification (alignment) processing in each computation node is prepared when 10 buckets are prepared for each computation node (FIG. 5). It can be shortened to the same as (b). On the other hand, since communication between calculation nodes is performed twice in the bucket transmission phase and the bucket retransmission phase, the amount of communication increases. However, in general, the time required for communication between computation nodes is proportional to N with respect to the data amount N, whereas the time required for classification (alignment) is proportional to N × Log (N). As the amount increases, the reduction in the processing time for classification (alignment) becomes more effective in reducing the overall processing time.
なお、上記の説明で、各計算ノードへのReduce処理を担当するKeyの割り当ては単純にKey番号の小さい順に1000個ずつとし、各バケツへのKeyの割り当て方はKey番号を1000で割った余りで定めたが、この決め方には任意性があり、他の割り当て方も適用可能である。一方で、本発明の効果をより高めるためには、図6に示したように、各計算ノードへのReduce処理を担当するKeyの割り当て方と、各バケツへの割り当て方は、なるべく互いに直交の関係になるようにするのが望ましい。すなわち、第1のグループ分けの任意の一のグループに含まれる処理対象は、第2のグループ分けのそれぞれのグループに少なくとも一つ含まれるように分散されていることが望ましい。さらに、望ましい割り当て方は、Mapフェーズで出力されるKeyの分布に依存しており、アプリケーションや入力データ毎に異なる。したがって、アプリケーションの設計者やユーザーが、各計算ノードへのReduce処理を担当するKeyの割り当て方、および、各バケツへの割り当て方を、設定できるようにすることが考えられる。また、バケツの再送管理を行うコントローラBUCKET_CONTをおかずに、各バケツ、および、計算ノード間で、必要な情報を都度やりとりすることで、フェーズ間の状態遷移を行うことも可能である。
In the above description, the number of Keys responsible for Reduce processing to each computation node is simply 1000 in ascending order of the Key number, and the Key allocation method for each bucket is the remainder obtained by dividing the Key number by 1000. However, this determination method is arbitrary, and other allocation methods can be applied. On the other hand, in order to further enhance the effect of the present invention, as shown in FIG. 6, the assignment method of the key responsible for the reduction process to each computation node and the assignment method to each bucket are as orthogonal as possible. It is desirable to have a relationship. That is, it is desirable that the processing targets included in any one group in the first grouping are distributed so as to be included in each group in the second grouping. Furthermore, a desirable allocation method depends on the distribution of the Key output in the Map phase, and differs depending on the application and input data. Therefore, it is conceivable that the application designer or user can set the assignment method of the key responsible for the reduction processing to each computation node and the assignment method to each bucket. It is also possible to perform state transition between phases by exchanging necessary information between each bucket and each computation node without having to use a controller BUCKET_CONT for performing bucket retransmission management.
また、前述のように、情報処理システム101の処理対象としては、グラフ構造データもあり得る。グラフ構造データの場合には、並列処理を行う計算ノードの台数分にグラフが分割され各計算ノードに割当てられる。各計算ノードでは、割当てられた頂点群の処理が行われる。
Further, as described above, graph processing data can also be used as a processing target of the information processing system 101. In the case of graph structure data, the graph is divided into the number of computation nodes performing parallel processing and assigned to each computation node. In each computation node, processing of the assigned vertex group is performed.
図8は、情報処理システム101での、並列処理の動作の第2の例を説明する図である。本実施例の方式は、バケットソートを複数回行うことでより精密な整列を行う方法である基数ソート(Radix Sort)という方法の考え方を応用し、同一のバケツ群BSTOR1からBSTOR10を再利用して2回バケットソートを行うことで、実質的に、バケツの数が2乗になったのと同等の効果を得ることができる。本実施例では、<Key,Value>の組を一時的に記憶するバケツ群BSTOR1からBSTOR10を、計算ノードCALC_NODE_1の記憶装置STORをBSTOR1、計算ノードCALC_NODE_2の記憶装置STORをBSTOR2、計算ノードCALC_NODE_3の記憶装置STORをBSTOR3、のように並列処理で計算を実行する計算ノードの記憶装置STORと兼用させて実現する形式を例として説明する。
FIG. 8 is a diagram for explaining a second example of the parallel processing operation in the information processing system 101. The system of this embodiment applies the idea of a radix sort (Radix Sort) method, which is a method of performing more precise sorting by performing bucket sorting a plurality of times, and reuses the same bucket group BSTOR1 to BSTOR10. By performing the bucket sort twice, it is possible to obtain substantially the same effect as when the number of buckets is squared. In the present embodiment, a group of buckets BSTOR1 to BSTOR10 that temporarily store a set of <Key, Value> are stored, storage device STOR of calculation node CALC_NODE_1 is BSTOR1, storage device STOR of calculation node CALC_NODE_2 is storage of BSTOR2, and storage of calculation node CALC_NODE_3 A description will be given by taking, as an example, a format in which the device STOR is realized by being used also as a storage device STOR of a calculation node that performs calculation by parallel processing, such as BSTR3.
図8に示した、処理方式では、Mapフェーズで出力される<Key,Value>の組が、Key番号を100で割った余りに基づいて定まるバケツBSTOR1からBSTOR10に送信される。この通信が全て完了後、各計算ノードは、バケツBSTOR1の記憶内容を読み出して、再び別のバケツBSTOR1からBSTOR10に送信する。このとき、Key番号を1000で割った余りに基づいて送り先のバケツが定められる。バケツBSTOR1の再送が完了後、同様に、バケツBSTOR2からバケツBSTOR10まで順に再送がなされる。全てのバケツの再送が完了後、再び、各計算ノードは、バケツBSTOR1の記憶内容を読み出して、本来の宛先計算ノード(Reduce処理を行う計算ノード)に再々送し、各計算ノードでLocal SortおよびReduce処理を行う。バケツBSTOR1の処理が完了後、同様に、バケツBSTOR2からバケツBSTOR10まで順に再々送、および各計算ノードでのLocal SortおよびReduce処理が行われる。
In the processing method shown in FIG. 8, a set of <Key, Value> output in the Map phase is transmitted from the bucket BSTOR1 determined based on the remainder obtained by dividing the Key number by 100 to the BSTOR10. After all the communication is completed, each calculation node reads out the stored contents of the bucket BSTOR1 and transmits it again from another bucket BSTOR1 to the BSTOR10. At this time, the destination bucket is determined based on the remainder obtained by dividing the Key number by 1000. Similarly, after the retransmission of the bucket BSTOR1 is completed, the retransmission is sequentially performed from the bucket BSTOR2 to the bucket BSTOR10. After the retransmission of all buckets is completed, each calculation node again reads out the stored contents of the bucket BSTOR1 and sends it again to the original destination calculation node (calculation node that performs the Reduce process). At each calculation node, Local Sort and Reduce processing is performed. After the processing of the bucket BSTOR1 is completed, similarly, the packet BSTOR2 to the bucket BSTOR10 are sequentially sent again, and the Local Sort and Reduce processing is performed at each calculation node.
以上のように、図8に示した、処理方式では、各計算ノードにおける分類(整列)処理にかかる時間を、各計算ノードそれぞれに100個のバケツを用意した場合と同等まで短縮することが可能である。なお、バケツ送信および再送信の回数を2回より多くすることも可能である。再送信の回数を増やすことで、各計算ノードでのKey毎の分類(整列)にかかる時間が減少する一方で、計算ノード間の通信量が増加するので、アプリケーションおよび入力データに応じて最適なバケツ再送信の回数が存在する。したがって、アプリケーションの設計者やユーザーが、バケツ再送信の回数を設定できるようにすることが考えられる。
As described above, in the processing method shown in FIG. 8, it is possible to reduce the time required for the classification (alignment) processing in each computation node to the same level as when 100 buckets are prepared for each computation node. It is. It is possible to increase the number of bucket transmissions and retransmissions more than two. By increasing the number of retransmissions, the time required for classification (alignment) for each key at each computation node is reduced, while the amount of communication between computation nodes increases, so it is optimal for the application and input data. There is a number of bucket retransmissions. Therefore, it may be possible for an application designer or user to set the number of bucket retransmissions.
図9は、情報処理システム101での、並列処理の動作の第2の例を説明する図である。本実施例の方式は、前述の図7に示した本発明の実施例1においては、バケツの再送信は、バケツ送信フェーズが完了した後、すなわち、Mapフェーズで出力される<Key,Value>の組が、全てのバケツBSTOR1からBSTOR10に記憶された後、に開始される。したがって、本来の宛先計算ノード(Reduce処理を行う計算ノード)は、バケツ再送信フェーズが開始するまで、<Key,Value>の組を一切受け取ることができない。しかしながら、もし、本来の宛先計算ノード(Reduce処理を行う計算ノード)が、早い段階から<Key,Value>の組を受信することができれば、<Key,Value>の組を受信しながら、平行して(オンラインで)少しずつKey毎の分類(整列)の処理を行うことで、全体の処理時間を短縮できる可能性があると本願発明者らは考えた。
FIG. 9 is a diagram for explaining a second example of the parallel processing operation in the information processing system 101. In the system of the present embodiment, in the first embodiment of the present invention shown in FIG. 7, the bucket is retransmitted after the bucket transmission phase is completed, that is, output in the Map phase <Key, Value>. Is started after all the buckets BSTOR1 to BSTOR10 are stored. Therefore, the original destination calculation node (the calculation node performing the Reduce process) cannot receive any <Key, Value> pair until the bucket retransmission phase starts. However, if the original destination computation node (compute node that performs the Reduce process) can receive the <Key, Value> pair from an early stage, it will receive the <Key, Value> pair in parallel. The inventors of the present application have considered that the entire processing time may be shortened by performing the classification (alignment) processing for each key little by little (online).
図9に示した、処理方式においては、各バケツBSTOR1からBSTOR10に、それぞれ、Key番号を1000で割った余りが91から181、182から272、以下同様に、90個または91個のKey番号を割り当てる。90または91は、1つの計算ノードがReduce処理を担当するKeyの数(1000個)を、バケツ群の個数(10台)+1、で割った数に相当する。また、全てのMap処理が完了するのに必要な時間をあらかじめ見積もっておき、全てのMap処理が完了する時間の1/10の時間が経過するごとに、順にバケツBSTOR1からBSTOR10の記憶内容を読み出して、本来の宛先計算ノード(Reduce処理を行う計算ノード)に再送信する。こうすることで、本来の宛先計算ノード(Reduce処理を行う計算ノード)が、早い段階から<Key,Value>の組を受信することができ、<Key,Value>の組を受信しながら、平行して(オンラインで)少しずつKey毎の分類(整列)の処理を行うことができる。
In the processing method shown in FIG. 9, each bucket BSTOR1 to BSTOR10 has a remainder obtained by dividing the Key number by 1000 from 91 to 181, 182 to 272, and so on. Similarly, 90 or 91 Key numbers are assigned. assign. 90 or 91 corresponds to a number obtained by dividing the number of keys (1000) for which one computation node is in charge of Reduce processing by the number of bucket groups (10) +1. In addition, the time required for completing all Map processes is estimated in advance, and the contents stored in the buckets BSTOR1 to BSTOR10 are sequentially read every time 1/10 of the time for completing all Map processes has elapsed. Then, the data is retransmitted to the original destination calculation node (calculation node performing the Reduce process). By doing so, the original destination calculation node (the calculation node that performs the Reduce process) can receive the <Key, Value> set from an early stage, and receive the <Key, Value> set while receiving the parallel. Thus, the classification (alignment) processing for each key can be performed little by little (online).
具体的には、Mapフェーズの開始に先立って、全てのMap処理にかかる時間Tmapをなんらかの手段で見積もっておく。通常、Mapプロセスの個数や、処理内容はあらかじめ分かっているためMap処理全体の処理時間の見積りは可能である。また、この見積もり時間は多少誤差があっても、以下の動作には支障がない。
Specifically, prior to the start of the Map phase, the time Tmap required for all Map processes is estimated by some means. Usually, since the number of Map processes and the processing contents are known in advance, it is possible to estimate the processing time of the entire Map process. Even if the estimated time has some errors, the following operations are not hindered.
Mapフェーズ開始後、Mapフェーズで出力される<Key,Value>の組を、もしKey番号に対応するバケツが存在するならそのバケツにあてて、もしKey番号に対応するバケツが存在しないのであれば、本来の宛先計算ノード(Reduce処理を行う計算ノード)にあてて送信する。したがって、Key番号が0から90までの<Key,Value>の組は、バケツ群に送られることなく、本来の宛先計算ノード(Reduce処理を行う計算ノード)に向けて直接送信されることになる。
After the start of the Map phase, the <Key, Value> set output in the Map phase is applied to the bucket corresponding to the Key number, and if there is no bucket corresponding to the Key number. The data is sent to the original destination calculation node (the calculation node that performs the Reduce process). Therefore, a set of <Key, Value> with a Key number from 0 to 90 is not directly sent to the bucket group, but is directly transmitted to the original destination calculation node (calculation node that performs Reduce processing). .
その後、Mapフェーズ開始からTmap/10の時間が経過したタイミングで、バケツBSTOR1の記憶内容を読み出して本来の宛先計算ノード(Reduce処理を行う計算ノード)にあてて再送信する。これ以降は、Map処理で出力された<Key,Value>の組は、対応するバケツが存在しないKey番号が0から90までの<Key,Value>の組に加えて、バケツBSTOR1が担当していたKey番号が91から181までの<Key,Value>の組も、本来の宛先計算ノード(Reduce処理を行う計算ノード)に向けて直接送信する。その後、さらにTmap/10だけ時間が経過して、Mapフェーズ開始から(Tmap/10)×2の時間が経過したタイミングで、バケツBSTOR2の記憶内容を読み出して本来の宛先計算ノード(Reduce処理を行う計算ノード)にあてて再送信する。これ以降は、Map処理で出力された<Key,Value>の組は、Map処理でバケツBSTOR2が担当していた<Key,Value>の組が出力された場合にも、本来の宛先計算ノード(Reduce処理を行う計算ノード)に向けて直接送信する。以下同様に、さらにTmap/10だけ時間が経過するたびにバケツを本来の宛先計算ノード(Reduce処理を行う計算ノード)に向けて再送信し、それ以降、Map処理で出力された<Key,Value>の組は、Keyに対応するバケツが存在して、かつ、そのバケツが再送信済みでない場合には、そのバケツにあてて、そうでない場合には、直接、本来の宛先計算ノード(Reduce処理を行う計算ノード)にあてて送信する。
After that, at the timing when the time Tmap / 10 has elapsed from the start of the Map phase, the storage contents of the bucket BSTOR1 are read out and retransmitted to the original destination calculation node (calculation node that performs the Reduce process). From now on, the <Key, Value> pair output by the Map processing is in charge of the bucket BSTOR1 in addition to the <Key, Value> pair whose key number is 0 to 90 and no corresponding bucket exists. A pair of <Key, Value> with key numbers 91 to 181 is also transmitted directly to the original destination calculation node (calculation node that performs Reduce processing). Thereafter, when the time Tmap / 10 further elapses and the time (Tmap / 10) × 2 has elapsed since the start of the Map phase, the stored contents of the bucket BSTOR2 are read and the original destination calculation node (Reduce processing is performed) Re-send to (compute node). From then on, the <Key, Value> pair output by the Map process is the same as the original destination calculation node (<Key, Value>) even when the <Key, Value> pair that was handled by the bucket BSTOR2 is output by the Map process. It is directly transmitted to a computing node that performs Reduce processing. In the same manner, each time when Tmap / 10 further elapses, the bucket is retransmitted toward the original destination calculation node (the calculation node that performs the Reduce process), and thereafter, the <Key, Value output in the Map process is output. The pair of> is assigned to the bucket if the bucket corresponding to the Key exists and the bucket has not been retransmitted, and directly to the original destination calculation node (Reduce processing) otherwise. To the computing node that performs the transmission.
以上の説明では、簡単のために、各バケツが担当するKeyの数を全て同じにし、また、バケツの再送信を、全てのMap処理にかかる時間Tmapを等間隔に分割したタイミングで行うことにしたが、実際には、これではバケツ再送信のタイミングで各バケツが記憶している<Key,Value>の組の数がバケツによって異なるため記憶装置の利用効率や、ネットワークの効率が低下する場合がある。なぜなら、バケツBSTOR1はTmap/10の時間しか<Key,Value>の組の記憶を行わないのに対して、バケツBSTOR10はTmapの時間<Key,Value>の組の記憶を行う。したがって、もし、Mapフェーズで出力されるKeyの分布が一様であるならば、バケツの再送信のタイミングで、バケツBSTOR1は、バケツBSTOR10の1/10の<Key,Value>の組しか記憶していないことになる。バケツ再送信のタイミングでの各バケツが記憶している<Key,Value>の組の数をなるべく同じにするには、各バケツが担当するKeyの数を、バケツBSTOR1では多くし、バケツBSTOR10では少なくすることが考えられる。あるいは、バケツ再送信のタイミングを単純に、Tmapを等分割するのではなく、前のバケツを再送信してから次のバケツを再送信するまでの時間を、だんだん短くしていくといった方策をとってもよい。
In the above description, for the sake of simplicity, all the buckets are assigned the same number of keys, and the buckets are retransmitted at a timing obtained by dividing the time Tmap required for all Map processing at equal intervals. However, in actuality, when the number of <Key, Value> pairs stored in each bucket at the bucket re-transmission timing differs depending on the bucket, the use efficiency of the storage device and the network efficiency are reduced. There is. This is because the bucket BSTOR1 stores a set of <Key, Value> only for a time of Tmap / 10, whereas the bucket BSTOR10 stores a set of <Key, Value> for a Tmap. Therefore, if the distribution of the Key output in the Map phase is uniform, the bucket BSTOR1 stores only 1/10 <Key, Value> pairs of the bucket BSTOR10 at the timing of bucket retransmission. Will not be. In order to keep the number of <Key, Value> pairs stored in each bucket at the time of bucket retransmission as much as possible, the number of keys handled by each bucket is increased in the bucket BSTOR1, and in the bucket BSTOR10. It is possible to reduce it. Or, instead of simply dividing the Tmap into equal parts, instead of dividing the Tmap evenly, the time between retransmitting the previous bucket and retransmitting the next bucket is gradually shortened. Good.
以上のように、図9に示した、本発明の第3の実施例による処理方式では、Map処理と並行してバケツを順に再送信することで、本来の宛先計算ノード(Reduce処理を行う計算ノード)が、早い段階から<Key,Value>の組を受信することができ、<Key,Value>の組を受信しながら、平行して(オンラインで)少しずつKey毎の分類(整列)の処理を行うことができ、全体の処理時間を短縮することができる。
As described above, in the processing method according to the third embodiment of the present invention shown in FIG. 9, the original destination calculation node (reduction processing is performed by retransmitting the bucket in order in parallel with the Map processing. Node) can receive <Key, Value> pairs from an early stage, and receive the <Key, Value> pairs in parallel (online) little by little (online) classification (alignment) for each Key. Processing can be performed, and the overall processing time can be shortened.
101:情報処理システム、CALC_NODE_x:計算ノード、COM_SW:通信スイッチ、MEM:一時記憶装置、STOR:記憶装置、COM_DEV:通信デバイス、STOR:記憶装置、BUS:バス、BSTOR1~10:記憶装置(バケツ)、BUCKET_CONT:バケツの再送管理を行うコントローラ。
101: Information processing system, CALC_NODE_x: calculation node, COM_SW: communication switch, MEM: temporary storage device, STOR: storage device, COM_DEV: communication device, STOR: storage device, BUS: bus, BSTOR1 to 10: storage device (bucket) , BUCKET_CONT: A controller that performs retransmission management of buckets.
Claims (10)
- 複数の計算ノードを有する並列計算機システムでの並列処理方法であって、
第1のグループ分けで処理対象を分割して各計算ノードに配置して処理し、
処理結果にそれぞれが属する第2のグループ毎の宛先を与え、
該宛先毎に異なるストレージ装置に該処理結果を保存し、
保存された処理結果を前記第1のグループ分けに従って各計算ノードに送信することを特徴とする並列処理方法。 A parallel processing method in a parallel computer system having a plurality of computation nodes,
Divide the processing target in the first grouping and place it on each computation node for processing,
A destination for each second group to which each of the processing results belongs,
Save the processing result in a different storage device for each destination,
A parallel processing method, comprising: transmitting a stored processing result to each computing node according to the first grouping. - 請求項1に記載の並列処理方法において、
該処理結果が保存される各ストレージ装置は、各計算ノード上に配置されていることを特徴とする並列処理方法。 The parallel processing method according to claim 1,
A parallel processing method, wherein each storage device in which the processing result is stored is arranged on each computation node. - 請求項1に記載の並列処理方法において、
前記保存された処理結果を前記第1のグループ分けに従って各計算ノードに送信する際に、
前記保存された処理結果を前記第2のグループ分け毎に時間差をおいて送信することを特徴とする並列処理方法。 The parallel processing method according to claim 1,
When sending the stored processing result to each computing node according to the first grouping,
The parallel processing method, wherein the stored processing result is transmitted with a time difference for each second grouping. - 請求項1に記載の並列処理方法において、
前記第1のグループ分けの任意の一のグループに含まれる処理対象は、前記第2のグループ分けのそれぞれのグループに少なくとも一つ含まれるように分散されていることを特徴とする並列処理方法。 The parallel processing method according to claim 1,
The parallel processing method according to claim 1, wherein the processing targets included in any one group of the first grouping are distributed so as to be included in each group of the second grouping. - 請求項1に記載の並列処理方法において、
前記処理対象はグラフ構造データであることを特徴とする並列処理方法。 The parallel processing method according to claim 1,
The parallel processing method characterized in that the processing target is graph structure data. - 複数の計算ノードを有する並列計算機システムであって、
第1のグループ分けで処理対象を分割して各計算ノードに配置して処理し、
処理結果にそれぞれが属する第2のグループ毎の宛先を与え、
該宛先毎に異なるストレージ装置に該処理結果を保存し、
保存された処理結果を前記第1のグループ分けに従って各計算ノードに送信することを特徴とする並列計算機システム。 A parallel computer system having a plurality of computation nodes,
Divide the processing target in the first grouping and place it on each computation node for processing,
A destination for each second group to which each of the processing results belongs,
Save the processing result in a different storage device for each destination,
A parallel computer system, wherein a stored processing result is transmitted to each computation node according to the first grouping. - 請求項6に記載の並列計算機システムにおいて、
該処理結果が保存される各ストレージ装置は、各計算ノード上に配置されていることを特徴とする並列計算機システム。 The parallel computer system according to claim 6,
A parallel computer system, wherein each storage device in which the processing result is stored is arranged on each computation node. - 請求項6に記載の並列計算機システムにおいて、
前記保存された処理結果を前記第1のグループ分けに従って各計算ノードに送信する際に、
前記保存された処理結果を前記第2のグループ分け毎に時間差をおいて送信することを特徴とする並列計算機システム。 The parallel computer system according to claim 6,
When sending the stored processing result to each computing node according to the first grouping,
A parallel computer system, wherein the stored processing result is transmitted with a time difference for each of the second groupings. - 請求項6に記載の並列計算機システムにおいて、
前記第1のグループ分けの任意の一のグループに含まれる処理対象は、前記第2のグループ分けのそれぞれのグループに少なくとも一つ含まれるように分散されていることを特徴とする並列計算機システム。 The parallel computer system according to claim 6,
The parallel computer system is characterized in that at least one processing target included in any one group of the first grouping is distributed so as to be included in each group of the second grouping. - 請求項6に記載の並列計算機システムにおいて、
前記処理対象はグラフ構造データであることを特徴とする並列計算機システム。 The parallel computer system according to claim 6,
The parallel computer system, wherein the processing object is graph structure data.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2014553925A JP5965498B2 (en) | 2012-12-26 | 2012-12-26 | Parallel processing method and parallel computer system |
PCT/JP2012/083546 WO2014102917A1 (en) | 2012-12-26 | 2012-12-26 | Parallel processing method and parallel computer system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2012/083546 WO2014102917A1 (en) | 2012-12-26 | 2012-12-26 | Parallel processing method and parallel computer system |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2014102917A1 true WO2014102917A1 (en) | 2014-07-03 |
Family
ID=51020076
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2012/083546 WO2014102917A1 (en) | 2012-12-26 | 2012-12-26 | Parallel processing method and parallel computer system |
Country Status (2)
Country | Link |
---|---|
JP (1) | JP5965498B2 (en) |
WO (1) | WO2014102917A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018538607A (en) * | 2015-10-28 | 2018-12-27 | グーグル エルエルシー | Calculation graph processing |
CN113094155A (en) * | 2019-12-23 | 2021-07-09 | 中国移动通信集团辽宁有限公司 | Task scheduling method and device under Hadoop platform |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011113377A (en) * | 2009-11-27 | 2011-06-09 | Hitachi Ltd | Distributed arithmetic device and method of controlling the same |
-
2012
- 2012-12-26 WO PCT/JP2012/083546 patent/WO2014102917A1/en active Application Filing
- 2012-12-26 JP JP2014553925A patent/JP5965498B2/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011113377A (en) * | 2009-11-27 | 2011-06-09 | Hitachi Ltd | Distributed arithmetic device and method of controlling the same |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2018538607A (en) * | 2015-10-28 | 2018-12-27 | グーグル エルエルシー | Calculation graph processing |
CN113094155A (en) * | 2019-12-23 | 2021-07-09 | 中国移动通信集团辽宁有限公司 | Task scheduling method and device under Hadoop platform |
Also Published As
Publication number | Publication date |
---|---|
JPWO2014102917A1 (en) | 2017-01-12 |
JP5965498B2 (en) | 2016-08-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8065503B2 (en) | Iteratively processing data segments by concurrently transmitting to, processing by, and receiving from partnered process | |
US11003604B2 (en) | Procedures for improving efficiency of an interconnect fabric on a system on chip | |
US9367344B2 (en) | Optimized assignments and/or generation virtual machine for reducer tasks | |
US20180018197A1 (en) | Virtual Machine Resource Allocation Method and Apparatus | |
US20210049146A1 (en) | Reconfigurable distributed processing | |
US10042683B2 (en) | All-to-all message exchange in parallel computing systems | |
WO2014015204A2 (en) | Domain-agnostic resource allocation framework | |
US12040949B2 (en) | Connecting processors using twisted torus configurations | |
CN107959642B (en) | Method, device and system for measuring network path | |
JP5965498B2 (en) | Parallel processing method and parallel computer system | |
US20140215476A1 (en) | Apparatus and method for sharing function logic between functional units, and reconfigurable processor thereof | |
CN104281636A (en) | Concurrent distributed processing method for mass report data | |
CN113938434A (en) | Large-scale high-performance RoCEv2 network construction method and system | |
CN111880926B (en) | Load balancing method and device and computer storage medium | |
CN116171431A (en) | Memory architecture for multiple parallel datapath channels in an accelerator | |
CN111626410B (en) | Sparse convolutional neural network accelerator and calculation method | |
Benoit et al. | Optimizing buffer sizes for pipeline workflow scheduling with setup times | |
CN105190599A (en) | Cloud application bandwidth modeling | |
CN106897137B (en) | Physical machine and virtual machine mapping conversion method based on virtual machine live migration | |
Mohamed et al. | High-performance message striping over reliable transport protocols | |
CN104252338A (en) | Data processing method and data processing equipment | |
Sem-Jacobsen et al. | Efficient and contention-free virtualisation of fat-trees | |
US11194630B2 (en) | Grouped shuffling of partition vertices | |
Zhang et al. | Performance analysis of randomized data fetching in cluster computing | |
WO2018007988A1 (en) | System for accelerating data transmission in network interconnections |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12891218 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2014553925 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 12891218 Country of ref document: EP Kind code of ref document: A1 |