CN110502337B

CN110502337B - Optimization system for shuffling stage in Hadoop MapReduce

Info

Publication number: CN110502337B
Application number: CN201910627734.5A
Authority: CN
Inventors: 管海兵; 吴仲轩; 任锐; 戚正伟
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2019-07-12
Filing date: 2019-07-12
Publication date: 2023-02-07
Anticipated expiration: 2039-07-12
Also published as: CN110502337A

Abstract

The invention provides an optimization system for a shuffling stage in Hadoop MapReduce, which runs in a working node and a main node of the Hadoop MapReduce in a daemon mode and communicates with the Hadoop MapReduce in an inter-process communication and remote process call mode. Meanwhile, an optimization method based on the optimization system is provided. The optimization system provided by the invention takes over all intermediate data in the Hadoop MapReduce task operation after operation, reasonably utilizes the idle network bandwidth of a Map stage on one hand and effectively reduces the reading and writing of small files after merging the intermediate data in the same node on the other hand by utilizing the modes of pre-merging and pre-shuffling, thereby optimizing the completion time of the MapReduce task.

Description

Optimization system for shuffling stage in Hadoop MapReduce

Technical Field

The invention relates to the technical field of big data and cloud computing, in particular to an optimization system aiming at a Shuffle (Shuffle) stage in Hadoop MapReduce.

Background

MapReduce is a distributed computing framework for processing big data. Hadoop MapReduce is the best known and widely used open source implementation of MapReduce. A Hadoop MapReduce user can process massive data (TB-level data volume and even PB-level data volume) in parallel on a large-scale cluster (the maximum capacity can support thousands of nodes) by simply writing Map and reducing algorithm. Moreover, hadoop MapReduce provides strong fault-tolerant capability and ensures that tasks are completed in thousands of nodes.

Hadoop MapReduce follows BSP (Bulk synchronization Parallel) model, abstracting the distributed computing process into three stages: map, shuffle (Shuffle), and Reduce.

The operation of the Map phase is divided into two sub-phases: map computation and partitioning (Partition) phases. In the Map calculation stage, each working node calculates input data according to a Map method in a submitted algorithm and outputs intermediate data, wherein the intermediate data are composed of key value pairs. In the partitioning stage, the working node partitions the intermediate data according to the submitted partitioning function, and each partition and each Reduce subtask are in a one-to-one mapping relationship. Finally, the intermediate data output by the Map stage is stored in the disk.

In the Shuffle (Shuffle) phase, each worker node reassigns the intermediate data according to the partitioning result in the Map phase, and the worker node transmits the intermediate data of the same partition to a designated node through the network and stores it in the disk.

The Reduce phase operation is divided into two sub-phases: sort (Sort) and Reduce calculations. In the sorting stage, the working node reads out the shuffled intermediate data from the disk and sorts the keys (keys). In the Reduce calculation stage, the working node calculates intermediate data according to a Reduce method in the training algorithm, and finally outputs a calculation result.

In the current implementation of Hadoop MapReduce, two disadvantages exist that can cause significant impact on performance: firstly, a shuffling stage is coupled with a Reduce stage, so that network bandwidth resources in a Map stage are idle; secondly, a large number of small files are read and written in the shuffling stage, so that disk reading and writing are the bottleneck of the shuffling stage.

In particular, the amount of the solvent to be used,

in Hadoop MapReduce, a shuffle stage and a Reduce stage are coupled together, and a work node needs to wait until the Reduce stage starts to perform data shuffle, so that network bandwidth resources in a Map stage are completely idle. And according to research results, the shuffling stage occupies one third of the total completion time of the task on average in a big data processing task. Therefore, the performance of the Hadoop MapReduce is seriously affected by the inefficient use of network bandwidth resources.

In the production environment use of Hadoop MapReduce, no matter official documents or industry use experience, the single MapReduce task (job) is recommended to be divided into a plurality of small subtasks (tasks) (namely, a large number of maps and Reduce subtask numbers are configured), which helps to improve the parallelism of task execution and Reduce the influence of the occurrence of lagger (stratgler) of the subtasks. The above method has a significant effect on optimizing task completion time as evidenced by countless use results, and is particularly significant in large-scale clustering. However, splitting the task into a large number of subtasks results in the shuffle stage needing to read and write a large number of small files. Such a large amount of small data and random I/O disk reads and writes will become a bottleneck in the shuffle stage and seriously affect the task completion time.

At present, no explanation or report of the similar technology of the invention is found, and similar data at home and abroad are not collected.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an optimization system aiming at the shuffling stage in Hadoop MapReduce. The system provided by the invention is a distributed system with a master-slave architecture, and the system runs in a working node of Hadoop MapReduce in a daemon way. According to the method provided by the invention, all intermediate data in the operation of the MapReduce task are taken over by the system provided by the invention, and the network bandwidth of the Map stage is reasonably utilized and the read-write of small files is effectively reduced by utilizing the modes of pre-merging (pre-merge) and pre-shuffling (pre-shuffle), so that the completion time of the MapReduce task is optimized.

The invention is realized by the following technical scheme.

According to one aspect of the invention, an optimization system for a shuffling stage in Hadoop MapReduce is provided, and comprises a system main node and a system working node; wherein:

the system master node includes: the device comprises a scheduler module and a communication module a, wherein the scheduler module is used for scheduling the time for combining partition files in advance, the time for shuffling in advance and the place where the shuffling result is removed; the communication module a realizes the communication between the main node of the system and the main node of Hadoop MapReduce and the working node of the system by using the inter-process communication and the remote process call;

the system work node includes: the system comprises a shuffle processing module and a communication module b, wherein the shuffle processing module combines all temporary files on the same node into a large temporary file in advance, and performs the advance shuffle on the large temporary file according to the time indicated by a scheduler module; and the communication module b realizes the communication between the system working node and the system main node and between the system working node and the Hadoop MapReduce working node by using interprocess communication and remote process call.

Preferably, the optimization system runs in a working node of Hadoop MapReduce in a daemon manner.

According to another aspect of the invention, an optimization method for the shuffling stage in Hadoop MapReduce is provided, and the optimization system comprises:

and (3) merging process in advance: the shuffle processing module acquires a temporary file path of a Map calculation result from a Hadoop MapReduce working node, triggers one-time advanced combination when monitoring that one Map subtask is completed, and combines a newly acquired temporary file with a result of the previous advanced combination; the shuffle processing module repeats the process, and finally combines all temporary files in the file system of the Hadoop MapReduce working node into an intermediate data file;

the advance shuffling process: when Map subtasks running in the same batch are completed and combined in advance, the shuffle processing module triggers advance shuffle; and the shuffling processing module shuffles the intermediate data file to a designated Hadoop MapReduce working node according to the instruction of the scheduler module.

And the designated Hadoop MapReduce working node adopts a random designation mode.

Preferably, before the process of combining in advance, the following process is further included:

after receiving the submitted new task, the Hadoop MapReduce informs the scheduler module by using inter-process communication, the scheduler module informs all system working nodes, and the system working nodes start monitoring the completion condition of the subtasks in the Hadoop MapReduce working nodes.

Preferably, after the advance shuffling process, the following process is further included:

after the Reduce subtasks are started, the shuffle processing module informs the Reduce subtask intermediate data file path; the Reduce subtask reads the intermediate data files directly from the local file system in sequence and performs calculation.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the invention, through a method of combining in advance, the small files are combined into a large file in advance after Map calculation is completed, so that the file reading and writing times are effectively reduced, the data reading time in the shuffling stage is shortened, and the tail delay (tail latency) of disk reading and writing is optimized. Specifically, the MapReduce computing task can divide the task into a plurality of Map subtasks, and intermediate data obtained after the Map subtasks are computed are stored in a file system of the Hadoop MapReduce working node, so that the intermediate data of a plurality of different Map subtasks are stored in the same file system in the computing process. After the Map calculation in each Map subtask is completed, the system working node merges all intermediate data on the same file system in the Hadoop MapReduce working node and outputs the merged intermediate data into a large file, and a subsequent shuffling stage reads the intermediate data from the merged file directly.

2. The invention carries out the Map stage and the shuffle stage in parallel by a pre-shuffle method, and fully utilizes idle network bandwidth resources in the original Map stage, thereby optimizing the task completion time. Specifically, in the implementation of Hadoop MapReduce, reduce and the shuffle stage are tightly coupled together, and intermediate data needs to be transmitted in the process of the Reduce subtask. In the invention, the system working node takes over the whole shuffling stage in Hadoop MapReduce, and immediately merges output data in advance after Map calculation in each Map subtask is finished, and then immediately transmits the output data to a designated working node by using a network according to the instruction of a scheduler module. Because the MapReduce calculation task can be divided into a plurality of Map subtasks, and the Map subtasks are operated on the working nodes in batches, the system can effectively carry out the pre-shuffling and Map calculation in parallel, and effectively improve the resource utilization rate of the Map stage.

3. The optimization system for the shuffling stage in the Hadoop MapReduce, provided by the invention, is operated in a working node and a main node of the Hadoop MapReduce in a daemon process mode, and is communicated with the Hadoop MapReduce in an inter-process communication and remote process calling mode; after the system runs, all intermediate data in the Hadoop MapReduce task running are taken over, and by utilizing a pre-merge (pre-merge) and pre-shuffle (pre-shuffle) mode, on one hand, idle network bandwidth in a Map stage is reasonably utilized, and on the other hand, small file reading and writing are effectively reduced after the intermediate data in the same node are merged, so that the completion time of the MapReduce task is optimized, and the optimization of the conventional Hadoop MapReduce is realized.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a system architecture diagram of the present invention

FIG. 2 is a schematic diagram showing a comparison of the system flow during the operation of the present invention

Detailed Description

The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and gives a detailed implementation mode and a specific operation process. It should be noted that various changes and modifications can be made by those skilled in the art without departing from the spirit of the invention, and these changes and modifications are all within the scope of the invention.

The embodiment of the invention provides an optimization system for a shuffling stage in Hadoop MapReduce, which comprises a system main node and a system working node; wherein:

the system master node includes: the system comprises a scheduler module and a communication module a, wherein the scheduler module is used for scheduling the time for combining partition files in advance, the time for shuffling in advance and the destination of a shuffling result; the communication module a realizes the communication between the main node of the system and the main node of Hadoop MapReduce and the working node of the system by using the inter-process communication and the remote process call;

Further, the optimization system runs in a working node of Hadoop MapReduce in a daemon mode.

The embodiment of the invention also provides an optimization method for the shuffling stage in Hadoop MapReduce, and the optimization system comprises the following steps:

and (3) merging process in advance: the shuffle processing module acquires a temporary file path of a Map calculation result from a Hadoop MapReduce working node, triggers one-time advanced combination when monitoring that one Map subtask is completed, and combines a newly acquired temporary file with a result of the previous advanced combination; the shuffle processing module repeats the process, and finally merges all temporary files in the file system of the Hadoop MapReduce working node into an intermediate data file;

the advance shuffling process: when Map subtasks running in the same batch are completed and combined in advance, the shuffle processing module triggers advance shuffle; the shuffle processing module shuffles the intermediate data file to the designated Hadoop MapReduce working node as directed by the scheduler module.

Further, before the process of combining in advance, the following process is also included:

Further, after the advance shuffling process, the following process is also included:

The above embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

As shown in fig. 1, the optimization system provided by the above embodiment of the present invention is a distributed system with a master-slave structure, and includes a system master node and a system working node. The system main node comprises a scheduler module and a communication module a, the system work node comprises a shuffle processing module and a communication module b (the communication modules a and b are omitted in fig. 1), wherein: a scheduler module to schedule when to perform an early merge of the partition files (i.e., when the partition files are early merged), and to schedule when to perform an early shuffle (i.e., when to perform an early shuffle) and where to shuffle (i.e., where to shuffle the results of the shuffle); the communication module a realizes communication between the main node of the system and the working node of the system and the main node of Hadoop MapReduce by using interprocess communication and remote process call, and acquires a temporary file storage path and a subtask completion condition; the shuffle processing module is used to combine all temporary files on the same node in advance into one large temporary file (i.e. an intermediate data file) and to shuffle the intermediate data file in advance according to the scheduler module's instructions, the path of the temporary file and the destination to which the shuffle transmission is made being obtained from the scheduler module through the communication module b.

Fig. 2 shows the workflow of the optimization method of the present invention, and the workflow will be described in detail with reference to the accompanying drawings:

first, upon receiving a submitted new task, the Hadoop MapReduce will notify the scheduler module in the optimization system using inter-process communication. And then, the scheduler module informs all the system working nodes through the communication module, and the system working nodes start monitoring the completion condition of the subtasks in the Hadoop MapReduce working nodes.

Secondly, after the completion of the Map subtask is monitored, the shuffle processing module obtains a temporary file path of the Map calculation result from the Hadoop MapReduce working node through the communication module (the number of temporary files depends on the amount of task input data and the algorithm of Map calculation). And when the completion of one Map subtask is monitored, the shuffle processing module triggers one advanced combination to combine the newly acquired temporary file with the result of the previous advanced combination. The shuffle processing module will repeat the above process and finally merge all temporary files in the file system of the node into a single large file to form an intermediate data file.

Third, since the Map subtasks will run in batches on the work node, the shuffle processing module will trigger the early shuffle after a batch of Map subtasks is completed and the early merge is completed. The shuffle processing module shuffles the intermediate data file to the designated work node using network transport as directed by the scheduler module. In the advanced shuffling, the shuffling processing module directly transmits the combined intermediate data file, so that the read-write times of small files are effectively reduced.

Finally, after the Reduce subtask is started, the shuffle processing module notifies the path of the Reduce subtask intermediate data file through the communication module. The Reduce subtask reads the intermediate data files from the local file system directly and sequentially, calculates the intermediate data files, and finally outputs a calculation result.

The optimization system for the shuffling stage in the Hadoop MapReduce provided by the embodiment of the invention runs in the working node and the main node of the Hadoop MapReduce in a daemon process mode, and communicates with the Hadoop MapReduce in an inter-process communication and remote process calling mode. After the system runs, all intermediate data in the Hadoop MapReduce task running are taken over, and by utilizing a pre-merge (pre-merge) and pre-shuffle (pre-shuffle) mode, on one hand, idle network bandwidth in a Map stage is reasonably utilized, and on the other hand, small file reading and writing are effectively reduced after the intermediate data in the same node are merged, so that the completion time of the MapReduce task is optimized.

According to the optimization system for the shuffle stage in Hadoop MapReduce, provided by the embodiment of the invention, small files are combined into a large file in advance by a method of combining in advance once Map calculation is completed, so that the read-write times of the file are effectively reduced, the data read-write time of the shuffle stage is shortened, and the tail latency (tail latency) of disk read-write is optimized.

The optimization system for the shuffle stage in the Hadoop MapReduce, provided by the embodiment of the invention, has the advantages that the Map stage and the shuffle stage are performed in parallel by a method of shuffling in advance, and idle network bandwidth resources in the original Map stage are fully utilized, so that the task completion time is optimized.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. An optimization system for a shuffling stage in Hadoop MapReduce is characterized by comprising a system main node and a system working node; wherein:

the system working node comprises: the system comprises a shuffle processing module and a communication module b, wherein the shuffle processing module combines all temporary files on the same node into a large temporary file in advance, and performs the advance shuffle on the large temporary file according to the time indicated by a scheduler module; the communication module b realizes communication between the system working node and the system main node and between the system working node and the Hadoop MapReduce working node by using interprocess communication and remote process calling;

the merge-ahead merge process comprises: the shuffle processing module acquires a temporary file path of a Map calculation result from a Hadoop MapReduce working node, triggers one-time advanced combination when monitoring that one Map subtask is completed, and combines a newly acquired temporary file with a result of the previous advanced combination; the shuffle processing module repeats the process, and finally merges all temporary files in the file system of the Hadoop MapReduce working node into an intermediate data file;

the process of pre-shuffling shuffle comprises the following steps: when the Map subtasks running in the same batch are completed and combined in advance, the shuffle processing module triggers advance shuffle; and the shuffling processing module shuffles the intermediate data file to a designated Hadoop MapReduce working node according to the instruction of the scheduler module.

2. The optimization system for the shuffle stage in Hadoop MapReduce according to claim 1, wherein the optimization system runs in a working node of Hadoop MapReduce in a daemon manner.

3. The optimization system for the shuffle phase in Hadoop MapReduce as set forth in claim 1, further comprising the following process before the early merge process:

after receiving the submitted new task, the Hadoop MapReduce informs the scheduler module of the use of interprocess communication, the scheduler module informs all system working nodes, and the system working nodes start to monitor the completion condition of the subtasks in the Hadoop MapReduce working nodes.

4. The optimization system for the shuffling stage in Hadoop MapReduce as claimed in claim 1, further comprising the following process after the advanced shuffling process:

after the Reduce subtask is started, the shuffle processing module notifies a path of the data file in the middle of the Reduce subtask; the Reduce subtask reads the intermediate data files directly from the local file system in sequence and performs calculation.