CN110502337B - Optimization system for shuffling stage in Hadoop MapReduce - Google Patents

Optimization system for shuffling stage in Hadoop MapReduce Download PDF

Info

Publication number
CN110502337B
CN110502337B CN201910627734.5A CN201910627734A CN110502337B CN 110502337 B CN110502337 B CN 110502337B CN 201910627734 A CN201910627734 A CN 201910627734A CN 110502337 B CN110502337 B CN 110502337B
Authority
CN
China
Prior art keywords
shuffle
hadoop mapreduce
node
shuffling
mapreduce
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910627734.5A
Other languages
Chinese (zh)
Other versions
CN110502337A (en
Inventor
管海兵
吴仲轩
任锐
戚正伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201910627734.5A priority Critical patent/CN110502337B/en
Publication of CN110502337A publication Critical patent/CN110502337A/en
Application granted granted Critical
Publication of CN110502337B publication Critical patent/CN110502337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an optimization system for a shuffling stage in Hadoop MapReduce, which runs in a working node and a main node of the Hadoop MapReduce in a daemon mode and communicates with the Hadoop MapReduce in an inter-process communication and remote process call mode. Meanwhile, an optimization method based on the optimization system is provided. The optimization system provided by the invention takes over all intermediate data in the Hadoop MapReduce task operation after operation, reasonably utilizes the idle network bandwidth of a Map stage on one hand and effectively reduces the reading and writing of small files after merging the intermediate data in the same node on the other hand by utilizing the modes of pre-merging and pre-shuffling, thereby optimizing the completion time of the MapReduce task.

Description

Optimization system for shuffling stage in Hadoop MapReduce
Technical Field
The invention relates to the technical field of big data and cloud computing, in particular to an optimization system aiming at a Shuffle (Shuffle) stage in Hadoop MapReduce.
Background
MapReduce is a distributed computing framework for processing big data. Hadoop MapReduce is the best known and widely used open source implementation of MapReduce. A Hadoop MapReduce user can process massive data (TB-level data volume and even PB-level data volume) in parallel on a large-scale cluster (the maximum capacity can support thousands of nodes) by simply writing Map and reducing algorithm. Moreover, hadoop MapReduce provides strong fault-tolerant capability and ensures that tasks are completed in thousands of nodes.
Hadoop MapReduce follows BSP (Bulk synchronization Parallel) model, abstracting the distributed computing process into three stages: map, shuffle (Shuffle), and Reduce.
The operation of the Map phase is divided into two sub-phases: map computation and partitioning (Partition) phases. In the Map calculation stage, each working node calculates input data according to a Map method in a submitted algorithm and outputs intermediate data, wherein the intermediate data are composed of key value pairs. In the partitioning stage, the working node partitions the intermediate data according to the submitted partitioning function, and each partition and each Reduce subtask are in a one-to-one mapping relationship. Finally, the intermediate data output by the Map stage is stored in the disk.
In the Shuffle (Shuffle) phase, each worker node reassigns the intermediate data according to the partitioning result in the Map phase, and the worker node transmits the intermediate data of the same partition to a designated node through the network and stores it in the disk.
The Reduce phase operation is divided into two sub-phases: sort (Sort) and Reduce calculations. In the sorting stage, the working node reads out the shuffled intermediate data from the disk and sorts the keys (keys). In the Reduce calculation stage, the working node calculates intermediate data according to a Reduce method in the training algorithm, and finally outputs a calculation result.
In the current implementation of Hadoop MapReduce, two disadvantages exist that can cause significant impact on performance: firstly, a shuffling stage is coupled with a Reduce stage, so that network bandwidth resources in a Map stage are idle; secondly, a large number of small files are read and written in the shuffling stage, so that disk reading and writing are the bottleneck of the shuffling stage.
In particular, the amount of the solvent to be used,
in Hadoop MapReduce, a shuffle stage and a Reduce stage are coupled together, and a work node needs to wait until the Reduce stage starts to perform data shuffle, so that network bandwidth resources in a Map stage are completely idle. And according to research results, the shuffling stage occupies one third of the total completion time of the task on average in a big data processing task. Therefore, the performance of the Hadoop MapReduce is seriously affected by the inefficient use of network bandwidth resources.
In the production environment use of Hadoop MapReduce, no matter official documents or industry use experience, the single MapReduce task (job) is recommended to be divided into a plurality of small subtasks (tasks) (namely, a large number of maps and Reduce subtask numbers are configured), which helps to improve the parallelism of task execution and Reduce the influence of the occurrence of lagger (stratgler) of the subtasks. The above method has a significant effect on optimizing task completion time as evidenced by countless use results, and is particularly significant in large-scale clustering. However, splitting the task into a large number of subtasks results in the shuffle stage needing to read and write a large number of small files. Such a large amount of small data and random I/O disk reads and writes will become a bottleneck in the shuffle stage and seriously affect the task completion time.
At present, no explanation or report of the similar technology of the invention is found, and similar data at home and abroad are not collected.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides an optimization system aiming at the shuffling stage in Hadoop MapReduce. The system provided by the invention is a distributed system with a master-slave architecture, and the system runs in a working node of Hadoop MapReduce in a daemon way. According to the method provided by the invention, all intermediate data in the operation of the MapReduce task are taken over by the system provided by the invention, and the network bandwidth of the Map stage is reasonably utilized and the read-write of small files is effectively reduced by utilizing the modes of pre-merging (pre-merge) and pre-shuffling (pre-shuffle), so that the completion time of the MapReduce task is optimized.
The invention is realized by the following technical scheme.
According to one aspect of the invention, an optimization system for a shuffling stage in Hadoop MapReduce is provided, and comprises a system main node and a system working node; wherein:
the system master node includes: the device comprises a scheduler module and a communication module a, wherein the scheduler module is used for scheduling the time for combining partition files in advance, the time for shuffling in advance and the place where the shuffling result is removed; the communication module a realizes the communication between the main node of the system and the main node of Hadoop MapReduce and the working node of the system by using the inter-process communication and the remote process call;
the system work node includes: the system comprises a shuffle processing module and a communication module b, wherein the shuffle processing module combines all temporary files on the same node into a large temporary file in advance, and performs the advance shuffle on the large temporary file according to the time indicated by a scheduler module; and the communication module b realizes the communication between the system working node and the system main node and between the system working node and the Hadoop MapReduce working node by using interprocess communication and remote process call.
Preferably, the optimization system runs in a working node of Hadoop MapReduce in a daemon manner.
According to another aspect of the invention, an optimization method for the shuffling stage in Hadoop MapReduce is provided, and the optimization system comprises:
and (3) merging process in advance: the shuffle processing module acquires a temporary file path of a Map calculation result from a Hadoop MapReduce working node, triggers one-time advanced combination when monitoring that one Map subtask is completed, and combines a newly acquired temporary file with a result of the previous advanced combination; the shuffle processing module repeats the process, and finally combines all temporary files in the file system of the Hadoop MapReduce working node into an intermediate data file;
the advance shuffling process: when Map subtasks running in the same batch are completed and combined in advance, the shuffle processing module triggers advance shuffle; and the shuffling processing module shuffles the intermediate data file to a designated Hadoop MapReduce working node according to the instruction of the scheduler module.
And the designated Hadoop MapReduce working node adopts a random designation mode.
Preferably, before the process of combining in advance, the following process is further included:
after receiving the submitted new task, the Hadoop MapReduce informs the scheduler module by using inter-process communication, the scheduler module informs all system working nodes, and the system working nodes start monitoring the completion condition of the subtasks in the Hadoop MapReduce working nodes.
Preferably, after the advance shuffling process, the following process is further included:
after the Reduce subtasks are started, the shuffle processing module informs the Reduce subtask intermediate data file path; the Reduce subtask reads the intermediate data files directly from the local file system in sequence and performs calculation.
Compared with the prior art, the invention has the following beneficial effects:
1. according to the invention, through a method of combining in advance, the small files are combined into a large file in advance after Map calculation is completed, so that the file reading and writing times are effectively reduced, the data reading time in the shuffling stage is shortened, and the tail delay (tail latency) of disk reading and writing is optimized. Specifically, the MapReduce computing task can divide the task into a plurality of Map subtasks, and intermediate data obtained after the Map subtasks are computed are stored in a file system of the Hadoop MapReduce working node, so that the intermediate data of a plurality of different Map subtasks are stored in the same file system in the computing process. After the Map calculation in each Map subtask is completed, the system working node merges all intermediate data on the same file system in the Hadoop MapReduce working node and outputs the merged intermediate data into a large file, and a subsequent shuffling stage reads the intermediate data from the merged file directly.
2. The invention carries out the Map stage and the shuffle stage in parallel by a pre-shuffle method, and fully utilizes idle network bandwidth resources in the original Map stage, thereby optimizing the task completion time. Specifically, in the implementation of Hadoop MapReduce, reduce and the shuffle stage are tightly coupled together, and intermediate data needs to be transmitted in the process of the Reduce subtask. In the invention, the system working node takes over the whole shuffling stage in Hadoop MapReduce, and immediately merges output data in advance after Map calculation in each Map subtask is finished, and then immediately transmits the output data to a designated working node by using a network according to the instruction of a scheduler module. Because the MapReduce calculation task can be divided into a plurality of Map subtasks, and the Map subtasks are operated on the working nodes in batches, the system can effectively carry out the pre-shuffling and Map calculation in parallel, and effectively improve the resource utilization rate of the Map stage.
3. The optimization system for the shuffling stage in the Hadoop MapReduce, provided by the invention, is operated in a working node and a main node of the Hadoop MapReduce in a daemon process mode, and is communicated with the Hadoop MapReduce in an inter-process communication and remote process calling mode; after the system runs, all intermediate data in the Hadoop MapReduce task running are taken over, and by utilizing a pre-merge (pre-merge) and pre-shuffle (pre-shuffle) mode, on one hand, idle network bandwidth in a Map stage is reasonably utilized, and on the other hand, small file reading and writing are effectively reduced after the intermediate data in the same node are merged, so that the completion time of the MapReduce task is optimized, and the optimization of the conventional Hadoop MapReduce is realized.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a system architecture diagram of the present invention
FIG. 2 is a schematic diagram showing a comparison of the system flow during the operation of the present invention
Detailed Description
The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and gives a detailed implementation mode and a specific operation process. It should be noted that various changes and modifications can be made by those skilled in the art without departing from the spirit of the invention, and these changes and modifications are all within the scope of the invention.
The embodiment of the invention provides an optimization system for a shuffling stage in Hadoop MapReduce, which comprises a system main node and a system working node; wherein:
the system master node includes: the system comprises a scheduler module and a communication module a, wherein the scheduler module is used for scheduling the time for combining partition files in advance, the time for shuffling in advance and the destination of a shuffling result; the communication module a realizes the communication between the main node of the system and the main node of Hadoop MapReduce and the working node of the system by using the inter-process communication and the remote process call;
the system work node includes: the system comprises a shuffle processing module and a communication module b, wherein the shuffle processing module combines all temporary files on the same node into a large temporary file in advance, and performs the advance shuffle on the large temporary file according to the time indicated by a scheduler module; and the communication module b realizes the communication between the system working node and the system main node and between the system working node and the Hadoop MapReduce working node by using interprocess communication and remote process call.
Further, the optimization system runs in a working node of Hadoop MapReduce in a daemon mode.
The embodiment of the invention also provides an optimization method for the shuffling stage in Hadoop MapReduce, and the optimization system comprises the following steps:
and (3) merging process in advance: the shuffle processing module acquires a temporary file path of a Map calculation result from a Hadoop MapReduce working node, triggers one-time advanced combination when monitoring that one Map subtask is completed, and combines a newly acquired temporary file with a result of the previous advanced combination; the shuffle processing module repeats the process, and finally merges all temporary files in the file system of the Hadoop MapReduce working node into an intermediate data file;
the advance shuffling process: when Map subtasks running in the same batch are completed and combined in advance, the shuffle processing module triggers advance shuffle; the shuffle processing module shuffles the intermediate data file to the designated Hadoop MapReduce working node as directed by the scheduler module.
Further, before the process of combining in advance, the following process is also included:
after receiving the submitted new task, the Hadoop MapReduce informs the scheduler module by using inter-process communication, the scheduler module informs all system working nodes, and the system working nodes start monitoring the completion condition of the subtasks in the Hadoop MapReduce working nodes.
Further, after the advance shuffling process, the following process is also included:
after the Reduce subtasks are started, the shuffle processing module informs the Reduce subtask intermediate data file path; the Reduce subtask reads the intermediate data files directly from the local file system in sequence and performs calculation.
The above embodiments of the present invention will be described in further detail with reference to the accompanying drawings.
As shown in fig. 1, the optimization system provided by the above embodiment of the present invention is a distributed system with a master-slave structure, and includes a system master node and a system working node. The system main node comprises a scheduler module and a communication module a, the system work node comprises a shuffle processing module and a communication module b (the communication modules a and b are omitted in fig. 1), wherein: a scheduler module to schedule when to perform an early merge of the partition files (i.e., when the partition files are early merged), and to schedule when to perform an early shuffle (i.e., when to perform an early shuffle) and where to shuffle (i.e., where to shuffle the results of the shuffle); the communication module a realizes communication between the main node of the system and the working node of the system and the main node of Hadoop MapReduce by using interprocess communication and remote process call, and acquires a temporary file storage path and a subtask completion condition; the shuffle processing module is used to combine all temporary files on the same node in advance into one large temporary file (i.e. an intermediate data file) and to shuffle the intermediate data file in advance according to the scheduler module's instructions, the path of the temporary file and the destination to which the shuffle transmission is made being obtained from the scheduler module through the communication module b.
Fig. 2 shows the workflow of the optimization method of the present invention, and the workflow will be described in detail with reference to the accompanying drawings:
first, upon receiving a submitted new task, the Hadoop MapReduce will notify the scheduler module in the optimization system using inter-process communication. And then, the scheduler module informs all the system working nodes through the communication module, and the system working nodes start monitoring the completion condition of the subtasks in the Hadoop MapReduce working nodes.
Secondly, after the completion of the Map subtask is monitored, the shuffle processing module obtains a temporary file path of the Map calculation result from the Hadoop MapReduce working node through the communication module (the number of temporary files depends on the amount of task input data and the algorithm of Map calculation). And when the completion of one Map subtask is monitored, the shuffle processing module triggers one advanced combination to combine the newly acquired temporary file with the result of the previous advanced combination. The shuffle processing module will repeat the above process and finally merge all temporary files in the file system of the node into a single large file to form an intermediate data file.
Third, since the Map subtasks will run in batches on the work node, the shuffle processing module will trigger the early shuffle after a batch of Map subtasks is completed and the early merge is completed. The shuffle processing module shuffles the intermediate data file to the designated work node using network transport as directed by the scheduler module. In the advanced shuffling, the shuffling processing module directly transmits the combined intermediate data file, so that the read-write times of small files are effectively reduced.
Finally, after the Reduce subtask is started, the shuffle processing module notifies the path of the Reduce subtask intermediate data file through the communication module. The Reduce subtask reads the intermediate data files from the local file system directly and sequentially, calculates the intermediate data files, and finally outputs a calculation result.
The optimization system for the shuffling stage in the Hadoop MapReduce provided by the embodiment of the invention runs in the working node and the main node of the Hadoop MapReduce in a daemon process mode, and communicates with the Hadoop MapReduce in an inter-process communication and remote process calling mode. After the system runs, all intermediate data in the Hadoop MapReduce task running are taken over, and by utilizing a pre-merge (pre-merge) and pre-shuffle (pre-shuffle) mode, on one hand, idle network bandwidth in a Map stage is reasonably utilized, and on the other hand, small file reading and writing are effectively reduced after the intermediate data in the same node are merged, so that the completion time of the MapReduce task is optimized.
According to the optimization system for the shuffle stage in Hadoop MapReduce, provided by the embodiment of the invention, small files are combined into a large file in advance by a method of combining in advance once Map calculation is completed, so that the read-write times of the file are effectively reduced, the data read-write time of the shuffle stage is shortened, and the tail latency (tail latency) of disk read-write is optimized.
The optimization system for the shuffle stage in the Hadoop MapReduce, provided by the embodiment of the invention, has the advantages that the Map stage and the shuffle stage are performed in parallel by a method of shuffling in advance, and idle network bandwidth resources in the original Map stage are fully utilized, so that the task completion time is optimized.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims (4)

1. An optimization system for a shuffling stage in Hadoop MapReduce is characterized by comprising a system main node and a system working node; wherein:
the system master node includes: the device comprises a scheduler module and a communication module a, wherein the scheduler module is used for scheduling the time for combining partition files in advance, the time for shuffling in advance and the place where the shuffling result is removed; the communication module a realizes the communication between the main node of the system and the main node of Hadoop MapReduce and the working node of the system by using the inter-process communication and the remote process call;
the system working node comprises: the system comprises a shuffle processing module and a communication module b, wherein the shuffle processing module combines all temporary files on the same node into a large temporary file in advance, and performs the advance shuffle on the large temporary file according to the time indicated by a scheduler module; the communication module b realizes communication between the system working node and the system main node and between the system working node and the Hadoop MapReduce working node by using interprocess communication and remote process calling;
the merge-ahead merge process comprises: the shuffle processing module acquires a temporary file path of a Map calculation result from a Hadoop MapReduce working node, triggers one-time advanced combination when monitoring that one Map subtask is completed, and combines a newly acquired temporary file with a result of the previous advanced combination; the shuffle processing module repeats the process, and finally merges all temporary files in the file system of the Hadoop MapReduce working node into an intermediate data file;
the process of pre-shuffling shuffle comprises the following steps: when the Map subtasks running in the same batch are completed and combined in advance, the shuffle processing module triggers advance shuffle; and the shuffling processing module shuffles the intermediate data file to a designated Hadoop MapReduce working node according to the instruction of the scheduler module.
2. The optimization system for the shuffle stage in Hadoop MapReduce according to claim 1, wherein the optimization system runs in a working node of Hadoop MapReduce in a daemon manner.
3. The optimization system for the shuffle phase in Hadoop MapReduce as set forth in claim 1, further comprising the following process before the early merge process:
after receiving the submitted new task, the Hadoop MapReduce informs the scheduler module of the use of interprocess communication, the scheduler module informs all system working nodes, and the system working nodes start to monitor the completion condition of the subtasks in the Hadoop MapReduce working nodes.
4. The optimization system for the shuffling stage in Hadoop MapReduce as claimed in claim 1, further comprising the following process after the advanced shuffling process:
after the Reduce subtask is started, the shuffle processing module notifies a path of the data file in the middle of the Reduce subtask; the Reduce subtask reads the intermediate data files directly from the local file system in sequence and performs calculation.
CN201910627734.5A 2019-07-12 2019-07-12 Optimization system for shuffling stage in Hadoop MapReduce Active CN110502337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910627734.5A CN110502337B (en) 2019-07-12 2019-07-12 Optimization system for shuffling stage in Hadoop MapReduce

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910627734.5A CN110502337B (en) 2019-07-12 2019-07-12 Optimization system for shuffling stage in Hadoop MapReduce

Publications (2)

Publication Number Publication Date
CN110502337A CN110502337A (en) 2019-11-26
CN110502337B true CN110502337B (en) 2023-02-07

Family

ID=68585359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910627734.5A Active CN110502337B (en) 2019-07-12 2019-07-12 Optimization system for shuffling stage in Hadoop MapReduce

Country Status (1)

Country Link
CN (1) CN110502337B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112307008B (en) * 2020-12-14 2023-12-08 湖南蚁坊软件股份有限公司 Druid compacting method
CN113407354B (en) * 2021-08-18 2022-01-21 阿里云计算有限公司 Distributed job adjustment method, master node, system, physical machine, and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933110A (en) * 2015-06-03 2015-09-23 电子科技大学 MapReduce-based data pre-fetching method
CN105357124A (en) * 2015-11-22 2016-02-24 华中科技大学 MapReduce bandwidth optimization method
CN106250233A (en) * 2016-07-21 2016-12-21 鄞州浙江清华长三角研究院创新中心 MapReduce performance optimization system and optimization method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9424274B2 (en) * 2013-06-03 2016-08-23 Zettaset, Inc. Management of intermediate data spills during the shuffle phase of a map-reduce job

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933110A (en) * 2015-06-03 2015-09-23 电子科技大学 MapReduce-based data pre-fetching method
CN105357124A (en) * 2015-11-22 2016-02-24 华中科技大学 MapReduce bandwidth optimization method
CN106250233A (en) * 2016-07-21 2016-12-21 鄞州浙江清华长三角研究院创新中心 MapReduce performance optimization system and optimization method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment;Sangwon Seo et.;《2009 IEEE International Conference on Cluster Computing and Workshops》;IEEE Explore;20090904;全文 *
Improving the Shuffle of Hadoop MapReduce;Jingui Li et al.;《2013 IEEE 5th International Conference on Cloud Computing Technology and Science》;IEEE Explore;20131205;全文 *

Also Published As

Publication number Publication date
CN110502337A (en) 2019-11-26

Similar Documents

Publication Publication Date Title
US9152601B2 (en) Power-efficient nested map-reduce execution on a cloud of heterogeneous accelerated processing units
CN111367630A (en) Multi-user multi-priority distributed cooperative processing method based on cloud computing
JP2017016693A (en) Support for cluster computing for application program
CN109522108B (en) GPU task scheduling system and method based on Kernel merging
US9986018B2 (en) Method and system for a scheduled map executor
US8898422B2 (en) Workload-aware distributed data processing apparatus and method for processing large data based on hardware acceleration
CN110659278A (en) Graph data distributed processing system based on CPU-GPU heterogeneous architecture
US10970805B2 (en) Graphics processing unit operation
CN111694643B (en) Task scheduling execution system and method for graph neural network application
CN110569312B (en) Big data rapid retrieval system based on GPU and use method thereof
CN105912387A (en) Method and device for dispatching data processing operation
Frey et al. A spinning join that does not get dizzy
CN110502337B (en) Optimization system for shuffling stage in Hadoop MapReduce
Chen et al. Pipelined multi-gpu mapreduce for big-data processing
Liu et al. Optimizing shuffle in wide-area data analytics
US10326824B2 (en) Method and system for iterative pipeline
CN116302574B (en) Concurrent processing method based on MapReduce
Dai et al. Research and implementation of big data preprocessing system based on Hadoop
CN114518940A (en) Task scheduling circuit, method, electronic device and computer-readable storage medium
Duan et al. Reducing makespans of dag scheduling through interleaving overlapping resource utilization
CN113222099A (en) Convolution operation method and chip
Wu et al. Shadow: Exploiting the power of choice for efficient shuffling in mapreduce
Jin et al. A new parallelization method for K-means
Perera et al. Supercharging distributed computing environments for high performance data engineering
CN114518941A (en) Task scheduling circuit, method, electronic device and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant