CN114945902A - Shuffle reduction task with reduced I/O overhead - Google Patents
Shuffle reduction task with reduced I/O overhead Download PDFInfo
- Publication number
- CN114945902A CN114945902A CN202080092843.2A CN202080092843A CN114945902A CN 114945902 A CN114945902 A CN 114945902A CN 202080092843 A CN202080092843 A CN 202080092843A CN 114945902 A CN114945902 A CN 114945902A
- Authority
- CN
- China
- Prior art keywords
- data
- shuffle
- memory
- reduction operation
- reduction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000009467 reduction Effects 0.000 title claims abstract description 143
- 230000015654 memory Effects 0.000 claims abstract description 74
- 238000000034 method Methods 0.000 claims abstract description 68
- 238000013507 mapping Methods 0.000 claims abstract description 42
- 230000008569 process Effects 0.000 claims abstract description 32
- 238000007405 data analysis Methods 0.000 claims description 52
- 238000003860 storage Methods 0.000 claims description 31
- 230000002085 persistent effect Effects 0.000 claims description 8
- 230000002776 aggregation Effects 0.000 claims description 7
- 238000004220 aggregation Methods 0.000 claims description 7
- 238000004891 communication Methods 0.000 claims description 7
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 239000007787 solid Substances 0.000 claims description 5
- 230000006870 function Effects 0.000 description 15
- 238000005192 partition Methods 0.000 description 10
- 238000004590 computer program Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 241000404172 Minois dryas Species 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012546 transfer Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000002245 particle Substances 0.000 description 2
- 235000004936 Bromus mango Nutrition 0.000 description 1
- 244000241257 Cucumis melo Species 0.000 description 1
- 235000015510 Cucumis melo subsp melo Nutrition 0.000 description 1
- 241000220225 Malus Species 0.000 description 1
- 235000011430 Malus pumila Nutrition 0.000 description 1
- 235000015103 Malus silvestris Nutrition 0.000 description 1
- 240000007228 Mangifera indica Species 0.000 description 1
- 235000014826 Mangifera indica Nutrition 0.000 description 1
- 235000009184 Spondias indica Nutrition 0.000 description 1
- 240000000851 Vaccinium corymbosum Species 0.000 description 1
- 235000003095 Vaccinium corymbosum Nutrition 0.000 description 1
- 235000017537 Vaccinium myrtillus Nutrition 0.000 description 1
- 235000009754 Vitis X bourquina Nutrition 0.000 description 1
- 235000012333 Vitis X labruscana Nutrition 0.000 description 1
- 240000006365 Vitis vinifera Species 0.000 description 1
- 235000014787 Vitis vinifera Nutrition 0.000 description 1
- FJJCIZWZNKZHII-UHFFFAOYSA-N [4,6-bis(cyanoamino)-1,3,5-triazin-2-yl]cyanamide Chemical compound N#CNC1=NC(NC#N)=NC(NC#N)=N1 FJJCIZWZNKZHII-UHFFFAOYSA-N 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 235000021014 blueberries Nutrition 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012517 data analytics Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000013341 scale-up Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The shuffle reduction operation receives as input files ordered and written by different mapping tasks, takes batch data from each input file, merges and orders the batch data to form a large unified data segment. A shuffle reduction operation is applied to the unified data segment to produce output data. The shuffle reduction operation includes an exchangeable reduction operation that provides an amount of output data that is significantly less than an amount of input data. The output data is written to the memory. This process is repeated for different batches of data until the data from each input file is completely consumed and the output data has been completely formed. The shuffle reduction operation greatly reduces the size of data that needs to be read by the reduction task in the shuffle operation, thus significantly reducing input/output overhead and total execution time.
Description
Technical Field
The present application relates to large-scale data processing systems, and more particularly, to a system and method for reducing the number of random access I/O requests to a Hard Disk Drive (HDD) while reducing the reading and writing of data, thereby reducing the total execution time during complex data processing operations.
Background
Spark TM 、Hadoop TM 、TensorFlow TM And Dryad TM Such large data analysis systems are commonly used to perform complex data analysis operations. Such systems save data partitions in memory for pipeline operators (pipelined operators) and save data on hard drives across stages with wide dependencies for fault tolerance. In such systems, all-to-all data transfers, so-called shuffle operations, become an extended bottleneck when running many small tasks into which each job is divided for multi-stage data analysis. Zhang et al in "rifgle: optimized Shuffle Reduce Service for Large-Scale Data Analytics ", ACM, the thirteenth EuroSys conference discourse, page 43, 4 months 23-26 years 2018, commenting that this bottleneck is due to the ultra-linear increase in disk I/O operations with increasing Data volume. To address the increase in disk I/O operations, Zhang et al propose to efficiently merge fragmented intermediate shuffle files into a large block file, so as to convert small random disk I/O requests into large sequential requests.
Unfortunately, such a solution is associated with considerable costs. For example, Riffle TM The system needs to perform two reads and writes for all data, which can result in a largerI/O overhead of. It remains desirable to establish additional methods of reducing the reading and writing of data while reducing the number of random access I/O requests to a Hard Disk Drive (HDD) to further reduce the overall execution time during complex data processing operations such as data analysis.
Disclosure of Invention
Various examples are now described to introduce a set of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
In a sample embodiment, a Shuffle reduction (Shuffle Reduce) task is designed and integrated into a data analysis framework. A driver module in the data analysis framework decides whether and how many shuffle reduction tasks to use. The shuffle reduction task described herein performs sequential read I/O requests instead of the large number of random access I/O requests performed in the conventional approach (reducing I/O overhead). And rifle TM The shuffle reduction task described herein significantly reduces the data size of reads compared to the system, resulting in less I/O overhead and improved overall execution time.
In example embodiments, the shuffle reduction module may be included in a data analysis framework driver that decides whether to invoke a shuffle reduction task for a job (how many, etc.) or to maintain an existing execution plan. If the shuffle reduction task is called, it is called as part of the job execution plan. The scheme can be applied to any data stream execution platform.
According to a first aspect of the present disclosure, a method of performing a shuffle reduction operation that groups and concatenates data between mapping and reduction stages for data conversion during a data analysis process is provided. The method includes receiving as input at least two input files from a first memory, wherein each input file has been sorted and written by a different mapping task; acquiring batch data from each input file; the batch data in the second memory is merged and sorted to form a unified data segment. A shuffle reduction operation is applied to the unified data segment to produce output data. In example embodiments, the shuffle reduction operation includes a commutative reduction operation that provides an amount of output data that is less than an amount of input data. The output data is written to a third memory, and these steps are repeated until the data from each input file is completely consumed and the analytical data output of the data analysis process has been fully developed.
According to a second aspect of the present disclosure, there is provided a data analysis system comprising: at least one processor; a first memory storing at least two input files, each input file being classified and written by a different mapping task of the data analysis process; a second memory; a third memory; and an instruction memory storing instructions that, when executed by the at least one processor, perform a data analysis process including a shuffle reduction operation that groups and concatenates data between mapping and reduction stages for data conversion during the data analysis process. In an example embodiment, the shuffle reduction operation includes: (1) receiving as input at least two input files from a first memory; (2) acquiring batch data from each input file; (3) merging and sorting the batch data in the second memory to form a unified data segment; (4) applying a shuffle reduction operation to the unified data segment to produce output data, wherein the shuffle reduction operation includes an exchangeable reduction operation that provides an amount of output data that is less than the amount of input data; (5) writing the output data to a third memory; and (6) repeating (2) - (5) until the data from each input file is completely consumed and the analytical data output of the data analysis process has been completely formed.
According to a third aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium comprising instructions that, when executed by at least one processor, cause the processor to perform a shuffle reduction operation that groups and concatenates data between mapping and reduction stages for data conversion during a data analysis process by performing operations comprising: (1) receiving as input at least two input files from a first memory, wherein each input file has been sorted and written by a different mapping task; (2) acquiring batch data from each input file; (3) merging and sorting the batch data in the second memory to form a unified data segment; (4) applying a shuffle reduction operation to the unified data segment to produce output data, wherein the shuffle reduction operation includes an exchangeable reduction operation that provides an amount of output data that is less than the amount of input data; (5) writing the output data to a third memory; and (6) repeating (2) - (5) until the data from each input file is completely consumed and the analytical data output of the data analysis process has been completely formed.
In a first implementation of any of the preceding aspects, the first memory and the third memory comprise hard disk drives and the second memory comprises Dynamic Random Access Memory (DRAM).
In a second implementation of any of the preceding aspects, the first memory may include at least one hard disk drive, and the third memory may include at least one of a solid state disk and persistent memory.
In a third implementation of any of the preceding aspects, obtaining batch data from each input file includes obtaining a total amount of data that does not exceed an allocated memory capacity for the shuffle reduction operation.
In a fourth implementation of any of the preceding aspects, the driver module determines whether to perform the shuffle reduction operation for the particular job in the data analysis process based on: whether the job includes a shuffle operation, whether a workload of the job is large enough to obtain a performance gain, whether the job includes the commutative reduce operation that provides an amount of output data in an output file that is less than an amount of input data, how many tasks to start to implement the shuffle reduce operation, and when to start and stop the shuffle reduce operation.
In a fifth implementation of any of the preceding aspects, the data analysis process is implemented on a data analysis platform, and applying the shuffle reduction operation includes applying at least one of a key aggregation operation, a key grouping operation, and a key reduction operation as the commutative reduction operation.
In a sixth implementation form of any of the preceding aspects, the first task performs the steps of: receiving input files, obtaining batch data from each input file, and merging and sorting the batch data, and the second task runs the following steps: apply a shuffle reduction operation, and write the output. In a sample embodiment, the first task and the second task communicate directly with each other independent of the first memory and the third memory.
In a seventh implementation form of any of the preceding aspects, applying the shuffle reduction operation to a unified data segment to produce output data comprises applying a plurality of tasks to implement the shuffle reduction operation.
The method may be performed and the instructions on the computer readable medium may be processed by the apparatus, and further features of the method and the instructions on the computer readable medium result from functionality of the apparatus. Moreover, the explanations provided for each aspect and its implementations apply equally to the other aspects and the corresponding implementations. The different embodiments may be implemented in hardware, software, or any combination thereof. Moreover, any of the foregoing examples may be combined with any one or more of the other foregoing examples to create new embodiments within the scope of the present disclosure.
Drawings
In the drawings, which are not necessarily drawn to scale, like numerals may describe similar components in different views. The drawings generally illustrate, by way of example and not by way of limitation, various embodiments discussed in the present document.
FIG. 1 shows Spark TM A logical view of a job that applies transformations (mappings and filters) to data from two separate tables, item concatenations and aggregations for each key (some field of an item) using a key grouping (GroupByKey) function.
FIG. 2 illustrates Spark for the job illustrated in FIG. 1 TM And executing the plan.
FIG. 3 illustrates a mapping between shuffle term mappings and reduction tasks.
FIG. 4 shows Riffle TM A data analysis system that requires less read and write I/O overhead due to a reduced number of tasks, but requires two reads and writes to run on the entire shuffle data.
FIG. 5 illustrates a sample embodiment of a shuffle reduction method for reducing I/O overhead.
FIG. 6 illustrates steps performed by a shuffle reduction operation in a sample embodiment.
FIG. 7 shows a flow diagram of a shuffle reduction operation in a sample embodiment.
FIG. 8 illustrates an example of a shuffle reduction operation on a specified input file in a sample embodiment.
FIG. 9 is a block diagram illustrating circuitry for performing a method according to a sample embodiment.
Detailed Description
It should be understood at the outset that although an exemplary implementation of one or more embodiments are provided below, the disclosed systems and/or methods described with respect to fig. 1 through 9 may be implemented using any number of techniques, whether currently known or in existence. The disclosure should in no way be limited to the exemplary implementations, drawings, and techniques illustrated below, including the exemplary designs and implementations illustrated and described herein, but may be modified within the scope of the appended claims along with their full scope of equivalents.
In one embodiment, the functions or algorithms described herein may be implemented in software. The software may include computer-executable instructions stored on a computer-readable medium or computer-readable storage device, such as one or more non-transitory memories or other types of local or network hardware-based storage devices. Further, such functions correspond to modules, which may be software, hardware, firmware, or any combination thereof. Various functions may be performed in one or more modules as desired, and the described embodiments are merely examples. The software may be executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a computer system, such as a personal computer, server, or other computer system, to turn such computer system into a specially programmed machine.
Spark TM 、Hadoop TM 、TensorFlow TM And Dryad TM Such large data analysis platforms typically use Directed Acyclic Graph (DAG) models to define user jobs. For example, FIG. 1 shows Spark TM A logical view of a job that applies mapping transformation 110 and filter transformation 120 to input data 130 and mapping transformation 140 to input data 150, where input data 130 and 150 are from two separate tables. Items on each key (some field of an item) are concatenated and aggregated using, for example, a concatenation function 160. After being filtered by the filter task 170, the output data 180 is stored in a results table.
FIG. 2 illustrates Spark for the job illustrated in FIG. 1 TM And executing the plan. As shown, the DAG execution platform (e.g., Spark) decides how to map these operations to physical resources. Since operations may occur across partitions, communication across partitions is required for processing. For narrow dependencies (mapping and filter operations 210 and 220), Spark TM The system pipelines the transitions and executes operators in a single state on each partition 230 and 240. Internally, Spark TM The system attempts to save the intermediate data of a single task in memory during the mapping phase so that pipeline operators can be efficiently executed (e.g., filter operator 220 follows mapping operator 210 in the first phase). However, operations like mapping 210, filter 220, etc. can result in narrow dependencies between tasks (one-to-one communication mode). Typically, these operations are bundled together and performed by the same process. On the other hand, operations like nexus 250, key reduction (ReduceByKey), key grouping (GroupByKey), etc. may result in wide dependencies between different tasks at the reduction stage before filtering by filter task 260 (many-to-many communication mode). Traditionally, the data exchange between the mapping phase and the reduction phase is represented as a shuffle period, as shown in FIG. 2.
As used herein, a shuffle period includes a full-to-full communication for wide dependence between stage 1 (mapping stage) and stage 2 (reduction stage). Each map task 210 reads from a data partition (e.g., rows of a large table), converts the data to an intermediate format using map task operators, sorts and aggregates items by partition function (e.g., key range) of the reduction phase to produce item blocks, and saves the blocks to an intermediate file on disk. Mapping task 210 also writes a separate index file that shows the offsets of the blocks corresponding to each reduction task of the reduction phase. Each reduction task brings together the specified data blocks and performs the reduction task operations. By looking up the offset in the index file, each reduction task will issue get requests for target blocks from all mapped output files. Thus, data initially partitioned according to table rows (table rows) is processed and shuffled to data partitioned according to the reduction key range. Between phases with wide dependencies, each reduction task needs to read a block of data from all the mapping task outputs. If the intermediate shuffle file is not persistent, even a single reduction task failure may result in the entire mapping phase being recomputed. Therefore, for strong fault tolerance, it is important to permanently shuffle data.
Thus, a shuffle operation during a shuffle period is an extremely resource intensive operation. Each data block passed from the map task to the reduce task needs to go through data serialization, disk and network I/O, and data deserialization. Nevertheless, shuffle operations are heavily used in various types of jobs. For example, operations that require a key to partition, group, or reduce, or concatenate data all involve shuffle operations. The shuffle period may be used for operations such as determining how many times a particular word appears in text, where the individual results of each mapping task are combined to obtain a final result.
Previous solutions have had mapping tasks write their output data to persistent storage for fault tolerance (if one reduction task fails, then there is no need to rerun any mapping tasks). For large workloads, Hard Disk Drives (HDDs) are more popular than other types of persistent storage because they are inexpensive and easy to scale. However, as the number of mapping and reduction tasks increases, this initial design does not scale well. The reduce task needs to perform M R random access input/output (I/O) requests to the HDD, where M is the number of map tasks and R is the number of reduce tasks. Thus, the shuffle period results in a large amount of I/O overhead.
This method is summarized as error! Reference sources are not found, which shows the mapping between mapping task 310 and the corresponding reduction tasks 360, 370, and 380 during the shuffle period. As shown in FIG. 3, each mapping task 310 produces a respective data set 330, 340, and 350, which, as shown, are directed to respective reduction tasks 360, 370, and 380. As shown, data set 330 is processed by reduction task 360, data set 340 is processed by reduction task 370, and data set 350 is processed by reduction task 380.
As the size of job data increases, the number of mapping tasks and reduction tasks also grows proportionally. Because each reduction task R needs to be fetched from all mapping tasks M, the number of shuffle I/O requests M R increases quadratically, and the average block size S/(M R) fetched each time decreases quadratically. Using fewer reduction tasks R reduces the total number of shuffle acquisitions, thus improving shuffle performance. However, using fewer tasks inevitably expands the average size of the input data and creates very large, slow tasks that must spill intermediate data into permanent storage, thus increasing overhead.
Thus, shuffle operations (e.g., concatenations) are the most difficult operations to perform for a distributed data analysis platform. The shuffle operation of a large workload writes data to persistent storage, so that in the event of a reduction task failure, the entire mapping phase need not be rerun. However, as the most cost-effective solution, a large amount of data temporarily remains in the Hard Disk Drive (HDD). Random access I/O requests can result in significant overhead for the HDD.
Riffle mentioned earlier TM The system provides a shuffle service that employs different methods to address the extended bottleneck caused by the all-to-all data transfer of the shuffle operation. Riffle TM The system introduces a merge task in the middle of the shuffle period. Riffle TM The system does not directly provide the data written by the mapping taskInstead, the reduce task is assigned a merge task to read and merge the fragmented intermediate shuffle file produced by the map task into a larger chunk file, thereby converting small, random disk I/O requests into large, sequential requests. Thus, the reduction task performs far fewer I/O random access requests to the HDD. In this case, Riffle TM The system performs M R/N random access I/O requests, where N is the average number of mapping tasks associated with each merging task.
However, Riffle TM The performance improvement provided by the system comes at considerable cost. As shown in FIG. 4, although Riffle TM The system requires less write and read I/O overhead due to the reduced number of tasks (by a factor of N), but Riffle TM The system performs two reads and writes to the entire shuffle data — once for reading the data sets 330, 340, and 350 created by map task 310 and once for reading the merged data set 420, 430, and 440 created by merge task 410. Doubling the reads and writes of the entire shuffle data may still result in a large I/O overhead, particularly for large block files created during merge operation 410.
The systems and methods described herein further reduce I/O overhead in a cost-effective manner by implementing a shuffle period for a data analysis system that substantially avoids reading and writing the entire shuffled data twice (e.g., Riffle) TM System) while also reducing the number of random access I/O requests to the HDD (e.g., rifle) TM A system). And such as Riffle TM Compared to conventional implementations in a system, the systems and methods provided herein reduce the total execution time of the shuffle period and reduce the number of I/O requests to the HDD by taking advantage of two features that are often exhibited by user workloads, namely:
1. the reduction operation is interchangeable.
2. The reduce operation reduces the amount of data (the output data size is much smaller than the input data size).
Based on these two features, a shuffle reduction operation is provided that combines a merge file operation with a reduction operation. In example embodiments, shuffle reductionThe operation is an exchangeable reduce operation implemented during the shuffle period that provides an amount of output data that is less than an amount of input data. Since the reduction operation is commutative, the final result is correct, since the final result is the same as the previous method. In addition, the shuffle reduction operation will eventually issue a rifle to the HDD TM The same number of I/O requests in the system. However, the second batch of reads and writes is performed on significantly less data because only the data resulting from the aggregate operation is stored.
FIG. 5 illustrates a sample embodiment of a shuffle reduction method for reducing I/O overhead. As shown, merge operation 410 of FIG. 4 is replaced by a shuffle reduction task 510 that merges the respective data sets 330, 340, and 350 from the respective map tasks 310 and reduces the data sets by performing a reduction operation to produce reduced- size data sets 520, 530, and 540 corresponding to data sets 330, 340, and 350, respectively. As shown, the data sets 330, 340, and 350 and the reduced size data sets 520, 530, and 540 are stored in the HDD for fault tolerance. The reduction tasks 360, 370, and 380 are then applied to the reduced- size data sets 520, 530, and 540, respectively. As the size of the data sets 520, 530, and 540 is reduced, the overall execution time of the resulting shuffle period is significantly reduced.
FIG. 6 shows steps performed by a shuffle reduction operation, and FIG. 7 shows a flow diagram of the shuffle reduction operation in a sample embodiment. As shown in fig. 6 and 7, the shuffle reduction task is implemented by receiving a plurality of input files 610 (operation 710 in fig. 7) from the external HDD in step (1). In a sample embodiment, each input file 610 is written by a different mapping task in a different partition. Then, the small lot data 620 is acquired from all the input files 610 in step (2), as indicated by an arrow 630 in fig. 6 (operation 720 in fig. 7). In example embodiments, the total size of the acquired data should not exceed the Dynamic Random Access Memory (DRAM) capacity of the process of the shuffle reduction task. The small batches of data are then merged and sorted together (according to their keys) at step (3) to form unified data 640 (operation 730 in FIG. 7).
At step (4), a shuffle reduction operation is applied to the unified data 640 to produceReduce data set 650 (operation 740 in fig. 7). In a sample embodiment, the shuffle reduction operation is applied when the reduce operation is commutative and reduces the amount of data so that the output data size is much smaller compared to the input data size. For example, in Spark TM In the system, such reduction operations include a key aggregation operation, a key reduction operation, and a key grouping operation. Spark TM A key aggregation (aggregateabebykey) operation in the system uses a given combinatorial function and a neutral "zero value" to group values relative to a common key to return different types of values for the key. For example, the AggregateByKey operation allows a student, a discipline, and a tuple of multiple scores to be processed for the discipline into an output dataset that includes the student and the highest score or percentage. Spark TM The ReduceByKey operation in the system merges values for each key using an associative reduction function that accepts two parameters and returns a single element and is commutative and associative in mathematical nature. In other words, the ReduceByKey function produces the same result regardless of the order of the elements when repeatedly applied to the same data set with multiple partitions. Spark TM The GroupByKey function in the system is very similar to the ReduceByKey function. The GroupByKey function takes a key-value pair (K, V) as input and collects the value of each key in the form of an iterator to produce an output that includes a list of keys and values. Thus, the GroupByKey function groups all values with respect to a single key and returns them in the form of an iterator. Multiple pipeline operations that reduce the amount of data may also be used to shuffle the reduction operation. Similar functionality of other data analysis systems will be apparent to those skilled in the art.
The reduced data set 650 is then written to the external HDD in step (5), as indicated by arrow 660 (operation 750 in FIG. 7). When there is more data to be processed from the input file 610, steps (2) - (5) are repeated at step (6) until all of the input data has been processed (step 760 in FIG. 7). When there is no more input data to process, the process ends at operation 770.
Thus, the shuffle reduction operation reduces the data size read by the reduction task (compare the data size after step (6) compared to the initial size (1) in FIG. 6), which significantly reduces I/O overhead and reduces the total execution time for the shuffle period. Due to the reduced data size, HDDs can be used in cost-effective solutions for fault tolerance.
In addition to the shuffle reduction task described herein, sample embodiments may include a shuffle reduction module included in a data analysis framework driver. The data analysis framework driver is a master node in the data analysis application that splits the data analysis application into tasks and schedules the tasks to run on the executor. The driver module may generate tasks across multiple partitions. In a sample embodiment, a driver module is provided on the client side to decide whether to invoke a shuffle reduction task (how many, etc.) as described herein for a job or to maintain an existing execution plan. In example embodiments, the driver module of a traditional data analysis framework driver is modified to determine when to invoke a shuffle reduction task by checking:
1. whether there is a shuffling period;
2. whether the workload is large enough for the shuffle reduction method to gain revenue; and
3. whether it is known (or determined) that the reduction operation is commutative and reduces the amount of data (the output data size is much smaller compared to the input data size).
When these conditions are met, the shuffle reduction task described herein is invoked.
A first alternative to the above embodiment divides the shuffle reduction period into two different tasks (merge, p-reduce). This embodiment will have similar features to the embodiments described above. All operations prior to the merge operation will be performed by the merge task and the rest will be performed by the p-reduce task. The merge task will communicate directly with the p-reduce task, rather than through the HDD.
The benefit of the first alternative is that the data analysis framework user/operator can independently scale up/down the merge and reduce operations in the shuffle phase, which provides greater flexibility and better resource utilization. The disadvantage is that this embodiment adds an extra layer of network communication as it breaks away from the local aspect of the first embodiment.
A second alternative to the above embodiment would be to use a different storage medium instead of the HDD. For cost purposes, HDDs are generally better suited for large amounts of data because HDDs are less expensive than other solutions. However, since the shuffle reduction task significantly reduces the amount of data processed by the reduction task, one possible solution is to store the data in a more efficient medium (such as a Solid State Disk (SSD) or a different persistent memory technology) in order to obtain better performance for the shuffle reduction period (since the cost of these solutions is now reduced due to storing less data). Thus, the second alternative provides better I/O performance at the cost of the overall cost of the entire architecture.
FIG. 8 illustrates an example of a shuffle reduction operation being performed on a specified input file in a sample embodiment. In this example, several input files 810 including different types of fruit (apple, blueberry, grape, melon, orange, mango) are merged at merge operation 820 to produce respective files 830 and 840 stored in DRAM. p-reduce task 850 performs a reduce bykey operation on respective files 830 and 840 to produce respective output files 860 and 870. Those skilled in the art will appreciate that the ReduceByKey is an exchangeable operation and that the resulting files 860 and 870 are smaller than files 830 and 840. As shown, the output key-value pair includes fruit and the number of occurrences in the merged input files 830 and 840. In this example, by combining the repeated elements into element and element count key-value pairs, the number of key-value pairs has been reduced from 12 to 8, thus saving memory space. It should be appreciated that similar reductions in memory space may be achieved for other commutative reduce operations, such as the AggregateByKey operation or the GroupByKey operation.
The shuffle reduction period reduces HDD I/O overhead because it significantly reduces the number of random access I/O requests. The shuffle reduction task described herein also greatly reduces the size of data that needs to be read by the reduction task. As a result, I/O overhead and overall execution time for the shuffle period are greatly reduced.
The following table compares the inclusion of Spark TM System, Riffle TM The performance of the conventional data analysis framework of the system and the shuffle reduction method implemented herein.
Wherein:
m is the number of mapping tasks;
r is the number of reduction tasks;
n is the number of mapping tasks per merge or shuffle reduction task;
s is the total shuffle data size; and
r is a reduction factor (0 < R ≦ 1, depending on workload, operation, and N). In the sample embodiment, r < 0.1, but varies with workload. In the worst case, r is 1 and the total read/write size is still at least equal to rifle TM The read/write size of the system is equally good.
FIG. 9 illustrates a general-purpose computer 900 suitable for implementing one or more embodiments of the methods disclosed herein. For example, computer 900 in FIG. 9 may implement a data analysis framework (e.g., Spark) of the type described herein TM 、Hadoop TM 、TensorFlow TM And Dryad TM ) The above process is carried out. The above components may be implemented on any general-purpose network component, such as computer 900, having sufficient processing power, memory resources, and network throughput capability to handle the necessary workload placed upon it. Computer 900 includes a processor 910 (which may be referred to as a central processor unit or CPU) that communicates with memory devices including secondary storage 920, Read Only Memory (ROM)930, Random Access Memory (RAM)940, input/output (I/O) devices 950, and network connectivity devices 960. In example embodiments, the network connection device 960 further connects the processor 910 to a client-side data analysis driver 970, which manages data analysis processes and determines when to invoke a shuffle reduction task as described herein. The processor 910 may be implemented as one or more CPU chips or may be part of one or more Application Specific Integrated Circuits (ASICs).
The secondary storage 920 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if the RAM 940 is not large enough to hold all working data. Secondary storage 920 may be used to store programs that are loaded into RAM 940 when such programs are selected for execution. The ROM 930 is used to store instructions and perhaps data that are read during program execution. The ROM 930 is a non-volatile memory device that typically has a smaller memory capacity relative to the larger memory capacity of the secondary storage 920. The RAM 940 is used to store volatile data and perhaps to store instructions. Access to both ROM 930 and RAM 940 is typically faster than to secondary storage 920.
It should be understood that computer 900 may execute instructions from a computer-readable non-transitory medium storing computer-readable instructions and one or more processors coupled to a memory, and that, when executing the computer-readable instructions, computer 900 is configured to perform the method steps and operations described in this disclosure with reference to fig. 1-8. Computer-readable non-transitory media include all types of computer-readable media, including magnetic storage media, optical storage media, flash memory media, and solid state storage media.
It will also be understood that software comprising one or more computer-executable instructions which facilitate the processes and operations as described above with reference to any or all of the steps of the present disclosure may be installed in and sold with one or more servers or databases. Alternatively, the software may be obtained and loaded into one or more servers or one or more databases in a manner consistent with the present disclosure, including by obtaining the software through physical media or distribution systems, including, for example, from a server owned by the software creator or from a server not owned but used by the software creator. For example, the software may be stored on a server for distribution over the internet.
Furthermore, those skilled in the art will understand that the present disclosure is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the drawings. The embodiments herein are capable of other embodiments and of being practiced or of being carried out in various ways. Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of "including," "comprising," or "having" and variations thereof herein is meant to encompass the items listed thereafter and equivalents thereof as well as additional items.
The components of the exemplary devices, systems and methods employed in accordance with the illustrated embodiments may be implemented at least partially in digital electronic circuitry, analog electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. For example, these means may also be embodied as a computer program product, such as a computer program, program code, or computer instructions tangibly embodied in an information carrier or machine-readable storage device for execution by, or to control the operation of, data processing apparatus, such as a programmable processor, a computer, or multiple computers.
A computer program can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. Also, functional programs, codes, and code segments for accomplishing the systems and methods described herein may be easily construed as being within the scope of the present disclosure by programmers having ordinary skill in the art to which the present disclosure pertains. Method steps associated with example embodiments may be performed by one or more programmable processors executing a computer program, code, or instructions to perform functions (e.g., by operating on input data and generating output). For example, the method steps may also be performed by and the apparatus may be implemented as the following: special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with the following: a general purpose processor, a Digital Signal Processor (DSP), an ASIC, an FPGA or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.
Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., electrically programmable read-only memory or ROM (eprom), electrically erasable programmable ROM (eeprom), flash memory devices, and data storage disks (e.g., magnetic disks, internal hard or removable disks, magneto-optical disks, CD-ROM disks, or DVD-ROM disks). The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
Those of skill in the art would understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure. A software module may reside in Random Access Memory (RAM), flash memory, ROM, EPROM, EEPROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. A sample storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. In other words, the processor and the storage medium may reside as integrated circuits or be implemented as discrete components.
As used herein, a "machine-readable medium" refers to a device capable of storing instructions and data, either temporarily or permanently, and may include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), cache memory, flash memory, optical media, magnetic media, cache memory, other types of memory (e.g., erasable programmable read only memory (EEPROM)), and any suitable combination thereof. The term "machine-readable medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) that are capable of storing the processor instructions. The term "machine-readable medium" shall also be taken to include any medium, or combination of media, that is capable of storing instructions for execution by one or more processors, such that the instructions, when executed by the one or more processors, cause the one or more processors to perform any one or more of the methodologies described herein. Thus, "machine-readable medium" refers to a single storage device or appliance, as well as a "cloud-based" storage system or storage network that includes multiple storage devices or appliances. As used herein, the term "machine-readable medium" does not include the signal itself.
Although several embodiments have been described in detail above, other modifications are possible. For example, the logic flows depicted in the figures do not require the particular ordering shown, or sequential ordering, to achieve desirable results. Other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Other embodiments may be within the scope of the following claims.
Claims (20)
1. A method of performing a shuffle reduction operation that groups and concatenates data for data conversion between mapping and reduction stages during a data analysis process, the method comprising:
(1) receiving as input at least two input files from a first memory, wherein each input file has been sorted and written by a different mapping task;
(2) acquiring batch data from each input file;
(3) merging and sorting the batch data in the second memory to form a unified data segment;
(4) applying a shuffle reduction operation to the unified data segment to produce output data, wherein the shuffle reduction operation includes an exchangeable reduction operation that provides an amount of output data that is less than an amount of input data;
(5) writing the output data to a third memory; and
(6) repeating (2) - (5) until data from each input file is completely consumed and an analytical data output of the data analysis process has been completely formed.
2. The method of claim 1, wherein the first and third memories comprise hard disk drives and the second memory comprises Dynamic Random Access Memory (DRAM).
3. The method of claim 1, wherein the first memory comprises at least one hard disk drive and the third memory comprises at least one of a solid state disk and persistent storage.
4. The method of claim 1, wherein obtaining batch data from each input file comprises obtaining a total amount of data that does not exceed an allocated memory capacity for the shuffle reduction operation.
5. The method of claim 1, further comprising a driver module that determines whether to perform the shuffle reduction operation for a particular job in the data analysis process based on: whether the job includes a shuffle operation, whether a workload of the job is large enough to obtain a performance gain, whether the job includes the commutative reduce operation that provides an amount of output data in an output file that is less than an amount of input data, how many tasks must be started to implement the shuffle reduce operation, and when the shuffle reduce operation will start and stop.
6. The method of claim 1, wherein the data analysis process is implemented on a data analysis platform, and applying the shuffle reduction operation comprises applying at least one of a key aggregation operation, a key grouping operation, and a key reduction operation as the commutative reduction operation.
7. The method of claim 1, further comprising performing steps (1) - (3) by a first task and performing steps (4) - (5) by a second task, wherein the first task and the second task communicate directly with each other independent of the first memory and the third memory.
8. The method of claim 1, wherein applying the shuffle reduction operation to a unified data segment to produce output data comprises applying a plurality of tasks to implement the shuffle reduction operation.
9. A data analysis system, comprising:
at least one processor;
a first memory storing at least two input files, each input file being classified and written by a different mapping task of the data analysis process;
a second memory;
a third memory; and
an instruction memory storing instructions that, when executed by the at least one processor, perform a data analysis process comprising a shuffle reduction operation that groups and concatenates data between mapping and reduction stages for data conversion during the data analysis process, the shuffle reduction operation comprising:
(1) receiving as input the at least two input files from the first memory;
(2) acquiring batch data from each input file;
(3) merging and sorting the batch data in the second memory to form a unified data segment;
(4) applying a shuffle reduction operation to the unified data segment to produce output data, wherein the shuffle reduction operation includes an exchangeable reduction operation that provides an amount of output data that is less than an amount of input data;
(5) writing the output data to the third memory; and
(6) repeating (2) - (5) until data from each input file is completely consumed and an analytical data output of the data analysis process has been completely formed.
10. The system of claim 9, wherein the first and third memories comprise hard disk drives and the second memory comprises Dynamic Random Access Memory (DRAM).
11. The system of claim 9, wherein the first memory comprises at least one hard disk drive and the third memory comprises at least one of a solid state disk and persistent storage.
12. The system of claim 9, wherein a total amount of data obtained from each input file does not exceed a memory capacity allocated by the data analysis process for the shuffle reduction operation.
13. The system of claim 9, further comprising a driver module that determines whether to perform the shuffle reduction operation for a particular job in the data analysis process based on: whether the job includes a shuffle operation, whether a workload of the job is large enough to obtain a performance gain, whether the job includes the commutative reduce operation that provides an amount of output data in an output file that is less than an amount of input data, how many tasks must be started to implement the shuffle reduce operation, and when the shuffle reduce operation will start and stop.
14. The system of claim 9, further comprising a data analysis platform that implements the data analysis process and applies the shuffle reduction operation by applying at least one of a key aggregation operation, a key grouping operation, and a key reduction operation as the commutative reduction operation.
15. The system of claim 9, further comprising performing a first task of steps (1) - (3) and performing a second task of steps (4) - (5), wherein the first task and the second task communicate directly with each other independent of the first memory and the third memory.
16. The system of claim 9, wherein the shuffle reduction operation includes a plurality of tasks that are applied to the unified data segment to produce output data.
17. A computer-readable storage medium comprising instructions that, when executed by at least one processor, cause the processor to execute a shuffle reduction operation that groups and concatenates data for data conversion between mapping and reduction stages during a data analysis process by performing operations, the operations comprising:
(1) receiving as input at least two input files from a first memory, wherein each input file has been sorted and written by a different mapping task;
(2) acquiring batch data from each input file;
(3) merging and sorting the batch data in the second memory to form a unified data segment;
(4) applying a shuffle reduction operation to the unified data segment to produce output data, wherein the shuffle reduction operation includes an exchangeable reduction operation that provides an amount of output data that is less than an amount of input data;
(5) writing the output data to a third memory; and
(6) repeating (2) - (5) until the data from each input file is completely consumed and the analytical data output of the data analysis process has been completely formed.
18. The medium of claim 17, wherein the data analysis process is implemented on a data analysis platform, the medium further comprising instructions that, when executed by the at least one processor, cause the processor to perform operations comprising applying at least one of a key aggregation operation, a key grouping operation, and a key reduction operation as the exchangeable reduction operation.
19. The medium of claim 17, further comprising instructions that, when executed by the at least one processor, cause the processor to perform steps (1) - (3) by a first task and steps (4) - (5) by a second task, wherein the first and second tasks are in direct communication with each other independent of the first and third memories.
20. The medium of claim 17, further comprising instructions that, when executed by the at least one processor, cause the processor to perform a plurality of tasks to implement the shuffle reduction operation.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2020/013686 WO2021061183A1 (en) | 2020-01-15 | 2020-01-15 | Shuffle reduce tasks to reduce i/o overhead |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114945902A true CN114945902A (en) | 2022-08-26 |
Family
ID=69500883
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202080092843.2A Pending CN114945902A (en) | 2020-01-15 | 2020-01-15 | Shuffle reduction task with reduced I/O overhead |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114945902A (en) |
WO (1) | WO2021061183A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113743582B (en) * | 2021-08-06 | 2023-11-17 | 北京邮电大学 | Novel channel shuffling method and device based on stack shuffling |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130167151A1 (en) * | 2011-12-22 | 2013-06-27 | Abhishek Verma | Job scheduling based on map stage and reduce stage duration |
US20150150017A1 (en) * | 2013-11-26 | 2015-05-28 | International Business Machines Corporation | Optimization of map-reduce shuffle performance through shuffler i/o pipeline actions and planning |
US20160034205A1 (en) * | 2014-08-01 | 2016-02-04 | Software Ag Usa, Inc. | Systems and/or methods for leveraging in-memory storage in connection with the shuffle phase of mapreduce |
CN108027801A (en) * | 2015-12-31 | 2018-05-11 | 华为技术有限公司 | Data processing method, device and system |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9170848B1 (en) * | 2010-07-27 | 2015-10-27 | Google Inc. | Parallel processing of data |
US8560779B2 (en) * | 2011-05-20 | 2013-10-15 | International Business Machines Corporation | I/O performance of data analytic workloads |
US9424274B2 (en) * | 2013-06-03 | 2016-08-23 | Zettaset, Inc. | Management of intermediate data spills during the shuffle phase of a map-reduce job |
-
2020
- 2020-01-15 WO PCT/US2020/013686 patent/WO2021061183A1/en active Application Filing
- 2020-01-15 CN CN202080092843.2A patent/CN114945902A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130167151A1 (en) * | 2011-12-22 | 2013-06-27 | Abhishek Verma | Job scheduling based on map stage and reduce stage duration |
US20150150017A1 (en) * | 2013-11-26 | 2015-05-28 | International Business Machines Corporation | Optimization of map-reduce shuffle performance through shuffler i/o pipeline actions and planning |
US20160034205A1 (en) * | 2014-08-01 | 2016-02-04 | Software Ag Usa, Inc. | Systems and/or methods for leveraging in-memory storage in connection with the shuffle phase of mapreduce |
CN108027801A (en) * | 2015-12-31 | 2018-05-11 | 华为技术有限公司 | Data processing method, device and system |
Also Published As
Publication number | Publication date |
---|---|
WO2021061183A1 (en) | 2021-04-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6605573B2 (en) | Parallel decision tree processor architecture | |
Kolb et al. | Multi-pass sorted neighborhood blocking with mapreduce | |
US9619430B2 (en) | Active non-volatile memory post-processing | |
US8074219B2 (en) | Order preservation in data parallel operations | |
US10706077B2 (en) | Performance of distributed databases and database-dependent software applications | |
Hsieh et al. | SQLMR: A scalable database management system for cloud computing | |
US11107187B2 (en) | Graph upscaling method for preserving graph properties | |
WO2016107497A1 (en) | Method and apparatus for scalable sorting of data set | |
Mutharaju et al. | D-SPARQ: distributed, scalable and efficient RDF query engine | |
CN111159235A (en) | Data pre-partition method and device, electronic equipment and readable storage medium | |
US10599614B1 (en) | Intersection-based dynamic blocking | |
CN106909554A (en) | A kind of loading method and device of database text table data | |
Bhatotia | Incremental parallel and distributed systems | |
Premchaiswadi et al. | Optimizing and tuning MapReduce jobs to improve the large‐scale data analysis process | |
Lai et al. | Accelerating multi-way joins on the GPU | |
CN114945902A (en) | Shuffle reduction task with reduced I/O overhead | |
Liu et al. | G-Learned Index: Enabling Efficient Learned Index on GPU | |
Silva et al. | An experimental survey of MapReduce-based similarity joins | |
Ravindra et al. | Efficient processing of RDF graph pattern matching on MapReduce platforms | |
JP2002049603A (en) | Method and apparatus for dynamic load distribution | |
CN110008382A (en) | A kind of method, system and the equipment of determining TopN data | |
Triaji et al. | Query Execution Performance Analysis of Column-Oriented Database in Dashboard | |
US20130173647A1 (en) | String matching device based on multi-core processor and string matching method thereof | |
Bhargav et al. | Power and area efficient FSM with comparison-free sorting algorithm for write-evaluate phase and read-sort phase | |
Zhao et al. | Divide‐and‐conquer approach for solving singular value decomposition based on MapReduce |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |