CN106648451B

CN106648451B - MapReduce engine data processing method and device based on memory

Info

Publication number: CN106648451B
Application number: CN201610305911.4A
Authority: CN
Inventors: 张伟; 王界兵; 李�杰; 施莹; 宋泰然; 韦辉华; 郭宇翔
Original assignee: Shenzhen Frontsurf Information Technology Co ltd
Current assignee: Shenzhen Frontsurf Information Technology Co ltd
Priority date: 2016-05-10
Filing date: 2016-05-10
Publication date: 2020-09-08
Anticipated expiration: 2036-05-10
Also published as: CN106648451A

Abstract

The invention discloses a memory-based MapReduce engine data processing method and device, wherein the method comprises the following steps: performing granularity cutting on Map output result data of each partition, and sequencing the cut granularity; and performing multiple batches of shuffle on the granularity of each partitioned area, sequentially copying and merging the data of each batch, performing pipeline type data processing on the reduce, and controlling the data processed by the reduce process in a memory. According to the method, the reduction process of the MapReduce is designed in a pipeline mode in a pure software mode, so that IO access and delay are greatly reduced; the number of the concurrent batches can be adjusted according to the amount of the available internal memory, so that the throughput and the overall performance of the mapreduce engine are improved, and the actual measurement result is 1.6-2 times that of the original actual measurement result.

Description

MapReduce engine data processing method and device based on memory

Technical Field

The invention relates to the field of MapReduce engine data processing, in particular to a method and a device for processing MapReduce engine data based on a memory.

Background

Since the introduction of Hadoop as a big data processing technology, the optimization of MapReduce performance of a data processing engine therein has been a relatively concerned problem in the industry.

Taking Reduce as an example, the whole partition (P1) in each Map Output File (MOF) is copied to the Reduce end, then all partitions (P1, P1, P1) are merged and become an ordered total set (P1), and finally, Reduce is performed.

Currently, the method for reducing IO in MapReduce in the industry mainly uses compression (compression) and combination (combination), and the memory-based MapReduce is an attempt of Pipelining (Pipelining) by uda (unstructured data access) of mellonox corporation, but it is mainly a hardware-based acceleration technique, and because of the design limitation of the shuffle protocol, each batch only has a small amount of data, and does not achieve a large throughput.

Disclosure of Invention

The invention mainly aims to provide a memory-based MapReduce engine data processing method and device for realizing large data throughput through software.

In order to achieve the above object, the present invention first provides a memory-based MapReduce engine data processing method, including:

performing granularity cutting on Map output result data of each partition, and sequencing the cut granularity;

and performing multiple batches of shuffle on the granularity of each partitioned area, sequentially copying and merging the data of each batch, performing pipeline type data processing on the reduce, and controlling the data processed by the reduce process in a memory.

Further, the step of performing multiple batches of shuffle on the granularity of each partition after cutting, sequentially copying and merging the data of each batch, and controlling the data processed by the reduce process in the memory includes:

each processing process of the Reduce process is an autonomously operating sub-thread, and all the sub-threads communicate with each other through a shared first-in first-out asynchronous message queue.

the pipelined data processing runs concurrently in multiple batches, and the number of concurrent batches is controlled according to the size of the available memory divided by the size of the memory configured for each batch.

Further, before the step of performing granularity cutting on the Map output result data of each partition and sorting the cut granularity, the method includes:

after the specified batch of data is processed in the pipeline mode, synchronizing all reduce once;

after the step of performing granularity cutting on the Map output result data of each partition and sorting the cut granularity, the method comprises the following steps of:

the MOF files are stored in the relative bucket ID ordering and correspondingly pre-cached in the same order.

and selectively sorting the data in the granularity, and if the data in the granularity is not sorted, sorting the data at the reduce end after copying.

The invention also provides a memory-based MapReduce engine data processing device, which comprises:

the cutting unit is used for cutting the granularity of the Map output result data of each partition and sequencing the cut granularity;

and the pipeline unit is used for performing multi-batch shuffle on the granularity of each partitioned part, sequentially copying and combining data of each batch and performing pipeline data processing of reduce, and controlling the data processed by the reduce process in the memory.

Further, the pipeline unit includes:

and the shared communication module is used for enabling each processing process of the Reduce process to be a sub-thread which runs independently, and all the sub-threads communicate through a shared first-in first-out asynchronous message queue.

Further, the pipeline unit includes:

and the concurrent operation module is used for the pipeline type data processing multi-batch concurrent operation and controlling the number of concurrent batches according to the size of the available memory divided by the size of the memory configured in each batch.

Further, the memory-based MapReduce engine data processing device further includes:

the synchronization unit synchronizes all the reduce once after the specified batch of data is processed in the pipeline manner;

and the storage unit stores the MOF files according to the relative bucket ID sorting and correspondingly pre-caches the MOF files according to the same sequence.

Further, the pipeline unit includes:

and the granularity internal data processing module is used for selectively sequencing the granularity internal data, and if the granularity internal data is not sequenced, sequencing the granularity internal data at the reduce end after copying.

According to the method and the device for processing the MapReduce engine data based on the Memory, the reduce process of the MapReduce is designed In a pipeline mode In a pure software mode, namely, the data is processed In batches and concurrently, the shuffle or copy of each batch can be controlled to be carried out In the Memory (In-Memory), and the access and delay of IO are greatly reduced; the number of the concurrent batches can be adjusted according to the amount of the available internal memory, so that the throughput and the overall performance of the mapreduce engine are improved, and the actual measurement result is 1.6-2 times that of the original actual measurement result.

Drawings

Fig. 1 is a schematic flow chart illustrating a memory-based MapReduce engine data processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a pipelined Reduce process according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart illustrating a memory-based MapReduce engine data processing method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a zookeeper-based batch-wise synchronization process between Reduce processes according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating a mapping relationship of bucket IDs according to an embodiment of the present invention;

FIG. 6 is a block diagram illustrating a schematic structure of a memory-based MapReduce engine data processing apparatus according to an embodiment of the present invention;

FIG. 7 is a block diagram illustrating the structure of a pipeline unit according to an embodiment of the present invention;

fig. 8 is a block diagram illustrating a structure of a memory-based MapReduce engine data processing apparatus according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, an embodiment of the present invention provides a memory-based MapReduce engine data processing method, including:

s1, performing granularity cutting on the Map output result data of each partition, and sequencing the cut granularity;

and S2, performing multiple batches of shuffle on the granularity of each partition after cutting, sequentially copying and merging the data of each batch, performing pipeline data processing on the reduce, and controlling the data processed by the reduce process in a memory.

As described in step S1, the original data of each partition (partition) is cut into finer-grained (buckets), and the buckets must be ordered to realize the subsequent pipelined data processing. In the present embodiment, the first and second electrodes are,

as described in step S2, in order to implement the pipelining operation, compared to the original entire partition protocol, the new fragment protocol may perform multiple batches (pass) of a partition, and each batch only copies a subset (packet) of the original entire partition data in order, so that the data processed by the reduce process can be controlled in the memory by adjusting the size of the packet. The method comprises the steps that a pure software mode is used for carrying out pipelining (pipelining) design on the reduce process of the MapReduce, namely, data are processed In batches, a shuffle or a copy of each batch is combined (merge) and reduce can be controlled to be carried out In an internal Memory (In-Memory), and IO access and delay are greatly reduced; the number of concurrent batches can be adjusted according to the amount of available memory, so that the throughput and the overall performance of the mapreduce engine are improved.

Referring to fig. 2, in this embodiment, the step S2 of performing multiple batches of shuffle on the granularity obtained by cutting each partition, sequentially copying, merging and performing pipeline data processing on the data of each batch, and controlling the data processed by the reduce process in the memory includes:

and S21, each processing process of the Reduce process is an autonomously operating sub-thread, and all sub-threads communicate through a shared first-in first-out asynchronous message queue.

As described in the step S21, the merge thread (merger) is notified after the copy thread (fetcher) is completed, and the reduce thread (reducer) is notified after the merge thread is completed.

In this embodiment, the step S2 of performing multiple batches of shuffle on the granularity obtained by cutting each partition, sequentially copying and merging the data of each batch, and performing pipeline data processing on the reduce, and controlling the data processed by the reduce process in the memory includes:

s22, the pipelined data processing is performed in parallel for multiple batches, and the number of parallel batches is controlled according to the size of the available memory divided by the size of the memory allocated for each batch.

As described in step S22, performing concurrent operations adaptively according to the available size of the memory greatly improves throughput compared to the native method.

In this embodiment, the step of performing multiple batches of shuffle on the granularity obtained by cutting each partition, sequentially copying, merging and performing pipeline data processing on the batches of data, and controlling the data processed by the reduce process in the memory includes:

and S23, selectively sorting the granularity internal data, and if the granularity internal data are not sorted, sorting the granularity internal data at a reduce end after copying.

As described in step S23, if the data within the granularity is not sorted, the data can be copied and sorted at the reduce end, so that some sorted cpu consumption is transferred from the map end to the reduce end, which helps to reduce the pressure of the job whose map end cpu is the bottleneck.

Referring to fig. 3, 4 and 5, in the present embodiment, before the step S1 of performing granularity cutting on the Map output result data of each partition and sorting the cut granularity, the method includes:

s11, a synchronization unit, which synchronizes all reduce once after processing the data of the appointed batch through the pipeline;

after the step S1 of performing granularity cutting on the Map output result data of each partition and sorting the cut granularities, the method includes:

and S12, storing the MOF files according to the sorting of the relative bucket IDs by the storage unit, and correspondingly pre-caching the MOF files according to the same sequence.

As described in step S11, due to the difference of the run sequences of the reduce, the difference of the node computing capabilities and the network, the difference of the run batches (passes) of each reduce is sometimes obvious, for example, one reduce runs pass 0 and the other runs pass 5, so it is an examination about the batch-by-batch pre-caching of MOFs, the memory may not be enough to cache so much data of 0 to 5 batches, a synchronization mechanism is required to lock (lock) the batches, and after all the reduces complete the processing of the same batch, the MOF is uniformly processed in the next batch, so as to ensure that the cache of the MOF is hit (hit). Referring to fig. 4, since the respective reduces may be distributed on different nodes, the embodiment is implemented by using a node synchronization mechanism zookeeper in the hadoop system, and all the reduces are synchronized once every specified number of batches, that is, the difference between the reduces does not exceed the specified number of batches.

Referring to fig. 5, as described in step S12 above, since multiple batches of shreds are performed sequentially from the corresponding bucket 0, we store MOF files in the order of the corresponding bucket IDs, and correspondingly perform a new method of pre-caching (cache) in the same order, which effectively reduces the probability of random IO and increases the probability of cache hit.

In this embodiment, since the pipelined memory-based calculation is completed, the performance is greatly affected by the use and management of the memory, and the memory managed by the Java Virtual Machine (JVM) is limited by the performance of garbage Collection (gargbage Collection), which is obviously not suitable for the present invention. The present invention employs direct memory allocation from the system and then self-management without using JVM.

In one embodiment, a comparative experiment was performed and 4, comparative analyses of experimental data were performed as follows:

(1) and (3) testing environment: 4 data node

Hadoop software-three suppliers CDH, HDP, MAPR results were similar

CPU 2x8 core

RAM 128GB

Disk 2TBx12

(2) The results are shown in the following table:

as can be seen from the table, the data processing capacity of the MapReduce is remarkably improved and is about 1.6-2 times of that of the original MapReduce.

According to the MapReduce engine data processing method based on the Memory, the reduce process of the MapReduce is designed In a pipeline mode In a pure software mode, namely, the data is processed In batches and concurrently, the shuffle or copy of each batch, the merge (merge) and the reduce can be controlled to be carried out In the Memory (In-Memory), and the access and delay of IO are greatly reduced; the number of the concurrent batches can be adjusted according to the amount of the available internal memory, so that the throughput and the overall performance of the mapreduce engine are improved, and the actual measurement result is 1.6-2 times that of the original actual measurement result.

Referring to fig. 6, an embodiment of the present invention further provides a memory-based MapReduce engine data processing apparatus, including:

the cutting unit 10 is configured to perform granularity cutting on Map output result data of each partition, and sort the cut granularities;

the pipeline unit 20 is configured to perform multiple batches of shuffle on the granularity obtained after each partition is cut, sequentially copy and combine data of each batch, perform pipeline data processing on the reduce, and control data processed by the reduce process in the memory.

As described above, the slicing unit 10 performs finer-grained (bucket) slicing on the original data of each partition (partition), and the buckets must be ordered to realize subsequent pipelined data processing. In the present embodiment, the first and second electrodes are,

as mentioned above in the pipeline unit 20, in order to implement pipelining, compared with the original entire partition protocol, the new fragment protocol will execute multiple batches (pass) of a partition, and each batch only copies the subset (packet) of the original entire partition data in order, so that the data processed by the reduce process can be controlled in the memory by adjusting the size of the packet. The method comprises the steps that a pure software mode is used for carrying out pipelining (pipelining) design on the reduce process of the MapReduce, namely, data are processed In batches, a shuffle or a copy of each batch is combined (merge) and reduce can be controlled to be carried out In an internal Memory (In-Memory), and IO access and delay are greatly reduced; the number of concurrent batches can be adjusted according to the amount of available memory, so that the throughput and the overall performance of the mapreduce engine are improved.

Referring to fig. 7, in this embodiment, the pipeline unit 20 includes:

and the shared communication module 21 is used for enabling each processing process of the Reduce process to be an autonomously running sub-thread, and all the sub-threads communicate through a shared first-in first-out asynchronous message queue.

According to the shared communication module, after the copy thread (fetcher) is completed, the merge thread (merger) is informed, after the merge thread is completed, the reduce thread (reducer) is informed, and because the order is kept among batches, the reducer can directly perform the reduce calculation without waiting for the rest data to be copied, in the whole process, the data access can be controlled to be performed in the memory, and the delay can be greatly reduced.

In this embodiment, the pipeline unit 20 includes:

the concurrent operation module 22 is configured to perform the pipeline data processing for multiple batches of concurrent operations, and control the number of concurrent batches according to the size of the available memory divided by the size of the memory configured for each batch.

According to the concurrent operation module, the concurrent operation is carried out adaptively according to the available size of the memory, and compared with a native method, the throughput is greatly improved.

In this embodiment, the pipeline unit 20 includes:

and the granularity internal data processing module 23 is used for selectively sorting the granularity internal data, and if the granularity internal data is not sorted, the granularity internal data is copied and then sorted at the reduce end.

As the intra-granularity data processing module 23, if the data in the granularity is not sorted, the data can be copied and sorted at the reduce end, so that some sorted cpu consumption is transferred from the map end to the reduce end, which helps to reduce the pressure of the job in which the map-end cpu is a bottleneck.

Referring to fig. 8, in this embodiment, the MapReduce engine data processing apparatus based on a memory further includes:

a synchronization unit 110, which synchronizes all the reduce once after processing the designated batch data in the pipeline manner;

the storage unit 120 stores MOF files sorted by the relative bucket IDs, and pre-caches the MOF files in the same order accordingly.

As the synchronization unit 110, because of the difference of the run sequences of the reduce, the difference of the node computing capabilities and the network, the difference of the run batches (passes) of each reduce is sometimes obvious, for example, one reduce runs pass 0, and the other runs pass 5, so that it is an examination for the batch-by-batch pre-caching of MOFs, and the memory may not be enough to store so much data of 0 to 5 batches, a synchronization mechanism is required to lock (lock) the batches, and after all the reduces complete the processing of the same batch, the MOF is uniformly processed in the next batch, so as to ensure that the cache of the MOF is hit (hit). Referring to fig. 4, since the respective reduces may be distributed on different nodes, the embodiment is implemented by using a node synchronization mechanism zookeeper in the hadoop system, and all the reduces are synchronized once every specified number of batches, that is, the difference between the reduces does not exceed the specified number of batches.

Referring to fig. 5, as for the storage unit 120, because multiple batches of shreds are performed, each reduce is performed in sequence from the corresponding bucket 0, we store MOF files according to the relative bucket ID sorting, and correspondingly perform a new method of pre-caching (cache) in the same sequence, thereby effectively reducing the probability of random IO and increasing the probability of cache hit.

(1) and (3) testing environment: 4 data node

Hadoop software-three suppliers CDH, HDP, MAPR results were similar

CPU 2x8 core

RAM 128GB

Disk 2TBx12

(2) The results are shown in the following table:

According to the MapReduce engine data processing device based on the Memory, the reduce process of the MapReduce is subjected to pipelining (pipelining) design In a pure software mode, namely, the data is subjected to batch concurrent processing, the shuffle or copy of each batch, the merge (merge) and the reduce can be controlled to be performed In the Memory (In-Memory), and the access and delay of IO are greatly reduced; the number of the concurrent batches can be adjusted according to the amount of the available internal memory, so that the throughput and the overall performance of the mapreduce engine are improved, and the actual measurement result is 1.6-2 times that of the original actual measurement result.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A MapReduce engine data processing method based on a memory is characterized by comprising the following steps:

performing multiple batches of shuffle on the granularity of each partitioned data, copying and combining the data of each batch in sequence, performing pipeline type data processing on the reduce data, and controlling the data processed by the reduce process in an internal memory;

storing the MOF files according to the relative bucket ID sequencing, and correspondingly pre-caching the MOF files according to the same sequence;

the method comprises the following steps of performing multiple batches of shuffle on the granularity of each partitioned area, sequentially copying and combining data of each batch, performing pipeline data processing on reduce, and controlling the data processed by the reduce process in an internal memory, wherein the steps comprise:

2. The memory-based MapReduce engine data processing method according to claim 1, wherein the step of performing multiple batches of shuffle on the granularity of each partition after cutting, sequentially copying, merging and reducing pipeline data processing on each batch of data, and controlling the data processed by the reducing process in the memory comprises:

3. The memory-based MapReduce engine data processing method according to claim 2, wherein before the step of performing granularity cutting on the Map output result data of each partition and sorting the cut granularities, the method comprises:

after the specified batch of data is processed in the pipeline, all reduce is synchronized once.

4. The memory-based MapReduce engine data processing method according to claim 1, wherein the step of performing multiple batches of shuffle on the granularity of each partition after cutting, sequentially copying, merging and reducing pipeline data processing on each batch of data, and controlling the data processed by the reducing process in the memory comprises:

5. A memory-based MapReduce engine data processing device is characterized by comprising:

the assembly line unit is used for carrying out multi-batch shuffle on the granularity of each partitioned part, copying and combining the data of each batch in sequence and carrying out assembly line type data processing of reduce, and controlling the data processed by the reduce process in the memory;

wherein the apparatus further comprises: the storage unit stores the MOF files according to the relative bucket ID sorting, and correspondingly pre-caches the MOF files according to the same sequence;

the pipeline unit includes:

6. The memory-based MapReduce engine data processing apparatus according to claim 5, wherein said pipeline unit comprises:

7. The memory-based MapReduce engine data processing apparatus according to claim 6, further comprising:

and the synchronization unit synchronizes all the reduce once after the specified batch of data is processed in the pipeline manner.

8. The memory-based MapReduce engine data processing apparatus according to claim 5, wherein said pipeline unit comprises: