CN106648451B - MapReduce engine data processing method and device based on memory - Google Patents

MapReduce engine data processing method and device based on memory Download PDF

Info

Publication number
CN106648451B
CN106648451B CN201610305911.4A CN201610305911A CN106648451B CN 106648451 B CN106648451 B CN 106648451B CN 201610305911 A CN201610305911 A CN 201610305911A CN 106648451 B CN106648451 B CN 106648451B
Authority
CN
China
Prior art keywords
data
memory
granularity
data processing
reduce
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610305911.4A
Other languages
Chinese (zh)
Other versions
CN106648451A (en
Inventor
张伟
王界兵
李�杰
施莹
宋泰然
韦辉华
郭宇翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Frontsurf Information Technology Co ltd
Original Assignee
Shenzhen Frontsurf Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Frontsurf Information Technology Co ltd filed Critical Shenzhen Frontsurf Information Technology Co ltd
Priority to CN201610305911.4A priority Critical patent/CN106648451B/en
Publication of CN106648451A publication Critical patent/CN106648451A/en
Application granted granted Critical
Publication of CN106648451B publication Critical patent/CN106648451B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0613Improving I/O performance in relation to throughput
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Abstract

The invention discloses a memory-based MapReduce engine data processing method and device, wherein the method comprises the following steps: performing granularity cutting on Map output result data of each partition, and sequencing the cut granularity; and performing multiple batches of shuffle on the granularity of each partitioned area, sequentially copying and merging the data of each batch, performing pipeline type data processing on the reduce, and controlling the data processed by the reduce process in a memory. According to the method, the reduction process of the MapReduce is designed in a pipeline mode in a pure software mode, so that IO access and delay are greatly reduced; the number of the concurrent batches can be adjusted according to the amount of the available internal memory, so that the throughput and the overall performance of the mapreduce engine are improved, and the actual measurement result is 1.6-2 times that of the original actual measurement result.

Description

MapReduce engine data processing method and device based on memory
Technical Field
The invention relates to the field of MapReduce engine data processing, in particular to a method and a device for processing MapReduce engine data based on a memory.
Background
Since the introduction of Hadoop as a big data processing technology, the optimization of MapReduce performance of a data processing engine therein has been a relatively concerned problem in the industry.
Taking Reduce as an example, the whole partition (P1) in each Map Output File (MOF) is copied to the Reduce end, then all partitions (P1, P1, P1) are merged and become an ordered total set (P1), and finally, Reduce is performed.
Currently, the method for reducing IO in MapReduce in the industry mainly uses compression (compression) and combination (combination), and the memory-based MapReduce is an attempt of Pipelining (Pipelining) by uda (unstructured data access) of mellonox corporation, but it is mainly a hardware-based acceleration technique, and because of the design limitation of the shuffle protocol, each batch only has a small amount of data, and does not achieve a large throughput.
Disclosure of Invention
The invention mainly aims to provide a memory-based MapReduce engine data processing method and device for realizing large data throughput through software.
In order to achieve the above object, the present invention first provides a memory-based MapReduce engine data processing method, including:
performing granularity cutting on Map output result data of each partition, and sequencing the cut granularity;
and performing multiple batches of shuffle on the granularity of each partitioned area, sequentially copying and merging the data of each batch, performing pipeline type data processing on the reduce, and controlling the data processed by the reduce process in a memory.
Further, the step of performing multiple batches of shuffle on the granularity of each partition after cutting, sequentially copying and merging the data of each batch, and controlling the data processed by the reduce process in the memory includes:
each processing process of the Reduce process is an autonomously operating sub-thread, and all the sub-threads communicate with each other through a shared first-in first-out asynchronous message queue.
Further, the step of performing multiple batches of shuffle on the granularity of each partition after cutting, sequentially copying and merging the data of each batch, and controlling the data processed by the reduce process in the memory includes:
the pipelined data processing runs concurrently in multiple batches, and the number of concurrent batches is controlled according to the size of the available memory divided by the size of the memory configured for each batch.
Further, before the step of performing granularity cutting on the Map output result data of each partition and sorting the cut granularity, the method includes:
after the specified batch of data is processed in the pipeline mode, synchronizing all reduce once;
after the step of performing granularity cutting on the Map output result data of each partition and sorting the cut granularity, the method comprises the following steps of:
the MOF files are stored in the relative bucket ID ordering and correspondingly pre-cached in the same order.
Further, the step of performing multiple batches of shuffle on the granularity of each partition after cutting, sequentially copying and merging the data of each batch, and controlling the data processed by the reduce process in the memory includes:
and selectively sorting the data in the granularity, and if the data in the granularity is not sorted, sorting the data at the reduce end after copying.
The invention also provides a memory-based MapReduce engine data processing device, which comprises:
the cutting unit is used for cutting the granularity of the Map output result data of each partition and sequencing the cut granularity;
and the pipeline unit is used for performing multi-batch shuffle on the granularity of each partitioned part, sequentially copying and combining data of each batch and performing pipeline data processing of reduce, and controlling the data processed by the reduce process in the memory.
Further, the pipeline unit includes:
and the shared communication module is used for enabling each processing process of the Reduce process to be a sub-thread which runs independently, and all the sub-threads communicate through a shared first-in first-out asynchronous message queue.
Further, the pipeline unit includes:
and the concurrent operation module is used for the pipeline type data processing multi-batch concurrent operation and controlling the number of concurrent batches according to the size of the available memory divided by the size of the memory configured in each batch.
Further, the memory-based MapReduce engine data processing device further includes:
the synchronization unit synchronizes all the reduce once after the specified batch of data is processed in the pipeline manner;
and the storage unit stores the MOF files according to the relative bucket ID sorting and correspondingly pre-caches the MOF files according to the same sequence.
Further, the pipeline unit includes:
and the granularity internal data processing module is used for selectively sequencing the granularity internal data, and if the granularity internal data is not sequenced, sequencing the granularity internal data at the reduce end after copying.
According to the method and the device for processing the MapReduce engine data based on the Memory, the reduce process of the MapReduce is designed In a pipeline mode In a pure software mode, namely, the data is processed In batches and concurrently, the shuffle or copy of each batch can be controlled to be carried out In the Memory (In-Memory), and the access and delay of IO are greatly reduced; the number of the concurrent batches can be adjusted according to the amount of the available internal memory, so that the throughput and the overall performance of the mapreduce engine are improved, and the actual measurement result is 1.6-2 times that of the original actual measurement result.
Drawings
Fig. 1 is a schematic flow chart illustrating a memory-based MapReduce engine data processing method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a pipelined Reduce process according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart illustrating a memory-based MapReduce engine data processing method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a zookeeper-based batch-wise synchronization process between Reduce processes according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating a mapping relationship of bucket IDs according to an embodiment of the present invention;
FIG. 6 is a block diagram illustrating a schematic structure of a memory-based MapReduce engine data processing apparatus according to an embodiment of the present invention;
FIG. 7 is a block diagram illustrating the structure of a pipeline unit according to an embodiment of the present invention;
fig. 8 is a block diagram illustrating a structure of a memory-based MapReduce engine data processing apparatus according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, an embodiment of the present invention provides a memory-based MapReduce engine data processing method, including:
s1, performing granularity cutting on the Map output result data of each partition, and sequencing the cut granularity;
and S2, performing multiple batches of shuffle on the granularity of each partition after cutting, sequentially copying and merging the data of each batch, performing pipeline data processing on the reduce, and controlling the data processed by the reduce process in a memory.
As described in step S1, the original data of each partition (partition) is cut into finer-grained (buckets), and the buckets must be ordered to realize the subsequent pipelined data processing. In the present embodiment, the first and second electrodes are,
as described in step S2, in order to implement the pipelining operation, compared to the original entire partition protocol, the new fragment protocol may perform multiple batches (pass) of a partition, and each batch only copies a subset (packet) of the original entire partition data in order, so that the data processed by the reduce process can be controlled in the memory by adjusting the size of the packet. The method comprises the steps that a pure software mode is used for carrying out pipelining (pipelining) design on the reduce process of the MapReduce, namely, data are processed In batches, a shuffle or a copy of each batch is combined (merge) and reduce can be controlled to be carried out In an internal Memory (In-Memory), and IO access and delay are greatly reduced; the number of concurrent batches can be adjusted according to the amount of available memory, so that the throughput and the overall performance of the mapreduce engine are improved.
Referring to fig. 2, in this embodiment, the step S2 of performing multiple batches of shuffle on the granularity obtained by cutting each partition, sequentially copying, merging and performing pipeline data processing on the data of each batch, and controlling the data processed by the reduce process in the memory includes:
and S21, each processing process of the Reduce process is an autonomously operating sub-thread, and all sub-threads communicate through a shared first-in first-out asynchronous message queue.
As described in the step S21, the merge thread (merger) is notified after the copy thread (fetcher) is completed, and the reduce thread (reducer) is notified after the merge thread is completed.
In this embodiment, the step S2 of performing multiple batches of shuffle on the granularity obtained by cutting each partition, sequentially copying and merging the data of each batch, and performing pipeline data processing on the reduce, and controlling the data processed by the reduce process in the memory includes:
s22, the pipelined data processing is performed in parallel for multiple batches, and the number of parallel batches is controlled according to the size of the available memory divided by the size of the memory allocated for each batch.
As described in step S22, performing concurrent operations adaptively according to the available size of the memory greatly improves throughput compared to the native method.
In this embodiment, the step of performing multiple batches of shuffle on the granularity obtained by cutting each partition, sequentially copying, merging and performing pipeline data processing on the batches of data, and controlling the data processed by the reduce process in the memory includes:
and S23, selectively sorting the granularity internal data, and if the granularity internal data are not sorted, sorting the granularity internal data at a reduce end after copying.
As described in step S23, if the data within the granularity is not sorted, the data can be copied and sorted at the reduce end, so that some sorted cpu consumption is transferred from the map end to the reduce end, which helps to reduce the pressure of the job whose map end cpu is the bottleneck.
Referring to fig. 3, 4 and 5, in the present embodiment, before the step S1 of performing granularity cutting on the Map output result data of each partition and sorting the cut granularity, the method includes:
s11, a synchronization unit, which synchronizes all reduce once after processing the data of the appointed batch through the pipeline;
after the step S1 of performing granularity cutting on the Map output result data of each partition and sorting the cut granularities, the method includes:
and S12, storing the MOF files according to the sorting of the relative bucket IDs by the storage unit, and correspondingly pre-caching the MOF files according to the same sequence.
As described in step S11, due to the difference of the run sequences of the reduce, the difference of the node computing capabilities and the network, the difference of the run batches (passes) of each reduce is sometimes obvious, for example, one reduce runs pass 0 and the other runs pass 5, so it is an examination about the batch-by-batch pre-caching of MOFs, the memory may not be enough to cache so much data of 0 to 5 batches, a synchronization mechanism is required to lock (lock) the batches, and after all the reduces complete the processing of the same batch, the MOF is uniformly processed in the next batch, so as to ensure that the cache of the MOF is hit (hit). Referring to fig. 4, since the respective reduces may be distributed on different nodes, the embodiment is implemented by using a node synchronization mechanism zookeeper in the hadoop system, and all the reduces are synchronized once every specified number of batches, that is, the difference between the reduces does not exceed the specified number of batches.
Referring to fig. 5, as described in step S12 above, since multiple batches of shreds are performed sequentially from the corresponding bucket 0, we store MOF files in the order of the corresponding bucket IDs, and correspondingly perform a new method of pre-caching (cache) in the same order, which effectively reduces the probability of random IO and increases the probability of cache hit.
In this embodiment, since the pipelined memory-based calculation is completed, the performance is greatly affected by the use and management of the memory, and the memory managed by the Java Virtual Machine (JVM) is limited by the performance of garbage Collection (gargbage Collection), which is obviously not suitable for the present invention. The present invention employs direct memory allocation from the system and then self-management without using JVM.
In one embodiment, a comparative experiment was performed and 4, comparative analyses of experimental data were performed as follows:
(1) and (3) testing environment: 4 data node
Hadoop software-three suppliers CDH, HDP, MAPR results were similar
CPU 2x8 core
RAM 128GB
Disk 2TBx12
(2) The results are shown in the following table:
Figure BDA0000986327940000061
as can be seen from the table, the data processing capacity of the MapReduce is remarkably improved and is about 1.6-2 times of that of the original MapReduce.
According to the MapReduce engine data processing method based on the Memory, the reduce process of the MapReduce is designed In a pipeline mode In a pure software mode, namely, the data is processed In batches and concurrently, the shuffle or copy of each batch, the merge (merge) and the reduce can be controlled to be carried out In the Memory (In-Memory), and the access and delay of IO are greatly reduced; the number of the concurrent batches can be adjusted according to the amount of the available internal memory, so that the throughput and the overall performance of the mapreduce engine are improved, and the actual measurement result is 1.6-2 times that of the original actual measurement result.
Referring to fig. 6, an embodiment of the present invention further provides a memory-based MapReduce engine data processing apparatus, including:
the cutting unit 10 is configured to perform granularity cutting on Map output result data of each partition, and sort the cut granularities;
the pipeline unit 20 is configured to perform multiple batches of shuffle on the granularity obtained after each partition is cut, sequentially copy and combine data of each batch, perform pipeline data processing on the reduce, and control data processed by the reduce process in the memory.
As described above, the slicing unit 10 performs finer-grained (bucket) slicing on the original data of each partition (partition), and the buckets must be ordered to realize subsequent pipelined data processing. In the present embodiment, the first and second electrodes are,
as mentioned above in the pipeline unit 20, in order to implement pipelining, compared with the original entire partition protocol, the new fragment protocol will execute multiple batches (pass) of a partition, and each batch only copies the subset (packet) of the original entire partition data in order, so that the data processed by the reduce process can be controlled in the memory by adjusting the size of the packet. The method comprises the steps that a pure software mode is used for carrying out pipelining (pipelining) design on the reduce process of the MapReduce, namely, data are processed In batches, a shuffle or a copy of each batch is combined (merge) and reduce can be controlled to be carried out In an internal Memory (In-Memory), and IO access and delay are greatly reduced; the number of concurrent batches can be adjusted according to the amount of available memory, so that the throughput and the overall performance of the mapreduce engine are improved.
Referring to fig. 7, in this embodiment, the pipeline unit 20 includes:
and the shared communication module 21 is used for enabling each processing process of the Reduce process to be an autonomously running sub-thread, and all the sub-threads communicate through a shared first-in first-out asynchronous message queue.
According to the shared communication module, after the copy thread (fetcher) is completed, the merge thread (merger) is informed, after the merge thread is completed, the reduce thread (reducer) is informed, and because the order is kept among batches, the reducer can directly perform the reduce calculation without waiting for the rest data to be copied, in the whole process, the data access can be controlled to be performed in the memory, and the delay can be greatly reduced.
In this embodiment, the pipeline unit 20 includes:
the concurrent operation module 22 is configured to perform the pipeline data processing for multiple batches of concurrent operations, and control the number of concurrent batches according to the size of the available memory divided by the size of the memory configured for each batch.
According to the concurrent operation module, the concurrent operation is carried out adaptively according to the available size of the memory, and compared with a native method, the throughput is greatly improved.
In this embodiment, the pipeline unit 20 includes:
and the granularity internal data processing module 23 is used for selectively sorting the granularity internal data, and if the granularity internal data is not sorted, the granularity internal data is copied and then sorted at the reduce end.
As the intra-granularity data processing module 23, if the data in the granularity is not sorted, the data can be copied and sorted at the reduce end, so that some sorted cpu consumption is transferred from the map end to the reduce end, which helps to reduce the pressure of the job in which the map-end cpu is a bottleneck.
Referring to fig. 8, in this embodiment, the MapReduce engine data processing apparatus based on a memory further includes:
a synchronization unit 110, which synchronizes all the reduce once after processing the designated batch data in the pipeline manner;
the storage unit 120 stores MOF files sorted by the relative bucket IDs, and pre-caches the MOF files in the same order accordingly.
As the synchronization unit 110, because of the difference of the run sequences of the reduce, the difference of the node computing capabilities and the network, the difference of the run batches (passes) of each reduce is sometimes obvious, for example, one reduce runs pass 0, and the other runs pass 5, so that it is an examination for the batch-by-batch pre-caching of MOFs, and the memory may not be enough to store so much data of 0 to 5 batches, a synchronization mechanism is required to lock (lock) the batches, and after all the reduces complete the processing of the same batch, the MOF is uniformly processed in the next batch, so as to ensure that the cache of the MOF is hit (hit). Referring to fig. 4, since the respective reduces may be distributed on different nodes, the embodiment is implemented by using a node synchronization mechanism zookeeper in the hadoop system, and all the reduces are synchronized once every specified number of batches, that is, the difference between the reduces does not exceed the specified number of batches.
Referring to fig. 5, as for the storage unit 120, because multiple batches of shreds are performed, each reduce is performed in sequence from the corresponding bucket 0, we store MOF files according to the relative bucket ID sorting, and correspondingly perform a new method of pre-caching (cache) in the same sequence, thereby effectively reducing the probability of random IO and increasing the probability of cache hit.
In this embodiment, since the pipelined memory-based calculation is completed, the performance is greatly affected by the use and management of the memory, and the memory managed by the Java Virtual Machine (JVM) is limited by the performance of garbage Collection (gargbage Collection), which is obviously not suitable for the present invention. The present invention employs direct memory allocation from the system and then self-management without using JVM.
In one embodiment, a comparative experiment was performed and 4, comparative analyses of experimental data were performed as follows:
(1) and (3) testing environment: 4 data node
Hadoop software-three suppliers CDH, HDP, MAPR results were similar
CPU 2x8 core
RAM 128GB
Disk 2TBx12
(2) The results are shown in the following table:
Figure BDA0000986327940000091
as can be seen from the table, the data processing capacity of the MapReduce is remarkably improved and is about 1.6-2 times of that of the original MapReduce.
According to the MapReduce engine data processing device based on the Memory, the reduce process of the MapReduce is subjected to pipelining (pipelining) design In a pure software mode, namely, the data is subjected to batch concurrent processing, the shuffle or copy of each batch, the merge (merge) and the reduce can be controlled to be performed In the Memory (In-Memory), and the access and delay of IO are greatly reduced; the number of the concurrent batches can be adjusted according to the amount of the available internal memory, so that the throughput and the overall performance of the mapreduce engine are improved, and the actual measurement result is 1.6-2 times that of the original actual measurement result.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (8)

1. A MapReduce engine data processing method based on a memory is characterized by comprising the following steps:
performing granularity cutting on Map output result data of each partition, and sequencing the cut granularity;
performing multiple batches of shuffle on the granularity of each partitioned data, copying and combining the data of each batch in sequence, performing pipeline type data processing on the reduce data, and controlling the data processed by the reduce process in an internal memory;
after the step of performing granularity cutting on the Map output result data of each partition and sorting the cut granularity, the method comprises the following steps of:
storing the MOF files according to the relative bucket ID sequencing, and correspondingly pre-caching the MOF files according to the same sequence;
the method comprises the following steps of performing multiple batches of shuffle on the granularity of each partitioned area, sequentially copying and combining data of each batch, performing pipeline data processing on reduce, and controlling the data processed by the reduce process in an internal memory, wherein the steps comprise:
each processing process of the Reduce process is an autonomously operating sub-thread, and all the sub-threads communicate with each other through a shared first-in first-out asynchronous message queue.
2. The memory-based MapReduce engine data processing method according to claim 1, wherein the step of performing multiple batches of shuffle on the granularity of each partition after cutting, sequentially copying, merging and reducing pipeline data processing on each batch of data, and controlling the data processed by the reducing process in the memory comprises:
the pipelined data processing runs concurrently in multiple batches, and the number of concurrent batches is controlled according to the size of the available memory divided by the size of the memory configured for each batch.
3. The memory-based MapReduce engine data processing method according to claim 2, wherein before the step of performing granularity cutting on the Map output result data of each partition and sorting the cut granularities, the method comprises:
after the specified batch of data is processed in the pipeline, all reduce is synchronized once.
4. The memory-based MapReduce engine data processing method according to claim 1, wherein the step of performing multiple batches of shuffle on the granularity of each partition after cutting, sequentially copying, merging and reducing pipeline data processing on each batch of data, and controlling the data processed by the reducing process in the memory comprises:
and selectively sorting the data in the granularity, and if the data in the granularity is not sorted, sorting the data at the reduce end after copying.
5. A memory-based MapReduce engine data processing device is characterized by comprising:
the cutting unit is used for cutting the granularity of the Map output result data of each partition and sequencing the cut granularity;
the assembly line unit is used for carrying out multi-batch shuffle on the granularity of each partitioned part, copying and combining the data of each batch in sequence and carrying out assembly line type data processing of reduce, and controlling the data processed by the reduce process in the memory;
wherein the apparatus further comprises: the storage unit stores the MOF files according to the relative bucket ID sorting, and correspondingly pre-caches the MOF files according to the same sequence;
the pipeline unit includes:
and the shared communication module is used for enabling each processing process of the Reduce process to be a sub-thread which runs independently, and all the sub-threads communicate through a shared first-in first-out asynchronous message queue.
6. The memory-based MapReduce engine data processing apparatus according to claim 5, wherein said pipeline unit comprises:
and the concurrent operation module is used for the pipeline type data processing multi-batch concurrent operation and controlling the number of concurrent batches according to the size of the available memory divided by the size of the memory configured in each batch.
7. The memory-based MapReduce engine data processing apparatus according to claim 6, further comprising:
and the synchronization unit synchronizes all the reduce once after the specified batch of data is processed in the pipeline manner.
8. The memory-based MapReduce engine data processing apparatus according to claim 5, wherein said pipeline unit comprises:
and the granularity internal data processing module is used for selectively sequencing the granularity internal data, and if the granularity internal data is not sequenced, sequencing the granularity internal data at the reduce end after copying.
CN201610305911.4A 2016-05-10 2016-05-10 MapReduce engine data processing method and device based on memory Active CN106648451B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610305911.4A CN106648451B (en) 2016-05-10 2016-05-10 MapReduce engine data processing method and device based on memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610305911.4A CN106648451B (en) 2016-05-10 2016-05-10 MapReduce engine data processing method and device based on memory

Publications (2)

Publication Number Publication Date
CN106648451A CN106648451A (en) 2017-05-10
CN106648451B true CN106648451B (en) 2020-09-08

Family

ID=58848688

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610305911.4A Active CN106648451B (en) 2016-05-10 2016-05-10 MapReduce engine data processing method and device based on memory

Country Status (1)

Country Link
CN (1) CN106648451B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197016A (en) * 2006-12-08 2008-06-11 鸿富锦精密工业(深圳)有限公司 Batching concurrent job processing equipment and method
CN102222092A (en) * 2011-06-03 2011-10-19 复旦大学 Massive high-dimension data clustering method for MapReduce platform
CN103793425A (en) * 2012-10-31 2014-05-14 国际商业机器公司 Data processing method and data processing device for distributed system
CN104021194A (en) * 2014-06-13 2014-09-03 浪潮(北京)电子信息产业有限公司 Mixed type processing system and method oriented to industry big data diversity application
CN104598562A (en) * 2015-01-08 2015-05-06 浪潮软件股份有限公司 XML file processing method and device based on MapReduce parallel computing model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9146830B2 (en) * 2012-10-26 2015-09-29 Jsmapreduce Corporation Hybrid local/remote infrastructure for data processing with lightweight setup, powerful debuggability, controllability, integration, and productivity features
CN103970520B (en) * 2013-01-31 2017-06-16 国际商业机器公司 Method for managing resource, device and architecture system in MapReduce frameworks
US9389994B2 (en) * 2013-11-26 2016-07-12 International Business Machines Corporation Optimization of map-reduce shuffle performance through shuffler I/O pipeline actions and planning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197016A (en) * 2006-12-08 2008-06-11 鸿富锦精密工业(深圳)有限公司 Batching concurrent job processing equipment and method
CN102222092A (en) * 2011-06-03 2011-10-19 复旦大学 Massive high-dimension data clustering method for MapReduce platform
CN103793425A (en) * 2012-10-31 2014-05-14 国际商业机器公司 Data processing method and data processing device for distributed system
CN104021194A (en) * 2014-06-13 2014-09-03 浪潮(北京)电子信息产业有限公司 Mixed type processing system and method oriented to industry big data diversity application
CN104598562A (en) * 2015-01-08 2015-05-06 浪潮软件股份有限公司 XML file processing method and device based on MapReduce parallel computing model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MapReduce中shuffle优化与重构;彭辅权 等;《中国科技论文》;20120802;第7卷(第4期);第242-243页 *

Also Published As

Publication number Publication date
CN106648451A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
US11734179B2 (en) Efficient work unit processing in a multicore system
US8381230B2 (en) Message passing with queues and channels
US20160132541A1 (en) Efficient implementations for mapreduce systems
WO2019223596A1 (en) Method, device, and apparatus for event processing, and storage medium
CN100375093C (en) Processing of multiroute processing element data
US20150286414A1 (en) Scanning memory for de-duplication using rdma
CN107153643B (en) Data table connection method and device
Frey et al. A spinning join that does not get dizzy
US20210311891A1 (en) Handling an input/output store instruction
Yu et al. Design and evaluation of network-levitated merge for hadoop acceleration
WO2022083197A1 (en) Data processing method and apparatus, electronic device, and storage medium
CN108139927B (en) Action-based routing of transactions in an online transaction processing system
US9317346B2 (en) Method and apparatus for transmitting data elements between threads of a parallel computer system
US8543722B2 (en) Message passing with queues and channels
US7466716B2 (en) Reducing latency in a channel adapter by accelerated I/O control block processing
GB2517780A (en) Improved checkpoint and restart
US10061725B2 (en) Scanning memory for de-duplication using RDMA
US20130110968A1 (en) Reducing latency in multicast traffic reception
US20110302377A1 (en) Automatic Reallocation of Structured External Storage Structures
Guo et al. Distributed join algorithms on multi-CPU clusters with GPUDirect RDMA
CN106648451B (en) MapReduce engine data processing method and device based on memory
US11714741B2 (en) Dynamic selective filtering of persistent tracing
CN110955461B (en) Processing method, device, system, server and storage medium for computing task
Song et al. Cascade: A Platform for Delay-Sensitive Edge Intelligence
WO2021057759A1 (en) Memory migration method, device, and computing apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant