CN106648451A - Memory-based MapReduce engine data processing method and apparatus - Google Patents

Memory-based MapReduce engine data processing method and apparatus Download PDF

Info

Publication number
CN106648451A
CN106648451A CN201610305911.4A CN201610305911A CN106648451A CN 106648451 A CN106648451 A CN 106648451A CN 201610305911 A CN201610305911 A CN 201610305911A CN 106648451 A CN106648451 A CN 106648451A
Authority
CN
China
Prior art keywords
granularity
data
reduce
data processing
internal memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610305911.4A
Other languages
Chinese (zh)
Other versions
CN106648451B (en
Inventor
张伟
王界兵
李�杰
施莹
宋泰然
韦辉华
郭宇翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Frontsurf Information Technology Co Ltd
Original Assignee
Shenzhen Frontsurf Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Frontsurf Information Technology Co Ltd filed Critical Shenzhen Frontsurf Information Technology Co Ltd
Priority to CN201610305911.4A priority Critical patent/CN106648451B/en
Publication of CN106648451A publication Critical patent/CN106648451A/en
Application granted granted Critical
Publication of CN106648451B publication Critical patent/CN106648451B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0613Improving I/O performance in relation to throughput
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Multi Processors (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a memory-based MapReduce engine data processing method and apparatus. The method comprises the steps of performing granularity cutting on Map output result data of each partition, and performing sorting on the cut granularities; and performing multi-batch shuffle on the cut granularities of each partition, performing pipelining type data processing of copy, combination and reduce on each batch of data in sequence, and controlling data processed by a reduce process in a memory. According to the method and the apparatus, the reduce process of MapReduce is subjected to pipelining type design through pure software, so that the IO access and delay are greatly reduced and shortened; and the number of concurrent batches can be adjusted according to the capacity of the available memory, so that the throughput capacity and the overall performance of a MapReduce engine are improved, and an actual measurement result is more than 1.6-2 times an original result.

Description

MapReduce engine data treating method and apparatus based on internal memory
Technical field
The present invention relates to MapReduce engine data process fields, especially relate to a kind of based on internal memory MapReduce engine data treating method and apparatus.
Background technology
Since arrogant data processing technique Hadoop comes out, the optimization of data processing engine MapReduce performances therein Always industry compares concern.
By taking reduce as an example, the whole subregion (partition, P1) in each Map output file (MOF) can be copied into Reduce ends, then all subregions (P1, P1, P1) can merge becomes an orderly total set (P1), is finally Reduce, Uncontrollable due to data volume, above-mentioned copy link internal memory overflows inevitable, subsequently merges and carries out during Reduce The access of more hard disk IO can be produced.
At present industry reduces the method for IO mainly with compression (compression) in MapReduce and combines (combination), it is exactly UDA (the Unstructured of mellanox companies based on the mapreduce of internal memory DataAccelerator the trial of pipelining (Pipelining)) was done, but it is mainly based upon the acceleration skill of hardware Art, and because the design of shuffle agreements is limited, each batch shuffle low volume datas are not carried out big handling capacity.
The content of the invention
The main object of the present invention is a kind of big handling capacity of data realized by software of offer based on internal memory MapReduce engine data treating method and apparatus.
In order to realize foregoing invention purpose, present invention firstly provides at a kind of MapReduce engine datas based on internal memory Reason method, including:
The Map output results data of each subregion are carried out into granularity cutting, and the granularity after cutting is ranked up;
Granularity after the cutting of each subregion is carried out into multiple batches of shuffle, and the data of each batch are copied successively, Merge the pipeline system data processing with reduce, by the Data Control of reduce processes process in internal memory.
Further, the granularity after the cutting by each subregion carries out multiple batches of shuffle, and by the data of each batch Copied successively, merge and reduce pipeline system data processing, by reduce processes process Data Control in internal memory In step, including:
Each processing procedure of Reduce processes is the sub-line journey of autonomous operation, advanced by what is shared between each sub-line journey First go out asynchronous message queue to be communicated.
Further, the granularity after the cutting by each subregion carries out multiple batches of shuffle, and by the data of each batch Copied successively, merge and reduce pipeline system data processing, by reduce processes process Data Control in internal memory In step, including:
The pipeline system data processing is multiple batches of concurrently to be run, and is configured divided by each batch according to the size of free memory Memory size control concurrent batch number.
Further, it is described that the Map output results data of each subregion are carried out into granularity cutting, and by the granularity after cutting Before the step of being ranked up, including:
After the pipeline given batch data, synchronous all reduce are once;
It is described that the Map output results data of each subregion are carried out into granularity cutting, and the granularity after cutting is ranked up The step of after, including:
MOF file is deposited by relative bucket ID sequences, correspondingly carrys out pre-cache by same order.
Further, the granularity after the cutting by each subregion carries out multiple batches of shuffle, and by the data of each batch Copied successively, merge and reduce pipeline system data processing, by reduce processes process Data Control in internal memory In step, including:
Granularity internal data optionally sorts, if granularity internal data is unsorted, enters at reduce ends after copy Row sequence.
The present invention also provides a kind of MapReduce engine data processing meanss based on internal memory, including:
Cutter unit, for the Map output results data of each subregion to be carried out into granularity cutting, and by the granularity after cutting It is ranked up;
Pipelined units, for the granularity after the cutting of each subregion to be carried out into multiple batches of shuffle, and by the number of each batch According to copied successively, merged and reduce pipeline system data processing, by reduce processes process Data Control including In depositing.
Further, the pipelined units, including:
Shared communication module, for Reduce processes each processing procedure for autonomous operation sub-line journey, each sub-line journey Between communicated by the asynchronous message queue of first in first out shared.
Further, the pipelined units, including:
Module is concurrently run, is concurrently run for the pipeline system data processing to be multiple batches of, according to the big of free memory The little memory size divided by each batch configuration controls concurrent batch number.
Further, the MapReduce engine data processing meanss based on internal memory, also include:
Lock unit, after the pipeline given batch data, synchronous all reduce are once;
Memory cell, by relative bucket ID sequences MOF file is deposited, and correspondingly carrys out pre-cache by same order.
Further, the pipelined units, including:
Granularity internal data processing module, optionally sorts for granularity internal data, if granularity internal data is not Sequence, then be ranked up after copying at reduce ends.
The MapReduce engine data treating method and apparatus based on internal memory of the present invention, by pure software mode pair The reduce processes of MapReduce are pipelined (pipelining) design, i.e., data have been carried out with concurrent processing in batches, The shuffle or copy of each batch, merging (merge) can be controlled in internal memory in (In-Memory) with reduce and carries out, pole The earth reduces the access of IO and postpones;Can be to adjust the number of concurrent batch according to the number of free memory, so as to carry The handling capacity and overall performance of high mapreduce engines, measured result is more than -2 times of original 1.6 times.
Description of the drawings
Fig. 1 is the schematic flow sheet of the MapReduce engine data processing methods based on internal memory of one embodiment of the invention;
Fig. 2 is the Reduce process schematic diagrames of the pipelining of one embodiment of the invention;
Fig. 3 is the schematic flow sheet of the MapReduce engine data processing methods based on internal memory of one embodiment of the invention;
The schematic diagram of zookeeper is based between Reduce processes of the Fig. 4 for one embodiment of the invention by batch synchronization process;
Fig. 5 is the schematic diagram of the mapping relations of the bucket ID of one embodiment of the invention;
Fig. 6 is the structural representation frame of the MapReduce engine data processing meanss based on internal memory of one embodiment of the invention Figure;
Fig. 7 is the structural schematic block diagram of the pipelined units of one embodiment of the invention;
Fig. 8 is the structural representation frame of the MapReduce engine data processing meanss based on internal memory of one embodiment of the invention Figure.
The realization of the object of the invention, functional characteristics and advantage will be described further referring to the drawings in conjunction with the embodiments.
Specific embodiment
It should be appreciated that specific embodiment described herein is not intended to limit the present invention only to explain the present invention.
With reference to Fig. 1, the embodiment of the present invention provides a kind of MapReduce engine data processing methods based on internal memory, including Step:
S1, the Map output results data of each subregion are carried out into granularity cutting, and the granularity after cutting is ranked up;
S2, the granularity after the cutting of each subregion is carried out into multiple batches of shuffle, and the data of each batch are copied successively The pipeline system data processing of shellfish, merging and reduce, by the Data Control of reduce processes process in internal memory.
As described in above-mentioned step S1, more fine granularity is carried out to the data of original each subregion (partition) (bucket) cutting, in order to realize the data processing of follow-up streamlined, must be in order between bucket.In the present embodiment,
As described in above-mentioned step S2, in order to realize streamlined operation, the agreement of the whole subregions of primary shuffle is compared, New shuffle consultations carry out multiple batches of (pass) shuffle a subregion, and each batch only copies in order original whole The subset (bucket) of individual partition data, so by the size of adjustment bucket, the data of the process of reduce processes can Control is in internal memory.(pipelining) is pipelined by pure software mode to the reduce processes of MapReduce to set Meter, i.e., processed in batches data, the shuffle or copy of each batch, is merged (merge) and be can control with reduce Carry out in (In-Memory) in internal memory, considerably reduce the access of IO and postpone;Can be with according to the number of free memory To adjust the number of concurrent batch, so as to improve the handling capacity and overall performance of mapreduce engines.
With reference to Fig. 2, in the present embodiment, the above-mentioned granularity by after each subregion cutting carries out multiple batches of shuffle, and will be each The data of batch are copied successively, merge and reduce pipeline system data processing, by the data of reduce processes process Step S2 of the control in internal memory, including:
Each processing procedure of S21, Reduce process is the sub-line journey of autonomous operation, passes through what is shared between each sub-line journey The asynchronous message queue of first in first out is communicated.
As described in above-mentioned step S21, copy and merging thread (merger) is notified that after the completion of thread (fetcher), merge Reduce threads (reducer) are notified that after the completion of thread, due to keeping orderly between each batch, reducer may not necessarily be waited Remaining data copy comes and directly does reduce calculating, and in whole process, data access can control to be carried out in internal memory, Delay can be greatly reduced.
In the present embodiment, the granularity after the above-mentioned cutting by each subregion carries out multiple batches of shuffle, and by the number of each batch According to copied successively, merged and reduce pipeline system data processing, by reduce processes process Data Control including Step S2 in depositing, including:
S22, above-mentioned pipeline system data processing is multiple batches of concurrently runs, according to the size of free memory divided by each batch The memory size of configuration controls concurrent batch number.
As described in above-mentioned step S22, adaptedly concurrently run according to the available size of internal memory, compared Native method, Handling capacity is greatly improved.
In the present embodiment, the granularity after the above-mentioned cutting by each subregion carries out multiple batches of shuffle, and by the number of each batch According to copied successively, merged and reduce pipeline system data processing, by reduce processes process Data Control including Step in depositing, including:
S23, above-mentioned granularity internal data optionally sort, if granularity internal data is unsorted, copy after Reduce ends are ranked up.
As described in above-mentioned step S23, if the data inside granularity do not sort, carry out at reduce ends after can copying Sequence, thus consumes some cpu for sorting and has been transferred to reduce ends from map ends, and this becomes bottleneck to map ends cpu Operation has help of reducing pressure.
It is above-mentioned the Map output results data of each subregion are carried out into granularity to cut in the present embodiment with reference to Fig. 3, Fig. 4 and Fig. 5 Before the step of cutting, and the granularity after cutting is ranked up S1, including:
S11, lock unit, after the pipeline given batch data, synchronous all reduce are once;
It is above-mentioned that the Map output results data of each subregion are carried out into granularity cutting, and the granularity after cutting is ranked up The step of S1 after, including:
S12, memory cell, by relative bucket ID sequences MOF file is deposited, and correspondingly carrys out pre-cache by same order.
As described in above-mentioned step S11, because the differentiation of reduce boot sequences, and node computing capability also have network Differentiation, batch (pass) differentiation of each reduce operation is sometimes it is obvious that such as one reduce is in operation , in operation pass 5, so for MOF is a test by the pre-cache of batch, and internal memory may for pass 0, another reduce The so many data from 0 to 5 batches of not enough cache, need a kind of synchronization mechanism to pin (lock) batch, wait all reduce After completing the process of same batch, the unified process into next batch, to ensure that the cache of MOF is hit (hit).Reference Fig. 4, because each reduce is likely distributed on no node, the present embodiment employs the node synchronization in hadoop systems Realizing, control often separates the synchronous once all reduce of batch of specified quantity to mechanism zookeeper, that is to say, that reduce Between differentiation not over specified quantity batch.
With reference to Fig. 5, as described in above-mentioned step S12, due to multiple batches of shuffle, each reduce is from relative Bucket 0 starts to carry out in order, and we sort to deposit MOF file by relative bucket ID, correspondingly comes by same order The new method of pre-cache (cache), effectively reduces the probability of random IO, increased cache and hits (hit) probability.
In the present embodiment, because into the calculating based on internal memory, the use of internal memory is with management to performance after pipelining Affect also very big, the internal memory managed based on Java Virtual Machine (JVM) receives garbage reclamation (Garbbage Collection) performance Restriction, hence it is evident that be not suitable for the present invention.Present invention employs from the direct storage allocation of system and then oneself management, and do not use JVM。
In one embodiment, a kind of contrast test is done, 4, experimental data comparative analysis is carried out, it is as follows:
(1) test environment:4 back end
The big supplier CDH of hadoop softwares-three, HDP, MAPR result is similar to
CPU 2x8 cores
RAM 128GB
Disk 2TBx12
(2) measured result, such as following table:
As can be seen from the above table, invention significantly improves the data-handling capacity of MapReduce itself, is probably original 1.6 times -2 times.
The MapReduce engine data processing methods based on internal memory of the present embodiment, by pure software mode pair The reduce processes of MapReduce are pipelined (pipelining) design, i.e., data have been carried out with concurrent processing in batches, The shuffle or copy of each batch, merging (merge) can be controlled in internal memory in (In-Memory) with reduce and carries out, pole The earth reduces the access of IO and postpones;Can be to adjust the number of concurrent batch according to the number of free memory, so as to carry The handling capacity and overall performance of high mapreduce engines, measured result is more than -2 times of original 1.6 times.
With reference to Fig. 6, the embodiment of the present invention also provides a kind of MapReduce engine data processing meanss based on internal memory, wraps Include:
Cutter unit 10, for the Map output results data of each subregion to be carried out into granularity cutting, and by the grain after cutting Degree is ranked up;
Pipelined units 20, for the granularity after the cutting of each subregion to be carried out into multiple batches of shuffle, and by each batch Data are copied successively, merge and reduce pipeline system data processing, the Data Control of reduce processes process is existed In internal memory.
Such as above-mentioned cutter unit 10, more fine granularity is carried out to the data of original each subregion (partition) (bucket) cutting, in order to realize the data processing of follow-up streamlined, must be in order between bucket.In the present embodiment,
Such as above-mentioned pipelined units 20, in order to realize streamlined operation, the association of the whole subregions of primary shuffle is compared View, new shuffle consultations carry out multiple batches of (pass) shuffle a subregion, and each batch only copies in order former Carry out the subset (bucket) of whole partition data, so by the size of adjustment bucket, the data of the process of reduce processes Just it can be controlled in internal memory.The reduce processes of MapReduce are pipelined by pure software mode (pipelining) design, i.e., data are processed in batches, the shuffle or copy of each batch, merge (merge) Caning be controlled in internal memory in (In-Memory) with reduce is carried out, and is considerably reduced the access of IO and is postponed;Can be with basis The number of free memory adjusting the number of concurrent batch, so as to improve the handling capacity and overall performance of mapreduce engines.
With reference to Fig. 7, in the present embodiment, above-mentioned pipelined units 20, including:
Shared communication module 21, for Reduce processes each processing procedure for autonomous operation sub-line journey, each sub-line Communicated by the asynchronous message queue of the first in first out shared between journey.
Such as above-mentioned shared communication module, merging thread (merger) is notified that after the completion of copy thread (fetcher), is merged Reduce threads (reducer) are notified that after the completion of thread, due to keeping orderly between each batch, reducer may not necessarily be waited Remaining data copy comes and directly does reduce calculating, and in whole process, data access can control to be carried out in internal memory, Delay can be greatly reduced.
In the present embodiment, above-mentioned pipelined units 20, including:
Module 22 is concurrently run, is concurrently run for the pipeline system data processing to be multiple batches of, according to free memory Size controls concurrent batch number divided by the memory size of each batch configuration.
Module is concurrently run as described above, is adaptedly concurrently run according to the available size of internal memory, compare Native method, Handling capacity is greatly improved.
In the present embodiment, above-mentioned pipelined units 20, including:
Granularity internal data processing module 23, optionally sorts for granularity internal data, if granularity internal data It is unsorted, then it is ranked up at reduce ends after copying.
Such as above-mentioned granularity internal data processing module 23, if the data inside granularity do not sort, after can copying Reduce ends are ranked up, and thus some cpu for sorting are consumed and has been transferred to reduce ends from map ends, and this is to map ends Cpu becomes the operation of bottleneck and has help of reducing pressure.
With reference to Fig. 8, in the present embodiment, the above-mentioned MapReduce engine data processing meanss based on internal memory also include:
Lock unit 110, after the pipeline given batch data, synchronous all reduce are once;
Memory cell 120, by relative bucket ID sequences MOF file is deposited, and correspondingly carrys out pre-cache by same order.
Such as above-mentioned lock unit 110, because the differentiation of reduce boot sequences, and node computing capability also have network Differentiation, batch (pass) differentiation of each reduce operation is sometimes it is obvious that such as one reduce is in operation , in operation pass 5, so for MOF is a test by the pre-cache of batch, and internal memory may for pass 0, another reduce The so many data from 0 to 5 batches of not enough cache, need a kind of synchronization mechanism to pin (lock) batch, wait all reduce After completing the process of same batch, the unified process into next batch, to ensure that the cache of MOF is hit (hit).Reference Fig. 4, because each reduce is likely distributed on no node, the present embodiment employs the node synchronization in hadoop systems Realizing, control often separates the synchronous once all reduce of batch of specified quantity to mechanism zookeeper, that is to say, that reduce Between differentiation not over specified quantity batch.
With reference to Fig. 5, such as above-mentioned memory cell 120, due to multiple batches of shuffle, each reduce is from relative Bucket 0 starts to carry out in order, and we sort to deposit MOF file by relative bucket ID, correspondingly comes by same order The new method of pre-cache (cache), effectively reduces the probability of random IO, increased cache and hits (hit) probability.
In the present embodiment, because into the calculating based on internal memory, the use of internal memory is with management to performance after pipelining Affect also very big, the internal memory managed based on Java Virtual Machine (JVM) receives garbage reclamation (Garbbage Collection) performance Restriction, hence it is evident that be not suitable for the present invention.Present invention employs from the direct storage allocation of system and then oneself management, and do not use JVM。
In one embodiment, a kind of contrast test is done, 4, experimental data comparative analysis is carried out, it is as follows:
(1) test environment:4 back end
The big supplier CDH of hadoop softwares-three, HDP, MAPR result is similar to
CPU 2x8 cores
RAM 128GB
Disk 2TBx12
(2) measured result, such as following table:
As can be seen from the above table, invention significantly improves the data-handling capacity of MapReduce itself, is probably original 1.6 times -2 times.
The MapReduce engine data processing meanss based on internal memory of the present embodiment, by pure software mode pair The reduce processes of MapReduce are pipelined (pipelining) design, i.e., data have been carried out with concurrent processing in batches, The shuffle or copy of each batch, merging (merge) can be controlled in internal memory in (In-Memory) with reduce and carries out, pole The earth reduces the access of IO and postpones;Can be to adjust the number of concurrent batch according to the number of free memory, so as to carry The handling capacity and overall performance of high mapreduce engines, measured result is more than -2 times of original 1.6 times.
The preferred embodiments of the present invention are the foregoing is only, the scope of the claims of the present invention, every utilization is not thereby limited Equivalent structure or equivalent flow conversion that description of the invention and accompanying drawing content are made, or directly or indirectly it is used in other correlations Technical field, be included within the scope of the present invention.

Claims (10)

1. a kind of MapReduce engine data processing methods based on internal memory, it is characterised in that include:
The Map output results data of each subregion are carried out into granularity cutting, and the granularity after cutting is ranked up;
Granularity after each subregion cutting is carried out into multiple batches of shuffle, and the data of each batch are copied successively, merged With the pipeline system data processing of reduce, the Data Control that reduce processes are processed is in internal memory.
2. MapReduce engine data processing methods based on internal memory according to claim 1, it is characterised in that described Granularity after the cutting of each subregion is carried out into multiple batches of shuffle, and the data of each batch are copied successively, merge and The pipeline system data processing of reduce, step of the Data Control that reduce processes are processed in internal memory, including:
Each processing procedure of Reduce processes is the sub-line journey of autonomous operation, and shared first in first out is passed through between each sub-line journey Asynchronous message queue is communicated.
3. MapReduce engine data processing methods based on internal memory according to claim 1, it is characterised in that described Granularity after the cutting of each subregion is carried out into multiple batches of shuffle, and the data of each batch are copied successively, merge and The pipeline system data processing of reduce, step of the Data Control that reduce processes are processed in internal memory, including:
The pipeline system data processing is multiple batches of concurrently to be run, according to the size of free memory divided by each batch configuration Deposit size and control concurrent batch number.
4. MapReduce engine data processing methods based on internal memory according to claim 3, it is characterised in that described Before the step of Map output results data of each subregion being carried out into granularity cutting, and the granularity after cutting are ranked up, bag Include:
After the pipeline given batch data, synchronous all reduce are once;
It is described that the Map output results data of each subregion are carried out into granularity cutting, and the step that the granularity after cutting is ranked up After rapid, including:
MOF file is deposited by relative bucket ID sequences, correspondingly carrys out pre-cache by same order.
5. MapReduce engine data processing methods based on internal memory according to claim 1, it is characterised in that described Granularity after the cutting of each subregion is carried out into multiple batches of shuffle, and the data of each batch are copied successively, merge and The pipeline system data processing of reduce, step of the Data Control that reduce processes are processed in internal memory, including:
Granularity internal data optionally sorts, if granularity internal data is unsorted, is arranged at reduce ends after copy Sequence.
6. a kind of MapReduce engine data processing meanss based on internal memory, it is characterised in that include:
Cutter unit, for the Map output results data of each subregion to be carried out into granularity cutting, and the granularity after cutting is carried out Sequence;
Pipelined units, for the granularity after the cutting of each subregion to be carried out into multiple batches of shuffle, and by the data of each batch according to It is secondary copied, merge and reduce pipeline system data processing, by reduce processes process Data Control in internal memory.
7. MapReduce engine data processing meanss based on internal memory according to claim 6, it is characterised in that described Pipelined units, including:
Shared communication module, for Reduce processes each processing procedure for autonomous operation sub-line journey, between each sub-line journey Communicated by the asynchronous message queue of the first in first out shared.
8. MapReduce engine data processing meanss based on internal memory according to claim 6, it is characterised in that described Pipelined units, including:
Module is concurrently run, is concurrently run for the pipeline system data processing to be multiple batches of, removed according to the size of free memory Concurrent batch number is controlled with the memory size of each batch configuration.
9. MapReduce engine data processing meanss based on internal memory according to claim 6, it is characterised in that also wrap Include:
Lock unit, after the pipeline given batch data, synchronous all reduce are once;
Memory cell, by relative bucket ID sequences MOF file is deposited, and correspondingly carrys out pre-cache by same order.
10. MapReduce engine data processing meanss based on internal memory according to claim 6, it is characterised in that described Pipelined units, including:
Granularity internal data processing module, optionally sorts for granularity internal data, if granularity internal data is unsorted, It is ranked up at reduce ends after then copying.
CN201610305911.4A 2016-05-10 2016-05-10 MapReduce engine data processing method and device based on memory Active CN106648451B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610305911.4A CN106648451B (en) 2016-05-10 2016-05-10 MapReduce engine data processing method and device based on memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610305911.4A CN106648451B (en) 2016-05-10 2016-05-10 MapReduce engine data processing method and device based on memory

Publications (2)

Publication Number Publication Date
CN106648451A true CN106648451A (en) 2017-05-10
CN106648451B CN106648451B (en) 2020-09-08

Family

ID=58848688

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610305911.4A Active CN106648451B (en) 2016-05-10 2016-05-10 MapReduce engine data processing method and device based on memory

Country Status (1)

Country Link
CN (1) CN106648451B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197016A (en) * 2006-12-08 2008-06-11 鸿富锦精密工业(深圳)有限公司 Batching concurrent job processing equipment and method
CN102222092A (en) * 2011-06-03 2011-10-19 复旦大学 Massive high-dimension data clustering method for MapReduce platform
US20140123115A1 (en) * 2012-10-26 2014-05-01 Jsmapreduce Corporation Hybrid local/remote infrastructure for data processing with lightweight setup, powerful debuggability, controllability, integration, and productivity features
CN103793425A (en) * 2012-10-31 2014-05-14 国际商业机器公司 Data processing method and data processing device for distributed system
CN103970520A (en) * 2013-01-31 2014-08-06 国际商业机器公司 Resource management method and device in MapReduce framework and framework system with device
CN104021194A (en) * 2014-06-13 2014-09-03 浪潮(北京)电子信息产业有限公司 Mixed type processing system and method oriented to industry big data diversity application
CN104598562A (en) * 2015-01-08 2015-05-06 浪潮软件股份有限公司 XML file processing method and device based on MapReduce parallel computing model
US20150150017A1 (en) * 2013-11-26 2015-05-28 International Business Machines Corporation Optimization of map-reduce shuffle performance through shuffler i/o pipeline actions and planning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197016A (en) * 2006-12-08 2008-06-11 鸿富锦精密工业(深圳)有限公司 Batching concurrent job processing equipment and method
CN102222092A (en) * 2011-06-03 2011-10-19 复旦大学 Massive high-dimension data clustering method for MapReduce platform
US20140123115A1 (en) * 2012-10-26 2014-05-01 Jsmapreduce Corporation Hybrid local/remote infrastructure for data processing with lightweight setup, powerful debuggability, controllability, integration, and productivity features
CN103793425A (en) * 2012-10-31 2014-05-14 国际商业机器公司 Data processing method and data processing device for distributed system
CN103970520A (en) * 2013-01-31 2014-08-06 国际商业机器公司 Resource management method and device in MapReduce framework and framework system with device
US20150150017A1 (en) * 2013-11-26 2015-05-28 International Business Machines Corporation Optimization of map-reduce shuffle performance through shuffler i/o pipeline actions and planning
CN104021194A (en) * 2014-06-13 2014-09-03 浪潮(北京)电子信息产业有限公司 Mixed type processing system and method oriented to industry big data diversity application
CN104598562A (en) * 2015-01-08 2015-05-06 浪潮软件股份有限公司 XML file processing method and device based on MapReduce parallel computing model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
彭辅权 等: "MapReduce中shuffle优化与重构", 《中国科技论文》 *

Also Published As

Publication number Publication date
CN106648451B (en) 2020-09-08

Similar Documents

Publication Publication Date Title
Zheng et al. Distdgl: distributed graph neural network training for billion-scale graphs
US10831562B2 (en) Method and system for operating a data center by reducing an amount of data to be processed
Verma et al. Breaking the MapReduce stage barrier
Tao et al. Minimal mapreduce algorithms
CN110032442A (en) Accelerate the framework and mechanism of tuple space search using integrated GPU
US10002019B2 (en) System and method for assigning a transaction to a serialized execution group based on an execution group limit for parallel processing with other execution groups
US9986018B2 (en) Method and system for a scheduled map executor
Shatdal et al. Using shared virtual memory for parallel join processing
CN107193898B (en) The inquiry sharing method and system of log data stream based on stepped multiplexing
WO2022083197A1 (en) Data processing method and apparatus, electronic device, and storage medium
Frey et al. A spinning join that does not get dizzy
CN109828790B (en) Data processing method and system based on Shenwei heterogeneous many-core processor
Li et al. Improving the shuffle of hadoop MapReduce
Tseng et al. Accelerating open vSwitch with integrated GPU
US7466716B2 (en) Reducing latency in a channel adapter by accelerated I/O control block processing
CN114253930A (en) Data processing method, device, equipment and storage medium
Guo et al. Distributed join algorithms on multi-CPU clusters with GPUDirect RDMA
CN106648451A (en) Memory-based MapReduce engine data processing method and apparatus
CN110502337A (en) For the optimization system and method for shuffling the stage in Hadoop MapReduce
CN100499564C (en) Packet processing engine
US11582133B2 (en) Apparatus and method for distributed processing of identical packet in high-speed network security equipment
Guo et al. Handling data skew at reduce stage in Spark by ReducePartition
Vinutha et al. In-Memory Cache and Intra-Node Combiner Approaches for Optimizing Execution Time in High-Performance Computing
He et al. Ds 2: Handling data skew using data stealings over high-speed networks
Lu et al. Improving mapreduce performance by using a new partitioner in yarn

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant