CN106648451A - Memory-based MapReduce engine data processing method and apparatus - Google Patents
Memory-based MapReduce engine data processing method and apparatus Download PDFInfo
- Publication number
- CN106648451A CN106648451A CN201610305911.4A CN201610305911A CN106648451A CN 106648451 A CN106648451 A CN 106648451A CN 201610305911 A CN201610305911 A CN 201610305911A CN 106648451 A CN106648451 A CN 106648451A
- Authority
- CN
- China
- Prior art keywords
- granularity
- data
- reduce
- data processing
- internal memory
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
- G06F3/0611—Improving I/O performance in relation to response time
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
- G06F3/0613—Improving I/O performance in relation to throughput
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0629—Configuration or reconfiguration of storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Multi Processors (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The invention discloses a memory-based MapReduce engine data processing method and apparatus. The method comprises the steps of performing granularity cutting on Map output result data of each partition, and performing sorting on the cut granularities; and performing multi-batch shuffle on the cut granularities of each partition, performing pipelining type data processing of copy, combination and reduce on each batch of data in sequence, and controlling data processed by a reduce process in a memory. According to the method and the apparatus, the reduce process of MapReduce is subjected to pipelining type design through pure software, so that the IO access and delay are greatly reduced and shortened; and the number of concurrent batches can be adjusted according to the capacity of the available memory, so that the throughput capacity and the overall performance of a MapReduce engine are improved, and an actual measurement result is more than 1.6-2 times an original result.
Description
Technical field
The present invention relates to MapReduce engine data process fields, especially relate to a kind of based on internal memory
MapReduce engine data treating method and apparatus.
Background technology
Since arrogant data processing technique Hadoop comes out, the optimization of data processing engine MapReduce performances therein
Always industry compares concern.
By taking reduce as an example, the whole subregion (partition, P1) in each Map output file (MOF) can be copied into
Reduce ends, then all subregions (P1, P1, P1) can merge becomes an orderly total set (P1), is finally Reduce,
Uncontrollable due to data volume, above-mentioned copy link internal memory overflows inevitable, subsequently merges and carries out during Reduce
The access of more hard disk IO can be produced.
At present industry reduces the method for IO mainly with compression (compression) in MapReduce and combines
(combination), it is exactly UDA (the Unstructured of mellanox companies based on the mapreduce of internal memory
DataAccelerator the trial of pipelining (Pipelining)) was done, but it is mainly based upon the acceleration skill of hardware
Art, and because the design of shuffle agreements is limited, each batch shuffle low volume datas are not carried out big handling capacity.
The content of the invention
The main object of the present invention is a kind of big handling capacity of data realized by software of offer based on internal memory
MapReduce engine data treating method and apparatus.
In order to realize foregoing invention purpose, present invention firstly provides at a kind of MapReduce engine datas based on internal memory
Reason method, including:
The Map output results data of each subregion are carried out into granularity cutting, and the granularity after cutting is ranked up;
Granularity after the cutting of each subregion is carried out into multiple batches of shuffle, and the data of each batch are copied successively,
Merge the pipeline system data processing with reduce, by the Data Control of reduce processes process in internal memory.
Further, the granularity after the cutting by each subregion carries out multiple batches of shuffle, and by the data of each batch
Copied successively, merge and reduce pipeline system data processing, by reduce processes process Data Control in internal memory
In step, including:
Each processing procedure of Reduce processes is the sub-line journey of autonomous operation, advanced by what is shared between each sub-line journey
First go out asynchronous message queue to be communicated.
Further, the granularity after the cutting by each subregion carries out multiple batches of shuffle, and by the data of each batch
Copied successively, merge and reduce pipeline system data processing, by reduce processes process Data Control in internal memory
In step, including:
The pipeline system data processing is multiple batches of concurrently to be run, and is configured divided by each batch according to the size of free memory
Memory size control concurrent batch number.
Further, it is described that the Map output results data of each subregion are carried out into granularity cutting, and by the granularity after cutting
Before the step of being ranked up, including:
After the pipeline given batch data, synchronous all reduce are once;
It is described that the Map output results data of each subregion are carried out into granularity cutting, and the granularity after cutting is ranked up
The step of after, including:
MOF file is deposited by relative bucket ID sequences, correspondingly carrys out pre-cache by same order.
Further, the granularity after the cutting by each subregion carries out multiple batches of shuffle, and by the data of each batch
Copied successively, merge and reduce pipeline system data processing, by reduce processes process Data Control in internal memory
In step, including:
Granularity internal data optionally sorts, if granularity internal data is unsorted, enters at reduce ends after copy
Row sequence.
The present invention also provides a kind of MapReduce engine data processing meanss based on internal memory, including:
Cutter unit, for the Map output results data of each subregion to be carried out into granularity cutting, and by the granularity after cutting
It is ranked up;
Pipelined units, for the granularity after the cutting of each subregion to be carried out into multiple batches of shuffle, and by the number of each batch
According to copied successively, merged and reduce pipeline system data processing, by reduce processes process Data Control including
In depositing.
Further, the pipelined units, including:
Shared communication module, for Reduce processes each processing procedure for autonomous operation sub-line journey, each sub-line journey
Between communicated by the asynchronous message queue of first in first out shared.
Further, the pipelined units, including:
Module is concurrently run, is concurrently run for the pipeline system data processing to be multiple batches of, according to the big of free memory
The little memory size divided by each batch configuration controls concurrent batch number.
Further, the MapReduce engine data processing meanss based on internal memory, also include:
Lock unit, after the pipeline given batch data, synchronous all reduce are once;
Memory cell, by relative bucket ID sequences MOF file is deposited, and correspondingly carrys out pre-cache by same order.
Further, the pipelined units, including:
Granularity internal data processing module, optionally sorts for granularity internal data, if granularity internal data is not
Sequence, then be ranked up after copying at reduce ends.
The MapReduce engine data treating method and apparatus based on internal memory of the present invention, by pure software mode pair
The reduce processes of MapReduce are pipelined (pipelining) design, i.e., data have been carried out with concurrent processing in batches,
The shuffle or copy of each batch, merging (merge) can be controlled in internal memory in (In-Memory) with reduce and carries out, pole
The earth reduces the access of IO and postpones;Can be to adjust the number of concurrent batch according to the number of free memory, so as to carry
The handling capacity and overall performance of high mapreduce engines, measured result is more than -2 times of original 1.6 times.
Description of the drawings
Fig. 1 is the schematic flow sheet of the MapReduce engine data processing methods based on internal memory of one embodiment of the invention;
Fig. 2 is the Reduce process schematic diagrames of the pipelining of one embodiment of the invention;
Fig. 3 is the schematic flow sheet of the MapReduce engine data processing methods based on internal memory of one embodiment of the invention;
The schematic diagram of zookeeper is based between Reduce processes of the Fig. 4 for one embodiment of the invention by batch synchronization process;
Fig. 5 is the schematic diagram of the mapping relations of the bucket ID of one embodiment of the invention;
Fig. 6 is the structural representation frame of the MapReduce engine data processing meanss based on internal memory of one embodiment of the invention
Figure;
Fig. 7 is the structural schematic block diagram of the pipelined units of one embodiment of the invention;
Fig. 8 is the structural representation frame of the MapReduce engine data processing meanss based on internal memory of one embodiment of the invention
Figure.
The realization of the object of the invention, functional characteristics and advantage will be described further referring to the drawings in conjunction with the embodiments.
Specific embodiment
It should be appreciated that specific embodiment described herein is not intended to limit the present invention only to explain the present invention.
With reference to Fig. 1, the embodiment of the present invention provides a kind of MapReduce engine data processing methods based on internal memory, including
Step:
S1, the Map output results data of each subregion are carried out into granularity cutting, and the granularity after cutting is ranked up;
S2, the granularity after the cutting of each subregion is carried out into multiple batches of shuffle, and the data of each batch are copied successively
The pipeline system data processing of shellfish, merging and reduce, by the Data Control of reduce processes process in internal memory.
As described in above-mentioned step S1, more fine granularity is carried out to the data of original each subregion (partition)
(bucket) cutting, in order to realize the data processing of follow-up streamlined, must be in order between bucket.In the present embodiment,
As described in above-mentioned step S2, in order to realize streamlined operation, the agreement of the whole subregions of primary shuffle is compared,
New shuffle consultations carry out multiple batches of (pass) shuffle a subregion, and each batch only copies in order original whole
The subset (bucket) of individual partition data, so by the size of adjustment bucket, the data of the process of reduce processes can
Control is in internal memory.(pipelining) is pipelined by pure software mode to the reduce processes of MapReduce to set
Meter, i.e., processed in batches data, the shuffle or copy of each batch, is merged (merge) and be can control with reduce
Carry out in (In-Memory) in internal memory, considerably reduce the access of IO and postpone;Can be with according to the number of free memory
To adjust the number of concurrent batch, so as to improve the handling capacity and overall performance of mapreduce engines.
With reference to Fig. 2, in the present embodiment, the above-mentioned granularity by after each subregion cutting carries out multiple batches of shuffle, and will be each
The data of batch are copied successively, merge and reduce pipeline system data processing, by the data of reduce processes process
Step S2 of the control in internal memory, including:
Each processing procedure of S21, Reduce process is the sub-line journey of autonomous operation, passes through what is shared between each sub-line journey
The asynchronous message queue of first in first out is communicated.
As described in above-mentioned step S21, copy and merging thread (merger) is notified that after the completion of thread (fetcher), merge
Reduce threads (reducer) are notified that after the completion of thread, due to keeping orderly between each batch, reducer may not necessarily be waited
Remaining data copy comes and directly does reduce calculating, and in whole process, data access can control to be carried out in internal memory,
Delay can be greatly reduced.
In the present embodiment, the granularity after the above-mentioned cutting by each subregion carries out multiple batches of shuffle, and by the number of each batch
According to copied successively, merged and reduce pipeline system data processing, by reduce processes process Data Control including
Step S2 in depositing, including:
S22, above-mentioned pipeline system data processing is multiple batches of concurrently runs, according to the size of free memory divided by each batch
The memory size of configuration controls concurrent batch number.
As described in above-mentioned step S22, adaptedly concurrently run according to the available size of internal memory, compared Native method,
Handling capacity is greatly improved.
In the present embodiment, the granularity after the above-mentioned cutting by each subregion carries out multiple batches of shuffle, and by the number of each batch
According to copied successively, merged and reduce pipeline system data processing, by reduce processes process Data Control including
Step in depositing, including:
S23, above-mentioned granularity internal data optionally sort, if granularity internal data is unsorted, copy after
Reduce ends are ranked up.
As described in above-mentioned step S23, if the data inside granularity do not sort, carry out at reduce ends after can copying
Sequence, thus consumes some cpu for sorting and has been transferred to reduce ends from map ends, and this becomes bottleneck to map ends cpu
Operation has help of reducing pressure.
It is above-mentioned the Map output results data of each subregion are carried out into granularity to cut in the present embodiment with reference to Fig. 3, Fig. 4 and Fig. 5
Before the step of cutting, and the granularity after cutting is ranked up S1, including:
S11, lock unit, after the pipeline given batch data, synchronous all reduce are once;
It is above-mentioned that the Map output results data of each subregion are carried out into granularity cutting, and the granularity after cutting is ranked up
The step of S1 after, including:
S12, memory cell, by relative bucket ID sequences MOF file is deposited, and correspondingly carrys out pre-cache by same order.
As described in above-mentioned step S11, because the differentiation of reduce boot sequences, and node computing capability also have network
Differentiation, batch (pass) differentiation of each reduce operation is sometimes it is obvious that such as one reduce is in operation
, in operation pass 5, so for MOF is a test by the pre-cache of batch, and internal memory may for pass 0, another reduce
The so many data from 0 to 5 batches of not enough cache, need a kind of synchronization mechanism to pin (lock) batch, wait all reduce
After completing the process of same batch, the unified process into next batch, to ensure that the cache of MOF is hit (hit).Reference
Fig. 4, because each reduce is likely distributed on no node, the present embodiment employs the node synchronization in hadoop systems
Realizing, control often separates the synchronous once all reduce of batch of specified quantity to mechanism zookeeper, that is to say, that reduce
Between differentiation not over specified quantity batch.
With reference to Fig. 5, as described in above-mentioned step S12, due to multiple batches of shuffle, each reduce is from relative
Bucket 0 starts to carry out in order, and we sort to deposit MOF file by relative bucket ID, correspondingly comes by same order
The new method of pre-cache (cache), effectively reduces the probability of random IO, increased cache and hits (hit) probability.
In the present embodiment, because into the calculating based on internal memory, the use of internal memory is with management to performance after pipelining
Affect also very big, the internal memory managed based on Java Virtual Machine (JVM) receives garbage reclamation (Garbbage Collection) performance
Restriction, hence it is evident that be not suitable for the present invention.Present invention employs from the direct storage allocation of system and then oneself management, and do not use
JVM。
In one embodiment, a kind of contrast test is done, 4, experimental data comparative analysis is carried out, it is as follows:
(1) test environment:4 back end
The big supplier CDH of hadoop softwares-three, HDP, MAPR result is similar to
CPU 2x8 cores
RAM 128GB
Disk 2TBx12
(2) measured result, such as following table:
As can be seen from the above table, invention significantly improves the data-handling capacity of MapReduce itself, is probably original
1.6 times -2 times.
The MapReduce engine data processing methods based on internal memory of the present embodiment, by pure software mode pair
The reduce processes of MapReduce are pipelined (pipelining) design, i.e., data have been carried out with concurrent processing in batches,
The shuffle or copy of each batch, merging (merge) can be controlled in internal memory in (In-Memory) with reduce and carries out, pole
The earth reduces the access of IO and postpones;Can be to adjust the number of concurrent batch according to the number of free memory, so as to carry
The handling capacity and overall performance of high mapreduce engines, measured result is more than -2 times of original 1.6 times.
With reference to Fig. 6, the embodiment of the present invention also provides a kind of MapReduce engine data processing meanss based on internal memory, wraps
Include:
Cutter unit 10, for the Map output results data of each subregion to be carried out into granularity cutting, and by the grain after cutting
Degree is ranked up;
Pipelined units 20, for the granularity after the cutting of each subregion to be carried out into multiple batches of shuffle, and by each batch
Data are copied successively, merge and reduce pipeline system data processing, the Data Control of reduce processes process is existed
In internal memory.
Such as above-mentioned cutter unit 10, more fine granularity is carried out to the data of original each subregion (partition)
(bucket) cutting, in order to realize the data processing of follow-up streamlined, must be in order between bucket.In the present embodiment,
Such as above-mentioned pipelined units 20, in order to realize streamlined operation, the association of the whole subregions of primary shuffle is compared
View, new shuffle consultations carry out multiple batches of (pass) shuffle a subregion, and each batch only copies in order former
Carry out the subset (bucket) of whole partition data, so by the size of adjustment bucket, the data of the process of reduce processes
Just it can be controlled in internal memory.The reduce processes of MapReduce are pipelined by pure software mode
(pipelining) design, i.e., data are processed in batches, the shuffle or copy of each batch, merge (merge)
Caning be controlled in internal memory in (In-Memory) with reduce is carried out, and is considerably reduced the access of IO and is postponed;Can be with basis
The number of free memory adjusting the number of concurrent batch, so as to improve the handling capacity and overall performance of mapreduce engines.
With reference to Fig. 7, in the present embodiment, above-mentioned pipelined units 20, including:
Shared communication module 21, for Reduce processes each processing procedure for autonomous operation sub-line journey, each sub-line
Communicated by the asynchronous message queue of the first in first out shared between journey.
Such as above-mentioned shared communication module, merging thread (merger) is notified that after the completion of copy thread (fetcher), is merged
Reduce threads (reducer) are notified that after the completion of thread, due to keeping orderly between each batch, reducer may not necessarily be waited
Remaining data copy comes and directly does reduce calculating, and in whole process, data access can control to be carried out in internal memory,
Delay can be greatly reduced.
In the present embodiment, above-mentioned pipelined units 20, including:
Module 22 is concurrently run, is concurrently run for the pipeline system data processing to be multiple batches of, according to free memory
Size controls concurrent batch number divided by the memory size of each batch configuration.
Module is concurrently run as described above, is adaptedly concurrently run according to the available size of internal memory, compare Native method,
Handling capacity is greatly improved.
In the present embodiment, above-mentioned pipelined units 20, including:
Granularity internal data processing module 23, optionally sorts for granularity internal data, if granularity internal data
It is unsorted, then it is ranked up at reduce ends after copying.
Such as above-mentioned granularity internal data processing module 23, if the data inside granularity do not sort, after can copying
Reduce ends are ranked up, and thus some cpu for sorting are consumed and has been transferred to reduce ends from map ends, and this is to map ends
Cpu becomes the operation of bottleneck and has help of reducing pressure.
With reference to Fig. 8, in the present embodiment, the above-mentioned MapReduce engine data processing meanss based on internal memory also include:
Lock unit 110, after the pipeline given batch data, synchronous all reduce are once;
Memory cell 120, by relative bucket ID sequences MOF file is deposited, and correspondingly carrys out pre-cache by same order.
Such as above-mentioned lock unit 110, because the differentiation of reduce boot sequences, and node computing capability also have network
Differentiation, batch (pass) differentiation of each reduce operation is sometimes it is obvious that such as one reduce is in operation
, in operation pass 5, so for MOF is a test by the pre-cache of batch, and internal memory may for pass 0, another reduce
The so many data from 0 to 5 batches of not enough cache, need a kind of synchronization mechanism to pin (lock) batch, wait all reduce
After completing the process of same batch, the unified process into next batch, to ensure that the cache of MOF is hit (hit).Reference
Fig. 4, because each reduce is likely distributed on no node, the present embodiment employs the node synchronization in hadoop systems
Realizing, control often separates the synchronous once all reduce of batch of specified quantity to mechanism zookeeper, that is to say, that reduce
Between differentiation not over specified quantity batch.
With reference to Fig. 5, such as above-mentioned memory cell 120, due to multiple batches of shuffle, each reduce is from relative
Bucket 0 starts to carry out in order, and we sort to deposit MOF file by relative bucket ID, correspondingly comes by same order
The new method of pre-cache (cache), effectively reduces the probability of random IO, increased cache and hits (hit) probability.
In the present embodiment, because into the calculating based on internal memory, the use of internal memory is with management to performance after pipelining
Affect also very big, the internal memory managed based on Java Virtual Machine (JVM) receives garbage reclamation (Garbbage Collection) performance
Restriction, hence it is evident that be not suitable for the present invention.Present invention employs from the direct storage allocation of system and then oneself management, and do not use
JVM。
In one embodiment, a kind of contrast test is done, 4, experimental data comparative analysis is carried out, it is as follows:
(1) test environment:4 back end
The big supplier CDH of hadoop softwares-three, HDP, MAPR result is similar to
CPU 2x8 cores
RAM 128GB
Disk 2TBx12
(2) measured result, such as following table:
As can be seen from the above table, invention significantly improves the data-handling capacity of MapReduce itself, is probably original
1.6 times -2 times.
The MapReduce engine data processing meanss based on internal memory of the present embodiment, by pure software mode pair
The reduce processes of MapReduce are pipelined (pipelining) design, i.e., data have been carried out with concurrent processing in batches,
The shuffle or copy of each batch, merging (merge) can be controlled in internal memory in (In-Memory) with reduce and carries out, pole
The earth reduces the access of IO and postpones;Can be to adjust the number of concurrent batch according to the number of free memory, so as to carry
The handling capacity and overall performance of high mapreduce engines, measured result is more than -2 times of original 1.6 times.
The preferred embodiments of the present invention are the foregoing is only, the scope of the claims of the present invention, every utilization is not thereby limited
Equivalent structure or equivalent flow conversion that description of the invention and accompanying drawing content are made, or directly or indirectly it is used in other correlations
Technical field, be included within the scope of the present invention.
Claims (10)
1. a kind of MapReduce engine data processing methods based on internal memory, it is characterised in that include:
The Map output results data of each subregion are carried out into granularity cutting, and the granularity after cutting is ranked up;
Granularity after each subregion cutting is carried out into multiple batches of shuffle, and the data of each batch are copied successively, merged
With the pipeline system data processing of reduce, the Data Control that reduce processes are processed is in internal memory.
2. MapReduce engine data processing methods based on internal memory according to claim 1, it is characterised in that described
Granularity after the cutting of each subregion is carried out into multiple batches of shuffle, and the data of each batch are copied successively, merge and
The pipeline system data processing of reduce, step of the Data Control that reduce processes are processed in internal memory, including:
Each processing procedure of Reduce processes is the sub-line journey of autonomous operation, and shared first in first out is passed through between each sub-line journey
Asynchronous message queue is communicated.
3. MapReduce engine data processing methods based on internal memory according to claim 1, it is characterised in that described
Granularity after the cutting of each subregion is carried out into multiple batches of shuffle, and the data of each batch are copied successively, merge and
The pipeline system data processing of reduce, step of the Data Control that reduce processes are processed in internal memory, including:
The pipeline system data processing is multiple batches of concurrently to be run, according to the size of free memory divided by each batch configuration
Deposit size and control concurrent batch number.
4. MapReduce engine data processing methods based on internal memory according to claim 3, it is characterised in that described
Before the step of Map output results data of each subregion being carried out into granularity cutting, and the granularity after cutting are ranked up, bag
Include:
After the pipeline given batch data, synchronous all reduce are once;
It is described that the Map output results data of each subregion are carried out into granularity cutting, and the step that the granularity after cutting is ranked up
After rapid, including:
MOF file is deposited by relative bucket ID sequences, correspondingly carrys out pre-cache by same order.
5. MapReduce engine data processing methods based on internal memory according to claim 1, it is characterised in that described
Granularity after the cutting of each subregion is carried out into multiple batches of shuffle, and the data of each batch are copied successively, merge and
The pipeline system data processing of reduce, step of the Data Control that reduce processes are processed in internal memory, including:
Granularity internal data optionally sorts, if granularity internal data is unsorted, is arranged at reduce ends after copy
Sequence.
6. a kind of MapReduce engine data processing meanss based on internal memory, it is characterised in that include:
Cutter unit, for the Map output results data of each subregion to be carried out into granularity cutting, and the granularity after cutting is carried out
Sequence;
Pipelined units, for the granularity after the cutting of each subregion to be carried out into multiple batches of shuffle, and by the data of each batch according to
It is secondary copied, merge and reduce pipeline system data processing, by reduce processes process Data Control in internal memory.
7. MapReduce engine data processing meanss based on internal memory according to claim 6, it is characterised in that described
Pipelined units, including:
Shared communication module, for Reduce processes each processing procedure for autonomous operation sub-line journey, between each sub-line journey
Communicated by the asynchronous message queue of the first in first out shared.
8. MapReduce engine data processing meanss based on internal memory according to claim 6, it is characterised in that described
Pipelined units, including:
Module is concurrently run, is concurrently run for the pipeline system data processing to be multiple batches of, removed according to the size of free memory
Concurrent batch number is controlled with the memory size of each batch configuration.
9. MapReduce engine data processing meanss based on internal memory according to claim 6, it is characterised in that also wrap
Include:
Lock unit, after the pipeline given batch data, synchronous all reduce are once;
Memory cell, by relative bucket ID sequences MOF file is deposited, and correspondingly carrys out pre-cache by same order.
10. MapReduce engine data processing meanss based on internal memory according to claim 6, it is characterised in that described
Pipelined units, including:
Granularity internal data processing module, optionally sorts for granularity internal data, if granularity internal data is unsorted,
It is ranked up at reduce ends after then copying.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610305911.4A CN106648451B (en) | 2016-05-10 | 2016-05-10 | MapReduce engine data processing method and device based on memory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610305911.4A CN106648451B (en) | 2016-05-10 | 2016-05-10 | MapReduce engine data processing method and device based on memory |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106648451A true CN106648451A (en) | 2017-05-10 |
CN106648451B CN106648451B (en) | 2020-09-08 |
Family
ID=58848688
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610305911.4A Active CN106648451B (en) | 2016-05-10 | 2016-05-10 | MapReduce engine data processing method and device based on memory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106648451B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101197016A (en) * | 2006-12-08 | 2008-06-11 | 鸿富锦精密工业(深圳)有限公司 | Batching concurrent job processing equipment and method |
CN102222092A (en) * | 2011-06-03 | 2011-10-19 | 复旦大学 | Massive high-dimension data clustering method for MapReduce platform |
US20140123115A1 (en) * | 2012-10-26 | 2014-05-01 | Jsmapreduce Corporation | Hybrid local/remote infrastructure for data processing with lightweight setup, powerful debuggability, controllability, integration, and productivity features |
CN103793425A (en) * | 2012-10-31 | 2014-05-14 | 国际商业机器公司 | Data processing method and data processing device for distributed system |
CN103970520A (en) * | 2013-01-31 | 2014-08-06 | 国际商业机器公司 | Resource management method and device in MapReduce framework and framework system with device |
CN104021194A (en) * | 2014-06-13 | 2014-09-03 | 浪潮(北京)电子信息产业有限公司 | Mixed type processing system and method oriented to industry big data diversity application |
CN104598562A (en) * | 2015-01-08 | 2015-05-06 | 浪潮软件股份有限公司 | XML file processing method and device based on MapReduce parallel computing model |
US20150150017A1 (en) * | 2013-11-26 | 2015-05-28 | International Business Machines Corporation | Optimization of map-reduce shuffle performance through shuffler i/o pipeline actions and planning |
-
2016
- 2016-05-10 CN CN201610305911.4A patent/CN106648451B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101197016A (en) * | 2006-12-08 | 2008-06-11 | 鸿富锦精密工业(深圳)有限公司 | Batching concurrent job processing equipment and method |
CN102222092A (en) * | 2011-06-03 | 2011-10-19 | 复旦大学 | Massive high-dimension data clustering method for MapReduce platform |
US20140123115A1 (en) * | 2012-10-26 | 2014-05-01 | Jsmapreduce Corporation | Hybrid local/remote infrastructure for data processing with lightweight setup, powerful debuggability, controllability, integration, and productivity features |
CN103793425A (en) * | 2012-10-31 | 2014-05-14 | 国际商业机器公司 | Data processing method and data processing device for distributed system |
CN103970520A (en) * | 2013-01-31 | 2014-08-06 | 国际商业机器公司 | Resource management method and device in MapReduce framework and framework system with device |
US20150150017A1 (en) * | 2013-11-26 | 2015-05-28 | International Business Machines Corporation | Optimization of map-reduce shuffle performance through shuffler i/o pipeline actions and planning |
CN104021194A (en) * | 2014-06-13 | 2014-09-03 | 浪潮(北京)电子信息产业有限公司 | Mixed type processing system and method oriented to industry big data diversity application |
CN104598562A (en) * | 2015-01-08 | 2015-05-06 | 浪潮软件股份有限公司 | XML file processing method and device based on MapReduce parallel computing model |
Non-Patent Citations (1)
Title |
---|
彭辅权 等: "MapReduce中shuffle优化与重构", 《中国科技论文》 * |
Also Published As
Publication number | Publication date |
---|---|
CN106648451B (en) | 2020-09-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zheng et al. | Distdgl: distributed graph neural network training for billion-scale graphs | |
US10831562B2 (en) | Method and system for operating a data center by reducing an amount of data to be processed | |
Verma et al. | Breaking the MapReduce stage barrier | |
Tao et al. | Minimal mapreduce algorithms | |
CN110032442A (en) | Accelerate the framework and mechanism of tuple space search using integrated GPU | |
US10002019B2 (en) | System and method for assigning a transaction to a serialized execution group based on an execution group limit for parallel processing with other execution groups | |
US9986018B2 (en) | Method and system for a scheduled map executor | |
Shatdal et al. | Using shared virtual memory for parallel join processing | |
CN107193898B (en) | The inquiry sharing method and system of log data stream based on stepped multiplexing | |
WO2022083197A1 (en) | Data processing method and apparatus, electronic device, and storage medium | |
Frey et al. | A spinning join that does not get dizzy | |
CN109828790B (en) | Data processing method and system based on Shenwei heterogeneous many-core processor | |
Li et al. | Improving the shuffle of hadoop MapReduce | |
Tseng et al. | Accelerating open vSwitch with integrated GPU | |
US7466716B2 (en) | Reducing latency in a channel adapter by accelerated I/O control block processing | |
CN114253930A (en) | Data processing method, device, equipment and storage medium | |
Guo et al. | Distributed join algorithms on multi-CPU clusters with GPUDirect RDMA | |
CN106648451A (en) | Memory-based MapReduce engine data processing method and apparatus | |
CN110502337A (en) | For the optimization system and method for shuffling the stage in Hadoop MapReduce | |
CN100499564C (en) | Packet processing engine | |
US11582133B2 (en) | Apparatus and method for distributed processing of identical packet in high-speed network security equipment | |
Guo et al. | Handling data skew at reduce stage in Spark by ReducePartition | |
Vinutha et al. | In-Memory Cache and Intra-Node Combiner Approaches for Optimizing Execution Time in High-Performance Computing | |
He et al. | Ds 2: Handling data skew using data stealings over high-speed networks | |
Lu et al. | Improving mapreduce performance by using a new partitioner in yarn |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |