CN103699442B - Under MapReduce Computational frames can iterative data processing method - Google Patents
Under MapReduce Computational frames can iterative data processing method Download PDFInfo
- Publication number
- CN103699442B CN103699442B CN201310686716.7A CN201310686716A CN103699442B CN 103699442 B CN103699442 B CN 103699442B CN 201310686716 A CN201310686716 A CN 201310686716A CN 103699442 B CN103699442 B CN 103699442B
- Authority
- CN
- China
- Prior art keywords
- key
- data
- value
- thread
- data item
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The present invention propose under a kind of MapReduce Computational frames can iterative data processing method, comprise the following steps:S10, read initial data, and initial data is parsed into independent data item;S20, be distributed to each thread or process is handled using Shuffle Grouping mechanism by the input data;S30, carry out data Hash restructuring, sequence, and uses Fields Grouping mechanism by data distribution after sequence to each thread or process;S40, each thread or process are in real time ranked up data in buffer pool, are grouped;S50, send data to thread or process is handled;The result of calculation of return, is parsed into independent data item, repeat step S20 to S50 by S60, until sending the data item for representing to stop iteration.The present invention can keep the calculated performance of MapReduce not to be affected because of iteration, also reduce the expense of establishment and the destruction of virtual machine.
Description
【Technical field】
The present invention relates under a kind of MapReduce Computational frames can iterative data processing method.
【Background technology】
In the big data epoch, for data volume into explosive growth, this calculating processing to data proposes high requirement.
The it is proposed of the Hadoop ecospheres provides a powerful instrument for the large-scale calculations of mass data and distributed reliable memory.
In Hadoop, MapReduce be one for mass data calculate provide it is reliable, easy-to-use, can scale key component, institute's base
In MapReduce Computational frames it is very friendly for many data analyses and computational methods, this causes MapReduce calculation blocks
Frame has a wide range of applications in mass data analysis.However, in reality in use, iteration meter under MapReduce Computational frames
The restriction for performing and receiving the realization of the Hadoop ecospheres is calculated, so as to cause iterative calculation performance to be affected.
Under MapReduce Computational frames, data set MassiveDataSet can be divided into several data blocks, then
Each Map handles a data block, and exports one by queues of the key-value to forming, can be to institute in the shuffle stages
There is key-value to performing Hash restructuring and sort according to key, form key-value-list pairs, then in the Reduce stages,
Each key-value-list is to that can be treated separately and export result.
Iterative calculation is limited under MapReduce Computational frames:(1)In between two MapReduce tasks
Between data must be written back into Hadoop Distributed File System, therefore cause performance to incur loss.(2)Map
With Reduce itself can not iteration perform, if require iterate to calculate if, it is necessary to which two MapReduce that connect, cause Java
The establishment of virtual machine and destruction expense, influence performance.To solve problem above, the prior art is using multiple MapReduce that connect
Task, but still remain following shortcoming:(1)Intermediate data between two MapReduce tasks must be written back into Hadoop
Distributed FileSystem(HDFS)In, therefore cause performance loss.(2)Map and Reduce itself can not iteration hold
OK, it is necessary to two MapReduce that connect if requiring to iterate to calculate, cause the establishment of Java Virtual Machine and destroy expense,
Influence performance.
【The content of the invention】
Present invention seek to address that above-mentioned problems of the prior art, propose under a kind of MapReduce Computational frames
Can iterative data processing method.
Under MapReduce Computational frames proposed by the present invention can iterative data processing method, comprise the following steps:
S10, ReadNode read initial data from Hadoop distributed file systems, and the initial data is parsed into independent digit
According to item, the input data of MapNode is used as using the independent data item;S20, MapNode use Shuffle Grouping machines
The input data is distributed to system each thread of MapNode or process is handled, defeated for each independent data item
Go out<key,value>Formatted data;S30, ShuffleNode couple<key,value>Carry out Hash restructuring, performed based on key values
Sequence, and using FieldsGrouping mechanism by after sequence<key,value>Be distributed to ShuffleNode each thread or
Process;Each thread or process of S40, ShuffleNode in real time will<key,value>Local KVlist buffer pools are stored in, until
Receive and represent what data sending finished<key,value>, based on key values in KVlist buffer pools<key,value>Carry out
Sequence, packet, each packet is exported i,<key,value_list>Formatted data, wherein, i compiles for current thread or process
Number;S50, ReduceNode general i,<key,value_list>Send to its ith thread or process and handled, export<
key’,value’>;S60, CoordinateNode are received and buffered<key’,value’>It is sent until receiving expression
Data item, CoordinateNode will be based on<key’,value’>Result of calculation be back to ReadNode, ReadNode will
The result of calculation is parsed into independent data item, and repeat step S20 to S50 is iterated, until ReduceNode sends expression
Stop the data item of iteration, CoordinateNode is exited.
Under MapReduce Computational frames proposed by the present invention can iterative data processing method based on streaming computing realize
Can iteration MapReduce Computational frames, can keep the calculated performance of MapReduce will not be affected because of iteration.The party
Method makes intermediate data not have to write back distributed file system, also avoids the expense of establishment and the destruction of java virtual machines, and can be with
Support the realization of more flexible and more efficient data analysis and process algorithm.
【Brief description of the drawings】
Fig. 1 be MapReduce Computational frames proposed by the present invention under can iterative data processing method flow chart.
Fig. 2 be under the MapReduce Computational frames of one of the embodiment of the present invention can iterative data processing method open up
Flutter structure chart.
Fig. 3 be under the MapReduce Computational frames of the two of the embodiment of the present invention can iterative data processing method open up
Flutter structure chart.
Fig. 4 be under the MapReduce Computational frames of the three of the embodiment of the present invention can iterative data processing method open up
Flutter structure chart.
【Embodiment】
The present invention is described in further detail with reference to specific embodiment and attached drawing.It is described below in detail the present invention's
Embodiment, the example of the embodiment are shown in the drawings, wherein same or similar label represents identical or class from beginning to end
As element or there is same or like element.The embodiments described below with reference to the accompanying drawings are exemplary, only
For explaining technical scheme, and it is not construed as limitation of the present invention.
In the description of the present invention, term " interior ", " outer ", " longitudinal direction ", " transverse direction ", " on ", " under ", " top ", " bottom " etc. refer to
The orientation or position relationship shown be based on orientation shown in the drawings or position relationship, be for only for ease of the description present invention rather than
It is required that the present invention must be with specific azimuth configuration and operation, therefore it is not construed as limitation of the present invention.
The present invention provide under a kind of MapReduce Computational frames can iterative data processing method.As shown in Figure 1, should
Method comprises the following steps:S10, ReadNode read initial data from Hadoop distributed file systems, and by the original
Beginning data are parsed into independent data item, and the input data of MapNode is used as using the independent data item;S20, MapNode are used
The input data is distributed to Shuffle Grouping mechanism each thread of MapNode or process is handled, for every
The one independent data item output<key,value>Formatted data;S30, ShuffleNode couple<key,value>Carry out Hash
Restructuring, perform sequence based on key values, and using Fields Grouping mechanism by after sequence<key,value>It is distributed to
Each thread or process of ShuffleNode;Each thread or process of S40, ShuffleNode in real time will<key,value>Deposit
Local KVlist buffer pools, represent what data sending finished until receiving<key,value>, KVlist is buffered based on key values
In pond<key,value>It is ranked up, is grouped, each packet is exported i,<key,value_list>Formatted data, its
In, i numbers for current thread or process;S50, ReduceNode general i,<key,value_list>Send to its i-th of line
Journey or process are handled, output<key’,value’>;S60, CoordinateNode are received and buffered<key’,value’>
Until receiving the data item for representing to be sent, CoordinateNode will be based on<key’,value’>Result of calculation return
ReadNode is back to, the result of calculation is parsed into independent data item by ReadNode, and repeat step S20 to S50 is iterated,
Until ReduceNode sends the data item for representing to stop iteration, CoordinateNode is exited.
Specifically, can be in the lump with reference to Fig. 2.Under MapReduce Computational frames proposed by the present invention can be at iterative data
Reason method is based on streaming computing, i.e., using streaming computing realize Map stages of MapReduce Computational frames, the Shuffle stages,
The Reduce stages simultaneously realize iterator mechanism using streaming computing.
Under MapReduce Computational frames proposed by the present invention can iterative data processing method whole topological structure by
Five kinds of nodes are formed:ReadNode, MapNode, ShuffleNode, ReduceNode, CoordinateNode.ReadNode
It is responsible for from distributed file system(Hadoop Distributed FileSystem,HDFS)Middle reading initial data, and parse
Into in data item input topological structure independent one by one;MapNode is responsible for realizing the Map stages of MapReduce Computational frames,
The thread or number of processes of the node determine Map quantity;ShuffleNode is responsible for realizing MapReduce Computational frames
Shuffle stages, the thread or number of processes of the node are equal to the thread or number of processes of ReduceNode;ReduceNode is born
Duty realizes the Reduce stages of MapReduce Computational frames, and the thread or number of processes of the node determine Reduce quantity.
Data Collection and data when CoordinateNode is responsible for performing iteration is synchronous.
Wherein, MapNode is used for realization the Map stages in MapReduce Computational frames, and MapNode receives ReadNode
The data item of output, it is preferable that thread in MapNode or into number of passes be exactly Map quantity, this and " MapReduce of Hadoop
The Map quantity of Computational frame is determined by data number of blocks " it is different;Due to there is no relevance between the data item in data set,
So in order to balance the computational load of each thread or process, MapNode distributes its institute using shuffleGrouping mechanism
The data received give each thread or process, and for each data item, MapNode, which is performed, calculates and export one<key,
value>;When MapNode receives the data item that a specific expression data set is sent, then a specific expression is exported
What data sending finished<key,value>;ShuffleNode is used for realization the Shuffle ranks in MapReduce Computational frames
Section, ShuffleNode receive the data item of MapNode outputs, it is preferable that thread in ShuffleNode is equal into number of passes
The thread of ReduceNode or into number of passes, all outputs of a thread all must be by ReduceNode in ShuffleNode
In any corresponding thread receive;ShuffleNode is responsible for received<key,value>Hash restructuring is carried out, and
Sequence is performed according to key, thus ShuffleNode distributed using FieldsGrouping it is received<key,value>;
The per thread or process of ShuffleNode, for one received<key,value>, first put it into local one
Buffer pool(KVlist), specific represent what all data sendings finished until receiving one<key,value>, when
A thread or process of ShuffleNode receives what specific all data sendings of expression finished<key,value>Afterwards,
Key is then based on, first to owning in KVlist<key,value>It is ranked up, then again all<key,value>Packet,
Identical key's<key,value>It is grouped into one group.For a packet, one is generated<key,value_list>, final output
{i,<key,value_list>, wherein, i is the numbering of current thread or process;It is defeated after a thread process complete KVlist
Go out a specific data item for representing data sending and finishing.
ReduceNode is used for realization the Reduce stages in MapReduce Computational frames, and ReduceNode is received
The data item of ShuffleNode outputs:For i,<key,value_list>, then it is sent to i-th of line of ReduceNode
Journey or process
For a thread or process of ReduceNode, often receive one i,<key,value_list>,
Then handle one<key,value_list>And export with<key’,value’>Form exports result.
CoordinateNode is responsible for the data buffering of iterator mechanism, data synchronization and data and calculates.
Depending on CoordinateNode distributes to the mechanism of its internal each thread or process by concrete application for the data that receive.
As a node NodeiWhen needing to perform iterative operation, NodeiData item is first sent to CoordinateNode
In, CoordinateNode can be received and be buffered all data item until receiving the specific data for representing to be sent
, then Coordinate can perform calculating based on received data item, and result of calculation is returned to Nodei, until
NodeiSend one it is specific represent stop iteration data item when, CoordinateNode will move out.
Wherein Fig. 2 is shown:CoordinateNode is received and buffered<key’,value’>Sent until receiving expression
The data item finished, CoordinateNode will be based on<key’,value’>Result of calculation be back to ReadNode,
The result of calculation is parsed into independent data item by ReadNode, and repeat step S20 to S50 is iterated, until
ReduceNode sends the data item for representing to stop iteration, and CoordinateNode is exited
Fig. 3 is shown after step S20 is performed, and CoordinateNode is received and buffered<key,value>Until connecing
The data item for representing to be sent is received, CoordinateNode will be based on<key,value>Result of calculation be back to
MapNode, then input data using the result of calculation as MapNode re-execute step S20.
Fig. 4 is shown after step S40 is performed, CoordinateNode receive and buffer i,<key,value_list
>Until receiving represent the data item that is sent, CoordinateNode will be based on i,<key,value_list>Meter
Calculate result and be back to ShuffleNode, then the input data using the result of calculation as ShuffleNode re-executes step
S30 and S40.
For Fig. 3 described embodiments, for example, for a data set Set, ReadNode first can be received simultaneously
It is parsed into data item DataEntry one by onei, it is then sent to MapNode;MapNode the data item received according to
The mode of ShuffleGrouping is distributed to its internal each thread, for a thread in MapNode, processing
One data item, output type,<key,value>, wherein type is the identifier of one 4, and whether describe needs to change
Whether generation, data the extraneous information such as are sent.
CoordinateNode1 and ShuffleNode can receive the output of MapNode, if in the data item of input
Type expressions need iteration, then ShuffleNode can ignore received data item.CoordinateNode will receive number
According to item and it is cached in an array, is finished until receiving a specific expression data sending.After data receiver,
CoordinateNode start to process data item arrays and output one<key,value>Handling result is described, MapNode is received
The output of CoordinateNode, and handled again.
During iteration ends, MapNode need output type,<key,value>Type in set corresponding positions with
Represent that iteration terminates.
ShuffleNode receives the output of MapNode, equally first checks for type, if type represents iteration ends,
ShuffleNode start receiving data item type,<key,value>And be cached in an array, until receiving one
Data item, its type represent that data sending finishes.After data receiver, ShuffleNode is primarily based on key, to data
It is all in item array<key,value>It is ranked up, is then grouped, identical key's<key,value>It is put into one
In packet, then generated for each packet<key,value_list>(Value_list is a value chained list), it is then defeated
Go out type ',<key,value_list>}.
ReduceNode receive the output of ShuffleNode data item type ',<key,value_list>, so
Post-process and with<key’,value’>Form exports result.
CoordinateNode2 receives the output of ReduceNode, as CoordinateNode1,
CoordinateNode2 receiving data items are simultaneously cached in an array, complete until receiving a specific expression data sending
Finish.After data receiver, CoordinateNode2 start to process data item arrays and with<key,value>Form describes
Handling result simultaneously exports.
ReadNode receives the output of CoordinateNode2 and inputs data to be treated again to whole frame.
Under MapReduce Computational frames proposed by the present invention can iterative data processing method based on streaming computing realize
Can iteration MapReduce Computational frames, can keep the calculated performance of MapReduce will not be affected because of iteration.The party
Method makes intermediate data not have to write back distributed file system, also avoids the expense of establishment and the destruction of java virtual machines, and can be with
Support the realization of more flexible and more efficient data analysis and process algorithm.
Under MapReduce Computational frames proposed by the present invention can iterative data processing method by using Storm
Streaming computing instrument realizes that experiment effect is good.
Although the present invention is described with reference to current better embodiment, those skilled in the art should be able to manage
Solution, above-mentioned better embodiment is only used for explanation and illustration technical scheme, and is not used for limiting the guarantor of the present invention
Scope is protected, any within the scope of the spirit and principles in the present invention, any modification for being done, equivalence replacement, deformation, improvement etc.,
It should be included within the claims of the present invention.
Claims (5)
1. under a kind of MapReduce Computational frames can iterative data processing method, comprise the following steps:
S10, ReadNode read initial data from Hadoop distributed file systems, and the initial data is parsed into solely
Vertical data item, the input data of MapNode is used as using the independent data item;
S20, MapNode using Shuffle Grouping mechanism by the input data be distributed to MapNode each thread or
Process is handled, and is exported for each independent data item<key,value>Formatted data;
S30, ShuffleNode couple<key,value>Carry out Hash restructuring, sequence is performed based on key values, and use Fields
Grouping mechanism is by after sequence<key,value>It is distributed to each thread or process of ShuffleNode;
Each thread or process of S40, ShuffleNode in real time will<key,value>Local KVlist buffer pools are stored in, until receiving
Finished to expression data sending<key,value>, based on key values in KVlist buffer pools<key,value>Arranged
Sequence, packet, each packet is exported i,<key,value_list>Formatted data, wherein, i compiles for current thread or process
Number;
S50, ReduceNode general i,<key,value_list>Send to its ith thread or process and handled, export<
key’,value’>;
S60, CoordinateNode are received and buffered<key’,value’>Until the data item for representing to be sent is received,
CoordinateNode will be based on<key’,value’>Result of calculation be back to ReadNode, ReadNode is by the calculating
As a result independent data item is parsed into, repeat step S20 to S50 is iterated, until ReduceNode sends expression and stops iteration
Data item, CoordinateNode exits;
Thread or number of processes in MapNode are Map quantity;
When MapNode receives the data item that the expression independent data item is sent, then output represents that data item is sent
Finish<key,value>.
2. under MapReduce Computational frames according to claim 1 can iterative data processing method, its feature exists
In after step S20 is performed, CoordinateNode is received and buffered<key,value>Sent until receiving expression
Complete data item, CoordinateNode will be based on<key,value>Result of calculation be back to MapNode, then with the calculating
As a result the input data as MapNode re-executes step S20.
3. under MapReduce Computational frames according to claim 1 can iterative data processing method, its feature exists
In, after step S40 is performed, CoordinateNode receive and buffer i,<key,value_list>Until receiving table
Show the data item being sent, CoordinateNode will be based on i,<key,value_list>Result of calculation be back to
ShuffleNode, then input data using the result of calculation as ShuffleNode re-execute step S30 and S40.
4. under MapReduce Computational frames according to claim 1 can iterative data processing method, its feature exists
In, thread in ShuffleNode or into number of passes and the thread in ReduceNode or equal into number of passes.
5. under MapReduce Computational frames according to claim 4 can iterative data processing method, its feature exists
In all outputs of each thread or process are by a thread in ReduceNode or process reception in ShuffleNode.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310686716.7A CN103699442B (en) | 2013-12-12 | 2013-12-12 | Under MapReduce Computational frames can iterative data processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310686716.7A CN103699442B (en) | 2013-12-12 | 2013-12-12 | Under MapReduce Computational frames can iterative data processing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103699442A CN103699442A (en) | 2014-04-02 |
CN103699442B true CN103699442B (en) | 2018-04-17 |
Family
ID=50360981
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310686716.7A Active CN103699442B (en) | 2013-12-12 | 2013-12-12 | Under MapReduce Computational frames can iterative data processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103699442B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103995827B (en) * | 2014-04-10 | 2017-08-04 | 北京大学 | High-performance sort method in MapReduce Computational frames |
CN105095244A (en) * | 2014-05-04 | 2015-11-25 | 李筑 | Big data algorithm for entrepreneurship cloud platform |
CN104391916A (en) * | 2014-11-19 | 2015-03-04 | 广州杰赛科技股份有限公司 | GPEH data analysis method and device based on distributed computing platform |
CN104391748A (en) * | 2014-11-21 | 2015-03-04 | 浪潮电子信息产业股份有限公司 | Mapreduce computation process optimization method |
CN105354089B (en) * | 2015-10-15 | 2019-02-01 | 北京航空航天大学 | Support the stream data processing unit and system of iterative calculation |
CN107797852A (en) * | 2016-09-06 | 2018-03-13 | 阿里巴巴集团控股有限公司 | The processing unit and processing method of data iteration |
CN114077609B (en) * | 2022-01-19 | 2022-04-22 | 北京四维纵横数据技术有限公司 | Data storage and retrieval method, device, computer readable storage medium and electronic equipment |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102137125A (en) * | 2010-01-26 | 2011-07-27 | 复旦大学 | Method for processing cross task data in distributive network system |
CN103279328A (en) * | 2013-04-08 | 2013-09-04 | 河海大学 | BlogRank algorithm parallelization processing construction method based on Haloop |
-
2013
- 2013-12-12 CN CN201310686716.7A patent/CN103699442B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN103699442A (en) | 2014-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103699442B (en) | Under MapReduce Computational frames can iterative data processing method | |
TWI680409B (en) | Method for matrix by vector multiplication for use in artificial neural network | |
Das et al. | Distributed deep learning using synchronous stochastic gradient descent | |
US10831773B2 (en) | Method and system for parallelization of ingestion of large data sets | |
CN111656367A (en) | System and architecture for neural network accelerator | |
WO2017167095A1 (en) | Model training method and device | |
US20200159810A1 (en) | Partitioning sparse matrices based on sparse matrix representations for crossbar-based architectures | |
KR102163209B1 (en) | Method and reconfigurable interconnect topology for multi-dimensional parallel training of convolutional neural network | |
CN104809161B (en) | A kind of method and system that sparse matrix is compressed and is inquired | |
Yamazaki et al. | One-sided dense matrix factorizations on a multicore with multiple GPU accelerators | |
CN111444134A (en) | Parallel PME (pulse-modulated emission) accelerated optimization method and system of molecular dynamics simulation software | |
CN103995827B (en) | High-performance sort method in MapReduce Computational frames | |
Plimpton et al. | Streaming data analytics via message passing with application to graph algorithms | |
CN108491924B (en) | Neural network data serial flow processing device for artificial intelligence calculation | |
KR101361080B1 (en) | Apparatus, method and computer readable recording medium for calculating between matrices | |
JP4310500B2 (en) | Important component priority calculation method and equipment | |
US20220382829A1 (en) | Sparse matrix multiplication in hardware | |
Geng et al. | Rima: an RDMA-accelerated model-parallelized solution to large-scale matrix factorization | |
Xia et al. | Redundancy-free high-performance dynamic GNN training with hierarchical pipeline parallelism | |
TWI814734B (en) | Calculation device for and calculation method of performing convolution | |
Shi et al. | Accelerating intersection computation in frequent itemset mining with fpga | |
US11886347B2 (en) | Large-scale data processing computer architecture | |
Yuan et al. | Optimizing sparse matrix vector multiplication using diagonal storage matrix format | |
Lu et al. | High-performance homomorphic matrix completion on GPUs | |
Song et al. | Novel graph processor architecture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |