CN103699442B

CN103699442B - Under MapReduce Computational frames can iterative data processing method

Info

Publication number: CN103699442B
Application number: CN201310686716.7A
Authority: CN
Inventors: 邹瑜斌; 张帆; 须成忠
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2013-12-12
Filing date: 2013-12-12
Publication date: 2018-04-17
Anticipated expiration: 2033-12-12
Also published as: CN103699442A

Abstract

The present invention propose under a kind of MapReduce Computational frames can iterative data processing method, comprise the following steps：S10, read initial data, and initial data is parsed into independent data item；S20, be distributed to each thread or process is handled using Shuffle Grouping mechanism by the input data；S30, carry out data Hash restructuring, sequence, and uses Fields Grouping mechanism by data distribution after sequence to each thread or process；S40, each thread or process are in real time ranked up data in buffer pool, are grouped；S50, send data to thread or process is handled；The result of calculation of return, is parsed into independent data item, repeat step S20 to S50 by S60, until sending the data item for representing to stop iteration.The present invention can keep the calculated performance of MapReduce not to be affected because of iteration, also reduce the expense of establishment and the destruction of virtual machine.

Description

Under MapReduce Computational frames can iterative data processing method

【Technical field】

The present invention relates under a kind of MapReduce Computational frames can iterative data processing method.

【Background technology】

In the big data epoch, for data volume into explosive growth, this calculating processing to data proposes high requirement. The it is proposed of the Hadoop ecospheres provides a powerful instrument for the large-scale calculations of mass data and distributed reliable memory. In Hadoop, MapReduce be one for mass data calculate provide it is reliable, easy-to-use, can scale key component, institute's base In MapReduce Computational frames it is very friendly for many data analyses and computational methods, this causes MapReduce calculation blocks Frame has a wide range of applications in mass data analysis.However, in reality in use, iteration meter under MapReduce Computational frames The restriction for performing and receiving the realization of the Hadoop ecospheres is calculated, so as to cause iterative calculation performance to be affected.

Under MapReduce Computational frames, data set MassiveDataSet can be divided into several data blocks, then Each Map handles a data block, and exports one by queues of the key-value to forming, can be to institute in the shuffle stages There is key-value to performing Hash restructuring and sort according to key, form key-value-list pairs, then in the Reduce stages, Each key-value-list is to that can be treated separately and export result.

Iterative calculation is limited under MapReduce Computational frames：（1）In between two MapReduce tasks Between data must be written back into Hadoop Distributed File System, therefore cause performance to incur loss.（2）Map With Reduce itself can not iteration perform, if require iterate to calculate if, it is necessary to which two MapReduce that connect, cause Java The establishment of virtual machine and destruction expense, influence performance.To solve problem above, the prior art is using multiple MapReduce that connect Task, but still remain following shortcoming：（1）Intermediate data between two MapReduce tasks must be written back into Hadoop Distributed FileSystem（HDFS）In, therefore cause performance loss.（2）Map and Reduce itself can not iteration hold OK, it is necessary to two MapReduce that connect if requiring to iterate to calculate, cause the establishment of Java Virtual Machine and destroy expense, Influence performance.

【The content of the invention】

Present invention seek to address that above-mentioned problems of the prior art, propose under a kind of MapReduce Computational frames Can iterative data processing method.

Under MapReduce Computational frames proposed by the present invention can iterative data processing method, comprise the following steps： S10, ReadNode read initial data from Hadoop distributed file systems, and the initial data is parsed into independent digit According to item, the input data of MapNode is used as using the independent data item；S20, MapNode use Shuffle Grouping machines The input data is distributed to system each thread of MapNode or process is handled, defeated for each independent data item Go out<key,value>Formatted data；S30, ShuffleNode couple<key,value>Carry out Hash restructuring, performed based on key values Sequence, and using FieldsGrouping mechanism by after sequence<key,value>Be distributed to ShuffleNode each thread or Process；Each thread or process of S40, ShuffleNode in real time will<key,value>Local KVlist buffer pools are stored in, until Receive and represent what data sending finished<key,value>, based on key values in KVlist buffer pools<key,value>Carry out Sequence, packet, each packet is exported i,<key,value_list>Formatted data, wherein, i compiles for current thread or process Number；S50, ReduceNode general i,<key,value_list>Send to its ith thread or process and handled, export< key’,value’>；S60, CoordinateNode are received and buffered<key’,value’>It is sent until receiving expression Data item, CoordinateNode will be based on<key’,value’>Result of calculation be back to ReadNode, ReadNode will The result of calculation is parsed into independent data item, and repeat step S20 to S50 is iterated, until ReduceNode sends expression Stop the data item of iteration, CoordinateNode is exited.

Under MapReduce Computational frames proposed by the present invention can iterative data processing method based on streaming computing realize Can iteration MapReduce Computational frames, can keep the calculated performance of MapReduce will not be affected because of iteration.The party Method makes intermediate data not have to write back distributed file system, also avoids the expense of establishment and the destruction of java virtual machines, and can be with Support the realization of more flexible and more efficient data analysis and process algorithm.

【Brief description of the drawings】

Fig. 1 be MapReduce Computational frames proposed by the present invention under can iterative data processing method flow chart.

Fig. 2 be under the MapReduce Computational frames of one of the embodiment of the present invention can iterative data processing method open up Flutter structure chart.

Fig. 3 be under the MapReduce Computational frames of the two of the embodiment of the present invention can iterative data processing method open up Flutter structure chart.

Fig. 4 be under the MapReduce Computational frames of the three of the embodiment of the present invention can iterative data processing method open up Flutter structure chart.

【Embodiment】

The present invention is described in further detail with reference to specific embodiment and attached drawing.It is described below in detail the present invention's Embodiment, the example of the embodiment are shown in the drawings, wherein same or similar label represents identical or class from beginning to end As element or there is same or like element.The embodiments described below with reference to the accompanying drawings are exemplary, only For explaining technical scheme, and it is not construed as limitation of the present invention.

In the description of the present invention, term " interior ", " outer ", " longitudinal direction ", " transverse direction ", " on ", " under ", " top ", " bottom " etc. refer to The orientation or position relationship shown be based on orientation shown in the drawings or position relationship, be for only for ease of the description present invention rather than It is required that the present invention must be with specific azimuth configuration and operation, therefore it is not construed as limitation of the present invention.

The present invention provide under a kind of MapReduce Computational frames can iterative data processing method.As shown in Figure 1, should Method comprises the following steps：S10, ReadNode read initial data from Hadoop distributed file systems, and by the original Beginning data are parsed into independent data item, and the input data of MapNode is used as using the independent data item；S20, MapNode are used The input data is distributed to Shuffle Grouping mechanism each thread of MapNode or process is handled, for every The one independent data item output<key,value>Formatted data；S30, ShuffleNode couple<key,value>Carry out Hash Restructuring, perform sequence based on key values, and using Fields Grouping mechanism by after sequence<key,value>It is distributed to Each thread or process of ShuffleNode；Each thread or process of S40, ShuffleNode in real time will<key,value>Deposit Local KVlist buffer pools, represent what data sending finished until receiving<key,value>, KVlist is buffered based on key values In pond<key,value>It is ranked up, is grouped, each packet is exported i,<key,value_list>Formatted data, its In, i numbers for current thread or process；S50, ReduceNode general i,<key,value_list>Send to its i-th of line Journey or process are handled, output<key’,value’>；S60, CoordinateNode are received and buffered<key’,value’> Until receiving the data item for representing to be sent, CoordinateNode will be based on<key’,value’>Result of calculation return ReadNode is back to, the result of calculation is parsed into independent data item by ReadNode, and repeat step S20 to S50 is iterated, Until ReduceNode sends the data item for representing to stop iteration, CoordinateNode is exited.

Specifically, can be in the lump with reference to Fig. 2.Under MapReduce Computational frames proposed by the present invention can be at iterative data Reason method is based on streaming computing, i.e., using streaming computing realize Map stages of MapReduce Computational frames, the Shuffle stages, The Reduce stages simultaneously realize iterator mechanism using streaming computing.

Under MapReduce Computational frames proposed by the present invention can iterative data processing method whole topological structure by Five kinds of nodes are formed：ReadNode, MapNode, ShuffleNode, ReduceNode, CoordinateNode.ReadNode It is responsible for from distributed file system（Hadoop Distributed FileSystem,HDFS）Middle reading initial data, and parse Into in data item input topological structure independent one by one；MapNode is responsible for realizing the Map stages of MapReduce Computational frames, The thread or number of processes of the node determine Map quantity；ShuffleNode is responsible for realizing MapReduce Computational frames Shuffle stages, the thread or number of processes of the node are equal to the thread or number of processes of ReduceNode；ReduceNode is born Duty realizes the Reduce stages of MapReduce Computational frames, and the thread or number of processes of the node determine Reduce quantity. Data Collection and data when CoordinateNode is responsible for performing iteration is synchronous.

Wherein, MapNode is used for realization the Map stages in MapReduce Computational frames, and MapNode receives ReadNode The data item of output, it is preferable that thread in MapNode or into number of passes be exactly Map quantity, this and " MapReduce of Hadoop The Map quantity of Computational frame is determined by data number of blocks " it is different；Due to there is no relevance between the data item in data set, So in order to balance the computational load of each thread or process, MapNode distributes its institute using shuffleGrouping mechanism The data received give each thread or process, and for each data item, MapNode, which is performed, calculates and export one<key, value>；When MapNode receives the data item that a specific expression data set is sent, then a specific expression is exported What data sending finished<key,value>；ShuffleNode is used for realization the Shuffle ranks in MapReduce Computational frames Section, ShuffleNode receive the data item of MapNode outputs, it is preferable that thread in ShuffleNode is equal into number of passes The thread of ReduceNode or into number of passes, all outputs of a thread all must be by ReduceNode in ShuffleNode In any corresponding thread receive；ShuffleNode is responsible for received<key,value>Hash restructuring is carried out, and Sequence is performed according to key, thus ShuffleNode distributed using FieldsGrouping it is received<key,value>； The per thread or process of ShuffleNode, for one received<key,value>, first put it into local one Buffer pool（KVlist）, specific represent what all data sendings finished until receiving one<key,value>, when A thread or process of ShuffleNode receives what specific all data sendings of expression finished<key,value>Afterwards, Key is then based on, first to owning in KVlist<key,value>It is ranked up, then again all<key,value>Packet, Identical key's<key,value>It is grouped into one group.For a packet, one is generated<key,value_list>, final output {i,<key,value_list>, wherein, i is the numbering of current thread or process；It is defeated after a thread process complete KVlist Go out a specific data item for representing data sending and finishing.

ReduceNode is used for realization the Reduce stages in MapReduce Computational frames, and ReduceNode is received The data item of ShuffleNode outputs：For i,<key,value_list>, then it is sent to i-th of line of ReduceNode Journey or process

For a thread or process of ReduceNode, often receive one i,<key,value_list>, Then handle one<key,value_list>And export with<key’,value’>Form exports result.

CoordinateNode is responsible for the data buffering of iterator mechanism, data synchronization and data and calculates. Depending on CoordinateNode distributes to the mechanism of its internal each thread or process by concrete application for the data that receive.

As a node Node_iWhen needing to perform iterative operation, Node_iData item is first sent to CoordinateNode In, CoordinateNode can be received and be buffered all data item until receiving the specific data for representing to be sent , then Coordinate can perform calculating based on received data item, and result of calculation is returned to Node_i, until Node_iSend one it is specific represent stop iteration data item when, CoordinateNode will move out.

Wherein Fig. 2 is shown：CoordinateNode is received and buffered<key’,value’>Sent until receiving expression The data item finished, CoordinateNode will be based on<key’,value’>Result of calculation be back to ReadNode, The result of calculation is parsed into independent data item by ReadNode, and repeat step S20 to S50 is iterated, until ReduceNode sends the data item for representing to stop iteration, and CoordinateNode is exited

Fig. 3 is shown after step S20 is performed, and CoordinateNode is received and buffered<key,value>Until connecing The data item for representing to be sent is received, CoordinateNode will be based on<key,value>Result of calculation be back to MapNode, then input data using the result of calculation as MapNode re-execute step S20.

Fig. 4 is shown after step S40 is performed, CoordinateNode receive and buffer i,<key,value_list >Until receiving represent the data item that is sent, CoordinateNode will be based on i,<key,value_list>Meter Calculate result and be back to ShuffleNode, then the input data using the result of calculation as ShuffleNode re-executes step S30 and S40.

For Fig. 3 described embodiments, for example, for a data set Set, ReadNode first can be received simultaneously It is parsed into data item DataEntry one by one_i, it is then sent to MapNode；MapNode the data item received according to The mode of ShuffleGrouping is distributed to its internal each thread, for a thread in MapNode, processing One data item, output type,<key,value>, wherein type is the identifier of one 4, and whether describe needs to change Whether generation, data the extraneous information such as are sent.

CoordinateNode1 and ShuffleNode can receive the output of MapNode, if in the data item of input Type expressions need iteration, then ShuffleNode can ignore received data item.CoordinateNode will receive number According to item and it is cached in an array, is finished until receiving a specific expression data sending.After data receiver, CoordinateNode start to process data item arrays and output one<key,value>Handling result is described, MapNode is received The output of CoordinateNode, and handled again.

During iteration ends, MapNode need output type,<key,value>Type in set corresponding positions with Represent that iteration terminates.

ShuffleNode receives the output of MapNode, equally first checks for type, if type represents iteration ends, ShuffleNode start receiving data item type,<key,value>And be cached in an array, until receiving one Data item, its type represent that data sending finishes.After data receiver, ShuffleNode is primarily based on key, to data It is all in item array<key,value>It is ranked up, is then grouped, identical key's<key,value>It is put into one In packet, then generated for each packet<key,value_list>（Value_list is a value chained list）, it is then defeated Go out type ',<key,value_list>}.

ReduceNode receive the output of ShuffleNode data item type ',<key,value_list>, so Post-process and with<key’,value’>Form exports result.

CoordinateNode2 receives the output of ReduceNode, as CoordinateNode1, CoordinateNode2 receiving data items are simultaneously cached in an array, complete until receiving a specific expression data sending Finish.After data receiver, CoordinateNode2 start to process data item arrays and with<key,value>Form describes Handling result simultaneously exports.

ReadNode receives the output of CoordinateNode2 and inputs data to be treated again to whole frame.

Under MapReduce Computational frames proposed by the present invention can iterative data processing method by using Storm Streaming computing instrument realizes that experiment effect is good.

Although the present invention is described with reference to current better embodiment, those skilled in the art should be able to manage Solution, above-mentioned better embodiment is only used for explanation and illustration technical scheme, and is not used for limiting the guarantor of the present invention Scope is protected, any within the scope of the spirit and principles in the present invention, any modification for being done, equivalence replacement, deformation, improvement etc., It should be included within the claims of the present invention.

Claims

1. under a kind of MapReduce Computational frames can iterative data processing method, comprise the following steps：

S10, ReadNode read initial data from Hadoop distributed file systems, and the initial data is parsed into solely Vertical data item, the input data of MapNode is used as using the independent data item；

S20, MapNode using Shuffle Grouping mechanism by the input data be distributed to MapNode each thread or Process is handled, and is exported for each independent data item<key,value>Formatted data；

S30, ShuffleNode couple<key,value>Carry out Hash restructuring, sequence is performed based on key values, and use Fields Grouping mechanism is by after sequence<key,value>It is distributed to each thread or process of ShuffleNode；

Each thread or process of S40, ShuffleNode in real time will<key,value>Local KVlist buffer pools are stored in, until receiving Finished to expression data sending<key,value>, based on key values in KVlist buffer pools<key,value>Arranged Sequence, packet, each packet is exported i,<key,value_list>Formatted data, wherein, i compiles for current thread or process Number；

S50, ReduceNode general i,<key,value_list>Send to its ith thread or process and handled, export< key’,value’>；

S60, CoordinateNode are received and buffered<key’,value’>Until the data item for representing to be sent is received, CoordinateNode will be based on<key’,value’>Result of calculation be back to ReadNode, ReadNode is by the calculating As a result independent data item is parsed into, repeat step S20 to S50 is iterated, until ReduceNode sends expression and stops iteration Data item, CoordinateNode exits；

Thread or number of processes in MapNode are Map quantity；

When MapNode receives the data item that the expression independent data item is sent, then output represents that data item is sent Finish<key,value>.

2. under MapReduce Computational frames according to claim 1 can iterative data processing method, its feature exists In after step S20 is performed, CoordinateNode is received and buffered<key,value>Sent until receiving expression Complete data item, CoordinateNode will be based on<key,value>Result of calculation be back to MapNode, then with the calculating As a result the input data as MapNode re-executes step S20.

3. under MapReduce Computational frames according to claim 1 can iterative data processing method, its feature exists In, after step S40 is performed, CoordinateNode receive and buffer i,<key,value_list>Until receiving table Show the data item being sent, CoordinateNode will be based on i,<key,value_list>Result of calculation be back to ShuffleNode, then input data using the result of calculation as ShuffleNode re-execute step S30 and S40.

4. under MapReduce Computational frames according to claim 1 can iterative data processing method, its feature exists In, thread in ShuffleNode or into number of passes and the thread in ReduceNode or equal into number of passes.

5. under MapReduce Computational frames according to claim 4 can iterative data processing method, its feature exists In all outputs of each thread or process are by a thread in ReduceNode or process reception in ShuffleNode.