CN103699442A

CN103699442A - Iterable data processing method under MapReduce calculation framework

Info

Publication number: CN103699442A
Application number: CN201310686716.7A
Authority: CN
Inventors: 邹瑜斌; 张帆; 须成忠
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2013-12-12
Filing date: 2013-12-12
Publication date: 2014-04-02
Anticipated expiration: 2033-12-12
Also published as: CN103699442B

Abstract

The invention provides an iterable data processing method under a MapReduce calculation framework. The iterable data processing method comprises the following steps of S10, reading original data, and analyzing the original data into independent data items; S20, distributing input data to all threads or progresses for processing through a Shuffle Grouping mechanism; S30, performing harsh recombination and sequencing on the data, and distributing the sequenced data to all the threads or the progresses through a Fields Grouping mechanism; S40, sequencing and grouping the data in a buffer pool in real time through all the threads or the progresses; S50, sending the data to the threads or the progresses for processing; S60, analyzing a returned calculation result into independent data items, and repeating the steps S20-S50 until data items which represent stop of iteration are sent. According to the iterable data processing method under the MapReduce calculation framework, the calculation property of MapReduce cannot be affected by the iteration, and the cost for establishing and destroying virtual machines is reduced.

Description

Under MapReduce Computational frame can iterative data processing method

[technical field]

The present invention relates under a kind of MapReduce Computational frame can iterative data processing method.

[background technology]

At large data age, data volume becomes explosive growth, and this computing to data has proposed high requirement.The large-scale calculations that the proposition of the Hadoop ecosphere is mass data and distributed reliable memory provide a powerful instrument.In Hadoop; MapReduce is one and calculates key component reliable, easy-to-use, can scale is provided for mass data; based on MapReduce Computational frame very friendly for many data analyses and computing method, this has a wide range of applications MapReduce Computational frame in mass data analysis.Yet in reality is used, under MapReduce Computational frame, iterative computation is carried out and has been subject to the restriction that the Hadoop ecosphere is realized, thereby causes iterative computation performance to be affected.

Under MapReduce Computational frame, data set MassiveDataSet can be divided into several data blocks, then each Map processes a data block, and export one by key-value to the queue forming, in the shuffle stage, can be to all key-value to carrying out Hash restructuring and according to key sequence, forming key-value-list couple, then in the Reduce stage, each key-value-list is to being treated separately and Output rusults.

Under MapReduce Computational frame, iterative computation is subject to following restriction: the intermediate data between (1) two MapReduce task must be write back in Hadoop Distributed File System, therefore causes performance to incur loss.(2) Map and Reduce self cannot iteration carry out, if require iterative computation, two MapReduce that need to connect, cause the establishment of Java Virtual Machine and destroy expense, affect performance.For overcoming the above problems, prior art adopts a plurality of MapReduce tasks of series connection, yet still there is following shortcoming: the intermediate data between (1) two MapReduce task must be write back Hadoop Distributed FileSystem(HDFS) in, performance loss therefore caused.(2) Map and Reduce self cannot iteration carry out, if require iterative computation, two MapReduce that need to connect, cause the establishment of Java Virtual Machine and destroy expense, affect performance.

[summary of the invention]

The present invention is intended to solve above-mentioned problems of the prior art, propose under a kind of MapReduce Computational frame can iterative data processing method.

Under the MapReduce Computational frame that the present invention proposes can iterative data processing method, comprise the following steps: S10, ReadNode read raw data from Hadoop distributed file system, and described raw data is resolved to independent data item, using the input data of described independent data item as MapNode; S20, MapNode adopt Shuffle Grouping mechanism that described input Data dissemination to each thread or the process of MapNode processed, for independent data item output <key described in each, value> formatted data; S30, ShuffleNode are to <key, value> carries out Hash restructuring, based on key value, carries out sequence, and adopting FieldsGrouping mechanism by the <key after sequence, value> is distributed to each thread or the process of ShuffleNode; Each thread of S40, ShuffleNode or process are in real time by <key, value> deposits local KVlist Buffer Pool in, until receive the <key that represents that data are sent, value>, based on key value to the <key in KVlist Buffer Pool, value> sorts, divides into groups, to each grouping output { i, <key, value_list>} formatted data, wherein, i is current thread or process numbering; S50, ReduceNode will { value_list>} be sent to its i thread or process is processed for i, <key, output <key ', value ' >; S60, CoordinateNode receive and cushion <key ', value ' > is until receive the data item that expression is sent, CoordinateNode will be based on <key ', the result of calculation of value ' > is back to ReadNode, ReadNode resolves to independent data item by described result of calculation, repeating step S20 to S50 carries out iteration, until ReduceNode sends the data item that represents to stop iteration, CoordinateNode exits.

Under the MapReduce Computational frame that the present invention proposes, can iterative data processing method based on streaming, calculate that realize can iteration MapReduce Computational frame, can keep the calculated performance of MapReduce can not be affected because of iteration.The method makes intermediate data need not write back distributed file system, also avoids the expense of establishment and the destruction of java virtual machine, and can support the realization of more flexible and more efficient data analysis and process algorithm.

[accompanying drawing explanation]

Fig. 1 be the present invention propose MapReduce Computational frame under can iterative data processing method process flow diagram.

Fig. 2 under the MapReduce Computational frame for one of embodiment of the present invention can iterative data processing method topology diagram.

Fig. 3 under two the MapReduce Computational frame for the embodiment of the present invention can iterative data processing method topology diagram.

Fig. 4 under three the MapReduce Computational frame for the embodiment of the present invention can iterative data processing method topology diagram.

[embodiment]

Below in conjunction with specific embodiment and accompanying drawing, the present invention is described in further detail.Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish.Below by the embodiment being described with reference to the drawings, be exemplary, only for explaining technical scheme of the present invention, and do not should be understood to limitation of the present invention.

In description of the invention, term " interior ", " outward ", " longitudinally ", " laterally ", " on ", orientation or the position relationship of the indication such as D score, " top ", " end " be based on orientation shown in the drawings or position relationship, be only the present invention for convenience of description rather than require the present invention with specific orientation structure and operation, therefore not should be understood to limitation of the present invention.

The invention provides under a kind of MapReduce Computational frame can iterative data processing method.As shown in Figure 1, the method comprises the following steps: S10, ReadNode read raw data from Hadoop distributed file system, and described raw data is resolved to independent data item, usings the input data of described independent data item as MapNode; S20, MapNode adopt Shuffle Grouping mechanism that described input Data dissemination to each thread or the process of MapNode processed, for independent data item output <key described in each, value> formatted data; S30, ShuffleNode are to <key, value> carries out Hash restructuring, based on key value, carries out sequence, and adopting Fields Grouping mechanism by the <key after sequence, value> is distributed to each thread or the process of ShuffleNode; Each thread of S40, ShuffleNode or process are in real time by <key, value> deposits local KVlist Buffer Pool in, until receive the <key that represents that data are sent, value>, based on key value to the <key in KVlist Buffer Pool, value> sorts, divides into groups, to each grouping output { i, <key, value_list>} formatted data, wherein, i is current thread or process numbering; S50, ReduceNode will { value_list>} be sent to its i thread or process is processed for i, <key, output <key ', value ' >; S60, CoordinateNode receive and cushion <key ', value ' > is until receive the data item that expression is sent, CoordinateNode will be based on <key ', the result of calculation of value ' > is back to ReadNode, ReadNode resolves to independent data item by described result of calculation, repeating step S20 to S50 carries out iteration, until ReduceNode sends the data item that represents to stop iteration, CoordinateNode exits.

Particularly, can be in the lump with reference to Fig. 2.Can iterative data processing method calculate based on streaming under the MapReduce Computational frame that the present invention proposes, is used streaming to calculate and realizes Map stage, Shuffle stage, the Reduce stage of MapReduce Computational frame and use streaming to calculate the iteration mechanism that realizes.

Under the MapReduce Computational frame that the present invention proposes can iterative data processing method whole topological structure by five kinds of nodes, formed: ReadNode, MapNode, ShuffleNode, ReduceNode, CoordinateNode.ReadNode is responsible for reading raw data from distributed file system (Hadoop Distributed FileSystem, HDFS), and resolves to one by one independently in data item input topological structure; MapNode is responsible for realizing the Map stage of MapReduce Computational frame, and the thread of this node or number of processes have determined Map quantity; ShuffleNode is responsible for realizing the Shuffle stage of MapReduce Computational frame, and the thread of this node or number of processes equal thread or the number of processes of ReduceNode; ReduceNode is responsible for realizing the Reduce stage of MapReduce Computational frame, and the thread of this node or number of processes have determined Reduce quantity.Data Collection and data when CoordinateNode is responsible for carrying out iteration are synchronous.

Wherein, the Map stage of MapNode for realizing MapReduce Computational frame, MapNode receives the data item of ReadNode output, preferably, thread in MapNode or process number are exactly Map quantity, and this is different with " the Map quantity of the MapReduce Computational frame of Hadoop is determined by data block quantity ", between data item due to data centralization, there is no relevance, so computational load for each thread of balance or process, MapNode distributes its received data to each thread or process by shuffleGrouping mechanism, for each data item, MapNode carries out and calculates and export a <key, value>, when MapNode receives the data item that a particular table shows that data set is sent, export a particular table registration according to the <key being sent, value>, the Shuffle stage of ShuffleNode for realizing MapReduce Computational frame, ShuffleNode receives the data item of MapNode output, preferably, thread in ShuffleNode or process number equal thread or the process number of ReduceNode, and in ShuffleNode, all outputs of a thread all must receive by the thread of arbitrary correspondence in ReduceNode, ShuffleNode is responsible for received <key, value> carries out Hash restructuring, and carry out sequence according to key, therefore ShuffleNode distributes received <key, value> with FieldsGrouping, each thread or the process of ShuffleNode, for a <key who receives, value>, first put it into a local Buffer Pool (KVlist), until receive the <key that all data of specific expression are sent, value>, when thread or the process of ShuffleNode receives the <key that all data of specific expression are sent, after value>, based on key, first to all <key in KVlist, value> sorts, and then all <key, value> grouping, the <key of identical key, value> is grouped into one group.For a grouping, generate a <key, value_list>, final output i, and <key, value_list>}, wherein, i is the numbering of current thread or process; After a complete KVlist of thread process, export a particular table registration according to the data item being sent.

The Reduce stage of ReduceNode for realizing MapReduce Computational frame, ReduceNode receives the data item of ShuffleNode output: for { i, <key, value_list>}, sends to i thread or the process of ReduceNode

Thread or process for ReduceNode, often receive { an i, <key, value_list>}, process a <key, value_list> also exports with <key ' value ' > formatted output result.

CoordinateNode is responsible for the synchronous and data of the data buffering, data of iteration mechanism and calculates.CoordinateNode is determined to the mechanism of its inner each thread or process for the data allocations receiving by concrete application.

As a node Node _iin the time of need to carrying out iterative operation, Node _ifirst data item is sent in CoordinateNode, CoordinateNode can receive and cushion all data item until receive the data item that a specific expression is sent, then Coordinate can the data item based on received carry out calculating, and result of calculation is returned to Node _i, until Node _isend a specific expression while stopping the data item of iteration, CoordinateNode will exit.

Wherein Figure 2 shows that: CoordinateNode receives and cushion <key ', value ' > is until receive the data item that expression is sent, CoordinateNode will be based on <key ', the result of calculation of value ' > is back to ReadNode, ReadNode resolves to independent data item by described result of calculation, repeating step S20 to S50 carries out iteration, until ReduceNode sends the data item that represents to stop iteration, CoordinateNode exits

Figure 3 shows that after execution step S20, CoordinateNode receives and cushions <key, value> is until receive the data item that expression is sent, CoordinateNode will be based on <key, the result of calculation of value> is back to MapNode, then usings this result of calculation and re-execute step S20 as the input data of MapNode.

Figure 4 shows that after execution step S40, CoordinateNode receives and cushions { i, <key, value_list>} is until receive the data item that expression is sent, CoordinateNode will be based on { i, <key, the result of calculation of value_list>} is back to ShuffleNode, then usings this result of calculation and re-execute step S30 and S40 as the input data of ShuffleNode.

For the described embodiment of Fig. 3, for example, for a data set Set, first ReadNode can receive and resolve to data item DataEntry one by one _i, then send to MapNode; MapNode is distributed to each thread of its inside the data item receiving according to the mode of ShuffleGrouping, for a thread in MapNode, process a data item, output { type, <key, value>}, wherein type is the identifier of 4, described and whether needed the extraneous informations such as whether iteration, data are sent.

CoordinateNode1 and ShuffleNode can receive the output of MapNode, if type represents to need iteration in the data item of input, ShuffleNode can ignore received data item.CoordinateNode will receiving data item and is cached in an array, until receive a particular table registration according to being sent.After data receiver, CoordinateNode starts to process data item array and exports a <key, and value> describes result, and MapNode receives the output of CoordinateNode, and again processes.

When iteration stops, MapNode need to output { type, <key arrange corresponding positions to represent that iteration finishes in the type of value>}.

ShuffleNode receives the output of MapNode, equally first, check type, if type represents iteration and stops, ShuffleNode starts receiving data item { type, <key, value>} is also cached in an array, until receive a data item, its type represents that data are sent.After data receiver, ShuffleNode is first based on key, to all <key in data item number group, value> sorts, then divide into groups, the <key of identical key, value> is put in a grouping, then for each grouping, generate <key, value_list>(value_list is a value chained list), then output type ', <key, value_list>}.

ReduceNode receive the output of ShuffleNode data item type ', <key, value_list>}, then processes and with <key ', value ' > formal output result.

CoordinateNode2 receives the output of ReduceNode, and the same with CoordinateNode1, CoordinateNode2 receiving data item is also cached in an array, until receive a particular table registration according to being sent.After data receiver, CoordinateNode2 starts to process data item array and with <key, value> formal description result is also exported.

ReadNode receives the output of CoordinateNode2 and again inputs and need data to be processed to whole framework.

Can iterative data processing method realize by use Storm streaming computational tool under the MapReduce Computational frame that the present invention proposes, experiment effect is good.

Although the present invention is described with reference to current better embodiment; but those skilled in the art will be understood that; above-mentioned better embodiment is only used for explaining and illustrating technical scheme of the present invention; and be not used for limiting protection scope of the present invention; any within the spirit and principles in the present invention scope; any modification of doing, equivalent replacement, distortion, improvement etc., within all should being included in claim protection domain of the present invention.

Claims

Under MapReduce Computational frame can an iterative data processing method, comprise the following steps:

S10, ReadNode read raw data from Hadoop distributed file system, and described raw data is resolved to independent data item, using the input data of described independent data item as MapNode;

S20, MapNode adopt Shuffle Grouping mechanism that described input Data dissemination to each thread or the process of MapNode processed, for independent data item output <key described in each, value> formatted data;

S30, ShuffleNode are to <key, value> carries out Hash restructuring, based on key value, carries out sequence, and adopting Fields Grouping mechanism by the <key after sequence, value> is distributed to each thread or the process of ShuffleNode;

Each thread of S40, ShuffleNode or process are in real time by <key, value> deposits local KVlist Buffer Pool in, until receive the <key that represents that data are sent, value>, based on key value to the <key in KVlist Buffer Pool, value> sorts, divides into groups, to each grouping output { i, <key, value_list>} formatted data, wherein, i is current thread or process numbering;

S50, ReduceNode will { value_list>} be sent to its i thread or process is processed for i, <key, output <key ', value ' >;

S60, CoordinateNode receive and cushion <key ', value ' > is until receive the data item that expression is sent, CoordinateNode will be based on <key ', the result of calculation of value ' > is back to ReadNode, ReadNode resolves to independent data item by described result of calculation, repeating step S20 to S50 carries out iteration, until ReduceNode sends the data item that represents to stop iteration, CoordinateNode exits.
Under MapReduce Computational frame according to claim 1 can iterative data processing method, it is characterized in that, after execution step S20, CoordinateNode receives and cushions <key, value> is until receive the data item that expression is sent, CoordinateNode will be based on <key, the result of calculation of value> is back to MapNode, then usings this result of calculation and re-execute step S20 as the input data of MapNode.
Under MapReduce Computational frame according to claim 1 can iterative data processing method, it is characterized in that, after execution step S40, CoordinateNode receives and cushions { i, <key, value_list>} is until receive the data item that expression is sent, CoordinateNode will be based on { i, <key, the result of calculation of value_list>} is back to ShuffleNode, usining this result of calculation re-executes step S30 and S40 as the input data of ShuffleNode again.
Under MapReduce Computational frame according to claim 1 can iterative data processing method, it is characterized in that, the thread in MapNode or number of processes are Map quantity.
Under MapReduce Computational frame according to claim 1 can iterative data processing method, it is characterized in that, when MapNode receives while representing data item that described independent data item is sent, output represents the <key that data item is sent, value>.
Under MapReduce Computational frame according to claim 1 can iterative data processing method, it is characterized in that, the thread in ShuffleNode or process number equate with thread or process number in ReduceNode.
Under MapReduce Computational frame according to claim 6 can iterative data processing method, it is characterized in that, in ShuffleNode, all outputs of each thread or process receive by a thread in ReduceNode or process.