CN103699442B - Under MapReduce Computational frames can iterative data processing method - Google Patents

Under MapReduce Computational frames can iterative data processing method Download PDF

Info

Publication number
CN103699442B
CN103699442B CN201310686716.7A CN201310686716A CN103699442B CN 103699442 B CN103699442 B CN 103699442B CN 201310686716 A CN201310686716 A CN 201310686716A CN 103699442 B CN103699442 B CN 103699442B
Authority
CN
China
Prior art keywords
key
data
value
thread
data item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310686716.7A
Other languages
Chinese (zh)
Other versions
CN103699442A (en
Inventor
邹瑜斌
张帆
须成忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201310686716.7A priority Critical patent/CN103699442B/en
Publication of CN103699442A publication Critical patent/CN103699442A/en
Application granted granted Critical
Publication of CN103699442B publication Critical patent/CN103699442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The present invention propose under a kind of MapReduce Computational frames can iterative data processing method, comprise the following steps:S10, read initial data, and initial data is parsed into independent data item;S20, be distributed to each thread or process is handled using Shuffle Grouping mechanism by the input data;S30, carry out data Hash restructuring, sequence, and uses Fields Grouping mechanism by data distribution after sequence to each thread or process;S40, each thread or process are in real time ranked up data in buffer pool, are grouped;S50, send data to thread or process is handled;The result of calculation of return, is parsed into independent data item, repeat step S20 to S50 by S60, until sending the data item for representing to stop iteration.The present invention can keep the calculated performance of MapReduce not to be affected because of iteration, also reduce the expense of establishment and the destruction of virtual machine.

Description

Under MapReduce Computational frames can iterative data processing method
【Technical field】
The present invention relates under a kind of MapReduce Computational frames can iterative data processing method.
【Background technology】
In the big data epoch, for data volume into explosive growth, this calculating processing to data proposes high requirement. The it is proposed of the Hadoop ecospheres provides a powerful instrument for the large-scale calculations of mass data and distributed reliable memory. In Hadoop, MapReduce be one for mass data calculate provide it is reliable, easy-to-use, can scale key component, institute's base In MapReduce Computational frames it is very friendly for many data analyses and computational methods, this causes MapReduce calculation blocks Frame has a wide range of applications in mass data analysis.However, in reality in use, iteration meter under MapReduce Computational frames The restriction for performing and receiving the realization of the Hadoop ecospheres is calculated, so as to cause iterative calculation performance to be affected.
Under MapReduce Computational frames, data set MassiveDataSet can be divided into several data blocks, then Each Map handles a data block, and exports one by queues of the key-value to forming, can be to institute in the shuffle stages There is key-value to performing Hash restructuring and sort according to key, form key-value-list pairs, then in the Reduce stages, Each key-value-list is to that can be treated separately and export result.
Iterative calculation is limited under MapReduce Computational frames:(1)In between two MapReduce tasks Between data must be written back into Hadoop Distributed File System, therefore cause performance to incur loss.(2)Map With Reduce itself can not iteration perform, if require iterate to calculate if, it is necessary to which two MapReduce that connect, cause Java The establishment of virtual machine and destruction expense, influence performance.To solve problem above, the prior art is using multiple MapReduce that connect Task, but still remain following shortcoming:(1)Intermediate data between two MapReduce tasks must be written back into Hadoop Distributed FileSystem(HDFS)In, therefore cause performance loss.(2)Map and Reduce itself can not iteration hold OK, it is necessary to two MapReduce that connect if requiring to iterate to calculate, cause the establishment of Java Virtual Machine and destroy expense, Influence performance.
【The content of the invention】
Present invention seek to address that above-mentioned problems of the prior art, propose under a kind of MapReduce Computational frames Can iterative data processing method.
Under MapReduce Computational frames proposed by the present invention can iterative data processing method, comprise the following steps: S10, ReadNode read initial data from Hadoop distributed file systems, and the initial data is parsed into independent digit According to item, the input data of MapNode is used as using the independent data item;S20, MapNode use Shuffle Grouping machines The input data is distributed to system each thread of MapNode or process is handled, defeated for each independent data item Go out<key,value>Formatted data;S30, ShuffleNode couple<key,value>Carry out Hash restructuring, performed based on key values Sequence, and using FieldsGrouping mechanism by after sequence<key,value>Be distributed to ShuffleNode each thread or Process;Each thread or process of S40, ShuffleNode in real time will<key,value>Local KVlist buffer pools are stored in, until Receive and represent what data sending finished<key,value>, based on key values in KVlist buffer pools<key,value>Carry out Sequence, packet, each packet is exported i,<key,value_list>Formatted data, wherein, i compiles for current thread or process Number;S50, ReduceNode general i,<key,value_list>Send to its ith thread or process and handled, export< key’,value’>;S60, CoordinateNode are received and buffered<key’,value’>It is sent until receiving expression Data item, CoordinateNode will be based on<key’,value’>Result of calculation be back to ReadNode, ReadNode will The result of calculation is parsed into independent data item, and repeat step S20 to S50 is iterated, until ReduceNode sends expression Stop the data item of iteration, CoordinateNode is exited.
Under MapReduce Computational frames proposed by the present invention can iterative data processing method based on streaming computing realize Can iteration MapReduce Computational frames, can keep the calculated performance of MapReduce will not be affected because of iteration.The party Method makes intermediate data not have to write back distributed file system, also avoids the expense of establishment and the destruction of java virtual machines, and can be with Support the realization of more flexible and more efficient data analysis and process algorithm.
【Brief description of the drawings】
Fig. 1 be MapReduce Computational frames proposed by the present invention under can iterative data processing method flow chart.
Fig. 2 be under the MapReduce Computational frames of one of the embodiment of the present invention can iterative data processing method open up Flutter structure chart.
Fig. 3 be under the MapReduce Computational frames of the two of the embodiment of the present invention can iterative data processing method open up Flutter structure chart.
Fig. 4 be under the MapReduce Computational frames of the three of the embodiment of the present invention can iterative data processing method open up Flutter structure chart.
【Embodiment】
The present invention is described in further detail with reference to specific embodiment and attached drawing.It is described below in detail the present invention's Embodiment, the example of the embodiment are shown in the drawings, wherein same or similar label represents identical or class from beginning to end As element or there is same or like element.The embodiments described below with reference to the accompanying drawings are exemplary, only For explaining technical scheme, and it is not construed as limitation of the present invention.
In the description of the present invention, term " interior ", " outer ", " longitudinal direction ", " transverse direction ", " on ", " under ", " top ", " bottom " etc. refer to The orientation or position relationship shown be based on orientation shown in the drawings or position relationship, be for only for ease of the description present invention rather than It is required that the present invention must be with specific azimuth configuration and operation, therefore it is not construed as limitation of the present invention.
The present invention provide under a kind of MapReduce Computational frames can iterative data processing method.As shown in Figure 1, should Method comprises the following steps:S10, ReadNode read initial data from Hadoop distributed file systems, and by the original Beginning data are parsed into independent data item, and the input data of MapNode is used as using the independent data item;S20, MapNode are used The input data is distributed to Shuffle Grouping mechanism each thread of MapNode or process is handled, for every The one independent data item output<key,value>Formatted data;S30, ShuffleNode couple<key,value>Carry out Hash Restructuring, perform sequence based on key values, and using Fields Grouping mechanism by after sequence<key,value>It is distributed to Each thread or process of ShuffleNode;Each thread or process of S40, ShuffleNode in real time will<key,value>Deposit Local KVlist buffer pools, represent what data sending finished until receiving<key,value>, KVlist is buffered based on key values In pond<key,value>It is ranked up, is grouped, each packet is exported i,<key,value_list>Formatted data, its In, i numbers for current thread or process;S50, ReduceNode general i,<key,value_list>Send to its i-th of line Journey or process are handled, output<key’,value’>;S60, CoordinateNode are received and buffered<key’,value’> Until receiving the data item for representing to be sent, CoordinateNode will be based on<key’,value’>Result of calculation return ReadNode is back to, the result of calculation is parsed into independent data item by ReadNode, and repeat step S20 to S50 is iterated, Until ReduceNode sends the data item for representing to stop iteration, CoordinateNode is exited.
Specifically, can be in the lump with reference to Fig. 2.Under MapReduce Computational frames proposed by the present invention can be at iterative data Reason method is based on streaming computing, i.e., using streaming computing realize Map stages of MapReduce Computational frames, the Shuffle stages, The Reduce stages simultaneously realize iterator mechanism using streaming computing.
Under MapReduce Computational frames proposed by the present invention can iterative data processing method whole topological structure by Five kinds of nodes are formed:ReadNode, MapNode, ShuffleNode, ReduceNode, CoordinateNode.ReadNode It is responsible for from distributed file system(Hadoop Distributed FileSystem,HDFS)Middle reading initial data, and parse Into in data item input topological structure independent one by one;MapNode is responsible for realizing the Map stages of MapReduce Computational frames, The thread or number of processes of the node determine Map quantity;ShuffleNode is responsible for realizing MapReduce Computational frames Shuffle stages, the thread or number of processes of the node are equal to the thread or number of processes of ReduceNode;ReduceNode is born Duty realizes the Reduce stages of MapReduce Computational frames, and the thread or number of processes of the node determine Reduce quantity. Data Collection and data when CoordinateNode is responsible for performing iteration is synchronous.
Wherein, MapNode is used for realization the Map stages in MapReduce Computational frames, and MapNode receives ReadNode The data item of output, it is preferable that thread in MapNode or into number of passes be exactly Map quantity, this and " MapReduce of Hadoop The Map quantity of Computational frame is determined by data number of blocks " it is different;Due to there is no relevance between the data item in data set, So in order to balance the computational load of each thread or process, MapNode distributes its institute using shuffleGrouping mechanism The data received give each thread or process, and for each data item, MapNode, which is performed, calculates and export one<key, value>;When MapNode receives the data item that a specific expression data set is sent, then a specific expression is exported What data sending finished<key,value>;ShuffleNode is used for realization the Shuffle ranks in MapReduce Computational frames Section, ShuffleNode receive the data item of MapNode outputs, it is preferable that thread in ShuffleNode is equal into number of passes The thread of ReduceNode or into number of passes, all outputs of a thread all must be by ReduceNode in ShuffleNode In any corresponding thread receive;ShuffleNode is responsible for received<key,value>Hash restructuring is carried out, and Sequence is performed according to key, thus ShuffleNode distributed using FieldsGrouping it is received<key,value>; The per thread or process of ShuffleNode, for one received<key,value>, first put it into local one Buffer pool(KVlist), specific represent what all data sendings finished until receiving one<key,value>, when A thread or process of ShuffleNode receives what specific all data sendings of expression finished<key,value>Afterwards, Key is then based on, first to owning in KVlist<key,value>It is ranked up, then again all<key,value>Packet, Identical key's<key,value>It is grouped into one group.For a packet, one is generated<key,value_list>, final output {i,<key,value_list>, wherein, i is the numbering of current thread or process;It is defeated after a thread process complete KVlist Go out a specific data item for representing data sending and finishing.
ReduceNode is used for realization the Reduce stages in MapReduce Computational frames, and ReduceNode is received The data item of ShuffleNode outputs:For i,<key,value_list>, then it is sent to i-th of line of ReduceNode Journey or process
For a thread or process of ReduceNode, often receive one i,<key,value_list>, Then handle one<key,value_list>And export with<key’,value’>Form exports result.
CoordinateNode is responsible for the data buffering of iterator mechanism, data synchronization and data and calculates. Depending on CoordinateNode distributes to the mechanism of its internal each thread or process by concrete application for the data that receive.
As a node NodeiWhen needing to perform iterative operation, NodeiData item is first sent to CoordinateNode In, CoordinateNode can be received and be buffered all data item until receiving the specific data for representing to be sent , then Coordinate can perform calculating based on received data item, and result of calculation is returned to Nodei, until NodeiSend one it is specific represent stop iteration data item when, CoordinateNode will move out.
Wherein Fig. 2 is shown:CoordinateNode is received and buffered<key’,value’>Sent until receiving expression The data item finished, CoordinateNode will be based on<key’,value’>Result of calculation be back to ReadNode, The result of calculation is parsed into independent data item by ReadNode, and repeat step S20 to S50 is iterated, until ReduceNode sends the data item for representing to stop iteration, and CoordinateNode is exited
Fig. 3 is shown after step S20 is performed, and CoordinateNode is received and buffered<key,value>Until connecing The data item for representing to be sent is received, CoordinateNode will be based on<key,value>Result of calculation be back to MapNode, then input data using the result of calculation as MapNode re-execute step S20.
Fig. 4 is shown after step S40 is performed, CoordinateNode receive and buffer i,<key,value_list >Until receiving represent the data item that is sent, CoordinateNode will be based on i,<key,value_list>Meter Calculate result and be back to ShuffleNode, then the input data using the result of calculation as ShuffleNode re-executes step S30 and S40.
For Fig. 3 described embodiments, for example, for a data set Set, ReadNode first can be received simultaneously It is parsed into data item DataEntry one by onei, it is then sent to MapNode;MapNode the data item received according to The mode of ShuffleGrouping is distributed to its internal each thread, for a thread in MapNode, processing One data item, output type,<key,value>, wherein type is the identifier of one 4, and whether describe needs to change Whether generation, data the extraneous information such as are sent.
CoordinateNode1 and ShuffleNode can receive the output of MapNode, if in the data item of input Type expressions need iteration, then ShuffleNode can ignore received data item.CoordinateNode will receive number According to item and it is cached in an array, is finished until receiving a specific expression data sending.After data receiver, CoordinateNode start to process data item arrays and output one<key,value>Handling result is described, MapNode is received The output of CoordinateNode, and handled again.
During iteration ends, MapNode need output type,<key,value>Type in set corresponding positions with Represent that iteration terminates.
ShuffleNode receives the output of MapNode, equally first checks for type, if type represents iteration ends, ShuffleNode start receiving data item type,<key,value>And be cached in an array, until receiving one Data item, its type represent that data sending finishes.After data receiver, ShuffleNode is primarily based on key, to data It is all in item array<key,value>It is ranked up, is then grouped, identical key's<key,value>It is put into one In packet, then generated for each packet<key,value_list>(Value_list is a value chained list), it is then defeated Go out type ',<key,value_list>}.
ReduceNode receive the output of ShuffleNode data item type ',<key,value_list>, so Post-process and with<key’,value’>Form exports result.
CoordinateNode2 receives the output of ReduceNode, as CoordinateNode1, CoordinateNode2 receiving data items are simultaneously cached in an array, complete until receiving a specific expression data sending Finish.After data receiver, CoordinateNode2 start to process data item arrays and with<key,value>Form describes Handling result simultaneously exports.
ReadNode receives the output of CoordinateNode2 and inputs data to be treated again to whole frame.
Under MapReduce Computational frames proposed by the present invention can iterative data processing method based on streaming computing realize Can iteration MapReduce Computational frames, can keep the calculated performance of MapReduce will not be affected because of iteration.The party Method makes intermediate data not have to write back distributed file system, also avoids the expense of establishment and the destruction of java virtual machines, and can be with Support the realization of more flexible and more efficient data analysis and process algorithm.
Under MapReduce Computational frames proposed by the present invention can iterative data processing method by using Storm Streaming computing instrument realizes that experiment effect is good.
Although the present invention is described with reference to current better embodiment, those skilled in the art should be able to manage Solution, above-mentioned better embodiment is only used for explanation and illustration technical scheme, and is not used for limiting the guarantor of the present invention Scope is protected, any within the scope of the spirit and principles in the present invention, any modification for being done, equivalence replacement, deformation, improvement etc., It should be included within the claims of the present invention.

Claims (5)

1. under a kind of MapReduce Computational frames can iterative data processing method, comprise the following steps:
S10, ReadNode read initial data from Hadoop distributed file systems, and the initial data is parsed into solely Vertical data item, the input data of MapNode is used as using the independent data item;
S20, MapNode using Shuffle Grouping mechanism by the input data be distributed to MapNode each thread or Process is handled, and is exported for each independent data item<key,value>Formatted data;
S30, ShuffleNode couple<key,value>Carry out Hash restructuring, sequence is performed based on key values, and use Fields Grouping mechanism is by after sequence<key,value>It is distributed to each thread or process of ShuffleNode;
Each thread or process of S40, ShuffleNode in real time will<key,value>Local KVlist buffer pools are stored in, until receiving Finished to expression data sending<key,value>, based on key values in KVlist buffer pools<key,value>Arranged Sequence, packet, each packet is exported i,<key,value_list>Formatted data, wherein, i compiles for current thread or process Number;
S50, ReduceNode general i,<key,value_list>Send to its ith thread or process and handled, export< key’,value’>;
S60, CoordinateNode are received and buffered<key’,value’>Until the data item for representing to be sent is received, CoordinateNode will be based on<key’,value’>Result of calculation be back to ReadNode, ReadNode is by the calculating As a result independent data item is parsed into, repeat step S20 to S50 is iterated, until ReduceNode sends expression and stops iteration Data item, CoordinateNode exits;
Thread or number of processes in MapNode are Map quantity;
When MapNode receives the data item that the expression independent data item is sent, then output represents that data item is sent Finish<key,value>.
2. under MapReduce Computational frames according to claim 1 can iterative data processing method, its feature exists In after step S20 is performed, CoordinateNode is received and buffered<key,value>Sent until receiving expression Complete data item, CoordinateNode will be based on<key,value>Result of calculation be back to MapNode, then with the calculating As a result the input data as MapNode re-executes step S20.
3. under MapReduce Computational frames according to claim 1 can iterative data processing method, its feature exists In, after step S40 is performed, CoordinateNode receive and buffer i,<key,value_list>Until receiving table Show the data item being sent, CoordinateNode will be based on i,<key,value_list>Result of calculation be back to ShuffleNode, then input data using the result of calculation as ShuffleNode re-execute step S30 and S40.
4. under MapReduce Computational frames according to claim 1 can iterative data processing method, its feature exists In, thread in ShuffleNode or into number of passes and the thread in ReduceNode or equal into number of passes.
5. under MapReduce Computational frames according to claim 4 can iterative data processing method, its feature exists In all outputs of each thread or process are by a thread in ReduceNode or process reception in ShuffleNode.
CN201310686716.7A 2013-12-12 2013-12-12 Under MapReduce Computational frames can iterative data processing method Active CN103699442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310686716.7A CN103699442B (en) 2013-12-12 2013-12-12 Under MapReduce Computational frames can iterative data processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310686716.7A CN103699442B (en) 2013-12-12 2013-12-12 Under MapReduce Computational frames can iterative data processing method

Publications (2)

Publication Number Publication Date
CN103699442A CN103699442A (en) 2014-04-02
CN103699442B true CN103699442B (en) 2018-04-17

Family

ID=50360981

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310686716.7A Active CN103699442B (en) 2013-12-12 2013-12-12 Under MapReduce Computational frames can iterative data processing method

Country Status (1)

Country Link
CN (1) CN103699442B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103995827B (en) * 2014-04-10 2017-08-04 北京大学 High-performance sort method in MapReduce Computational frames
CN105095244A (en) * 2014-05-04 2015-11-25 李筑 Big data algorithm for entrepreneurship cloud platform
CN104391916A (en) * 2014-11-19 2015-03-04 广州杰赛科技股份有限公司 GPEH data analysis method and device based on distributed computing platform
CN104391748A (en) * 2014-11-21 2015-03-04 浪潮电子信息产业股份有限公司 Mapreduce computation process optimization method
CN105354089B (en) * 2015-10-15 2019-02-01 北京航空航天大学 Support the stream data processing unit and system of iterative calculation
CN107797852A (en) * 2016-09-06 2018-03-13 阿里巴巴集团控股有限公司 The processing unit and processing method of data iteration
CN114077609B (en) * 2022-01-19 2022-04-22 北京四维纵横数据技术有限公司 Data storage and retrieval method, device, computer readable storage medium and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102137125A (en) * 2010-01-26 2011-07-27 复旦大学 Method for processing cross task data in distributive network system
CN103279328A (en) * 2013-04-08 2013-09-04 河海大学 BlogRank algorithm parallelization processing construction method based on Haloop

Also Published As

Publication number Publication date
CN103699442A (en) 2014-04-02

Similar Documents

Publication Publication Date Title
CN103699442B (en) Under MapReduce Computational frames can iterative data processing method
TWI680409B (en) Method for matrix by vector multiplication for use in artificial neural network
Das et al. Distributed deep learning using synchronous stochastic gradient descent
US10831773B2 (en) Method and system for parallelization of ingestion of large data sets
CN111656367A (en) System and architecture for neural network accelerator
WO2017167095A1 (en) Model training method and device
US20200159810A1 (en) Partitioning sparse matrices based on sparse matrix representations for crossbar-based architectures
KR102163209B1 (en) Method and reconfigurable interconnect topology for multi-dimensional parallel training of convolutional neural network
CN104809161B (en) A kind of method and system that sparse matrix is compressed and is inquired
Yamazaki et al. One-sided dense matrix factorizations on a multicore with multiple GPU accelerators
CN111444134A (en) Parallel PME (pulse-modulated emission) accelerated optimization method and system of molecular dynamics simulation software
CN103995827B (en) High-performance sort method in MapReduce Computational frames
Plimpton et al. Streaming data analytics via message passing with application to graph algorithms
CN108491924B (en) Neural network data serial flow processing device for artificial intelligence calculation
KR101361080B1 (en) Apparatus, method and computer readable recording medium for calculating between matrices
JP4310500B2 (en) Important component priority calculation method and equipment
US20220382829A1 (en) Sparse matrix multiplication in hardware
Geng et al. Rima: an RDMA-accelerated model-parallelized solution to large-scale matrix factorization
Xia et al. Redundancy-free high-performance dynamic GNN training with hierarchical pipeline parallelism
TWI814734B (en) Calculation device for and calculation method of performing convolution
Shi et al. Accelerating intersection computation in frequent itemset mining with fpga
US11886347B2 (en) Large-scale data processing computer architecture
Yuan et al. Optimizing sparse matrix vector multiplication using diagonal storage matrix format
Lu et al. High-performance homomorphic matrix completion on GPUs
Song et al. Novel graph processor architecture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant