CN106484879B

CN106484879B - A kind of polymerization of the Map end data based on MapReduce

Info

Publication number: CN106484879B
Application number: CN201610899802.XA
Authority: CN
Inventors: 郭方方; 朱建文; 吕宏武; 王慧强; 冯光升; 刘慧姝
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2016-10-14
Filing date: 2016-10-14
Publication date: 2019-08-06
Anticipated expiration: 2036-10-14
Also published as: CN106484879A

Abstract

The polymerization of the present invention is to provide a kind of Map end data based on MapReduce.Including test phase and polymerization stage.Test phase verifies the algorithm in the Map function at the used end Map by test phase if appropriate for polymerization in carrying out.Interior polymerization is to carry out in the calculating process of Map function in memory, is just polymerize after a part has been calculated；Outer polymerization is after all data to be had been calculated to deposit disk in Map function, then to call in memory and polymerize.Polymerization stage, if test passes through, the data after being calculated using interior polymerization the end Map are polymerize；If test does not pass through, the data after being calculated using outer polymerization the end Map are polymerize.The characteristics of present invention is according to data, guarantee calculated result it is correct under the premise of, select corresponding polymerization methods, reduce I/O access times while, reduce transmission<Key, Value>the traffic.

Description

A kind of polymerization of the Map end data based on MapReduce

Technical field

The present invention relates to a kind of distributed computing methods, and in particular to a kind of Map end data based on MapReduce Polymerization.

Background technique

When currently processed big data application problem, two important thoughts: parallel and divide and rule.By will be large-scale Data reasonably split into multiple fractions, the calculation method combined by parallel idea and thought of dividing and ruling, so that problem obtains The relatively satisfied solution arrived.MapReduce is that we provide a kind of effectively and rapidly multiple programming frames.

MapReduce realizes two major functions: Map and Reduce.Map is that a function is applied in set Then all members return to the result set based on this processing.Reduce is knots some in two or more Map Fruit is classified and is concluded by the result set of multiple threads, process or autonomous system parallel processing.

In MapReduce model, user needs to define Map and Reduce function, and input is a key-value pair list, key Value is all based on Key to the binary group (Key, Value) being made of key and value, sequence and grouping.The key when input of Map function Value pair, calculates each key-value pair, the result of generation is also intermediate key-value pair list.Among Map and Reduce this Key assignments list, is assembled based on key.The input of Reduce function is the key-value pair grouping based on key, wherein each grouping is Independent, the mode that distributed large-scale parallel thus can be used is handled, the knot of total long-range MapReduce of input energy The memory of point.

However in terms of real processing, the speed of processing is not directly proportional with input resource, such as: with 2 The efficiency of platform host process data can't be than with 1 times of raising of 1 host.And it is less than 1；Because of big data treatment process In, it how from leading to the problem of end (output of Map) and being transferred to using end (Reduce input) to be one critically important by data.Cause I/O operation and network transmission can be generally related to for transmission, and it is both relatively time-consuming.In Hadoop, intermediate result is generally first Disk is write back, is then giving the end Reduce by network transmission.So improving algorithm operational efficiency, the traffic, the end Map are reduced Need result as few as possible.

In order to solve the problems, such as that the key-value pair in the calculated result of the end Map is excessive, a kind of method is to introduce Combine component, Intermediate result is polymerize, is equivalent to and has been a small Reduce.But I/O twice will be will do it by doing so, because after Map has been calculated Disk can be written at once, data in disk are then read out into polymerizeing using Combine for progress.

Key-value pair in the calculated result of the end Map polymerize by another method in memory, primary to calculate, an I/O.It is this The advantages of method is that speed is fast.The interior algorithm polymerizeing in the Map function for requiring the end Map retains the intermediate knot calculated in memory Fruit.But not every algorithm is all to be suitable for for calculated result polymerizeing in memory, because converging operation in memory exists It is carried out during calculating, for algorithm related with the input sequence of data, if using interior polymerization, final calculating knot Fruit may be different with the result of outer polymerization, so, carry out whether algorithm in verifying Map function is able to carry out before polymerizeing inside Interior polymerization.

In conclusion interior polymerization is i.e. in memory at present in the research to the end the Map flowcollector aggregation scheme FlowCollector based on MapReduce Middle polymerization, user oneself realizes that basic skills is that temporary variable is simply provided to deposit when being by writing MapReduce program Storage merges the key-value pair after Map is calculated, what outer polymerization was realized by the Combine component provided in Hadoop.Existing is poly- Conjunction method is too simple, and the key-value pair after cannot effectively calculating in memory Map polymerize, low efficiency.

Summary of the invention

The purpose of the present invention is to provide one kind to be able to solve Map end data polymerization under MapReduce Computational frame The polymerization of the Map end data based on MapReduce of the problems such as low efficiency, the excessive end Map calculated result.

The object of the present invention is achieved like this:

(1) corresponding result is calculated by the way that outer polymerization and cohesion are total respectively；

(2) whether identical compare two results；

(3) polymerization in then carrying out if they are the same, carries out outer polymerization if not identical；

The interior polymerization specifically includes:

(3.1.1) foundation<Key, Value>inverted index: according to reading<Key, Value>in Key value establish fall row rope Draw, in the index record<Key, Addresss>, Address is<Key, the address value of Value>in memory；

(3.1.2) to Address establish be directed toward Count index, matched, by successful match<Key, Value> It merges；

(3.1.3) checks whether memory is enough before carrying out next matching, will be in memory if low memory is enough Count value small part<Key, Value>write back disk；If memory checks whether that there are also uncalculated < Key, Value enough >, if there is uncalculated<Key, Value>, by uncalculated<Key, Value>call in memory is calculated and is returned (3.1.1) is continued to execute；If without uncalculated<Key, Value>then terminate；

The outer polymerization specifically includes:

(3.2.1) general<Key, Value>call in memory is calculated, and disk is written in calculated result, is denoted as S_<Key,Value>；

(3.2.2) is by the S in disk_<Key,Value>Again memory is recalled to, the interior operation polymerizeing is executed.

The beneficial effects of the present invention are embodied in:

(1) present invention is by establishing inverted index, while promoting retrieval rate, makes<Key, Value>in memory into The effective polymerization of row, establishes index, promotes matching speed.Simultaneously it is considered that in carrying out when polymerization, asking for memory spilling is prevented Topic, during merging check memory it is whether enough, if Out of Memory, by partially merge number it is less < Key, Value>, i.e., Count value it is lesser<Key, Value>, write back disk first to prevent memory from overflowing.

(2) the characteristics of present invention is according to data select corresponding polymerization methods under the premise of guaranteeing that calculated result is correct, While reducing the access times of I/O；Reduction generation<Key, Value>quantity, so that transmission<Key is reduced, Value> The traffic.

Detailed description of the invention

Fig. 1 is the test phase flow chart of the polymerization of the Map end data based on MapReduce.

Fig. 2 is the interior polymerization method flow diagram of the polymerization of the Map end data based on MapReduce.

Fig. 3 is the outer polymerization flow chart of the polymerization of the Map end data based on MapReduce.

Specific embodiment

The invention will be further described for citing with reference to the accompanying drawing.

Present invention seek to address that Map end data polymerization low efficiency and the end Map calculate knot under MapReduce Computational frame The excessive problem of fruit.A kind of polymerization of Map end data based on MapReduce is proposed, after calculating in memory Map Key-value pair effectively polymerize, ensure export result it is correct under the premise of, reduce the access times of I/O；It reduces simultaneously The key-value pair of generation, to reduce the traffic of transmission.

As shown in Figure 1, this method includes following two stage, test phase and polymerization stage.

Test phase: the algorithm in the Map function at the used end Map is verified by test phase if appropriate in progress Polymerization.Because the algorithm in some Map functions is sensitive to the input sequence of input data, calculated result error may result in. And interior polymerization and outer polymerization difference are exactly the difference of the input sequence of data.Interior polymerization is in memory It carries out in the calculating process of Map function, is just polymerize after a part has been calculated；Outer polymerization is in Map function by institute After there are data that deposit disk has been calculated, calling in what memory was polymerize.

Polymerization stage: if test passes through, interior polymerization is carried out, i.e., closes the data after method calculates the end Map using cohesion It is polymerize；If test does not pass through, outer polymerization is carried out, i.e., the data after being calculated using outer polymerization the end Map are carried out Polymerization.

Two stages specific steps are as follows:

1. test phase.The task of test phase is to test whether the data after the end Map calculates are able to carry out interior polymerization side Method, specific practice are after the partial data to be calculated is calculated by interior polymerization and outer polymerization, to compare to obtain Result it is whether identical.Because the data used are few, the time that the time test stage uses will be very short, relative to entire It the calculating total time of MapReduce, can be ignored.As shown in Figure 1, the specific steps are as follows:

(1) corresponding result is calculated by the way that outer polymerization and cohesion are total respectively.

(2) whether identical compare two results.

(3) polymerization in then carrying out if they are the same, carries out outer polymerization if not identical.

2. polymerization stage.Including interior polymerization and outer polymerization.

(1) interior polymerization: the effect of interior polymerization is to be placed into converging operation in memory to carry out.Firstly, being based on To memory<Key, Value>in Key establish inverted index, i.e.,<Key, Address>, wherein Address is<Key, The address of Value > in memory.Secondly, in order in low memory, the part < Key that can timely lack matching times, Value > recall memory establishes the index of matching times Count to Address.After overmatching, will match to < Key, Value > merge.It is every to complete primary<Key, Value>interior polymerization after, when there is new<Key, when Value>call in memory, Memory size is checked, if memory is enough, be checked whether there are also uncalculated<Key, Value>, if there is will not Calculate<Key, Value>, by uncalculated<Key, Value>call in memory continues to calculate；If memory current capacities are insufficient It is enough, by the small part<Key of Count value in memory, Value>write back disk.As shown in Figure 2, the specific steps are as follows:

1) foundation<Key, Value>inverted index: according to reading<Key, Value>in Key value establish inverted index, Record<Key in index, Address>, Address is<Key, the address value of Value>in memory.

2) index for being directed toward Count is established to Address:, can be timely by matching times in order in low memory Few<Key, Value>recall memory establish the index of matching times Count to Address.

Matched, by successful match<Key, Value>merge.

3) before carrying out next matching, check whether memory is enough, if low memory is enough, by Count value in memory Small part<Key, Value>write back disk；If memory is checked whether enough there are also uncalculated<Key, Value>, if There is uncalculated<Key, Value>, by uncalculated<Key, 1) Value>call in memory, which is calculated and returned, to be continued to execute； If without uncalculated<Key, Value>then terminate.

(2) outer polymerization: the effect of<Key, Value>outer aggregation module is the Map letter that converging operation is placed on to the end Map Number is unified after the completion of calculating to carry out.As indicated at 3, the specific steps are as follows:

1) general<Key, Value>call in memory are calculated, and disk is written in calculated result.It is denoted as S_<Key,Value>。

2) by the S in disk_<Key,Value>Again memory is recalled to, the interior operation polymerizeing is executed.

Specific example are as follows: by 1000<Key, Value>be randomly divided into 10 parts, each end Map progress 100 that is averaged< Key, Value > relevant calculation, using traditional method, then each end Map needs the access of 2 I/O, i.e., needs 20 times altogether I/O access, using improved method then each end Map under best-case, it is only necessary to the access of 1 I/O, i.e., altogether need 10 Secondary I/O access.Although partial picture is the access for needing 2 I/O, the Average visits of overall I/O are 15 times.Simultaneously In the case where only needing 1 access I/O, 1 all<Key can be saved, when Value>load and unload in memory is consumed The time taken, the available further promotion of the overall performance of system.

Claims

1. a kind of polymerization of the Map end data based on MapReduce, it is characterized in that:

(2) whether identical compare two results；

The interior polymerization specifically includes:

(3.1.1) foundation<Key, Value>inverted index: according to reading<Key, Value>in Key value establish inverted index, Record<Key in the index, Address>, Address is<Key, the address value of Value>in memory；

(3.1.2) establishes the index for being directed toward Count to Address, and the index of matching times Count is established to Address, carries out Matching, by successful match<Key, Value>merge；

(3.1.3) checks whether memory is enough before carrying out next matching, if low memory is enough, by Count value in memory Small part<Key, Value>write back disk；If memory is checked whether enough there are also uncalculated<Key, Value>, if There is uncalculated<Key, Value>, by uncalculated<Key, Value>call in memory is calculated and is returned to (3.1.1) continuation It executes；If without uncalculated<Key, Value>then terminate；

The outer polymerization specifically includes: