CN106484879B - A kind of polymerization of the Map end data based on MapReduce - Google Patents

A kind of polymerization of the Map end data based on MapReduce Download PDF

Info

Publication number
CN106484879B
CN106484879B CN201610899802.XA CN201610899802A CN106484879B CN 106484879 B CN106484879 B CN 106484879B CN 201610899802 A CN201610899802 A CN 201610899802A CN 106484879 B CN106484879 B CN 106484879B
Authority
CN
China
Prior art keywords
key
value
polymerization
memory
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610899802.XA
Other languages
Chinese (zh)
Other versions
CN106484879A (en
Inventor
郭方方
朱建文
吕宏武
王慧强
冯光升
刘慧姝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Engineering University
Original Assignee
Harbin Engineering University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Engineering University filed Critical Harbin Engineering University
Priority to CN201610899802.XA priority Critical patent/CN106484879B/en
Publication of CN106484879A publication Critical patent/CN106484879A/en
Application granted granted Critical
Publication of CN106484879B publication Critical patent/CN106484879B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The polymerization of the present invention is to provide a kind of Map end data based on MapReduce.Including test phase and polymerization stage.Test phase verifies the algorithm in the Map function at the used end Map by test phase if appropriate for polymerization in carrying out.Interior polymerization is to carry out in the calculating process of Map function in memory, is just polymerize after a part has been calculated;Outer polymerization is after all data to be had been calculated to deposit disk in Map function, then to call in memory and polymerize.Polymerization stage, if test passes through, the data after being calculated using interior polymerization the end Map are polymerize;If test does not pass through, the data after being calculated using outer polymerization the end Map are polymerize.The characteristics of present invention is according to data, guarantee calculated result it is correct under the premise of, select corresponding polymerization methods, reduce I/O access times while, reduce transmission<Key, Value>the traffic.

Description

A kind of polymerization of the Map end data based on MapReduce
Technical field
The present invention relates to a kind of distributed computing methods, and in particular to a kind of Map end data based on MapReduce Polymerization.
Background technique
When currently processed big data application problem, two important thoughts: parallel and divide and rule.By will be large-scale Data reasonably split into multiple fractions, the calculation method combined by parallel idea and thought of dividing and ruling, so that problem obtains The relatively satisfied solution arrived.MapReduce is that we provide a kind of effectively and rapidly multiple programming frames.
MapReduce realizes two major functions: Map and Reduce.Map is that a function is applied in set Then all members return to the result set based on this processing.Reduce is knots some in two or more Map Fruit is classified and is concluded by the result set of multiple threads, process or autonomous system parallel processing.
In MapReduce model, user needs to define Map and Reduce function, and input is a key-value pair list, key Value is all based on Key to the binary group (Key, Value) being made of key and value, sequence and grouping.The key when input of Map function Value pair, calculates each key-value pair, the result of generation is also intermediate key-value pair list.Among Map and Reduce this Key assignments list, is assembled based on key.The input of Reduce function is the key-value pair grouping based on key, wherein each grouping is Independent, the mode that distributed large-scale parallel thus can be used is handled, the knot of total long-range MapReduce of input energy The memory of point.
However in terms of real processing, the speed of processing is not directly proportional with input resource, such as: with 2 The efficiency of platform host process data can't be than with 1 times of raising of 1 host.And it is less than 1;Because of big data treatment process In, it how from leading to the problem of end (output of Map) and being transferred to using end (Reduce input) to be one critically important by data.Cause I/O operation and network transmission can be generally related to for transmission, and it is both relatively time-consuming.In Hadoop, intermediate result is generally first Disk is write back, is then giving the end Reduce by network transmission.So improving algorithm operational efficiency, the traffic, the end Map are reduced Need result as few as possible.
In order to solve the problems, such as that the key-value pair in the calculated result of the end Map is excessive, a kind of method is to introduce Combine component, Intermediate result is polymerize, is equivalent to and has been a small Reduce.But I/O twice will be will do it by doing so, because after Map has been calculated Disk can be written at once, data in disk are then read out into polymerizeing using Combine for progress.
Key-value pair in the calculated result of the end Map polymerize by another method in memory, primary to calculate, an I/O.It is this The advantages of method is that speed is fast.The interior algorithm polymerizeing in the Map function for requiring the end Map retains the intermediate knot calculated in memory Fruit.But not every algorithm is all to be suitable for for calculated result polymerizeing in memory, because converging operation in memory exists It is carried out during calculating, for algorithm related with the input sequence of data, if using interior polymerization, final calculating knot Fruit may be different with the result of outer polymerization, so, carry out whether algorithm in verifying Map function is able to carry out before polymerizeing inside Interior polymerization.
In conclusion interior polymerization is i.e. in memory at present in the research to the end the Map flowcollector aggregation scheme FlowCollector based on MapReduce Middle polymerization, user oneself realizes that basic skills is that temporary variable is simply provided to deposit when being by writing MapReduce program Storage merges the key-value pair after Map is calculated, what outer polymerization was realized by the Combine component provided in Hadoop.Existing is poly- Conjunction method is too simple, and the key-value pair after cannot effectively calculating in memory Map polymerize, low efficiency.
Summary of the invention
The purpose of the present invention is to provide one kind to be able to solve Map end data polymerization under MapReduce Computational frame The polymerization of the Map end data based on MapReduce of the problems such as low efficiency, the excessive end Map calculated result.
The object of the present invention is achieved like this:
(1) corresponding result is calculated by the way that outer polymerization and cohesion are total respectively;
(2) whether identical compare two results;
(3) polymerization in then carrying out if they are the same, carries out outer polymerization if not identical;
The interior polymerization specifically includes:
(3.1.1) foundation<Key, Value>inverted index: according to reading<Key, Value>in Key value establish fall row rope Draw, in the index record<Key, Addresss>, Address is<Key, the address value of Value>in memory;
(3.1.2) to Address establish be directed toward Count index, matched, by successful match<Key, Value> It merges;
(3.1.3) checks whether memory is enough before carrying out next matching, will be in memory if low memory is enough Count value small part<Key, Value>write back disk;If memory checks whether that there are also uncalculated < Key, Value enough >, if there is uncalculated<Key, Value>, by uncalculated<Key, Value>call in memory is calculated and is returned (3.1.1) is continued to execute;If without uncalculated<Key, Value>then terminate;
The outer polymerization specifically includes:
(3.2.1) general<Key, Value>call in memory is calculated, and disk is written in calculated result, is denoted as S<Key,Value>
(3.2.2) is by the S in disk<Key,Value>Again memory is recalled to, the interior operation polymerizeing is executed.
The beneficial effects of the present invention are embodied in:
(1) present invention is by establishing inverted index, while promoting retrieval rate, makes<Key, Value>in memory into The effective polymerization of row, establishes index, promotes matching speed.Simultaneously it is considered that in carrying out when polymerization, asking for memory spilling is prevented Topic, during merging check memory it is whether enough, if Out of Memory, by partially merge number it is less < Key, Value>, i.e., Count value it is lesser<Key, Value>, write back disk first to prevent memory from overflowing.
(2) the characteristics of present invention is according to data select corresponding polymerization methods under the premise of guaranteeing that calculated result is correct, While reducing the access times of I/O;Reduction generation<Key, Value>quantity, so that transmission<Key is reduced, Value> The traffic.
Detailed description of the invention
Fig. 1 is the test phase flow chart of the polymerization of the Map end data based on MapReduce.
Fig. 2 is the interior polymerization method flow diagram of the polymerization of the Map end data based on MapReduce.
Fig. 3 is the outer polymerization flow chart of the polymerization of the Map end data based on MapReduce.
Specific embodiment
The invention will be further described for citing with reference to the accompanying drawing.
Present invention seek to address that Map end data polymerization low efficiency and the end Map calculate knot under MapReduce Computational frame The excessive problem of fruit.A kind of polymerization of Map end data based on MapReduce is proposed, after calculating in memory Map Key-value pair effectively polymerize, ensure export result it is correct under the premise of, reduce the access times of I/O;It reduces simultaneously The key-value pair of generation, to reduce the traffic of transmission.
As shown in Figure 1, this method includes following two stage, test phase and polymerization stage.
Test phase: the algorithm in the Map function at the used end Map is verified by test phase if appropriate in progress Polymerization.Because the algorithm in some Map functions is sensitive to the input sequence of input data, calculated result error may result in. And interior polymerization and outer polymerization difference are exactly the difference of the input sequence of data.Interior polymerization is in memory It carries out in the calculating process of Map function, is just polymerize after a part has been calculated;Outer polymerization is in Map function by institute After there are data that deposit disk has been calculated, calling in what memory was polymerize.
Polymerization stage: if test passes through, interior polymerization is carried out, i.e., closes the data after method calculates the end Map using cohesion It is polymerize;If test does not pass through, outer polymerization is carried out, i.e., the data after being calculated using outer polymerization the end Map are carried out Polymerization.
Two stages specific steps are as follows:
1. test phase.The task of test phase is to test whether the data after the end Map calculates are able to carry out interior polymerization side Method, specific practice are after the partial data to be calculated is calculated by interior polymerization and outer polymerization, to compare to obtain Result it is whether identical.Because the data used are few, the time that the time test stage uses will be very short, relative to entire It the calculating total time of MapReduce, can be ignored.As shown in Figure 1, the specific steps are as follows:
(1) corresponding result is calculated by the way that outer polymerization and cohesion are total respectively.
(2) whether identical compare two results.
(3) polymerization in then carrying out if they are the same, carries out outer polymerization if not identical.
2. polymerization stage.Including interior polymerization and outer polymerization.
(1) interior polymerization: the effect of interior polymerization is to be placed into converging operation in memory to carry out.Firstly, being based on To memory<Key, Value>in Key establish inverted index, i.e.,<Key, Address>, wherein Address is<Key, The address of Value > in memory.Secondly, in order in low memory, the part < Key that can timely lack matching times, Value > recall memory establishes the index of matching times Count to Address.After overmatching, will match to < Key, Value > merge.It is every to complete primary<Key, Value>interior polymerization after, when there is new<Key, when Value>call in memory, Memory size is checked, if memory is enough, be checked whether there are also uncalculated<Key, Value>, if there is will not Calculate<Key, Value>, by uncalculated<Key, Value>call in memory continues to calculate;If memory current capacities are insufficient It is enough, by the small part<Key of Count value in memory, Value>write back disk.As shown in Figure 2, the specific steps are as follows:
1) foundation<Key, Value>inverted index: according to reading<Key, Value>in Key value establish inverted index, Record<Key in index, Address>, Address is<Key, the address value of Value>in memory.
2) index for being directed toward Count is established to Address:, can be timely by matching times in order in low memory Few<Key, Value>recall memory establish the index of matching times Count to Address.
Matched, by successful match<Key, Value>merge.
3) before carrying out next matching, check whether memory is enough, if low memory is enough, by Count value in memory Small part<Key, Value>write back disk;If memory is checked whether enough there are also uncalculated<Key, Value>, if There is uncalculated<Key, Value>, by uncalculated<Key, 1) Value>call in memory, which is calculated and returned, to be continued to execute; If without uncalculated<Key, Value>then terminate.
(2) outer polymerization: the effect of<Key, Value>outer aggregation module is the Map letter that converging operation is placed on to the end Map Number is unified after the completion of calculating to carry out.As indicated at 3, the specific steps are as follows:
1) general<Key, Value>call in memory are calculated, and disk is written in calculated result.It is denoted as S<Key,Value>
2) by the S in disk<Key,Value>Again memory is recalled to, the interior operation polymerizeing is executed.
Specific example are as follows: by 1000<Key, Value>be randomly divided into 10 parts, each end Map progress 100 that is averaged< Key, Value > relevant calculation, using traditional method, then each end Map needs the access of 2 I/O, i.e., needs 20 times altogether I/O access, using improved method then each end Map under best-case, it is only necessary to the access of 1 I/O, i.e., altogether need 10 Secondary I/O access.Although partial picture is the access for needing 2 I/O, the Average visits of overall I/O are 15 times.Simultaneously In the case where only needing 1 access I/O, 1 all<Key can be saved, when Value>load and unload in memory is consumed The time taken, the available further promotion of the overall performance of system.

Claims (1)

1. a kind of polymerization of the Map end data based on MapReduce, it is characterized in that:
(1) corresponding result is calculated by the way that outer polymerization and cohesion are total respectively;
(2) whether identical compare two results;
(3) polymerization in then carrying out if they are the same, carries out outer polymerization if not identical;
The interior polymerization specifically includes:
(3.1.1) foundation<Key, Value>inverted index: according to reading<Key, Value>in Key value establish inverted index, Record<Key in the index, Address>, Address is<Key, the address value of Value>in memory;
(3.1.2) establishes the index for being directed toward Count to Address, and the index of matching times Count is established to Address, carries out Matching, by successful match<Key, Value>merge;
(3.1.3) checks whether memory is enough before carrying out next matching, if low memory is enough, by Count value in memory Small part<Key, Value>write back disk;If memory is checked whether enough there are also uncalculated<Key, Value>, if There is uncalculated<Key, Value>, by uncalculated<Key, Value>call in memory is calculated and is returned to (3.1.1) continuation It executes;If without uncalculated<Key, Value>then terminate;
The outer polymerization specifically includes:
(3.2.1) general<Key, Value>call in memory is calculated, and disk is written in calculated result, is denoted as S<Key,Value>
(3.2.2) is by the S in disk<Key,Value>Again memory is recalled to, the interior operation polymerizeing is executed.
CN201610899802.XA 2016-10-14 2016-10-14 A kind of polymerization of the Map end data based on MapReduce Active CN106484879B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610899802.XA CN106484879B (en) 2016-10-14 2016-10-14 A kind of polymerization of the Map end data based on MapReduce

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610899802.XA CN106484879B (en) 2016-10-14 2016-10-14 A kind of polymerization of the Map end data based on MapReduce

Publications (2)

Publication Number Publication Date
CN106484879A CN106484879A (en) 2017-03-08
CN106484879B true CN106484879B (en) 2019-08-06

Family

ID=58269694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610899802.XA Active CN106484879B (en) 2016-10-14 2016-10-14 A kind of polymerization of the Map end data based on MapReduce

Country Status (1)

Country Link
CN (1) CN106484879B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399397A (en) * 2018-04-19 2019-11-01 北京京东尚科信息技术有限公司 A kind of data query method and system
CN114265849B (en) * 2022-02-28 2022-06-10 杭州广立微电子股份有限公司 Data aggregation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101183368A (en) * 2007-12-06 2008-05-21 华南理工大学 Method and system for distributed calculating and enquiring magnanimity data in on-line analysis processing
CN103198099A (en) * 2013-03-12 2013-07-10 南京邮电大学 Cloud-based data mining application method facing telecommunication service
CN103440246A (en) * 2013-07-19 2013-12-11 百度在线网络技术(北京)有限公司 Intermediate result data sequencing method and system for MapReduce

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103970520B (en) * 2013-01-31 2017-06-16 国际商业机器公司 Method for managing resource, device and architecture system in MapReduce frameworks

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101183368A (en) * 2007-12-06 2008-05-21 华南理工大学 Method and system for distributed calculating and enquiring magnanimity data in on-line analysis processing
CN103198099A (en) * 2013-03-12 2013-07-10 南京邮电大学 Cloud-based data mining application method facing telecommunication service
CN103440246A (en) * 2013-07-19 2013-12-11 百度在线网络技术(北京)有限公司 Intermediate result data sequencing method and system for MapReduce

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《imapreduce: A distributed computing framework for iterative computation》;Yanfeng Zhang,Qixin Gao,et al.;《Journal of Grid Computing》;20120325;第10卷(第1期);47-68

Also Published As

Publication number Publication date
CN106484879A (en) 2017-03-08

Similar Documents

Publication Publication Date Title
WO2022105805A1 (en) Data processing method and in-memory computing chip
CN107360206A (en) A kind of block chain common recognition method, equipment and system
US9367359B2 (en) Optimized resource management for map/reduce computing
CN110308980A (en) Batch processing method, device, equipment and the storage medium of data
CN103581336B (en) Service flow scheduling method and system based on cloud computing platform
CN109510869A (en) A kind of Internet of Things service dynamic offloading method and device based on edge calculations
CN115150471B (en) Data processing method, apparatus, device, storage medium, and program product
CN109191287A (en) A kind of sharding method, device and the electronic equipment of block chain intelligence contract
CN106484879B (en) A kind of polymerization of the Map end data based on MapReduce
US11886969B2 (en) Dynamic network bandwidth in distributed deep learning training
CN104243531A (en) Data processing method, device and system
WO2020177488A1 (en) Method and device for blockchain transaction tracing
CN116227599A (en) Inference model optimization method and device, electronic equipment and storage medium
CN114742237A (en) Federal learning model aggregation method and device, electronic equipment and readable storage medium
CN108388471B (en) Management method based on double-threshold constraint virtual machine migration
US20230325149A1 (en) Data processing method and apparatus, computer device, and computer-readable storage medium
CN109412865A (en) A kind of virtual network resource allocation method, system and electronic equipment
CN111124439B (en) Intelligent dynamic unloading algorithm with cloud edge cooperation
CN108710538A (en) A kind of thread configuration method, computer readable storage medium and terminal device
CN109829678A (en) A kind of rollback processing method, device and electronic equipment
CN111951112A (en) Intelligent contract execution method based on block chain, terminal equipment and storage medium
CN107436812B (en) A kind of method and device of linux system performance optimization
EP4363970A1 (en) Method and system for resource governance in a multi-tenant system
CN107911484A (en) A kind of method and device of Message Processing
CN112527832B (en) Rule engine acceleration execution method, device, medium and equipment based on FPGA

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant