CN106484879B - A kind of polymerization of the Map end data based on MapReduce - Google Patents
A kind of polymerization of the Map end data based on MapReduce Download PDFInfo
- Publication number
- CN106484879B CN106484879B CN201610899802.XA CN201610899802A CN106484879B CN 106484879 B CN106484879 B CN 106484879B CN 201610899802 A CN201610899802 A CN 201610899802A CN 106484879 B CN106484879 B CN 106484879B
- Authority
- CN
- China
- Prior art keywords
- key
- value
- polymerization
- memory
- map
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The polymerization of the present invention is to provide a kind of Map end data based on MapReduce.Including test phase and polymerization stage.Test phase verifies the algorithm in the Map function at the used end Map by test phase if appropriate for polymerization in carrying out.Interior polymerization is to carry out in the calculating process of Map function in memory, is just polymerize after a part has been calculated;Outer polymerization is after all data to be had been calculated to deposit disk in Map function, then to call in memory and polymerize.Polymerization stage, if test passes through, the data after being calculated using interior polymerization the end Map are polymerize;If test does not pass through, the data after being calculated using outer polymerization the end Map are polymerize.The characteristics of present invention is according to data, guarantee calculated result it is correct under the premise of, select corresponding polymerization methods, reduce I/O access times while, reduce transmission<Key, Value>the traffic.
Description
Technical field
The present invention relates to a kind of distributed computing methods, and in particular to a kind of Map end data based on MapReduce
Polymerization.
Background technique
When currently processed big data application problem, two important thoughts: parallel and divide and rule.By will be large-scale
Data reasonably split into multiple fractions, the calculation method combined by parallel idea and thought of dividing and ruling, so that problem obtains
The relatively satisfied solution arrived.MapReduce is that we provide a kind of effectively and rapidly multiple programming frames.
MapReduce realizes two major functions: Map and Reduce.Map is that a function is applied in set
Then all members return to the result set based on this processing.Reduce is knots some in two or more Map
Fruit is classified and is concluded by the result set of multiple threads, process or autonomous system parallel processing.
In MapReduce model, user needs to define Map and Reduce function, and input is a key-value pair list, key
Value is all based on Key to the binary group (Key, Value) being made of key and value, sequence and grouping.The key when input of Map function
Value pair, calculates each key-value pair, the result of generation is also intermediate key-value pair list.Among Map and Reduce this
Key assignments list, is assembled based on key.The input of Reduce function is the key-value pair grouping based on key, wherein each grouping is
Independent, the mode that distributed large-scale parallel thus can be used is handled, the knot of total long-range MapReduce of input energy
The memory of point.
However in terms of real processing, the speed of processing is not directly proportional with input resource, such as: with 2
The efficiency of platform host process data can't be than with 1 times of raising of 1 host.And it is less than 1;Because of big data treatment process
In, it how from leading to the problem of end (output of Map) and being transferred to using end (Reduce input) to be one critically important by data.Cause
I/O operation and network transmission can be generally related to for transmission, and it is both relatively time-consuming.In Hadoop, intermediate result is generally first
Disk is write back, is then giving the end Reduce by network transmission.So improving algorithm operational efficiency, the traffic, the end Map are reduced
Need result as few as possible.
In order to solve the problems, such as that the key-value pair in the calculated result of the end Map is excessive, a kind of method is to introduce Combine component,
Intermediate result is polymerize, is equivalent to and has been a small Reduce.But I/O twice will be will do it by doing so, because after Map has been calculated
Disk can be written at once, data in disk are then read out into polymerizeing using Combine for progress.
Key-value pair in the calculated result of the end Map polymerize by another method in memory, primary to calculate, an I/O.It is this
The advantages of method is that speed is fast.The interior algorithm polymerizeing in the Map function for requiring the end Map retains the intermediate knot calculated in memory
Fruit.But not every algorithm is all to be suitable for for calculated result polymerizeing in memory, because converging operation in memory exists
It is carried out during calculating, for algorithm related with the input sequence of data, if using interior polymerization, final calculating knot
Fruit may be different with the result of outer polymerization, so, carry out whether algorithm in verifying Map function is able to carry out before polymerizeing inside
Interior polymerization.
In conclusion interior polymerization is i.e. in memory at present in the research to the end the Map flowcollector aggregation scheme FlowCollector based on MapReduce
Middle polymerization, user oneself realizes that basic skills is that temporary variable is simply provided to deposit when being by writing MapReduce program
Storage merges the key-value pair after Map is calculated, what outer polymerization was realized by the Combine component provided in Hadoop.Existing is poly-
Conjunction method is too simple, and the key-value pair after cannot effectively calculating in memory Map polymerize, low efficiency.
Summary of the invention
The purpose of the present invention is to provide one kind to be able to solve Map end data polymerization under MapReduce Computational frame
The polymerization of the Map end data based on MapReduce of the problems such as low efficiency, the excessive end Map calculated result.
The object of the present invention is achieved like this:
(1) corresponding result is calculated by the way that outer polymerization and cohesion are total respectively;
(2) whether identical compare two results;
(3) polymerization in then carrying out if they are the same, carries out outer polymerization if not identical;
The interior polymerization specifically includes:
(3.1.1) foundation<Key, Value>inverted index: according to reading<Key, Value>in Key value establish fall row rope
Draw, in the index record<Key, Addresss>, Address is<Key, the address value of Value>in memory;
(3.1.2) to Address establish be directed toward Count index, matched, by successful match<Key, Value>
It merges;
(3.1.3) checks whether memory is enough before carrying out next matching, will be in memory if low memory is enough
Count value small part<Key, Value>write back disk;If memory checks whether that there are also uncalculated < Key, Value enough
>, if there is uncalculated<Key, Value>, by uncalculated<Key, Value>call in memory is calculated and is returned
(3.1.1) is continued to execute;If without uncalculated<Key, Value>then terminate;
The outer polymerization specifically includes:
(3.2.1) general<Key, Value>call in memory is calculated, and disk is written in calculated result, is denoted as S<Key,Value>;
(3.2.2) is by the S in disk<Key,Value>Again memory is recalled to, the interior operation polymerizeing is executed.
The beneficial effects of the present invention are embodied in:
(1) present invention is by establishing inverted index, while promoting retrieval rate, makes<Key, Value>in memory into
The effective polymerization of row, establishes index, promotes matching speed.Simultaneously it is considered that in carrying out when polymerization, asking for memory spilling is prevented
Topic, during merging check memory it is whether enough, if Out of Memory, by partially merge number it is less <
Key, Value>, i.e., Count value it is lesser<Key, Value>, write back disk first to prevent memory from overflowing.
(2) the characteristics of present invention is according to data select corresponding polymerization methods under the premise of guaranteeing that calculated result is correct,
While reducing the access times of I/O;Reduction generation<Key, Value>quantity, so that transmission<Key is reduced, Value>
The traffic.
Detailed description of the invention
Fig. 1 is the test phase flow chart of the polymerization of the Map end data based on MapReduce.
Fig. 2 is the interior polymerization method flow diagram of the polymerization of the Map end data based on MapReduce.
Fig. 3 is the outer polymerization flow chart of the polymerization of the Map end data based on MapReduce.
Specific embodiment
The invention will be further described for citing with reference to the accompanying drawing.
Present invention seek to address that Map end data polymerization low efficiency and the end Map calculate knot under MapReduce Computational frame
The excessive problem of fruit.A kind of polymerization of Map end data based on MapReduce is proposed, after calculating in memory Map
Key-value pair effectively polymerize, ensure export result it is correct under the premise of, reduce the access times of I/O;It reduces simultaneously
The key-value pair of generation, to reduce the traffic of transmission.
As shown in Figure 1, this method includes following two stage, test phase and polymerization stage.
Test phase: the algorithm in the Map function at the used end Map is verified by test phase if appropriate in progress
Polymerization.Because the algorithm in some Map functions is sensitive to the input sequence of input data, calculated result error may result in.
And interior polymerization and outer polymerization difference are exactly the difference of the input sequence of data.Interior polymerization is in memory
It carries out in the calculating process of Map function, is just polymerize after a part has been calculated;Outer polymerization is in Map function by institute
After there are data that deposit disk has been calculated, calling in what memory was polymerize.
Polymerization stage: if test passes through, interior polymerization is carried out, i.e., closes the data after method calculates the end Map using cohesion
It is polymerize;If test does not pass through, outer polymerization is carried out, i.e., the data after being calculated using outer polymerization the end Map are carried out
Polymerization.
Two stages specific steps are as follows:
1. test phase.The task of test phase is to test whether the data after the end Map calculates are able to carry out interior polymerization side
Method, specific practice are after the partial data to be calculated is calculated by interior polymerization and outer polymerization, to compare to obtain
Result it is whether identical.Because the data used are few, the time that the time test stage uses will be very short, relative to entire
It the calculating total time of MapReduce, can be ignored.As shown in Figure 1, the specific steps are as follows:
(1) corresponding result is calculated by the way that outer polymerization and cohesion are total respectively.
(2) whether identical compare two results.
(3) polymerization in then carrying out if they are the same, carries out outer polymerization if not identical.
2. polymerization stage.Including interior polymerization and outer polymerization.
(1) interior polymerization: the effect of interior polymerization is to be placed into converging operation in memory to carry out.Firstly, being based on
To memory<Key, Value>in Key establish inverted index, i.e.,<Key, Address>, wherein Address is<Key,
The address of Value > in memory.Secondly, in order in low memory, the part < Key that can timely lack matching times,
Value > recall memory establishes the index of matching times Count to Address.After overmatching, will match to < Key,
Value > merge.It is every to complete primary<Key, Value>interior polymerization after, when there is new<Key, when Value>call in memory,
Memory size is checked, if memory is enough, be checked whether there are also uncalculated<Key, Value>, if there is will not
Calculate<Key, Value>, by uncalculated<Key, Value>call in memory continues to calculate;If memory current capacities are insufficient
It is enough, by the small part<Key of Count value in memory, Value>write back disk.As shown in Figure 2, the specific steps are as follows:
1) foundation<Key, Value>inverted index: according to reading<Key, Value>in Key value establish inverted index,
Record<Key in index, Address>, Address is<Key, the address value of Value>in memory.
2) index for being directed toward Count is established to Address:, can be timely by matching times in order in low memory
Few<Key, Value>recall memory establish the index of matching times Count to Address.
Matched, by successful match<Key, Value>merge.
3) before carrying out next matching, check whether memory is enough, if low memory is enough, by Count value in memory
Small part<Key, Value>write back disk;If memory is checked whether enough there are also uncalculated<Key, Value>, if
There is uncalculated<Key, Value>, by uncalculated<Key, 1) Value>call in memory, which is calculated and returned, to be continued to execute;
If without uncalculated<Key, Value>then terminate.
(2) outer polymerization: the effect of<Key, Value>outer aggregation module is the Map letter that converging operation is placed on to the end Map
Number is unified after the completion of calculating to carry out.As indicated at 3, the specific steps are as follows:
1) general<Key, Value>call in memory are calculated, and disk is written in calculated result.It is denoted as S<Key,Value>。
2) by the S in disk<Key,Value>Again memory is recalled to, the interior operation polymerizeing is executed.
Specific example are as follows: by 1000<Key, Value>be randomly divided into 10 parts, each end Map progress 100 that is averaged<
Key, Value > relevant calculation, using traditional method, then each end Map needs the access of 2 I/O, i.e., needs 20 times altogether
I/O access, using improved method then each end Map under best-case, it is only necessary to the access of 1 I/O, i.e., altogether need 10
Secondary I/O access.Although partial picture is the access for needing 2 I/O, the Average visits of overall I/O are 15 times.Simultaneously
In the case where only needing 1 access I/O, 1 all<Key can be saved, when Value>load and unload in memory is consumed
The time taken, the available further promotion of the overall performance of system.
Claims (1)
1. a kind of polymerization of the Map end data based on MapReduce, it is characterized in that:
(1) corresponding result is calculated by the way that outer polymerization and cohesion are total respectively;
(2) whether identical compare two results;
(3) polymerization in then carrying out if they are the same, carries out outer polymerization if not identical;
The interior polymerization specifically includes:
(3.1.1) foundation<Key, Value>inverted index: according to reading<Key, Value>in Key value establish inverted index,
Record<Key in the index, Address>, Address is<Key, the address value of Value>in memory;
(3.1.2) establishes the index for being directed toward Count to Address, and the index of matching times Count is established to Address, carries out
Matching, by successful match<Key, Value>merge;
(3.1.3) checks whether memory is enough before carrying out next matching, if low memory is enough, by Count value in memory
Small part<Key, Value>write back disk;If memory is checked whether enough there are also uncalculated<Key, Value>, if
There is uncalculated<Key, Value>, by uncalculated<Key, Value>call in memory is calculated and is returned to (3.1.1) continuation
It executes;If without uncalculated<Key, Value>then terminate;
The outer polymerization specifically includes:
(3.2.1) general<Key, Value>call in memory is calculated, and disk is written in calculated result, is denoted as S<Key,Value>;
(3.2.2) is by the S in disk<Key,Value>Again memory is recalled to, the interior operation polymerizeing is executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610899802.XA CN106484879B (en) | 2016-10-14 | 2016-10-14 | A kind of polymerization of the Map end data based on MapReduce |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610899802.XA CN106484879B (en) | 2016-10-14 | 2016-10-14 | A kind of polymerization of the Map end data based on MapReduce |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106484879A CN106484879A (en) | 2017-03-08 |
CN106484879B true CN106484879B (en) | 2019-08-06 |
Family
ID=58269694
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610899802.XA Active CN106484879B (en) | 2016-10-14 | 2016-10-14 | A kind of polymerization of the Map end data based on MapReduce |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106484879B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110399397A (en) * | 2018-04-19 | 2019-11-01 | 北京京东尚科信息技术有限公司 | A kind of data query method and system |
CN114265849B (en) * | 2022-02-28 | 2022-06-10 | 杭州广立微电子股份有限公司 | Data aggregation method and system |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101183368A (en) * | 2007-12-06 | 2008-05-21 | 华南理工大学 | Method and system for distributed calculating and enquiring magnanimity data in on-line analysis processing |
CN103198099A (en) * | 2013-03-12 | 2013-07-10 | 南京邮电大学 | Cloud-based data mining application method facing telecommunication service |
CN103440246A (en) * | 2013-07-19 | 2013-12-11 | 百度在线网络技术(北京)有限公司 | Intermediate result data sequencing method and system for MapReduce |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103970520B (en) * | 2013-01-31 | 2017-06-16 | 国际商业机器公司 | Method for managing resource, device and architecture system in MapReduce frameworks |
-
2016
- 2016-10-14 CN CN201610899802.XA patent/CN106484879B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101183368A (en) * | 2007-12-06 | 2008-05-21 | 华南理工大学 | Method and system for distributed calculating and enquiring magnanimity data in on-line analysis processing |
CN103198099A (en) * | 2013-03-12 | 2013-07-10 | 南京邮电大学 | Cloud-based data mining application method facing telecommunication service |
CN103440246A (en) * | 2013-07-19 | 2013-12-11 | 百度在线网络技术(北京)有限公司 | Intermediate result data sequencing method and system for MapReduce |
Non-Patent Citations (1)
Title |
---|
《imapreduce: A distributed computing framework for iterative computation》;Yanfeng Zhang,Qixin Gao,et al.;《Journal of Grid Computing》;20120325;第10卷(第1期);47-68 |
Also Published As
Publication number | Publication date |
---|---|
CN106484879A (en) | 2017-03-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022105805A1 (en) | Data processing method and in-memory computing chip | |
CN107360206A (en) | A kind of block chain common recognition method, equipment and system | |
US9367359B2 (en) | Optimized resource management for map/reduce computing | |
CN110308980A (en) | Batch processing method, device, equipment and the storage medium of data | |
CN103581336B (en) | Service flow scheduling method and system based on cloud computing platform | |
CN109510869A (en) | A kind of Internet of Things service dynamic offloading method and device based on edge calculations | |
CN115150471B (en) | Data processing method, apparatus, device, storage medium, and program product | |
CN109191287A (en) | A kind of sharding method, device and the electronic equipment of block chain intelligence contract | |
CN106484879B (en) | A kind of polymerization of the Map end data based on MapReduce | |
US11886969B2 (en) | Dynamic network bandwidth in distributed deep learning training | |
CN104243531A (en) | Data processing method, device and system | |
WO2020177488A1 (en) | Method and device for blockchain transaction tracing | |
CN116227599A (en) | Inference model optimization method and device, electronic equipment and storage medium | |
CN114742237A (en) | Federal learning model aggregation method and device, electronic equipment and readable storage medium | |
CN108388471B (en) | Management method based on double-threshold constraint virtual machine migration | |
US20230325149A1 (en) | Data processing method and apparatus, computer device, and computer-readable storage medium | |
CN109412865A (en) | A kind of virtual network resource allocation method, system and electronic equipment | |
CN111124439B (en) | Intelligent dynamic unloading algorithm with cloud edge cooperation | |
CN108710538A (en) | A kind of thread configuration method, computer readable storage medium and terminal device | |
CN109829678A (en) | A kind of rollback processing method, device and electronic equipment | |
CN111951112A (en) | Intelligent contract execution method based on block chain, terminal equipment and storage medium | |
CN107436812B (en) | A kind of method and device of linux system performance optimization | |
EP4363970A1 (en) | Method and system for resource governance in a multi-tenant system | |
CN107911484A (en) | A kind of method and device of Message Processing | |
CN112527832B (en) | Rule engine acceleration execution method, device, medium and equipment based on FPGA |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |