CN104503820B - A kind of Hadoop optimization methods based on asynchronous starting - Google Patents

A kind of Hadoop optimization methods based on asynchronous starting Download PDF

Info

Publication number
CN104503820B
CN104503820B CN201410757131.4A CN201410757131A CN104503820B CN 104503820 B CN104503820 B CN 104503820B CN 201410757131 A CN201410757131 A CN 201410757131A CN 104503820 B CN104503820 B CN 104503820B
Authority
CN
China
Prior art keywords
iteration
data
hadoop
map
tasks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410757131.4A
Other languages
Chinese (zh)
Other versions
CN104503820A (en
Inventor
赵淦森
邓运亨
王翔
何建涛
程庆年
周冠宇
周尚勤
王欣明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China Normal University
GCI Science and Technology Co Ltd
Original Assignee
South China Normal University
GCI Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China Normal University, GCI Science and Technology Co Ltd filed Critical South China Normal University
Priority to CN201410757131.4A priority Critical patent/CN104503820B/en
Publication of CN104503820A publication Critical patent/CN104503820A/en
Application granted granted Critical
Publication of CN104503820B publication Critical patent/CN104503820B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a kind of, and the Hadoop optimization methods based on asynchronous starting are concurrently executed by the way that operation is changed to part from serial execution so that its entire job execution process is optimized, and is greatly promoted the speed of interative computation, is effectively promoted execution efficiency.The present invention need not change the code of bottom, easy to use, and can improve the utilization rate of cluster, not the bottleneck of memory headroom.The present invention can be widely applied to as a kind of Hadoop optimization methods based on asynchronous starting in Hadoop framework technology.

Description

A kind of Hadoop optimization methods based on asynchronous starting
Technical field
The present invention relates to field of computer technology more particularly to a kind of Hadoop optimization methods based on asynchronous starting.
Background technology
There are following performance bottlenecks for present Hadoop processing interative computation:
It is completely serial to execute:Each operation will wait an operation to be fully completed and could start;
The interminable startup time:The startup time of operation averagely needs 10-15 seconds, this is a huge time waste;
Reduce processes are long:Reduce processes are the processes for calculating Global center point and writing results to HDFS, This process needs to take 10 seconds or so, is equally larger time loss;
Randomly choose initial center point:Initial center point for k-means iterations influence it is very big, if selection compared with Good initial center point contributes to the consumption for reducing the number and total time of iteration.
The I/O read-writes repeated:The data that each ends Map are read do not change substantially, but after every task start all It reads primary;
Network bandwidth consumption:The data of reading will be read from other nodes, and non-localized calculates, and causes a large amount of Netowrk tapes Wide consumption.
For performance issue existing for above-mentioned Hadoop, occur the improvement much to Hadoop interative computations now, but Be it can be found that most improvement will be related to the modifications of Hadoop bottom source codes, when user need to be switched to it is original When the pattern of Hadoop, the frame for replacing entire Hadoop is just had to, so causing very big inconvenience.In addition, portion Improved method is divided to be saved the data of repetitive read-write with caching, although can be avoided the repeated accesses of I/O in this way, It is that can also increase the pressure of the memory of node in this way, brings performance bottleneck, while also changing the structure of entire Hadoop, loses The characteristic of Hadoop is gone.
Hadoop:Distributed big data data processing tools;
MapReduce:Distributed computing framework;
K-means algorithms:Classic Clustering Algorithms based on division;
Interative computation:Need to can be only achieved the operation of the condition of iteration by multiple operation;
HDFS:Hadoop distributed file systems.
Invention content
In order to solve the above-mentioned technical problem, the object of the present invention is to provide a kind of promotion operation efficiencies, and the bottom of it is not necessary to modify A kind of Hadoop optimization methods based on asynchronous starting of layer identification code.
The technical solution adopted in the present invention is:
A kind of Hadoop optimization methods based on asynchronous starting, include the following steps:
A, it uploads data file and is divided into multiple data blocks to HDFS, and by data file;;
B, the data block after piecemeal copy and be distributed on different machines;
C, enabled instruction is sent out, and submits MapReduce operations and distribution map tasks and MyReduce tasks;
D, map tasks are executed, operation Map function pair data blocks are handled, obtain intermediate result data and send, and open Dynamic following iteration operation;
E, MyReduce tasks are executed, intermediate result data is received, and it is handled, obtain current iteration as a result, Step D is executed in following iteration operation simultaneously;
F, judged whether to reach iteration termination condition according to iteration result, if so, terminating iteration;Otherwise, it is next change E is returned to step for job service.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step C includes:
C1, enabled instruction is sent out, submits MapReduce operations;
C2, map tasks, MyReduce tasks and data block are sent to each node.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, held in the step D Row map tasks, operation Map function pair data blocks are handled, obtain intermediate result data and send comprising:
D1, the read block from each node;
D2, according to data block, operation Map functions calculate local center, obtain intermediate result data, and by it from map End is sent to the ends MyReduce.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, held in the step E Row MyReduce tasks receive intermediate result data, and handle it, obtain current iteration result comprising:
The data that E1, monitoring are sent from the ends map, receive intermediate result data;
E2, judge whether the quantity of the local center point set of the intermediate result data received is equal to the quantity of map tasks, If so, thening follow the steps E3;Otherwise, E1 is returned to step;
E3, according to intermediate result data, calculate Global center point, obtain current iteration result;
E4, HDFS is written into iteration result.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step D2 falls into a trap It calculates local center and uses kmeans algorithms.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step E3 falls into a trap It calculates Global center point and uses kmeans algorithms.
The beneficial effects of the invention are as follows:
A kind of Hadoop optimization methods based on asynchronous starting of the present invention from serial execution by operation by being changed to part simultaneously Hair executes so that its entire job execution process is optimized, and the speed of interative computation is greatly promoted, and is effectively promoted and is executed effect Rate.The present invention need not change the code of bottom, easy to use, and can improve the utilization rate of cluster, not memory headroom Bottleneck.
Description of the drawings
The specific implementation mode of the present invention is described further below in conjunction with the accompanying drawings:
Fig. 1 is a kind of step flow chart of the Hadoop optimization methods based on asynchronous starting of the present invention;
Fig. 2 is the execution flow chart of Hadoop operations before improving;
Fig. 3 is a kind of execution flow chart of the Hadoop optimization methods based on asynchronous starting of the present invention.
Specific implementation mode
With reference to figure 1, a kind of Hadoop optimization methods based on asynchronous starting of the present invention include the following steps:
A, it uploads data file and is divided into multiple data blocks to HDFS, and by data file;;
B, the data block after piecemeal copy and be distributed on different machines;
C, enabled instruction is sent out, and submits MapReduce operations and distribution map tasks and MyReduce tasks;
D, map tasks are executed, operation Map function pair data blocks are handled, obtain intermediate result data and send, and open Dynamic following iteration operation;
E, MyReduce tasks are executed, intermediate result data is received, and it is handled, obtain current iteration as a result, Step D is executed in following iteration operation simultaneously;
F, judged whether to reach iteration termination condition according to iteration result, if so, terminating iteration;Otherwise, it is next change E is returned to step for job service.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step C includes:
C1, enabled instruction is sent out, submits MapReduce operations;
C2, map tasks, MyReduce tasks and data block are sent to each node.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, held in the step D Row map tasks, operation Map function pair data blocks are handled, obtain intermediate result data and send comprising:
D1, the read block from each node;
D2, according to data block, operation Map functions calculate local center, obtain intermediate result data, and by it from map End is sent to the ends MyReduce.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, held in the step E Row MyReduce tasks receive intermediate result data, and handle it, obtain current iteration result comprising:
The data that E1, monitoring are sent from the ends map, receive intermediate result data;
E2, judge whether the quantity of the local center point set of the intermediate result data received is equal to the quantity of map tasks, If so, thening follow the steps E3;Otherwise, E1 is returned to step;
E3, according to intermediate result data, calculate Global center point, obtain current iteration result;
E4, HDFS is written into iteration result.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step D2 falls into a trap It calculates local center and uses kmeans algorithms.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step E3 falls into a trap It calculates Global center point and uses kmeans algorithms.
Operation asynchronous starting in the present invention is in order to solve the problems, such as operation serially caused time waste, for this purpose, first Certain abstractdesription is carried out to entire operation, then to finding out existing room for improvement, finally propose the improvement strategy of this paper, and The consumption of total time can be reduced by verifying improved method.
1, scene describes
When carrying out Distributed Cluster to data set, successive ignition operation is needed, and primary change is responsible in each operation For operation.These operations are end to end, and each operation is necessarily dependent upon the result of calculation of previous operation, therefore, dividing In cloth k-means algorithms, each operation has to wait for start after previous operation is fully completed.In each operation Inside, the process of each execution are also to have stringent to execute sequence.
In order to come out above scene formalized description, a mathematical model is established herein, for indicating entirely poly- In class process between operation and the relation of interdependence of job built, and calculated after asynchronous starting to entire by mathematical model The improved efficiency of interative computation process.
(1) related definition
Define 1:One complete operation be since its startup until it is fully completed;
Define 2:The process of entire iteration is that an iteration terminates to the end since first time iteration;
Define 3:The mean consumption timing definition of operation is total time divided by iterations;
(2) precondition
In entire iterative process, continuously it is made of several operations;
It is each to make to share four-stage, including startup, map, reduce, end in the industry, and each stage consumption is certain Time, each stage execution sequence is fixed;
When an operation finishes, if not reaching iteration termination condition, be bound to one new operation of startup;
If Job1 starts earlier than Job2, and two operations are adjacent, then local center calculating is necessary in the map tasks of Job2 Waiting for the result of Job1 to be completely written to HDFS could start to execute, and Job2 can not possibly be finished earlier than Job1;
One collection group energy supports multiple operations to be carried out at the same time;
Under normal circumstances, each operation will not come in stopping midway for execution;
2, mathematical model
A upper section, has made the operation execution situation of interative computation in entire clustering the description of word, below will The description of word is abstracted into the model of mathematics.
Mathematical model mainly describes the relation of interdependence between operation and operation, and also the execution of job built, which relies on, closes System, the also data dependence relation between operation.
(1) variable description:
N:Total iterations of entire clustering
i:Iteration serial number, 1,2,3 ... .i ... ..N
j:The process of each job built executes serial number 1 to 4, and 1 is startup stage, and 2 be the map stages, and 3 be reduce ranks Section, 4 be ending phase
Sti:At the time of starting to start of the operation of ith iteration;
Eti:At the time of the operation of ith iteration is fully completed;
Stij:The Startup time of j-th of process of ith iteration;
Sti2':At the time of the map of ith iteration starts to calculate local center;
Sti3':At the time of the reduce of ith iteration starts to calculate Global center point;
Etij:At the time of j-th of process of ith iteration is fully completed;
T:Total time consumes
:Average each iteration time consumption
(2) the serial constraints of operation
In in distributed k-means interative computations, the execution sequence of operation therein is that have stringent job dependence Relationship, and each job built is also to have stringent data dependence relation.Variable handle above-mentioned will be used below These mutual dependence formalization representations come out.
The serial dependence of operation is broadly divided into:Job dependence executes dependence and data dependence.Job dependence refers to operating room Mutual execution sequencing;Execute the execution sequencing that dependence refers to job built;Data dependence refer to two operations it Between map and reduce data dependence relation.
Job dependence:
The iteration of ith does not terminate, then i+1 time iteration would not start
Eti<Sti+1
The operation of i+1 time iteration can not possibly be completed earlier than the operation of ith iteration
Eti<Eti+1
Execute dependence:The execution sequence of each job built is fixed, and particular order is startup, map, reduce, knot The case where beam, there is no job built execution order entanglements
Sti≤Stij≤Sti(j+1)≤Eti
The formula illustrates, in the operation of ith iteration, the startup of the process in each stage is that have stringent sequence , there is no each stages to start the problem of being not in the right order.
Data dependence:
The iteration of ith is fully completed, and the map of i+1 time iteration could start to read the result of last iteration
Eti4<St(i+1)2'
There is also data dependences between the map and reduce of each job built:Reduce is starting to calculate Global center Map is needed to be fully completed before point
Eti2≤Sti3'
It then consumes entire cluster process total time and is:
Then the time loss of average each iteration is:
3, improvement project
Two constraintss are improved the item that one of them is job dependence during this trifle will be executed for chained job Part, another is the condition of data dependence.Due to the presence (i.e. cluster support while running multiple operations) of precondition 5, It therefore can be by operation pre-cooling.
(1) job dependence improves:The map moment of the operation of ith iteration only need to be earlier than the job initiation of i+1 time iteration Moment, the operation without ith iteration are fully completed, and i+1 time iteration just starts.
By formula:Eti≤Sti+1
It is improved to:Sti<Sti2<Sti+1<Eti
The time then saved is at least:△t1=Eti-Sti3
The formula is shown:Ith iteration operation need not wait until that i+1 time iteration is fully completed and just start completely, it is only necessary to Ensure ith starts startup time of the time earlier than i+1 time, and after the map of ith iteration, in ith iteration Before reduce.Reason for this is that when the map of ith goes to certain proportion, just open the operation of i+1 time Dynamic, after ith completes all operations, i+1 time iteration can read the result of last iteration.If cannot read It gets as a result, the iteration of i+1 time will restart unsuccessfully map tasks (fault-tolerance of Hadoop), until reading.
(2) data dependence improves:Second is the constraint for improving data dependence, and the map of i+1 time iteration need not be waited It could start to calculate local center after being fully completed to ith iteration, it is only necessary to after the completion of the reduce of ith iteration just Start the task of the calculating of map.
By formula:Eti4<Sti
It is changed to:Eti3<St(i+1)2'
The time then saved is:△t2=Eti4-Eti3>0
The formula indicates that ith reduce is fully completed, and the map of i+1 time could start, because map needs the last time to change The result of calculation in generation is used to calculate local center in this map task.
4, conclusion
In conclusion iteration is compared to the time saved before improvement every time:
△ t=△ t1+△t2
Improved total time is:
Due to:△t1>0,
So:T>T', it can be deduced that improved time loss disappears certainly less than the distributed k-means times before improvement Consumption, and the elapsed time of average each iteration is also certainly less than the distributed k-means time loss before improvement, i.e.,
Therefore, to the modification of the Startup time condition of job dependence and data dependence to the efficiency of entire clustering flow Promotion has certain effect, and is connected compacter between improved operation, accelerates the process of interative computation.
It is Job1 and Job2 respectively as shown, here there are two the operation in iterative process with reference to figure 2.Wherein, 1 table Show job initiation, 2 indicate to execute map tasks, and 3 indicate to execute reduce tasks, and 4 indicate to terminate.
This is the constraints of operation before improving, main there are three constraints, be job dependence respectively, execute dependence, Data dependence.
Job dependence refers to that Job2 has to start to start after Job1 is fully completed.
Execute to rely on and refer to, the execution in each stage of each job built sequence is fixed, according to startup, map, Reduce, end sequence execute.
Data dependence refers to that the map functions of Job2 need that until Job1 is fully completed its data could be read, and each make The data that the reduce in portion is read in the industry will be finished until map functions.
With reference to figure 3, Fig. 3 is the execution flow of operation after improving.,
First, job dependence has been broken, has allowed Job2 that can reduce the shadow for starting the time for time loss with pre-cooling It rings.
Secondly, broken and executed dependence, allowed process 3 and process 4 to execute together, MyReduce is used in combination to be counted instead of reduce Global center point is calculated, the process for calculating Global center point is accelerated.
Finally, broken data dependence so that Job2 can be read after the completion of process 3 of Job1 it is upper it is primary repeatedly The result of calculation in generation.
The specific experiment data of the present invention are as follows:
(1) experimental situation
Use three physical machines, the configuration parameter of every physical machine as follows in experiment:
CPU models CPU core number Memory Hard disk
Intel(R)Core(TM)i5-2500K CPU@3.30GHz 4 16G 500G
1 physical machine of table configures
The virtualization tool used starts 6 virtual machines for XenCenter6.2 in three physical machines.Wherein Server1 starts two virtual machines, and one uses Win7 systems, the programmed environment as Eclipse.One use Ubuntu12.04 is responsible for being scheduled Slave nodes as the Master of Hadoop clusters.Remaining two Server difference Load Slave node of 2 virtual machines as Hadoop clusters.
The configuration of six virtual machines therein is as shown in table 2:
Virtual cpu check figure Memory Hard disk Operating system Network
2 4G 150G Ubuntu12.04 Gigabit switch
2 virtual machine configuration of table
Development environment therein is as shown in the table:
Development language Java
JDK versions JDK1.6
Development environment Eclipse+Maven
Hadoop versions 1.2.1
3 development environment of table
Specifically the division of labor is as shown in the table for 6 virtual machines in cluster, and wherein deng1 is responsible for Namenode as master And jobtracker;Other four (deng2, deng3, deng4, deng5) are used as Datanode and tasktracker.Also Development environments of one deng0 as Hadoop is responsible for the compiling of Java code and sends the tasks such as job initiation instruction.
Machine name IP Task Operating system
deng0 192.168.1.10 The ends development environment+MyReduce Windows7
deng1 192.168.1.11 Master Ubuntu 12.04
deng2 192.168.1.12 Slave Ubuntu 12.04
deng3 192.168.1.13 Slave Ubuntu 12.04
deng4 192.168.1.14 Slave Ubuntu 12.04
deng5 192.168.1.15 Slave Ubuntu 12.04
The division of labor of 4 each virtual machine of table and basic condition
(2) Reduce average handling times compare
Data and the parameter of algorithm that the experiment is selected are consistent with the first experiment of experiment two, and purpose is in order to compare The processing time of Reduce processing times and MyReduce.
5 Reduce and MyReduce processing times of table compare
From, it can be found that after being improved using MyReduce, the time loss of calculating Global center point is by original in experimental result Average more than 10 seconds come is fallen to averagely 2 seconds or so, and in the comparison of most fast processing time, MyReduce is at least faster than Reduce 10 times, in the comparison of most slow processing time, MyReduce is 3 times also at least faster than Reduce.Although in data set The average handling time of the increase of the number of point, MyReduce also increases more obviously, but is still far below at Reduce The time loss of reason process.This be primarily due to MyReduce be always maintained at monitor and calculate state, save due to The time loss that MapReduce frame strips come.From the point of view of experimental result, MyReduce calculates improving the efficiency of Global center point Promotion is very big.
(3) the clustering time compares
The matrix generated after the data that this experiment uses are consistent with the experiment of front and MakeMatrix.
What this experiment to be compared is the distributed k-means algorithms (Distributed k-means) and process before improving Distributed k-means algorithms (Advanced Distributed k-means) after three kinds of improved methods, both Main Analysis Average each iterative calculation time and total processing time comparison.
Table 6 improves preceding and improved efficiency comparative
From in experimental result it can be found that process part concurrently and after the improvement of MyReduce, average each iteration Time reduces nearly 20 seconds, this is because reducing the time of job initiation and using the time of Reduce at MapReduce Consumption.
In addition, calculating and partly concurrently executing three kinds of optimizations by initial point optimization, MyReduce optimization Global center points Collective effect after, the total time consumption of entire Distributed Cluster analysis declines respectively in the case of 100W, 400W, 800W 54%, 40% and 39%, it has used nearly 5 hours less in the case of 800W point, 7 is fallen to from 12 original hours More hours, reduce the stand-by period of nearly 5 hours.
It finds in an experiment, although due to by the improvement that concurrently executes of part, the time of certain interative computation several times becomes It is long, still, start the consumption of time due to saving so that the average time consumption of each iteration declines.In addition, due to first The effect of initial point optimization, reduces the number of iteration, in the case where reduction number is most so that the number of final iteration subtracts Lack more than 100 times, when also allowing for reaching iteration termination condition, the consumption of total time significantly declines.In addition, experiment It is found that the increase of the number with point, the time of central point is calculated in operation can account for the ratio of total time bigger, i.e., The ratio regular meeting that the time of Hadoop frames consumption accounts for declines, and the efficiency of Hadoop processing can rise.
The present invention is concurrently executed since operation is changed to part from serial execution, and operation is saved in process of cluster analysis and is opened The dynamic time, while also improving the resource utilization of Hadoop clusters.Before optimization in the interval time of interative computation twice In there is the phenomenon that " trough ", this is because after preceding an iteration, need to wait several seconds for next iteration operation after clock Operation can just start, so occur cpu resource consumption reduce the phenomenon that.Simultaneously also because operation serially executes, Mei Gezuo Industry will wait for an operation to execute could start completely, and the time for causing entire Distributed Cluster analysis is elongated.Through After crossing improvement, the phenomenon that " trough ", disappears, and is that part concurrently executes between each operation, the time to be launched such as is not in The case where.Since the free time of cpu resource consumption reduces, the improved efficiency of clustering, entire Hadoop collection The CPU resource utilization of group is also improved and (is primarily due to reduce the time of idle waiting).
Improved average CPU resource utilization is 68%, and the average CPU resource utilization before improvement is 48%, then collects The CPU resource utilization of group improved 41.7% (the cluster CPU computing resources for more being utilized 20%) more originally.After improvement Time is 60 minutes, and the time before improvement is 131 minutes, then the whole cpu resource consumption before improving is improved: (131*48%)/(60*68%)=1.54 times, that is, the cpu resource for reaching iterated conditional needs before improving are required after improving 1.5 times of CPU computing resources, it was demonstrated that the method for operation asynchronous starting can effectively save the time, to also save the totality of cluster CPU computing resources.
It is to be illustrated to the preferable implementation of the present invention, but the invention is not limited to the implementation above Example, those skilled in the art can also make various equivalent variations or be replaced under the premise of without prejudice to spirit of that invention It changes, these equivalent deformations or replacement are all contained in the application claim limited range.

Claims (5)

1. a kind of Hadoop optimization methods based on asynchronous starting, it is characterised in that:Include the following steps:
A, it uploads data file and is divided into multiple data blocks to HDFS, and by data file;
B, the data block after piecemeal copy and be distributed on different machines;
C, enabled instruction is sent out, and submits MapReduce operations and distribution map tasks and MyReduce tasks;
D, map tasks are executed, operation Map function pair data blocks are handled, obtain intermediate result data and send, and under startup One iteration operation;
E, MyReduce tasks are executed, intermediate result data is received, and handle it, obtains current iteration as a result, simultaneously Step D is executed in following iteration operation;
F, judged whether to reach iteration termination condition according to iteration result, if so, terminating iteration;Otherwise, it is that following iteration is made Industry service returns to step E;
MyReduce tasks are executed in the step E, are received intermediate result data, and handle it, are obtained current iteration As a result comprising:
The data that E1, monitoring are sent from the ends map, receive intermediate result data;
E2, judge whether the quantity of the local center point set of the intermediate result data received is equal to the quantity of map tasks, if It is to then follow the steps E3;Otherwise, E1 is returned to step;
E3, according to intermediate result data, calculate Global center point, obtain current iteration result;
E4, HDFS is written into iteration result.
2. a kind of Hadoop optimization methods based on asynchronous starting according to claim 1, it is characterised in that:The step C includes:
C1, enabled instruction is sent out, submits MapReduce operations;
C2, map tasks, MyReduce tasks and data block are sent to each node.
3. a kind of Hadoop optimization methods based on asynchronous starting according to claim 1, it is characterised in that:The step Map tasks are executed in D, operation Map function pair data blocks are handled, obtain intermediate result data and send comprising:
D1, the read block from each node;
D2, according to data block, operation Map functions calculate local center, obtain intermediate result data, and it is sent out from the ends map It send to the ends MyReduce.
4. a kind of Hadoop optimization methods based on asynchronous starting according to claim 3, it is characterised in that:The step Local center is calculated in D2 uses kmeans algorithms.
5. a kind of Hadoop optimization methods based on asynchronous starting according to claim 1, it is characterised in that:The step Global center point is calculated in E3 uses kmeans algorithms.
CN201410757131.4A 2014-12-10 2014-12-10 A kind of Hadoop optimization methods based on asynchronous starting Active CN104503820B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410757131.4A CN104503820B (en) 2014-12-10 2014-12-10 A kind of Hadoop optimization methods based on asynchronous starting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410757131.4A CN104503820B (en) 2014-12-10 2014-12-10 A kind of Hadoop optimization methods based on asynchronous starting

Publications (2)

Publication Number Publication Date
CN104503820A CN104503820A (en) 2015-04-08
CN104503820B true CN104503820B (en) 2018-07-24

Family

ID=52945221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410757131.4A Active CN104503820B (en) 2014-12-10 2014-12-10 A kind of Hadoop optimization methods based on asynchronous starting

Country Status (1)

Country Link
CN (1) CN104503820B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104915250B (en) * 2015-06-03 2018-04-06 电子科技大学 It is a kind of to realize the method for making MapReduce data localization in the industry
CN107844568B (en) * 2017-11-03 2021-05-28 广东电网有限责任公司电力调度控制中心 MapReduce execution process optimization method for processing data source update
CN110795265B (en) * 2019-10-25 2021-04-02 东北大学 Iterator based on optimistic fault-tolerant method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222092A (en) * 2011-06-03 2011-10-19 复旦大学 Massive high-dimension data clustering method for MapReduce platform
CN103077253A (en) * 2013-01-25 2013-05-01 西安电子科技大学 High-dimensional mass data GMM (Gaussian Mixture Model) clustering method under Hadoop framework
CN103605576A (en) * 2013-11-25 2014-02-26 华中科技大学 Multithreading-based MapReduce execution system
CN103617087A (en) * 2013-11-25 2014-03-05 华中科技大学 MapReduce optimizing method suitable for iterative computations
CN103838863A (en) * 2014-03-14 2014-06-04 内蒙古科技大学 Big-data clustering algorithm based on cloud computing platform
CN104156463A (en) * 2014-08-21 2014-11-19 南京信息工程大学 Big-data clustering ensemble method based on MapReduce

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222092A (en) * 2011-06-03 2011-10-19 复旦大学 Massive high-dimension data clustering method for MapReduce platform
CN103077253A (en) * 2013-01-25 2013-05-01 西安电子科技大学 High-dimensional mass data GMM (Gaussian Mixture Model) clustering method under Hadoop framework
CN103605576A (en) * 2013-11-25 2014-02-26 华中科技大学 Multithreading-based MapReduce execution system
CN103617087A (en) * 2013-11-25 2014-03-05 华中科技大学 MapReduce optimizing method suitable for iterative computations
CN103838863A (en) * 2014-03-14 2014-06-04 内蒙古科技大学 Big-data clustering algorithm based on cloud computing platform
CN104156463A (en) * 2014-08-21 2014-11-19 南京信息工程大学 Big-data clustering ensemble method based on MapReduce

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于云计算的并行k-means 算法研究;林长方 等;《齐齐哈尔大学学报》;20140930;第30卷(第5期);第5-9页,第1、2节 *

Also Published As

Publication number Publication date
CN104503820A (en) 2015-04-08

Similar Documents

Publication Publication Date Title
Ousterhout et al. Monotasks: Architecting for performance clarity in data analytics frameworks
Venkataraman et al. The power of choice in {Data-Aware} cluster scheduling
CN105117286B (en) The dispatching method of task and streamlined perform method in MapReduce
US10585889B2 (en) Optimizing skewed joins in big data
Zhao et al. Reliable workflow scheduling with less resource redundancy
Nghiem et al. Towards efficient resource provisioning in MapReduce
CN112416585B (en) Deep learning-oriented GPU resource management and intelligent scheduling method
Lai et al. Sol: Fast distributed computation over slow networks
WO2023179415A1 (en) Machine learning computation optimization method and platform
CN103399800A (en) Dynamic load balancing method based on Linux parallel computing platform
Huang et al. Novel heuristic speculative execution strategies in heterogeneous distributed environments
US11537429B2 (en) Sub-idle thread priority class
Wang et al. An efficient and non-intrusive GPU scheduling framework for deep learning training systems
CN104503820B (en) A kind of Hadoop optimization methods based on asynchronous starting
CN105138405A (en) To-be-released resource list based MapReduce task speculation execution method and apparatus
Thamsen et al. Ellis: Dynamically scaling distributed dataflows to meet runtime targets
Stafford et al. Improving utilization of heterogeneous clusters
Chai Task scheduling based on swarm intelligence algorithms in high performance computing environment
Henzinger et al. Scheduling large jobs by abstraction refinement
Fu et al. Optimizing speculative execution in spark heterogeneous environments
Chen et al. Task Scheduling for Multi-core and Parallel architectures
Wang et al. A model driven approach towards improving the performance of apache spark applications
Tang et al. A network load perception based task scheduler for parallel distributed data processing systems
Jin et al. Ditto: Efficient serverless analytics with elastic parallelism
Huang et al. Improving speculative execution performance with coworker for cloud computing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant