CN104503820B - A kind of Hadoop optimization methods based on asynchronous starting - Google Patents
A kind of Hadoop optimization methods based on asynchronous starting Download PDFInfo
- Publication number
- CN104503820B CN104503820B CN201410757131.4A CN201410757131A CN104503820B CN 104503820 B CN104503820 B CN 104503820B CN 201410757131 A CN201410757131 A CN 201410757131A CN 104503820 B CN104503820 B CN 104503820B
- Authority
- CN
- China
- Prior art keywords
- iteration
- data
- hadoop
- map
- tasks
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
The invention discloses a kind of, and the Hadoop optimization methods based on asynchronous starting are concurrently executed by the way that operation is changed to part from serial execution so that its entire job execution process is optimized, and is greatly promoted the speed of interative computation, is effectively promoted execution efficiency.The present invention need not change the code of bottom, easy to use, and can improve the utilization rate of cluster, not the bottleneck of memory headroom.The present invention can be widely applied to as a kind of Hadoop optimization methods based on asynchronous starting in Hadoop framework technology.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of Hadoop optimization methods based on asynchronous starting.
Background technology
There are following performance bottlenecks for present Hadoop processing interative computation:
It is completely serial to execute:Each operation will wait an operation to be fully completed and could start;
The interminable startup time:The startup time of operation averagely needs 10-15 seconds, this is a huge time waste;
Reduce processes are long:Reduce processes are the processes for calculating Global center point and writing results to HDFS,
This process needs to take 10 seconds or so, is equally larger time loss;
Randomly choose initial center point:Initial center point for k-means iterations influence it is very big, if selection compared with
Good initial center point contributes to the consumption for reducing the number and total time of iteration.
The I/O read-writes repeated:The data that each ends Map are read do not change substantially, but after every task start all
It reads primary;
Network bandwidth consumption:The data of reading will be read from other nodes, and non-localized calculates, and causes a large amount of Netowrk tapes
Wide consumption.
For performance issue existing for above-mentioned Hadoop, occur the improvement much to Hadoop interative computations now, but
Be it can be found that most improvement will be related to the modifications of Hadoop bottom source codes, when user need to be switched to it is original
When the pattern of Hadoop, the frame for replacing entire Hadoop is just had to, so causing very big inconvenience.In addition, portion
Improved method is divided to be saved the data of repetitive read-write with caching, although can be avoided the repeated accesses of I/O in this way,
It is that can also increase the pressure of the memory of node in this way, brings performance bottleneck, while also changing the structure of entire Hadoop, loses
The characteristic of Hadoop is gone.
Hadoop:Distributed big data data processing tools;
MapReduce:Distributed computing framework;
K-means algorithms:Classic Clustering Algorithms based on division;
Interative computation:Need to can be only achieved the operation of the condition of iteration by multiple operation;
HDFS:Hadoop distributed file systems.
Invention content
In order to solve the above-mentioned technical problem, the object of the present invention is to provide a kind of promotion operation efficiencies, and the bottom of it is not necessary to modify
A kind of Hadoop optimization methods based on asynchronous starting of layer identification code.
The technical solution adopted in the present invention is:
A kind of Hadoop optimization methods based on asynchronous starting, include the following steps:
A, it uploads data file and is divided into multiple data blocks to HDFS, and by data file;;
B, the data block after piecemeal copy and be distributed on different machines;
C, enabled instruction is sent out, and submits MapReduce operations and distribution map tasks and MyReduce tasks;
D, map tasks are executed, operation Map function pair data blocks are handled, obtain intermediate result data and send, and open
Dynamic following iteration operation;
E, MyReduce tasks are executed, intermediate result data is received, and it is handled, obtain current iteration as a result,
Step D is executed in following iteration operation simultaneously;
F, judged whether to reach iteration termination condition according to iteration result, if so, terminating iteration;Otherwise, it is next change
E is returned to step for job service.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step C includes:
C1, enabled instruction is sent out, submits MapReduce operations;
C2, map tasks, MyReduce tasks and data block are sent to each node.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, held in the step D
Row map tasks, operation Map function pair data blocks are handled, obtain intermediate result data and send comprising:
D1, the read block from each node;
D2, according to data block, operation Map functions calculate local center, obtain intermediate result data, and by it from map
End is sent to the ends MyReduce.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, held in the step E
Row MyReduce tasks receive intermediate result data, and handle it, obtain current iteration result comprising:
The data that E1, monitoring are sent from the ends map, receive intermediate result data;
E2, judge whether the quantity of the local center point set of the intermediate result data received is equal to the quantity of map tasks,
If so, thening follow the steps E3;Otherwise, E1 is returned to step;
E3, according to intermediate result data, calculate Global center point, obtain current iteration result;
E4, HDFS is written into iteration result.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step D2 falls into a trap
It calculates local center and uses kmeans algorithms.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step E3 falls into a trap
It calculates Global center point and uses kmeans algorithms.
The beneficial effects of the invention are as follows:
A kind of Hadoop optimization methods based on asynchronous starting of the present invention from serial execution by operation by being changed to part simultaneously
Hair executes so that its entire job execution process is optimized, and the speed of interative computation is greatly promoted, and is effectively promoted and is executed effect
Rate.The present invention need not change the code of bottom, easy to use, and can improve the utilization rate of cluster, not memory headroom
Bottleneck.
Description of the drawings
The specific implementation mode of the present invention is described further below in conjunction with the accompanying drawings:
Fig. 1 is a kind of step flow chart of the Hadoop optimization methods based on asynchronous starting of the present invention;
Fig. 2 is the execution flow chart of Hadoop operations before improving;
Fig. 3 is a kind of execution flow chart of the Hadoop optimization methods based on asynchronous starting of the present invention.
Specific implementation mode
With reference to figure 1, a kind of Hadoop optimization methods based on asynchronous starting of the present invention include the following steps:
A, it uploads data file and is divided into multiple data blocks to HDFS, and by data file;;
B, the data block after piecemeal copy and be distributed on different machines;
C, enabled instruction is sent out, and submits MapReduce operations and distribution map tasks and MyReduce tasks;
D, map tasks are executed, operation Map function pair data blocks are handled, obtain intermediate result data and send, and open
Dynamic following iteration operation;
E, MyReduce tasks are executed, intermediate result data is received, and it is handled, obtain current iteration as a result,
Step D is executed in following iteration operation simultaneously;
F, judged whether to reach iteration termination condition according to iteration result, if so, terminating iteration;Otherwise, it is next change
E is returned to step for job service.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step C includes:
C1, enabled instruction is sent out, submits MapReduce operations;
C2, map tasks, MyReduce tasks and data block are sent to each node.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, held in the step D
Row map tasks, operation Map function pair data blocks are handled, obtain intermediate result data and send comprising:
D1, the read block from each node;
D2, according to data block, operation Map functions calculate local center, obtain intermediate result data, and by it from map
End is sent to the ends MyReduce.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, held in the step E
Row MyReduce tasks receive intermediate result data, and handle it, obtain current iteration result comprising:
The data that E1, monitoring are sent from the ends map, receive intermediate result data;
E2, judge whether the quantity of the local center point set of the intermediate result data received is equal to the quantity of map tasks,
If so, thening follow the steps E3;Otherwise, E1 is returned to step;
E3, according to intermediate result data, calculate Global center point, obtain current iteration result;
E4, HDFS is written into iteration result.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step D2 falls into a trap
It calculates local center and uses kmeans algorithms.
As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step E3 falls into a trap
It calculates Global center point and uses kmeans algorithms.
Operation asynchronous starting in the present invention is in order to solve the problems, such as operation serially caused time waste, for this purpose, first
Certain abstractdesription is carried out to entire operation, then to finding out existing room for improvement, finally propose the improvement strategy of this paper, and
The consumption of total time can be reduced by verifying improved method.
1, scene describes
When carrying out Distributed Cluster to data set, successive ignition operation is needed, and primary change is responsible in each operation
For operation.These operations are end to end, and each operation is necessarily dependent upon the result of calculation of previous operation, therefore, dividing
In cloth k-means algorithms, each operation has to wait for start after previous operation is fully completed.In each operation
Inside, the process of each execution are also to have stringent to execute sequence.
In order to come out above scene formalized description, a mathematical model is established herein, for indicating entirely poly-
In class process between operation and the relation of interdependence of job built, and calculated after asynchronous starting to entire by mathematical model
The improved efficiency of interative computation process.
(1) related definition
Define 1:One complete operation be since its startup until it is fully completed;
Define 2:The process of entire iteration is that an iteration terminates to the end since first time iteration;
Define 3:The mean consumption timing definition of operation is total time divided by iterations;
(2) precondition
In entire iterative process, continuously it is made of several operations;
It is each to make to share four-stage, including startup, map, reduce, end in the industry, and each stage consumption is certain
Time, each stage execution sequence is fixed;
When an operation finishes, if not reaching iteration termination condition, be bound to one new operation of startup;
If Job1 starts earlier than Job2, and two operations are adjacent, then local center calculating is necessary in the map tasks of Job2
Waiting for the result of Job1 to be completely written to HDFS could start to execute, and Job2 can not possibly be finished earlier than Job1;
One collection group energy supports multiple operations to be carried out at the same time;
Under normal circumstances, each operation will not come in stopping midway for execution;
2, mathematical model
A upper section, has made the operation execution situation of interative computation in entire clustering the description of word, below will
The description of word is abstracted into the model of mathematics.
Mathematical model mainly describes the relation of interdependence between operation and operation, and also the execution of job built, which relies on, closes
System, the also data dependence relation between operation.
(1) variable description:
N:Total iterations of entire clustering
i:Iteration serial number, 1,2,3 ... .i ... ..N
j:The process of each job built executes serial number 1 to 4, and 1 is startup stage, and 2 be the map stages, and 3 be reduce ranks
Section, 4 be ending phase
Sti:At the time of starting to start of the operation of ith iteration;
Eti:At the time of the operation of ith iteration is fully completed;
Stij:The Startup time of j-th of process of ith iteration;
Sti2':At the time of the map of ith iteration starts to calculate local center;
Sti3':At the time of the reduce of ith iteration starts to calculate Global center point;
Etij:At the time of j-th of process of ith iteration is fully completed;
T:Total time consumes
:Average each iteration time consumption
(2) the serial constraints of operation
In in distributed k-means interative computations, the execution sequence of operation therein is that have stringent job dependence
Relationship, and each job built is also to have stringent data dependence relation.Variable handle above-mentioned will be used below
These mutual dependence formalization representations come out.
The serial dependence of operation is broadly divided into:Job dependence executes dependence and data dependence.Job dependence refers to operating room
Mutual execution sequencing;Execute the execution sequencing that dependence refers to job built;Data dependence refer to two operations it
Between map and reduce data dependence relation.
Job dependence:
The iteration of ith does not terminate, then i+1 time iteration would not start
Eti<Sti+1
The operation of i+1 time iteration can not possibly be completed earlier than the operation of ith iteration
Eti<Eti+1
Execute dependence:The execution sequence of each job built is fixed, and particular order is startup, map, reduce, knot
The case where beam, there is no job built execution order entanglements
Sti≤Stij≤Sti(j+1)≤Eti
The formula illustrates, in the operation of ith iteration, the startup of the process in each stage is that have stringent sequence
, there is no each stages to start the problem of being not in the right order.
Data dependence:
The iteration of ith is fully completed, and the map of i+1 time iteration could start to read the result of last iteration
Eti4<St(i+1)2'
There is also data dependences between the map and reduce of each job built:Reduce is starting to calculate Global center
Map is needed to be fully completed before point
Eti2≤Sti3'
It then consumes entire cluster process total time and is:
Then the time loss of average each iteration is:
3, improvement project
Two constraintss are improved the item that one of them is job dependence during this trifle will be executed for chained job
Part, another is the condition of data dependence.Due to the presence (i.e. cluster support while running multiple operations) of precondition 5,
It therefore can be by operation pre-cooling.
(1) job dependence improves:The map moment of the operation of ith iteration only need to be earlier than the job initiation of i+1 time iteration
Moment, the operation without ith iteration are fully completed, and i+1 time iteration just starts.
By formula:Eti≤Sti+1
It is improved to:Sti<Sti2<Sti+1<Eti
The time then saved is at least:△t1=Eti-Sti3
The formula is shown:Ith iteration operation need not wait until that i+1 time iteration is fully completed and just start completely, it is only necessary to
Ensure ith starts startup time of the time earlier than i+1 time, and after the map of ith iteration, in ith iteration
Before reduce.Reason for this is that when the map of ith goes to certain proportion, just open the operation of i+1 time
Dynamic, after ith completes all operations, i+1 time iteration can read the result of last iteration.If cannot read
It gets as a result, the iteration of i+1 time will restart unsuccessfully map tasks (fault-tolerance of Hadoop), until reading.
(2) data dependence improves:Second is the constraint for improving data dependence, and the map of i+1 time iteration need not be waited
It could start to calculate local center after being fully completed to ith iteration, it is only necessary to after the completion of the reduce of ith iteration just
Start the task of the calculating of map.
By formula:Eti4<Sti
It is changed to:Eti3<St(i+1)2'
The time then saved is:△t2=Eti4-Eti3>0
The formula indicates that ith reduce is fully completed, and the map of i+1 time could start, because map needs the last time to change
The result of calculation in generation is used to calculate local center in this map task.
4, conclusion
In conclusion iteration is compared to the time saved before improvement every time:
△ t=△ t1+△t2
Improved total time is:
Due to:△t1>0,
So:T>T', it can be deduced that improved time loss disappears certainly less than the distributed k-means times before improvement
Consumption, and the elapsed time of average each iteration is also certainly less than the distributed k-means time loss before improvement, i.e.,
Therefore, to the modification of the Startup time condition of job dependence and data dependence to the efficiency of entire clustering flow
Promotion has certain effect, and is connected compacter between improved operation, accelerates the process of interative computation.
It is Job1 and Job2 respectively as shown, here there are two the operation in iterative process with reference to figure 2.Wherein, 1 table
Show job initiation, 2 indicate to execute map tasks, and 3 indicate to execute reduce tasks, and 4 indicate to terminate.
This is the constraints of operation before improving, main there are three constraints, be job dependence respectively, execute dependence,
Data dependence.
Job dependence refers to that Job2 has to start to start after Job1 is fully completed.
Execute to rely on and refer to, the execution in each stage of each job built sequence is fixed, according to startup, map,
Reduce, end sequence execute.
Data dependence refers to that the map functions of Job2 need that until Job1 is fully completed its data could be read, and each make
The data that the reduce in portion is read in the industry will be finished until map functions.
With reference to figure 3, Fig. 3 is the execution flow of operation after improving.,
First, job dependence has been broken, has allowed Job2 that can reduce the shadow for starting the time for time loss with pre-cooling
It rings.
Secondly, broken and executed dependence, allowed process 3 and process 4 to execute together, MyReduce is used in combination to be counted instead of reduce
Global center point is calculated, the process for calculating Global center point is accelerated.
Finally, broken data dependence so that Job2 can be read after the completion of process 3 of Job1 it is upper it is primary repeatedly
The result of calculation in generation.
The specific experiment data of the present invention are as follows:
(1) experimental situation
Use three physical machines, the configuration parameter of every physical machine as follows in experiment:
CPU models | CPU core number | Memory | Hard disk |
Intel(R)Core(TM)i5-2500K CPU@3.30GHz | 4 | 16G | 500G |
1 physical machine of table configures
The virtualization tool used starts 6 virtual machines for XenCenter6.2 in three physical machines.Wherein
Server1 starts two virtual machines, and one uses Win7 systems, the programmed environment as Eclipse.One use
Ubuntu12.04 is responsible for being scheduled Slave nodes as the Master of Hadoop clusters.Remaining two Server difference
Load Slave node of 2 virtual machines as Hadoop clusters.
The configuration of six virtual machines therein is as shown in table 2:
Virtual cpu check figure | Memory | Hard disk | Operating system | Network |
2 | 4G | 150G | Ubuntu12.04 | Gigabit switch |
2 virtual machine configuration of table
Development environment therein is as shown in the table:
Development language | Java |
JDK versions | JDK1.6 |
Development environment | Eclipse+Maven |
Hadoop versions | 1.2.1 |
3 development environment of table
Specifically the division of labor is as shown in the table for 6 virtual machines in cluster, and wherein deng1 is responsible for Namenode as master
And jobtracker;Other four (deng2, deng3, deng4, deng5) are used as Datanode and tasktracker.Also
Development environments of one deng0 as Hadoop is responsible for the compiling of Java code and sends the tasks such as job initiation instruction.
Machine name | IP | Task | Operating system |
deng0 | 192.168.1.10 | The ends development environment+MyReduce | Windows7 |
deng1 | 192.168.1.11 | Master | Ubuntu 12.04 |
deng2 | 192.168.1.12 | Slave | Ubuntu 12.04 |
deng3 | 192.168.1.13 | Slave | Ubuntu 12.04 |
deng4 | 192.168.1.14 | Slave | Ubuntu 12.04 |
deng5 | 192.168.1.15 | Slave | Ubuntu 12.04 |
The division of labor of 4 each virtual machine of table and basic condition
(2) Reduce average handling times compare
Data and the parameter of algorithm that the experiment is selected are consistent with the first experiment of experiment two, and purpose is in order to compare
The processing time of Reduce processing times and MyReduce.
5 Reduce and MyReduce processing times of table compare
From, it can be found that after being improved using MyReduce, the time loss of calculating Global center point is by original in experimental result
Average more than 10 seconds come is fallen to averagely 2 seconds or so, and in the comparison of most fast processing time, MyReduce is at least faster than Reduce
10 times, in the comparison of most slow processing time, MyReduce is 3 times also at least faster than Reduce.Although in data set
The average handling time of the increase of the number of point, MyReduce also increases more obviously, but is still far below at Reduce
The time loss of reason process.This be primarily due to MyReduce be always maintained at monitor and calculate state, save due to
The time loss that MapReduce frame strips come.From the point of view of experimental result, MyReduce calculates improving the efficiency of Global center point
Promotion is very big.
(3) the clustering time compares
The matrix generated after the data that this experiment uses are consistent with the experiment of front and MakeMatrix.
What this experiment to be compared is the distributed k-means algorithms (Distributed k-means) and process before improving
Distributed k-means algorithms (Advanced Distributed k-means) after three kinds of improved methods, both Main Analysis
Average each iterative calculation time and total processing time comparison.
Table 6 improves preceding and improved efficiency comparative
From in experimental result it can be found that process part concurrently and after the improvement of MyReduce, average each iteration
Time reduces nearly 20 seconds, this is because reducing the time of job initiation and using the time of Reduce at MapReduce
Consumption.
In addition, calculating and partly concurrently executing three kinds of optimizations by initial point optimization, MyReduce optimization Global center points
Collective effect after, the total time consumption of entire Distributed Cluster analysis declines respectively in the case of 100W, 400W, 800W
54%, 40% and 39%, it has used nearly 5 hours less in the case of 800W point, 7 is fallen to from 12 original hours
More hours, reduce the stand-by period of nearly 5 hours.
It finds in an experiment, although due to by the improvement that concurrently executes of part, the time of certain interative computation several times becomes
It is long, still, start the consumption of time due to saving so that the average time consumption of each iteration declines.In addition, due to first
The effect of initial point optimization, reduces the number of iteration, in the case where reduction number is most so that the number of final iteration subtracts
Lack more than 100 times, when also allowing for reaching iteration termination condition, the consumption of total time significantly declines.In addition, experiment
It is found that the increase of the number with point, the time of central point is calculated in operation can account for the ratio of total time bigger, i.e.,
The ratio regular meeting that the time of Hadoop frames consumption accounts for declines, and the efficiency of Hadoop processing can rise.
The present invention is concurrently executed since operation is changed to part from serial execution, and operation is saved in process of cluster analysis and is opened
The dynamic time, while also improving the resource utilization of Hadoop clusters.Before optimization in the interval time of interative computation twice
In there is the phenomenon that " trough ", this is because after preceding an iteration, need to wait several seconds for next iteration operation after clock
Operation can just start, so occur cpu resource consumption reduce the phenomenon that.Simultaneously also because operation serially executes, Mei Gezuo
Industry will wait for an operation to execute could start completely, and the time for causing entire Distributed Cluster analysis is elongated.Through
After crossing improvement, the phenomenon that " trough ", disappears, and is that part concurrently executes between each operation, the time to be launched such as is not in
The case where.Since the free time of cpu resource consumption reduces, the improved efficiency of clustering, entire Hadoop collection
The CPU resource utilization of group is also improved and (is primarily due to reduce the time of idle waiting).
Improved average CPU resource utilization is 68%, and the average CPU resource utilization before improvement is 48%, then collects
The CPU resource utilization of group improved 41.7% (the cluster CPU computing resources for more being utilized 20%) more originally.After improvement
Time is 60 minutes, and the time before improvement is 131 minutes, then the whole cpu resource consumption before improving is improved:
(131*48%)/(60*68%)=1.54 times, that is, the cpu resource for reaching iterated conditional needs before improving are required after improving
1.5 times of CPU computing resources, it was demonstrated that the method for operation asynchronous starting can effectively save the time, to also save the totality of cluster
CPU computing resources.
It is to be illustrated to the preferable implementation of the present invention, but the invention is not limited to the implementation above
Example, those skilled in the art can also make various equivalent variations or be replaced under the premise of without prejudice to spirit of that invention
It changes, these equivalent deformations or replacement are all contained in the application claim limited range.
Claims (5)
1. a kind of Hadoop optimization methods based on asynchronous starting, it is characterised in that:Include the following steps:
A, it uploads data file and is divided into multiple data blocks to HDFS, and by data file;
B, the data block after piecemeal copy and be distributed on different machines;
C, enabled instruction is sent out, and submits MapReduce operations and distribution map tasks and MyReduce tasks;
D, map tasks are executed, operation Map function pair data blocks are handled, obtain intermediate result data and send, and under startup
One iteration operation;
E, MyReduce tasks are executed, intermediate result data is received, and handle it, obtains current iteration as a result, simultaneously
Step D is executed in following iteration operation;
F, judged whether to reach iteration termination condition according to iteration result, if so, terminating iteration;Otherwise, it is that following iteration is made
Industry service returns to step E;
MyReduce tasks are executed in the step E, are received intermediate result data, and handle it, are obtained current iteration
As a result comprising:
The data that E1, monitoring are sent from the ends map, receive intermediate result data;
E2, judge whether the quantity of the local center point set of the intermediate result data received is equal to the quantity of map tasks, if
It is to then follow the steps E3;Otherwise, E1 is returned to step;
E3, according to intermediate result data, calculate Global center point, obtain current iteration result;
E4, HDFS is written into iteration result.
2. a kind of Hadoop optimization methods based on asynchronous starting according to claim 1, it is characterised in that:The step
C includes:
C1, enabled instruction is sent out, submits MapReduce operations;
C2, map tasks, MyReduce tasks and data block are sent to each node.
3. a kind of Hadoop optimization methods based on asynchronous starting according to claim 1, it is characterised in that:The step
Map tasks are executed in D, operation Map function pair data blocks are handled, obtain intermediate result data and send comprising:
D1, the read block from each node;
D2, according to data block, operation Map functions calculate local center, obtain intermediate result data, and it is sent out from the ends map
It send to the ends MyReduce.
4. a kind of Hadoop optimization methods based on asynchronous starting according to claim 3, it is characterised in that:The step
Local center is calculated in D2 uses kmeans algorithms.
5. a kind of Hadoop optimization methods based on asynchronous starting according to claim 1, it is characterised in that:The step
Global center point is calculated in E3 uses kmeans algorithms.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410757131.4A CN104503820B (en) | 2014-12-10 | 2014-12-10 | A kind of Hadoop optimization methods based on asynchronous starting |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410757131.4A CN104503820B (en) | 2014-12-10 | 2014-12-10 | A kind of Hadoop optimization methods based on asynchronous starting |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104503820A CN104503820A (en) | 2015-04-08 |
CN104503820B true CN104503820B (en) | 2018-07-24 |
Family
ID=52945221
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410757131.4A Active CN104503820B (en) | 2014-12-10 | 2014-12-10 | A kind of Hadoop optimization methods based on asynchronous starting |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104503820B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104915250B (en) * | 2015-06-03 | 2018-04-06 | 电子科技大学 | It is a kind of to realize the method for making MapReduce data localization in the industry |
CN107844568B (en) * | 2017-11-03 | 2021-05-28 | 广东电网有限责任公司电力调度控制中心 | MapReduce execution process optimization method for processing data source update |
CN110795265B (en) * | 2019-10-25 | 2021-04-02 | 东北大学 | Iterator based on optimistic fault-tolerant method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222092A (en) * | 2011-06-03 | 2011-10-19 | 复旦大学 | Massive high-dimension data clustering method for MapReduce platform |
CN103077253A (en) * | 2013-01-25 | 2013-05-01 | 西安电子科技大学 | High-dimensional mass data GMM (Gaussian Mixture Model) clustering method under Hadoop framework |
CN103605576A (en) * | 2013-11-25 | 2014-02-26 | 华中科技大学 | Multithreading-based MapReduce execution system |
CN103617087A (en) * | 2013-11-25 | 2014-03-05 | 华中科技大学 | MapReduce optimizing method suitable for iterative computations |
CN103838863A (en) * | 2014-03-14 | 2014-06-04 | 内蒙古科技大学 | Big-data clustering algorithm based on cloud computing platform |
CN104156463A (en) * | 2014-08-21 | 2014-11-19 | 南京信息工程大学 | Big-data clustering ensemble method based on MapReduce |
-
2014
- 2014-12-10 CN CN201410757131.4A patent/CN104503820B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102222092A (en) * | 2011-06-03 | 2011-10-19 | 复旦大学 | Massive high-dimension data clustering method for MapReduce platform |
CN103077253A (en) * | 2013-01-25 | 2013-05-01 | 西安电子科技大学 | High-dimensional mass data GMM (Gaussian Mixture Model) clustering method under Hadoop framework |
CN103605576A (en) * | 2013-11-25 | 2014-02-26 | 华中科技大学 | Multithreading-based MapReduce execution system |
CN103617087A (en) * | 2013-11-25 | 2014-03-05 | 华中科技大学 | MapReduce optimizing method suitable for iterative computations |
CN103838863A (en) * | 2014-03-14 | 2014-06-04 | 内蒙古科技大学 | Big-data clustering algorithm based on cloud computing platform |
CN104156463A (en) * | 2014-08-21 | 2014-11-19 | 南京信息工程大学 | Big-data clustering ensemble method based on MapReduce |
Non-Patent Citations (1)
Title |
---|
基于云计算的并行k-means 算法研究;林长方 等;《齐齐哈尔大学学报》;20140930;第30卷(第5期);第5-9页,第1、2节 * |
Also Published As
Publication number | Publication date |
---|---|
CN104503820A (en) | 2015-04-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ousterhout et al. | Monotasks: Architecting for performance clarity in data analytics frameworks | |
Venkataraman et al. | The power of choice in {Data-Aware} cluster scheduling | |
CN105117286B (en) | The dispatching method of task and streamlined perform method in MapReduce | |
US10585889B2 (en) | Optimizing skewed joins in big data | |
Zhao et al. | Reliable workflow scheduling with less resource redundancy | |
Nghiem et al. | Towards efficient resource provisioning in MapReduce | |
CN112416585B (en) | Deep learning-oriented GPU resource management and intelligent scheduling method | |
Lai et al. | Sol: Fast distributed computation over slow networks | |
WO2023179415A1 (en) | Machine learning computation optimization method and platform | |
CN103399800A (en) | Dynamic load balancing method based on Linux parallel computing platform | |
Huang et al. | Novel heuristic speculative execution strategies in heterogeneous distributed environments | |
US11537429B2 (en) | Sub-idle thread priority class | |
Wang et al. | An efficient and non-intrusive GPU scheduling framework for deep learning training systems | |
CN104503820B (en) | A kind of Hadoop optimization methods based on asynchronous starting | |
CN105138405A (en) | To-be-released resource list based MapReduce task speculation execution method and apparatus | |
Thamsen et al. | Ellis: Dynamically scaling distributed dataflows to meet runtime targets | |
Stafford et al. | Improving utilization of heterogeneous clusters | |
Chai | Task scheduling based on swarm intelligence algorithms in high performance computing environment | |
Henzinger et al. | Scheduling large jobs by abstraction refinement | |
Fu et al. | Optimizing speculative execution in spark heterogeneous environments | |
Chen et al. | Task Scheduling for Multi-core and Parallel architectures | |
Wang et al. | A model driven approach towards improving the performance of apache spark applications | |
Tang et al. | A network load perception based task scheduler for parallel distributed data processing systems | |
Jin et al. | Ditto: Efficient serverless analytics with elastic parallelism | |
Huang et al. | Improving speculative execution performance with coworker for cloud computing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |