CN104503820B

CN104503820B - A kind of Hadoop optimization methods based on asynchronous starting

Info

Publication number: CN104503820B
Application number: CN201410757131.4A
Authority: CN
Inventors: 赵淦森; 邓运亨; 王翔; 何建涛; 程庆年; 周冠宇; 周尚勤; 王欣明
Original assignee: South China Normal University; GCI Science and Technology Co Ltd
Current assignee: South China Normal University; GCI Science and Technology Co Ltd
Priority date: 2014-12-10
Filing date: 2014-12-10
Publication date: 2018-07-24
Anticipated expiration: 2034-12-10
Also published as: CN104503820A

Abstract

The invention discloses a kind of, and the Hadoop optimization methods based on asynchronous starting are concurrently executed by the way that operation is changed to part from serial execution so that its entire job execution process is optimized, and is greatly promoted the speed of interative computation, is effectively promoted execution efficiency.The present invention need not change the code of bottom, easy to use, and can improve the utilization rate of cluster, not the bottleneck of memory headroom.The present invention can be widely applied to as a kind of Hadoop optimization methods based on asynchronous starting in Hadoop framework technology.

Description

A kind of Hadoop optimization methods based on asynchronous starting

Technical field

The present invention relates to field of computer technology more particularly to a kind of Hadoop optimization methods based on asynchronous starting.

Background technology

There are following performance bottlenecks for present Hadoop processing interative computation：

It is completely serial to execute：Each operation will wait an operation to be fully completed and could start；

The interminable startup time：The startup time of operation averagely needs 10-15 seconds, this is a huge time waste；

Reduce processes are long：Reduce processes are the processes for calculating Global center point and writing results to HDFS, This process needs to take 10 seconds or so, is equally larger time loss；

Randomly choose initial center point：Initial center point for k-means iterations influence it is very big, if selection compared with Good initial center point contributes to the consumption for reducing the number and total time of iteration.

The I/O read-writes repeated：The data that each ends Map are read do not change substantially, but after every task start all It reads primary；

Network bandwidth consumption：The data of reading will be read from other nodes, and non-localized calculates, and causes a large amount of Netowrk tapes Wide consumption.

For performance issue existing for above-mentioned Hadoop, occur the improvement much to Hadoop interative computations now, but Be it can be found that most improvement will be related to the modifications of Hadoop bottom source codes, when user need to be switched to it is original When the pattern of Hadoop, the frame for replacing entire Hadoop is just had to, so causing very big inconvenience.In addition, portion Improved method is divided to be saved the data of repetitive read-write with caching, although can be avoided the repeated accesses of I/O in this way, It is that can also increase the pressure of the memory of node in this way, brings performance bottleneck, while also changing the structure of entire Hadoop, loses The characteristic of Hadoop is gone.

Hadoop：Distributed big data data processing tools；

MapReduce：Distributed computing framework；

K-means algorithms：Classic Clustering Algorithms based on division；

Interative computation：Need to can be only achieved the operation of the condition of iteration by multiple operation；

HDFS：Hadoop distributed file systems.

Invention content

In order to solve the above-mentioned technical problem, the object of the present invention is to provide a kind of promotion operation efficiencies, and the bottom of it is not necessary to modify A kind of Hadoop optimization methods based on asynchronous starting of layer identification code.

The technical solution adopted in the present invention is：

A kind of Hadoop optimization methods based on asynchronous starting, include the following steps：

A, it uploads data file and is divided into multiple data blocks to HDFS, and by data file；；

B, the data block after piecemeal copy and be distributed on different machines；

C, enabled instruction is sent out, and submits MapReduce operations and distribution map tasks and MyReduce tasks；

D, map tasks are executed, operation Map function pair data blocks are handled, obtain intermediate result data and send, and open Dynamic following iteration operation；

E, MyReduce tasks are executed, intermediate result data is received, and it is handled, obtain current iteration as a result, Step D is executed in following iteration operation simultaneously；

F, judged whether to reach iteration termination condition according to iteration result, if so, terminating iteration；Otherwise, it is next change E is returned to step for job service.

As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step C includes：

C1, enabled instruction is sent out, submits MapReduce operations；

C2, map tasks, MyReduce tasks and data block are sent to each node.

As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, held in the step D Row map tasks, operation Map function pair data blocks are handled, obtain intermediate result data and send comprising：

D1, the read block from each node；

D2, according to data block, operation Map functions calculate local center, obtain intermediate result data, and by it from map End is sent to the ends MyReduce.

As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, held in the step E Row MyReduce tasks receive intermediate result data, and handle it, obtain current iteration result comprising：

The data that E1, monitoring are sent from the ends map, receive intermediate result data；

E2, judge whether the quantity of the local center point set of the intermediate result data received is equal to the quantity of map tasks, If so, thening follow the steps E3；Otherwise, E1 is returned to step；

E3, according to intermediate result data, calculate Global center point, obtain current iteration result；

E4, HDFS is written into iteration result.

As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step D2 falls into a trap It calculates local center and uses kmeans algorithms.

As a kind of being further improved for Hadoop optimization methods based on asynchronous starting, the step E3 falls into a trap It calculates Global center point and uses kmeans algorithms.

The beneficial effects of the invention are as follows：

A kind of Hadoop optimization methods based on asynchronous starting of the present invention from serial execution by operation by being changed to part simultaneously Hair executes so that its entire job execution process is optimized, and the speed of interative computation is greatly promoted, and is effectively promoted and is executed effect Rate.The present invention need not change the code of bottom, easy to use, and can improve the utilization rate of cluster, not memory headroom Bottleneck.

Description of the drawings

The specific implementation mode of the present invention is described further below in conjunction with the accompanying drawings：

Fig. 1 is a kind of step flow chart of the Hadoop optimization methods based on asynchronous starting of the present invention；

Fig. 2 is the execution flow chart of Hadoop operations before improving；

Fig. 3 is a kind of execution flow chart of the Hadoop optimization methods based on asynchronous starting of the present invention.

Specific implementation mode

With reference to figure 1, a kind of Hadoop optimization methods based on asynchronous starting of the present invention include the following steps：

C1, enabled instruction is sent out, submits MapReduce operations；

C2, map tasks, MyReduce tasks and data block are sent to each node.

D1, the read block from each node；

E4, HDFS is written into iteration result.

Operation asynchronous starting in the present invention is in order to solve the problems, such as operation serially caused time waste, for this purpose, first Certain abstractdesription is carried out to entire operation, then to finding out existing room for improvement, finally propose the improvement strategy of this paper, and The consumption of total time can be reduced by verifying improved method.

1, scene describes

When carrying out Distributed Cluster to data set, successive ignition operation is needed, and primary change is responsible in each operation For operation.These operations are end to end, and each operation is necessarily dependent upon the result of calculation of previous operation, therefore, dividing In cloth k-means algorithms, each operation has to wait for start after previous operation is fully completed.In each operation Inside, the process of each execution are also to have stringent to execute sequence.

In order to come out above scene formalized description, a mathematical model is established herein, for indicating entirely poly- In class process between operation and the relation of interdependence of job built, and calculated after asynchronous starting to entire by mathematical model The improved efficiency of interative computation process.

(1) related definition

Define 1：One complete operation be since its startup until it is fully completed；

Define 2：The process of entire iteration is that an iteration terminates to the end since first time iteration；

Define 3：The mean consumption timing definition of operation is total time divided by iterations；

(2) precondition

In entire iterative process, continuously it is made of several operations；

It is each to make to share four-stage, including startup, map, reduce, end in the industry, and each stage consumption is certain Time, each stage execution sequence is fixed；

When an operation finishes, if not reaching iteration termination condition, be bound to one new operation of startup；

If Job1 starts earlier than Job2, and two operations are adjacent, then local center calculating is necessary in the map tasks of Job2 Waiting for the result of Job1 to be completely written to HDFS could start to execute, and Job2 can not possibly be finished earlier than Job1；

One collection group energy supports multiple operations to be carried out at the same time；

Under normal circumstances, each operation will not come in stopping midway for execution；

2, mathematical model

A upper section, has made the operation execution situation of interative computation in entire clustering the description of word, below will The description of word is abstracted into the model of mathematics.

Mathematical model mainly describes the relation of interdependence between operation and operation, and also the execution of job built, which relies on, closes System, the also data dependence relation between operation.

(1) variable description：

N：Total iterations of entire clustering

i：Iteration serial number, 1,2,3 ... .i ... ..N

j：The process of each job built executes serial number 1 to 4, and 1 is startup stage, and 2 be the map stages, and 3 be reduce ranks Section, 4 be ending phase

St_i：At the time of starting to start of the operation of ith iteration；

Et_i：At the time of the operation of ith iteration is fully completed；

St_ij：The Startup time of j-th of process of ith iteration；

St_i2'：At the time of the map of ith iteration starts to calculate local center；

St_i3'：At the time of the reduce of ith iteration starts to calculate Global center point；

Et_ij：At the time of j-th of process of ith iteration is fully completed；

T：Total time consumes

：Average each iteration time consumption

(2) the serial constraints of operation

In in distributed k-means interative computations, the execution sequence of operation therein is that have stringent job dependence Relationship, and each job built is also to have stringent data dependence relation.Variable handle above-mentioned will be used below These mutual dependence formalization representations come out.

The serial dependence of operation is broadly divided into：Job dependence executes dependence and data dependence.Job dependence refers to operating room Mutual execution sequencing；Execute the execution sequencing that dependence refers to job built；Data dependence refer to two operations it Between map and reduce data dependence relation.

Job dependence：

The iteration of ith does not terminate, then i+1 time iteration would not start

Et_i<St_i+1

The operation of i+1 time iteration can not possibly be completed earlier than the operation of ith iteration

Et_i<Et_i+1

Execute dependence：The execution sequence of each job built is fixed, and particular order is startup, map, reduce, knot The case where beam, there is no job built execution order entanglements

St_i≤St_ij≤St_i(j+1)≤Et_i

The formula illustrates, in the operation of ith iteration, the startup of the process in each stage is that have stringent sequence , there is no each stages to start the problem of being not in the right order.

Data dependence：

The iteration of ith is fully completed, and the map of i+1 time iteration could start to read the result of last iteration

Et_i4<St_(i+1)2'

There is also data dependences between the map and reduce of each job built：Reduce is starting to calculate Global center Map is needed to be fully completed before point

Et_i2≤St_i3'

It then consumes entire cluster process total time and is：

Then the time loss of average each iteration is：

3, improvement project

Two constraintss are improved the item that one of them is job dependence during this trifle will be executed for chained job Part, another is the condition of data dependence.Due to the presence (i.e. cluster support while running multiple operations) of precondition 5, It therefore can be by operation pre-cooling.

(1) job dependence improves：The map moment of the operation of ith iteration only need to be earlier than the job initiation of i+1 time iteration Moment, the operation without ith iteration are fully completed, and i+1 time iteration just starts.

By formula：Et_i≤St_i+1

It is improved to：St_i<St_i2<St_i+1<Et_i

The time then saved is at least：△t₁=Et_i-St_i3

The formula is shown：Ith iteration operation need not wait until that i+1 time iteration is fully completed and just start completely, it is only necessary to Ensure ith starts startup time of the time earlier than i+1 time, and after the map of ith iteration, in ith iteration Before reduce.Reason for this is that when the map of ith goes to certain proportion, just open the operation of i+1 time Dynamic, after ith completes all operations, i+1 time iteration can read the result of last iteration.If cannot read It gets as a result, the iteration of i+1 time will restart unsuccessfully map tasks (fault-tolerance of Hadoop), until reading.

(2) data dependence improves：Second is the constraint for improving data dependence, and the map of i+1 time iteration need not be waited It could start to calculate local center after being fully completed to ith iteration, it is only necessary to after the completion of the reduce of ith iteration just Start the task of the calculating of map.

By formula：Et_i4<St_i

It is changed to：Et_i3<St_(i+1)2'

The time then saved is：△t₂=Et_i4-Et_i3>0

The formula indicates that ith reduce is fully completed, and the map of i+1 time could start, because map needs the last time to change The result of calculation in generation is used to calculate local center in this map task.

4, conclusion

In conclusion iteration is compared to the time saved before improvement every time：

△ t=△ t₁+△t₂

Improved total time is：

Due to：△t₁>0,

So：T>T', it can be deduced that improved time loss disappears certainly less than the distributed k-means times before improvement Consumption, and the elapsed time of average each iteration is also certainly less than the distributed k-means time loss before improvement, i.e.,

Therefore, to the modification of the Startup time condition of job dependence and data dependence to the efficiency of entire clustering flow Promotion has certain effect, and is connected compacter between improved operation, accelerates the process of interative computation.

It is Job1 and Job2 respectively as shown, here there are two the operation in iterative process with reference to figure 2.Wherein, 1 table Show job initiation, 2 indicate to execute map tasks, and 3 indicate to execute reduce tasks, and 4 indicate to terminate.

This is the constraints of operation before improving, main there are three constraints, be job dependence respectively, execute dependence, Data dependence.

Job dependence refers to that Job2 has to start to start after Job1 is fully completed.

Execute to rely on and refer to, the execution in each stage of each job built sequence is fixed, according to startup, map, Reduce, end sequence execute.

Data dependence refers to that the map functions of Job2 need that until Job1 is fully completed its data could be read, and each make The data that the reduce in portion is read in the industry will be finished until map functions.

With reference to figure 3, Fig. 3 is the execution flow of operation after improving.,

First, job dependence has been broken, has allowed Job2 that can reduce the shadow for starting the time for time loss with pre-cooling It rings.

Secondly, broken and executed dependence, allowed process 3 and process 4 to execute together, MyReduce is used in combination to be counted instead of reduce Global center point is calculated, the process for calculating Global center point is accelerated.

Finally, broken data dependence so that Job2 can be read after the completion of process 3 of Job1 it is upper it is primary repeatedly The result of calculation in generation.

The specific experiment data of the present invention are as follows：

(1) experimental situation

Use three physical machines, the configuration parameter of every physical machine as follows in experiment：

CPU models	CPU core number	Memory	Hard disk
				Intel(R)Core(TM)i5-2500K CPU@3.30GHz	4	16G	500G

1 physical machine of table configures

The virtualization tool used starts 6 virtual machines for XenCenter6.2 in three physical machines.Wherein Server1 starts two virtual machines, and one uses Win7 systems, the programmed environment as Eclipse.One use Ubuntu12.04 is responsible for being scheduled Slave nodes as the Master of Hadoop clusters.Remaining two Server difference Load Slave node of 2 virtual machines as Hadoop clusters.

The configuration of six virtual machines therein is as shown in table 2：

Virtual cpu check figure	Memory	Hard disk	Operating system	Network
					2	4G	150G	Ubuntu12.04	Gigabit switch

2 virtual machine configuration of table

Development environment therein is as shown in the table：

Development language	Java
		JDK versions	JDK1.6
Development environment	Eclipse+Maven
		Hadoop versions	1.2.1

3 development environment of table

Specifically the division of labor is as shown in the table for 6 virtual machines in cluster, and wherein deng1 is responsible for Namenode as master And jobtracker；Other four (deng2, deng3, deng4, deng5) are used as Datanode and tasktracker.Also Development environments of one deng0 as Hadoop is responsible for the compiling of Java code and sends the tasks such as job initiation instruction.

Machine name	IP	Task	Operating system
				deng0	192.168.1.10	The ends development environment+MyReduce	Windows7
deng1	192.168.1.11	Master	Ubuntu 12.04
				deng2	192.168.1.12	Slave	Ubuntu 12.04
deng3	192.168.1.13	Slave	Ubuntu 12.04
				deng4	192.168.1.14	Slave	Ubuntu 12.04
deng5	192.168.1.15	Slave	Ubuntu 12.04

The division of labor of 4 each virtual machine of table and basic condition

(2) Reduce average handling times compare

Data and the parameter of algorithm that the experiment is selected are consistent with the first experiment of experiment two, and purpose is in order to compare The processing time of Reduce processing times and MyReduce.

5 Reduce and MyReduce processing times of table compare

From, it can be found that after being improved using MyReduce, the time loss of calculating Global center point is by original in experimental result Average more than 10 seconds come is fallen to averagely 2 seconds or so, and in the comparison of most fast processing time, MyReduce is at least faster than Reduce 10 times, in the comparison of most slow processing time, MyReduce is 3 times also at least faster than Reduce.Although in data set The average handling time of the increase of the number of point, MyReduce also increases more obviously, but is still far below at Reduce The time loss of reason process.This be primarily due to MyReduce be always maintained at monitor and calculate state, save due to The time loss that MapReduce frame strips come.From the point of view of experimental result, MyReduce calculates improving the efficiency of Global center point Promotion is very big.

(3) the clustering time compares

The matrix generated after the data that this experiment uses are consistent with the experiment of front and MakeMatrix.

What this experiment to be compared is the distributed k-means algorithms (Distributed k-means) and process before improving Distributed k-means algorithms (Advanced Distributed k-means) after three kinds of improved methods, both Main Analysis Average each iterative calculation time and total processing time comparison.

Table 6 improves preceding and improved efficiency comparative

From in experimental result it can be found that process part concurrently and after the improvement of MyReduce, average each iteration Time reduces nearly 20 seconds, this is because reducing the time of job initiation and using the time of Reduce at MapReduce Consumption.

In addition, calculating and partly concurrently executing three kinds of optimizations by initial point optimization, MyReduce optimization Global center points Collective effect after, the total time consumption of entire Distributed Cluster analysis declines respectively in the case of 100W, 400W, 800W 54%, 40% and 39%, it has used nearly 5 hours less in the case of 800W point, 7 is fallen to from 12 original hours More hours, reduce the stand-by period of nearly 5 hours.

It finds in an experiment, although due to by the improvement that concurrently executes of part, the time of certain interative computation several times becomes It is long, still, start the consumption of time due to saving so that the average time consumption of each iteration declines.In addition, due to first The effect of initial point optimization, reduces the number of iteration, in the case where reduction number is most so that the number of final iteration subtracts Lack more than 100 times, when also allowing for reaching iteration termination condition, the consumption of total time significantly declines.In addition, experiment It is found that the increase of the number with point, the time of central point is calculated in operation can account for the ratio of total time bigger, i.e., The ratio regular meeting that the time of Hadoop frames consumption accounts for declines, and the efficiency of Hadoop processing can rise.

The present invention is concurrently executed since operation is changed to part from serial execution, and operation is saved in process of cluster analysis and is opened The dynamic time, while also improving the resource utilization of Hadoop clusters.Before optimization in the interval time of interative computation twice In there is the phenomenon that " trough ", this is because after preceding an iteration, need to wait several seconds for next iteration operation after clock Operation can just start, so occur cpu resource consumption reduce the phenomenon that.Simultaneously also because operation serially executes, Mei Gezuo Industry will wait for an operation to execute could start completely, and the time for causing entire Distributed Cluster analysis is elongated.Through After crossing improvement, the phenomenon that " trough ", disappears, and is that part concurrently executes between each operation, the time to be launched such as is not in The case where.Since the free time of cpu resource consumption reduces, the improved efficiency of clustering, entire Hadoop collection The CPU resource utilization of group is also improved and (is primarily due to reduce the time of idle waiting).

Improved average CPU resource utilization is 68%, and the average CPU resource utilization before improvement is 48%, then collects The CPU resource utilization of group improved 41.7% (the cluster CPU computing resources for more being utilized 20%) more originally.After improvement Time is 60 minutes, and the time before improvement is 131 minutes, then the whole cpu resource consumption before improving is improved： (131*48%)/(60*68%)=1.54 times, that is, the cpu resource for reaching iterated conditional needs before improving are required after improving 1.5 times of CPU computing resources, it was demonstrated that the method for operation asynchronous starting can effectively save the time, to also save the totality of cluster CPU computing resources.

It is to be illustrated to the preferable implementation of the present invention, but the invention is not limited to the implementation above Example, those skilled in the art can also make various equivalent variations or be replaced under the premise of without prejudice to spirit of that invention It changes, these equivalent deformations or replacement are all contained in the application claim limited range.

Claims

1. a kind of Hadoop optimization methods based on asynchronous starting, it is characterised in that：Include the following steps：

A, it uploads data file and is divided into multiple data blocks to HDFS, and by data file；

D, map tasks are executed, operation Map function pair data blocks are handled, obtain intermediate result data and send, and under startup One iteration operation；

E, MyReduce tasks are executed, intermediate result data is received, and handle it, obtains current iteration as a result, simultaneously Step D is executed in following iteration operation；

F, judged whether to reach iteration termination condition according to iteration result, if so, terminating iteration；Otherwise, it is that following iteration is made Industry service returns to step E；

MyReduce tasks are executed in the step E, are received intermediate result data, and handle it, are obtained current iteration As a result comprising：

E2, judge whether the quantity of the local center point set of the intermediate result data received is equal to the quantity of map tasks, if It is to then follow the steps E3；Otherwise, E1 is returned to step；

E4, HDFS is written into iteration result.

2. a kind of Hadoop optimization methods based on asynchronous starting according to claim 1, it is characterised in that：The step C includes：

C1, enabled instruction is sent out, submits MapReduce operations；

C2, map tasks, MyReduce tasks and data block are sent to each node.

3. a kind of Hadoop optimization methods based on asynchronous starting according to claim 1, it is characterised in that：The step Map tasks are executed in D, operation Map function pair data blocks are handled, obtain intermediate result data and send comprising：

D1, the read block from each node；

D2, according to data block, operation Map functions calculate local center, obtain intermediate result data, and it is sent out from the ends map It send to the ends MyReduce.

4. a kind of Hadoop optimization methods based on asynchronous starting according to claim 3, it is characterised in that：The step Local center is calculated in D2 uses kmeans algorithms.

5. a kind of Hadoop optimization methods based on asynchronous starting according to claim 1, it is characterised in that：The step Global center point is calculated in E3 uses kmeans algorithms.