CN106202431A

CN106202431A - A kind of Hadoop parameter automated tuning method and system based on machine learning

Info

Publication number: CN106202431A
Application number: CN201610550098.7A
Authority: CN
Inventors: 施展; 冯丹; 于瑞丽; 童颖; 王子毅; 彭亚妹
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2016-07-13
Filing date: 2016-07-13
Publication date: 2016-12-07
Anticipated expiration: 2036-07-13
Also published as: CN106202431B

Abstract

The invention belongs to big technical field of data processing, relate to the automated tuning method and system of a kind of Hadoop parameter based on machine learning.The present invention is grouped according to the resource consumption feature clustering of different application, and sets up different performance models for the application of difference group, automatically derives the different parameters bigger on inhomogeneity application impact, and quantitative parameter recommendation value.System includes off-line module and at wire module, and off-line module includes that Hadoop data collector, cluster device and performance model build submodule；Job manager, optimizer, resource consumption feature matcher and operation explorer is included at wire module.The present invention efficiently solves the restricted problem that high Dependence Problem and the qualitative parameter of user experience are advised by existing method based on empirical law；Parameter Optimization System is separated by the present invention with Hadoop system simultaneously, reduces the system degree of coupling, reduces artificial expense, it is to avoid artificially judge by accident, and is easy to system extension and safeguards.

Description

A kind of Hadoop parameter automated tuning method and system based on machine learning

Technical field

The invention belongs to big technical field of data processing, more specifically, relate to a kind of Hadoop based on machine learning The automated tuning method and system of parameter.

Background technology

Hadoop increases income realization as the one of widely used large data parallel processing framework Mapreduce, has Good autgmentability and fault-tolerance.User can use Hadoop easy expansion application program, but be necessary for application-specific and join Putting parameter, the performance of system can be had a huge impact by different parameter configuration.The performance optimization of Hadoop task is one The optimization problem of multidimensional, influence factor mainly has data set scale, the configuration of machine hardware, operation to utilization of resources spy Levy, and different dispatching algorithms.First, for a set of cluster, different collection swarm parameters arrange and there is very big difference, the most often Individual Hadoop system needs to configure to optimize performance according to the feature of residing cluster self.Secondly, the relation between each parameter is non- The most complicated, some parameters may influence each other with one or more parameters.There is mutually restriction or dependence between parameter, arrange The unreasonable resource contention that may result in, the overall performance of whole system reduces, so that mutually restricting pass between configuration parameter System reaches to balance extremely important；Additionally, different application task is different to the requirement performing environment, need system is joined accordingly Put and need to improve overall performance with matching task.

Traditional tuning method based on empirical law is that Hadoop user is by great many of experiments and dividing system itself The parameter bigger on the impact of certain class Hadoop transaction capabilities is summed up in analysis, and regulates these parameter values in concrete application practice.This Kind of method needs user to have Hadoop system to compare and understand in depth and a large amount of smell of powder, and can produce because user is different Distinct effect.Parameter adjustment each time needs repeatedly to test, and needs to consume substantial amounts of hardware resource and time.As Vaidya (refers to: Vaidya. [Online] .Availab1e:http: //Hadoop.apache.org/Mapreduce/doc S/r0.21.0/vaidya.Html) as rule-based Mapreduce transaction capabilities diagnostic tool, just run by resolving After operation configuration file in the information such as parameter, statistical information and JobHistory daily record and act on the rule that predefined is good Find the performance issue in operation on then, and provide qualitative rather than quantitative suggestion.Hivnath etc. propose a kind of based on The tuning method of code rewriting, by revising the code of Hadoop itself, adds the module of parameter optimization in Hadoop source code And use for reference the query optimization thought in data base to realize the target of Hadoop parameter automated tuning (see paper: Babu S.Towards Automatic Optimization of Mapreduce Programs.in:Proceedings of the 1st ACM symposium on Cloud computing(SoCC'10).ACM,2010.137-142.).But it can not root The feature different according to application program self and the different situation of the resource of system regulate optimization targetedly.Secondly will The work of configuration parameter is given Hadoop system and is done, and Hadoop bottom source code can be made to become more complicated and be difficult to tie up Protect.K.Kambatla proposes tuning method based on resource consumption feature (see paper: Kambatla K, Pathak A, Pucha H.Towards optimizing Hadoop provisioning in the cloud.in:HotCloud Workshop in conjunction with USENIX Annual Technical Conference.USENIX Association, 2009.), by by resource consumption feature analysiss of other application in the resource consumption feature of current operation and data base with than Relatively, find most like one, and read corresponding optimized parameter allocation plan, as the method for parameter configuration of current work, Thus it is embodied as new Hadoop operation automatic configuration parameter.It can make operation on the premise of obtaining top performance, subtracts as far as possible Few program consumption.The key of the method is aiming at different Hadoop typical case and applies the resource consumption feature collecting correspondence, and leads to Cross constantly amendment configuration parameter and find optimized parameter allocation plan corresponding to corresponding Mapreduce operation, then according to by every kind Hadoop application and the resource consumption feature of its correspondence, optimized parameter allocation plan are saved in data base.This tuning mode Being similar to mode based on empirical law, simply experiences here refers to those application existing in data base.Its advantage is Being easier to realize, its shortcoming is mainly for map.tasks.maximum Yu reduce.tasks.maximum two ginseng The setting of number, the parameter of optimization is little, causes effect of optimization limited；And in order to find every kind of optimized parameter configuration side applied Case, needs great many of experiments to test, and wastes substantial amounts of system hardware resources and time.

Summary of the invention

For the above deficiency of existing Hadoop tuning method, the present invention provides Hadoop based on a machine learning ginseng Number automated tuning method and system, it is therefore intended that be grouped according to the resource consumption feature clustering of different application, and for not Set up different performance models with group application, automatically derive the different parameters bigger on inhomogeneity application impact, and quantitative Parameter recommendation value, effectively solves existing method based on empirical law and builds high Dependence Problem and the qualitative parameter of user experience The restricted problem of view；Solve existing method parameters optimization based on resource consumption feature few, the problem that effect of optimization is limited；With Time the present invention Parameter Optimization System is separated with Hadoop system, reduce the system degree of coupling.

The present invention provide a kind of based on machine learning Hadoop parameter automated tuning method, including off-line procedure and Line process, wherein, off-line procedure comprises the steps:

S1. time that performs of the Historical Jobs run in current cluster, input data set scale, Mapreduce are collected Parameter configuration and the time serial message of all kinds of resource consumption；

S2. the time serial message of all kinds of resource consumptions of the Historical Jobs of collection is normalized pretreatment, then Build resource consumption characteristic vector；

S3. the spacing of the resource consumption characteristic vector of different work is calculated, for weighing the similarity of different work, and By operation Clustering so that the operation of resource consumption feature similarity is divided into one group；

S4. according to cluster result, utilize and often organize the configuration parameter of Historical Jobs, input data scale and execution time, point Group builds Job execution time training set；

S5. for every group job, it is respectively adopted stepwise regression method and selects the optimum prediction factor, i.e. select and Job execution The factor of time strong correlation；

S6. for every group job, the result being utilized respectively stepwise regression analysis carries out SVM regression analysis, selects to close Suitable kernel function, builds SVM performance model；

Described comprise the steps: at line process

S7. for the new operation submitted to, default parameters configuration and a part for input data set are used, in cluster cluster Run this operation, collect the time serial message of all kinds of resource consumption, and build resource consumption feature according to method in step S2 Vector；

S8. the resource consumption characteristic vector of operation will be newly submitted to carry out with the every class cluster centre in step S3 cluster result Distance coupling, then utilizes the performance model that the class of jobs of coupling is corresponding, it was predicted that different parameters configuration and input data set scale Under the execution time, as the search volume of parameter optimization；

S9. utilize searching algorithm search optimized parameter configuration, and export；

S10. utilize the optimized parameter allocation plan that step S9 obtains, it is intended that input data set, run under current cluster Newly submit operation to.

As the improvement of technique scheme, described step S3 is weighed the distance calculating that different work similarity is used Formula, makes similarity between identical operation higher than similarity between different work, preferably COS distance formula.

As the further improvement of technique scheme, the clustering algorithm that in described step S3, Clustering is used is nothing Supervision clustering algorithm, principle is: the operation making resource characteristic most like is automatically clustered into one group, and preferably K-means calculates Method.

As the further improvement of technique scheme, in described step S6, cross-validation method is utilized to assess described SVM The forecasting accuracy of performance model.

As the further improvement of technique scheme, in described step S9, the selection principle of searching algorithm is: quickly In search volume, find so that the optimized parameter of Job execution shortest time configures；Specifically can be by operation at different parameters The configuration lower prediction execution time sorts from small to large, takes parameter configuration corresponding to the most several prediction execution time, seeks each ginseng The meansigma methods of number is as optimal value of the parameter.

A kind of Hadoop parameter automated tuning system based on machine learning provided by the present invention, including off-line module and At wire module, wherein:

Off-line module includes that Hadoop data collector, cluster device and performance model build submodule；

Hadoop data collector is for collecting Historical Jobs from the Hadoop host node journal file of current cluster Execution time, input data set scale, the parameter configuration of Mapreduce and the time serial message of all kinds of resource consumption；

Cluster device is for Historical Jobs all kinds of run in current cluster collected by Hadoop data collector Resource consumption time serial message, carries out pretreatment, builds resource consumption characteristic vector, carries out operation Clustering；

Performance model builds submodule for by the input data scale of every group job, parameter configuration, execution temporal information As training set, it is respectively trained and builds performance model, and performance model is supplied to optimizer selection；

Job manager, optimizer, resource consumption feature matcher and operation explorer is included at wire module；

Job manager, for submitting to new given operation to Hadoop host node, submits to new job to cluster set for the first time Group, and specify input data set and default parameters configuration, for being grouped new job, second time submits to new job to current collection Group, and specify input data set and optimized parameter configuration；

Operation explorer for collect resource consumption when the new job submitted to for the first time by job manager runs time Between sequence information, pretreatment also builds resource consumption characteristic vector, and be supplied to resource consumption feature matcher；

Resource consumption feature matcher for the resource consumption feature that compares this stack features vector and often organize cluster centre to Span from, obtain the task group belonging to this operation, and be supplied to optimizer；

Optimizer, according to group result, selects corresponding performance model, calculates the most rational parameter configuration of search and improves Resource utilization and job run efficiency；Finally optimized parameter configuration and input data set are supplied to job manager, and The Hadoop master that will newly be submitted to operation and optimized parameter configuration thereof and input data set to be submitted to current cluster by job manager Node runs.

For the high configurability of Hadoop, under cluster hardware resource fixing situation, utilize a kind of machine learning performance mould Type carries out parameter automated tuning and then optimizes systematic function new submission task.In general, the inventive method and prior art Scheme has compared following advantage:

1, compared with existing method based on empirical law, the inventive method does not has high requirement to user experience, and And the detail parameters allocation plan that numerical value is clear and definite can be automatically provided；

2, compared with existing method based on code rewriting, the inventive method uses arameter optimization part and Hadoop system The mode that system separates, does not increase Hadoop bottom source code complexity；And this method is sharp to system resource according to application program Use situation Clustering, and set up performance model for each packet, propose different optimized parameter allocation plans, so can depend on Carry out parameter regulation targetedly according to the feature of dissimilar application to optimize；

3, compared with existing method based on resource consumption feature, the inventive method according to resource consumption feature to operation Clustering, and to setting up the SVM performance meeting every group job resource consumption feature after every group job repeatedly regression analysis dimensionality reduction Model, the predictor of this model is different and different according to packet, the parameter type i.e. optimized with number because cluster result is different And different, it is not limited to optimize fixing parameter, thus be effectively increased effect of optimization, and the inventive method utilizes SVM The execution time under performance model prediction different scales input data, different parameters configuration builds search volume, without Actual great many of experiments test, effectively save system hardware resources and time loss.

Accompanying drawing explanation

Fig. 1 is the flow chart of the Hadoop parameter automated tuning method based on machine learning that the present invention proposes；

Fig. 2 is the Organization Chart of the Hadoop parameter automated tuning system based on machine learning that the present invention proposes.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, right The present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, and It is not used in the restriction present invention.If additionally, technical characteristic involved in each embodiment of invention described below The conflict of not constituting each other just can be mutually combined.

Fig. 1 is the flow chart of Hadoop parameter automated tuning method of the present invention, specifically includes following steps:

S1. time that performs of the Historical Jobs run in current cluster, input data set scale, Mapreduce are collected Parameter configuration and the time serial message of all kinds of resource consumption, including cpu busy percentage, disk utilization rate and network transfer rate etc. Time serial message.

S2. the time series of the cpu busy percentage of each Historical Jobs, disk utilization rate and the network transfer rate collected is believed Breath is normalized pretreatment, seeks their meansigma methods the most respectively, as each Historical Jobs resource consumption characteristic vector The most one-dimensional build resource consumption characteristic vector.

S3. use COS distance formula, calculate the spacing of the resource consumption characteristic vector of different work, for weighing not With the similarity of operation, and use K-means clustering algorithm, by task group so that the operation of resource consumption feature similarity is certainly It is divided into one group dynamicly.

The distance computing formula weighing different work similarity has to comply with objective experiment law: similarity between identical operation Higher than similarity between different work.Through experimental verification, this range formula can be COS distance formula, but cannot be European Range formula.

The clustering algorithm of cluster operation uses a kind of Unsupervised clustering algorithm, and principle is: make resource characteristic most like Operation is automatically clustered into one group.This algorithm can be K-means algorithm.

S6. for every group job, the result being utilized respectively stepwise regression analysis carries out SVM regression analysis, selects to close Suitable kernel function, builds SVM performance model (utilizing the forecasting accuracy of cross-validation method assessment models)；

A part for input data set herein, its scale typically from economizing on resources and the angle of time, chooses one Less but bigger than current main memory empirical value；Cluster cluster herein can directly use current cluster, but from saving money Source angle is set out, it would however also be possible to employ a portion of current cluster.

S8. the resource consumption characteristic vector newly submitting operation to is carried out distance with the every class cluster centre in S3 cluster result Coupling, then utilizes the performance model that the class of jobs of coupling is corresponding, it was predicted that under different parameters configuration and input data set scale The execution time, as the search volume of parameter optimization.

S9. use suitable searching algorithm, search optimized parameter configuration, and export.

The selection principle of searching algorithm is: quickly in search volume, find so that the optimum of Job execution shortest time Parameter configuration.

S10. utilize the parameter configuration scheme that S9 obtains, it is intended that input data set, run in current cluster and newly submit work to Industry.

Example:

The present invention provides an embodiment, with 6 node Hadoop clusters (1 NameNode, 5 DataNode, often Individual DataNode configures 2 Map slot, 2 Reduce slot, each task 300MB internal memory, and data block size is 64MB), Select bigger 9 parameters (being set to P1～P9) of Hadoop j ob impact, utilize 200 Historical Jobs information architectures in this cluster As a example by the situation of model, the implementation process of the inventive method is described.Specifically include following steps:

Step 1: when collecting the execution of the Historical Jobs run in current cluster (1 NameNode, 5 DataNode) Between, the parameter configuration of input data set scale, Mapreduce and the time serial message of all kinds of resource consumption, as CPU utilizes Rate, disk utilization rate and the time serial message of network transfer rate.

Step 2: by the time sequence of the cpu busy percentage of each Historical Jobs, disk utilization rate and the network transfer rate of collection Column information is normalized pretreatment, asks their meansigma methods as each Historical Jobs resource consumption characteristic vector the most respectively The most one-dimensional build resource consumption characteristic vector.

Step 3: use COS distance computing formula, calculates the spacing of the resource consumption characteristic vector of different work, and Use K-means clustering algorithm, by operation Clustering, it is assumed that operation 1～operation 100 cluster to same group, operation 101～work Industry 200 cluster is to another group, and the job number in the most each packet is continuous print.

Step 4: according to cluster result, when utilizing operation 1～the configuration parameter of operation 100, input data scale and perform Between, build the Job execution time training set of this group, utilize operation 101～the configuration parameter of operation 200, input data scale and The execution time, build the Job execution time training set of this group.

Step 5: for every group job, is respectively adopted stepwise regression method and selects the optimum prediction factor, i.e. select and operation The factor of execution time strong correlation.Assume the optimum prediction factor of operation 1～operation 100 place group be input data scale, P9, P8, P2, P1 and P4；The optimum prediction factor of operation 101～operation 200 place group is input data scale, P9, P8, P7 and P4.

Step 6: for every group job, the result being utilized respectively stepwise regression analysis carries out SVM regression analysis, selects RBF kernel function, builds SVM performance model, utilizes the forecasting accuracy of cross-validation method assessment models.

Step 7: assume that an input data set scale is that operation x is submitted in the new of 10GB to, use of input data set Divide (assuming 1GB) and default parameters to be arranged under cluster cluster (1 NameNode, 2 DataNode) environment and run this operation, Collect the time serial message of all kinds of resource consumption, build resource consumption characteristic vector according to method in step 2.

Step 8: by the resource characteristic consumption vector of operation x newly submitted to and the apoplexy due to endogenous wind of often birdsing of the same feather flock together in step 3 cluster result The heart carries out distance coupling, it is assumed that it mates with operation 101～operation 200 place group, then just utilize operation 101～operation 200 The performance model that place group is corresponding, it was predicted that the execution time under the configuration of operation x different parameters and input data set scale (10GB), Search volume as parameter optimization.

Step 9: during search optimized parameter configuration, the prediction under being configured by operation x different parameters performs the time from small to large Sequence, takes front 10 prediction parameter configuration corresponding to execution times, asks the meansigma methods of each parameter as optimal value of the parameter, output Parameter P9, P8, P7 and P4 are distributed rationally value.

Step 10: utilize the parameter configuration scheme that step 9 obtains, it is intended that input data set (10GB), at current cluster (1 Individual NameNode, 5 DataNode) middle operation new submission operation x.

As it will be easily appreciated by one skilled in the art that and the foregoing is only presently preferred embodiments of the present invention, not in order to Limit the present invention, all any amendment, equivalent and improvement etc. made within the spirit and principles in the present invention, all should comprise Within protection scope of the present invention.

Claims

1. a Hadoop parameter automated tuning method based on machine learning, including off-line procedure with at line process, wherein,

Off-line procedure comprises the steps:

S1. execution time, input data set scale, the parameter of Mapreduce of the Historical Jobs run in current cluster are collected Configuration and the time serial message of all kinds of resource consumptions；

S2. the time serial message of all kinds of resource consumptions of the Historical Jobs of collection is normalized pretreatment, then builds Resource consumption characteristic vector；

S3. calculate the spacing of the resource consumption characteristic vector of different work, for weighing the similarity of different work, and will make Industry Clustering so that the operation of resource consumption feature similarity is divided into one group；

S4. according to cluster result, utilize and often organize the configuration parameter of Historical Jobs, input data scale and execution time, be grouped structure Build Job execution time training set；

S5. for every group job, it is respectively adopted stepwise regression method and selects the optimum prediction factor, i.e. select and the Job execution time The factor of strong correlation；

S6. for every group job, the result being utilized respectively stepwise regression analysis carries out SVM regression analysis, selects suitable core Function, builds SVM performance model；

Described comprise the steps: at line process

S7. for the new operation submitted to, use default parameters configuration and a part for input data set, run in cluster cluster This operation, collects the time serial message of all kinds of resource consumption, and builds resource consumption characteristic vector according to method in step S2；

S8. the resource consumption characteristic vector newly submitting operation to is carried out distance with the every class cluster centre in step S3 cluster result Coupling, then utilizes the performance model that the class of jobs of coupling is corresponding, it was predicted that under different parameters configuration and input data set scale The execution time, as the search volume of parameter optimization；

S10. utilize the optimized parameter allocation plan that step S9 obtains, it is intended that input data set, run in current cluster and newly carry Hand in homework.

A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1, it is characterised in that Described step S3 is weighed the distance computing formula that different work similarity is used, makes similarity between identical operation make than difference Between industry, similarity is high.

A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, the distance computing formula that in described step S3, measurement different work similarity is used is COS distance formula.

A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, the clustering algorithm that in described step S3, Clustering is used is Unsupervised clustering algorithm, and principle is: make resource characteristic Similar operation is automatically clustered into one group.

A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, in described step S3, Clustering uses K-means algorithm.

A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, in described step S6, utilize cross-validation method to assess the forecasting accuracy of described SVM performance model model.

A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, in described step S9, the selection principle of searching algorithm is: quickly in search volume, find so that Job execution shortest time Optimized parameter configuration.

A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, in described step S9, searching algorithm is: operation being predicted under different parameters configures, the execution time sorts, before taking from small to large The parameter configuration that the several prediction in the face execution time is corresponding, asks the meansigma methods of each parameter as optimal value of the parameter.

A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, in step S7 described cluster cluster is current cluster part or all.

10. a Hadoop parameter automated tuning system based on machine learning, including off-line module with at wire module, wherein:

Hadoop data collector is for collecting the execution of Historical Jobs from the Hadoop host node journal file of current cluster Time, input data set scale, the parameter configuration of Mapreduce and the time serial message of all kinds of resource consumption；

All kinds of resources of the Historical Jobs that cluster device runs in the current cluster collected by Hadoop data collector disappear The time serial message of consumption carries out pretreatment, builds resource consumption characteristic vector, carries out operation Clustering；

Performance model builds submodule for often organizing the input data scale of Historical Jobs, parameter configuration, execution temporal information As training set, it is respectively trained and builds performance model, and performance model is supplied to optimizer selection；

Job manager is for submitting to new given operation to Hadoop host node, and submission new job is to cluster cluster for the first time, And specify input data set and default parameters configuration, for being grouped new job, submission new job is to current cluster for the second time, and Specify input data set and optimized parameter configuration；

Operation explorer is for collecting the time sequence of the resource consumption when new job submitted to for the first time by job manager runs Column information, pretreatment also builds resource consumption characteristic vector, and be supplied to resource consumption feature matcher；

Resource consumption feature matcher for the resource consumption characteristic vector that compares this stack features vector and often organize cluster centre away from From, obtain the task group belonging to this operation, and be supplied to optimizer；

Optimizer, according to group result, selects corresponding performance model, calculates the most rational parameter configuration of search and improves resource Utilization rate and job run efficiency；Finally optimized parameter configuration and input data set are supplied to job manager, and by making The Hadoop host node that industry manager will newly submit to operation and optimized parameter configuration thereof and input data set to be submitted to current cluster Run.