CN106202431A - A kind of Hadoop parameter automated tuning method and system based on machine learning - Google Patents

A kind of Hadoop parameter automated tuning method and system based on machine learning Download PDF

Info

Publication number
CN106202431A
CN106202431A CN201610550098.7A CN201610550098A CN106202431A CN 106202431 A CN106202431 A CN 106202431A CN 201610550098 A CN201610550098 A CN 201610550098A CN 106202431 A CN106202431 A CN 106202431A
Authority
CN
China
Prior art keywords
parameter
hadoop
cluster
resource consumption
job
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610550098.7A
Other languages
Chinese (zh)
Other versions
CN106202431B (en
Inventor
施展
冯丹
于瑞丽
童颖
王子毅
彭亚妹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201610550098.7A priority Critical patent/CN106202431B/en
Publication of CN106202431A publication Critical patent/CN106202431A/en
Application granted granted Critical
Publication of CN106202431B publication Critical patent/CN106202431B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/217Database tuning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to big technical field of data processing, relate to the automated tuning method and system of a kind of Hadoop parameter based on machine learning.The present invention is grouped according to the resource consumption feature clustering of different application, and sets up different performance models for the application of difference group, automatically derives the different parameters bigger on inhomogeneity application impact, and quantitative parameter recommendation value.System includes off-line module and at wire module, and off-line module includes that Hadoop data collector, cluster device and performance model build submodule;Job manager, optimizer, resource consumption feature matcher and operation explorer is included at wire module.The present invention efficiently solves the restricted problem that high Dependence Problem and the qualitative parameter of user experience are advised by existing method based on empirical law;Parameter Optimization System is separated by the present invention with Hadoop system simultaneously, reduces the system degree of coupling, reduces artificial expense, it is to avoid artificially judge by accident, and is easy to system extension and safeguards.

Description

A kind of Hadoop parameter automated tuning method and system based on machine learning
Technical field
The invention belongs to big technical field of data processing, more specifically, relate to a kind of Hadoop based on machine learning The automated tuning method and system of parameter.
Background technology
Hadoop increases income realization as the one of widely used large data parallel processing framework Mapreduce, has Good autgmentability and fault-tolerance.User can use Hadoop easy expansion application program, but be necessary for application-specific and join Putting parameter, the performance of system can be had a huge impact by different parameter configuration.The performance optimization of Hadoop task is one The optimization problem of multidimensional, influence factor mainly has data set scale, the configuration of machine hardware, operation to utilization of resources spy Levy, and different dispatching algorithms.First, for a set of cluster, different collection swarm parameters arrange and there is very big difference, the most often Individual Hadoop system needs to configure to optimize performance according to the feature of residing cluster self.Secondly, the relation between each parameter is non- The most complicated, some parameters may influence each other with one or more parameters.There is mutually restriction or dependence between parameter, arrange The unreasonable resource contention that may result in, the overall performance of whole system reduces, so that mutually restricting pass between configuration parameter System reaches to balance extremely important;Additionally, different application task is different to the requirement performing environment, need system is joined accordingly Put and need to improve overall performance with matching task.
Traditional tuning method based on empirical law is that Hadoop user is by great many of experiments and dividing system itself The parameter bigger on the impact of certain class Hadoop transaction capabilities is summed up in analysis, and regulates these parameter values in concrete application practice.This Kind of method needs user to have Hadoop system to compare and understand in depth and a large amount of smell of powder, and can produce because user is different Distinct effect.Parameter adjustment each time needs repeatedly to test, and needs to consume substantial amounts of hardware resource and time.As Vaidya (refers to: Vaidya. [Online] .Availab1e:http: //Hadoop.apache.org/Mapreduce/doc S/r0.21.0/vaidya.Html) as rule-based Mapreduce transaction capabilities diagnostic tool, just run by resolving After operation configuration file in the information such as parameter, statistical information and JobHistory daily record and act on the rule that predefined is good Find the performance issue in operation on then, and provide qualitative rather than quantitative suggestion.Hivnath etc. propose a kind of based on The tuning method of code rewriting, by revising the code of Hadoop itself, adds the module of parameter optimization in Hadoop source code And use for reference the query optimization thought in data base to realize the target of Hadoop parameter automated tuning (see paper: Babu S.Towards Automatic Optimization of Mapreduce Programs.in:Proceedings of the 1st ACM symposium on Cloud computing(SoCC'10).ACM,2010.137-142.).But it can not root The feature different according to application program self and the different situation of the resource of system regulate optimization targetedly.Secondly will The work of configuration parameter is given Hadoop system and is done, and Hadoop bottom source code can be made to become more complicated and be difficult to tie up Protect.K.Kambatla proposes tuning method based on resource consumption feature (see paper: Kambatla K, Pathak A, Pucha H.Towards optimizing Hadoop provisioning in the cloud.in:HotCloud Workshop in conjunction with USENIX Annual Technical Conference.USENIX Association, 2009.), by by resource consumption feature analysiss of other application in the resource consumption feature of current operation and data base with than Relatively, find most like one, and read corresponding optimized parameter allocation plan, as the method for parameter configuration of current work, Thus it is embodied as new Hadoop operation automatic configuration parameter.It can make operation on the premise of obtaining top performance, subtracts as far as possible Few program consumption.The key of the method is aiming at different Hadoop typical case and applies the resource consumption feature collecting correspondence, and leads to Cross constantly amendment configuration parameter and find optimized parameter allocation plan corresponding to corresponding Mapreduce operation, then according to by every kind Hadoop application and the resource consumption feature of its correspondence, optimized parameter allocation plan are saved in data base.This tuning mode Being similar to mode based on empirical law, simply experiences here refers to those application existing in data base.Its advantage is Being easier to realize, its shortcoming is mainly for map.tasks.maximum Yu reduce.tasks.maximum two ginseng The setting of number, the parameter of optimization is little, causes effect of optimization limited;And in order to find every kind of optimized parameter configuration side applied Case, needs great many of experiments to test, and wastes substantial amounts of system hardware resources and time.
Summary of the invention
For the above deficiency of existing Hadoop tuning method, the present invention provides Hadoop based on a machine learning ginseng Number automated tuning method and system, it is therefore intended that be grouped according to the resource consumption feature clustering of different application, and for not Set up different performance models with group application, automatically derive the different parameters bigger on inhomogeneity application impact, and quantitative Parameter recommendation value, effectively solves existing method based on empirical law and builds high Dependence Problem and the qualitative parameter of user experience The restricted problem of view;Solve existing method parameters optimization based on resource consumption feature few, the problem that effect of optimization is limited;With Time the present invention Parameter Optimization System is separated with Hadoop system, reduce the system degree of coupling.
The present invention provide a kind of based on machine learning Hadoop parameter automated tuning method, including off-line procedure and Line process, wherein, off-line procedure comprises the steps:
S1. time that performs of the Historical Jobs run in current cluster, input data set scale, Mapreduce are collected Parameter configuration and the time serial message of all kinds of resource consumption;
S2. the time serial message of all kinds of resource consumptions of the Historical Jobs of collection is normalized pretreatment, then Build resource consumption characteristic vector;
S3. the spacing of the resource consumption characteristic vector of different work is calculated, for weighing the similarity of different work, and By operation Clustering so that the operation of resource consumption feature similarity is divided into one group;
S4. according to cluster result, utilize and often organize the configuration parameter of Historical Jobs, input data scale and execution time, point Group builds Job execution time training set;
S5. for every group job, it is respectively adopted stepwise regression method and selects the optimum prediction factor, i.e. select and Job execution The factor of time strong correlation;
S6. for every group job, the result being utilized respectively stepwise regression analysis carries out SVM regression analysis, selects to close Suitable kernel function, builds SVM performance model;
Described comprise the steps: at line process
S7. for the new operation submitted to, default parameters configuration and a part for input data set are used, in cluster cluster Run this operation, collect the time serial message of all kinds of resource consumption, and build resource consumption feature according to method in step S2 Vector;
S8. the resource consumption characteristic vector of operation will be newly submitted to carry out with the every class cluster centre in step S3 cluster result Distance coupling, then utilizes the performance model that the class of jobs of coupling is corresponding, it was predicted that different parameters configuration and input data set scale Under the execution time, as the search volume of parameter optimization;
S9. utilize searching algorithm search optimized parameter configuration, and export;
S10. utilize the optimized parameter allocation plan that step S9 obtains, it is intended that input data set, run under current cluster Newly submit operation to.
As the improvement of technique scheme, described step S3 is weighed the distance calculating that different work similarity is used Formula, makes similarity between identical operation higher than similarity between different work, preferably COS distance formula.
As the further improvement of technique scheme, the clustering algorithm that in described step S3, Clustering is used is nothing Supervision clustering algorithm, principle is: the operation making resource characteristic most like is automatically clustered into one group, and preferably K-means calculates Method.
As the further improvement of technique scheme, in described step S6, cross-validation method is utilized to assess described SVM The forecasting accuracy of performance model.
As the further improvement of technique scheme, in described step S9, the selection principle of searching algorithm is: quickly In search volume, find so that the optimized parameter of Job execution shortest time configures;Specifically can be by operation at different parameters The configuration lower prediction execution time sorts from small to large, takes parameter configuration corresponding to the most several prediction execution time, seeks each ginseng The meansigma methods of number is as optimal value of the parameter.
A kind of Hadoop parameter automated tuning system based on machine learning provided by the present invention, including off-line module and At wire module, wherein:
Off-line module includes that Hadoop data collector, cluster device and performance model build submodule;
Hadoop data collector is for collecting Historical Jobs from the Hadoop host node journal file of current cluster Execution time, input data set scale, the parameter configuration of Mapreduce and the time serial message of all kinds of resource consumption;
Cluster device is for Historical Jobs all kinds of run in current cluster collected by Hadoop data collector Resource consumption time serial message, carries out pretreatment, builds resource consumption characteristic vector, carries out operation Clustering;
Performance model builds submodule for by the input data scale of every group job, parameter configuration, execution temporal information As training set, it is respectively trained and builds performance model, and performance model is supplied to optimizer selection;
Job manager, optimizer, resource consumption feature matcher and operation explorer is included at wire module;
Job manager, for submitting to new given operation to Hadoop host node, submits to new job to cluster set for the first time Group, and specify input data set and default parameters configuration, for being grouped new job, second time submits to new job to current collection Group, and specify input data set and optimized parameter configuration;
Operation explorer for collect resource consumption when the new job submitted to for the first time by job manager runs time Between sequence information, pretreatment also builds resource consumption characteristic vector, and be supplied to resource consumption feature matcher;
Resource consumption feature matcher for the resource consumption feature that compares this stack features vector and often organize cluster centre to Span from, obtain the task group belonging to this operation, and be supplied to optimizer;
Optimizer, according to group result, selects corresponding performance model, calculates the most rational parameter configuration of search and improves Resource utilization and job run efficiency;Finally optimized parameter configuration and input data set are supplied to job manager, and The Hadoop master that will newly be submitted to operation and optimized parameter configuration thereof and input data set to be submitted to current cluster by job manager Node runs.
For the high configurability of Hadoop, under cluster hardware resource fixing situation, utilize a kind of machine learning performance mould Type carries out parameter automated tuning and then optimizes systematic function new submission task.In general, the inventive method and prior art Scheme has compared following advantage:
1, compared with existing method based on empirical law, the inventive method does not has high requirement to user experience, and And the detail parameters allocation plan that numerical value is clear and definite can be automatically provided;
2, compared with existing method based on code rewriting, the inventive method uses arameter optimization part and Hadoop system The mode that system separates, does not increase Hadoop bottom source code complexity;And this method is sharp to system resource according to application program Use situation Clustering, and set up performance model for each packet, propose different optimized parameter allocation plans, so can depend on Carry out parameter regulation targetedly according to the feature of dissimilar application to optimize;
3, compared with existing method based on resource consumption feature, the inventive method according to resource consumption feature to operation Clustering, and to setting up the SVM performance meeting every group job resource consumption feature after every group job repeatedly regression analysis dimensionality reduction Model, the predictor of this model is different and different according to packet, the parameter type i.e. optimized with number because cluster result is different And different, it is not limited to optimize fixing parameter, thus be effectively increased effect of optimization, and the inventive method utilizes SVM The execution time under performance model prediction different scales input data, different parameters configuration builds search volume, without Actual great many of experiments test, effectively save system hardware resources and time loss.
Accompanying drawing explanation
Fig. 1 is the flow chart of the Hadoop parameter automated tuning method based on machine learning that the present invention proposes;
Fig. 2 is the Organization Chart of the Hadoop parameter automated tuning system based on machine learning that the present invention proposes.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, right The present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, and It is not used in the restriction present invention.If additionally, technical characteristic involved in each embodiment of invention described below The conflict of not constituting each other just can be mutually combined.
Fig. 1 is the flow chart of Hadoop parameter automated tuning method of the present invention, specifically includes following steps:
S1. time that performs of the Historical Jobs run in current cluster, input data set scale, Mapreduce are collected Parameter configuration and the time serial message of all kinds of resource consumption, including cpu busy percentage, disk utilization rate and network transfer rate etc. Time serial message.
S2. the time series of the cpu busy percentage of each Historical Jobs, disk utilization rate and the network transfer rate collected is believed Breath is normalized pretreatment, seeks their meansigma methods the most respectively, as each Historical Jobs resource consumption characteristic vector The most one-dimensional build resource consumption characteristic vector.
S3. use COS distance formula, calculate the spacing of the resource consumption characteristic vector of different work, for weighing not With the similarity of operation, and use K-means clustering algorithm, by task group so that the operation of resource consumption feature similarity is certainly It is divided into one group dynamicly.
The distance computing formula weighing different work similarity has to comply with objective experiment law: similarity between identical operation Higher than similarity between different work.Through experimental verification, this range formula can be COS distance formula, but cannot be European Range formula.
The clustering algorithm of cluster operation uses a kind of Unsupervised clustering algorithm, and principle is: make resource characteristic most like Operation is automatically clustered into one group.This algorithm can be K-means algorithm.
S4. according to cluster result, utilize and often organize the configuration parameter of Historical Jobs, input data scale and execution time, point Group builds Job execution time training set;
S5. for every group job, it is respectively adopted stepwise regression method and selects the optimum prediction factor, i.e. select and Job execution The factor of time strong correlation;
S6. for every group job, the result being utilized respectively stepwise regression analysis carries out SVM regression analysis, selects to close Suitable kernel function, builds SVM performance model (utilizing the forecasting accuracy of cross-validation method assessment models);
S7. for the new operation submitted to, default parameters configuration and a part for input data set are used, in cluster cluster Run this operation, collect the time serial message of all kinds of resource consumption, and build resource consumption feature according to method in step S2 Vector;
A part for input data set herein, its scale typically from economizing on resources and the angle of time, chooses one Less but bigger than current main memory empirical value;Cluster cluster herein can directly use current cluster, but from saving money Source angle is set out, it would however also be possible to employ a portion of current cluster.
S8. the resource consumption characteristic vector newly submitting operation to is carried out distance with the every class cluster centre in S3 cluster result Coupling, then utilizes the performance model that the class of jobs of coupling is corresponding, it was predicted that under different parameters configuration and input data set scale The execution time, as the search volume of parameter optimization.
S9. use suitable searching algorithm, search optimized parameter configuration, and export.
The selection principle of searching algorithm is: quickly in search volume, find so that the optimum of Job execution shortest time Parameter configuration.
S10. utilize the parameter configuration scheme that S9 obtains, it is intended that input data set, run in current cluster and newly submit work to Industry.
Example:
The present invention provides an embodiment, with 6 node Hadoop clusters (1 NameNode, 5 DataNode, often Individual DataNode configures 2 Map slot, 2 Reduce slot, each task 300MB internal memory, and data block size is 64MB), Select bigger 9 parameters (being set to P1~P9) of Hadoop j ob impact, utilize 200 Historical Jobs information architectures in this cluster As a example by the situation of model, the implementation process of the inventive method is described.Specifically include following steps:
Step 1: when collecting the execution of the Historical Jobs run in current cluster (1 NameNode, 5 DataNode) Between, the parameter configuration of input data set scale, Mapreduce and the time serial message of all kinds of resource consumption, as CPU utilizes Rate, disk utilization rate and the time serial message of network transfer rate.
Step 2: by the time sequence of the cpu busy percentage of each Historical Jobs, disk utilization rate and the network transfer rate of collection Column information is normalized pretreatment, asks their meansigma methods as each Historical Jobs resource consumption characteristic vector the most respectively The most one-dimensional build resource consumption characteristic vector.
Step 3: use COS distance computing formula, calculates the spacing of the resource consumption characteristic vector of different work, and Use K-means clustering algorithm, by operation Clustering, it is assumed that operation 1~operation 100 cluster to same group, operation 101~work Industry 200 cluster is to another group, and the job number in the most each packet is continuous print.
Step 4: according to cluster result, when utilizing operation 1~the configuration parameter of operation 100, input data scale and perform Between, build the Job execution time training set of this group, utilize operation 101~the configuration parameter of operation 200, input data scale and The execution time, build the Job execution time training set of this group.
Step 5: for every group job, is respectively adopted stepwise regression method and selects the optimum prediction factor, i.e. select and operation The factor of execution time strong correlation.Assume the optimum prediction factor of operation 1~operation 100 place group be input data scale, P9, P8, P2, P1 and P4;The optimum prediction factor of operation 101~operation 200 place group is input data scale, P9, P8, P7 and P4.
Step 6: for every group job, the result being utilized respectively stepwise regression analysis carries out SVM regression analysis, selects RBF kernel function, builds SVM performance model, utilizes the forecasting accuracy of cross-validation method assessment models.
Step 7: assume that an input data set scale is that operation x is submitted in the new of 10GB to, use of input data set Divide (assuming 1GB) and default parameters to be arranged under cluster cluster (1 NameNode, 2 DataNode) environment and run this operation, Collect the time serial message of all kinds of resource consumption, build resource consumption characteristic vector according to method in step 2.
Step 8: by the resource characteristic consumption vector of operation x newly submitted to and the apoplexy due to endogenous wind of often birdsing of the same feather flock together in step 3 cluster result The heart carries out distance coupling, it is assumed that it mates with operation 101~operation 200 place group, then just utilize operation 101~operation 200 The performance model that place group is corresponding, it was predicted that the execution time under the configuration of operation x different parameters and input data set scale (10GB), Search volume as parameter optimization.
Step 9: during search optimized parameter configuration, the prediction under being configured by operation x different parameters performs the time from small to large Sequence, takes front 10 prediction parameter configuration corresponding to execution times, asks the meansigma methods of each parameter as optimal value of the parameter, output Parameter P9, P8, P7 and P4 are distributed rationally value.
Step 10: utilize the parameter configuration scheme that step 9 obtains, it is intended that input data set (10GB), at current cluster (1 Individual NameNode, 5 DataNode) middle operation new submission operation x.
As it will be easily appreciated by one skilled in the art that and the foregoing is only presently preferred embodiments of the present invention, not in order to Limit the present invention, all any amendment, equivalent and improvement etc. made within the spirit and principles in the present invention, all should comprise Within protection scope of the present invention.

Claims (10)

1. a Hadoop parameter automated tuning method based on machine learning, including off-line procedure with at line process, wherein,
Off-line procedure comprises the steps:
S1. execution time, input data set scale, the parameter of Mapreduce of the Historical Jobs run in current cluster are collected Configuration and the time serial message of all kinds of resource consumptions;
S2. the time serial message of all kinds of resource consumptions of the Historical Jobs of collection is normalized pretreatment, then builds Resource consumption characteristic vector;
S3. calculate the spacing of the resource consumption characteristic vector of different work, for weighing the similarity of different work, and will make Industry Clustering so that the operation of resource consumption feature similarity is divided into one group;
S4. according to cluster result, utilize and often organize the configuration parameter of Historical Jobs, input data scale and execution time, be grouped structure Build Job execution time training set;
S5. for every group job, it is respectively adopted stepwise regression method and selects the optimum prediction factor, i.e. select and the Job execution time The factor of strong correlation;
S6. for every group job, the result being utilized respectively stepwise regression analysis carries out SVM regression analysis, selects suitable core Function, builds SVM performance model;
Described comprise the steps: at line process
S7. for the new operation submitted to, use default parameters configuration and a part for input data set, run in cluster cluster This operation, collects the time serial message of all kinds of resource consumption, and builds resource consumption characteristic vector according to method in step S2;
S8. the resource consumption characteristic vector newly submitting operation to is carried out distance with the every class cluster centre in step S3 cluster result Coupling, then utilizes the performance model that the class of jobs of coupling is corresponding, it was predicted that under different parameters configuration and input data set scale The execution time, as the search volume of parameter optimization;
S9. utilize searching algorithm search optimized parameter configuration, and export;
S10. utilize the optimized parameter allocation plan that step S9 obtains, it is intended that input data set, run in current cluster and newly carry Hand in homework.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1, it is characterised in that Described step S3 is weighed the distance computing formula that different work similarity is used, makes similarity between identical operation make than difference Between industry, similarity is high.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, the distance computing formula that in described step S3, measurement different work similarity is used is COS distance formula.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, the clustering algorithm that in described step S3, Clustering is used is Unsupervised clustering algorithm, and principle is: make resource characteristic Similar operation is automatically clustered into one group.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, in described step S3, Clustering uses K-means algorithm.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, in described step S6, utilize cross-validation method to assess the forecasting accuracy of described SVM performance model model.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, in described step S9, the selection principle of searching algorithm is: quickly in search volume, find so that Job execution shortest time Optimized parameter configuration.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, in described step S9, searching algorithm is: operation being predicted under different parameters configures, the execution time sorts, before taking from small to large The parameter configuration that the several prediction in the face execution time is corresponding, asks the meansigma methods of each parameter as optimal value of the parameter.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists In, in step S7 described cluster cluster is current cluster part or all.
10. a Hadoop parameter automated tuning system based on machine learning, including off-line module with at wire module, wherein:
Off-line module includes that Hadoop data collector, cluster device and performance model build submodule;
Hadoop data collector is for collecting the execution of Historical Jobs from the Hadoop host node journal file of current cluster Time, input data set scale, the parameter configuration of Mapreduce and the time serial message of all kinds of resource consumption;
All kinds of resources of the Historical Jobs that cluster device runs in the current cluster collected by Hadoop data collector disappear The time serial message of consumption carries out pretreatment, builds resource consumption characteristic vector, carries out operation Clustering;
Performance model builds submodule for often organizing the input data scale of Historical Jobs, parameter configuration, execution temporal information As training set, it is respectively trained and builds performance model, and performance model is supplied to optimizer selection;
Job manager, optimizer, resource consumption feature matcher and operation explorer is included at wire module;
Job manager is for submitting to new given operation to Hadoop host node, and submission new job is to cluster cluster for the first time, And specify input data set and default parameters configuration, for being grouped new job, submission new job is to current cluster for the second time, and Specify input data set and optimized parameter configuration;
Operation explorer is for collecting the time sequence of the resource consumption when new job submitted to for the first time by job manager runs Column information, pretreatment also builds resource consumption characteristic vector, and be supplied to resource consumption feature matcher;
Resource consumption feature matcher for the resource consumption characteristic vector that compares this stack features vector and often organize cluster centre away from From, obtain the task group belonging to this operation, and be supplied to optimizer;
Optimizer, according to group result, selects corresponding performance model, calculates the most rational parameter configuration of search and improves resource Utilization rate and job run efficiency;Finally optimized parameter configuration and input data set are supplied to job manager, and by making The Hadoop host node that industry manager will newly submit to operation and optimized parameter configuration thereof and input data set to be submitted to current cluster Run.
CN201610550098.7A 2016-07-13 2016-07-13 A kind of Hadoop parameter automated tuning method and system based on machine learning Active CN106202431B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610550098.7A CN106202431B (en) 2016-07-13 2016-07-13 A kind of Hadoop parameter automated tuning method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610550098.7A CN106202431B (en) 2016-07-13 2016-07-13 A kind of Hadoop parameter automated tuning method and system based on machine learning

Publications (2)

Publication Number Publication Date
CN106202431A true CN106202431A (en) 2016-12-07
CN106202431B CN106202431B (en) 2019-06-28

Family

ID=57477667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610550098.7A Active CN106202431B (en) 2016-07-13 2016-07-13 A kind of Hadoop parameter automated tuning method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN106202431B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107025141A (en) * 2017-05-18 2017-08-08 成都海天数联科技有限公司 A kind of dispatching method based on big data mixture operation model
CN107229693A (en) * 2017-05-22 2017-10-03 哈工大大数据产业有限公司 The method and system of big data system configuration parameter tuning based on deep learning
CN108052425A (en) * 2017-12-07 2018-05-18 郑州云海信息技术有限公司 A kind of method and device of the lifting system performance based on kernel tuning
CN108376180A (en) * 2018-04-03 2018-08-07 哈工大大数据(哈尔滨)智能科技有限公司 Influence the key parameter lookup method and device of big data system performance
CN109144716A (en) * 2017-06-28 2019-01-04 中兴通讯股份有限公司 Operating system dispatching method and device, equipment based on machine learning
CN109710499A (en) * 2018-11-13 2019-05-03 平安科技(深圳)有限公司 The recognition methods of computer equipment performance and device
CN110188804A (en) * 2019-05-16 2019-08-30 武汉工程大学 The method of support vector machines optimal classification model parameter search based on MapReduce frame
CN110427356A (en) * 2018-04-26 2019-11-08 中移(苏州)软件技术有限公司 One parameter configuration method and equipment
CN111652380A (en) * 2017-10-31 2020-09-11 第四范式(北京)技术有限公司 Method and system for adjusting and optimizing algorithm parameters aiming at machine learning algorithm
CN111858003A (en) * 2020-07-16 2020-10-30 山东大学 Hadoop optimal parameter evaluation method and device
WO2021051578A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Method and device for performance feature dimensionality reduction, electronic device, and storage medium
CN112884017A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Data analysis method based on data space and computer equipment
CN113010312A (en) * 2021-03-11 2021-06-22 山东英信计算机技术有限公司 Hyper-parameter tuning method, device and storage medium
CN113032367A (en) * 2021-03-24 2021-06-25 安徽大学 Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system
CN113553057A (en) * 2021-07-22 2021-10-26 中国电子科技集团公司第十五研究所 Optimization system for parallel computing of GPUs with different architectures
CN114169651A (en) * 2022-02-14 2022-03-11 中国空气动力研究与发展中心计算空气动力研究所 Active prediction method for supercomputer operation failure based on application similarity
CN114861781A (en) * 2022-04-25 2022-08-05 北京科杰科技有限公司 Automatic parameter adjustment optimization method and device and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929667A (en) * 2012-10-24 2013-02-13 曙光信息产业(北京)有限公司 Method for optimizing hadoop cluster performance
CN103064664A (en) * 2012-11-28 2013-04-24 华中科技大学 Hadoop parameter automatic optimization method and system based on performance pre-evaluation
US20130254196A1 (en) * 2012-03-26 2013-09-26 Duke University Cost-based optimization of configuration parameters and cluster sizing for hadoop
CN103701635A (en) * 2013-12-10 2014-04-02 中国科学院深圳先进技术研究院 Method and device for configuring Hadoop parameters on line
CN103713935A (en) * 2013-12-04 2014-04-09 中国科学院深圳先进技术研究院 Method and device for managing Hadoop cluster resources in online manner
CN103942108A (en) * 2014-04-25 2014-07-23 四川大学 Resource parameter optimization method under Hadoop homogenous cluster
WO2015066979A1 (en) * 2013-11-07 2015-05-14 浪潮电子信息产业股份有限公司 Machine learning method for mapreduce task resource configuration parameters
CN104750780A (en) * 2015-03-04 2015-07-01 北京航空航天大学 Hadoop configuration parameter optimization method based on statistic analysis
CN105184424A (en) * 2015-10-19 2015-12-23 国网山东省电力公司菏泽供电公司 Mapreduced short period load prediction method of multinucleated function learning SVM realizing multi-source heterogeneous data fusion

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130254196A1 (en) * 2012-03-26 2013-09-26 Duke University Cost-based optimization of configuration parameters and cluster sizing for hadoop
CN102929667A (en) * 2012-10-24 2013-02-13 曙光信息产业(北京)有限公司 Method for optimizing hadoop cluster performance
CN103064664A (en) * 2012-11-28 2013-04-24 华中科技大学 Hadoop parameter automatic optimization method and system based on performance pre-evaluation
WO2015066979A1 (en) * 2013-11-07 2015-05-14 浪潮电子信息产业股份有限公司 Machine learning method for mapreduce task resource configuration parameters
CN103713935A (en) * 2013-12-04 2014-04-09 中国科学院深圳先进技术研究院 Method and device for managing Hadoop cluster resources in online manner
CN103701635A (en) * 2013-12-10 2014-04-02 中国科学院深圳先进技术研究院 Method and device for configuring Hadoop parameters on line
CN103942108A (en) * 2014-04-25 2014-07-23 四川大学 Resource parameter optimization method under Hadoop homogenous cluster
CN104750780A (en) * 2015-03-04 2015-07-01 北京航空航天大学 Hadoop configuration parameter optimization method based on statistic analysis
CN105184424A (en) * 2015-10-19 2015-12-23 国网山东省电力公司菏泽供电公司 Mapreduced short period load prediction method of multinucleated function learning SVM realizing multi-source heterogeneous data fusion

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
TIANYAO SUN ET AL: "Accelerating Support Vector Machine Learning with GPU-Based MapReduce", 《 2015 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS》 *
曾林西: "基于性能预估的Hadoop参数自动调优系统", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
王皎等: "Hadoop集群参数的自动调优", 《电脑知识与技术》 *
郑晓薇等: "Hadoop 集群性能参数自动调优信息库系统构建", 《小型微型计算机系统》 *
陶杭: "基于Hadoop的SVM算法优化及在文本分类中的应用", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107025141A (en) * 2017-05-18 2017-08-08 成都海天数联科技有限公司 A kind of dispatching method based on big data mixture operation model
CN107025141B (en) * 2017-05-18 2020-09-01 成都海天数联科技有限公司 Scheduling method based on big data mixed operation model
CN107229693A (en) * 2017-05-22 2017-10-03 哈工大大数据产业有限公司 The method and system of big data system configuration parameter tuning based on deep learning
CN107229693B (en) * 2017-05-22 2018-05-01 哈工大大数据产业有限公司 The method and system of big data system configuration parameter tuning based on deep learning
CN109144716A (en) * 2017-06-28 2019-01-04 中兴通讯股份有限公司 Operating system dispatching method and device, equipment based on machine learning
CN111652380B (en) * 2017-10-31 2023-12-22 第四范式(北京)技术有限公司 Method and system for optimizing algorithm parameters aiming at machine learning algorithm
CN111652380A (en) * 2017-10-31 2020-09-11 第四范式(北京)技术有限公司 Method and system for adjusting and optimizing algorithm parameters aiming at machine learning algorithm
CN108052425A (en) * 2017-12-07 2018-05-18 郑州云海信息技术有限公司 A kind of method and device of the lifting system performance based on kernel tuning
CN108376180A (en) * 2018-04-03 2018-08-07 哈工大大数据(哈尔滨)智能科技有限公司 Influence the key parameter lookup method and device of big data system performance
CN108376180B (en) * 2018-04-03 2020-09-01 哈工大大数据(哈尔滨)智能科技有限公司 Key parameter searching method and device influencing performance of big data system
CN110427356B (en) * 2018-04-26 2021-08-13 中移(苏州)软件技术有限公司 Parameter configuration method and equipment
CN110427356A (en) * 2018-04-26 2019-11-08 中移(苏州)软件技术有限公司 One parameter configuration method and equipment
CN109710499A (en) * 2018-11-13 2019-05-03 平安科技(深圳)有限公司 The recognition methods of computer equipment performance and device
CN109710499B (en) * 2018-11-13 2023-01-17 平安科技(深圳)有限公司 Computer equipment performance identification method and device
CN110188804B (en) * 2019-05-16 2022-12-20 武汉工程大学 Method for searching optimal classification model parameters of support vector machine based on MapReduce framework
CN110188804A (en) * 2019-05-16 2019-08-30 武汉工程大学 The method of support vector machines optimal classification model parameter search based on MapReduce frame
WO2021051578A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Method and device for performance feature dimensionality reduction, electronic device, and storage medium
CN111858003B (en) * 2020-07-16 2021-05-28 山东大学 Hadoop optimal parameter evaluation method and device
CN111858003A (en) * 2020-07-16 2020-10-30 山东大学 Hadoop optimal parameter evaluation method and device
CN112884017A (en) * 2021-01-28 2021-06-01 平安科技(深圳)有限公司 Data analysis method based on data space and computer equipment
CN113010312A (en) * 2021-03-11 2021-06-22 山东英信计算机技术有限公司 Hyper-parameter tuning method, device and storage medium
CN113010312B (en) * 2021-03-11 2024-01-23 山东英信计算机技术有限公司 Super-parameter tuning method, device and storage medium
CN113032367A (en) * 2021-03-24 2021-06-25 安徽大学 Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system
CN113553057A (en) * 2021-07-22 2021-10-26 中国电子科技集团公司第十五研究所 Optimization system for parallel computing of GPUs with different architectures
CN113553057B (en) * 2021-07-22 2022-09-09 中国电子科技集团公司第十五研究所 Optimization system for parallel computing of GPUs with different architectures
CN114169651A (en) * 2022-02-14 2022-03-11 中国空气动力研究与发展中心计算空气动力研究所 Active prediction method for supercomputer operation failure based on application similarity
CN114861781A (en) * 2022-04-25 2022-08-05 北京科杰科技有限公司 Automatic parameter adjustment optimization method and device and electronic equipment

Also Published As

Publication number Publication date
CN106202431B (en) 2019-06-28

Similar Documents

Publication Publication Date Title
CN106202431A (en) A kind of Hadoop parameter automated tuning method and system based on machine learning
US20170330078A1 (en) Method and system for automated model building
Azzalini et al. Clustering via nonparametric density estimation: The R package pdfCluster
CN109492774B (en) Deep learning-based cloud resource scheduling method
CN102799486B (en) Data sampling and partitioning method for MapReduce system
Hasan et al. A machine learning approach to sparql query performance prediction
Ruan et al. Workload time series prediction in storage systems: a deep learning based approach
Li et al. Production task queue optimization based on multi-attribute evaluation for complex product assembly workshop
CN103605662A (en) Distributed computation frame parameter optimizing method, device and system
CN111966495B (en) Data processing method and device
CN107357652A (en) A kind of cloud computing method for scheduling task based on segmentation sequence and standard deviation Dynamic gene
CN105786681A (en) Server performance evaluating and server updating method for data center
CN108921324A (en) Platform area short-term load forecasting method based on distribution transforming cluster
CN110147808A (en) A kind of novel battery screening technique in groups
CN107908536A (en) To the performance estimating method and system of GPU applications in CPU GPU isomerous environments
CN103605711A (en) Construction method and device, classification method and device of support vector machine
CN106100922A (en) The Forecasting Methodology of the network traffics of TCN and device
CN112579273A (en) Task scheduling method and device and computer readable storage medium
CN110825526B (en) Distributed scheduling method and device based on ER relationship, equipment and storage medium
CN112819246A (en) Energy demand prediction method for optimizing neural network based on cuckoo algorithm
WO2012111235A1 (en) Information processing device, information processing method, and storage medium
CN111831418A (en) Big data analysis job performance optimization method based on delay scheduling technology
Azimzadeh et al. Multi-objective job scheduling algorithm in cloud computing based on reliability and time
CN110647682A (en) Associated recommendation system for transaction data
CN112328332B (en) Database configuration optimization method for cloud computing environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant