CN106202431A - A kind of Hadoop parameter automated tuning method and system based on machine learning - Google Patents
A kind of Hadoop parameter automated tuning method and system based on machine learning Download PDFInfo
- Publication number
- CN106202431A CN106202431A CN201610550098.7A CN201610550098A CN106202431A CN 106202431 A CN106202431 A CN 106202431A CN 201610550098 A CN201610550098 A CN 201610550098A CN 106202431 A CN106202431 A CN 106202431A
- Authority
- CN
- China
- Prior art keywords
- parameter
- hadoop
- cluster
- resource consumption
- job
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2471—Distributed queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/217—Database tuning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention belongs to big technical field of data processing, relate to the automated tuning method and system of a kind of Hadoop parameter based on machine learning.The present invention is grouped according to the resource consumption feature clustering of different application, and sets up different performance models for the application of difference group, automatically derives the different parameters bigger on inhomogeneity application impact, and quantitative parameter recommendation value.System includes off-line module and at wire module, and off-line module includes that Hadoop data collector, cluster device and performance model build submodule;Job manager, optimizer, resource consumption feature matcher and operation explorer is included at wire module.The present invention efficiently solves the restricted problem that high Dependence Problem and the qualitative parameter of user experience are advised by existing method based on empirical law;Parameter Optimization System is separated by the present invention with Hadoop system simultaneously, reduces the system degree of coupling, reduces artificial expense, it is to avoid artificially judge by accident, and is easy to system extension and safeguards.
Description
Technical field
The invention belongs to big technical field of data processing, more specifically, relate to a kind of Hadoop based on machine learning
The automated tuning method and system of parameter.
Background technology
Hadoop increases income realization as the one of widely used large data parallel processing framework Mapreduce, has
Good autgmentability and fault-tolerance.User can use Hadoop easy expansion application program, but be necessary for application-specific and join
Putting parameter, the performance of system can be had a huge impact by different parameter configuration.The performance optimization of Hadoop task is one
The optimization problem of multidimensional, influence factor mainly has data set scale, the configuration of machine hardware, operation to utilization of resources spy
Levy, and different dispatching algorithms.First, for a set of cluster, different collection swarm parameters arrange and there is very big difference, the most often
Individual Hadoop system needs to configure to optimize performance according to the feature of residing cluster self.Secondly, the relation between each parameter is non-
The most complicated, some parameters may influence each other with one or more parameters.There is mutually restriction or dependence between parameter, arrange
The unreasonable resource contention that may result in, the overall performance of whole system reduces, so that mutually restricting pass between configuration parameter
System reaches to balance extremely important;Additionally, different application task is different to the requirement performing environment, need system is joined accordingly
Put and need to improve overall performance with matching task.
Traditional tuning method based on empirical law is that Hadoop user is by great many of experiments and dividing system itself
The parameter bigger on the impact of certain class Hadoop transaction capabilities is summed up in analysis, and regulates these parameter values in concrete application practice.This
Kind of method needs user to have Hadoop system to compare and understand in depth and a large amount of smell of powder, and can produce because user is different
Distinct effect.Parameter adjustment each time needs repeatedly to test, and needs to consume substantial amounts of hardware resource and time.As
Vaidya (refers to: Vaidya. [Online] .Availab1e:http: //Hadoop.apache.org/Mapreduce/doc
S/r0.21.0/vaidya.Html) as rule-based Mapreduce transaction capabilities diagnostic tool, just run by resolving
After operation configuration file in the information such as parameter, statistical information and JobHistory daily record and act on the rule that predefined is good
Find the performance issue in operation on then, and provide qualitative rather than quantitative suggestion.Hivnath etc. propose a kind of based on
The tuning method of code rewriting, by revising the code of Hadoop itself, adds the module of parameter optimization in Hadoop source code
And use for reference the query optimization thought in data base to realize the target of Hadoop parameter automated tuning (see paper: Babu
S.Towards Automatic Optimization of Mapreduce Programs.in:Proceedings of the
1st ACM symposium on Cloud computing(SoCC'10).ACM,2010.137-142.).But it can not root
The feature different according to application program self and the different situation of the resource of system regulate optimization targetedly.Secondly will
The work of configuration parameter is given Hadoop system and is done, and Hadoop bottom source code can be made to become more complicated and be difficult to tie up
Protect.K.Kambatla proposes tuning method based on resource consumption feature (see paper: Kambatla K, Pathak A, Pucha
H.Towards optimizing Hadoop provisioning in the cloud.in:HotCloud Workshop in
conjunction with USENIX Annual Technical Conference.USENIX Association,
2009.), by by resource consumption feature analysiss of other application in the resource consumption feature of current operation and data base with than
Relatively, find most like one, and read corresponding optimized parameter allocation plan, as the method for parameter configuration of current work,
Thus it is embodied as new Hadoop operation automatic configuration parameter.It can make operation on the premise of obtaining top performance, subtracts as far as possible
Few program consumption.The key of the method is aiming at different Hadoop typical case and applies the resource consumption feature collecting correspondence, and leads to
Cross constantly amendment configuration parameter and find optimized parameter allocation plan corresponding to corresponding Mapreduce operation, then according to by every kind
Hadoop application and the resource consumption feature of its correspondence, optimized parameter allocation plan are saved in data base.This tuning mode
Being similar to mode based on empirical law, simply experiences here refers to those application existing in data base.Its advantage is
Being easier to realize, its shortcoming is mainly for map.tasks.maximum Yu reduce.tasks.maximum two ginseng
The setting of number, the parameter of optimization is little, causes effect of optimization limited;And in order to find every kind of optimized parameter configuration side applied
Case, needs great many of experiments to test, and wastes substantial amounts of system hardware resources and time.
Summary of the invention
For the above deficiency of existing Hadoop tuning method, the present invention provides Hadoop based on a machine learning ginseng
Number automated tuning method and system, it is therefore intended that be grouped according to the resource consumption feature clustering of different application, and for not
Set up different performance models with group application, automatically derive the different parameters bigger on inhomogeneity application impact, and quantitative
Parameter recommendation value, effectively solves existing method based on empirical law and builds high Dependence Problem and the qualitative parameter of user experience
The restricted problem of view;Solve existing method parameters optimization based on resource consumption feature few, the problem that effect of optimization is limited;With
Time the present invention Parameter Optimization System is separated with Hadoop system, reduce the system degree of coupling.
The present invention provide a kind of based on machine learning Hadoop parameter automated tuning method, including off-line procedure and
Line process, wherein, off-line procedure comprises the steps:
S1. time that performs of the Historical Jobs run in current cluster, input data set scale, Mapreduce are collected
Parameter configuration and the time serial message of all kinds of resource consumption;
S2. the time serial message of all kinds of resource consumptions of the Historical Jobs of collection is normalized pretreatment, then
Build resource consumption characteristic vector;
S3. the spacing of the resource consumption characteristic vector of different work is calculated, for weighing the similarity of different work, and
By operation Clustering so that the operation of resource consumption feature similarity is divided into one group;
S4. according to cluster result, utilize and often organize the configuration parameter of Historical Jobs, input data scale and execution time, point
Group builds Job execution time training set;
S5. for every group job, it is respectively adopted stepwise regression method and selects the optimum prediction factor, i.e. select and Job execution
The factor of time strong correlation;
S6. for every group job, the result being utilized respectively stepwise regression analysis carries out SVM regression analysis, selects to close
Suitable kernel function, builds SVM performance model;
Described comprise the steps: at line process
S7. for the new operation submitted to, default parameters configuration and a part for input data set are used, in cluster cluster
Run this operation, collect the time serial message of all kinds of resource consumption, and build resource consumption feature according to method in step S2
Vector;
S8. the resource consumption characteristic vector of operation will be newly submitted to carry out with the every class cluster centre in step S3 cluster result
Distance coupling, then utilizes the performance model that the class of jobs of coupling is corresponding, it was predicted that different parameters configuration and input data set scale
Under the execution time, as the search volume of parameter optimization;
S9. utilize searching algorithm search optimized parameter configuration, and export;
S10. utilize the optimized parameter allocation plan that step S9 obtains, it is intended that input data set, run under current cluster
Newly submit operation to.
As the improvement of technique scheme, described step S3 is weighed the distance calculating that different work similarity is used
Formula, makes similarity between identical operation higher than similarity between different work, preferably COS distance formula.
As the further improvement of technique scheme, the clustering algorithm that in described step S3, Clustering is used is nothing
Supervision clustering algorithm, principle is: the operation making resource characteristic most like is automatically clustered into one group, and preferably K-means calculates
Method.
As the further improvement of technique scheme, in described step S6, cross-validation method is utilized to assess described SVM
The forecasting accuracy of performance model.
As the further improvement of technique scheme, in described step S9, the selection principle of searching algorithm is: quickly
In search volume, find so that the optimized parameter of Job execution shortest time configures;Specifically can be by operation at different parameters
The configuration lower prediction execution time sorts from small to large, takes parameter configuration corresponding to the most several prediction execution time, seeks each ginseng
The meansigma methods of number is as optimal value of the parameter.
A kind of Hadoop parameter automated tuning system based on machine learning provided by the present invention, including off-line module and
At wire module, wherein:
Off-line module includes that Hadoop data collector, cluster device and performance model build submodule;
Hadoop data collector is for collecting Historical Jobs from the Hadoop host node journal file of current cluster
Execution time, input data set scale, the parameter configuration of Mapreduce and the time serial message of all kinds of resource consumption;
Cluster device is for Historical Jobs all kinds of run in current cluster collected by Hadoop data collector
Resource consumption time serial message, carries out pretreatment, builds resource consumption characteristic vector, carries out operation Clustering;
Performance model builds submodule for by the input data scale of every group job, parameter configuration, execution temporal information
As training set, it is respectively trained and builds performance model, and performance model is supplied to optimizer selection;
Job manager, optimizer, resource consumption feature matcher and operation explorer is included at wire module;
Job manager, for submitting to new given operation to Hadoop host node, submits to new job to cluster set for the first time
Group, and specify input data set and default parameters configuration, for being grouped new job, second time submits to new job to current collection
Group, and specify input data set and optimized parameter configuration;
Operation explorer for collect resource consumption when the new job submitted to for the first time by job manager runs time
Between sequence information, pretreatment also builds resource consumption characteristic vector, and be supplied to resource consumption feature matcher;
Resource consumption feature matcher for the resource consumption feature that compares this stack features vector and often organize cluster centre to
Span from, obtain the task group belonging to this operation, and be supplied to optimizer;
Optimizer, according to group result, selects corresponding performance model, calculates the most rational parameter configuration of search and improves
Resource utilization and job run efficiency;Finally optimized parameter configuration and input data set are supplied to job manager, and
The Hadoop master that will newly be submitted to operation and optimized parameter configuration thereof and input data set to be submitted to current cluster by job manager
Node runs.
For the high configurability of Hadoop, under cluster hardware resource fixing situation, utilize a kind of machine learning performance mould
Type carries out parameter automated tuning and then optimizes systematic function new submission task.In general, the inventive method and prior art
Scheme has compared following advantage:
1, compared with existing method based on empirical law, the inventive method does not has high requirement to user experience, and
And the detail parameters allocation plan that numerical value is clear and definite can be automatically provided;
2, compared with existing method based on code rewriting, the inventive method uses arameter optimization part and Hadoop system
The mode that system separates, does not increase Hadoop bottom source code complexity;And this method is sharp to system resource according to application program
Use situation Clustering, and set up performance model for each packet, propose different optimized parameter allocation plans, so can depend on
Carry out parameter regulation targetedly according to the feature of dissimilar application to optimize;
3, compared with existing method based on resource consumption feature, the inventive method according to resource consumption feature to operation
Clustering, and to setting up the SVM performance meeting every group job resource consumption feature after every group job repeatedly regression analysis dimensionality reduction
Model, the predictor of this model is different and different according to packet, the parameter type i.e. optimized with number because cluster result is different
And different, it is not limited to optimize fixing parameter, thus be effectively increased effect of optimization, and the inventive method utilizes SVM
The execution time under performance model prediction different scales input data, different parameters configuration builds search volume, without
Actual great many of experiments test, effectively save system hardware resources and time loss.
Accompanying drawing explanation
Fig. 1 is the flow chart of the Hadoop parameter automated tuning method based on machine learning that the present invention proposes;
Fig. 2 is the Organization Chart of the Hadoop parameter automated tuning system based on machine learning that the present invention proposes.
Detailed description of the invention
In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, right
The present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, and
It is not used in the restriction present invention.If additionally, technical characteristic involved in each embodiment of invention described below
The conflict of not constituting each other just can be mutually combined.
Fig. 1 is the flow chart of Hadoop parameter automated tuning method of the present invention, specifically includes following steps:
S1. time that performs of the Historical Jobs run in current cluster, input data set scale, Mapreduce are collected
Parameter configuration and the time serial message of all kinds of resource consumption, including cpu busy percentage, disk utilization rate and network transfer rate etc.
Time serial message.
S2. the time series of the cpu busy percentage of each Historical Jobs, disk utilization rate and the network transfer rate collected is believed
Breath is normalized pretreatment, seeks their meansigma methods the most respectively, as each Historical Jobs resource consumption characteristic vector
The most one-dimensional build resource consumption characteristic vector.
S3. use COS distance formula, calculate the spacing of the resource consumption characteristic vector of different work, for weighing not
With the similarity of operation, and use K-means clustering algorithm, by task group so that the operation of resource consumption feature similarity is certainly
It is divided into one group dynamicly.
The distance computing formula weighing different work similarity has to comply with objective experiment law: similarity between identical operation
Higher than similarity between different work.Through experimental verification, this range formula can be COS distance formula, but cannot be European
Range formula.
The clustering algorithm of cluster operation uses a kind of Unsupervised clustering algorithm, and principle is: make resource characteristic most like
Operation is automatically clustered into one group.This algorithm can be K-means algorithm.
S4. according to cluster result, utilize and often organize the configuration parameter of Historical Jobs, input data scale and execution time, point
Group builds Job execution time training set;
S5. for every group job, it is respectively adopted stepwise regression method and selects the optimum prediction factor, i.e. select and Job execution
The factor of time strong correlation;
S6. for every group job, the result being utilized respectively stepwise regression analysis carries out SVM regression analysis, selects to close
Suitable kernel function, builds SVM performance model (utilizing the forecasting accuracy of cross-validation method assessment models);
S7. for the new operation submitted to, default parameters configuration and a part for input data set are used, in cluster cluster
Run this operation, collect the time serial message of all kinds of resource consumption, and build resource consumption feature according to method in step S2
Vector;
A part for input data set herein, its scale typically from economizing on resources and the angle of time, chooses one
Less but bigger than current main memory empirical value;Cluster cluster herein can directly use current cluster, but from saving money
Source angle is set out, it would however also be possible to employ a portion of current cluster.
S8. the resource consumption characteristic vector newly submitting operation to is carried out distance with the every class cluster centre in S3 cluster result
Coupling, then utilizes the performance model that the class of jobs of coupling is corresponding, it was predicted that under different parameters configuration and input data set scale
The execution time, as the search volume of parameter optimization.
S9. use suitable searching algorithm, search optimized parameter configuration, and export.
The selection principle of searching algorithm is: quickly in search volume, find so that the optimum of Job execution shortest time
Parameter configuration.
S10. utilize the parameter configuration scheme that S9 obtains, it is intended that input data set, run in current cluster and newly submit work to
Industry.
Example:
The present invention provides an embodiment, with 6 node Hadoop clusters (1 NameNode, 5 DataNode, often
Individual DataNode configures 2 Map slot, 2 Reduce slot, each task 300MB internal memory, and data block size is 64MB),
Select bigger 9 parameters (being set to P1~P9) of Hadoop j ob impact, utilize 200 Historical Jobs information architectures in this cluster
As a example by the situation of model, the implementation process of the inventive method is described.Specifically include following steps:
Step 1: when collecting the execution of the Historical Jobs run in current cluster (1 NameNode, 5 DataNode)
Between, the parameter configuration of input data set scale, Mapreduce and the time serial message of all kinds of resource consumption, as CPU utilizes
Rate, disk utilization rate and the time serial message of network transfer rate.
Step 2: by the time sequence of the cpu busy percentage of each Historical Jobs, disk utilization rate and the network transfer rate of collection
Column information is normalized pretreatment, asks their meansigma methods as each Historical Jobs resource consumption characteristic vector the most respectively
The most one-dimensional build resource consumption characteristic vector.
Step 3: use COS distance computing formula, calculates the spacing of the resource consumption characteristic vector of different work, and
Use K-means clustering algorithm, by operation Clustering, it is assumed that operation 1~operation 100 cluster to same group, operation 101~work
Industry 200 cluster is to another group, and the job number in the most each packet is continuous print.
Step 4: according to cluster result, when utilizing operation 1~the configuration parameter of operation 100, input data scale and perform
Between, build the Job execution time training set of this group, utilize operation 101~the configuration parameter of operation 200, input data scale and
The execution time, build the Job execution time training set of this group.
Step 5: for every group job, is respectively adopted stepwise regression method and selects the optimum prediction factor, i.e. select and operation
The factor of execution time strong correlation.Assume the optimum prediction factor of operation 1~operation 100 place group be input data scale, P9,
P8, P2, P1 and P4;The optimum prediction factor of operation 101~operation 200 place group is input data scale, P9, P8, P7 and P4.
Step 6: for every group job, the result being utilized respectively stepwise regression analysis carries out SVM regression analysis, selects
RBF kernel function, builds SVM performance model, utilizes the forecasting accuracy of cross-validation method assessment models.
Step 7: assume that an input data set scale is that operation x is submitted in the new of 10GB to, use of input data set
Divide (assuming 1GB) and default parameters to be arranged under cluster cluster (1 NameNode, 2 DataNode) environment and run this operation,
Collect the time serial message of all kinds of resource consumption, build resource consumption characteristic vector according to method in step 2.
Step 8: by the resource characteristic consumption vector of operation x newly submitted to and the apoplexy due to endogenous wind of often birdsing of the same feather flock together in step 3 cluster result
The heart carries out distance coupling, it is assumed that it mates with operation 101~operation 200 place group, then just utilize operation 101~operation 200
The performance model that place group is corresponding, it was predicted that the execution time under the configuration of operation x different parameters and input data set scale (10GB),
Search volume as parameter optimization.
Step 9: during search optimized parameter configuration, the prediction under being configured by operation x different parameters performs the time from small to large
Sequence, takes front 10 prediction parameter configuration corresponding to execution times, asks the meansigma methods of each parameter as optimal value of the parameter, output
Parameter P9, P8, P7 and P4 are distributed rationally value.
Step 10: utilize the parameter configuration scheme that step 9 obtains, it is intended that input data set (10GB), at current cluster (1
Individual NameNode, 5 DataNode) middle operation new submission operation x.
As it will be easily appreciated by one skilled in the art that and the foregoing is only presently preferred embodiments of the present invention, not in order to
Limit the present invention, all any amendment, equivalent and improvement etc. made within the spirit and principles in the present invention, all should comprise
Within protection scope of the present invention.
Claims (10)
1. a Hadoop parameter automated tuning method based on machine learning, including off-line procedure with at line process, wherein,
Off-line procedure comprises the steps:
S1. execution time, input data set scale, the parameter of Mapreduce of the Historical Jobs run in current cluster are collected
Configuration and the time serial message of all kinds of resource consumptions;
S2. the time serial message of all kinds of resource consumptions of the Historical Jobs of collection is normalized pretreatment, then builds
Resource consumption characteristic vector;
S3. calculate the spacing of the resource consumption characteristic vector of different work, for weighing the similarity of different work, and will make
Industry Clustering so that the operation of resource consumption feature similarity is divided into one group;
S4. according to cluster result, utilize and often organize the configuration parameter of Historical Jobs, input data scale and execution time, be grouped structure
Build Job execution time training set;
S5. for every group job, it is respectively adopted stepwise regression method and selects the optimum prediction factor, i.e. select and the Job execution time
The factor of strong correlation;
S6. for every group job, the result being utilized respectively stepwise regression analysis carries out SVM regression analysis, selects suitable core
Function, builds SVM performance model;
Described comprise the steps: at line process
S7. for the new operation submitted to, use default parameters configuration and a part for input data set, run in cluster cluster
This operation, collects the time serial message of all kinds of resource consumption, and builds resource consumption characteristic vector according to method in step S2;
S8. the resource consumption characteristic vector newly submitting operation to is carried out distance with the every class cluster centre in step S3 cluster result
Coupling, then utilizes the performance model that the class of jobs of coupling is corresponding, it was predicted that under different parameters configuration and input data set scale
The execution time, as the search volume of parameter optimization;
S9. utilize searching algorithm search optimized parameter configuration, and export;
S10. utilize the optimized parameter allocation plan that step S9 obtains, it is intended that input data set, run in current cluster and newly carry
Hand in homework.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1, it is characterised in that
Described step S3 is weighed the distance computing formula that different work similarity is used, makes similarity between identical operation make than difference
Between industry, similarity is high.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists
In, the distance computing formula that in described step S3, measurement different work similarity is used is COS distance formula.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists
In, the clustering algorithm that in described step S3, Clustering is used is Unsupervised clustering algorithm, and principle is: make resource characteristic
Similar operation is automatically clustered into one group.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists
In, in described step S3, Clustering uses K-means algorithm.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists
In, in described step S6, utilize cross-validation method to assess the forecasting accuracy of described SVM performance model model.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists
In, in described step S9, the selection principle of searching algorithm is: quickly in search volume, find so that Job execution shortest time
Optimized parameter configuration.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists
In, in described step S9, searching algorithm is: operation being predicted under different parameters configures, the execution time sorts, before taking from small to large
The parameter configuration that the several prediction in the face execution time is corresponding, asks the meansigma methods of each parameter as optimal value of the parameter.
A kind of Hadoop parameter automated tuning method based on machine learning the most according to claim 1 and 2, its feature exists
In, in step S7 described cluster cluster is current cluster part or all.
10. a Hadoop parameter automated tuning system based on machine learning, including off-line module with at wire module, wherein:
Off-line module includes that Hadoop data collector, cluster device and performance model build submodule;
Hadoop data collector is for collecting the execution of Historical Jobs from the Hadoop host node journal file of current cluster
Time, input data set scale, the parameter configuration of Mapreduce and the time serial message of all kinds of resource consumption;
All kinds of resources of the Historical Jobs that cluster device runs in the current cluster collected by Hadoop data collector disappear
The time serial message of consumption carries out pretreatment, builds resource consumption characteristic vector, carries out operation Clustering;
Performance model builds submodule for often organizing the input data scale of Historical Jobs, parameter configuration, execution temporal information
As training set, it is respectively trained and builds performance model, and performance model is supplied to optimizer selection;
Job manager, optimizer, resource consumption feature matcher and operation explorer is included at wire module;
Job manager is for submitting to new given operation to Hadoop host node, and submission new job is to cluster cluster for the first time,
And specify input data set and default parameters configuration, for being grouped new job, submission new job is to current cluster for the second time, and
Specify input data set and optimized parameter configuration;
Operation explorer is for collecting the time sequence of the resource consumption when new job submitted to for the first time by job manager runs
Column information, pretreatment also builds resource consumption characteristic vector, and be supplied to resource consumption feature matcher;
Resource consumption feature matcher for the resource consumption characteristic vector that compares this stack features vector and often organize cluster centre away from
From, obtain the task group belonging to this operation, and be supplied to optimizer;
Optimizer, according to group result, selects corresponding performance model, calculates the most rational parameter configuration of search and improves resource
Utilization rate and job run efficiency;Finally optimized parameter configuration and input data set are supplied to job manager, and by making
The Hadoop host node that industry manager will newly submit to operation and optimized parameter configuration thereof and input data set to be submitted to current cluster
Run.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610550098.7A CN106202431B (en) | 2016-07-13 | 2016-07-13 | A kind of Hadoop parameter automated tuning method and system based on machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610550098.7A CN106202431B (en) | 2016-07-13 | 2016-07-13 | A kind of Hadoop parameter automated tuning method and system based on machine learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106202431A true CN106202431A (en) | 2016-12-07 |
CN106202431B CN106202431B (en) | 2019-06-28 |
Family
ID=57477667
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610550098.7A Active CN106202431B (en) | 2016-07-13 | 2016-07-13 | A kind of Hadoop parameter automated tuning method and system based on machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106202431B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107025141A (en) * | 2017-05-18 | 2017-08-08 | 成都海天数联科技有限公司 | A kind of dispatching method based on big data mixture operation model |
CN107229693A (en) * | 2017-05-22 | 2017-10-03 | 哈工大大数据产业有限公司 | The method and system of big data system configuration parameter tuning based on deep learning |
CN108052425A (en) * | 2017-12-07 | 2018-05-18 | 郑州云海信息技术有限公司 | A kind of method and device of the lifting system performance based on kernel tuning |
CN108376180A (en) * | 2018-04-03 | 2018-08-07 | 哈工大大数据(哈尔滨)智能科技有限公司 | Influence the key parameter lookup method and device of big data system performance |
CN109144716A (en) * | 2017-06-28 | 2019-01-04 | 中兴通讯股份有限公司 | Operating system dispatching method and device, equipment based on machine learning |
CN109710499A (en) * | 2018-11-13 | 2019-05-03 | 平安科技(深圳)有限公司 | The recognition methods of computer equipment performance and device |
CN110188804A (en) * | 2019-05-16 | 2019-08-30 | 武汉工程大学 | The method of support vector machines optimal classification model parameter search based on MapReduce frame |
CN110427356A (en) * | 2018-04-26 | 2019-11-08 | 中移(苏州)软件技术有限公司 | One parameter configuration method and equipment |
CN111652380A (en) * | 2017-10-31 | 2020-09-11 | 第四范式(北京)技术有限公司 | Method and system for adjusting and optimizing algorithm parameters aiming at machine learning algorithm |
CN111858003A (en) * | 2020-07-16 | 2020-10-30 | 山东大学 | Hadoop optimal parameter evaluation method and device |
WO2021051578A1 (en) * | 2019-09-17 | 2021-03-25 | 平安科技(深圳)有限公司 | Method and device for performance feature dimensionality reduction, electronic device, and storage medium |
CN112884017A (en) * | 2021-01-28 | 2021-06-01 | 平安科技(深圳)有限公司 | Data analysis method based on data space and computer equipment |
CN113010312A (en) * | 2021-03-11 | 2021-06-22 | 山东英信计算机技术有限公司 | Hyper-parameter tuning method, device and storage medium |
CN113032367A (en) * | 2021-03-24 | 2021-06-25 | 安徽大学 | Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system |
CN113553057A (en) * | 2021-07-22 | 2021-10-26 | 中国电子科技集团公司第十五研究所 | Optimization system for parallel computing of GPUs with different architectures |
CN114169651A (en) * | 2022-02-14 | 2022-03-11 | 中国空气动力研究与发展中心计算空气动力研究所 | Active prediction method for supercomputer operation failure based on application similarity |
CN114861781A (en) * | 2022-04-25 | 2022-08-05 | 北京科杰科技有限公司 | Automatic parameter adjustment optimization method and device and electronic equipment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102929667A (en) * | 2012-10-24 | 2013-02-13 | 曙光信息产业(北京)有限公司 | Method for optimizing hadoop cluster performance |
CN103064664A (en) * | 2012-11-28 | 2013-04-24 | 华中科技大学 | Hadoop parameter automatic optimization method and system based on performance pre-evaluation |
US20130254196A1 (en) * | 2012-03-26 | 2013-09-26 | Duke University | Cost-based optimization of configuration parameters and cluster sizing for hadoop |
CN103701635A (en) * | 2013-12-10 | 2014-04-02 | 中国科学院深圳先进技术研究院 | Method and device for configuring Hadoop parameters on line |
CN103713935A (en) * | 2013-12-04 | 2014-04-09 | 中国科学院深圳先进技术研究院 | Method and device for managing Hadoop cluster resources in online manner |
CN103942108A (en) * | 2014-04-25 | 2014-07-23 | 四川大学 | Resource parameter optimization method under Hadoop homogenous cluster |
WO2015066979A1 (en) * | 2013-11-07 | 2015-05-14 | 浪潮电子信息产业股份有限公司 | Machine learning method for mapreduce task resource configuration parameters |
CN104750780A (en) * | 2015-03-04 | 2015-07-01 | 北京航空航天大学 | Hadoop configuration parameter optimization method based on statistic analysis |
CN105184424A (en) * | 2015-10-19 | 2015-12-23 | 国网山东省电力公司菏泽供电公司 | Mapreduced short period load prediction method of multinucleated function learning SVM realizing multi-source heterogeneous data fusion |
-
2016
- 2016-07-13 CN CN201610550098.7A patent/CN106202431B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130254196A1 (en) * | 2012-03-26 | 2013-09-26 | Duke University | Cost-based optimization of configuration parameters and cluster sizing for hadoop |
CN102929667A (en) * | 2012-10-24 | 2013-02-13 | 曙光信息产业(北京)有限公司 | Method for optimizing hadoop cluster performance |
CN103064664A (en) * | 2012-11-28 | 2013-04-24 | 华中科技大学 | Hadoop parameter automatic optimization method and system based on performance pre-evaluation |
WO2015066979A1 (en) * | 2013-11-07 | 2015-05-14 | 浪潮电子信息产业股份有限公司 | Machine learning method for mapreduce task resource configuration parameters |
CN103713935A (en) * | 2013-12-04 | 2014-04-09 | 中国科学院深圳先进技术研究院 | Method and device for managing Hadoop cluster resources in online manner |
CN103701635A (en) * | 2013-12-10 | 2014-04-02 | 中国科学院深圳先进技术研究院 | Method and device for configuring Hadoop parameters on line |
CN103942108A (en) * | 2014-04-25 | 2014-07-23 | 四川大学 | Resource parameter optimization method under Hadoop homogenous cluster |
CN104750780A (en) * | 2015-03-04 | 2015-07-01 | 北京航空航天大学 | Hadoop configuration parameter optimization method based on statistic analysis |
CN105184424A (en) * | 2015-10-19 | 2015-12-23 | 国网山东省电力公司菏泽供电公司 | Mapreduced short period load prediction method of multinucleated function learning SVM realizing multi-source heterogeneous data fusion |
Non-Patent Citations (5)
Title |
---|
TIANYAO SUN ET AL: "Accelerating Support Vector Machine Learning with GPU-Based MapReduce", 《 2015 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS》 * |
曾林西: "基于性能预估的Hadoop参数自动调优系统", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
王皎等: "Hadoop集群参数的自动调优", 《电脑知识与技术》 * |
郑晓薇等: "Hadoop 集群性能参数自动调优信息库系统构建", 《小型微型计算机系统》 * |
陶杭: "基于Hadoop的SVM算法优化及在文本分类中的应用", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107025141A (en) * | 2017-05-18 | 2017-08-08 | 成都海天数联科技有限公司 | A kind of dispatching method based on big data mixture operation model |
CN107025141B (en) * | 2017-05-18 | 2020-09-01 | 成都海天数联科技有限公司 | Scheduling method based on big data mixed operation model |
CN107229693A (en) * | 2017-05-22 | 2017-10-03 | 哈工大大数据产业有限公司 | The method and system of big data system configuration parameter tuning based on deep learning |
CN107229693B (en) * | 2017-05-22 | 2018-05-01 | 哈工大大数据产业有限公司 | The method and system of big data system configuration parameter tuning based on deep learning |
CN109144716A (en) * | 2017-06-28 | 2019-01-04 | 中兴通讯股份有限公司 | Operating system dispatching method and device, equipment based on machine learning |
CN111652380B (en) * | 2017-10-31 | 2023-12-22 | 第四范式(北京)技术有限公司 | Method and system for optimizing algorithm parameters aiming at machine learning algorithm |
CN111652380A (en) * | 2017-10-31 | 2020-09-11 | 第四范式(北京)技术有限公司 | Method and system for adjusting and optimizing algorithm parameters aiming at machine learning algorithm |
CN108052425A (en) * | 2017-12-07 | 2018-05-18 | 郑州云海信息技术有限公司 | A kind of method and device of the lifting system performance based on kernel tuning |
CN108376180A (en) * | 2018-04-03 | 2018-08-07 | 哈工大大数据(哈尔滨)智能科技有限公司 | Influence the key parameter lookup method and device of big data system performance |
CN108376180B (en) * | 2018-04-03 | 2020-09-01 | 哈工大大数据(哈尔滨)智能科技有限公司 | Key parameter searching method and device influencing performance of big data system |
CN110427356B (en) * | 2018-04-26 | 2021-08-13 | 中移(苏州)软件技术有限公司 | Parameter configuration method and equipment |
CN110427356A (en) * | 2018-04-26 | 2019-11-08 | 中移(苏州)软件技术有限公司 | One parameter configuration method and equipment |
CN109710499A (en) * | 2018-11-13 | 2019-05-03 | 平安科技(深圳)有限公司 | The recognition methods of computer equipment performance and device |
CN109710499B (en) * | 2018-11-13 | 2023-01-17 | 平安科技(深圳)有限公司 | Computer equipment performance identification method and device |
CN110188804B (en) * | 2019-05-16 | 2022-12-20 | 武汉工程大学 | Method for searching optimal classification model parameters of support vector machine based on MapReduce framework |
CN110188804A (en) * | 2019-05-16 | 2019-08-30 | 武汉工程大学 | The method of support vector machines optimal classification model parameter search based on MapReduce frame |
WO2021051578A1 (en) * | 2019-09-17 | 2021-03-25 | 平安科技(深圳)有限公司 | Method and device for performance feature dimensionality reduction, electronic device, and storage medium |
CN111858003B (en) * | 2020-07-16 | 2021-05-28 | 山东大学 | Hadoop optimal parameter evaluation method and device |
CN111858003A (en) * | 2020-07-16 | 2020-10-30 | 山东大学 | Hadoop optimal parameter evaluation method and device |
CN112884017A (en) * | 2021-01-28 | 2021-06-01 | 平安科技(深圳)有限公司 | Data analysis method based on data space and computer equipment |
CN113010312A (en) * | 2021-03-11 | 2021-06-22 | 山东英信计算机技术有限公司 | Hyper-parameter tuning method, device and storage medium |
CN113010312B (en) * | 2021-03-11 | 2024-01-23 | 山东英信计算机技术有限公司 | Super-parameter tuning method, device and storage medium |
CN113032367A (en) * | 2021-03-24 | 2021-06-25 | 安徽大学 | Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system |
CN113553057A (en) * | 2021-07-22 | 2021-10-26 | 中国电子科技集团公司第十五研究所 | Optimization system for parallel computing of GPUs with different architectures |
CN113553057B (en) * | 2021-07-22 | 2022-09-09 | 中国电子科技集团公司第十五研究所 | Optimization system for parallel computing of GPUs with different architectures |
CN114169651A (en) * | 2022-02-14 | 2022-03-11 | 中国空气动力研究与发展中心计算空气动力研究所 | Active prediction method for supercomputer operation failure based on application similarity |
CN114861781A (en) * | 2022-04-25 | 2022-08-05 | 北京科杰科技有限公司 | Automatic parameter adjustment optimization method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN106202431B (en) | 2019-06-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106202431A (en) | A kind of Hadoop parameter automated tuning method and system based on machine learning | |
US20170330078A1 (en) | Method and system for automated model building | |
Azzalini et al. | Clustering via nonparametric density estimation: The R package pdfCluster | |
CN109492774B (en) | Deep learning-based cloud resource scheduling method | |
CN102799486B (en) | Data sampling and partitioning method for MapReduce system | |
Hasan et al. | A machine learning approach to sparql query performance prediction | |
Ruan et al. | Workload time series prediction in storage systems: a deep learning based approach | |
Li et al. | Production task queue optimization based on multi-attribute evaluation for complex product assembly workshop | |
CN103605662A (en) | Distributed computation frame parameter optimizing method, device and system | |
CN111966495B (en) | Data processing method and device | |
CN107357652A (en) | A kind of cloud computing method for scheduling task based on segmentation sequence and standard deviation Dynamic gene | |
CN105786681A (en) | Server performance evaluating and server updating method for data center | |
CN108921324A (en) | Platform area short-term load forecasting method based on distribution transforming cluster | |
CN110147808A (en) | A kind of novel battery screening technique in groups | |
CN107908536A (en) | To the performance estimating method and system of GPU applications in CPU GPU isomerous environments | |
CN103605711A (en) | Construction method and device, classification method and device of support vector machine | |
CN106100922A (en) | The Forecasting Methodology of the network traffics of TCN and device | |
CN112579273A (en) | Task scheduling method and device and computer readable storage medium | |
CN110825526B (en) | Distributed scheduling method and device based on ER relationship, equipment and storage medium | |
CN112819246A (en) | Energy demand prediction method for optimizing neural network based on cuckoo algorithm | |
WO2012111235A1 (en) | Information processing device, information processing method, and storage medium | |
CN111831418A (en) | Big data analysis job performance optimization method based on delay scheduling technology | |
Azimzadeh et al. | Multi-objective job scheduling algorithm in cloud computing based on reliability and time | |
CN110647682A (en) | Associated recommendation system for transaction data | |
CN112328332B (en) | Database configuration optimization method for cloud computing environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |